The production of bespoke synthetic teaching datasets without access to the original data

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Teaching datasets are a pivotal component of the data discovery pipeline. These datasets often serve as the initial point of interaction for data users, allowing them to explore the contents of a dataset and assess its relevance to their needs. However, there are instances where their viability is limited, particularly where source data is only accessible within restricted settings, such as trusted research environments (TREs). In response to this challenge, this paper proposes the production of synthetic datasets tailored for specific teaching purposes by utilising already
cleared (and published) analyses as the basis for the synthesis. Unlike generic synthetic datasets, the datasets created are designed to solely reproduce the specific analyses. Crucially, the datasets can be generated without access to the original data. Two experiments with census data
demonstrate the viability of the method and a live use case is described. Issues arising such as marginal disclosure risk are then discussed
Original languageEnglish
Title of host publicationPrivacy in Statistical Databases conference 2024
Publication statusAccepted/In press - 1 Jul 2024

Keywords

  • Data Synthesis
  • Evolutionary Algorithms
  • Data Utility
  • Disclosure Risk

Fingerprint

Dive into the research topics of 'The production of bespoke synthetic teaching datasets without access to the original data'. Together they form a unique fingerprint.

Cite this