A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records

Jennifer Taub, Mark Elliot, Joseph Sakshaug

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

153 Downloads (Pure)

Abstract

Synthetic data is an alternative to controlling confidentiality risk through traditional statistical disclosure
control (SDC) methods. A barrier to the use of synthetic data for real analyses is uncertainty about its reliability and
validity. Surprisingly, there has been a relative dearth of research into the measurement of utility of synthetic data.
Utility measures developed to date have been either information theoretic abstractions, such as the Propensity Score
Measure mean-squared error, or somewhat arbitrary collations of statistics and there has been no systematic
investigation into how synthetic data holds in response with real data analyses.
In this paper, we adopt the methodology used by Purdam and Elliot (2007), in which they reran published
analyses on disclosure-controlled microdata and evaluate the impact of the disclosure control on the analytical
outcomes. We utilise the same studies as Purdam and Elliot to facilitate comparisons of data utility between
synthetic and disclosure controlled versions of the same data.
The results will be of interest to academics and practitioners who wish to know the extent to which
synthetic data delivers utility under a variety of analytic objectives.
Original languageEnglish
Title of host publicationUNECE workksession on Statistical Confidentiality
Publication statusPublished - 2017

Research Beacons, Institutes and Platforms

  • Cathie Marsh Institute
  • Manchester Institute for Collaborative Research on Ageing

Fingerprint

Dive into the research topics of 'A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records'. Together they form a unique fingerprint.

Cite this