Abstract
Synthetic data is an alternative to controlling confidentiality risk through traditional statistical disclosure
control (SDC) methods. A barrier to the use of synthetic data for real analyses is uncertainty about its reliability and
validity. Surprisingly, there has been a relative dearth of research into the measurement of utility of synthetic data.
Utility measures developed to date have been either information theoretic abstractions, such as the Propensity Score
Measure mean-squared error, or somewhat arbitrary collations of statistics and there has been no systematic
investigation into how synthetic data holds in response with real data analyses.
In this paper, we adopt the methodology used by Purdam and Elliot (2007), in which they reran published
analyses on disclosure-controlled microdata and evaluate the impact of the disclosure control on the analytical
outcomes. We utilise the same studies as Purdam and Elliot to facilitate comparisons of data utility between
synthetic and disclosure controlled versions of the same data.
The results will be of interest to academics and practitioners who wish to know the extent to which
synthetic data delivers utility under a variety of analytic objectives.
control (SDC) methods. A barrier to the use of synthetic data for real analyses is uncertainty about its reliability and
validity. Surprisingly, there has been a relative dearth of research into the measurement of utility of synthetic data.
Utility measures developed to date have been either information theoretic abstractions, such as the Propensity Score
Measure mean-squared error, or somewhat arbitrary collations of statistics and there has been no systematic
investigation into how synthetic data holds in response with real data analyses.
In this paper, we adopt the methodology used by Purdam and Elliot (2007), in which they reran published
analyses on disclosure-controlled microdata and evaluate the impact of the disclosure control on the analytical
outcomes. We utilise the same studies as Purdam and Elliot to facilitate comparisons of data utility between
synthetic and disclosure controlled versions of the same data.
The results will be of interest to academics and practitioners who wish to know the extent to which
synthetic data delivers utility under a variety of analytic objectives.
Original language | English |
---|---|
Title of host publication | UNECE workksession on Statistical Confidentiality |
Publication status | Published - 2017 |
Research Beacons, Institutes and Platforms
- Cathie Marsh Institute
- Manchester Institute for Collaborative Research on Ageing
Fingerprint
Dive into the research topics of 'A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records'. Together they form a unique fingerprint.Impacts
-
Impact on the Statistical Confidentiality Practices of Data Stewardship Organisations
Elliot, M. (Participant), Purdam, K. (Participant), Mackey, E. (Participant), Smith, D. (Participant) & (Participant)
Impact: Economic impacts, Societal impacts, Legal impacts