Synthetic census microdata generation: a comparative study of synthesis methods examining the trade-off between disclosure risk and utility

Research output: Contribution to journalArticlepeer-review

Abstract

There is growing interest in synthetic data generation as a means of allowing access to useful data whilst preserving confidentiality. In particular, synthetic microdata generation could
allow increased access to census and administrative data. An accurate understanding of the comparative performance of current synthetic data generators, in terms of the resulting data utility and disclosure risk for synthetic microdata, is important in allowing data owners to make informed decisions about the choice of method and parameter settings to use. Synthesising microdata can present challenges as the data typically contains predominantly categorical variables that standard statistical methods may struggle to process. In this paper we present the first in-depth evaluation of four state-of-the-art synthetic data generators originating from the statistical (synthpop, Data-Synthesizer ) and deep learning (CTGAN, TVAE) communities and each capable of dealing with microdata. We use four real census microdatasets (Canada, Fiji, Rwanda, UK) to systematically validate and compare the synthetic data generators and their parameter settings in terms of the utility and disclosure risk of the resulting synthetic data using statistical metrics and the risk-utility map for visualization. Our analysis shows that the performance of the synthetic data generators considered depends on their parameter settings and the dataset.
Original languageEnglish
JournalJournal of Official Statistics
Publication statusAccepted/In press - 17 Jun 2024

Keywords

  • Synthetic Data
  • Data Utility
  • Disclosure Risk

Fingerprint

Dive into the research topics of 'Synthetic census microdata generation: a comparative study of synthesis methods examining the trade-off between disclosure risk and utility'. Together they form a unique fingerprint.

Cite this