Synthetic Data: An Exploration of Data Utility and Disclosure Risk

  • Jennifer Taub

Student thesis: Phd

Abstract

Synthetic data generation has been proposed as a flexible alternative to orthodox statistical disclosure control (SDC) methods for minimising disclosure risk. While traditional SDC techniques aim to suppress or perturb existing datasets, synthetic data has a different approach. Synthesis creates a brand new dataset. From this a secondary question arises as to how best to measure disclosure risk and utility for synthetic data. The thesis is based on a set of papers.The first paper of the thesis explores data utility by applying a methodology developed by Purdam and Elliot (2007), in which they replicated published analyses using disclosure-controlled versions of the same microdata used in said analyses to evaluated the impact of the disclosure control on the analytic outcomes. The paper utilises the same studies as Purdam and Elliot, to facilitate comparisons of synthetic data utility between different utility metrics. The second paper explores disclosure risk. New metrics for measuring disclosure risk of synthetic data are needed, since reidentification, which has been an integral part of measuring disclosure risk for SDC, is not a meaningful concern for fully synthetic data. The paper develops a method called Differential Correct Attribution Probability (DCAP). Using DCAP, the paper explores the effect of multiple imputation on the disclosure risk of synthetic data. The third paper explores the trade off between data utility and disclosure risk in synthetic data. Using Genetic Algorithms (GA) as a synthetic data generator, the DCAP score and full contingency tables were set as competing objectives for the GA. The GA synthetic data is then compared to parametric and CART Synthetic data to establish it as a feasible synthetic data generator. The final paper explores how different synthetic datasets perform in a series of utility and disclosure risk tests. This paper was an outcome of the Isaac Newton Institute programme on Data Linkage and Anonymisation in 2016. The group decided to run a challenge amongst themselves to test various synthetic data generation methods against one another. Overall, I can conclude that synthetic data is a useful tool in terms of data privacy. Synthetic data performs on par in terms data utility with traditional SDC techniques. Synthetic data may be inherently safer than SDC techniques.
Date of Award1 Aug 2021
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorMark Elliot (Supervisor) & Maria Pampaka (Supervisor)

Keywords

  • Disclosure Risk
  • Data Utility
  • Synthetic Data
  • Anonymisation

Cite this

'