Federated Learning (FL) is a decentralized approach to statistical model training in which training is performed across multiple clients to produce a global model. This approach can be used where multiple sites have data but do not have enough data to generate the required statistical power and cannot for legal, commercial or ethical reasons share their data. One paradigm case is randomized control trials for rare diseases. With FL, training data stays with each local client and is not shared or exchanged with other clients, so the use of FL can reduce privacy and security risks (compared to methods that pool multiple data sources) while addressing data access and heterogeneity problems. This study explores the feasibility of using FL to generate synthetic microdata, allowing multiple organizations to contribute to the construction of combined synthetic datasets (possibly for wider release) without the need to share or distribute their own data. The primary issue is whether it is possible in principle to produce good enough quality synthetic data and the study here focuses on this as a proof of concept before going on to discuss the issue of risk measurement. The results show that the approach is feasible and crucially in the main experiment the synthetic datasets better represented the full population than random samples of that population do. However the experiments are on toy datasets and the next step is to expand the dataset size.
|Title of host publication
|UNECE Expert Meeting on Statistical Data Confidentiality 2023
|Accepted/In press - 1 Sept 2023