Probabilistic Approaches for Data Integration in Biomedical Research

  • Alexia Sampri

Student thesis: Phd

Abstract

Data generated by the numerous medical studies conducted worldwide have the potential benefit the scientific and patient communities by generating new knowledge about health, disease, and treatments. This promise is well recognised by research communities, but it remains the case that many biomedical datasets are underutilised. To realise the potential of such datasets, these must be integrated with other existing data, to generate large-scale, research-ready data resources. However, datasets are often heterogeneous in content – i.e., they capture different information, or capture overlapping information at different levels of granularity. Traditional approaches for integrating heterogeneous datasets focus on harmonisation: they limit the combined dataset to information that was captured in all original datasets, which can be extremely wasteful. For instance, new biomarkers that were not measured in all original datasets may be left out of the combined dataset, and categorical data may be reduced to two or three levels, whilst some of the original datasets captured it in much more detail. We have developed new, probabilistic approaches for data integration, reducing content heterogeneity to a missing data problem, which is subsequently resolved with well-established multiple imputation methods. Subsequently we address three commonly occurring forms of content heterogeneity (i.e., variation in variables, varying granularity of categorical variables, and variation in variable types). For each form of content heterogeneity we first outline the theoretical solution using probabilistic approaches to data integration. Then, we evaluate the suggested solution through simulation studies. Finally, we illustrate the solution through application to real-world datasets from studies in Systematic Lupus Erythematosus. We also do this for combinations of different forms of content heterogeneity. The research in the thesis is methodological but with clear and direct application benefits.
Date of Award31 Dec 2022
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorPhilip Couch (Supervisor), Niels Peek (Supervisor) & Nophar Geifman (Supervisor)

Keywords

  • systematically missing values
  • probabilistic methods
  • multiple imputation
  • mixed type
  • Systematic Lupus Erythematosus
  • FCS
  • data integration
  • content heterogeneity
  • granularity

Cite this

'