Handling missing data when validating and deploying clinical prediction models in health settings: an investigation of compatible methods

Student thesis: Phd


The Clinical Prediction Models (CPM) pipeline is the process of taking a model from conception and validation to its application in clinical settings. Missing data is a key issue in each of these stages. Existing guidance to address missing data are generally not cohesive with the goals of clinical prediction, although emerging research advocates that validation, whether it's internal or external, should address missing data in a manner solely reliant on the data used for model development, and remain relevant when making predictions for new individual patients. This thesis aims to understand the handling of missing data across the CPM pipeline, and to investigate the impact on estimated predictive performance at validation using different imputation methods, compared to during deployment. The main objectives to help in achieving this aim are: to describe CPMs used in the UK healthcare, with respect to missing data handling across their pipelines; to determine if the approaches used at each stage are consistent; then, through simulations, to quantify bias in the performance of a model under incompatible combinations of methods; to further explore the idea of \textit{incompatibility} by using models from cardio-thoracic surgery as a real-world exemplars, to assess whether our findings generalise beyond synthetic data; to develop guidelines for reporting and handling of missing data in CPMs to be used by both applied and methodological researchers; We finish by setting out an agenda for future avenues of research. We find that whether a model allows for missingness at deployment, could be used as a guide in the choice of missing data handling method to-be-applied in validation. Matching the approach that is used at model deployment during its validation can offer protection against bias. The findings of this thesis have laid the foundations for developing a framework for missing data handling exclusively for predictive modelling.
Date of Award1 Aug 2024
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorMatthew Sperrin (Supervisor), David Jenkins (Supervisor), Niels Peek (Supervisor) & Glen Martin (Supervisor)


  • clinical prediction models
  • missing data
  • imputation methods
  • multiple imputation
  • simulation

Cite this