Propensity score diagnostics: assessing the accuracy of a propensity-adjusted effect estimate

  • Emily Granger

Student thesis: Phd


Propensity scores are commonly used to deal with confounding bias in observational studies, by balancing covariate distributions between exposure groups. Poorly estimated propensity scores may not achieve adequate balance, leading to biased effect estimates. Consequently, it is essential to assess propensity scores using diagnostics. Unfortunately, there is currently no consensus on the best way to do this. Therefore, the aims of this thesis are to: 1. Review and compare the performance of propensity score diagnostics; 2. Where necessary, modify existing diagnostics to improve their performance; 3. Produce guidelines on how best to build and assess propensity score models. Diagnostics were categorised as either individual or overall. Individual diagnostics assess balance in covariates individually, whereas overall diagnostics assess overall balance achieved by propensity scores. This thesis presents a series of simulation studies comparing diagnostics within the same category. Individual diagnostics were compared in terms of their ability to identify different types of model misspecification. Results indicated that diagnostics which work by comparing covariate means (e.g. standardised mean differences) can fail to identify when non-linear terms are misspecified in the propensity score model. Diagnostics which compare entire distributions (e.g. Kolmogorov-Smirnov statistic), performed worst in small samples sizes. The best performing individual diagnostics are new, and involve comparing the number of exposed subjects at each covariate value to that predicted by the propensity score. Overall diagnostics were compared in terms of their correlation with bias in the propensity-adjusted effect estimate. The following factors varied in simulated scenarios: type of baseline covariates and associations between them, amount of random error in the outcome model, type of outcome, associations between baseline covariates and outcome. On average, the best performing overall diagnostic was the standardised mean difference in prognostic scores (i.e. predicted outcomes under the control condition). The main limitation of prognostic scores is that their performance was dependent on how well the outcome model was specified. Prognostic scores are usually estimated using linear regression; I investigated a modified version where non-parametric methods were used instead. Results indicated that this modification improved the performance of prognostic scores when there are strong non-linear or non-additive covariate-outcome associations. Findings from the simulation studies were used to develop a five-step procedure for building and assessing propensity score models using a combination of individual and overall diagnostics. An example of implementation is given using data from the British Society for Rheumatology Biologics Register.
Date of Award1 Aug 2020
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorMark Lunt (Supervisor) & Jamie Sergeant (Supervisor)


  • simulation study
  • propensity scores
  • balance diagnostics
  • observational data
  • confounding

Cite this