## Abstract

There is growing interest within National Statistical Institutes to combine available datasets containing information on a large variety of social domains. Statistical matching approaches can be used to integrate data sources through a common set of variables where each dataset contains different units that belong to the same target population. However, a common problem is related to the assumption of conditional independence among variables observed in different data sources. In this context, an auxiliary dataset containing all the variables jointly can be used to improve the statistical matching by providing information on the correlation structure of variables observed across different datasets. We propose to modify the prediction models from the auxiliary dataset through a calibration step, and show that we can improve the outcome of statistical matching in a variety of settings. We evaluate the proposed approach via simulation and an application based on the European Union Statistics for Income and Living Conditions and Living Costs and Food Survey for the United Kingdom.

Statement of Significance

There is growing interest within National Statistical Institutes to combine available data sets containing information on a wide variety of social domains i.e., social exclusion, wellbeing and poverty. Statistical matching approaches based on a common set of variables can be used when different units (e.g., households or persons) that belong to the same target population are contained in different data sources. However, a common problem is related to the conditional independence assumption that needs to be made in order to estimate the relationships among variables observed in different data sources. In this article, we use an additional auxiliary dataset to obtain the correlation structure of the relevant variables. We propose a calibration step in the prediction models for estimating the correlation matrices that improves the outcome of statistical matching, particularly when there are misspecification errors in the auxiliary dataset.

Statement of Significance

There is growing interest within National Statistical Institutes to combine available data sets containing information on a wide variety of social domains i.e., social exclusion, wellbeing and poverty. Statistical matching approaches based on a common set of variables can be used when different units (e.g., households or persons) that belong to the same target population are contained in different data sources. However, a common problem is related to the conditional independence assumption that needs to be made in order to estimate the relationships among variables observed in different data sources. In this article, we use an additional auxiliary dataset to obtain the correlation structure of the relevant variables. We propose a calibration step in the prediction models for estimating the correlation matrices that improves the outcome of statistical matching, particularly when there are misspecification errors in the auxiliary dataset.

Original language | English |
---|---|

Journal | Journal of Survey Statistics and Methodology |

Publication status | Accepted/In press - 28 Nov 2022 |