AbstractThis work concerns the evaluation of statistical disclosure control risk by adopting the position of the data intruder. The underlying assertion is that risk metrics should be based on the actual inferences that an intruder can make. Ideally metrics would also take into account how sensitive the inferences would be, but that is subjective. A parallel theme is that of the knowledgeable data intruder; an intruder who has the technical skills to maximally exploit the information contained in released data. This also raises the issue of computational costs and the benefits of using good algorithms.A metric for attribution risk in tabular data is presented. It addresses the issue that most measures for tabular data are based on the risk of identification. The metric can also take into account assumed levels of intruder knowledge regarding the population, and it can be applied to both exact and perturbed collections of tables.An improved implementation of the Key Variable Mapping System (Elliot, et al., 2010) is presented. The problem is more precisely defined in terms of categorical variables rather than responses to survey questions. This allows much more efficient algorithms to be developed, leading to significant performance increases.The advantages and disadvantages of alternative matching strategies are investigated. Some are shown to dominate others. The costs of searching for a match are also considered, providing insight into how a knowledgeable intruder might tailor a strategy to balance the probability of a correct match and the time and effort required to find a match.A novel approach to model determination in decomposable graphical models is described. It offers purely computational advantages over existing schemes, but allows data sets to be more thoroughly checked for disclosure risk.It is shown that a Bayesian strategy for matching between a sample and a population offers much higher probabilities of a correct match than traditional strategies would suggest.The Special Uniques Detection Algorithm (Elliot et al., 2002) (Manning et al., 2008), for identifying risky sample counts of 1, is compared against Bayesian (using Markov Chain Monte Carlo and simulated annealing) alternatives. It is shown that the alternatives are better at identifying risky sample uniques, and can do so with reduced computational costs.
|Date of Award
|1 Aug 2012
|Mark Elliot (Supervisor) & Iain Buchan (Supervisor)