A PROBABILISTIC APPROACH TO UNCERTAINTY QUANTIFICATION IN PAY-AS-YOU-GO DATA INTEGRATION

  • Fernando Rene Sanchez Serrano

Student thesis: Phd

Abstract

The use of Web standards, compact publication guidelines, and open data initiatives have motivated many public and private organisations to publish data on the Web, giving rise to a global data space. Consuming data from heterogeneous data sources published on the Web requires integration at scale. The pay-as-you-go approach to data integration (PAYG) addresses integration at scale, relying on automatic techniques to provide candidate integrations. The high reliance on automatic techniques gives rise to uncertainty. Uncertainty may arise and propagate to all the tasks of the life cycle of a PAYG approach whose effect may be manifested in the quality of an automatically generated integration. Quantifying the uncertainty on the outcomes of a bootstrapped integration is a crucial task that can help in understanding the decisions made by the automatic algorithms, aiming to reduce such uncertainty that ultimately can improve the quality of an integration. In this thesis, we address the issue of quantifying the uncertainty that arises dur- ing the bootstrapping phase of PAYG in the context of Dataspaces. In particular, two approaches are proposed: (i) an approach to quantify the uncertainty in mapping gener- ation using internal evidence; (ii) an approach to quantify the uncertainty on the quality of an entire integration using user feedback in a pay-as-you-go manner. More specifically, this thesis makes the following contributions: (i) a principled methodology to derive degrees of belief on mappings that builds on Bayesian infer- ence to assimilate evidence in the form of fitness scores associated to mappings during mapping generation; (ii) a novel methodology to quantify the uncertainty on the quality of an entire integration by assimilating user feedback on tuple results; (iii) an experi- mental evaluation of the proposed techniques on a real-world integration scenario. The experimental evaluation of the contributed techniques presented in this dis- sertation provides empirical evidence of their cost-effectiveness, when applied in syn- thetic and real-world scenarios, in quantifying the quality of a pay-as-you-go data in- tegration.
Date of Award1 Aug 2019
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorNorman Paton (Supervisor) & Alvaro Fernandes (Supervisor)

Keywords

  • UNCERTAINTY QUANTIFICATION
  • PROBABILISTIC APPROACH
  • DATA INTEGRATION
  • PAY AS YOU GO

Cite this

'