• Ruhaila Maskat

Student thesis: Phd


With the growing demand for information in various domains, sharing of information from heterogeneous data sources is now a necessity. Data integration approaches promise to combine data from these different sources and present to the user a single, unified view of these data. However, although these approaches offer high quality services for the managing and integrating of data, they come with a high cost. This is because a great amount of manual effort to form relationships across data sources is needed to set up the data integration system. A newer variant of data integration, known as dataspaces, aims to spread the large manual effort spent at the start of the data integration system to the rest of the system's phases. This is achieved by soliciting from the user their feedback on a chosen artefact of a dataspace, either by explicit ways or implicitly. This practice is known as pay-as-you-go, where a user continuously pays to the data integration system, by providing feedback, to gain improvements in the quality of data integration. This PhD addresses two challenges in data integration by using pay-as-you-go approaches. The first is to identify instances relevant to a user's information need, calling for semantic mappings to be closely considered. Our contribution is a technique that ranks mappings with the help of implicit user feedback (i.e., terms found in query logs). Our evaluation shows that to produce stable rankings, our technique does not require large-sized query logs, and that our generated ranking is able to respond satisfactorily to the amount of terms inclined towards a particular data source, where we describe it as skew. The second challenge that we address is the identification of duplicate instances from disparate data sources. We contribute a strategy that uses explicitly-obtained user feedback to drive an evolutionary search algorithm to find suitable parameters for an underlying clustering algorithm. Our experiments show that optimising the algorithm's parameters and introducing attribute weights produces fitter clusters than clustering alone. However, our strategy to improve on integration quality can be quite expensive. Therefore, we propose a pruning technique to select from a dataset any records that are informative. Our experiment shows that on most of the datasets, our pruner produce comparably fit clusters with more feedback received.
Date of Award1 Aug 2016
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorNorman Paton (Supervisor) & Suzanne Embury (Supervisor)


  • Semantic mapping
  • Data integration
  • Pay-as-you-go
  • Dataspaces
  • Entity resolution

Cite this