Experimenting with a Big Data Framework for Scaling a Data Quality Query System

  • Sonia Cisneros Cabrera

Student thesis: Master of Philosophy

Abstract

The work presented in this thesis comprises the design, implementation and evaluation of extensions made to the Data Quality Query System (DQ2S), a state-of-the-art data quality aware query processing framework and query language, towards testing and improving its scalability when working with increasing amounts of data. The purpose of the evaluation is to assess to what extent a big data framework, such as Apache Spark, can offer significant gains in performance, including runtime, required amount of memory, processing capacity, and resource utilisation, when running over different environments. DQ2S enables assessing and improving data quality within information management by facilitating profiling of the data in use, and leading to the support of data cleansing tasks, which represent an important step in the big data life-cycle. Despite this, DQ2S, as the majority of data quality management systems, is not designed to process very large amounts of data. This research describes the journey of how data quality extensions from an earlier implementation that processed two datasets with 50 000 rows each one in 397 seconds, were designed, implemented and tested to achieve a big data solution capable of processing 105 000 000 rows in 145 seconds. The research described in this thesis provides a detailed account of the experimental journey followed to extend DQ2S towards exploring the capabilities of a popular big data framework (Apache Spark), including the experiments used to measure the scalability and usefulness of the approach. The study also provides a roadmap for researchers interested in re-purposing and porting existing information management systems and tools to explore the capabilities provided by big data frameworks, particularly useful given that re-purposing and re-writing existing software to work with big data frameworks is a less costly and risky approach when compared to greenfield engineering of information management systems and tools.
Date of Award1 Aug 2017
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorSandra Sampaio (Supervisor) & Pedro Sampaio (Supervisor)

Keywords

  • Empirical Evaluation
  • Big Data
  • Data Quality
  • Data Profiling
  • Scalability

Cite this

'