COST-EFFECTIVE DATA WRANGLING IN DATA LAKES

  • Alex Bogatu

Student thesis: Phd

Abstract

Data analytics stands to benefit from the increased availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data preparation, also known as data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness and heterogeneity of data lakes, obtaining value from such targets often requires significant prior effort in preparing the data for analysis. For example, data wrangling is reported to take as much as 80% of the time of data scientists. The issue then arises of how to decrease this cost. This thesis investigates what makes data preparation costly and how data preparation can become more cost-effective through automation. Specifically, this thesis inquires into two challenges that have been insufficiently covered by the state-of-the-art, viz., how to automatically pull out of the data lake those datasets that might contribute to wrangling out a given target, and how to automatically homogenise the representation of their instance value. We refer to the former as the problem of dataset discovery and to the latter as the problem of format transformation. This thesis contributes effective and efficient solutions to both problems. The work described in this thesis should be of interest to researchers and professionals in the areas of data analysis and data wrangling, who, in the process of preparing the data for analysis, confront themselves with heterogeneously represented data originating from many autonomous sources.
Date of Award1 Aug 2020
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorNorman Paton (Supervisor) & Alvaro Fernandes (Supervisor)

Keywords

  • data wrangling
  • data preparation
  • data discovery
  • format transformation

Cite this

'