Feedback Driven Improvement of Data Preparation Pipelines

Nikolaos Konstantinou, Norman Paton

Research output: Contribution to journalArticlepeer-review

97 Downloads (Pure)

Abstract

Data preparation, whether for populating enterprise data warehouses or as a precursor to more exploratory analyses, is recognised as being laborious, and as a result is a barrier to costeffective data analysis. Several steps that recur within data preparation pipelines are amenable to automation, but it seems important that automated decisions can be refined in the light of user feedback on data products. There has been significant work on how individual data preparation steps can be refined in the light of feedback. This paper goes further, by proposing an approach in which feedback on the correctness of values in a data product can be used to revise the results of diverse data preparation components. Furthermore, the approach takes into account all the available feedback in determining which actions should be applied to refine the data preparation process. The approach has been implemented to refine the results of of matching, mapping and data repair components in the VADA data preparation system, and is evaluated using deep web and open government data sets from the real estate domain. The experiments have shown how the approach enables feedback to be assimilated effectively for use with individual data preparation components, and furthermore that synergies result from applying the feedback to several data preparation components.
Original languageEnglish
Article number101480
JournalInformation Systems
Volume92
Issue number0
Early online date6 Dec 2019
DOIs
Publication statusPublished - 6 Dec 2019

Keywords

  • data preparation
  • data wrangling
  • extract transform load
  • dataspace
  • feedback

Fingerprint

Dive into the research topics of 'Feedback Driven Improvement of Data Preparation Pipelines'. Together they form a unique fingerprint.

Cite this