Data Preparation: A Technological Perspective and Review

Alvaro Fernandes, Martin Koehler, Nikolaos Konstantinou, Pavel Pankin, Norman Paton, Rizos Sakellariou

Research output: Contribution to journalArticlepeer-review


Data analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an organisation with internal data resources. As a result, there is a need to discover, clean, integrate and restructure data into a form that is suitable for an intended analysis. Data preparation, also known as data wrangling, is the process by which data is transformed from its existing representation into a form that is suitable for analysis. In this paper, we review the state-of-the-art in data preparation, by: (i) describing functionalities that are central to data preparation pipelines, specifically profiling, matching, mapping, format transformation and data repair; and (ii) presenting how these capabilities surface in different approaches to data preparation, that involve programming, writing workflows, interacting with individual data sets as tables, and automating aspects of the process. These functionalities and approaches are illustrated with reference to a running example that combines open government data with web extracted real estate data.
Original languageEnglish
JournalSN Computer Science
Publication statusAccepted/In press - 10 Apr 2023


  • data preparation
  • data engineering
  • data wrangling
  • data analysis


Dive into the research topics of 'Data Preparation: A Technological Perspective and Review'. Together they form a unique fingerprint.

Cite this