Mining Data Wrangling Workflows for Patterns, Reuse and Optimisation Opportunities

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

34 Downloads (Pure)

Abstract

Data wrangling (DW) is the complex process associated with preparation of raw data for analysis. It is typically performed as a craft in an ad hoc manner and its level of complexity and success is highly dependent on the data analysis task at hand, quality of input data, and skill set of the data analyst. This makes the reuse of DW pipelines difficult, forcing data analysts to often devise a new pipeline for each combination of data and analysis tasks, a process that is not only complex but also expensive, consuming between 50 to 80% of even the most experienced data analyst’s time. In this paper, we investigate a number of DW pipelines in the form of workflows to find commonalities or patterns in the way DW is performed in practice, considering a multitude of data analysis tasks and data sets, devised by data analysts with varying levels of experience.
We present our investigation as a methodology that, from selection of workflow sources to workflow mining techniques, describes how we dealt with the challenges of finding patterns in the way people prepare data for analysis, given the general lack of guidelines for best practices and standards. The obtained results provide insights into the most commonly used DW operations, solution patterns, redundancies and, not only optimisation opportunities, but also opportunities for reuse of experience and best practices in data engineering. We believe that the obtained insights can be useful in facilitating the
construction of DW solutions to inexperienced data analysts via the reuse of patterns and best practices in DW.
Original languageEnglish
Title of host publicationCEUR Workshop Proceedings
Subtitle of host publicationDataPlat’23: 2nd International Workshop on Data Platform Design, Management, and Optimization
Number of pages10
Publication statusPublished - 15 May 2023
Event2nd International Workshop on Data Platform Design, Management, and Optimization - Grand Serai Congress Hotel , Ioannina, Greece
Duration: 28 Mar 202328 Mar 2023
https://big.csr.unibo.it/dataplat2023/

Workshop

Workshop2nd International Workshop on Data Platform Design, Management, and Optimization
Abbreviated titleDATAPLAT 2023
Country/TerritoryGreece
CityIoannina
Period28/03/2328/03/23
Internet address

Keywords

  • Data wrangling, workflow patterns, workflow mining, pattern reuse

Fingerprint

Dive into the research topics of 'Mining Data Wrangling Workflows for Patterns, Reuse and Optimisation Opportunities'. Together they form a unique fingerprint.

Cite this