Abstract
Data wrangling (DW) is the complex process associated with preparation of raw data for analysis. It is typically performed as a craft in an ad hoc manner and its level of complexity and success is highly dependent on the data analysis task at hand, quality of input data, and skill set of the data analyst. This makes the reuse of DW pipelines difficult, forcing data analysts to often devise a new pipeline for each combination of data and analysis tasks, a process that is not only complex but also expensive, consuming between 50 to 80% of even the most experienced data analyst’s time. In this paper, we investigate a number of DW pipelines in the form of workflows to find commonalities or patterns in the way DW is performed in practice, considering a multitude of data analysis tasks and data sets, devised by data analysts with varying levels of experience.
We present our investigation as a methodology that, from selection of workflow sources to workflow mining techniques, describes how we dealt with the challenges of finding patterns in the way people prepare data for analysis, given the general lack of guidelines for best practices and standards. The obtained results provide insights into the most commonly used DW operations, solution patterns, redundancies and, not only optimisation opportunities, but also opportunities for reuse of experience and best practices in data engineering. We believe that the obtained insights can be useful in facilitating the
construction of DW solutions to inexperienced data analysts via the reuse of patterns and best practices in DW.
We present our investigation as a methodology that, from selection of workflow sources to workflow mining techniques, describes how we dealt with the challenges of finding patterns in the way people prepare data for analysis, given the general lack of guidelines for best practices and standards. The obtained results provide insights into the most commonly used DW operations, solution patterns, redundancies and, not only optimisation opportunities, but also opportunities for reuse of experience and best practices in data engineering. We believe that the obtained insights can be useful in facilitating the
construction of DW solutions to inexperienced data analysts via the reuse of patterns and best practices in DW.
Original language | English |
---|---|
Title of host publication | CEUR Workshop Proceedings |
Subtitle of host publication | DataPlat’23: 2nd International Workshop on Data Platform Design, Management, and Optimization |
Number of pages | 10 |
Publication status | Published - 15 May 2023 |
Event | 2nd International Workshop on Data Platform Design, Management, and Optimization - Grand Serai Congress Hotel , Ioannina, Greece Duration: 28 Mar 2023 → 28 Mar 2023 https://big.csr.unibo.it/dataplat2023/ |
Workshop
Workshop | 2nd International Workshop on Data Platform Design, Management, and Optimization |
---|---|
Abbreviated title | DATAPLAT 2023 |
Country/Territory | Greece |
City | Ioannina |
Period | 28/03/23 → 28/03/23 |
Internet address |
Keywords
- Data wrangling, workflow patterns, workflow mining, pattern reuse