Abstract
One of the challenges in data analysis is the substantial cost of human
involvement. Before any analysis can take place, data from heterogeneous sources needs to be cleaned, integrated and transformed into a uniform format. This tasks, also known as `'data wrangling" often requires both technical skills and knowledge from domain experts. Because effort performed during data wrangling, including format transformation, is usually task-dependent and often tailored to specific sources, it gives rise to a repetitive, time-consuming and labour intensive process. Current tools support data scientists in conducting wrangling steps, such as the creation of format transformation rules, but the problem of iterative manual work to inform the creation of such rules remains. We propose an approach that observes the actions of data scientists at work
correcting errors in a query result. Specifically, we aim to extract format
transformation examples from manual corrections carried out by data scientists, that can be used to synthesize format transformation programs. In so doing, the objective is to re-use information about recurring manual corrections to automate subsequent transformations. In this paper, we propose example generation and filtering techniques for extracting format transformation examples from manual corrections, and evaluate the techniques empirically on a variety of format transformation tasks.
involvement. Before any analysis can take place, data from heterogeneous sources needs to be cleaned, integrated and transformed into a uniform format. This tasks, also known as `'data wrangling" often requires both technical skills and knowledge from domain experts. Because effort performed during data wrangling, including format transformation, is usually task-dependent and often tailored to specific sources, it gives rise to a repetitive, time-consuming and labour intensive process. Current tools support data scientists in conducting wrangling steps, such as the creation of format transformation rules, but the problem of iterative manual work to inform the creation of such rules remains. We propose an approach that observes the actions of data scientists at work
correcting errors in a query result. Specifically, we aim to extract format
transformation examples from manual corrections carried out by data scientists, that can be used to synthesize format transformation programs. In so doing, the objective is to re-use information about recurring manual corrections to automate subsequent transformations. In this paper, we propose example generation and filtering techniques for extracting format transformation examples from manual corrections, and evaluate the techniques empirically on a variety of format transformation tasks.
Original language | English |
---|---|
Title of host publication | New Trends in Databases and Information Systems |
Subtitle of host publication | ADBIS 2018 Short Papers and Workshops, AI*QA, BIGPMED, CSACDB, M2U, BigDataMAPS, ISTREND, DC, Budapest, Hungary, September, 2-5, 2018, Proceedings |
Publisher | Springer Nature |
Chapter | 1 |
Pages | 3-11 |
Number of pages | 8 |
Volume | 909 |
ISBN (Electronic) | 978-3-030-00063-9 |
ISBN (Print) | 978-3-030-00062-2 |
DOIs | |
Publication status | Published - 2018 |
Keywords
- Data wrangling
- format transformation
- data integration
- user feedback
- implicit feedback