Abstract
Data wrangling is the process whereby data is cleaned and
integrated for analysis. Data wrangling, even with tool support, is typically
a labour intensive process. One aspect of data wrangling involves
carrying out format transformations on attribute values, for example
so that names or phone numbers are represented consistently. Recent
research has developed techniques for synthesising format transformation
programs from examples of the source and target representations.
This is valuable, but still requires a user to provide suitable examples,
something that may be challenging in applications in which there are
huge data sets or numerous data sources. In this paper we investigate
the automatic discovery of examples that can be used to synthesise format
transformation programs. In particular, we propose an approach to
identifying candidate data examples and validating the transformations
that are synthesised from them. The approach is evaluated empirically
using data sets from open government data.
integrated for analysis. Data wrangling, even with tool support, is typically
a labour intensive process. One aspect of data wrangling involves
carrying out format transformations on attribute values, for example
so that names or phone numbers are represented consistently. Recent
research has developed techniques for synthesising format transformation
programs from examples of the source and target representations.
This is valuable, but still requires a user to provide suitable examples,
something that may be challenging in applications in which there are
huge data sets or numerous data sources. In this paper we investigate
the automatic discovery of examples that can be used to synthesise format
transformation programs. In particular, we propose an approach to
identifying candidate data examples and validating the transformations
that are synthesised from them. The approach is evaluated empirically
using data sets from open government data.
Original language | English |
---|---|
Title of host publication | Data Analytics - 31st British International Conference on Databases, BICOD 2017, London, UK, July 10-12, 2017, Proceedings. |
Editors | Andrea Cali, Peter Wood, Nigel Martin, Alexandra Poulovassilis |
Publisher | Springer Nature |
Pages | 36-48 |
Number of pages | 13 |
ISBN (Print) | 978-3-319-60794-8 |
DOIs | |
Publication status | Published - 2017 |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Volume | 10375 |