Abstract
Data scientists are tasked with obtaining insights from data. However, suitable data is often not immediately to hand, and there may be many potentially relevant datasets in a data lake or in open data repositories. As a result, data discovery and exploration are necessary, but often time consuming, steps in a data analysis workflow. Data discovery is the process of identifying datasets that may meet an information need. Data exploration is the process of understanding the properties of candidate datasets and the relationships between them. Data discovery and data exploration often go hand in hand, and benefit from tool support. This paper surveys research areas that can contribute to data discovery and exploration, in particular considering dataset search, data navigation, data annotation and schema inference. For each of these areas, we identify key dimensions that can be used to characterize approaches and the values they can hold, and apply the dimensions to describe and compare prominent results. In addition, by surveying several adjacent areas that are often considered in isolation, we identify recurring techniques and alternative approaches to related challenges, thereby placing results within a wider context than is generally considered.
Original language | English |
---|---|
Journal | ACM Computing Surveys |
DOIs | |
Publication status | Published - 4 Oct 2023 |
Keywords
- data search
- data navigation
- data annotation
- data lake
- schema inference
- Information systems
- Data federation tools
- Mediators and data integration
- Data cleaning