Dataset Discovery and Exploration: A Survey

Research output: Contribution to journalArticlepeer-review

110 Downloads (Pure)

Abstract

Data scientists are tasked with obtaining insights from data. However, suitable data is often not immediately to hand, and there may be many potentially relevant datasets in a data lake or in open data repositories. As a result, data discovery and exploration are necessary, but often time consuming, steps in a data analysis workflow. Data discovery is the process of identifying datasets that may meet an information need. Data exploration is the process of understanding the properties of candidate datasets and the relationships between them. Data discovery and data exploration often go hand in hand, and benefit from tool support. This paper surveys research areas that can contribute to data discovery and exploration, in particular considering dataset search, data navigation, data annotation and schema inference. For each of these areas, we identify key dimensions that can be used to characterize approaches and the values they can hold, and apply the dimensions to describe and compare prominent results. In addition, by surveying several adjacent areas that are often considered in isolation, we identify recurring techniques and alternative approaches to related challenges, thereby placing results within a wider context than is generally considered.
Original languageEnglish
JournalACM Computing Surveys
DOIs
Publication statusPublished - 4 Oct 2023

Keywords

  • data search
  • data navigation
  • data annotation
  • data lake
  • schema inference
  • Information systems
  • Data federation tools
  • Mediators and data integration
  • Data cleaning

Fingerprint

Dive into the research topics of 'Dataset Discovery and Exploration: A Survey'. Together they form a unique fingerprint.

Cite this