Natural Language Inference over Tables: Enabling Explainable Data Exploration on Data Lakes

Mario Ramirez, Alex Bogatu, Norman Paton, Andre Freitas

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

Data lakes are repositories of data with potential for analysis. Data lakes aim to liberate data from silos, thereby enabling cross-cutting analyses that were hitherto out of reach. This gives rise to significant challenges for data scientists simply discovering what data sets may be relevant to a task-in-hand. Given a data set of interest, several proposals have been made for indexing schemes that can identify related data sets. However, such schemes tend to build on similarity metrics that stop short of providing a clear explanation as to how an identified data set relates to a provided target. We address this problem by applying Natural Language Inference (NLI) to providing explanations as to how the attributes of discovered data sets relate to those of the target, in terms of a collection of semantic relations. We provide two approaches to inferring semantic relations: (a) by performing unsupervised intensional and extensional analysis of the data sources using Natural Language Processing techniques; and (b) by performing supervised learning of semantic relations by applying BERT over source schema information. The contributions of this paper are: an NLI strategy for providing explicit characterisation of semantic relations between data sets; two approaches to inferring the semantic relations; and an empirical evaluation of the approaches using open government data.
Original languageEnglish
Title of host publicationEuropean Semantic Web Conference
Publication statusAccepted/In press - 23 Feb 2021

Fingerprint

Dive into the research topics of 'Natural Language Inference over Tables: Enabling Explainable Data Exploration on Data Lakes'. Together they form a unique fingerprint.

Cite this