Natural Language Processing over Tables: Enabling Data Exploration on Data Lakes

Student thesis: Phd

Abstract

The prevalence of Big Data in the current world, where there is an agile generation of overwhelming amounts of data, has exceeded the capabilities of organizations to manage their information resources with traditional data management systems (i.e., data warehouses), making it extremely difficult to leverage schema-on-write paradigms. As an alternative, platforms like data lakes are increasingly gaining attention as repositories that hold unprocessed data in its native format, thereby offering a flexible solution to manage data. Such an environment imposes the cost of requiring alternative techniques that assist on-demand analytical solutions to explore and discover data with a view to helping in the understanding of relationships between data sources in data lakes. Simultaneously, the field of Natural Language Processing (NLP), which leverages computational techniques to analyze and process natural language, has achieved outstanding performance, showing its effectiveness at modeling a range of tasks, boosted by deep learning techniques making their use transferable to other domains. This thesis investigates the use of NLP techniques in the context of data lakes with the aim to assist data exploration of data sources in table-based representations. Given that such data sources can be seen as structured linguistic artifacts that succinctly capture essential attributes and their inter-relationships, we investigate the extent to which NLP techniques can be used to encode table semantics to assist data exploration. More specifically, this thesis makes the following contributions: (i) An interpretable semantic entailment framework for tabular datasets based on Natural Language Inference; (ii) A systematic characterization of existing methods that investigate transformer-based models applied to tabular data; (iii) A methodology to transfer language-understanding knowledge from transformer-based language models to encode schema-level semantics. The experimental evaluation associated to the techniques presented in this thesis provides empirical evidence of the use of NLP techniques in the area of data exploration and data analysis, aiming to inspire further related research.
Date of Award6 Jan 2025
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorAndre Freitas (Supervisor) & Norman Paton (Supervisor)

Keywords

  • Data Lakes
  • Transformers
  • NLP

Cite this

'