Exploiting Unlabelled Data for Relation Extraction

  • Thy Tran

Student thesis: Phd


Information extraction transforms unstructured text to structured by annotating semantic information on raw data. A crucial step in information extraction is relation extraction, which identifies semantic relationships between named entities in text. The resulting relations can be used to construct and populate knowledge bases as well as used in various applications such as information retrieval and question answering. Relation extraction has been widely studied using fully supervised learning and distantly supervised approaches, these approaches require either manually- or automatically-annotated data. In contrast, a massive amount of unlabelled texts freely available are underused. We hence focus on leveraging the unlabelled data to improve and extend relation extraction. We approach the use of unlabelled text from three directions: (i) use it for pre-training word representations, (ii) conduct unsupervised learning, and (iii) perform weak supervision. Regarding the first direction, we want to leverage syntactic information for relation extraction. Instead of directly tuning such information on a relation extraction corpus, we propose a novel graph neural model for learning syntactically-informed word representations. The proposed method allows us to enrich pretrained word representations with syntactic information rather than re-training language models from scratch as previous work. Throughout this work, we can confirm that our novel representations are beneficial for relations in two different domains. In the second direction, we study unsupervised relation extraction, which is a promising approach because it does not require manually- or automatically-labelled data. We hypothesise that inductive biases are extremely important to direct unsupervised relation extraction. We hence employ two simple methods using only entity types to infer relations. Despite their simplicity, our methods can outperform existing approaches on two popular datasets. These surprising results suggest that entity types provide a strong inductive bias for unsupervised relation extraction. The last direction is inspired by recent evidence that large-scale pretrained language models capture some sort of relational facts. We want to investigate whether these pretrained language models can serve as weak annotators. To this end, we evaluate three large pretrained language models by matching sentences against relations’ exemplars. The matching scores decide how likely a given sentence expresses a relation. The top relations are further used as weak annotations to train a relation classifier. We observe that pretrained language models are confused by highly similar relations, thus, we propose a method that models the labelling confusion to correct relation prediction. We validate the proposed method on two datasets with different characteristics, showing that it can effectively model labelling noise from our weak annotator. Overall, we illustrate that exploring the use of unlabelled data is an important step towards improving relation extraction. The use of unlabelled data is a promising path for relation extraction and should receive more attention from researchers.
Date of Award1 Aug 2021
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorSophia Ananiadou (Supervisor) & Riza Theresa Batista-Navarro (Supervisor)


  • Unlabelled Data
  • Relation Extraction

Cite this