Exploring Automatic Text Summarisation Methods for the General and Clinical Domains

Student thesis: Phd

Abstract

Automatic Text Summarisation refers to the NLP task of automatically generating a concise and informative summary from the given long input. The two most common approaches to address this task are extractive and abstractive summarisation. The extractive approach forms a summary by directly extracting textual segments from the given input, while the abstractive approach generates a sequence of words. The extract-then-abstract approach sequentially combines these two approaches. This thesis aims to address research gaps within these three approaches in both general and clinical domains to enhance model performance. Firstly, existing extractive approaches in general domains usually formulate the task as extracting sentences with the top-k predicted importance, where k is a fixed value. We argue that the more fine-grained segment, the Elementary Discourse Unit (EDU), is a better extractive unit than the sentence, and justify this argument from both theoretical and empirical perspectives. Expanded on this conclusion, we propose EDU-VL, an EDU-level extractive model with varying k for different inputs to accommodate the realistic necessity of varying summary lengths. Experimental results on five summarisation datasets validate the efficacy of EDU-VL, as well as the two key components: extracting EDUs and the varying k. Secondly, most extract-then-abstract approaches treat the extractor and abstractor as two independent models and introduce extra learning parameters when highlighting extractions to the abstractor. This independent training results in repeated document encoding and exposes the abstractor to errors from the extractor. We propose a parameter-free highlight method (the saliency mask) and a novel extract-and-abstract framework (ExtAbs), which unifies the extractor and abstractor within single encoder-decoder model. Comparative analysis shows the superiority of the saliency mask over existing highlight methods. Experiments on three summarisation datasets and two encoder-decoder models suggest that ExtAbs outperforms strong extractive baselines without sacrificing abstractive performance on two of the datasets. Thirdly, there is a notable lack of language models pre-trained on clinical corpora with clinical terms explicitly modelled. We propose a domain-specific pre-training objective, masked clinical term prediction, and apply it to pre-train language models with 3B and 11B parameters, named PULSAR. Additionally, existing clinical text augmentation methods often lack generality and diversity. We explore general Large Language Model-aided data augmentation methods to alleviate the scarcity of clinical data by paraphrasing original data while keeping key clinical terms or deducing input data based on a summary. Experimental results indicate mixed efficacy of the proposed pre-training objective and data augmentation methods across four datasets, covering clinical note, conversation, and question summarisation tasks.
Date of Award10 Jan 2025
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorXiaojun Zeng (Main Supervisor) & Goran Nenadic (Co Supervisor)

Keywords

  • Natural Language Processing
  • Automatic Text Summarisation

Cite this

'