Emerging Evaluation Paradigms in Natural Language Understanding: A Case Study in Machine Reading Comprehension

Student thesis: Phd


Question Answering (QA) over unstructured textual data, also referred to as Machine Reading Comprehension (MRC), is advancing at an unprecedented rate. State-of-the-art language models are reported to outperform human-established baselines on multiple benchmarks aimed at evaluating Natural Language Understanding (NLU). Recent work, however, has questioned their seemingly superb performance. Specifically, training and evaluation data may contain exploitable superficial lexical cues which neural networks can learn to exploit in order achieve high performance on those benchmarks. Evaluating under the conventional machine learning assumptions, by splitting a dataset randomly into a training and evaluation set, conceals these issues. This gives opportunity to propose novel evaluation methodologies for MRC. Researchers may investigate the quality training and evaluation data of MRC data, propose evaluation methodologies that reveal the dependence of superficial cues or improve the performance of models when optimised on data that could contain these cues. In this thesis we contribute to this developing research field. The specific contributions are outlined as follows: We carry out a literature survey, systematically categorising methods that investigate NLU training data, evaluation methodologies and models. We find that in MRC as a testbed for NLU, there is a lack of investigations with regard to the capability to process linguistic phenomena. We propose a qualitative evaluation framework for MRC gold standards with regards to linguistic and reasoning requirements present in gold standard data, as well as the data quality. We find that state-of-the-art MRC gold standards lack challenging linguistic phenomena and reasoning forms, such as words that alter the semantics of the sentences they appear in. Furthermore, we find that the factual correctness of evaluation data can be influenced by the data generation method. We devise a methodology that evaluates a capability of interest by observing models' behaviour in reaction to controlled changes in input data. Alongside this, we propose a method to generate synthetic benchmarks. We evaluate its quality and diversity through comparison with existing corpora. We find our method to produce MRC data that are fit for the intended purpose. We apply this methodology to conduct a large-scale empirical study to investigate the capability of state-of-the-art MRC to process semantic-altering modifications (SAM) (such as almost or nearly) in input data. SAM are interesting in that they can indicate a model's dependence on simplifying cues, because they change the expected answer while preserving a similar lexical surface form. We find that multiple state-of-the-art MRC architectures optimised on various popular MRC datasets fail to process SAM correctly. One of the possible reasons for this, that we have identified, is the lack of relevant training examples. This thesis contributes towards gaining an empirically grounded understanding of what the current state-of-the-art MRC models are learning and where they still fail, which, in turn, gives specific proposals for building the next iteration of datasets and model architectures and therefore advance the research in MRC.
Date of Award31 Dec 2021
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorGoran Nenadic (Supervisor) & Riza Theresa Batista-Navarro (Supervisor)


  • Machine Reading Comprenension
  • Evaluation
  • Natural Language Understanding

Cite this