Investigating Data Quality in Question and Answer Reports

  • Mona Mohamed Zaki Ali

Student thesis: Phd

Abstract

Data Quality (DQ) has been a long-standing concern for a number of stakeholders in a variety of domains. It has become a critically important factor for the effectiveness of organisations and individuals. Previous work on DQ methodologies have mainly focused on either the analysis of structured data or the business-process level rather than analysing the data itself. Question and Answer Reports (QAR) are gaining momentum as a way to collect responses that can be used by data analysts, for instance, in business, education or healthcare. Various stakeholders benefit from QAR such as data brokers and data providers, and in order to effectively analyse and identify the common DQ problems in these reports, the various stakeholders' perspectives should be taken into account which adds another complexity for the analysis.This thesis investigates DQ in QAR through an in-depth DQ analysis and provide solutions that can highlight potential sources and causes of problems that result in "low-quality" collected data. The thesis proposes a DQ methodology that is appropriate for the context of QAR. The methodology consists of three modules: question analysis, medium analysis and answer analysis. In addition, a Question Design Support (QuDeS) framework is introduced to operationalise the proposed methodology through the automatic identification of DQ problems. The framework includes three components: question domain-independent profiling, question domain-dependent profiling and answers profiling. The proposed framework has been instantiated to address one example of DQ issues, namely Multi-Focal Question (MFQ). We introduce MFQ as a question with multiple requirements; it asks for multiple answers. QuDeS-MFQ (the implemented instance of QuDeS framework) has implemented two components of QuDeS for MFQ identification, these are question domain-independent profiling and question domain-dependent profiling. The proposed methodology and the framework are designed, implemented and evaluated in the context of the Carbon Disclosure Project (CDP) case study. The experiments show that we can identify MFQs with 90% accuracy.This thesis also demonstrates the challenges including the lack of domain resources for domain knowledge representation, such as domain ontology, the complexity and variability of the structure of QAR, as well as the variability and ambiguity of terminology and language expressions and understanding stakeholders or users need.
Date of Award31 Dec 2016
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorGoran Nenadic (Supervisor) & Babis Theodoulidis (Supervisor)

Keywords

  • data quality methodology, question and answer reports, question and answer questionnaires
  • data quality, data analysis, natural language processing, data mining, text mining

Cite this

'