Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

    Research output: Contribution to journalArticlepeer-review

    789 Downloads (Pure)


    In the era of data science, and data-driven science, data sets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in data sets can have far reaching consequences, spreading from data set to data set, and affecting the consumers of that data in ways that are hard to predict or quantify. Some form of waste is typically the result. For example, scientists using defective data to propose promising hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure. Because of this, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect. But, in some areas, human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately. Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort current available.
    In this paper, we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do. This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient) to find out if the same defects are present in their data or not. This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.
    Original languageEnglish
    Article number1
    Pages (from-to)1-12
    Number of pages12
    JournalInternational Journal of Digital Curation
    Publication statusPublished - 16 Sept 2017
    Event12th International Digital Curation Conference - Edinburgh, United Kingdom
    Duration: 20 Feb 201723 Feb 2017


    • Data Quality
    • Data Curation
    • Scientific Data
    • Big Data


    Dive into the research topics of 'Amplifying Data Curation Efforts to Improve the Quality of Life Science Data'. Together they form a unique fingerprint.
    • On the Feasibility of Crawling Linked Data Sets for Reusable Defect Corrections.

      Sampaio, S., Knuth, M. (ed.), Kontokostas, D. (ed.) & Sack, H. (ed.), 2 Sept 2014, Proceedings of the 1st Workshop on Linked Data Quality co-located with 10th International Conference on Semantic Systems, LDQ@SEMANTiCS 20144. Knuth, M., Kontokostas, D. & Sack, H. (eds.). RWTH Aachen University

      Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


    Cite this