In the era of data science, and data-driven science, data sets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in data sets can have far reaching consequences, spreading from data set to data set, and affecting the consumers of that data in ways that are hard to predict or quantify. Some form of waste is typically the result. For example, scientists using defective data to propose promising hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure. Because of this, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect. But, in some areas, human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately. Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort current available.
In this paper, we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do. This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient) to find out if the same defects are present in their data or not. This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.
- Data Quality
- Data Curation
- Scientific Data
- Big Data