On assisting scientific data curation in collection-based dataflows using labels

Pinar Alper, Khalid Belhajjame, Carole Goble

    Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

    226 Downloads (Pure)

    Abstract

    Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming. In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing. We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.
    Original languageEnglish
    Title of host publicationProceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
    PublisherAssociation for Computing Machinery
    Pages7-16
    Number of pages10
    ISBN (Print)9781450325028
    DOIs
    Publication statusPublished - 17 Nov 2013
    EventWorkflows in Support of Large-Scale Science -
    Duration: 1 Jan 1824 → …

    Conference

    ConferenceWorkflows in Support of Large-Scale Science
    Period1/01/24 → …

    Keywords

    • workflow
    • taverna
    • metadata
    • provenance

    Fingerprint

    Dive into the research topics of 'On assisting scientific data curation in collection-based dataflows using labels'. Together they form a unique fingerprint.

    Cite this