We're witnessing the era of Data-Oriented Science, where investigations routinely involve computational data analysis. The research lifecycle has now become more elaborate to support the sharing and re-use of scientific data. To establish the veracity of shared data, scientific communities aim for systematising 1) the process of analysing data, and, 2) the reporting of analyses and results. Scientific workflows are a prominent mechanism for systematising analyses by encoding them as automated processes and documenting process executions with Workflow Provenance. Meanwhile, systematic reporting calls for discipline-specific Experimental Metadata to be provided outlining the context of data analysis such as source/reference datasets and community resources used, analytical methods and their parameter settings. A natural expectation would be that investigations, which adopt a systematic, workflow-based approach to the analysis can be advantageous at the time of reporting. This premise holds weakly. While workflow provenance supports streamlined enactment of analyses, their auditability and verifiability, we conjecture that it has limited contribution to reporting. This dissertation focuses on eliciting the apparent disconnect of Workflow Provenance and Experimental Metadata as the provenance gap. We identify complexity, mixed granularity, and genericity as characteristics of workflow provenance that underlie this gap. In response we develop techniques for provenance abstraction, analysis and annotation. We argue that workflow provenance is accompanied with implicit information, that can be made explicit to inform these techniques. Through empirical evidence we show that workflow steps have common functional characteristics, which we capture in a taxonomy of Workflow Motifs. We show how formally defined Graph Transformations can exploit Motifs to identify causes of complexity in workflows and abstract them to structurally simpler forms. We build on insight from prior research to show how execution and provenance collection behaviour of a workflow system can anticipate the granularity characteristics of provenance. We provide declarative anticipatory rules for the static-analysis of workflows of the Taverna system. We observe that scientific context is often available in embedded form in data and argue that data can be lifted to become metadata by discipline-specific metadata extractors. We outline a framework, that can be plugged with extractors and provide operators that encapsulate generic procedures to annotate workflow provenance. We implement our techniques with technology-independent provenance models and we showcase their benefit using real-world workflows.
|Date of Award||1 Aug 2016|
- The University of Manchester
|Supervisor||Carole Goble (Supervisor) & Sean Bechhofer (Supervisor)|
- scientific workflows