TY - JOUR
T1 - The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections
T2 - A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections
AU - Hardisty, Alex
AU - Brack, Paul
AU - Goble, Carole
AU - Livermore, Laurence
AU - Scott, Ben
AU - Groom, Quentin
AU - Owen, Stuart
AU - Soiland-Reyes, Stian
N1 - Funding Information:
This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement numbers 823827 (SYNTHESYS Plus), 871043 (DiSSCo Prepare), 823830 (BioExcel-2), 824087 (EOSC-Life).
Publisher Copyright:
© 2022 Chinese Academy of Sciences.
PY - 2022/4/1
Y1 - 2022/4/1
N2 - A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging
AB - A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging
KW - Digital Specimen
KW - Workflow
KW - FAIR
KW - Digital Object
KW - RO-Crate
UR - http://www.scopus.com/inward/record.url?scp=85129619772&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/02b7d622-dbae-308b-be09-3d7479c80b78/
U2 - 10.1162/dint_a_00134
DO - 10.1162/dint_a_00134
M3 - Article
SN - 2641-435X
VL - 4
SP - 320
EP - 341
JO - Data Intelligence
JF - Data Intelligence
IS - 2
ER -