The Specimen Data Refinery: Using a scientific workflow approach for information extraction

Laurence Livermore, Paul Brack, Ben Scott, Stian Soiland-reyes, Oliver Woolland

Research output: Contribution to journalMeeting Abstractpeer-review


Over the past three years, we have been developing the Specimen Data Refinery (SDR) to automate the extraction of data from specimen images as part of the SYNTHESYS project (Walton et al. 2020). The SDR provides an easy to deploy, open source, web-based interface to multiple workflows that enable a user to create new or enhance existing natural history specimen records. The SDR uses the Galaxy workflow platform as the basis for managing data analysis, and where possible, using existing Galaxy community tools and approaches (Jalili et al. 2020, Hardisty et al. 2022). We have developed a library of domain-specific tools including semantic segmentation, optical character recognition, hand-written text recognition, barcode reading and natural language processing. These tools have been designed to work on standardised images of specimens, specifically herbarium sheets, pinned insects and microscope slides.

In this presentation, we provide our technical approach in developing the SDR, including the Galaxy workflow platform, application deployment, and tool interoperability, using FAIR digital objects (e.g., RO-Crates and openDigital Specimen objects (Soiland-Reyes et al. 2022, Addink and Hardisty 2020)). We present an evaluation of the tools, including segmentation, text recognition, and others, and the new challenges in using the resulting data from both a technical and social perspective.
Original languageEnglish
Article numbere93500
JournalBiodiversity Information Science and Standards (BISS)
Publication statusPublished - 23 Aug 2022
EventBiodiversity Information Standards (TDWG 2022): Stronger Together: Standards for linking biodiversity data - Sofia, Bulgaria
Duration: 17 Oct 202221 Oct 2022
Conference number: 2022


Dive into the research topics of 'The Specimen Data Refinery: Using a scientific workflow approach for information extraction'. Together they form a unique fingerprint.

Cite this