Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media

Impact: Society and culture, Technological

Narrative

The Biodiversity Heritage Library (BHL) stands as the world’s largest digital library of biodiversity literature, encompassing nearly 40 million pages of taxonomic literature. While the search interface that employs keyword matching can be used in exploring this vast collection, it may not always be sufficient in addressing issues brought about by ambiguity. Various challenges in terms of insufficient amounts of searchable content, indexing taxonomic names, and creating informative metadata often pose barriers to the availability and accessibility of collections to the users. Thus, like many libraries in the world, the BHL faces difficulties when sharing its biodiversity knowledge.

The Mining Biodiversity project, a collaboration between Manchester University's National Centre for Text Mining (UK), Missouri Botanical Garden (US), Dalhousie University’s Big Data Analytics Institute (Canada) and Toronto Metropolitan University’s Social Media Lab (Canada), improves the discoverability of information in BHL by transforming it into a next-generation social digital resource. The project was funded by AHRC, ESRC, Innovation.ca, the Institute of Museum and Library Services, JISC and NEH.

The project integrates novel text mining (TM) methods, visualisation, crowdsourcing, and social media into the BHL. The team first improves BHL content and enriches it with semantic metadata. Error correction was applied to improve the quality of the optical character recognition (OCR)-generated texts from BHL’s digital legacy documents, both in their original form (such as scanned images) and plain text. TM techniques were then used to automatically annotate these documents with semantic information. A biodiversity term inventory was then constructed, where a collection of biodiversity-relevant terms was built based on the BHL documents through the text-processing tools. The incorporation of TM-generated semantic metadata and biodiversity term inventory enabled the search engine to return results based on semantic similarity. To further facilitate efficient navigation of search results, the search system was also enhanced with capabilities for visualisation. Existing social media sites were integrated with the BHL to foster the collaborative discussion and curation of biodiversity digital artefacts. The integration has allowed the BHL to reach an even wider audience of scholars and at the same time, has facilitated the enrichment of the library with community-curated digital media, such as biodiversity-relevant images and video clips.

The resulting digital resource provides fully interlinked and indexed access to the full content of BHL library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to precisely locate the information of interest to them easily and efficiently.

This project has instilled a deeper appreciation for text mining in taxonomists, biodiversity informaticians and digital librarians. Taxonomists have started to appreciate the advantages of going beyond keyword, and string-matching when indexing biodiversity literature. The digital libraries community benefited from this project for its development of text mining skills. The tools developed as part of this project have been made available in the web-based text mining workbench Argo, which allows non-text mining experts to build their text mining solutions by designing workflows using a graphical user interface, supported by a tutorial accepted at the TPDL 2016 conference. It served as a knowledge transfer activity in which it is aimed to enable digital librarians in the use of Argo for enriching their respective textual collections with semantic metadata. Other institutes, i.e., the German National Library of Science and Technology and the Information Library Complex of Peter the Great St. Petersburg Polytechnic University, show interest in utilising our text mining tools for enriching and extending their libraries. BHL also contributes to enabling biodiversity content providers to the national digital platform construction, which will be hosted by the Digital Public Library of America (DPLA).

Mining Biodiversity was one of the projects that won in the third round of the transatlantic Digging Into Data Challenge in 2013, a competition aiming to promote the development of innovative computational techniques that can be applied to big data in the humanities and social sciences.
Category of impactSociety and culture, Technological
Impact levelAdoption

Research Beacons, Institutes and Platforms

  • Biotechnology
  • Digital Futures
  • Institute for Data Science and AI
  • Manchester Institute of Biotechnology