A text mining framework for accelerating the semantic curation of literature

Riza Batista-Navarro, Jennifer Hammock, William Ulate, Sophia Ananiadou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

156 Downloads (Pure)

Abstract

The Biodiversity Heritage Library is the world’s largest digital library of biodiversity literature. Currently containing almost 40 million pages, the library can be explored with a search interface employing keyword-matching, which unfortunately fails to address issues brought about by ambiguity. Helping alleviate these issues are tools that automatically attach semantic metadata to documents, e.g., biodiversity concept recognisers. However, gold standard, semantically annotated textual corpora are critical for the development of these advanced tools. In the biodiversity domain, such corpora are almost non-existent especially since the construction of semantically annotated resources is typically a timeconsuming and laborious process. Aiming to accelerate the development of a corpus of biodiversity documents, we propose a text mining framework that hastens curation through an iterative feedback-loop process of (1) manual annotation, and (2) training and application of statistical concept recognition models. Even after only a few iterations, our curators were observed to have spent less time and effort on annotation.

Original languageEnglish
Title of host publicationResearch and Advanced Technology for Digital Libraries :
Subtitle of host publication 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016 Hannover, Germany, September 5-9, 2016 proceedings
PublisherSpringer Nature
Pages459-462
Number of pages4
ISBN (Print)9783319439969
DOIs
Publication statusPublished - 10 Aug 2016
Event20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016 - Hannover, Germany
Duration: 5 Sept 20169 Sept 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9819
ISSN (Print)03029743
ISSN (Electronic)16113349

Conference

Conference20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016
Country/TerritoryGermany
CityHannover
Period5/09/169/09/16

Fingerprint

Dive into the research topics of 'A text mining framework for accelerating the semantic curation of literature'. Together they form a unique fingerprint.

Cite this