Projects per year
Abstract
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, owing to large-scale digitisation efforts. Searchable access is typically provided by applying Optical Character Recognition (OCR) software to scanned page images. Often, however, the automatically recognised text contains a large number of errors, since OCR systems are typically optimised to deal with modern documents, and can struggle with historical document features, including variable print characteristics and archaic vocabulary usage. Low quality OCR text can reduce the efficiency of search systems over historical archives, particularly semantic systems that are based on the application of sophisticated text mining (TM) techniques. We report on a new OCR correction strategy, customised for historical medical documents. The method combines rule-based correction of regular errors with a medically-tuned spell-checking strategy, whose corrections are guided by information about subject-specific language usage from the publication period of the article to be corrected. The performance of our method compares favourably to other OCR post-correction strategies, in improving word-level accuracy of poor-quality documents by up to 16%.
Original language | English |
---|---|
Title of host publication | 2015 Digital Heritage |
Editors | Gabriele Guidi, Roberto Scopigno, Juan Carlos Torres, Holger Graf |
Publisher | IEEE |
Pages | 35-41 |
Number of pages | 7 |
Volume | 1 |
ISBN (Print) | 978-1-5090-0254-2 |
DOIs | |
Publication status | Published - Mar 2016 |
Event | Digital Heritage 2015 - Granada, Spain Duration: 28 Sept 2015 → 2 Oct 2015 |
Conference
Conference | Digital Heritage 2015 |
---|---|
City | Granada, Spain |
Period | 28/09/15 → 2/10/15 |
Keywords
- OCR correction strategy
- historical document search
- medical history
- spell checking
- text mining
Fingerprint
Dive into the research topics of 'Customised OCR correction for historical medical text'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Mining the History of Medicine.
Ananiadou, S. (PI), Mcnaught, J. (CoI), Timmermann, C. (CoI) & Worboys, M. (CoI)
1/01/14 → 31/03/15
Project: Research
Impacts
-
Mining the History of Medicine: Semantically Enhanced Search System for Historical Medical Archives
Ananiadou, S. (Participant), Mcnaught, J. (Participant), Timmermann, C. (Participant), Worboys, M. (Participant), (Participant) & Toon, E. (Participant)
Impact: Society and culture, Health and wellbeing, Awareness and understanding