Customised OCR correction for historical medical text

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, owing to large-scale digitisation efforts. Searchable access is typically provided by applying Optical Character Recognition (OCR) software to scanned page images. Often, however, the automatically recognised text contains a large number of errors, since OCR systems are typically optimised to deal with modern documents, and can struggle with historical document features, including variable print characteristics and archaic vocabulary usage. Low quality OCR text can reduce the efficiency of search systems over historical archives, particularly semantic systems that are based on the application of sophisticated text mining (TM) techniques. We report on a new OCR correction strategy, customised for historical medical documents. The method combines rule-based correction of regular errors with a medically-tuned spell-checking strategy, whose corrections are guided by information about subject-specific language usage from the publication period of the article to be corrected. The performance of our method compares favourably to other OCR post-correction strategies, in improving word-level accuracy of poor-quality documents by up to 16%.
    Original languageEnglish
    Title of host publication2015 Digital Heritage
    EditorsGabriele Guidi, Roberto Scopigno, Juan Carlos Torres, Holger Graf
    Number of pages7
    ISBN (Print)978-1-5090-0254-2
    Publication statusPublished - Mar 2016
    EventDigital Heritage 2015 - Granada, Spain
    Duration: 28 Sept 20152 Oct 2015


    ConferenceDigital Heritage 2015
    CityGranada, Spain


    • OCR correction strategy
    • historical document search
    • medical history
    • spell checking
    • text mining


    Dive into the research topics of 'Customised OCR correction for historical medical text'. Together they form a unique fingerprint.

    Cite this