{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,21]],"date-time":"2026-07-21T12:23:56Z","timestamp":1784636636267,"version":"3.55.0"},"reference-count":89,"publisher":"Cambridge University Press (CUP)","issue":"2","license":[{"start":{"date-parts":[[2022,3,18]],"date-time":"2022-03-18T00:00:00Z","timestamp":1647561600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.<\/jats:p>","DOI":"10.1017\/s1351324922000110","type":"journal-article","created":{"date-parts":[[2022,3,30]],"date-time":"2022-03-30T09:57:14Z","timestamp":1648634234000},"page":"425-448","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":31,"title":["In-depth analysis of the impact of OCR errors on named entity recognition and linking"],"prefix":"10.1017","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8964-2135","authenticated-orcid":false,"given":"Ahmed","family":"Hamdi","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9571-5193","authenticated-orcid":false,"given":"Elvys","family":"Linhares Pontes","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6719-5007","authenticated-orcid":false,"given":"Nicolas","family":"Sidere","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0123-439X","authenticated-orcid":false,"given":"Micka\u00ebl","family":"Coustaty","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6160-3356","authenticated-orcid":false,"given":"Antoine","family":"Doucet","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"56","published-online":{"date-parts":[[2022,3,18]]},"reference":[{"key":"S1351324922000110_ref27","unstructured":"Grave, E. , Bojanowski, P. , Gupta, P. , Joulin, A. and Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation."},{"key":"S1351324922000110_ref76","unstructured":"Rodriquez, K.J. , Bryant, M. , Blanke, T. and Luszczynska, M. (2012). Comparison of named entity recognition tools for raw OCR text. In KONVENS, pp. 410\u2013414."},{"key":"S1351324922000110_ref85","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems","volume":"30","author":"Vaswani","year":"2017"},{"key":"S1351324922000110_ref81","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(95)00058-5"},{"key":"S1351324922000110_ref8","doi-asserted-by":"publisher","DOI":"10.7250\/csimq.2016-7.04"},{"key":"S1351324922000110_ref32","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661887"},{"key":"S1351324922000110_ref78","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2327028"},{"key":"S1351324922000110_ref19","unstructured":"Ehrmann, M. , Hamdi, A. , Pontes, E.L. , Romanello, M. and Doucet, A. (2021). Named entity recognition and classification on historical documents: a survey. CoRR, abs\/2109.11406."},{"key":"S1351324922000110_ref40","unstructured":"Ittner, D.J. , Lewis, D.D. and Ahn, D.D. (1995). Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval. Citeseer, pp. 301\u2013315."},{"key":"S1351324922000110_ref20","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313517"},{"key":"S1351324922000110_ref35","unstructured":"Han, X. and Zhao, J. (1999). NLPR_KBP in TAC 2009 KBP track: a two-stage method to entity linking. In In Proceedings of Test Analysis Conference 2009 (TAC 09). MIT Press."},{"key":"S1351324922000110_ref44","doi-asserted-by":"publisher","DOI":"10.1145\/129875.129882"},{"key":"S1351324922000110_ref88","unstructured":"Zheng, Z. , Li, F. , Huang, M. and Zhu, X. (2010). Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT\u201910. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 483\u2013491."},{"key":"S1351324922000110_ref61","doi-asserted-by":"publisher","DOI":"10.3115\/1034678.1034710"},{"key":"S1351324922000110_ref37","unstructured":"Hoffart, J. , Yosef, M.A. , Bordino, I. , F\u00fcrstenau, H. , Pinkal, M. , Spaniol, M. , Taneva, B. , Thater, S. and Weikum, G. (2011). Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP\u201911. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 782\u2013792."},{"key":"S1351324922000110_ref15","unstructured":"Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic. Association for Computational Linguistics, pp. 708\u2013716."},{"key":"S1351324922000110_ref48","doi-asserted-by":"publisher","DOI":"10.3233\/SW-140134"},{"key":"S1351324922000110_ref12","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2017.7991582"},{"key":"S1351324922000110_ref17","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 4171\u20134186."},{"key":"S1351324922000110_ref66","doi-asserted-by":"publisher","DOI":"10.1075\/li.30.1.03nad"},{"key":"S1351324922000110_ref33","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2019.00057"},{"key":"S1351324922000110_ref39","volume-title":"Digital Libraries at Times of Massive Societal Transition - Collaborating and Connecting Community during Global Change","author":"Huynh","year":"2020"},{"key":"S1351324922000110_ref34","first-page":"87","author":"Hamdi","year":"2020"},{"key":"S1351324922000110_ref28","doi-asserted-by":"crossref","unstructured":"Grishman, R. and Sundheim, B. (1996). Message understanding conference-6: a brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1.","DOI":"10.3115\/992628.992709"},{"key":"S1351324922000110_ref74","unstructured":"Ravi, M.P.K. , Singh, K. , Mulang, I.O. , Shekarpour, S. , Hoffart, J. and Lehmann, J. (2021). Cholan: a modular approach for neural entity linking on wikipedia and wikidata. arXiv preprint arXiv:2101.09969."},{"key":"S1351324922000110_ref38","article-title":"How good can it get? analysing and improving ocr accuracy in large scale historic newspaper digitisation programs","volume":"15","author":"Holley","year":"2009","journal-title":"D-Lib Magazine"},{"key":"S1351324922000110_ref71","doi-asserted-by":"crossref","unstructured":"Peters, M.E. , Ammar, W. , Bhagavatula, C. and Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108.","DOI":"10.18653\/v1\/P17-1161"},{"key":"S1351324922000110_ref41","doi-asserted-by":"crossref","unstructured":"Jing, H. , Lopresti, D. and Shih, C. (2003). Summarizing noisy documents. In Proceedings of the Symposium on Document Image Understanding Technology, pp. 111\u2013119.","DOI":"10.3115\/1119467.1119471"},{"key":"S1351324922000110_ref82","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fqt067"},{"key":"S1351324922000110_ref57","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-2026"},{"key":"S1351324922000110_ref87","unstructured":"Zhang, W. , Sim, Y.C. , Su, J. and Tan, C.L. (2011). Entity linking with effective acronym expansion, instance selection and topic modeling. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI\u201911. AAAI Press, pp. 1909\u20131914."},{"key":"S1351324922000110_ref56","doi-asserted-by":"crossref","unstructured":"Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354.","DOI":"10.18653\/v1\/P16-1101"},{"key":"S1351324922000110_ref10","unstructured":"Cao, N.D. , Wu, L. , Popat, K. , Artetxe, M. , Goyal, N. , Plekhanov, M. , Zettlemoyer, L. , Cancedda, N. , Riedel, S. and Petroni, F. (2021). Multilingual autoregressive entity linking. CoRR. https:\/\/arxiv.org\/abs\/2103.12528."},{"key":"S1351324922000110_ref68","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2019.00015"},{"key":"S1351324922000110_ref55","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1104"},{"key":"S1351324922000110_ref3","doi-asserted-by":"crossref","unstructured":"Bikel, D.M. , Miller, S. , Schwartz, R. and Weischedel, R. (1998). Nymble: a high-performance learning name-finder. arXiv preprint cmp-lg\/9803003.","DOI":"10.3115\/974557.974586"},{"key":"S1351324922000110_ref51","first-page":"102","author":"Linhares Pontes","year":"2019"},{"key":"S1351324922000110_ref30","doi-asserted-by":"publisher","DOI":"10.1145\/1571941.1571989"},{"key":"S1351324922000110_ref36","first-page":"120","author":"Heino","year":"2017"},{"key":"S1351324922000110_ref75","unstructured":"Ritter, A. , Clark, S. , Mausam, and Etzioni, O. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 1524\u20131534."},{"key":"S1351324922000110_ref42","doi-asserted-by":"publisher","DOI":"10.3390\/jimaging3040062"},{"key":"S1351324922000110_ref46","unstructured":"Lawrie, D. , Mayfield, J. and Etter, D. (2020). Building OCR\/NER test collections. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4639\u20134646."},{"key":"S1351324922000110_ref52","doi-asserted-by":"publisher","DOI":"10.1145\/3383583.3398597"},{"key":"S1351324922000110_ref53","doi-asserted-by":"publisher","DOI":"10.1145\/1066677.1066851"},{"key":"S1351324922000110_ref67","doi-asserted-by":"publisher","DOI":"10.1145\/3453476"},{"key":"S1351324922000110_ref65","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-04257-8_1"},{"key":"S1351324922000110_ref62","doi-asserted-by":"publisher","DOI":"10.3115\/974147.974191"},{"key":"S1351324922000110_ref21","doi-asserted-by":"publisher","DOI":"10.3115\/1220575.1220637"},{"key":"S1351324922000110_ref1","unstructured":"Akbik, A. , Bergmann, T. , Blythe, D. , Rasul, K. , Schweter, S. and Vollgraf, R. (2019). FLAIR: an easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 54\u201359."},{"key":"S1351324922000110_ref89","unstructured":"Zuccon, G. , Nguyen, A.N. , Bergheim, A. , Wickman, S. and Grayson, N. (2012). The impact of OCR accuracy on automated cancer classification of pathology reports. In HIC, pp. 250\u2013256."},{"key":"S1351324922000110_ref7","unstructured":"Borthwick, A. , Sterling, J. , Agichtein, E. and Grishman, R. (1998). Nyu: description of the mene named entity system as used in muc-7. In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29\u2013May 1, 1998."},{"key":"S1351324922000110_ref24","doi-asserted-by":"crossref","unstructured":"Gefen, A. (2014). Les enjeux \u00e9pist\u00e9mologiques des humanit\u00e9s num\u00e9riques. Socio-La nouvelle revue des sciences sociales, (4), 61\u201374.","DOI":"10.4000\/socio.1296"},{"key":"S1351324922000110_ref16","doi-asserted-by":"crossref","unstructured":"Dernoncourt, F. , Lee, J.Y. and Szolovits, P. (2017). Neuroner: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487.","DOI":"10.18653\/v1\/D17-2017"},{"key":"S1351324922000110_ref63","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(87)90116-6"},{"key":"S1351324922000110_ref83","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401416"},{"key":"S1351324922000110_ref25","unstructured":"Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.\u2019s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722."},{"key":"S1351324922000110_ref31","unstructured":"Guo, S. , Chang, M.-W. and Kiciman, E. (2013). To link or not to link? a study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia. Association for Computational Linguistics, pp. 1020\u20131030."},{"key":"S1351324922000110_ref59","doi-asserted-by":"crossref","unstructured":"Maynard, D. , Tablan, V. , Ursu, C. , Cunningham, H. and Wilks, Y. (2001). Named entity recognition from diverse text types. In Recent Advances in Natural Language Processing 2001 Conference, pp. 257\u2013274.","DOI":"10.1017\/S1351324902002930"},{"key":"S1351324922000110_ref45","doi-asserted-by":"crossref","unstructured":"Lample, G. , Ballesteros, M. , Subramanian, S. , Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.","DOI":"10.18653\/v1\/N16-1030"},{"key":"S1351324922000110_ref73","doi-asserted-by":"publisher","DOI":"10.1145\/1321440.1321542"},{"key":"S1351324922000110_ref58","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-2026"},{"key":"S1351324922000110_ref29","unstructured":"Grover, C. , Givon, S. , Tobin, R. and Ball, J. (2008). Named entity recognition for digitised historical texts. In LREC."},{"key":"S1351324922000110_ref69","doi-asserted-by":"publisher","DOI":"10.3115\/1072133.1072186"},{"key":"S1351324922000110_ref5","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.conll-1.35"},{"key":"S1351324922000110_ref22","unstructured":"Filannino, M. , Brown, G. and Nenadic, G. (2013). Mantime: temporal expression identification and normalization in the tempeval-3 challenge. arXiv preprint arXiv:1304.7942."},{"key":"S1351324922000110_ref11","unstructured":"Chen, H. , Zukov-Gregoric, A. , Li, X.D. and Wadhwa, S. (2019). Contextualized end-to-end neural entity linking. arXiv preprint arXiv:1911.03834."},{"key":"S1351324922000110_ref50","volume-title":"Digital Libraries at Times of Massive Societal Transition - Collaborating and Connecting Community during Global Change","author":"Linhares Pontes","year":"2020"},{"key":"S1351324922000110_ref54","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-009-0094-8"},{"key":"S1351324922000110_ref47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1148"},{"key":"S1351324922000110_ref86","unstructured":"Yaser, A.-O. (2005). Effect of degraded input on statistical machine translation. In 2005 Symposium on Document Image Understanding Technology, p. 103."},{"key":"S1351324922000110_ref2","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073447"},{"key":"S1351324922000110_ref60","unstructured":"McDonald, D.D. (1993). Internal and external evidence in the identification and semantic categorization of proper names. In Acquisition of Lexical Knowledge from Text."},{"key":"S1351324922000110_ref26","doi-asserted-by":"publisher","DOI":"10.1098\/rsta.2000.0587"},{"key":"S1351324922000110_ref64","doi-asserted-by":"publisher","DOI":"10.1145\/3197026.3197055"},{"key":"S1351324922000110_ref79","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505601"},{"key":"S1351324922000110_ref84","doi-asserted-by":"publisher","DOI":"10.5220\/0009169004840496"},{"key":"S1351324922000110_ref23","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1277"},{"key":"S1351324922000110_ref43","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K18-1050"},{"key":"S1351324922000110_ref13","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","author":"Collobert","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000110_ref18","unstructured":"Dredze, M. , McNamee, P. , Rao, D. , Gerber, A. and Finin, T. (2010). Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING\u201910, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 277\u2013285."},{"key":"S1351324922000110_ref9","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K19-1063"},{"key":"S1351324922000110_ref77","unstructured":"Ruiz, P. and Poibeau, T. (2019). Mapping the Bentham Corpus: concept-based navigation. Journal of Data Mining and Digital Humanities. Special Issue: Digital Humanities between knowledge and know-how (Atelier Digit_Hum)."},{"key":"S1351324922000110_ref4","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"S1351324922000110_ref6","unstructured":"Boros, E. , Linhares Pontes, E. , Cabrera-Diego, L.A. , Hamdi, A. , Moreno, J.G. , Sid\u00e8re, N. and Doucet, A. (2020b). Robust named entity recognition and linking on historical multilingual documents. In Conference and Labs of the Evaluation Forum (CLEF 2020). Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, vol. 2696, Thessaloniki, Greece. CEUR-WS Working Notes, pp. 1\u201317."},{"key":"S1351324922000110_ref70","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"S1351324922000110_ref14","unstructured":"Croft, W. , Harding, S. , Taghva, K. and Borsack, J. (1994). An evaluation of information retrieval accuracy with simulated ocr output. In Symposium on Document Analysis and Information Retrieval, pp. 115\u2013126."},{"key":"S1351324922000110_ref49","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2487681"},{"key":"S1351324922000110_ref80","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242667"},{"key":"S1351324922000110_ref72","doi-asserted-by":"crossref","unstructured":"Peters, M.E. , Neumann, M. , Iyyer, M. , Gardner, M. , Clark, C. , Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.","DOI":"10.18653\/v1\/N18-1202"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000110","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,13]],"date-time":"2023-03-13T04:19:50Z","timestamp":1678681190000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000110\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,18]]},"references-count":89,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,3]]}},"alternative-id":["S1351324922000110"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000110","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,18]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}