{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T06:01:59Z","timestamp":1770271319490,"version":"3.49.0"},"reference-count":35,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2021,5,4]],"date-time":"2021-05-04T00:00:00Z","timestamp":1620086400000},"content-version":"vor","delay-in-days":123,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,5,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization.<\/jats:p>\n               <jats:p>For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model\u2019s correcting behavior.<\/jats:p>\n               <jats:p>Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.<\/jats:p>","DOI":"10.1162\/tacl_a_00379","type":"journal-article","created":{"date-parts":[[2021,5,5]],"date-time":"2021-05-05T00:21:22Z","timestamp":1620174082000},"page":"479-493","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":12,"title":["Neural OCR Post-Hoc Correction of Historical Corpora"],"prefix":"10.1162","volume":"9","author":[{"given":"Lijun","family":"Lyu","sequence":"first","affiliation":[{"name":"L3S Research Center, Leibniz University of Hannover \/ Hannover, Germany. lyu@L3S.de"}]},{"given":"Maria","family":"Koutraki","sequence":"additional","affiliation":[{"name":"L3S Research Center, Leibniz University of Hannover \/ Hannover, Germany. koutraki@L3S.de"}]},{"given":"Martin","family":"Krickl","sequence":"additional","affiliation":[{"name":"Austrian National Library \/ Vienna, Austria. martin.krickl@onb.ac.at"}]},{"given":"Besnik","family":"Fetahu","sequence":"additional","affiliation":[{"name":"L3S Research Center, Leibniz University of Hannover \/ Hannover, Germany"},{"name":"Amazon \/ Seattle, WA, USA. besnikf@amazon.com"}]}],"member":"281","published-online":{"date-parts":[[2021,5,4]]},"reference":[{"key":"2021060823401809100_bib1","article-title":"Using SMT for OCR error correction of historical texts","volume-title":"Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portoro\u017e, Slovenia, May 23\u201328, 2016","author":"Afli","year":"2016"},{"key":"2021060823401809100_bib2","article-title":"Neural machine translation by jointly learning to align and translate","volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7\u20139, 2015, Conference Track Proceedings","author":"Bahdanau","year":"2015"},{"key":"2021060823401809100_bib3","first-page":"349","article-title":"Improved transition-based parsing by modeling characters instead of words with lstms","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17\u201321, 2015","author":"Ballesteros","year":"2015"},{"key":"2021060823401809100_bib4","article-title":"Bootstrapped OCR error detection for a less-resourced language variant","volume-title":"Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19\u201321, 2016, volume 16 of Bochumer Linguistische Arbeitsberichte","author":"Barbaresi","year":"2016"},{"key":"2021060823401809100_bib5","first-page":"286","article-title":"An improved error model for noisy channel spelling correction","volume-title":"38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, October 1\u20138, 2000","author":"Brill","year":"2000"},{"key":"2021060823401809100_bib6","first-page":"1724","article-title":"Learning phrase representations using RNN encoder-decoder for statistical machine translation","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25\u201329, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL","author":"Cho","year":"2014"},{"key":"2021060823401809100_bib7","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P16-1160","article-title":"A character-level decoder without explicit segmentation for neural machine translation","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7\u201312, 2016, Berlin, Germany, Volume 1: Long Papers","author":"Chung","year":"2016"},{"key":"2021060823401809100_bib8","first-page":"876","article-title":"Incorporating structural alignment biases into an attentional neural translation model","volume-title":"NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12\u201317, 2016","author":"Cohn","year":"2016"},{"key":"2021060823401809100_bib9","first-page":"933","article-title":"Language modeling with gated convolutional networks","volume-title":"Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\u201311 August 2017","author":"Dauphin","year":"2017"},{"key":"2021060823401809100_bib10","first-page":"2363","article-title":"Multi-input attention for unsupervised OCR correction","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15\u201320, 2018, Volume 1: Long Papers","author":"Dong","year":"2018"},{"key":"2021060823401809100_bib11","first-page":"1080","article-title":"Latent-variable modeling of string transductions with finite-state methods","volume-title":"2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25\u201327 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL","author":"Dreyer","year":"2008"},{"key":"2021060823401809100_bib12","first-page":"161","article-title":"Generalized character-level spelling error correction","volume-title":"Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22\u201327, 2014, Baltimore, MD, USA, Volume 2: Short Papers","author":"Farra","year":"2014"},{"key":"2021060823401809100_bib13","first-page":"123","article-title":"A convolutional encoder model for neural machine translation","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 \u2013 August 4, Volume 1: Long Papers","author":"Gehring","year":"2017"},{"key":"2021060823401809100_bib14","first-page":"1243","article-title":"Convolutional sequence to sequence learning","volume-title":"Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\u201311 August 2017","author":"Gehring","year":"2017"},{"issue":"8","key":"2021060823401809100_bib15","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Computation"},{"key":"2021060823401809100_bib16","first-page":"1700","article-title":"Recurrent continuous translation models","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18\u201321 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL","author":"Kalchbrenner","year":"2013"},{"key":"2021060823401809100_bib17","first-page":"2741","article-title":"Character-aware neural language models","volume-title":"Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12\u201317, 2016, Phoenix, Arizona, USA","author":"Kim","year":"2016"},{"issue":"10","key":"2021060823401809100_bib18","first-page":"1995","article-title":"Convolutional networks for images, speech, and time series","volume":"3361","author":"LeCun","year":"1995","journal-title":"The handbook of brain theory and neural networks"},{"key":"2021060823401809100_bib19","article-title":"Character-based neural machine translation","author":"Ling","year":"2015","journal-title":"CoRR"},{"key":"2021060823401809100_bib20","first-page":"86580R","article-title":"Combining multiple thresholding binarization values to improve OCR output","volume-title":"Document Recognition and Retrieval XX, part of the IS&T-SPIE Electronic Imaging Symposium, Burlingame, California, USA, February 5\u20137, 2013, Proceedings, volume 8658 of SPIE Proceedings","author":"Lund","year":"2013"},{"key":"2021060823401809100_bib21","first-page":"90210A","article-title":"How well does multiple OCR error correction generalize?","volume-title":"Document Recognition and Retrieval XXI, San Francisco, California, USA, February 5\u20136, 2014, volume 9021 of SPIE Proceedings","author":"Lund","year":"2014"},{"key":"2021060823401809100_bib22","first-page":"764","article-title":"Progressive alignment and discriminative error correction for multiple OCR engines","volume-title":"2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18\u201321, 2011","author":"Lund","year":"2011"},{"key":"2021060823401809100_bib23","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1017\/CBO9781139058452","volume-title":"Mining of Massive Datasets","author":"Rajaraman","year":"2011"},{"key":"2021060823401809100_bib24","first-page":"423","article-title":"Improving OCR accuracy on early printed books by utilizing cross fold training and voting","volume-title":"13th IAPR International Workshop on Document Analysis Systems, DAS 2018, Vienna, Austria, April 24\u201327, 2018","author":"Reul","year":"2018"},{"key":"2021060823401809100_bib25","article-title":"State of the art optical character recognition of 19th century fraktur scripts using open source engines","author":"Reul","year":"2018","journal-title":"CoRR"},{"key":"2021060823401809100_bib26","first-page":"386","article-title":"Character-level models versus morphology in semantic role labeling","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15\u201320, 2018, Volume 1: Long Papers","author":"Sahin","year":"2018"},{"key":"2021060823401809100_bib27","first-page":"1703","article-title":"Still not there? Comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks","volume-title":"COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11\u201316, 2016, Osaka, Japan","author":"Schnober","year":"2016"},{"key":"2021060823401809100_bib28","first-page":"2716","article-title":"Multi-modular domain-tailored OCR post-correction","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9\u201311, 2017","author":"Schulz","year":"2017"},{"key":"2021060823401809100_bib29","doi-asserted-by":"crossref","first-page":"51","DOI":"10.18653\/v1\/W16-2406","article-title":"Data-driven spelling correction using weighted finite-state methods","volume-title":"Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata","author":"Silfverberg","year":"2016"},{"key":"2021060823401809100_bib30","first-page":"3104","article-title":"Sequence to sequence learning with neural networks","volume-title":"Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8\u201313 2014, Montreal, Quebec, Canada","author":"Sutskever","year":"2014"},{"key":"2021060823401809100_bib31","author":"Gagnon-Marchand","year":".."},{"key":"2021060823401809100_bib32","first-page":"5998","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4\u20139 December 2017, Long Beach, CA, USA","author":"Vaswani","year":"2017"},{"issue":"5","key":"2021060823401809100_bib33","doi-asserted-by":"crossref","first-page":"1063","DOI":"10.1109\/TKDE.2013.11","article-title":"A probabilistic approach to string transformation","volume":"26","author":"Wang","year":"2014","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"2021060823401809100_bib34","article-title":"Neural language correction with character-based attention","author":"Xie","year":"2016","journal-title":"CoRR"},{"key":"2021060823401809100_bib35","first-page":"269","article-title":"Retrieving and combining repeated passages to improve OCR","volume-title":"2017 ACM\/IEEE Joint Conference on Digital Libraries, JCDL 2017, Toronto, ON, Canada, June 19\u201323, 2017","author":"Shaobin","year":"2017"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00379\/1924062\/tacl_a_00379.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00379\/1924062\/tacl_a_00379.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,6,9]],"date-time":"2021-06-09T09:39:47Z","timestamp":1623231587000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00379\/100788\/Neural-OCR-Post-Hoc-Correction-of-Historical"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":35,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00379","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}