{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T17:54:45Z","timestamp":1775325285043,"version":"3.50.1"},"reference-count":167,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T00:00:00Z","timestamp":1626134400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100010676","name":"H2020 Societal Challenges","doi-asserted-by":"publisher","award":["770299"],"award-info":[{"award-number":["770299"]}],"id":[{"id":"10.13039\/100010676","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2022,7,31]]},"abstract":"<jats:p>Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.<\/jats:p>","DOI":"10.1145\/3453476","type":"journal-article","created":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T16:48:08Z","timestamp":1626194888000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":220,"title":["Survey of Post-OCR Processing Approaches"],"prefix":"10.1145","volume":"54","author":[{"given":"Thi Tuyet Hai","family":"Nguyen","sequence":"first","affiliation":[{"name":"L3i, University of La Rochelle"}]},{"given":"Adam","family":"Jatowt","sequence":"additional","affiliation":[{"name":"University of Innsbruck"}]},{"given":"Mickael","family":"Coustaty","sequence":"additional","affiliation":[{"name":"L3i, University of La Rochelle"}]},{"given":"Antoine","family":"Doucet","sequence":"additional","affiliation":[{"name":"L3i, University of La Rochelle"}]}],"member":"320","published-online":{"date-parts":[[2021,7,13]]},"reference":[{"key":"e_1_2_2_1_1","first-page":"175","article-title":"OCR error correction using statistical machine translation.Int","volume":"7","author":"Afli Haithem","year":"2016","unstructured":"Haithem Afli , Lo\u00efc Barrault , and Holger Schwenk . 2016 . OCR error correction using statistical machine translation.Int . J. Comput. Ling. Appl. 7 , 1 (2016), 175 \u2013 191 . Haithem Afli, Lo\u00efc Barrault, and Holger Schwenk. 2016. OCR error correction using statistical machine translation.Int. J. Comput. Ling. Appl. 7, 1 (2016), 175\u2013191.","journal-title":"J. Comput. Ling. Appl."},{"key":"e_1_2_2_2_1","volume-title":"Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Afli Haithem","year":"2016","unstructured":"Haithem Afli , Zhengwei Qiu , Andy Way , and P\u00e1raic Sheridan . 2016 . Using SMT for OCR error correction of historical texts . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916) . 962\u2013966. Haithem Afli, Zhengwei Qiu, Andy Way, and P\u00e1raic Sheridan. 2016. Using SMT for OCR error correction of historical texts. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916). 962\u2013966."},{"key":"e_1_2_2_3_1","volume-title":"Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE, 116\u2013120","author":"Azawi Mayce Al","unstructured":"Mayce Al Azawi and Thomas M. Breuel . 2014. Context-dependent confusions rules for building error model using weighted finite state transducers for OCR post-processing . In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE, 116\u2013120 . Mayce Al Azawi and Thomas M. Breuel. 2014. Context-dependent confusions rules for building error model using weighted finite state transducers for OCR post-processing. In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE, 116\u2013120."},{"key":"e_1_2_2_4_1","volume-title":"Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR\u201915)","author":"Azawi Mayce Al","unstructured":"Mayce Al Azawi , Marcus Liwicki , and Thomas M. Breuel . 2015. Combination of multiple aligned recognition outputs using WFST and LSTM . In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR\u201915) . IEEE, 31\u201335. Mayce Al Azawi, Marcus Liwicki, and Thomas M. Breuel. 2015. Combination of multiple aligned recognition outputs using WFST and LSTM. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR\u201915). IEEE, 31\u201335."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/8871.8877"},{"key":"e_1_2_2_6_1","doi-asserted-by":"crossref","first-page":"49","DOI":"10.21248\/jlcl.33.2018.218","article-title":"Supervised OCR error detection and correction using statistical and neural machine translation methods","volume":"33","author":"Amrhein Chantal","year":"2018","unstructured":"Chantal Amrhein and Simon Clematide . 2018 . Supervised OCR error detection and correction using statistical and neural machine translation methods . J. Lang. Technol. Comput. Ling. 33 , 1 (2018), 49 \u2013 76 . Chantal Amrhein and Simon Clematide. 2018. Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Ling. 33, 1 (2018), 49\u201376.","journal-title":"J. Lang. Technol. Comput. Ling."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(83)90022-5"},{"key":"e_1_2_2_8_1","volume-title":"Proceedings of the 11th International Conference on Image Analysis and Recognition ICIA(R\u201914)","volume":"8814","author":"Ali Al Azawi Mayce Ibrahim","unstructured":"Mayce Ibrahim Ali Al Azawi , Adnan Ul-Hasan , Marcus Liwicki , and Thomas M. Breuel . 2014. Character-level alignment using WFST and LSTM for post-processing in multi-script recognition systems\u2014A comparative study . In Proceedings of the 11th International Conference on Image Analysis and Recognition ICIA(R\u201914) ,Lecture Notes in Computer Science , Vol. 8814 . Springer, 379\u2013386. Mayce Ibrahim Ali Al Azawi, Adnan Ul-Hasan, Marcus Liwicki, and Thomas M. Breuel. 2014. Character-level alignment using WFST and LSTM for post-processing in multi-script recognition systems\u2014A comparative study. In Proceedings of the 11th International Conference on Image Analysis and Recognition ICIA(R\u201914),Lecture Notes in Computer Science, Vol. 8814. Springer, 379\u2013386."},{"key":"e_1_2_2_9_1","volume-title":"OCR context-sensitive error correction based on google web 1T 5-gram data set. Am. J. Sci. Res. 50","author":"Bassil Youssef","year":"2012","unstructured":"Youssef Bassil and Mohammad Alwani . 2012. OCR context-sensitive error correction based on google web 1T 5-gram data set. Am. J. Sci. Res. 50 ( 2012 ). Youssef Bassil and Mohammad Alwani. 2012. OCR context-sensitive error correction based on google web 1T 5-gram data set. Am. J. Sci. Res. 50 (2012)."},{"key":"e_1_2_2_10_1","article-title":"OCR post-processing error correction algorithm using google\u2019s online spelling suggestion","volume":"3","author":"Bassil Youssef","year":"2012","unstructured":"Youssef Bassil and Mohammad Alwani . 2012 . OCR post-processing error correction algorithm using google\u2019s online spelling suggestion . J. Emerg. Trends Comput. Inf. Sci. 3 , 1 (2012). Youssef Bassil and Mohammad Alwani. 2012. OCR post-processing error correction algorithm using google\u2019s online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3, 1 (2012).","journal-title":"J. Emerg. Trends Comput. Inf. Sci."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-45442-5_13"},{"key":"e_1_2_2_12_1","unstructured":"Yoshua Bengio R\u00e9jean Ducharme Pascal Vincent and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb.2003) 1137\u20131155.  Yoshua Bengio R\u00e9jean Ducharme Pascal Vincent and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb.2003) 1137\u20131155."},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390749.1390766"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2008.01.002"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1031442.1031446"},{"key":"e_1_2_2_18_1","volume-title":"Web 1T 5-gram Version 1 LDC2006T13","author":"Brants Thorsten","unstructured":"Thorsten Brants and Alex Franz . 2006. Web 1T 5-gram Version 1 LDC2006T13 . In Philadelphia : Linguistic Data Consortium . Google Inc. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1 LDC2006T13. In Philadelphia: Linguistic Data Consortium. Google Inc."},{"key":"e_1_2_2_19_1","volume-title":"Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 286\u2013293","author":"Brill Eric","unstructured":"Eric Brill and Robert C. Moore . 2000. An improved error model for noisy channel spelling correction . In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 286\u2013293 . Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 286\u2013293."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-52246-9_51"},{"key":"e_1_2_2_21_1","volume-title":"Proceedings of the 16th International Conference on Information Technology-New Generations (ITNG\u201919)","author":"Fonseca Cacho Jorge Ram\u00f3n","year":"2019","unstructured":"Jorge Ram\u00f3n Fonseca Cacho , Kazem Taghva , and Daniel Alvarez . 2019 . Using the Google web 1T 5-gram corpus for OCR error correction . In Proceedings of the 16th International Conference on Information Technology-New Generations (ITNG\u201919) . Springer, 505\u2013511. Jorge Ram\u00f3n Fonseca Cacho, Kazem Taghva, and Daniel Alvarez. 2019. Using the Google web 1T 5-gram corpus for OCR error correction. In Proceedings of the 16th International Conference on Information Technology-New Generations (ITNG\u201919). Springer, 505\u2013511."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-92270-6_1"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2595188.2595221"},{"key":"e_1_2_2_24_1","volume-title":"Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201914)","author":"Chelba Ciprian","year":"2014","unstructured":"Ciprian Chelba , Tomas Mikolov , Mike Schuster , Qi Ge , Thorsten Brants , Phillipp Koehn , and Tony Robinson . 2014 . One billion word benchmark for measuring progress in statistical language modeling . In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201914) . ISCA, 2635\u20132639. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201914). ISCA, 2635\u20132639."},{"key":"e_1_2_2_25_1","volume-title":"ICDAR2017 competition on post-OCR text correction. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR\u201917)","volume":"1","author":"Chiron Guillaume","year":"2017","unstructured":"Guillaume Chiron , Antoine Doucet , Micka\u00ebl Coustaty , and Jean-Philippe Moreux . 2017 . ICDAR2017 competition on post-OCR text correction. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR\u201917) , Vol. 1 . IEEE, 1423\u20131428. Guillaume Chiron, Antoine Doucet, Micka\u00ebl Coustaty, and Jean-Philippe Moreux. 2017. ICDAR2017 competition on post-OCR text correction. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR\u201917), Vol. 1. IEEE, 1423\u20131428."},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2017.7991582"},{"key":"e_1_2_2_27_1","volume-title":"Digitalkoot: Making old archives accessible using crowdsourcing. In Human Computation, Papers from the 2011 AAAI Workshop (AAAI Workshops)","author":"Chrons Otto","year":"2011","unstructured":"Otto Chrons and Sami Sundell . 2011 . Digitalkoot: Making old archives accessible using crowdsourcing. In Human Computation, Papers from the 2011 AAAI Workshop (AAAI Workshops) , Vol. WS-11- 11 . AAAI. Otto Chrons and Sami Sundell. 2011. Digitalkoot: Making old archives accessible using crowdsourcing. In Human Computation, Papers from the 2011 AAAI Workshop (AAAI Workshops), Vol. WS-11-11. AAAI."},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01889984"},{"key":"e_1_2_2_29_1","volume-title":"Proceedings of the 5th European Conference on Speech Communication and Technology.","author":"Clarkson Philip","year":"1997","unstructured":"Philip Clarkson and Ronald Rosenfeld . 1997 . Statistical language modeling using the CMU-Cambridge toolkit . In Proceedings of the 5th European Conference on Speech Communication and Technology. Philip Clarkson and Ronald Rosenfeld. 1997. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the 5th European Conference on Speech Communication and Technology."},{"key":"e_1_2_2_30_1","volume-title":"Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Clematide Simon","year":"2016","unstructured":"Simon Clematide , Lenz Furrer , and Martin Volk . 2016 . Crowdsourcing an OCR gold standard for a german and french heritage corpus . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916) . 975\u2013982. Simon Clematide, Lenz Furrer, and Martin Volk. 2016. Crowdsourcing an OCR gold standard for a german and french heritage corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916). 975\u2013982."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2102"},{"key":"e_1_2_2_32_1","volume-title":"Proceedings of the Symposium on Document Analysis and Information Retrieval. 115\u2013126","author":"Croft W. B.","unstructured":"W. B. Croft , S. M. Harding , K. Taghva , and J. Borsack . 1994. An evaluation of information retrieval accuracy with simulated OCR output . In Proceedings of the Symposium on Document Analysis and Information Retrieval. 115\u2013126 . W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. 1994. An evaluation of information retrieval accuracy with simulated OCR output. In Proceedings of the Symposium on Document Analysis and Information Retrieval. 115\u2013126."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/363958.363994"},{"key":"e_1_2_2_34_1","volume-title":"Proceedings of the Digital Humanities in the Nordic Countries 5th ConferenceCEUR Workshop Proceedings","volume":"2612","author":"Dann\u00e9lls Dana","year":"2020","unstructured":"Dana Dann\u00e9lls and Simon Persson . 2020 . Supervised OCR post-correction of historical swedish texts: what role does the OCR system play ? In Proceedings of the Digital Humanities in the Nordic Countries 5th ConferenceCEUR Workshop Proceedings , Vol. 2612 . CEUR-WS.org, 24\u201337. Dana Dann\u00e9lls and Simon Persson. 2020. Supervised OCR post-correction of historical swedish texts: what role does the OCR system play? In Proceedings of the Digital Humanities in the Nordic Countries 5th ConferenceCEUR Workshop Proceedings, Vol. 2612. CEUR-WS.org, 24\u201337."},{"key":"e_1_2_2_35_1","volume-title":"Proceedings of the International Conference on Document Analysis and Recognition (ICDAR\u201919)","author":"Das Deepayan","unstructured":"Deepayan Das , Jerin Philip , Minesh Mathew , and C. V. Jawahar . 2019. A cost efficient approach to correct OCR errors in large document collections . In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR\u201919) . IEEE, 655\u2013662. Deepayan Das, Jerin Philip, Minesh Mathew, and C. V. Jawahar. 2019. A cost efficient approach to correct OCR errors in large document collections. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR\u201919). IEEE, 655\u2013662."},{"key":"e_1_2_2_36_1","unstructured":"DBNL. 2019. DBNL OCR Data set.  DBNL. 2019. DBNL OCR Data set."},{"key":"e_1_2_2_37_1","doi-asserted-by":"crossref","unstructured":"Andreas Dengel Rainer Hoch Frank H\u00f6nes Thorsten J\u00e4ger Michael Malburg and Achim Weigel. 1997. Techniques for improving OCR results. In Handbook of Character Recognition and Document Image Analysis. World Scientific 227\u2013258.  Andreas Dengel Rainer Hoch Frank H\u00f6nes Thorsten J\u00e4ger Michael Malburg and Achim Weigel. 1997. Techniques for improving OCR results. In Handbook of Character Recognition and Document Image Analysis. World Scientific 227\u2013258.","DOI":"10.1142\/9789812830968_0008"},{"key":"e_1_2_2_38_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT\u201919)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT\u201919) . Association for Computational Linguistics, 4171\u20134186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT\u201919). Association for Computational Linguistics, 4171\u20134186."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-6108"},{"key":"e_1_2_2_40_1","volume-title":"Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP\u201917)","author":"D\u2019hondt Eva","year":"2017","unstructured":"Eva D\u2019hondt , Cyril Grouin , and Brigitte Grau . 2017 . Generating a training corpus for ocr post-correction using encoder-decoder model . In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP\u201917) . Asian Federation of Natural Language Processing, 1006\u20131014. Eva D\u2019hondt, Cyril Grouin, and Brigitte Grau. 2017. Generating a training corpus for ocr post-correction using encoder-decoder model. In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP\u201917). Asian Federation of Natural Language Processing, 1006\u20131014."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1220"},{"key":"e_1_2_2_42_1","volume-title":"Proceedings of the 21st Nordic Conference on Computational Linguistics. 70\u201376","author":"Drobac Senka","year":"2017","unstructured":"Senka Drobac , Pekka Kauppinen , and Krister Lind\u00e9n . 2017 . OCR and post-correction of historical Finnish texts . In Proceedings of the 21st Nordic Conference on Computational Linguistics. 70\u201376 . Senka Drobac, Pekka Kauppinen, and Krister Lind\u00e9n. 2017. OCR and post-correction of historical Finnish texts. In Proceedings of the 21st Nordic Conference on Computational Linguistics. 70\u201376."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-020-00359-9"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1515\/pralin-2016-0004"},{"key":"e_1_2_2_45_1","volume-title":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. 19\u201324","author":"Englmeier Tobias","unstructured":"Tobias Englmeier , Florian Fink , and Klaus U. Schulz . 2019. AI-PoCoTo: Combining automated and interactive ocr postcorrection . In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. 19\u201324 . Tobias Englmeier, Florian Fink, and Klaus U. Schulz. 2019. AI-PoCoTo: Combining automated and interactive ocr postcorrection. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. 19\u201324."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2595188.2595194"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2595188.2595200"},{"key":"e_1_2_2_48_1","volume-title":"Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries (JCDL\u201906)","author":"Feng Shaolei","unstructured":"Shaolei Feng and R. Manmatha . 2006. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books . In Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries (JCDL\u201906) . ACM, 109\u2013118. Shaolei Feng and R. Manmatha. 2006. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries (JCDL\u201906). ACM, 109\u2013118."},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3078081.3078096"},{"key":"e_1_2_2_50_1","volume-title":"Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 97\u2013103","author":"Furrer Lenz","year":"2011","unstructured":"Lenz Furrer and Martin Volk . 2011 . Reducing OCR errors in gothic-script documents . In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 97\u2013103 . Lenz Furrer and Martin Volk. 2011. Reducing OCR errors in gothic-script documents. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 97\u2013103."},{"key":"e_1_2_2_51_1","volume-title":"OCR17: GT for 17th French prints","author":"Gabay Simon","unstructured":"Simon Gabay . 2020. OCR17: GT for 17th French prints . Simon Gabay. 2020. OCR17: GT for 17th French prints."},{"key":"e_1_2_2_52_1","volume-title":"Proceedings of the First Italian Conference on Computational Linguistics CLiC-It","author":"G\u00e9n\u00e9reux Michel","year":"2014","unstructured":"Michel G\u00e9n\u00e9reux , Egon W. Stemle , Verena Lyding , and Lionel Nicolas . 2014 . Correcting OCR errors for german in fraktur font . In Proceedings of the First Italian Conference on Computational Linguistics CLiC-It . Pisa University Press, 186\u2013190. Michel G\u00e9n\u00e9reux, Egon W. Stemle, Verena Lyding, and Lionel Nicolas. 2014. Correcting OCR errors for german in fraktur font. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-It. Pisa University Press, 186\u2013190."},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2013.72"},{"key":"e_1_2_2_54_1","volume-title":"Proceedings of Traitement Automatique des Langues Naturelles (TALN\u201911)","author":"G\u00f6hring Anne","year":"2011","unstructured":"Anne G\u00f6hring and Martin Volk . 2011 . The Text+ Berg corpus: An alpine french-german parallel resource . In Proceedings of Traitement Automatique des Langues Naturelles (TALN\u201911) . Anne G\u00f6hring and Martin Volk. 2011. The Text+ Berg corpus: An alpine french-german parallel resource. In Proceedings of Traitement Automatique des Langues Naturelles (TALN\u201911)."},{"key":"e_1_2_2_55_1","volume-title":"Proceedings of the 9th ACM Symposium on Document Engineering. ACM, 193\u2013200","author":"Gotscharek Annette","unstructured":"Annette Gotscharek , Ulrich Reffle , Christoph Ringlstetter , and Klaus U. Schulz . 2009. On lexical resources for digitization of historical documents . In Proceedings of the 9th ACM Symposium on Document Engineering. ACM, 193\u2013200 . Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2009. On lexical resources for digitization of historical documents. In Proceedings of the 9th ACM Symposium on Document Engineering. ACM, 193\u2013200."},{"key":"e_1_2_2_56_1","doi-asserted-by":"crossref","unstructured":"Isabelle Guyon Robert M. Haralick Jonathan J. Hull and Ihsin Tsaiyun Phillips. 1997. Data sets for OCR and document image understanding research. In Handbook of Character Recognition and Document Image Analysis. World Scientific 779\u2013799.  Isabelle Guyon Robert M. Haralick Jonathan J. Hull and Ihsin Tsaiyun Phillips. 1997. Data sets for OCR and document image understanding research. In Handbook of Character Recognition and Document Image Analysis. World Scientific 779\u2013799.","DOI":"10.1142\/9789812830968_0030"},{"key":"e_1_2_2_57_1","volume-title":"Leveraging text repetitions and denoising autoencoders in OCR post-correction. CoRR abs\/1906.10907","author":"Hakala Kai","year":"2019","unstructured":"Kai Hakala , Aleksi Vesanto , Niko Miekka , Tapio Salakoski , and Filip Ginter . 2019. Leveraging text repetitions and denoising autoencoders in OCR post-correction. CoRR abs\/1906.10907 ( 2019 ). Kai Hakala, Aleksi Vesanto, Niko Miekka, Tapio Salakoski, and Filip Ginter. 2019. Leveraging text repetitions and denoising autoencoders in OCR post-correction. CoRR abs\/1906.10907 (2019)."},{"key":"e_1_2_2_58_1","volume-title":"Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP\u201919)","author":"H\u00e4m\u00e4l\u00e4inen Mika","year":"2019","unstructured":"Mika H\u00e4m\u00e4l\u00e4inen and Simon Hengchen . 2019 . From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction . In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP\u201919) . INCOMA Ltd., 431\u2013436. Mika H\u00e4m\u00e4l\u00e4inen and Simon Hengchen. 2019. From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP\u201919). INCOMA Ltd., 431\u2013436."},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2019.00057"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3078081.3078107"},{"key":"e_1_2_2_61_1","unstructured":"Andreas W. Hauser. 2007. OCR-Postcorrection of Historical Texts. Master\u2019s thesis. Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen.  Andreas W. Hauser. 2007. OCR-Postcorrection of Historical Texts. Master\u2019s thesis. Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen."},{"key":"e_1_2_2_62_1","volume-title":"Proceedings of the 6th Workshop on Statistical Machine Translation. 187\u2013197","author":"Heafield Kenneth","year":"2011","unstructured":"Kenneth Heafield . 2011 . KenLM: Faster and smaller language model queries . In Proceedings of the 6th Workshop on Statistical Machine Translation. 187\u2013197 . Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. 187\u2013197."},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-016-4185-5"},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_2_65_1","volume-title":"Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers","author":"Holley Rose","unstructured":"Rose Holley . 2009. Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers . National Library of Australia . Rose Holley. 2009. Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers. National Library of Australia."},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.3115\/1699648.1699670"},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3357844"},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2017.197"},{"key":"e_1_2_2_69_1","volume-title":"Human Language Technologies: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 697\u2013700","author":"Jiampojamarn Sittichai","year":"2010","unstructured":"Sittichai Jiampojamarn , Colin Cherry , and Grzegorz Kondrak . 2010 . Integrating joint n-gram features into a discriminative training framework . In Human Language Technologies: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 697\u2013700 . Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2010. Integrating joint n-gram features into a discriminative training framework. In Human Language Technologies: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 697\u2013700."},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.3115\/1119467.1119471"},{"key":"e_1_2_2_71_1","volume-title":"Daily Battle Communiques","author":"Jordan D. R.","unstructured":"D. R. Jordan . 1945. Daily Battle Communiques . Harold B. Lee Library . D. R. Jordan. 1945. Daily Battle Communiques. Harold B. Lee Library."},{"key":"e_1_2_2_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2017.307"},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009902609570"},{"key":"e_1_2_2_74_1","volume-title":"Proceedings of the Italian Research Conference on Digital Libraries. Springer, 95\u2013103","author":"Kettunen Kimmo","year":"2015","unstructured":"Kimmo Kettunen . 2015 . Keep, change or delete? Setting up a low resource ocr post-correction framework for a digitized old finnish newspaper collection . In Proceedings of the Italian Research Conference on Digital Libraries. Springer, 95\u2013103 . Kimmo Kettunen. 2015. Keep, change or delete? Setting up a low resource ocr post-correction framework for a digitized old finnish newspaper collection. In Proceedings of the Italian Research Conference on Digital Libraries. Springer, 95\u2013103."},{"key":"e_1_2_2_75_1","volume-title":"Proceedings of the IFLA World Library and Information Congress (IFLA\u201914)","author":"Kettunen Kimmo","year":"2014","unstructured":"Kimmo Kettunen , Timo Honkela , Krister Lind\u00e9n , Pekka Kauppinen , Tuula P\u00e4\u00e4kk\u00f6nen , Jukka Kervinen , et\u00a0al. 2014 . Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods . In Proceedings of the IFLA World Library and Information Congress (IFLA\u201914) . Kimmo Kettunen, Timo Honkela, Krister Lind\u00e9n, Pekka Kauppinen, Tuula P\u00e4\u00e4kk\u00f6nen, Jukka Kervinen, et\u00a0al. 2014. Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In Proceedings of the IFLA World Library and Information Congress (IFLA\u201914)."},{"key":"e_1_2_2_76_1","volume-title":"Proceedings of the Australasian Language Technology Association Workshop","author":"Khirbat Gitansh","year":"2017","unstructured":"Gitansh Khirbat . 2017 . OCR post-processing text correction using simulated annealing (OPTeCA) . In Proceedings of the Australasian Language Technology Association Workshop 2017. 119\u2013123. Gitansh Khirbat. 2017. OCR post-processing text correction using simulated annealing (OPTeCA). In Proceedings of the Australasian Language Technology Association Workshop 2017. 119\u2013123."},{"key":"e_1_2_2_77_1","volume-title":"Vecchi","author":"Kirkpatrick Scott","year":"1983","unstructured":"Scott Kirkpatrick , C. Daniel Gelatt , and Mario P . Vecchi . 1983 . Optimization by simulated annealing. Science 220, 4598 (1983), 671\u2013680. Scott Kirkpatrick, C. Daniel Gelatt, and Mario P. Vecchi. 1983. Optimization by simulated annealing. Science 220, 4598 (1983), 671\u2013680."},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAS.2016.44"},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-4012"},{"key":"e_1_2_2_80_1","doi-asserted-by":"publisher","DOI":"10.3115\/1557769.1557821"},{"key":"e_1_2_2_81_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073463"},{"key":"e_1_2_2_82_1","doi-asserted-by":"publisher","DOI":"10.3115\/1289189.1289208"},{"key":"e_1_2_2_83_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220575.1220684"},{"key":"e_1_2_2_84_1","volume-title":"Computational Analysis of Present-Day American English","author":"Kucera Henry","unstructured":"Henry Kucera , Henry Ku\u010dera , and Winthrop Nelson Francis . 1967. Computational Analysis of Present-Day American English . Brown University Press . Henry Kucera, Henry Ku\u010dera, and Winthrop Nelson Francis. 1967. Computational Analysis of Present-Day American English. Brown University Press."},{"key":"e_1_2_2_85_1","unstructured":"Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions insertions and reversals. In Soviet physics doklady.  Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions insertions and reversals. In Soviet physics doklady."},{"key":"e_1_2_2_86_1","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbq015"},{"key":"e_1_2_2_87_1","doi-asserted-by":"publisher","DOI":"10.1117\/12.450731"},{"key":"e_1_2_2_88_1","volume-title":"Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. IEEE, 284\u2013288","author":"Lin Xiaofan","year":"2003","unstructured":"Xiaofan Lin . 2003 . Impact of imperfect OCR on part-of-speech tagging . In Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. IEEE, 284\u2013288 . Xiaofan Lin. 2003. Impact of imperfect OCR on part-of-speech tagging. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. IEEE, 284\u2013288."},{"key":"e_1_2_2_89_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2010.498"},{"key":"e_1_2_2_90_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-009-0094-8"},{"key":"e_1_2_2_91_1","doi-asserted-by":"publisher","DOI":"10.1006\/cviu.1996.0502"},{"key":"e_1_2_2_92_1","volume-title":"Ringger","author":"Lund William B.","year":"2013","unstructured":"William B. Lund , Douglas J. Kennard , and Eric K . Ringger . 2013 . Combining multiple thresholding binarization values to improve OCR output. In Document Recognition and Retrieval XX (SPIE Proceedings), Vol. 8658 . SPIE , 86580R. William B. Lund, Douglas J. Kennard, and Eric K. Ringger. 2013. Combining multiple thresholding binarization values to improve OCR output. In Document Recognition and Retrieval XX (SPIE Proceedings), Vol. 8658. SPIE, 86580R."},{"key":"e_1_2_2_93_1","volume-title":"Proceedings of the 9th ACM\/IEEE-CS joint conference on Digital libraries. ACM, 231\u2013240","author":"William","unstructured":"William B. Lund and Eric K. Ringger. 2009. Improving optical character recognition through efficient multiple system alignment . In Proceedings of the 9th ACM\/IEEE-CS joint conference on Digital libraries. ACM, 231\u2013240 . William B. Lund and Eric K. Ringger. 2009. Improving optical character recognition through efficient multiple system alignment. In Proceedings of the 9th ACM\/IEEE-CS joint conference on Digital libraries. ACM, 231\u2013240."},{"key":"e_1_2_2_94_1","volume-title":"Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 658\u2013662","author":"William","unstructured":"William B. Lund and Eric K. Ringger. 2011. Error correction with in-domain training across multiple OCR system outputs . In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 658\u2013662 . William B. Lund and Eric K. Ringger. 2011. Error correction with in-domain training across multiple OCR system outputs. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 658\u2013662."},{"key":"e_1_2_2_95_1","volume-title":"How well does multiple OCR error correction generalize? In Document Recognition and Retrieval XXI","author":"Lund William B.","unstructured":"William B. Lund , Eric K. Ringger , and Daniel David Walker . 2014. How well does multiple OCR error correction generalize? In Document Recognition and Retrieval XXI , Vol. 9021 . SPIE , 76\u201388. William B. Lund, Eric K. Ringger, and Daniel David Walker. 2014. How well does multiple OCR error correction generalize? In Document Recognition and Retrieval XXI, Vol. 9021. SPIE, 76\u201388."},{"key":"e_1_2_2_96_1","volume-title":"Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 764\u2013768","author":"Lund William B.","unstructured":"William B. Lund , Daniel D. Walker , and Eric K. Ringger . 2011. Progressive alignment and discriminative error correction for multiple OCR engines . In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 764\u2013768 . William B. Lund, Daniel D. Walker, and Eric K. Ringger. 2011. Progressive alignment and discriminative error correction for multiple OCR engines. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 764\u2013768."},{"key":"e_1_2_2_97_1","volume-title":"Proceedings of the Actes de la Conf\u00e9rence TALN (CORIA-TALN-RJC\u201918)","volume":"1","author":"Magallon Thibault","year":"2018","unstructured":"Thibault Magallon , Fr\u00e9d\u00e9ric B\u00e9chet , and Beno\u00eet Favre . 2018 . Combining character level and word level RNNs for post-OCR error detection . In Proceedings of the Actes de la Conf\u00e9rence TALN (CORIA-TALN-RJC\u201918) , Volume 1 . ATALA, 233\u2013240. Thibault Magallon, Fr\u00e9d\u00e9ric B\u00e9chet, and Beno\u00eet Favre. 2018. Combining character level and word level RNNs for post-OCR error detection. In Proceedings of the Actes de la Conf\u00e9rence TALN (CORIA-TALN-RJC\u201918), Volume 1. ATALA, 233\u2013240."},{"key":"e_1_2_2_98_1","doi-asserted-by":"publisher","DOI":"10.1145\/3103010.3121032"},{"key":"e_1_2_2_99_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2018.06.001"},{"key":"e_1_2_2_100_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1199644"},{"key":"e_1_2_2_101_1","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.","author":"Mieskes Margot","year":"2019","unstructured":"Margot Mieskes and Stefan Schmunk . 2019 . OCR quality and NLP preprocessing . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Margot Mieskes and Stefan Schmunk. 2019. OCR quality and NLP preprocessing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_2_2_102_1","volume-title":"Proceedings of Workshop on International Proofing Tools and Language Technologies","author":"Mihov Stoyan","year":"2004","unstructured":"Stoyan Mihov , Svetla Koeva , Christoph Ringlstetter , Klaus U. Schulz , and Christian Strohmaier . 2004 . Precise and efficient text correction using levenshtein automata, dynamic Web dictionaries and optimized correction models . In Proceedings of Workshop on International Proofing Tools and Language Technologies (2004). Stoyan Mihov, Svetla Koeva, Christoph Ringlstetter, Klaus U. Schulz, and Christian Strohmaier. 2004. Precise and efficient text correction using levenshtein automata, dynamic Web dictionaries and optimized correction models. In Proceedings of Workshop on International Proofing Tools and Language Technologies (2004)."},{"key":"e_1_2_2_103_1","volume-title":"Proceedings of the 1st International Conference on Learning Representations (ICLR\u201913)","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov , Kai Chen , Greg Corrado , and Jeffrey Dean . 2013 . Efficient estimation of word representations in vector space . In Proceedings of the 1st International Conference on Learning Representations (ICLR\u201913) . Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR\u201913)."},{"key":"e_1_2_2_104_1","doi-asserted-by":"publisher","DOI":"10.3115\/974147.974191"},{"key":"e_1_2_2_105_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1026564708926"},{"key":"e_1_2_2_106_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAS.2018.63"},{"key":"e_1_2_2_107_1","volume-title":"Proceedings of the Australasian Language Technology Association Workshop","author":"Molla Diego","year":"2017","unstructured":"Diego Molla and Steve Cassidy . 2017 . Overview of the 2017 ALTA shared task: Correcting ocr errors . In Proceedings of the Australasian Language Technology Association Workshop 2017. 115\u2013118. Diego Molla and Steve Cassidy. 2017. Overview of the 2017 ALTA shared task: Correcting ocr errors. In Proceedings of the Australasian Language Technology Association Workshop 2017. 115\u2013118."},{"key":"e_1_2_2_108_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-04257-8_1"},{"key":"e_1_2_2_109_1","volume-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)","author":"Nastase Vivi","year":"2018","unstructured":"Vivi Nastase and Julian Hitschler . 2018 . Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods . In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918) . European Language Resources Association (ELRA). Vivi Nastase and Julian Hitschler. 2018. Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918). European Language Resources Association (ELRA)."},{"key":"e_1_2_2_110_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2015.06.022"},{"key":"e_1_2_2_111_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICFHR.2010.126"},{"key":"e_1_2_2_112_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10044-020-00936-y"},{"key":"e_1_2_2_113_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-04257-8_29"},{"key":"e_1_2_2_114_1","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2019.00015"},{"key":"e_1_2_2_115_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00145"},{"key":"e_1_2_2_116_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383583.3398605"},{"key":"e_1_2_2_117_1","volume-title":"Unsupervised post-correction of OCR errors. Master\u2019s thesis. Leibniz Universit\u00e4t Hannover","author":"Niklas Kai","year":"2010","unstructured":"Kai Niklas . 2010. Unsupervised post-correction of OCR errors. Master\u2019s thesis. Leibniz Universit\u00e4t Hannover ( 2010 ). Kai Niklas. 2010. Unsupervised post-correction of OCR errors. Master\u2019s thesis. Leibniz Universit\u00e4t Hannover (2010)."},{"key":"e_1_2_2_118_1","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni , Salim Roukos , Todd Ward , and Wei-Jing Zhu . 2002 . BLEU: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318 . Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_2_2_119_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_2_120_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_2_2_121_1","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914)","author":"Pennington Jeffrey","unstructured":"Jeffrey Pennington , Richard Socher , and Christopher D. Manning . 2014. Glove: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914) . ACL, 1532\u20131543. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914). ACL, 1532\u20131543."},{"key":"e_1_2_2_122_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2000.902944"},{"key":"e_1_2_2_123_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICFHR.2010.99"},{"key":"e_1_2_2_124_1","volume-title":"Proceedings of the 1st Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA\u201920)","author":"Poncelas Alberto","year":"2020","unstructured":"Alberto Poncelas , Mohammad Aboomar , Jan Buts , James Hadley , and Andy Way . 2020 . A tool for facilitating ocr postediting in historical documents . In Proceedings of the 1st Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA\u201920) . 47\u201351. Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, and Andy Way. 2020. A tool for facilitating ocr postediting in historical documents. In Proceedings of the 1st Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA\u201920). 47\u201351."},{"key":"e_1_2_2_125_1","volume-title":"Proceedings of the International Conference on Asian Digital Libraries. Springer, 102\u2013115","author":"Pontes Elvys Linhares","year":"2019","unstructured":"Elvys Linhares Pontes , Ahmed Hamdi , Nicolas Sidere , and Antoine Doucet . 2019 . Impact of OCR quality on named entity linking . In Proceedings of the International Conference on Asian Digital Libraries. Springer, 102\u2013115 . Elvys Linhares Pontes, Ahmed Hamdi, Nicolas Sidere, and Antoine Doucet. 2019. Impact of OCR quality on named entity linking. In Proceedings of the International Conference on Asian Digital Libraries. Springer, 102\u2013115."},{"key":"e_1_2_2_126_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-009-0091-y"},{"key":"e_1_2_2_127_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2012.10.002"},{"key":"e_1_2_2_128_1","volume-title":"Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45\u201350","author":"\u0158eh\u016f\u0159ek Radim","year":"2010","unstructured":"Radim \u0158eh\u016f\u0159ek and Petr Sojka . 2010 . Software framework for topic modelling with large corpora . In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45\u201350 . Radim \u0158eh\u016f\u0159ek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45\u201350."},{"key":"e_1_2_2_129_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAS.2018.30"},{"key":"e_1_2_2_130_1","volume-title":"Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908)","author":"Reynaert Martin","year":"2008","unstructured":"Martin Reynaert . 2008 . All, and only, the errors: More complete and consistent spelling and ocr-error correction evaluation . In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908) . European Language Resources Association. Martin Reynaert. 2008. All, and only, the errors: More complete and consistent spelling and ocr-error correction evaluation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908). European Language Resources Association."},{"key":"e_1_2_2_131_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78135-6_53"},{"key":"e_1_2_2_132_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-010-0133-5"},{"key":"e_1_2_2_134_1","volume-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)","author":"Richter Caitlin","year":"2018","unstructured":"Caitlin Richter , Matthew Wickes , Deniz Beser , and Mitch Marcus . 2018 . Low-resource post processing of noisy OCR output for historical corpus digitisation . In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918) . Caitlin Richter, Matthew Wickes, Deniz Beser, and Mitch Marcus. 2018. Low-resource post processing of noisy OCR output for historical corpus digitisation. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)."},{"key":"e_1_2_2_135_1","volume-title":"ICDAR 2019 competition on post-OCR text correction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR\u201919)","author":"Rigaud Christophe","year":"2019","unstructured":"Christophe Rigaud , Antoine Doucet , Micka\u00ebl Coustaty , and Jean-Philippe Moreux . 2019 . ICDAR 2019 competition on post-OCR text correction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR\u201919) . IEEE, 1588\u20131593. Christophe Rigaud, Antoine Doucet, Micka\u00ebl Coustaty, and Jean-Philippe Moreux. 2019. ICDAR 2019 competition on post-OCR text correction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR\u201919). IEEE, 1588\u20131593."},{"key":"e_1_2_2_136_1","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI\u201907)","author":"Ringlstetter Christoph","year":"2007","unstructured":"Christoph Ringlstetter , Max Hadersbeck , Klaus U. Schulz , and Stoyan Mihov . 2007 . Text correction using domain dependent bigram models from web crawls . In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI\u201907) Workshop on Analytics for Noisy Unstructured Text Data. Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov. 2007. Text correction using domain dependent bigram models from web crawls. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI\u201907) Workshop on Analytics for Noisy Unstructured Text Data."},{"key":"e_1_2_2_137_1","doi-asserted-by":"publisher","DOI":"10.1145\/1289600.1289602"},{"key":"e_1_2_2_138_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2017.308"},{"key":"e_1_2_2_139_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAS.2014.12"},{"key":"e_1_2_2_140_1","volume-title":"Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 52\u201357","author":"Schaefer Robin","year":"2020","unstructured":"Robin Schaefer and Clemens Neudecker . 2020 . A two-step approach for automatic OCR post-correction . In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 52\u201357 . Robin Schaefer and Clemens Neudecker. 2020. A two-step approach for automatic OCR post-correction. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 52\u201357."},{"key":"e_1_2_2_141_1","volume-title":"Proceedings of the 26th International Conference on Computational Linguistics (COLING\u201916)","author":"Schnober Carsten","year":"2016","unstructured":"Carsten Schnober , Steffen Eger , Erik-L\u00e2n Do Dinh , and Iryna Gurevych . 2016 . Still not there? comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks . In Proceedings of the 26th International Conference on Computational Linguistics (COLING\u201916) . ACL, 1703\u20131714. Carsten Schnober, Steffen Eger, Erik-L\u00e2n Do Dinh, and Iryna Gurevych. 2016. Still not there? comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks. In Proceedings of the 26th International Conference on Computational Linguistics (COLING\u201916). ACL, 1703\u20131714."},{"key":"e_1_2_2_142_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2007.4378754"},{"key":"e_1_2_2_143_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1288"},{"key":"e_1_2_2_144_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-3017"},{"key":"e_1_2_2_145_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-2406"},{"key":"e_1_2_2_146_1","volume-title":"Smith and Ryan Cordell","author":"David","year":"2018","unstructured":"David A. Smith and Ryan Cordell . 2018 . A Research Agenda for Historical and Multilingual Optical Character Recognition. NUlab, Northeastern University ( 2018). David A. Smith and Ryan Cordell. 2018. A Research Agenda for Historical and Multilingual Optical Character Recognition. NUlab, Northeastern University (2018)."},{"key":"e_1_2_2_147_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-2513"},{"key":"e_1_2_2_148_1","doi-asserted-by":"crossref","first-page":"97","DOI":"10.21248\/jlcl.33.2018.220","article-title":"Ground Truth for training OCR engines on historical documents in german fraktur and early modern latin","volume":"33","author":"Springmann Uwe","year":"2018","unstructured":"Uwe Springmann , Christian Reul , Stefanie Dipper , and Johannes Baiter . 2018 . Ground Truth for training OCR engines on historical documents in german fraktur and early modern latin . J. Lang. Technol. Comput. Ling. 33 , 1 (2018), 97 \u2013 114 . Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. 2018. Ground Truth for training OCR engines on historical documents in german fraktur and early modern latin. J. Lang. Technol. Comput. Ling. 33, 1 (2018), 97\u2013114.","journal-title":"J. Lang. Technol. Comput. Ling."},{"key":"e_1_2_2_149_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2003.10031"},{"key":"e_1_2_2_150_1","volume-title":"Proceedings of the 7th International Conference on Document Analysis and Recognition. Citeseer, 1133\u20131137","author":"Strohmaier Christian M.","year":"2003","unstructured":"Christian M. Strohmaier , Christoph Ringlstetter , Klaus U. Schulz , and Stoyan Mihov . 2003 . Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary? . In Proceedings of the 7th International Conference on Document Analysis and Recognition. Citeseer, 1133\u20131137 . Christian M. Strohmaier, Christoph Ringlstetter, Klaus U. Schulz, and Stoyan Mihov. 2003. Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary?. In Proceedings of the 7th International Conference on Document Analysis and Recognition. Citeseer, 1133\u20131137."},{"key":"e_1_2_2_151_1","volume-title":"Document Recognition and Retrieval XXI","volume":"9021","author":"Taghva Kazem","year":"2014","unstructured":"Kazem Taghva and Shivam Agarwal . 2014 . Utilizing web data in identification and correction of OCR errors . In Document Recognition and Retrieval XXI , Vol. 9021 . International Society for Optics and Photonics, 902109. Kazem Taghva and Shivam Agarwal. 2014. Utilizing web data in identification and correction of OCR errors. In Document Recognition and Retrieval XXI, Vol. 9021. International Society for Optics and Photonics, 902109."},{"key":"e_1_2_2_152_1","doi-asserted-by":"publisher","DOI":"10.5555\/184656.180373"},{"key":"e_1_2_2_153_1","doi-asserted-by":"publisher","DOI":"10.1117\/12.304631"},{"key":"e_1_2_2_154_1","doi-asserted-by":"publisher","DOI":"10.1007\/PL00013558"},{"key":"e_1_2_2_155_1","volume-title":"SOUR CREAM: Toward Semantic Processing of Recipes. Technical Report CMU-LTI-08-005","author":"Tasse Dan","year":"2008","unstructured":"Dan Tasse and Noah A Smith . 2008 . SOUR CREAM: Toward Semantic Processing of Recipes. Technical Report CMU-LTI-08-005 (2008). Dan Tasse and Noah A Smith. 2008. SOUR CREAM: Toward Semantic Processing of Recipes. Technical Report CMU-LTI-08-005 (2008)."},{"key":"e_1_2_2_156_1","volume-title":"Proceedings of the Workshop on Computational Humanities Research (CHR\u201920)","volume":"2723","author":"Todorov Konstantin","year":"2020","unstructured":"Konstantin Todorov and Giovanni Colavizza . 2020 . Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition . In Proceedings of the Workshop on Computational Humanities Research (CHR\u201920) , Vol. 2723 . CEUR-WS.org, 310\u2013339. Konstantin Todorov and Giovanni Colavizza. 2020. Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition. In Proceedings of the Workshop on Computational Humanities Research (CHR\u201920), Vol. 2723. CEUR-WS.org, 310\u2013339."},{"key":"e_1_2_2_157_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01206331"},{"key":"e_1_2_2_158_1","doi-asserted-by":"publisher","DOI":"10.5220\/0009169004840496"},{"key":"e_1_2_2_159_1","volume-title":"Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. 57\u201361","author":"Vobl Thorsten","unstructured":"Thorsten Vobl , Annette Gotscharek , Uli Reffle , Christoph Ringlstetter , and Klaus U. Schulz . 2014. PoCoTo-an open source system for efficient interactive postcorrection of OCRed historical texts . In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. 57\u201361 . Thorsten Vobl, Annette Gotscharek, Uli Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2014. PoCoTo-an open source system for efficient interactive postcorrection of OCRed historical texts. In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. 57\u201361."},{"key":"e_1_2_2_160_1","volume-title":"Language Technology for Cultural Heritage","author":"Volk Martin","unstructured":"Martin Volk , Lenz Furrer , and Rico Sennrich . 2011. Strategies for reducing and correcting OCR errors . In Language Technology for Cultural Heritage . Springer , 3\u201322. Martin Volk, Lenz Furrer, and Rico Sennrich. 2011. Strategies for reducing and correcting OCR errors. In Language Technology for Cultural Heritage. Springer, 3\u201322."},{"key":"e_1_2_2_161_1","volume-title":"Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH\u201910)","author":"Volk Martin","year":"2010","unstructured":"Martin Volk , Torsten Marek , and Rico Sennrich . 2010 . Reducing OCR errors by combining two OCR systems . In Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH\u201910) . 61\u201365. Martin Volk, Torsten Marek, and Rico Sennrich. 2010. Reducing OCR errors by combining two OCR systems. In Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH\u201910). 61\u201365."},{"key":"e_1_2_2_162_1","volume-title":"recaptcha: Human-based character recognition via web security measures. Science 321, 5895","author":"Ahn Luis Von","year":"2008","unstructured":"Luis Von Ahn , Benjamin Maurer , Colin McMillen , David Abraham , and Manuel Blum . 2008. recaptcha: Human-based character recognition via web security measures. Science 321, 5895 ( 2008 ), 1465\u20131468. Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. 2008. recaptcha: Human-based character recognition via web security measures. Science 321, 5895 (2008), 1465\u20131468."},{"key":"e_1_2_2_163_1","volume-title":"Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. IEEE, 160\u2013164","author":"Wemhoener David","unstructured":"David Wemhoener , Ismet Zeki Yalniz , and R. Manmatha . 2013. Creating an improved version using noisy OCR from multiple editions . In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. IEEE, 160\u2013164 . David Wemhoener, Ismet Zeki Yalniz, and R. Manmatha. 2013. Creating an improved version using noisy OCR from multiple editions. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. IEEE, 160\u2013164."},{"key":"e_1_2_2_164_1","volume-title":"Calamari - A high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14, 2","author":"Wick Christoph","year":"2020","unstructured":"Christoph Wick , Christian Reul , and Frank Puppe . 2020. Calamari - A high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14, 2 ( 2020 ). Christoph Wick, Christian Reul, and Frank Puppe. 2020. Calamari - A high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14, 2 (2020)."},{"key":"e_1_2_2_165_1","volume-title":"Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR\u201907)","author":"Wick Michael L.","unstructured":"Michael L. Wick , Michael G. Ross , and Erik G . Learned-Miller. 2007. Context-sensitive error correction: Using topic models to improve OCR . In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR\u201907) . IEEE Computer Society, 1168\u20131172. Michael L. Wick, Michael G. Ross, and Erik G. Learned-Miller. 2007. Context-sensitive error correction: Using topic models to improve OCR. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR\u201907). IEEE Computer Society, 1168\u20131172."},{"key":"e_1_2_2_166_1","unstructured":"L. Wilms R. Nijssen and T. Koster. 2020. Historical newspaper OCR ground-truth data set. KB Lab: The Hague.  L. Wilms R. Nijssen and T. Koster. 2020. Historical newspaper OCR ground-truth data set. KB Lab: The Hague."},{"key":"e_1_2_2_167_1","doi-asserted-by":"publisher","DOI":"10.1109\/JCDL.2017.7991587"},{"key":"e_1_2_2_168_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2011.157"},{"key":"e_1_2_2_169_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-30483-8_49"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3453476","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3453476","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:03:07Z","timestamp":1750197787000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3453476"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,13]]},"references-count":167,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2022,7,31]]}},"alternative-id":["10.1145\/3453476"],"URL":"https:\/\/doi.org\/10.1145\/3453476","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,13]]},"assertion":[{"value":"2020-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}