{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T15:54:44Z","timestamp":1780588484625,"version":"3.54.1"},"reference-count":48,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2021,11,24]],"date-time":"2021-11-24T00:00:00Z","timestamp":1637712000000},"content-version":"vor","delay-in-days":327,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,11,22]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%\u201329%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.1<\/jats:p>","DOI":"10.1162\/tacl_a_00427","type":"journal-article","created":{"date-parts":[[2021,11,24]],"date-time":"2021-11-24T18:52:52Z","timestamp":1637779972000},"page":"1285-1302","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":6,"title":["Lexically Aware Semi-Supervised Learning for OCR Post-Correction"],"prefix":"10.1162","volume":"9","author":[{"given":"Shruti","family":"Rijhwani","sequence":"first","affiliation":[{"name":"Language Technologies Institute, Carnegie Mellon University, USA. srijhwan@cs.cmu.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Daisy","family":"Rosenblum","sequence":"additional","affiliation":[{"name":"University of British Columbia, Canada. daisy.rosenblum@ubc.ca"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Antonios","family":"Anastasopoulos","sequence":"additional","affiliation":[{"name":"Department of Computer Science, George Mason University, USA. antonis@gmu.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Graham","family":"Neubig","sequence":"additional","affiliation":[{"name":"Language Technologies Institute, Carnegie Mellon University, USA. gneubig@cs.cmu.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2021,11,22]]},"reference":[{"key":"2021121611062917400_bib1","doi-asserted-by":"publisher","first-page":"1557","DOI":"10.18653\/v1\/D16-1162","article-title":"Incorporating discrete translation lexicons into neural machine translation","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Arthur","year":"2016"},{"key":"2021121611062917400_bib2","article-title":"Neural machine translation by jointly learning to align and translate","volume-title":"3rd International Conference on Learning Representations, ICLR 2015","author":"Bahdanau","year":"2015"},{"key":"2021121611062917400_bib3","first-page":"207","article-title":"Unsupervised transcription of historical documents","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Berg-Kirkpatrick","year":"2013"},{"issue":"4","key":"2021121611062917400_bib4","doi-asserted-by":"publisher","first-page":"708","DOI":"10.1525\/aa.1900.2.4.02a00080","article-title":"Sketch of the Kwakiutl language","volume":"2","author":"Boas","year":"1900","journal-title":"American Anthropologist"},{"key":"2021121611062917400_bib5","doi-asserted-by":"publisher","DOI":"10.1006\/csla.1999.0128","volume-title":"Ethnology of the Kwakiutl","author":"Boas","year":"1921"},{"issue":"4","key":"2021121611062917400_bib6","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1006\/csla.1999.0128","article-title":"An empirical study of smoothing techniques for language modeling","volume":"13","author":"Chen","year":"1999","journal-title":"Computer Speech & Language"},{"key":"2021121611062917400_bib7","doi-asserted-by":"publisher","first-page":"876","DOI":"10.18653\/v1\/N16-1102","article-title":"Incorporating structural alignment biases into an attentional neural translation model","volume-title":"Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Cohn","year":"2016"},{"key":"2021121611062917400_bib8","first-page":"3079","article-title":"Semi- supervised sequence learning","volume-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2","author":"Dai","year":"2015"},{"key":"2021121611062917400_bib9","doi-asserted-by":"publisher","first-page":"2363","DOI":"10.18653\/v1\/P18-1220","article-title":"Multi-input attention for unsupervised OCR correction","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Dong","year":"2018"},{"key":"2021121611062917400_bib10","article-title":"MFST: A python openfst wrapper with support for custom semirings and jupyter notebooks","author":"Francis-Landau","year":"2020"},{"key":"2021121611062917400_bib11","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1109\/ICDAR.2017.35","article-title":"Sequence-to-label script identification for multilingual OCR","volume-title":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","author":"Fujii","year":"2017"},{"key":"2021121611062917400_bib12","first-page":"1631","article-title":"Incorporating copying mechanism in sequence-to-sequence learning","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Jiatao","year":"2016"},{"key":"2021121611062917400_bib13","doi-asserted-by":"publisher","first-page":"431","DOI":"10.26615\/978-954-452-056-4_051","article-title":"From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction","volume-title":"Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)","author":"H\u00e4m\u00e4l\u00e4inen","year":"2019"},{"key":"2021121611062917400_bib14","article-title":"Revisiting self- training for neural sequence generation","author":"He","year":"2020","journal-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Conference Track Proceedings"},{"key":"2021121611062917400_bib15","first-page":"690","article-title":"Scalable modified Kneser-Ney language model estimation","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Heafield","year":"2013"},{"issue":"8","key":"2021121611062917400_bib16","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Computation"},{"key":"2021121611062917400_bib17","doi-asserted-by":"publisher","first-page":"1535","DOI":"10.18653\/v1\/P17-1141","article-title":"Lexically constrained decoding for sequence generation using grid beam search","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Hokamp","year":"2017"},{"key":"2021121611062917400_bib18","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00013","article-title":"A scalable handwritten text recognition system","author":"Reeve Ingle","year":"2019","journal-title":"arXiv preprint arXiv:1904.09150"},{"key":"2021121611062917400_bib19","doi-asserted-by":"publisher","first-page":"6282","DOI":"10.18653\/v1\/2020.acl-main.560","article-title":"The state and fate of linguistic diversity and inclusion in the NLP world","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Joshi","year":"2020"},{"key":"2021121611062917400_bib20","volume-title":"Ainu Jojishi Y\u016bkara no Kenky\u016b [Research on Ainu Epic Yukar]","author":"Kindaichi","year":"1931"},{"key":"2021121611062917400_bib21","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1109\/ICASSP.1995.479394","article-title":"Improved backing-off for m-gram language modeling","volume-title":"1995 International Conference on Acoustics, Speech, and Signal Processing","author":"Kneser","year":"1995"},{"key":"2021121611062917400_bib22","doi-asserted-by":"publisher","first-page":"867","DOI":"10.3115\/1220575.1220684","article-title":"OCR post-processing for low density languages","volume-title":"Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing","author":"Kolak","year":"2005"},{"key":"2021121611062917400_bib23","doi-asserted-by":"publisher","first-page":"345","DOI":"10.18653\/v1\/K18-1034","article-title":"Upcycle your OCR: Reusing OCRs for post-OCR text correction in Romanised Sanskrit","volume-title":"Proceedings of the 22nd Conference on Computational Natural Language Learning","author":"Krishna","year":"2018"},{"key":"2021121611062917400_bib24","article-title":"Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks","volume-title":"Workshop on Challenges in Representation Learning, ICML","author":"Lee","year":"2013"},{"key":"2021121611062917400_bib25","first-page":"272","article-title":"Neural finite-state transducers: Beyond rational relations","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Lin","year":"2019"},{"key":"2021121611062917400_bib26","doi-asserted-by":"crossref","first-page":"955","DOI":"10.18653\/v1\/D16-1096","article-title":"Coverage embedding models for neural machine translation","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Mi","year":"2016"},{"issue":"1","key":"2021121611062917400_bib27","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1017\/S135132499600126X","article-title":"On some applications of finite-state automata theory to natural language processing","volume":"2","author":"Mohri","year":"1996","journal-title":"Natural Language Engineering"},{"issue":"1","key":"2021121611062917400_bib28","doi-asserted-by":"publisher","first-page":"69","DOI":"10.1006\/csla.2001.0184","article-title":"Weighted finite-state transducers in speech recognition","volume":"16","author":"Mohri","year":"2002","journal-title":"Computer Speech & Language"},{"key":"2021121611062917400_bib29","article-title":"DyNet: The dynamic neural network toolkit","author":"Neubig","year":"2017","journal-title":"arXiv preprint arXiv:1701.03980"},{"key":"2021121611062917400_bib30","unstructured":"Kai\n              Niklas\n            \n          . 2010. Unsupervised post-correction of OCR errors. Master\u2019s thesis. Leibniz Universit\u00e4t Hannover."},{"key":"2021121611062917400_bib31","doi-asserted-by":"publisher","first-page":"1314","DOI":"10.18653\/v1\/N18-1119","article-title":"Fast lexically constrained decoding with dynamic beam allocation for neural machine translation","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Post","year":"2018"},{"key":"2021121611062917400_bib32","doi-asserted-by":"publisher","first-page":"383","DOI":"10.18653\/v1\/D17-1039","article-title":"Unsupervised pretraining for sequence to sequence learning","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Ramachandran","year":"2017"},{"key":"2021121611062917400_bib33","doi-asserted-by":"publisher","first-page":"623","DOI":"10.18653\/v1\/N16-1076","article-title":"Weighting finite-state transductions with neural context","volume-title":"Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Rastogi","year":"2016"},{"key":"2021121611062917400_bib34","doi-asserted-by":"publisher","first-page":"1588","DOI":"10.1109\/ICDAR.2019.00255","article-title":"ICDAR 2019 competition on post-OCR text correction","volume-title":"2019 International Conference on Document Analysis and Recognition (ICDAR)","author":"Rigaud","year":"2019"},{"key":"2021121611062917400_bib35","doi-asserted-by":"publisher","first-page":"5931","DOI":"10.18653\/v1\/2020.emnlp-main.478","article-title":"OCR Post Correction for Endangered Language Texts","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Rijhwani","year":"2020"},{"key":"2021121611062917400_bib36","doi-asserted-by":"publisher","first-page":"198","DOI":"10.18653\/v1\/2021.sigmorphon-1.22","article-title":"Comparative error analysis in neural and finite-state models for unsupervised character- level transduction","volume-title":"Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology","author":"Ryskina","year":"2021"},{"key":"2021121611062917400_bib37","article-title":"Documentation and grammatical description of Yakkha, Nepal","author":"Schackow","year":"2012"},{"key":"2021121611062917400_bib38","first-page":"1703","article-title":"Still not there? Comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks","volume-title":"Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers","author":"Schnober","year":"2016"},{"key":"2021121611062917400_bib39","doi-asserted-by":"publisher","first-page":"2716","DOI":"10.18653\/v1\/D17-1288","article-title":"Multi- modular domain-tailored OCR post-correction","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Schulz","year":"2017"},{"key":"2021121611062917400_bib40","doi-asserted-by":"crossref","first-page":"1073","DOI":"10.18653\/v1\/P17-1099","article-title":"Get to the point: Summarization with pointer-generator networks","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"See","year":"2017"},{"key":"2021121611062917400_bib41","doi-asserted-by":"publisher","DOI":"10.1086\/464321","volume-title":"Racconti greci inediti di Sternat\u00eda","author":"Stomeo","year":"1980"},{"issue":"2","key":"2021121611062917400_bib42","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1086\/464321","article-title":"Towards greater accuracy in lexicostatistic dating","volume":"21","author":"Swadesh","year":"1955","journal-title":"International Journal of American Linguistics"},{"key":"2021121611062917400_bib43","article-title":"A statistical approach to automatic OCR error correction in context","volume-title":"Fourth Workshop on Very Large Corpora","author":"Tong","year":"1996"},{"key":"2021121611062917400_bib44","first-page":"76","article-title":"Modeling coverage for neural machine translation","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Zhaopeng","year":"2016"},{"key":"2021121611062917400_bib45","doi-asserted-by":"publisher","first-page":"189","DOI":"10.3115\/981658.981684","article-title":"Unsupervised word sense disambiguation rivaling supervised methods","volume-title":"33rd Annual Meeting of the Association for Computational Linguistics","author":"Yarowsky","year":"1995"},{"key":"2021121611062917400_bib46","doi-asserted-by":"publisher","first-page":"1325","DOI":"10.18653\/v1\/N18-1120","article-title":"Guiding neural machine translation with retrieved translation pieces","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Zhang","year":"2018"},{"issue":"1","key":"2021121611062917400_bib47","doi-asserted-by":"crossref","first-page":"1","DOI":"10.2200\/S00196ED1V01Y200906AIM006","article-title":"Introduction to semi-supervised learning","volume":"3","author":"Zhu","year":"2009","journal-title":"Synthesis Lectures on Artificial Intelligence and Machine Learning"},{"key":"2021121611062917400_bib48","doi-asserted-by":"publisher","DOI":"10.2200\/S00196ED1V01Y200906AIM006","article-title":"Rethinking pre-training and self-training","volume":"33","author":"Zoph","year":"2020","journal-title":"Advances in Neural Information Processing Systems"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00427\/1974763\/tacl_a_00427.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00427\/1974763\/tacl_a_00427.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,16]],"date-time":"2021-12-16T11:11:34Z","timestamp":1639653094000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00427\/108475\/Lexically-Aware-Semi-Supervised-Learning-for-OCR"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":48,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00427","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}