{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T23:33:05Z","timestamp":1781307185496,"version":"3.54.1"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T00:00:00Z","timestamp":1740096000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T00:00:00Z","timestamp":1740096000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Digit Libr"],"published-print":{"date-parts":[[2025,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>The digitization of historical documents is crucial for preserving the cultural heritage of the society. An essential step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a challenging problem as standard OCR tools are not tailored to deal with historical orthography or challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during the recognition. It improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/angelbeshirov\/post-ocr-text-correction\" ext-link-type=\"uri\">https:\/\/github.com\/angelbeshirov\/post-ocr-text-correction<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s00799-025-00415-x","type":"journal-article","created":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T06:53:55Z","timestamp":1740120835000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Post-ocr text correction for Bulgarian historical documents"],"prefix":"10.1007","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0684-2730","authenticated-orcid":false,"given":"Angel","family":"Beshirov","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2579-7541","authenticated-orcid":false,"given":"Milena","family":"Dobreva","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1308-180X","authenticated-orcid":false,"given":"Dimitar","family":"Dimitrov","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8095-3570","authenticated-orcid":false,"given":"Momchil","family":"Hardalov","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3919-030X","authenticated-orcid":false,"given":"Ivan","family":"Koychev","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3600-1510","authenticated-orcid":false,"given":"Preslav","family":"Nakov","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,2,21]]},"reference":[{"key":"415_CR1","doi-asserted-by":"crossref","unstructured":"David, Miller., Sean, Boisen., Richard, Schwartz., Rebecca, Stone., Ralph, Weischede.: Named entity extraction from noisy input: Speech and OCR. In Proceedings of Sixth Applied Natural Language Processing Conference, pages 316\u2013324, (2000)","DOI":"10.3115\/974147.974191"},{"key":"415_CR2","doi-asserted-by":"crossref","unstructured":"Christophe, Rigaud., Antoine, Doucet., Mickael, Coustaty., Jean-Philipp, Moreux.: ICDAR 2019 competition on post-OCR text correction. In Proceedings of 2019 Internation Conference of Information Retrieval (ICDAR), pages 1588\u20131593, (2019)","DOI":"10.1109\/ICDAR.2019.00255"},{"key":"415_CR3","doi-asserted-by":"crossref","unstructured":"Martin, Volk., Lenz, Furrer., Rico, Sennrich.: Strategies for reducing and correcting OCR errors. In Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing, pages 3\u201322, (2011)","DOI":"10.1007\/978-3-642-20227-8_1"},{"key":"415_CR4","doi-asserted-by":"crossref","unstructured":"Thi-Tuyet-Hai, Nguyen., Adam,Jatowt., Mickael, Coustaty., Nhu-Van, Nguyen., Antoine, Doucet.: Deep statistical analysis of ocr errors for effective post-ocr processing. In 2019 ACM\/IEEE Joint Conference on Digital Libraries (JCDL), 29\u201338, (2019)","DOI":"10.1109\/JCDL.2019.00015"},{"key":"415_CR5","doi-asserted-by":"crossref","unstructured":"Christian, Strohmaier., Ch\u00a0Ringlstetter., Schulz, Klaus., Stoyan, Mihov.: Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary? In Proceedings of the 7th International Conference on Document Analysis and Recognition, 1133\u20131137, (2003)","DOI":"10.1109\/ICDAR.2003.1227833"},{"key":"415_CR6","doi-asserted-by":"crossref","unstructured":"Klaus, Schulz., Stoyan, Mihov., Petar, Mitankin.: Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), pages 471\u2013475, (2007)","DOI":"10.1109\/ICDAR.2007.4378754"},{"key":"415_CR7","doi-asserted-by":"crossref","unstructured":"Paula, Estrella., Pablo, Paliza.: Ocr correction of documents generated during argentina\u2019s national reorganization process. In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage, pages 119\u2013123, (2014)","DOI":"10.1145\/2595188.2595194"},{"key":"415_CR8","unstructured":"Vladimir\u00a0I. Levenshtein.: Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, (1966)"},{"key":"415_CR9","doi-asserted-by":"crossref","unstructured":"Guillaume, Chiron., Antoine, Doucet., Micka\u00ebl, Coustaty., Jean-Philipp, Moreux.: ICDAR 2017 competition on post-OCR text correction. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1423\u20131428, (2017)","DOI":"10.1109\/ICDAR.2017.232"},{"key":"415_CR10","doi-asserted-by":"crossref","unstructured":"Chantal, Amrhein., Simon, Clematide.: Supervised OCR error detection and correction using statistical and neural machine translation methods. Language Technology and Computational Linguistic, (2018)","DOI":"10.21248\/jlcl.33.2018.218"},{"key":"415_CR11","unstructured":"Jacob, Devlin., Ming-Wei, Chang., Kenton, Lee., Kristina, Toutanova.: BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171\u20134186, Minneapolis, Minnesota, June (2019). Association for Computational Linguistics"},{"key":"415_CR12","doi-asserted-by":"crossref","unstructured":"Thi, Tuyet\u00a0Hai Nguyen., Adam, Jatowt., Nhu-Van, Nguyen., Micka\u00ebl, Coustaty.: Neural machine translation with BERT for post-OCR error detection and correction. In Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries, pages 333\u2013336, (2020)","DOI":"10.1145\/3383583.3398605"},{"key":"415_CR13","doi-asserted-by":"crossref","unstructured":"Shaohua, Zhang., Haoran, Huang., Jicong, Liu., Hang, Li.: Spelling error correction with soft-masked BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 882\u2013890, Online, July (2020). Association for Computational Linguistics","DOI":"10.18653\/v1\/2020.acl-main.82"},{"key":"415_CR14","doi-asserted-by":"crossref","unstructured":"Rui, Dong., David, Smith.: Multi-input attention for unsupervised OCR correction. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2363\u20132372, Melbourne, Australia, July (2018). Association for Computational Linguistics","DOI":"10.18653\/v1\/P18-1220"},{"key":"415_CR15","doi-asserted-by":"crossref","unstructured":"Kiril, Simov.: Integrated language and knowledge resources for a bulgarian-centric knowledge graph. In Digital Presentation and Preservation of Cultural and Scientific Heritage, pages 65\u201374, (2019)","DOI":"10.55630\/dipp.2019.9.5"},{"key":"415_CR16","doi-asserted-by":"crossref","unstructured":"Armand, Joulin., Edouard, Grave., Piotr, Bojanowski., Tom\u00e1s, Mikolov.: Bag of tricks for efficient text classification. CoRR, arXiv:1607.01759, (2016)","DOI":"10.18653\/v1\/E17-2068"},{"key":"415_CR17","unstructured":"Pengcheng, He., Xiaodong, Liu., Jianfeng, Gao., Weizhu, Chen.: Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, (2021)"},{"key":"415_CR18","unstructured":"Yinhan, Liu., Myle, Ott., Naman, Goyal., Jingfei, Du., Mandar, Joshi., Danqi, Chen., Omer, Levy., Mike, Lewis., Luke, Zettlemoyer., Veselin, Stoyanov.: Roberta: A robustly optimized BERT pretraining approach. CoRR, arXiv:1907.11692, (2019)"},{"key":"415_CR19","unstructured":"Yonghui, Wu., Mike, Schuster., Zhifeng, Chen., Quoc\u00a0V., Le., Mohammad, Norouzi., Wolfgang, Macherey., Maxim, Krikun., Yuan, Cao., Qin, Gao., Klaus, Macherey., Jeff, Klingner., Apurva, Shah., Melvin, Johnson., Xiaobing, Liu., Lukasz, Kaiser., Stephan, Gouws., Yoshikiyo, Kato., Taku, Kudo., Hideto, Kazawa., Keith, Stevens., George, Kurian., Nishant, Patil., Wei, Wang., Cliff, Young., Jason, Smith., Jason, Riesa., Alex, Rudnick., Oriol, Vinyals., Greg, Corrado., Macduff, Hughes., Jeffrey, Dean.: Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. CoRR, arXiv:1609.08144, (2016)"},{"key":"415_CR20","doi-asserted-by":"crossref","unstructured":"Rico, Sennrich., Barry, Haddow., Alexandra, Birch.: Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715\u20131725, Berlin, Germany, August (2016). Association for Computational Linguistics","DOI":"10.18653\/v1\/P16-1162"},{"key":"415_CR21","doi-asserted-by":"crossref","unstructured":"Taku, Kudo.: Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66\u201375, Melbourne, Australia, July (2018). Association for Computational Linguistics","DOI":"10.18653\/v1\/P18-1007"},{"key":"415_CR22","doi-asserted-by":"crossref","unstructured":"Sepp, Hochreiter., J\u00fcrgen, Schmidhuber.: Long short-term memory. In Neural computation, pages 1735\u20131780, (1997)","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"415_CR23","doi-asserted-by":"crossref","unstructured":"Shruti, Rijhwani., Antonios, Anastasopoulos., Graham, Neubig.: OCR Post Correction for Endangered Language Texts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5931\u20135942, Online, November (2020). Association for Computational Linguistics","DOI":"10.18653\/v1\/2020.emnlp-main.478"},{"key":"415_CR24","unstructured":"Abigail See., Peter\u00a0J Liu., Christopher\u00a0D., Manning.: Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073\u20131083, Vancouver, Canada, July (2017). Association for Computational Linguistics"},{"key":"415_CR25","doi-asserted-by":"crossref","unstructured":"Christophe, Rigaud., Antoine, Doucet., Mickael, Coustaty., Jean-Philipp, Moreux.: Dataset of ICDAR 2019 competition on post-OCR text correction, (2019)","DOI":"10.1109\/ICDAR.2019.00255"},{"key":"415_CR26","volume-title":"Dimitar Dimitrov","author":"Angel Beshirov","year":"2024","unstructured":"Beshirov, Angel, Dobreva, Milena: Dimitar Dimitrov. Ivan Koychev, and Preslav Nakov. Drinov orthography for post-OCR correction dataset, Momchil Hardalov (2024)"},{"key":"415_CR27","volume-title":"Codification of the Norms of the Bulgarian Standard Language from the End of the 19th and the Beginning of the 20th Century (1879\u20131921","author":"Katya Charalozova","year":"2022","unstructured":"Charalozova, Katya: Codification of the Norms of the Bulgarian Standard Language from the End of the 19th and the Beginning of the 20th Century (1879\u20131921. Prof. Marin Drinov Publishing House, Sofia (2022)"},{"key":"415_CR28","unstructured":"Nikola, Ljube\u0161i\u0107., Petya, Osenova,., Kiril, Simov.: The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard bulgarian 1.0, 2020. Slovenian language resource repository CLARIN.SI"},{"key":"415_CR29","unstructured":"Eva, D\u2019hondt., Cyril, Grouin., Brigitte, Grau.: Generating a training corpus for OCR post-correction using encoder-decoder model. In Greg Kondrak and Taro Watanabe, editors, Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1006\u20131014, Taipei, Taiwan, November (2017). Asian Federation of Natural Language Processing"},{"key":"415_CR30","doi-asserted-by":"crossref","unstructured":"Thomas, Wolf., Lysandre, Debut., Victor, Sanh., Julien, Chaumond., Clement, Delangue., Anthony,Moi., Pierric, Cistac., Tim, Rault., R\u00e9mi, Louf., Morgan, Funtowicz., Jamie, Brew.: Huggingface\u2019s transformers: State-of-the-art natural language processing. CoRR, arXiv:1910.03771, (2020)","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"415_CR31","unstructured":"Victor, Sanh., Lysandre, Debut., Julien, Chaumond., Thomas, Wolf.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, arXiv:1910.01108, (2019)"},{"key":"415_CR32","doi-asserted-by":"crossref","unstructured":"Alexis, Conneau., Kartikay, Khandelwal., Naman, Goyal., Vishrav, Chaudhary., Guillaume, Wenzek., Francisco, Guzm\u00e1n., Edouard, Grave., Myle, Ott., Luke, Zettlemoyer., Veselin, Stoyanov.: Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440\u20138451, Online, July (2020). Association for Computational Linguistics","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"415_CR33","unstructured":"Guillaume, Lample., Alexis, Conneau.: Cross-lingual language model pretraining. CoRR, arXiv:1901.07291, (2019)"},{"key":"415_CR34","unstructured":"Vivi Nastase and Julian Hitschler. Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA)"}],"container-title":["International Journal on Digital Libraries"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-025-00415-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00799-025-00415-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-025-00415-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,24]],"date-time":"2025-03-24T06:18:28Z","timestamp":1742797108000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00799-025-00415-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,21]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,3]]}},"alternative-id":["415"],"URL":"https:\/\/doi.org\/10.1007\/s00799-025-00415-x","relation":{},"ISSN":["1432-5012","1432-1300"],"issn-type":[{"value":"1432-5012","type":"print"},{"value":"1432-1300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,21]]},"assertion":[{"value":"15 May 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 September 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 January 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 February 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"4"}}