{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T15:18:56Z","timestamp":1778512736325,"version":"3.51.4"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2020,8,20]],"date-time":"2020-08-20T00:00:00Z","timestamp":1597881600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,8,20]],"date-time":"2020-08-20T00:00:00Z","timestamp":1597881600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"University of Helsinki including Helsinki University Central Hospital"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["IJDAR"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/tmbdev\/ocropy\">https:\/\/github.com\/tmbdev\/ocropy<\/jats:ext-link>) and Tesseract (<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/tesseract-ocr\/tesseract\">https:\/\/github.com\/tesseract-ocr\/tesseract<\/jats:ext-link>), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.<\/jats:p>","DOI":"10.1007\/s10032-020-00359-9","type":"journal-article","created":{"date-parts":[[2020,8,20]],"date-time":"2020-08-20T02:02:15Z","timestamp":1597888935000},"page":"279-295","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":57,"title":["Optical character recognition with neural networks and post-correction with finite state methods"],"prefix":"10.1007","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7645-3079","authenticated-orcid":false,"given":"Senka","family":"Drobac","sequence":"first","affiliation":[]},{"given":"Krister","family":"Lind\u00e9n","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,8,20]]},"reference":[{"key":"359_CR1","doi-asserted-by":"crossref","unstructured":"Breuel, T.: Recent progress on the OCRopus OCR system. In: Proceedings of the International Workshop on Multilingual OCR, p.\u00a02. ACM (2009)","DOI":"10.1145\/1577802.1577805"},{"key":"359_CR2","doi-asserted-by":"crossref","unstructured":"Breuel, T.M.: The OCRopus open source OCR system. In: Electronic Imaging 2008, pp. 68150F\u201368150F. International Society for Optics and Photonics (2008)","DOI":"10.1117\/12.783598"},{"key":"359_CR3","doi-asserted-by":"crossref","unstructured":"Breuel, T.M., Ul-Hasan, A., Al-Azawi, M.A., Shafait, F.: High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 683\u2013687. IEEE (2013)","DOI":"10.1109\/ICDAR.2013.140"},{"key":"359_CR4","doi-asserted-by":"crossref","unstructured":"Dong, R., Smith, D.A.: Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2363\u20132372 (2018)","DOI":"10.18653\/v1\/P18-1220"},{"key":"359_CR5","unstructured":"Drobac, S., Kauppinen, P., Lind\u00e9n, K.: OCR and post-correction of historical Finnish texts. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 70\u201376 (2017)"},{"key":"359_CR6","doi-asserted-by":"crossref","unstructured":"Drobac, S., Kauppinen, P., Lind\u00e9n, K.: Improving OCR of historical newspapers and journals published in Finland. In: Proceedings of the 3nd International Conference on Digital Access to Textual Cultural Heritage, pp. 97\u2013102. ACM International (2019)","DOI":"10.1145\/3322905.3322914"},{"key":"359_CR7","unstructured":"Drobac, S., Lind\u00e9n, K.: Optical font family recognition using a neural network. In: Proceedings of the Research Data and Humanities (RDHUM) 2019 Conference: Data, Methods and Tools, p. 115. Studia Humaniora Ouluensia, Finland (2019)"},{"key":"359_CR8","doi-asserted-by":"publisher","first-page":"77","DOI":"10.1515\/pralin-2016-0001","volume":"105","author":"S Eger","year":"2016","unstructured":"Eger, S., vor der Br\u00fcck, T., Mehler, A.: A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. Prague Bull. Math. Ling. 105, 77\u201399 (2016). https:\/\/doi.org\/10.1515\/pralin-2016-0001","journal-title":"Prague Bull. Math. Ling."},{"key":"359_CR9","doi-asserted-by":"crossref","unstructured":"Englmeier, T., Fink, F., Schulz, K.U.: AI-PoCoTo\u2014combining automated and interactive OCR postcorrection. In: Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM (2019)","DOI":"10.1145\/3322905.3322908"},{"key":"359_CR10","doi-asserted-by":"crossref","unstructured":"Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 45\u201351. ACM (2014)","DOI":"10.1145\/2595188.2595200"},{"key":"359_CR11","unstructured":"G\u00e9n\u00e9reux, M., Stemle, E.W., Lyding, V., Nicolas, L.: Correcting OCR errors for German in Fraktur font. In: Proceedings of the First Italian Conference on Computational Linguistics (CLiC-It 2014) (2014)"},{"key":"359_CR12","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369\u2013376. ACM (2006)","DOI":"10.1145\/1143844.1143891"},{"key":"359_CR13","doi-asserted-by":"crossref","unstructured":"Guha, R., Das, N., Kundu, M., Nasipuri, M., Santosh, K., senior member, I.: Devnet: an efficient cnn architecture for handwritten Devanagari character recognition. Int. J. Pattern Recogn. Artif. Intell. (2019)","DOI":"10.1142\/S0218001420520096"},{"key":"359_CR14","doi-asserted-by":"crossref","unstructured":"H\u00e4m\u00e4l\u00e4inen, M., Hengchen, S.: From the Paft to the Fiiture: a fully automatic NMT and word embeddings method for OCR post-correction. In: Recent Advances in Natural Language Processing, pp. 432\u2013437. INCOMA (2019)","DOI":"10.26615\/978-954-452-056-4_051"},{"key":"359_CR15","unstructured":"Jauhiainen, T.S., Linden, B.K.J., Jauhiainen, H.A., et\u00a0al.: Heli, a word-based backoff method for language identification. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects VarDial3, Osaka, Japan, December 12 2016 (2016)"},{"key":"359_CR16","unstructured":"Kauppinen, P.: OCR post-processing by parallel replace rules implemented as weighted finite-state transducers (2016)"},{"key":"359_CR17","unstructured":"Kettunen, K., Kervinen, J., Koistinen, M.: Creating and using ground truth OCR sample data for Finnish historical newspapers and journals (2018)"},{"key":"359_CR18","unstructured":"Kettunen, K., Koistinen, M.: Open source Tesseract in re-OCR of Finnish Fraktur from 19th and early 20th century newspapers and journals-collected notes on quality improvement. In: DHN, pp. 270\u2013282 (2019)"},{"key":"359_CR19","doi-asserted-by":"crossref","unstructured":"Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 198\u2013203. IEEE (2016)","DOI":"10.1109\/DAS.2016.44"},{"key":"359_CR20","unstructured":"Koistinen, M., Kettunen, K., Kervinen, J.: How to improve optical character recognition of historical Finnish newspapers using open source Tesseract OCR engine. In: Proceedings of the LTC, pp. 279\u2013283 (2017)"},{"key":"359_CR21","unstructured":"Koistinen, M., Kettunen, K., P\u00e4\u00e4kk\u00f6nen, T.: Improving optical character recognition of Finnish historical newspapers with a combination of Fraktur & Antiqua models and image preprocessing. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 277\u2013283 (2017)"},{"key":"359_CR22","first-page":"707","volume":"10","author":"VI Levenshtein","year":"1966","unstructured":"Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707 (1966)","journal-title":"Sov. Phys. Dokl."},{"key":"359_CR23","unstructured":"Lind\u00e9n, K., Silfverberg, M., Pirinen, T., Hardwick, S., Drobac, S., Axelson, E.: HFST\u2014An Environment for Creating Language Technology Applications. Studies in Computational Intelligence. Springer, Berlin (2012)"},{"key":"359_CR24","doi-asserted-by":"crossref","unstructured":"Llobet, R., Cerdan-Navarro, J.R., Perez-Cortes, J.C., Arlandis, J.: OCR post-processing using weighted finite-state transducers. In: 2010 20th International Conference on Pattern Recognition, pp. 2021\u20132024 (2010)","DOI":"10.1109\/ICPR.2010.498"},{"key":"359_CR25","doi-asserted-by":"crossref","unstructured":"Lund, W.B., Kennard, D.J., Ringger, E.K.: Combining multiple thresholding binarization values to improve OCR output. In: Document Recognition and Retrieval XX, vol. 8658, p. 86580R. International Society for Optics and Photonics (2013)","DOI":"10.1117\/12.2006228"},{"key":"359_CR26","doi-asserted-by":"crossref","unstructured":"Lund, W.B., Walker, D.D., Ringger, E.K.: Progressive alignment and discriminative error correction for multiple OCR engines. In: 2011 International Conference on Document Analysis and Recognition, pp. 764\u2013768. IEEE (2011)","DOI":"10.1109\/ICDAR.2011.303"},{"key":"359_CR27","unstructured":"Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)"},{"key":"359_CR28","doi-asserted-by":"crossref","unstructured":"Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, C., B\u00fcttner, A., Puppe, F.: Ocr4all\u2014an open-source tool providing a (semi-) automatic OCR workflow for historical printings. arXiv preprint arXiv:1909.04032 (2019)","DOI":"10.20944\/preprints201909.0101.v1"},{"key":"359_CR29","unstructured":"Reul, C., Springmann, U., Wick, C., Puppe, F.: State of the art optical character recognition of 19th century Fraktur scripts using open source engines. arXiv preprint arXiv:1810.03436 (2018)"},{"key":"359_CR30","unstructured":"Reynaert, M.: Ocr post-correction evaluation of early Dutch books online-revisited (2016)"},{"issue":"2","key":"359_CR31","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1007\/s10032-010-0133-5","volume":"14","author":"MW Reynaert","year":"2010","unstructured":"Reynaert, M.W.: Character confusion versus focus word-based correction of spelling and OCR variants in corpora. Int. J. Doc. Anal. Recogn. (IJDAR) 14(2), 173\u2013187 (2010)","journal-title":"Int. J. Doc. Anal. Recogn. (IJDAR)"},{"key":"359_CR32","doi-asserted-by":"publisher","DOI":"10.1142\/8394","volume-title":"Multimodal Interactive Handwritten Text Transcription","author":"V Romero","year":"2012","unstructured":"Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription, vol. 80. World Scientific, Singapore (2012)"},{"key":"359_CR33","doi-asserted-by":"crossref","unstructured":"Shafait, F.: Document image analysis with OCRopus. In: Multitopic Conference, 2009. INMIC 2009. IEEE 13th International, pp. 1\u20136. IEEE (2009)","DOI":"10.1109\/INMIC.2009.5383078"},{"key":"359_CR34","doi-asserted-by":"crossref","unstructured":"Silfverberg, M., Kauppinen, P., Lind\u00e9n, K.: Data-driven spelling correction using weighted finite-state methods. In: Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pp. 51\u201359. Association for Computational Linguistics, Berlin (2016). http:\/\/anthology.aclweb.org\/W16-2406","DOI":"10.18653\/v1\/W16-2406"},{"key":"359_CR35","unstructured":"Springmann, U., L\u00fcdeling, A.: OCR of historical printings with an application to building diachronic corpora: a case study using the RIDGES herbal corpus. arXiv preprint arXiv:1608.02153 (2016)"},{"key":"359_CR36","doi-asserted-by":"crossref","unstructured":"Springmann, U., Najock, D., Morgenroth, H., Schmid, H., Gotscharek, A., Fink, F.: OCR of historical printings of latin texts: problems, prospects, progress. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 71\u201375. ACM (2014)","DOI":"10.1145\/2595188.2595205"},{"issue":"1","key":"359_CR37","first-page":"1929","volume":"15","author":"N Srivastava","year":"2014","unstructured":"Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929\u20131958 (2014)","journal-title":"J. Mach. Learn. Res."},{"key":"359_CR38","doi-asserted-by":"crossref","unstructured":"Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Pocoto\u2014an open source system for efficient interactive postcorrection of OCRed historical texts. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 57\u201361. ACM (2014)","DOI":"10.1145\/2595188.2595197"},{"key":"359_CR39","unstructured":"Wick, C., Reul, C., Puppe, F.: Calamari\u2014a high-performance tensorflow-based deep learning package for optical character recognition. arXiv preprint arXiv:1807.02004 (2018)"},{"key":"359_CR40","doi-asserted-by":"crossref","first-page":"79","DOI":"10.21248\/jlcl.33.2018.219","volume":"33","author":"C Wick","year":"2018","unstructured":"Wick, C., Reul, C., Puppe, F.: Comparison of OCR accuracy on early printed books using the open source engines Calamari and OCRopus. JLCL 33, 79\u201396 (2018)","journal-title":"JLCL"}],"container-title":["International Journal on Document Analysis and Recognition (IJDAR)"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10032-020-00359-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10032-020-00359-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10032-020-00359-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,10,6]],"date-time":"2023-10-06T01:34:52Z","timestamp":1696556092000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10032-020-00359-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,20]]},"references-count":40,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["359"],"URL":"https:\/\/doi.org\/10.1007\/s10032-020-00359-9","relation":{},"ISSN":["1433-2833","1433-2825"],"issn-type":[{"value":"1433-2833","type":"print"},{"value":"1433-2825","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,8,20]]},"assertion":[{"value":"12 December 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 June 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 August 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 August 2020","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}