{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,26]],"date-time":"2025-03-26T06:07:34Z","timestamp":1742969254037,"version":"3.40.3"},"publisher-location":"Cham","reference-count":8,"publisher":"Springer Nature Switzerland","isbn-type":[{"type":"print","value":"9783031657931"},{"type":"electronic","value":"9783031657948"}],"license":[{"start":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T00:00:00Z","timestamp":1704067200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,8,15]],"date-time":"2024-08-15T00:00:00Z","timestamp":1723680000000},"content-version":"vor","delay-in-days":227,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Correcting Optical Character Recognition (OCR) errors is a major challenge in preprocessing datasets consisting of legacy PDF files. In this study, we develop Large Language Models specially finetuned to correct OCR errors. We experimented with the mT5 model (both the mT5-small and mT5-large configurations), a Text-to-Text Transfer Transformer-based machine translation model, for the post-correction of texts with OCR errors. We compiled a parallel corpus consisting of text corrupted with OCR errors as well as corresponding clean data. Our findings suggest that the mT5 model can be successfully applied to OCR error correction with improving accuracy. The results affirm the mT5 model as an effective tool for OCR post-correction, with prospects for achieving greater efficiency in future research.<\/jats:p>","DOI":"10.1007\/978-3-031-65794-8_4","type":"book-chapter","created":{"date-parts":[[2024,8,14]],"date-time":"2024-08-14T06:02:44Z","timestamp":1723615364000},"page":"49-58","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["OCR Cleaning of\u00a0Scientific Texts with\u00a0LLMs"],"prefix":"10.1007","author":[{"given":"G\u00e1bor","family":"Madar\u00e1sz","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0851-7621","authenticated-orcid":false,"given":"No\u00e9mi","family":"Ligeti-Nagy","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6873-3425","authenticated-orcid":false,"given":"Andr\u00e1s","family":"Holl","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5765-3908","authenticated-orcid":false,"given":"Tam\u00e1s","family":"V\u00e1radi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,8,15]]},"reference":[{"key":"4_CR1","unstructured":"Amrhein, C.: Post-correcting OCR errors using neural machine translation. Ph.D. thesis, Universit\u00e4t Z\u00fcrich (2017). https:\/\/api.semanticscholar.org\/CorpusID:231696696"},{"key":"4_CR2","doi-asserted-by":"publisher","unstructured":"Gupta, H., Del\u00a0Corro, L., Broscheit, S., Hoffart, J., Brenner, E.: Unsupervised multi-view post-OCR error correction with language models. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.T. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8647\u20138652. Association for Computational Linguistics, Online and Punta Cana (2021). https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.680, https:\/\/aclanthology.org\/2021.emnlp-main.680","DOI":"10.18653\/v1\/2021.emnlp-main.680"},{"key":"4_CR3","unstructured":"Laki, L.J., et al.: OCR hib\u00e1k jav\u00edt\u00e1sa neur\u00e1lis technol\u00f3gi\u00e1k seg\u00edts\u00e9g\u00e9vel [Correction of OCR errors using neural technologies]. In: XVIII. Magyar Sz\u00e1m\u00edt\u00f3g\u00e9pes Nyelv\u00e9szeti Konferencia, pp. 417\u2013430. Szegedi Tudom\u00e1nyegyetem, Informatikai Int\u00e9zet, Szeged (2022). Original text in Hungarian"},{"key":"4_CR4","doi-asserted-by":"publisher","unstructured":"Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A benchmark and dataset for post-OCR text correction in Sanskrit. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6287\u20136294. Association for Computational Linguistics (2022). https:\/\/doi.org\/10.18653\/v1\/2022.findings-emnlp.527, https:\/\/aclanthology.org\/2022.findings-emnlp.527","DOI":"10.18653\/v1\/2022.findings-emnlp.527"},{"key":"4_CR5","unstructured":"Piotrowski, M.: Post-correction of OCR results using pre-trained language model (2021). http:\/\/poleval.pl\/files\/2021\/09.pdf. Presentation slides"},{"issue":"140","key":"4_CR6","first-page":"1","volume":"21","author":"C Raffel","year":"2020","unstructured":"Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1\u201367 (2020)","journal-title":"J. Mach. Learn. Res."},{"key":"4_CR7","doi-asserted-by":"publisher","unstructured":"Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588\u20131593 (2019). https:\/\/doi.org\/10.1109\/ICDAR.2019.00255","DOI":"10.1109\/ICDAR.2019.00255"},{"key":"4_CR8","unstructured":"Schaefer, R., Neudecke, C.: A two-step approach for automatic OCR post-correction. In: Proceedings of the Workshop on Computational Humanities Research (LaTeCH-CLfL 2020), pp. 52\u201357. Association for Computational Linguistics (2020)"}],"container-title":["Lecture Notes in Computer Science","Natural Scientific Language Processing and Research Knowledge Graphs"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-65794-8_4","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,14]],"date-time":"2024-08-14T06:03:18Z","timestamp":1723615398000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-65794-8_4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"ISBN":["9783031657931","9783031657948"],"references-count":8,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-65794-8_4","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2024]]},"assertion":[{"value":"15 August 2024","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"NSLP","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Hersonissos, Crete","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Greece","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2024","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"26 May 2024","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"26 May 2024","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"1","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"nslp2024","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/nfdi4ds.github.io\/nslp2024\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}