{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:17:30Z","timestamp":1750220250937,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":24,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,20]],"date-time":"2022-06-20T00:00:00Z","timestamp":1655683200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,20]]},"DOI":"10.1145\/3529372.3533298","type":"proceedings-article","created":{"date-parts":[[2022,6,6]],"date-time":"2022-06-06T20:57:52Z","timestamp":1654549072000},"page":"1-5","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis"],"prefix":"10.1145","author":[{"given":"Ming","family":"Jiang","sequence":"first","affiliation":[{"name":"University of Illinois Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ryan C","family":"Dubnicek","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Glen","family":"Worthey","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ted","family":"Underwood","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"J. Stephen","family":"Downie","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,6,20]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"102","article-title":"Assessing the impact of ocr errors in information retrieval","volume":"12036","author":"Bazzo Guilherme Torresan","year":"2020","unstructured":"Guilherme Torresan Bazzo , Gustavo Acauan Lorentz , Danny Suarez Vargas , and Viviane P Moreira . 2020 . Assessing the impact of ocr errors in information retrieval . Advances in Information Retrieval 12036 (2020), 102 . Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P Moreira. 2020. Assessing the impact of ocr errors in information retrieval. Advances in Information Retrieval 12036 (2020), 102.","journal-title":"Advances in Information Retrieval"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.3115\/1690219.1690231"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1220"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2595188.2595200"},{"volume-title":"From Old English to Standard English: A course book in language variation across time","author":"Freeborn Dennis","key":"e_1_3_2_1_5_1","unstructured":"Dennis Freeborn . 1998. From Old English to Standard English: A course book in language variation across time . University of Ottawa Press. Dennis Freeborn. 1998. From Old English to Standard English: A course book in language variation across time. University of Ottawa Press."},{"key":"e_1_3_2_1_6_1","volume-title":"OCR17: Ground truth and models for 17th c. French prints (and hopefully more).(May","author":"Gabay Simon","year":"2020","unstructured":"Simon Gabay , Thibault Cl\u00e9rice , and Christian Reul . 2020. OCR17: Ground truth and models for 17th c. French prints (and hopefully more).(May 2020 ). Simon Gabay, Thibault Cl\u00e9rice, and Christian Reul. 2020. OCR17: Ground truth and models for 17th c. French prints (and hopefully more).(May 2020)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/11510888_56"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.680"},{"key":"e_1_3_2_1_9_1","volume-title":"2019 ACM\/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 29--38","author":"Jatowt Adam","year":"2019","unstructured":"Adam Jatowt , Mickael Coustaty , Nhu-Van Nguyen , Antoine Doucet , 2019 . Deep statistical analysis of OCR errors for effective post-OCR processing . In 2019 ACM\/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 29--38 . Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet, et al. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In 2019 ACM\/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 29--38."},{"key":"e_1_3_2_1_10_1","volume-title":"International Journal on Digital Libraries","author":"Jiang Ming","year":"2021","unstructured":"Ming Jiang , Jennifer D'Souza , S\u00f6ren Auer , and J Stephen Downie . 2021. Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections . International Journal on Digital Libraries ( 2021 ), 1--19. Ming Jiang, Jennifer D'Souza, S\u00f6ren Auer, and J Stephen Downie. 2021. Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections. International Journal on Digital Libraries (2021), 1--19."},{"key":"e_1_3_2_1_11_1","volume-title":"The Gutenberg-HathiTrust parallel corpus: A real-world dataset for noise investigation in uncorrected OCR texts. iConference 2021 (Poster)","author":"Jiang Ming","year":"2021","unstructured":"Ming Jiang , Yuerong Hu , Glen Worthey , Ryan C Dubnicek , Boris Capitanu , Deren Kudeki , and J Stephen Downie . 2021. The Gutenberg-HathiTrust parallel corpus: A real-world dataset for noise investigation in uncorrected OCR texts. iConference 2021 (Poster) ( 2021 ). Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C Dubnicek, Boris Capitanu, Deren Kudeki, and J Stephen Downie. 2021. The Gutenberg-HathiTrust parallel corpus: A real-world dataset for noise investigation in uncorrected OCR texts. iConference 2021 (Poster) (2021)."},{"key":"e_1_3_2_1_12_1","volume-title":"Proceedings of the Second Conference on Computational Humanities Research 1613","author":"Jiang Ming","year":"2021","unstructured":"Ming Jiang , Yuerong Hu , Glen Worthey , Ryan C Dubnicek , Ted Underwood , and J Stephen Downie . 2021 . Impact of OCR quality on BERT embeddings in the domain classification of book excerpts . Proceedings of the Second Conference on Computational Humanities Research 1613 (2021), 266--279. Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C Dubnicek, Ted Underwood, and J Stephen Downie. 2021. Impact of OCR quality on BERT embeddings in the domain classification of book excerpts. Proceedings of the Second Conference on Computational Humanities Research 1613 (2021), 266--279."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009902609570"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10032-009-0094-8"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00379"},{"key":"e_1_3_2_1_16_1","volume-title":"Proceedings of the Australasian Language Technology Association Workshop","author":"Molla Diego","year":"2017","unstructured":"Diego Molla and Steve Cassidy . 2017 . Overview of the 2017 ALTA shared task: Correcting OCR errors . In Proceedings of the Australasian Language Technology Association Workshop 2017. 115--118. Diego Molla and Steve Cassidy. 2017. Overview of the 2017 ALTA shared task: Correcting OCR errors. In Proceedings of the Australasian Language Technology Association Workshop 2017. 115--118."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3453476"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383583.3398605"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2501115.2501130"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.672"},{"key":"e_1_3_2_1_21_1","volume-title":"ICDAR 2019 competition on post-OCR text correction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1588--1593","author":"Rigaud Christophe","year":"2019","unstructured":"Christophe Rigaud , Antoine Doucet , Micka\u00ebl Coustaty , and Jean-Philippe Moreux . 2019 . ICDAR 2019 competition on post-OCR text correction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1588--1593 . Christophe Rigaud, Antoine Doucet, Micka\u00ebl Coustaty, and Jean-Philippe Moreux. 2019. ICDAR 2019 competition on post-OCR text correction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1588--1593."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.478"},{"key":"e_1_3_2_1_23_1","volume-title":"Ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. arXiv preprint arXiv:1809.05501","author":"Springmann Uwe","year":"2018","unstructured":"Uwe Springmann , Christian Reul , Stefanie Dipper , and Johannes Baiter . 2018. Ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. arXiv preprint arXiv:1809.05501 ( 2018 ). Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. 2018. Ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. arXiv preprint arXiv:1809.05501 (2018)."},{"key":"e_1_3_2_1_24_1","volume-title":"Kasra Hosseini, Barbara McGillivray, and Giovanni Colavizza.","author":"van Strien Daniel","year":"2020","unstructured":"Daniel van Strien , Kaspar Beelen , Mariona Coll Ardanuy , Kasra Hosseini, Barbara McGillivray, and Giovanni Colavizza. 2020 . Assessing the impact of OCR quality on downstream NLP tasks. (2020). Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, Barbara McGillivray, and Giovanni Colavizza. 2020. Assessing the impact of OCR quality on downstream NLP tasks. (2020)."}],"event":{"name":"JCDL '22: The ACM\/IEEE Joint Conference on Digital Libraries in 2022","sponsor":["SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web","SIGIR ACM Special Interest Group on Information Retrieval","IEEE Technical Committee on Digital Libraries (TC DL)"],"location":"Cologne Germany","acronym":"JCDL '22"},"container-title":["Proceedings of the 22nd ACM\/IEEE Joint Conference on Digital Libraries"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3529372.3533298","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3529372.3533298","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:39Z","timestamp":1750188639000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3529372.3533298"}},"subtitle":["pilot investigations"],"short-title":[],"issued":{"date-parts":[[2022,6,20]]},"references-count":24,"alternative-id":["10.1145\/3529372.3533298","10.1145\/3529372"],"URL":"https:\/\/doi.org\/10.1145\/3529372.3533298","relation":{},"subject":[],"published":{"date-parts":[[2022,6,20]]},"assertion":[{"value":"2022-06-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}