{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T01:54:01Z","timestamp":1772502841896,"version":"3.50.1"},"reference-count":62,"publisher":"Emerald","issue":"7","license":[{"start":{"date-parts":[[2023,5,23]],"date-time":"2023-05-23T00:00:00Z","timestamp":1684800000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JD"],"published-print":{"date-parts":[[2023,12,18]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title><jats:p>This study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users' subjective perception through an interactive information retrieval task with a collection of one digitized historical Finnish newspaper.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title><jats:p>This study is based on the simulated work task model used in interactive information retrieval. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869\u20131918 which consists of ca. 1.45 million autosegmented articles. The article search database had two versions of each article with different quality OCR. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top 10 results using a graded relevance scale of 0\u20133. Users were not informed about the OCR quality differences of the otherwise identical articles.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Findings<\/jats:title><jats:p>The main result of the study is that improved OCR quality affects subjective user perception of historical newspaper articles positively: higher relevance scores are given to better-quality texts.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title><jats:p>To the best of the authors\u2019 knowledge, this simulated interactive work task experiment is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of an optically read text.<\/jats:p><\/jats:sec>","DOI":"10.1108\/jd-01-2023-0002","type":"journal-article","created":{"date-parts":[[2023,5,31]],"date-time":"2023-05-31T03:10:40Z","timestamp":1685502640000},"page":"137-156","source":"Crossref","is-referenced-by-count":2,"title":["Optical character recognition quality affects subjective user perception of historical newspaper clippings"],"prefix":"10.1108","volume":"79","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2747-1382","authenticated-orcid":false,"given":"Kimmo","family":"Kettunen","sequence":"first","affiliation":[]},{"given":"Heikki","family":"Keskustalo","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7016-257X","authenticated-orcid":false,"given":"Sanna","family":"Kumpulainen","sequence":"additional","affiliation":[]},{"given":"Tuula","family":"P\u00e4\u00e4kk\u00f6nen","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5223-9940","authenticated-orcid":false,"given":"Juha","family":"Rautiainen","sequence":"additional","affiliation":[]}],"member":"140","published-online":{"date-parts":[[2023,5,23]]},"reference":[{"key":"key2024010911250686900_ref001","doi-asserted-by":"publisher","first-page":"561","DOI":"10.1145\/1458082.1458157","article-title":"Retrievability: an evaluation measure for higher order information access tasks","year":"2008"},{"key":"key2024010911250686900_ref002","doi-asserted-by":"crossref","unstructured":"Bazzo, G.T., Lorentz, G.A., Suarez Vargas, D. and Moreira, V.P. (2020), \u201cAssessing the impact of OCR errors in information retrieval\u201d, in Jose, J.M., Yilmaz, E., Magalh\u00e3es, J., Castells, P., Ferro, N., Silva, M.J and Martins, F. (Eds), Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, Springer, Cham, Vol.\u00a012036, pp. 102-109.","DOI":"10.1007\/978-3-030-45442-5_13"},{"key":"key2024010911250686900_ref003","doi-asserted-by":"publisher","article-title":"The Atlas of digitised newspapers and metadata: reports from oceanic Exchanges. Loughborough: 2020","year":"2020","DOI":"10.6084\/m9.figshare.11560059"},{"issue":"1","key":"key2024010911250686900_ref004","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1108\/EUM0000000007110","article-title":"Experimental components for the evaluation of interactive information retrieval systems","volume":"56","year":"2000","journal-title":"Journal of Documentation"},{"key":"key2024010911250686900_ref005","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1145\/290941.291019","article-title":"Measures of relative relevance and ranked half-life: performance indicators for interactive IR","year":"1998"},{"key":"key2024010911250686900_ref006","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/JCDL.2017.7991582","article-title":"Impact of OCR errors on the use of digital libraries: towards a better access to information","year":"2017"},{"key":"key2024010911250686900_ref007","doi-asserted-by":"publisher","article-title":"Scenario driven in-depth performance evaluation of document layout analysis methods","year":"2011","DOI":"10.1109\/ICDAR.2011.282"},{"key":"key2024010911250686900_ref008","doi-asserted-by":"crossref","unstructured":"Clausner, C., Pletshacher, S. and Antonacopoulos, A. (2017), \u201cICDAR2017 competition on recognition of documents with complex layouts \u2013 RDCL2017\u201d, available at: https:\/\/ieeexplore.ieee.org\/document\/8270160","DOI":"10.1109\/ICDAR.2017.229"},{"key":"key2024010911250686900_ref009","first-page":"1521","article-title":"ICDAR2019 competition on recognition of documents with complex layouts \u2013 RDCL2019","year":"2019"},{"key":"key2024010911250686900_ref010","volume-title":"Search Engines. Information Retrieval in Practice","year":"2010"},{"key":"key2024010911250686900_ref011","doi-asserted-by":"crossref","unstructured":"Dengel, A. and Shafait, F. (2014), \u201cAnalysis of the logical layout of documents\u201d, in Doerman, D. and Tombre, K. (Eds), Handbook of Document Image Processing and Recognition, Springer, London, pp. 177-222.","DOI":"10.1007\/978-0-85729-859-1_6"},{"issue":"10","key":"key2024010911250686900_ref012","doi-asserted-by":"publisher","first-page":"1297","DOI":"10.1080\/00140139208967394","article-title":"Reading from paper versus screens: a critical review of the empirical literature","volume":"35","year":"1992","journal-title":"Ergonomics"},{"key":"key2024010911250686900_ref013","unstructured":"Dunning, A. (2012), \u201cEuropean newspaper survey report\u201d, available at: http:\/\/www.europeana-newspapers.eu\/wp-content\/uploads\/2012\/04\/D4.1-Europeana-newspapers-survey-report.pdf (accessed 15 December 2022)."},{"key":"key2024010911250686900_ref014","volume-title":"Historic Newspapers in the Digital Age. Search All about it!","year":"2018"},{"key":"key2024010911250686900_ref015","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1007\/s10032-010-0132-6","article-title":"Towards information retrieval on historical document collections: the role of matching procedures and special lexica","volume":"14","year":"2011","journal-title":"International Journal on Document Analysis and Recognition"},{"key":"key2024010911250686900_ref016","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1145\/29933.30853","article-title":"Why reading was slower from CRT displays than from paper","year":"1986"},{"issue":"5","key":"key2024010911250686900_ref017","doi-asserted-by":"crossref","first-page":"497","DOI":"10.1177\/001872088702900501","article-title":"Reading from CRT displays can be as fast as reading from paper","volume":"29","year":"1987","journal-title":"Human Factors"},{"key":"key2024010911250686900_ref018","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1145\/2595188.2595217","article-title":"PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the \u201cJournal de Rouen\u201d collection","year":"2014"},{"key":"key2024010911250686900_ref019","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1145\/2595188.2595195","article-title":"Automatic article extraction in old newspapers digitized collections","year":"2014"},{"issue":"4","key":"key2024010911250686900_ref020","doi-asserted-by":"publisher","first-page":"825","DOI":"10.1093\/llc\/fqz024","article-title":"Quantifying the impact of dirty OCR on historical text analysis: eighteenth Century Collections Online as a case study","volume":"34","year":"2019","journal-title":"Digital Scholarship in the Humanities"},{"key":"key2024010911250686900_ref021","unstructured":"Hynynen, M.-L. (2019), \u201cBuilding a bilingual nation\u201d, available at: https:\/\/www.newseye.eu\/blog\/news\/building-a-bilingual-nation\/ (accessed 15 December 2022)."},{"key":"key2024010911250686900_ref022","volume-title":"The Turn. Integration of Information Seeking and Retrieval in Context","year":"2005"},{"issue":"12","key":"key2024010911250686900_ref023","doi-asserted-by":"crossref","first-page":"2928","DOI":"10.1002\/asi.23379","article-title":"Information retrieval from historical newspaper collections in highly inflectional languages: a query expansion approach","volume":"67","year":"2016","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"2","key":"key2024010911250686900_ref024","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1016\/S0306-4573(99)00056-4","article-title":"Real life, real users, and real needs: a study and analysis of user queries on the Web","volume":"36","year":"2000","journal-title":"Information Processing and Management"},{"issue":"6","key":"key2024010911250686900_ref025","doi-asserted-by":"publisher","first-page":"1228","DOI":"10.1108\/JD-09-2016-0106","article-title":"Cultural heritage as digital noise: nineteenth century newspapers in the digital archive","volume":"73","year":"2017","journal-title":"Journal of Documentation"},{"key":"key2024010911250686900_ref026","doi-asserted-by":"crossref","unstructured":"Karlgren, J., Hedlund, T., J\u00e4rvelin, K., Keskustalo, H. and Kettunen, K. (2019), \u201cThe challenges of language variation in information access\u201d, in Ferro, N. and Peters, C. (Eds), From Multilingual to Multimodal: The Evolution of CLEF over Two Decades. Lessons Learned from 20 Years of CLEF, Springer, Switzerland, pp. 201-216.","DOI":"10.1007\/978-3-030-22948-1_8"},{"issue":"1-2","key":"key2024010911250686900_ref027","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1561\/1500000012","article-title":"Methods for evaluating interactive information retrieval systems with users","volume":"3","year":"2009","journal-title":"Foundations and Trends\u00ae in Information Retrieval"},{"key":"key2024010911250686900_ref028","doi-asserted-by":"crossref","unstructured":"Kettunen, K. and Koistinen, M. (2019), \u201cOpen source Tesseract in Re-OCR of Finnish Fraktur from 19th and early 20th century newspapers and journals \u2013 collected notes on quality improvement\u201d, DHN2019, available at: https:\/\/ceur-ws.org\/Vol-2364\/25_paper.pdf","DOI":"10.5617\/dhnbpub.11102"},{"key":"key2024010911250686900_ref029","article-title":"Measuring lexical quality of a historical Finnish newspaper collection \u2013 analysis of garbled OCR data with basic language technology tools and means. LREC 2016","year":"2016"},{"key":"key2024010911250686900_ref030","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1145\/3322905.3322911","article-title":"Detecting articles in a digitized Finnish historical newspaper collection 1771-1929: early results using the PIVAJ software","year":"2019"},{"key":"key2024010911250686900_ref031","doi-asserted-by":"crossref","unstructured":"Kettunen, K., P\u00e4\u00e4kk\u00f6nen, T. and Liukkonen, E. (2019b), \u201cClipping the page \u2013 automatic article detection and marking software in production of newspaper clippings of a digitized historical journalistic collection\u201d, in Doucet, A., Isaac, A., Golub, K., Aalberg, T. and Jatowt, A. (Eds), TPDL 2019, LNCS 11799, Springer Cham, Switzerland, pp.\u00a0356-360, doi: 10.1007\/978-3-030-30760-8.","DOI":"10.1007\/978-3-030-30760-8_33"},{"key":"key2024010911250686900_ref032","doi-asserted-by":"publisher","article-title":"Reusing the model and components of an IIR study for perceived effects of OCR quality change. BIIRRR 2022","year":"2022","DOI":"10.5281\/zenodo.6513586"},{"key":"key2024010911250686900_ref033","doi-asserted-by":"crossref","unstructured":"Kise, K. (2014), \u201cPage segmentation techniques in document analysis\u201d, in Doerman, D. and Tombre, K. (Eds), Handbook of Document Image Processing and Recognition, Springer, London, pp. 135-175.","DOI":"10.1007\/978-0-85729-859-1_5"},{"issue":"5","key":"key2024010911250686900_ref034","doi-asserted-by":"publisher","first-page":"615","DOI":"10.1080\/00140139.2015.1100757","article-title":"Reading from computer screen versus reading from paper: does it still make a difference?","volume":"59","year":"2016","journal-title":"Ergonomics"},{"key":"key2024010911250686900_ref035","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1145\/3295750.3298931","article-title":"Interacting with digital documents: a real life study of historians' task processes, actions and goals","year":"2019"},{"issue":"7","key":"key2024010911250686900_ref036","doi-asserted-by":"publisher","first-page":"1012","DOI":"10.1002\/asi.24608","article-title":"Struggling with digitized historical newspapers: contextual barriers to information interaction in history research activities","volume":"73","year":"2022","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"7","key":"key2024010911250686900_ref037","doi-asserted-by":"publisher","first-page":"106","DOI":"10.1108\/JD-04-2021-0078","article-title":"Interacting with digitised historical newspapers: understanding the use of digital surrogates as primary sources","volume":"78","year":"2021","journal-title":"Journal of Documentation"},{"key":"key2024010911250686900_ref038","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1007\/s10032-009-0094-8","article-title":"Optical character recognition errors and their effects on natural language processing","volume":"12","year":"2009","journal-title":"International Journal on Document Analysis and Recognition"},{"key":"key2024010911250686900_ref039","first-page":"55","article-title":"Interdisciplinary collaboration in studying newspaper materiality","year":"2019"},{"issue":"1","key":"key2024010911250686900_ref040","doi-asserted-by":"publisher","first-page":"54","DOI":"10.21825\/jeps.v4i1.10483","article-title":"A\u00a0national public sphere? Analyzing the language, location, and form of newspapers in Finland, 1771-1917","volume":"4","year":"2019","journal-title":"Journal of European Periodical Studies"},{"issue":"3","key":"key2024010911250686900_ref041","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1023\/A:1026564708926","article-title":"Information retrieval can cope with many errors","volume":"3","year":"2000","journal-title":"Information Retrieval"},{"issue":"5","key":"key2024010911250686900_ref042","doi-asserted-by":"crossref","first-page":"954","DOI":"10.1108\/JD-07-2018-0114","article-title":"Transforming scholarship in the archives through handwritten text recognition: transkribus as a case study","volume":"75","year":"2019","journal-title":"Journal of Documentation"},{"key":"key2024010911250686900_ref043","doi-asserted-by":"crossref","unstructured":"Neudecker, C. and Antonacopoulos, A. (2016), \u201cMaking europe's historical newspapers searchable\u201d, 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, Greece, pp.\u00a0405-410, 2016, doi: 10.1109\/DAS.2016.83.","DOI":"10.1109\/DAS.2016.83"},{"issue":"6","key":"key2024010911250686900_ref044","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1145\/3453476","article-title":"Survey of post-OCR processing approaches","volume":"54","year":"2021","journal-title":"ACM Computing Survey"},{"issue":"2","key":"key2024010911250686900_ref045","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1002\/asi.24565","article-title":"Integrated interdisciplinary workflows for research on historical newspapers: perspectives from humanities scholars, computer scientists, and librarians","volume":"73","year":"2021","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"2","key":"key2024010911250686900_ref046","doi-asserted-by":"publisher","first-page":"130","DOI":"10.1080\/10790195.2022.2028593","article-title":"Reading from screen vs reading from paper: does it really matter?","volume":"52","year":"2022","journal-title":"Journal of College Reading and Learning"},{"issue":"2","key":"key2024010911250686900_ref047","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1002\/asi.24547","article-title":"Giving shape to large digital libraries through exploratory data analysis","volume":"73","year":"2021","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"key2024010911250686900_ref048","doi-asserted-by":"publisher","first-page":"2021","DOI":"10.46298\/jdmdh.6121","article-title":"Digital interfaces of historical newspapers: opportunities, restrictions and recommendations","volume":"11","year":"2021","journal-title":"Journal of Data Mining and Digital Humanities, January"},{"key":"key2024010911250686900_ref049","volume-title":"Natural Language Processing for Historical Texts","year":"2012"},{"issue":"1","key":"key2024010911250686900_ref050","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1080\/01615440.2020.1803166","article-title":"The reuse of texts in Finnish newspapers and journals, 1771-1920: a digital humanities perspective","volume":"54","year":"2020","journal-title":"Historical Methods: A Journal of Quantitative and Interdisciplinary History"},{"key":"key2024010911250686900_ref051","doi-asserted-by":"publisher","first-page":"527","DOI":"10.5555\/2028299.2028394","article-title":"Comparative information retrieval evaluation for scanned documents","year":"2011"},{"issue":"1","key":"key2024010911250686900_ref052","article-title":"Mining for the meanings of a murder: the impact of OCR quality on the use of digitized historical newspapers","volume":"8","year":"2014","journal-title":"Digital Humanitites Quarterly"},{"issue":"1","key":"key2024010911250686900_ref053","doi-asserted-by":"publisher","first-page":"64","DOI":"10.1145\/214174.214180","article-title":"Evaluation of model-based retrieval effectiveness with OCR text","volume":"14","year":"1996","journal-title":"ACM Transactions on Information Systems"},{"issue":"7\/8","key":"key2024010911250686900_ref054","doi-asserted-by":"publisher","DOI":"10.1045\/july2009-munoz","article-title":"Measuring mass text digitization quality and usefulness. Lessons learned from assessing the OCR accuracy of the British library's 19th century online newspaper archive","volume":"15","year":"2009","journal-title":"D-lib Magazine"},{"key":"key2024010911250686900_ref055","doi-asserted-by":"crossref","unstructured":"Torget, A.J. (2022), \u201cMapping texts: examining the effects of OCR noise on historical newspaper collections\u201d, in Bunout, E., Ehrmann, M. and Clavert, F. (Eds), Digitised Newspapers \u2013 A New Eldorado for Historians?: Reflections on Tools, Methods and Epistemology, De Gruyter Oldenbourg, Berlin, Boston, pp.\u00a047-66, 2023, doi: 10.1515\/9783110729214-003.","DOI":"10.1515\/9783110729214-003"},{"key":"key2024010911250686900_ref056","doi-asserted-by":"crossref","unstructured":"Traub, M.C., van Ossenbruggen, J. and Hardman, L. (2015), \u201cImpact analysis of OCR quality on research tasks in digital archives\u201d, in Kapidakis, S., Mazurek, C. and Werla, M. (Eds), Research and Advanced Technology for Digital Libraries. TPDL 2015. Lecture Notes in Computer Science, Vol.\u00a09316, Springer, Cham, doi: 10.1007\/978-3-319-24592-8_19.","DOI":"10.1007\/978-3-319-24592-8_19"},{"key":"key2024010911250686900_ref057","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1145\/3197026.3197046","article-title":"Impact of crowdsourcing OCR improvements on retrievability bias","year":"2018"},{"key":"key2024010911250686900_ref058","doi-asserted-by":"publisher","first-page":"484","DOI":"10.5220\/0009169004840496","article-title":"Assessing the impact of OCR quality on downstream NLP tasks","year":"2020"},{"key":"key2024010911250686900_ref059","volume-title":"What Impacts Success in Proofreading? A Literature Review of Proofreading on Screen vs on Paper","year":"2022"},{"key":"key2024010911250686900_ref060","volume-title":"Maailmanhistorian pikkuj\u00e4ttil\u00e4inen","year":"1988"},{"key":"key2024010911250686900_ref061","volume-title":"Suomen historian pikkuj\u00e4ttil\u00e4inen","year":"1989"},{"issue":"2","key":"key2024010911250686900_ref062","first-page":"165","article-title":"The TREC-5 confusion track: comparing retrieval methods for scanned text","volume":"2","year":"2000","journal-title":"Information Retrieval"}],"container-title":["Journal of Documentation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/JD-01-2023-0002\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/JD-01-2023-0002\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T22:33:11Z","timestamp":1753396391000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/jd\/article\/79\/7\/137-156\/194961"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,23]]},"references-count":62,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2023,5,23]]},"published-print":{"date-parts":[[2023,12,18]]}},"alternative-id":["10.1108\/JD-01-2023-0002"],"URL":"https:\/\/doi.org\/10.1108\/jd-01-2023-0002","relation":{},"ISSN":["0022-0418"],"issn-type":[{"value":"0022-0418","type":"print"}],"subject":[],"published":{"date-parts":[[2023,5,23]]}}}