{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T21:51:06Z","timestamp":1740174666123,"version":"3.37.3"},"reference-count":57,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2022,12,22]],"date-time":"2022-12-22T00:00:00Z","timestamp":1671667200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The availability of large digital archives of historical newspaper content has transformed the historical sciences. However, the scale of these archives can limit the direct application of advanced text processing methods. Even if it is computationally feasible to apply sophisticated language processing to an entire digital archive, if the material of interest is a small fraction of the archive, the results are unlikely to be useful. Methods for generating smaller specialized corpora from large archives are required to solve this problem. This article presents such a method for historical newspaper archives digitized using the METS\/ALTO XML standard (Veridian Software, n.d.). The method is an \u2018iterative bootstrapping\u2019 approach in which candidate corpora are evaluated using text mining techniques, items are manually labelled, and Na\u00efve Bayes text classifiers are trained and applied in order to produce new candidate corpora. The method is illustrated by a case study that investigates philosophical content, broadly construed, in pre-1900 English-language New Zealand newspapers. Extensive code is provided in Supplementary Materials.<\/jats:p>","DOI":"10.1093\/llc\/fqac079","type":"journal-article","created":{"date-parts":[[2022,12,22]],"date-time":"2022-12-22T18:23:17Z","timestamp":1671733397000},"page":"779-797","source":"Crossref","is-referenced-by-count":1,"title":["Creating specialized corpora from digitized historical newspaper archives"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8272-5763","authenticated-orcid":false,"given":"Joshua","family":"Wilson Black","sequence":"first","affiliation":[{"name":"UC Arts Digital Lab, University of Canterbury , Christchurch, New Zealand"},{"name":"New Zealand Institute of Language, Brain and Behaviour, University of Canterbury , Christchurch, New Zealand"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2022,12,22]]},"reference":[{"volume-title":"What Is Philosophy?","year":"2017","author":"Agamben","key":"2023053108323634600_fqac079-B1"},{"issue":"1","key":"2023053108323634600_fqac079-B2","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1007\/s10790-017-9602-0","article-title":"Identifying virtues and values through obituary data-mining","volume":"52","author":"Alfano","year":"2018","journal-title":"Journal of Value Inquiry"},{"key":"2023053108323634600_fqac079-B3","doi-asserted-by":"crossref","DOI":"10.5040\/9781350933996","volume-title":"Using Corpora in Discourse Analysis","author":"Baker","year":"2006"},{"key":"2023053108323634600_fqac079-B4","first-page":"47","article-title":"Reading the newspaper in Colonial Otago","volume":"12","author":"Ballantyne","year":"2012","journal-title":"The Journal of New Zealand Studies"},{"key":"2023053108323634600_fqac079-B5","first-page":"19","article-title":"Corpus construction: a principle for qualitative data collection","author":"Bauer","year":"2000","journal-title":"Qualitative Researching with Text, Image and Sound: A Practical Handbook"},{"issue":"1","key":"2023053108323634600_fqac079-B6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1515\/ling-2017-0032","article-title":"Reproducible research in linguistics: a position statement on data citation and attribution in our field","volume":"56","author":"Berez-Kroeker","year":"2017","journal-title":"Linguistics"},{"issue":"2","key":"2023053108323634600_fqac079-B7","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1093\/tcbh\/hwq007","article-title":"The digitization of newspaper archives: opportunities and challenges for historians","volume":"21","author":"Bingham","year":"2010","journal-title":"Twentieth Century British History"},{"volume-title":"Natural Language Processing with Python","year":"2009","author":"Bird","key":"2023053108323634600_fqac079-B8"},{"key":"2023053108323634600_fqac079-B9","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"Journal of Machine Learning Research"},{"issue":"7","key":"2023053108323634600_fqac079-B10","doi-asserted-by":"crossref","first-page":"897","DOI":"10.1177\/0963662518771400","article-title":"1891: The Collins\u2013Hosking debate, Christchurch","volume":"27","author":"Bush","year":"2018","journal-title":"Public Understanding of Science"},{"key":"2023053108323634600_fqac079-B11","first-page":"39","article-title":"A diachronic corpus of New Zealand newspapers","volume":"25","author":"Calude","year":"2011","journal-title":"New Zealand English Journal"},{"key":"2023053108323634600_fqac079-B12","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511694295","volume-title":"On the Origin of Species","author":"Darwin","year":"2009"},{"key":"2023053108323634600_fqac079-B13","first-page":"36","article-title":"A dangerous visionary? The lectures of the evolutionist T.J. Parker","volume":"15","author":"Crane","year":"2013","journal-title":"The Journal of New Zealand Studies"},{"key":"2023053108323634600_fqac079-B14","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1007\/978-94-007-6958-8_2","volume-title":"History of Philosophy in Australia and New Zealand","author":"Davies","year":"2014"},{"volume-title":"What Is Philosophy","year":"1991","author":"Deleuze","key":"2023053108323634600_fqac079-B15"},{"key":"2023053108323634600_fqac079-B16","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1007\/s10032-020-00359-9","article-title":"Optical character recognition with neural networks and post-correction with finite state methods","volume":"23","author":"Drobac","year":"2020","journal-title":"International Journal on Document Analysis and Recognition"},{"issue":"2","key":"2023053108323634600_fqac079-B17","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1353\/vpr.2015.0014","article-title":"Technologies of serendipity","volume":"48","author":"Fyfe","year":"2015","journal-title":"Victorian Periodicals Review"},{"year":"2018","author":"Froehlich","key":"2023053108323634600_fqac079-B18"},{"volume-title":"An Introduction to Statistical Learning: with Applications in R","year":"2021","author":"Gareth","key":"2023053108323634600_fqac079-B151"},{"volume-title":"Writing History in the Digital Age","year":"2013","author":"Gibbs","key":"2023053108323634600_fqac079-B19"},{"volume-title":"Philosophy and Its History: Aims and Methods in the Study of Early Modern Philosophy.","year":"2013","author":"Goldenbaum","key":"2023053108323634600_fqac079-B20"},{"volume-title":"Exploring Newspaper Language: Using the Web to Create and Investigate a Large Corpus of Modern Norwegian, Amsterdam, pp. 111\u2013130","year":"2012","author":"Hagen","key":"2023053108323634600_fqac079-B21"},{"issue":"1","key":"2023053108323634600_fqac079-B22","first-page":"9","article-title":"Confronting the digital","volume":"10","author":"Hitchcock","year":"2013","journal-title":"The Journal of the Social History Society"},{"issue":"1","key":"2023053108323634600_fqac079-B23","first-page":"168","article-title":"The case of a change in meaning and its impact","volume":"16","author":"Keelan","year":"2021","journal-title":"K\u014dtuitui: New Zealand Journal of Social Sciences Online"},{"issue":"1","key":"2023053108323634600_fqac079-B24","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1075\/ijcl.6.1.05kil","article-title":"Comparing corpora","volume":"6","author":"Kilgarriff","year":"2001","journal-title":"International Journal of Corpus Linguistics"},{"issue":"2","key":"2023053108323634600_fqac079-B25","doi-asserted-by":"crossref","first-page":"368","DOI":"10.1093\/llc\/fqy048","article-title":"Toward a model for digital tool criticism","volume":"34","author":"Koolen","year":"2019","journal-title":"Digital Scholarship in the Humanities"},{"key":"2023053108323634600_fqac079-B26","doi-asserted-by":"crossref","DOI":"10.1093\/acprof:oso\/9780199857142.001.0001","volume-title":"Philosophy and Its History: Aims and Methods in the Study of Early Modern Philosophy","author":"Laerke","year":"2013"},{"key":"2023053108323634600_fqac079-B27","doi-asserted-by":"crossref","DOI":"10.1093\/acprof:oso\/9780199857142.001.0001","volume-title":"Philosophy and Its History: Aims and Methods in the Study of Early Modern Philosophy.","author":"Laerke","year":"2013"},{"issue":"4","key":"2023053108323634600_fqac079-B28","doi-asserted-by":"crossref","first-page":"765","DOI":"10.1162\/coli_a_00364","article-title":"Argument mining: a survey","volume":"45","author":"Lawrence","year":"2020","journal-title":"Computational Linguistics"},{"issue":"1","key":"2023053108323634600_fqac079-B29","doi-asserted-by":"crossref","first-page":"72","DOI":"10.3366\/jvc.2005.10.1.72","article-title":"Googling the victorians","volume":"10","author":"Leary","year":"2005","journal-title":"Journal of Victorian Culture"},{"issue":"2","key":"2023053108323634600_fqac079-B30","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1080\/09608788.2020.1774863","article-title":"Friendly to all beings\u2019: Annie Besant as ethicist","volume":"29","author":"Leland","year":"2021","journal-title":"British Journal for the History of Philosophy"},{"key":"2023053108323634600_fqac079-B32","first-page":"56","volume-title":"Proceedings of the 9th Python in Science Conference, Austin","author":"McKinney","year":"2010"},{"volume-title":"Distant Reading","year":"2013","author":"Moretti","key":"2023053108323634600_fqac079-B33"},{"issue":"1","key":"2023053108323634600_fqac079-B34","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1080\/13688804.2012.752963","article-title":"The digital turn: exploring the methodological possibilities of digital newspaper archives","volume":"19","author":"Nicholson","year":"2013","journal-title":"Media History"},{"year":"2020","author":"Niekler","key":"2023053108323634600_fqac079-B35"},{"key":"2023053108323634600_fqac079-B152","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1007\/s42803-020-00028-7","article-title":"Digital sources and digital archives: historical evidence in the digital age","volume":"1","author":"Owens","year":"2021","journal-title":"International Journal of Digital Humanities"},{"volume-title":"Colonial Discourses: Niupepa M\u0101ori, 1855\u20131863","year":"2006","author":"Paterson","key":"2023053108323634600_fqac079-B36"},{"volume-title":"He Reo W\u0101hine: M\u0101ori Women\u2019s Voices from the Nineteenth Century","year":"2017","author":"Paterson","key":"2023053108323634600_fqac079-B37"},{"key":"2023053108323634600_fqac079-B38","first-page":"2825","article-title":"Scikit-learn: machine learning in python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"Journal of Machine Learning Research"},{"year":"2021","author":"Plotly Technologies Inc","key":"2023053108323634600_fqac079-B39"},{"key":"2023053108323634600_fqac079-B40","article-title":"From optical to digital (and back again)","volume":"6","author":"Plunkett","year":"2008","journal-title":"19: Interdisciplinary Studies in the Long Nineteenth Century"},{"issue":"2","key":"2023053108323634600_fqac079-B41","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1017\/S0031819106316026","article-title":"What is philosophy?","volume":"81","author":"Priest","year":"2006","journal-title":"Philosophy"},{"issue":"2","key":"2023053108323634600_fqac079-B42","doi-asserted-by":"crossref","first-page":"377","DOI":"10.1093\/ahr\/121.2.377","article-title":"The transnational and the text-searchable","volume":"121","author":"Putnam","year":"2016","journal-title":"American Historical Review"},{"first-page":"45","year":"2010","author":"\u0158eh\u016f\u0159ek","key":"2023053108323634600_fqac079-B43"},{"volume-title":"Pastplay: Teaching and Learning History with Technology","year":"2014","author":"Ramsay","key":"2023053108323634600_fqac079-B44"},{"key":"2023053108323634600_fqac079-B45","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1111\/1468-229X.12969","article-title":"State of the field: digital history","volume":"105","author":"Romein","year":"2020","journal-title":"History"},{"issue":"4","key":"2023053108323634600_fqac079-B46","doi-asserted-by":"crossref","first-page":"807","DOI":"10.1007\/s10579-019-09458-4","article-title":"Historical corpora meet the digital humanities: the Jerusalem corpus of emergent modern Hebrew","volume":"53","author":"Rubinstein","year":"2019","journal-title":"Language Resources & Evaluation"},{"issue":"1","key":"2023053108323634600_fqac079-B47","doi-asserted-by":"crossref","first-page":"204","DOI":"10.1093\/llc\/fqu058","article-title":"The sense of a connection: automatic tracing of intertextuality my meaning","volume":"31","author":"Scheirer","year":"2016","journal-title":"Digital Scholarship in the Humanities"},{"key":"2023053108323634600_fqac079-B50","doi-asserted-by":"publisher","DOI":"10.1109\/jcdl.2014.6970166","volume-title":"IEEE\/ACM Joint Conference on Digital Libraries","author":"Smith","year":"2014"},{"issue":"2","key":"2023053108323634600_fqac079-B51","doi-asserted-by":"crossref","DOI":"10.16995\/dscn.235","article-title":"Patterns of sentimentality in Victorian novels","volume":"3","author":"Steger","year":"2013","journal-title":"Digital Studies\/Le Champ Num\u00e9rique"},{"issue":"2","key":"2023053108323634600_fqac079-B52","doi-asserted-by":"crossref","first-page":"544","DOI":"10.17723\/aarc.74.2.644851p6gmg432h0","article-title":"Archival theory and digital historiography: selection, search, and metadata as archival processes for assessing historical contextualization","volume":"74","author":"Sternfeld","year":"2011","journal-title":"The American Archivist"},{"issue":"1","key":"2023053108323634600_fqac079-B53","article-title":"Mining for the meanings of a murder: the impact of OCR quality on the use of digitized historical newspapers","volume":"8","author":"Strange","year":"2014","journal-title":"Digital Humanities Quarterly"},{"issue":"7\/8","key":"2023053108323634600_fqac079-B54","doi-asserted-by":"crossref","DOI":"10.1045\/july2009-munoz","article-title":"Measuring mass text digitization quality and usefulness: lessons learned from assessing the OCR accuracy of the British Library\u2019s 19th Century online newspaper archive","volume":"15","author":"Tanner","year":"2009","journal-title":"D-Lib Magazine"},{"volume-title":"A Companion to Digital Humanities","year":"2004","author":"Thomas","key":"2023053108323634600_fqac079-B55"},{"year":".","author":"Veridian Software","key":"2023053108323634600_fqac079-B57"},{"issue":"1","key":"2023053108323634600_fqac079-B58","doi-asserted-by":"crossref","first-page":"100","DOI":"10.1080\/03036758.2016.1252408","article-title":"R\u0101hui and conservation? M\u0101ori voices in the nineteenth century Niupepa M\u0101ori","volume":"47","author":"Whaanga","year":"2017","journal-title":"Journal of the Royal Society of New Zealand"},{"issue":"4","key":"2023053108323634600_fqac079-B59","doi-asserted-by":"crossref","first-page":"535","DOI":"10.1111\/1467-9809.12089","article-title":"The reign of grace: liberalism and heresy in the new world","volume":"38","author":"Wood","year":"2014","journal-title":"Journal of Religious History"}],"container-title":["Digital Scholarship in the Humanities"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/dsh\/article-pdf\/38\/2\/779\/50488248\/fqac079.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/dsh\/article-pdf\/38\/2\/779\/50488248\/fqac079.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,9]],"date-time":"2024-10-09T20:07:40Z","timestamp":1728504460000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/dsh\/article\/38\/2\/779\/6957053"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,22]]},"references-count":57,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2022,12,22]]},"published-print":{"date-parts":[[2023,5,31]]}},"URL":"https:\/\/doi.org\/10.1093\/llc\/fqac079","relation":{},"ISSN":["2055-7671","2055-768X"],"issn-type":[{"type":"print","value":"2055-7671"},{"type":"electronic","value":"2055-768X"}],"subject":[],"published-other":{"date-parts":[[2023,6,1]]},"published":{"date-parts":[[2022,12,22]]}}}