{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T17:21:48Z","timestamp":1772040108862,"version":"3.50.1"},"reference-count":45,"publisher":"Emerald","issue":"7","license":[{"start":{"date-parts":[[2023,7,31]],"date-time":"2023-07-31T00:00:00Z","timestamp":1690761600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JD"],"published-print":{"date-parts":[[2023,12,18]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title><jats:p>Many libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title><jats:p>The authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Findings<\/jats:title><jats:p>The authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title><jats:p>The authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.<\/jats:p><\/jats:sec>","DOI":"10.1108\/jd-03-2023-0055","type":"journal-article","created":{"date-parts":[[2023,8,10]],"date-time":"2023-08-10T08:22:04Z","timestamp":1691655724000},"page":"225-239","source":"Crossref","is-referenced-by-count":3,"title":["Digitizing and parsing semi-structured historical administrative documents from the\u00a0G.I. Bill mortgage guarantee program"],"prefix":"10.1108","volume":"79","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5896-7295","authenticated-orcid":false,"given":"Sara","family":"Lafia","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7715-4348","authenticated-orcid":false,"given":"David A.","family":"Bleckley","sequence":"additional","affiliation":[]},{"given":"J. Trent","family":"Alexander","sequence":"additional","affiliation":[]}],"member":"140","published-online":{"date-parts":[[2023,7,31]]},"reference":[{"key":"key2024010911251121000_ref001","unstructured":"ABBYY (2019), \u201cABBYY FineReader PDF (version 15) [computer software]\u201d, available at: https:\/\/pdf.abbyy.com\/media\/1676\/users_guide.pdf"},{"key":"key2024010911251121000_ref002","unstructured":"Adobe (2022), \u201cAcrobat Pro 64-bit (version 2022) [computer software]\u201d, available at: https:\/\/www.adobe.com\/acrobat\/acrobat-pro.html"},{"key":"key2024010911251121000_ref003","doi-asserted-by":"publisher","DOI":"10.1016\/j.eeh.2022.101469","article-title":"Digitization and data frames for card index records","volume":"87","year":"2023","journal-title":"Explorations in Economic History"},{"key":"key2024010911251121000_ref004","doi-asserted-by":"publisher","first-page":"296","DOI":"10.1109\/ICDAR.2009.271","article-title":"A realistic dataset for performance evaluation of document layout analysis","year":"2009"},{"issue":"4","key":"key2024010911251121000_ref005","doi-asserted-by":"publisher","first-page":"171","DOI":"10.1515\/MFIR.2004.171","volume":"33","year":"2004","journal-title":"Recognizing Digitization as a Preservation Reformatting Method"},{"issue":"5","key":"key2024010911251121000_ref006","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1108\/00220411211256021","article-title":"Open source optical character recognition for historical research","volume":"68","year":"2012","journal-title":"Journal of Documentation"},{"key":"key2024010911251121000_ref007","first-page":"1","volume-title":"Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products","year":"2006"},{"key":"key2024010911251121000_ref008","unstructured":"Brahney, K. (2015), \u201cInformation extraction from semi-structured documents MSci. Computer science with industrial experience\u201d, available at: http:\/\/miami-nice.co.uk\/information-extraction-from-docs.pdf (accessed 05 June 2015)."},{"key":"key2024010911251121000_ref009","doi-asserted-by":"publisher","first-page":"390","DOI":"10.1016\/j.patrec.2020.02.003","article-title":"Flexible character accuracy measure for reading-order-independent evaluation","volume":"131","year":"2020","journal-title":"Pattern Recognition Letters"},{"key":"key2024010911251121000_ref010","first-page":"73","volume-title":"Technical Guidelines for Digitizing Cultural Heritage Materials","author":"Federal Agencies Digital Guidelines Initiative","year":"2022"},{"key":"key2024010911251121000_ref011","first-page":"1440","article-title":"Fast R-CNN","year":"2015"},{"key":"key2024010911251121000_ref012","unstructured":"Hoffstaetter, S. (2021), \u201cPython-tesseract (version 0.3.8) [computer software]\u201d, available at: https:\/\/github.com\/madmaze\/pytesseract"},{"key":"key2024010911251121000_ref013","unstructured":"Index to Loans on Veterans Administration Guaranteed Mortgages, 1946 \u2013 1954 (n.d.), \u201cData set\u201d, in National Archives NextGen Catalog, available at: https:\/\/catalog.archives.gov\/id\/783095"},{"key":"key2024010911251121000_ref014","doi-asserted-by":"publisher","first-page":"461","DOI":"10.1145\/1772690.1772738","article-title":"A scalable machine-learning approach for semi-structured named entity recognition","year":"2010"},{"issue":"1","key":"key2024010911251121000_ref015","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/bf02703309","article-title":"Document image analysis: a primer","volume":"27","year":"2002","journal-title":"Sadhana"},{"issue":"3","key":"key2024010911251121000_ref016","doi-asserted-by":"publisher","first-page":"519","DOI":"10.1017\/s1537592708081267","article-title":"On race and policy history: a dialogue about the G.I. Bill","volume":"6","year":"2008","journal-title":"Perspectives on Politics"},{"key":"key2024010911251121000_ref017","doi-asserted-by":"publisher","first-page":"3055","DOI":"10.1145\/3340531.3412767","article-title":"The newspaper navigator dataset: extracting headlines and visual content from 16 million historic newspaper pages in chronicling America","year":"2020"},{"issue":"10","key":"key2024010911251121000_ref018","doi-asserted-by":"publisher","DOI":"10.5210\/fm.v13i10.2101","article-title":"Mass book digitization: the deeper story of Google books and the open content alliance","volume":"13","year":"2008","journal-title":"First Monday"},{"issue":"2","key":"key2024010911251121000_ref019","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1109\/TIT.1962.1057696","article-title":"Design factors in the development of an optical character recognition machine","volume":"8","year":"1962","journal-title":"IRE Transactions on Information Theory"},{"issue":"6","key":"key2024010911251121000_ref020","doi-asserted-by":"publisher","first-page":"1249","DOI":"10.1108\/JD-04-2021-0080","article-title":"The emergence of digital reformatting in the history of preservation knowledge: 1823-2015","volume":"78","year":"2022","journal-title":"Journal of Documentation"},{"issue":"2","key":"key2024010911251121000_ref021","first-page":"135","article-title":"Web entity detection for semi-structured text data records with unlabeled data","volume":"4","year":"2013","journal-title":"International Journal Of. Computational Linguistics and Applications"},{"key":"key2024010911251121000_ref022","unstructured":"Montani, I., Honnibal, M., Boyd, A., Van Landeghem, S., Peters, H., O'Leary McCann, P., Geovedi, J., O'Regan, J., Samsonov, M., de Kok, D., Orosz, G., Bl\u00e4ttermann, M., Altinok, D., Mitsch, R., Kannan, M., Lind Kristiansen, S., Miranda, L., Bournhonesque, R., Baumgartner, P., Hudson, R., Fiedler, L., Daniels, R. and Phatthiyaphaibun, W. (2020), \u201cspaCy: industrial-strength natural language processing in Python (Version v3) [Computer software]\u201d, Zenodo, doi: 10.5281\/zenodo.1212303."},{"issue":"7","key":"key2024010911251121000_ref023","doi-asserted-by":"publisher","first-page":"1093","DOI":"10.1109\/5.156472","article-title":"\u2018At the frontiers of OCR\u2019","volume":"80","year":"1992","journal-title":"Proceedings of the IEEE. Institute of Electrical and Electronics Engineers"},{"key":"key2024010911251121000_ref045","unstructured":"National Archives and Records Administration (n.d), \u201cNational archives catalog\u201d, available at: https:\/\/catalog.archives.gov\/"},{"issue":"1","key":"key2024010911251121000_ref024","first-page":"12","article-title":"Museum libraries: how digitization can enhance the value of the museum","volume":"1","year":"2011","journal-title":"Palabra Clave (La Plata)"},{"key":"key2024010911251121000_ref025","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1145\/3476887.3476888","article-title":"A\u00a0survey of OCR evaluation tools and metrics","year":"2021"},{"key":"key2024010911251121000_ref026","volume-title":"OmniPage Professional (Version 18) [Computer Software]","author":"Nuance Communications, Inc","year":"2011"},{"key":"key2024010911251121000_ref027","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1145\/2037342.2037354","article-title":"Performing information extraction to improve OCR error detection in semi-structured historical documents","year":"2011"},{"key":"key2024010911251121000_ref028","unstructured":"Padilla, T., Allen, L., Frost, H., Potvin, S., Roke, E.R. and Varner, S. (2019), \u201cAlways already computational: collections as data: final report\u201d, available at: https:\/\/digitalcommons.unl.edu\/scholcom\/181\/"},{"key":"key2024010911251121000_ref029","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1109\/IWSSIP48289.2020.9145130","article-title":"A survey on performance metrics for object-detection algorithms","year":"2020"},{"key":"key2024010911251121000_ref030","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1109\/ICDAR.2017.325","article-title":"Exploiting state-of-the-art deep learning methods for document image analysis","year":"2017"},{"key":"key2024010911251121000_ref031","unstructured":"PRImA Research Lab (2018), \u201cPRImA text evaluation tool (version 1.5) [computer software]\u201d, available at: https:\/\/www.primaresearch.org\/tools\/PerformanceEvaluation"},{"key":"key2024010911251121000_ref032","volume-title":"Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files - Raster Images","year":"2005"},{"key":"key2024010911251121000_ref033","unstructured":"Puigcerver, J. (2014), \u201cxer\u201d, available at: https:\/\/github.com\/jpuigcerver\/xer"},{"key":"key2024010911251121000_ref034","unstructured":"Rice, S.V. (1996), \u201cMeasuring the accuracy of page-reading systems\u201d, in Nartker, T.A. (Ed.), University of Nevada, Las Vegas, available at: https:\/\/www.proquest.com\/dissertations-theses\/measuring-accuracy-page-reading-systems\/docview\/304329395\/se-2"},{"key":"key2024010911251121000_ref035","unstructured":"Servicemen\u2019s Readjustment Act of 1944 (1944), \u201c78th Congress, Pub. L. 346, 18\u201d, available at: https:\/\/hdl-handle-net.proxy.lib.umich.edu\/2027\/umn.31951d03569283l"},{"key":"key2024010911251121000_ref036","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1007\/978-3-030-86549-8_9","article-title":"LayoutParser: a unified toolkit for deep learning based document image analysis","volume":"2021","year":"2021","journal-title":"Document Analysis and Recognition \u2013 ICDAR"},{"key":"key2024010911251121000_ref037","doi-asserted-by":"publisher","first-page":"629","DOI":"10.1109\/ICDAR.2007.4376991","article-title":"An overview of the tesseract OCR engine","year":"2007"},{"issue":"4","key":"key2024010911251121000_ref038","doi-asserted-by":"publisher","first-page":"545","DOI":"10.1108\/AJIM-11-2019-0326","article-title":"Optimisation of archival processes involving digitisation of typewritten documents","volume":"72","year":"2020","journal-title":"Aslib Journal of Information Management"},{"key":"key2024010911251121000_ref039","volume-title":"Automatic Character Recognition: A State-Of-The-Art Report","year":"1961"},{"key":"key2024010911251121000_ref040","unstructured":"Tesseract (2021), \u201cTesseract OCR (version 5.0) [computer software]\u201d, available at: https:\/\/github.com\/tesseract-ocr\/tesseract"},{"key":"key2024010911251121000_ref041","unstructured":"Tkachenko, M., Malyuk, M., Shevchenko, N., Holmanyuk, A. and Liubimov, N. (2020), \u201cLabelStudio:Data labeling software (version 1.7) [computer software]\u201d, available at: https:\/\/github.com\/heartexlabs\/label-studio"},{"key":"key2024010911251121000_ref042","unstructured":"United States Department of Veterans Affairs (2013), \u201cHistory and timeline\u2014education and training\u201d, available at: https:\/\/www.va.gov\/education\/about-gi-bill-benefits\/"},{"key":"key2024010911251121000_ref043","volume-title":"Preliminary Inventory of the Records of the Reconstruction Finance Corporation, 1932-1964","year":"1973"},{"key":"key2024010911251121000_ref044","doi-asserted-by":"publisher","first-page":"1015","DOI":"10.1109\/ICDAR.2019.00166","article-title":"PubLayNet: largest dataset ever for document layout analysis","year":"2019"}],"container-title":["Journal of Documentation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/JD-03-2023-0055\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/JD-03-2023-0055\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T22:33:41Z","timestamp":1753396421000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/jd\/article\/79\/7\/225-239\/194935"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,31]]},"references-count":45,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2023,7,31]]},"published-print":{"date-parts":[[2023,12,18]]}},"alternative-id":["10.1108\/JD-03-2023-0055"],"URL":"https:\/\/doi.org\/10.1108\/jd-03-2023-0055","relation":{},"ISSN":["0022-0418"],"issn-type":[{"value":"0022-0418","type":"print"}],"subject":[],"published":{"date-parts":[[2023,7,31]]}}}