{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T17:57:34Z","timestamp":1770227854081,"version":"3.49.0"},"reference-count":85,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T00:00:00Z","timestamp":1714694400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T00:00:00Z","timestamp":1714694400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000208","name":"Institute of Museum and Library Services","doi-asserted-by":"publisher","award":["LG-37-19-0078-19"],"award-info":[{"award-number":["LG-37-19-0078-19"]}],"id":[{"id":"10.13039\/100000208","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Digit Libr"],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs.<\/jats:p>","DOI":"10.1007\/s00799-024-00395-4","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T08:02:16Z","timestamp":1714723336000},"page":"175-196","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Building datasets to support information extraction and structure parsing from electronic theses and dissertations"],"prefix":"10.1007","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8307-8844","authenticated-orcid":false,"given":"William A.","family":"Ingram","sequence":"first","affiliation":[]},{"given":"Jian","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Sampanna Yashwant","family":"Kahu","sequence":"additional","affiliation":[]},{"given":"Javaid Akbar","family":"Manzoor","sequence":"additional","affiliation":[]},{"given":"Bipasha","family":"Banerjee","sequence":"additional","affiliation":[]},{"given":"Aman","family":"Ahuja","sequence":"additional","affiliation":[]},{"given":"Muntabir Hasan","family":"Choudhury","sequence":"additional","affiliation":[]},{"given":"Lamia","family":"Salsabil","sequence":"additional","affiliation":[]},{"given":"Winston","family":"Shields","sequence":"additional","affiliation":[]},{"given":"Edward A.","family":"Fox","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,3]]},"reference":[{"key":"395_CR1","unstructured":"Artifex: PyMuPDF (2016). https:\/\/pymupdf.readthedocs.io\/"},{"issue":"12","key":"395_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1167\/13.12.1","volume":"13","author":"S Barthelm\u00e9","year":"2013","unstructured":"Barthelm\u00e9, S., Trukenbrod, H., Engbert, R., et al.: Modelling fixation locations using spatial point processes. J. Vis. 13(12), 1 (2013). https:\/\/doi.org\/10.1167\/13.12.1","journal-title":"J. Vis."},{"key":"395_CR3","doi-asserted-by":"publisher","unstructured":"Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V. et\u00a0al. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3\u20137, 2019 pp 3613\u20133618. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/D19-1371","DOI":"10.18653\/v1\/D19-1371"},{"key":"395_CR4","unstructured":"Belval, E.: pdf2image (2017). https:\/\/pypi.org\/project\/pdf2image\/"},{"key":"395_CR5","unstructured":"Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. In: Leen, T.K., Dietterich, T.G., Tresp, V.: (eds.) Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 932\u2013938. MIT Press (2000). https:\/\/proceedings.neurips.cc\/paper\/2000\/hash\/728f206c2a01bf572b5940d7d9a8fa4c-Abstract.html"},{"key":"395_CR6","unstructured":"Bochkovskiy, A., Wang, C., Liao, H.M.: YOLOv4: optimal speed and accuracy of object detection (2020). arXiv:2004.10934"},{"key":"395_CR7","doi-asserted-by":"publisher","unstructured":"Bojanowski, P., Grave, E., Joulin, A., et\u00a0al.: Enriching word vectors with subword information (2016). https:\/\/doi.org\/10.48550\/arXiv.1607.04606","DOI":"10.48550\/arXiv.1607.04606"},{"key":"395_CR8","doi-asserted-by":"publisher","unstructured":"Chacon, I.A., Sosnovsky, S.A.: Expanding the web of knowledge: one textbook at a time. In: Atzenbeck, C., Rubart, J., Millard, D.E. (eds.) Proceedings of the 30th ACM Conference on Hypertext and Social Media, HT 2019, Hof, Germany, September 17\u201320, 2019, pp. 9\u201318. ACM (2019). https:\/\/doi.org\/10.1145\/3342220.3343671","DOI":"10.1145\/3342220.3343671"},{"key":"395_CR9","doi-asserted-by":"publisher","unstructured":"Chacon, I.A., Sosnovsky, S.A.: Order out of chaos: construction of knowledge models from PDF textbooks. In: DocEng \u201920: ACM Symposium on Document Engineering 2020, Virtual Event, CA, USA, September 29\u2013October 1, 2020, pp. 8:1\u20138:10. ACM (2020). https:\/\/doi.org\/10.1145\/3395027.3419585","DOI":"10.1145\/3395027.3419585"},{"issue":"9","key":"395_CR10","doi-asserted-by":"publisher","first-page":"3826","DOI":"10.1109\/TVCG.2021.3054916","volume":"27","author":"J Chen","year":"2021","unstructured":"Chen, J., Ling, M., Li, R., et al.: VIS30K: a collection of figures and tables from IEEE visualization conference publications. IEEE Trans. Visual. Comput. Graph. 27(9), 3826\u20133833 (2021). https:\/\/doi.org\/10.1109\/TVCG.2021.3054916","journal-title":"IEEE Trans. Visual. Comput. Graph."},{"key":"395_CR11","doi-asserted-by":"publisher","unstructured":"Choudhury, M.H., Wu, J., Ingram, W. A., et\u00a0al.: A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. In: Huang, R., Wu, D., Marchionini, G. et\u00a0al. (eds.) JCDL \u201920: Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event, China, August 1\u20135, 2020, pp. 515\u2013516. ACM (2020). https:\/\/doi.org\/10.1145\/3383583.3398590","DOI":"10.1145\/3383583.3398590"},{"key":"395_CR12","doi-asserted-by":"publisher","unstructured":"Choudhury, M.H., Jayanetti, H.R., Wu, J., et\u00a0al.: Automatic metadata extraction incorporating visual features from scanned electronic theses and dissertations. In: Downie, J.S., McKay, D., Suleman, H. et\u00a0al. (eds.) ACM\/IEEE Joint Conference on Digital Libraries, JCDL 2021, Champaign, IL, USA, September 27\u201330, 2021, pp. 230\u2013233. IEEE (2021). https:\/\/doi.org\/10.1109\/JCDL52503.2021.00066","DOI":"10.1109\/JCDL52503.2021.00066"},{"key":"395_CR13","doi-asserted-by":"publisher","unstructured":"Choudhury, S.R., Tuarob, S., Mitra, P., et\u00a0al.: A figure search engine architecture for a chemistry digital library. In: 13th ACM\/IEEE-CS Joint Conference on Digital Libraries, JCDL \u201913, Indianapolis, IN, USA, July 22\u201326, 2013, pp. 369\u2013370 (2013). https:\/\/doi.org\/10.1145\/2467696.2467757","DOI":"10.1145\/2467696.2467757"},{"key":"395_CR14","unstructured":"Clark, C.A., Divvala, S.K.: Looking beyond text: extracting figures, tables and captions from computer science papers. In: Scholarly Big Data: AI Perspectives, Challenges, and Ideas, Papers from the 2015 AAAI Workshop, Austin, Texas, USA, January 2015 (2015). http:\/\/aaai.org\/ocs\/index.php\/WS\/AAAIW15\/paper\/view\/10092"},{"key":"395_CR15","doi-asserted-by":"publisher","unstructured":"Clark, C.A., Divvala, S.K.: PDFFigures 2.0: mining figures from research papers. In: Adam, N.R., Cassel L.B., Yesha Y. et\u00a0al. (eds.) Proceedings of the 16th ACM\/IEEE-CS on Joint Conference on Digital Libraries, JCDL 2016, Newark, NJ, USA, June 19\u201323, 2016, pp. 143\u2013152. ACM (2016). https:\/\/doi.org\/10.1145\/2910896.2910904","DOI":"10.1145\/2910896.2910904"},{"key":"395_CR16","unstructured":"Cornell: arXiv: a free distribution service and an open-access archive for 2,151,776 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics (2022). https:\/\/arxiv.org\/"},{"key":"395_CR17","unstructured":"Councill, I., Giles, C.L., Kan, M. Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC\u201908), Marrakech, Morocco (2008). https:\/\/aclanthology.org\/L08-1291\/"},{"key":"395_CR18","doi-asserted-by":"publisher","unstructured":"Devlin, J., Chang, M., Lee, K., et\u00a0al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2\u20137, 2019, Volume 1 (Long and Short Papers), pp. 4171\u20134186. Association for Computational Linguistics(2019). https:\/\/doi.org\/10.18653\/v1\/n19-1423","DOI":"10.18653\/v1\/n19-1423"},{"key":"395_CR19","unstructured":"Dong, L., Yang, N., Wang, W., et al.: Unified language model pre-training for natural language understanding and generation. In: Wallach, H. M., Larochelle, H., Beygelzimer, A. et\u00a0al. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada, pp. 13,042\u201313,054 (2019). https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html"},{"key":"395_CR20","doi-asserted-by":"publisher","unstructured":"Dutta, A., Zisserman, A.: The VIA annotation software for images, audio and video. In: Amsaleg, L., Huet, B., Larson, M.A. et\u00a0al. (eds.) Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21\u201325, 2019, pp 2276\u20132279. ACM (2019). https:\/\/doi.org\/10.1145\/3343031.3350535","DOI":"10.1145\/3343031.3350535"},{"key":"395_CR21","unstructured":"Dutta, A., Gupta, A., Zissermann, A.: VGG image annotator (VIA) Version: 2.0.9(2016). http:\/\/www.robots.ox.ac.uk\/~vgg\/software\/via\/"},{"key":"395_CR22","doi-asserted-by":"publisher","unstructured":"Fox, E.A.: How to make intelligent digital libraries. In: Ras, Z.W., Zemankova, M. (eds.) Methodologies for Intelligent Systems, 8th International Symposium, ISMIS \u201994, Charlotte, North Carolina, USA, October 16\u201319, 1994, Proceedings, Lecture Notes in Computer Science, vol 869, pp. 27\u201338. Springer (1994). https:\/\/doi.org\/10.1007\/3-540-58495-1_3","DOI":"10.1007\/3-540-58495-1_3"},{"key":"395_CR23","unstructured":"Gong, M., Wei, X., Oyen, D., et\u00a0al.: Recognizing figure labels in patents. In: Veyseh, A.P.B., Dernoncourt, F., Nguyen, T.H. et\u00a0al. (eds.) Proceedings of the Workshop on Scientific Document Understanding co-located with 35th AAAI Conference on Artificial Inteligence, SDUAAAI 2021, Virtual Event, February 9, 2021, CEUR Workshop Proceedings, vol 2831. CEUR-WS.org (2021). http:\/\/ceur-ws.org\/Vol-2831\/paper11.pdf"},{"key":"395_CR24","doi-asserted-by":"publisher","unstructured":"Han, H., Giles, C.L., Manavoglu, E., et\u00a0al.: Automatic document metadata extraction using support vector machines. In: ACM\/IEEE 2003 Joint Conference on Digital Libraries (JCDL 2003), May 27\u201331 2003, Houston, Texas, USA, Proceedings, pp. 37\u201348. IEEE Computer Society (2003). https:\/\/doi.org\/10.1109\/JCDL.2003.1204842","DOI":"10.1109\/JCDL.2003.1204842"},{"key":"395_CR25","doi-asserted-by":"publisher","DOI":"10.3390\/technologies7030065","author":"M Hansen","year":"2019","unstructured":"Hansen, M., Pomp, A., Erki, K., et al.: Data-driven recognition and extraction of PDF document elements. Technologies (2019). https:\/\/doi.org\/10.3390\/technologies7030065","journal-title":"Technologies"},{"key":"395_CR26","doi-asserted-by":"publisher","unstructured":"He, K., Zhang, X., Ren, S., et\u00a0al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D.J,, Pajdla, T., Schiele, B. et\u00a0al. (eds.) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, 6\u201312 September 2014, Proceedings, Part III, Lecture Notes in Computer Science, vol 8691, pp. 346\u2013361. Springer (2014). https:\/\/doi.org\/10.1007\/978-3-319-10578-9_23","DOI":"10.1007\/978-3-319-10578-9_23"},{"key":"395_CR27","doi-asserted-by":"publisher","unstructured":"He, K., Zhang, X., Ren, S., et\u00a0al.: Deep residual learning for image recognition (2015). https:\/\/doi.org\/10.48550\/arXiv.1512.03385","DOI":"10.48550\/arXiv.1512.03385"},{"key":"395_CR28","doi-asserted-by":"publisher","unstructured":"Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15\u201320, 2018. Volume 1: Long Papers, pp. 328\u2013339. Association for Computational Linguistics (2018). https:\/\/doi.org\/10.18653\/v1\/P18-1031","DOI":"10.18653\/v1\/P18-1031"},{"key":"395_CR29","doi-asserted-by":"publisher","unstructured":"Ingram, W.A., Banerjee, B., Fox, E.A.: Summarizing ETDs with deep learning. Cadernos BAD (Cadernos de Biblioteconomia, Arquiv\u00edstica e Documenta\u00e7\u00e3o) 1, 46\u201352 (2020). https:\/\/doi.org\/10.48798\/cadernosbad.2014","DOI":"10.48798\/cadernosbad.2014"},{"key":"395_CR30","doi-asserted-by":"publisher","unstructured":"Jelinek, F.: Markov Source Modeling of Text Generation, pp. 569\u2013591. Springer, Dordrecht (1985). https:\/\/doi.org\/10.1007\/978-94-009-5113-6_28","DOI":"10.1007\/978-94-009-5113-6_28"},{"key":"395_CR31","unstructured":"Jude, P. M.: Increasing accessibility of electronic theses and dissertations (ETDs) Through Chapter-level Classification. Thesis, Virginia Tech (2020). http:\/\/hdl.handle.net\/10919\/99294"},{"key":"395_CR32","doi-asserted-by":"publisher","unstructured":"Kahu, S., Ingram, W.A., Fox, E.A., et\u00a0al.: SampannaKahu\/ScanBank: v0.2 (2021a). https:\/\/doi.org\/10.5281\/zenodo.4663540","DOI":"10.5281\/zenodo.4663540"},{"key":"395_CR33","doi-asserted-by":"publisher","unstructured":"Kahu, S., Ingram, W.A., Fox, E.A., et al.: The ScanBank Dataset (2021b). https:\/\/doi.org\/10.5281\/zenodo.4663578","DOI":"10.5281\/zenodo.4663578"},{"key":"395_CR34","doi-asserted-by":"publisher","unstructured":"Kahu, S.Y., Ingram, W.A., Fox, E.A., et\u00a0al.: ScanBank: a benchmark dataset for figure extraction from scanned electronic theses and dissertations. In: Downie, J.S., McKay, D., Suleman, H. et\u00a0al. (eds.) ACM\/IEEE Joint Conference on Digital Libraries, JCDL 2021, Champaign, IL, USA, September 27\u201330, 2021, pp. 180\u2013191. IEEE (2021c). https:\/\/doi.org\/10.1109\/JCDL52503.2021.00030","DOI":"10.1109\/JCDL52503.2021.00030"},{"key":"395_CR35","doi-asserted-by":"publisher","DOI":"10.1045\/july2012-kern","author":"R Kern","year":"2012","unstructured":"Kern, R., Jack, K., Hristakeva, M., et al.: TeamBeam\u2014meta-data extraction from scientific literature. D. Lib. Mag. (2012). https:\/\/doi.org\/10.1045\/july2012-kern","journal-title":"D. Lib. Mag."},{"issue":"5","key":"395_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0093949","volume":"9","author":"M Khabsa","year":"2014","unstructured":"Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLOS ONE 9(5), 1\u20136 (2014). https:\/\/doi.org\/10.1371\/journal.pone.0093949","journal-title":"PLOS ONE"},{"key":"395_CR37","doi-asserted-by":"publisher","unstructured":"Koudas, N., Li, R., Xarchakos, I.: Video monitoring queries. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20\u201324, 2020. IEEE, pp. 1285\u20131296 (2020). https:\/\/doi.org\/10.1109\/ICDE48307.2020.00115","DOI":"10.1109\/ICDE48307.2020.00115"},{"key":"395_CR38","doi-asserted-by":"publisher","unstructured":"Kunze, J.A., Baker, T.: The Dublin Core metadata element set (2007). https:\/\/doi.org\/10.17487\/RFC5013","DOI":"10.17487\/RFC5013"},{"key":"395_CR39","doi-asserted-by":"publisher","unstructured":"Laroca, R., Severo, E., Zanlorensi, L.A., et\u00a0al.: A robust real-time automatic license plate recognition based on the YOLO detector. In: 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8\u201313, 2018, pp. 1\u201310. IEEE (2018). https:\/\/doi.org\/10.1109\/IJCNN.2018.8489629","DOI":"10.1109\/IJCNN.2018.8489629"},{"key":"395_CR40","doi-asserted-by":"publisher","unstructured":"Lee, B.C.G., Mears, J., Jakeway, E., et\u00a0al.: The newspaper navigator dataset: extracting and analyzing visual content from 16 million historic newspaper pages in chronicling America (2020). https:\/\/doi.org\/10.48550\/arXiv.2005.01583","DOI":"10.48550\/arXiv.2005.01583"},{"key":"395_CR41","unstructured":"Li, M., Cui, L., Huang, S., et\u00a0al.: TableBank: table benchmark for image-based table detection and recognition. In: Calzolari, N., B\u00e9chet, F., Blache, P. et\u00a0al. (eds.) Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11\u201316, 2020, pp. 1918\u20131925. European Language Resources Association (2020a). https:\/\/aclanthology.org\/2020.lrec-1.236\/"},{"key":"395_CR42","doi-asserted-by":"publisher","unstructured":"Li, M., Xu, Y., Cui, L., et\u00a0al.: DocBank: a benchmark dataset for document layout analysis. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8\u201313, 2020, pp. 949\u2013960. International Committee on Computational Linguistics (2020b). https:\/\/doi.org\/10.18653\/v1\/2020.coling-main.82","DOI":"10.18653\/v1\/2020.coling-main.82"},{"key":"395_CR43","doi-asserted-by":"publisher","unstructured":"Lin, T., Maire, M., Belongie, S.J., et\u00a0al.: Microsoft COCO: common objects in context. In: Fleet, D.J,, Pajdla, T., Schiele, B. et\u00a0al. (eds.) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6\u201312, 2014, Proceedings, Part V, Lecture Notes in Computer Science, vol. 8693, pp. 740\u2013755. Springer (2014). https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"395_CR44","doi-asserted-by":"publisher","unstructured":"Ling, M., Chen, J., M\u00f6ller, T., et\u00a0al.: Document domain randomization for deep learning document layout extraction (2021). https:\/\/doi.org\/10.48550\/arXiv.2105.14931","DOI":"10.48550\/arXiv.2105.14931"},{"key":"395_CR45","doi-asserted-by":"publisher","unstructured":"Liu, Y., Ott, M., Goyal, N., et\u00a0al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https:\/\/doi.org\/10.48550\/arXiv.1907.11692","DOI":"10.48550\/arXiv.1907.11692"},{"key":"395_CR46","doi-asserted-by":"publisher","unstructured":"Lo, K., Wang, L.L., Neumann, M., et\u00a0al.: S2ORC: the semantic scholar open research corpus. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4969\u20134983. Association for Computational Linguistics, Online (2020). https:\/\/doi.org\/10.18653\/v1\/2020.acl-main.447","DOI":"10.18653\/v1\/2020.acl-main.447"},{"key":"395_CR47","doi-asserted-by":"publisher","unstructured":"Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S. et\u00a0al. (eds.) Research and Advanced Technology for Digital Libraries, 13th European Conference, ECDL 2009, Corfu, Greece, September 27\u2013October 2, 2009. Proceedings, Lecture Notes in Computer Science, vol 5714, pp. 473\u2013474. Springer (2009). https:\/\/doi.org\/10.1007\/978-3-642-04346-8_62","DOI":"10.1007\/978-3-642-04346-8_62"},{"key":"395_CR48","doi-asserted-by":"publisher","unstructured":"Lynch, C.A., Parastatidis, S., Jacobs, N., et\u00a0al.: The OAI-ORE effort: progress, challenges, synergies. In: Rasmussen, E.M., Larson, R.R., Toms, E.G. et\u00a0al. (eds.) ACM\/IEEE Joint Conference on Digital Libraries, JCDL 2007, Vancouver, BC, Canada, June 18\u201323, 2007. Proceedings, p. 80. ACM (2007).https:\/\/doi.org\/10.1145\/1255175.1255190","DOI":"10.1145\/1255175.1255190"},{"key":"395_CR49","doi-asserted-by":"publisher","unstructured":"Mali, P., Kukkadapu, P., Mahdavi, M., et\u00a0al.: ScanSSD: scanning single shot detector for mathematical formulas in PDF document images (2020). https:\/\/doi.org\/10.48550\/arXiv.2003.08005","DOI":"10.48550\/arXiv.2003.08005"},{"key":"395_CR50","unstructured":"Manzoor, J.A.: Segmenting electronic theses and dissertations by chapters. MS thesis, Virginia Tech, Computer Science, defended September 23, 2022 (2022). http:\/\/hdl.handle.net\/10919\/113246"},{"issue":"3","key":"395_CR51","doi-asserted-by":"publisher","first-page":"1931","DOI":"10.1007\/s11192-018-2921-5","volume":"117","author":"Z Nasar","year":"2018","unstructured":"Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117(3), 1931\u20131990 (2018). https:\/\/doi.org\/10.1007\/s11192-018-2921-5","journal-title":"Scientometrics"},{"key":"395_CR52","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: EMNLP, pp. 1532\u20131543 (2014). https:\/\/nlp.stanford.edu\/projects\/glove\/","DOI":"10.3115\/v1\/D14-1162"},{"key":"395_CR53","unstructured":"Perez, L., Wang, J: The effectiveness of data augmentation in image classification using deep learning (2017). arXiv:1712.04621"},{"issue":"4","key":"395_CR54","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1007\/s00799-018-0242-1","volume":"19","author":"A Prasad","year":"2018","unstructured":"Prasad, A., Kaur, M., Kan, M.Y.: Neural ParsCit: a deep learning-based reference string parser. Int. J. Digit. Libr. 19(4), 323\u2013337 (2018). https:\/\/doi.org\/10.1007\/s00799-018-0242-1","journal-title":"Int. J. Digit. Libr."},{"key":"395_CR55","doi-asserted-by":"publisher","unstructured":"Rausch, J., Martinez, O., Bissig, F., et\u00a0al.: DocParser: Hierarchical document structure parsing from renderings. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2\u20139, 2021, pp. 4328\u20134338. AAAI Press (2021). https:\/\/doi.org\/10.1609\/aaai.v35i5.16558","DOI":"10.1609\/aaai.v35i5.16558"},{"key":"395_CR56","doi-asserted-by":"publisher","unstructured":"Redmon, J., Divvala, S.K., Girshick, R.B., et\u00a0al.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27\u201330, 2016, pp. 779\u2013788. IEEE Computer Society (2016). https:\/\/doi.org\/10.1109\/CVPR.2016.91","DOI":"10.1109\/CVPR.2016.91"},{"key":"395_CR57","doi-asserted-by":"publisher","DOI":"10.3390\/f9050272","author":"Z Ren","year":"2018","unstructured":"Ren, Z., He, X., Zheng, H., et al.: Spatio-temporal patterns of urban forest basal area under China\u2019s rapid urban expansion and greening: Implications for urban green infrastructure management. Forests (2018). https:\/\/doi.org\/10.3390\/f9050272","journal-title":"Forests"},{"issue":"3","key":"395_CR58","doi-asserted-by":"publisher","first-page":"3085","DOI":"10.1007\/s11192-020-03382-z","volume":"125","author":"T Saier","year":"2020","unstructured":"Saier, T., F\u00e4rber, M.: unarXive: a large scholarly data set with publications\u2019 full-text, annotated in-text citations, and links to metadata. Scientometrics 125(3), 3085\u20133108 (2020). https:\/\/doi.org\/10.1007\/s11192-020-03382-z","journal-title":"Scientometrics"},{"key":"395_CR59","doi-asserted-by":"publisher","unstructured":"Salsabil, L., Wu, J., Choudhury, M.H., et\u00a0al.: A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and Software. In: Laforest, F., Troncy, R., Simperl, E. et\u00a0al. (eds.) Companion of The Web Conference 2022, Virtual Event \/ Lyon, France, April 25\u201329, 2022, pp. 784\u2013788. ACM (2022). https:\/\/doi.org\/10.1145\/3487553.3524658","DOI":"10.1145\/3487553.3524658"},{"key":"395_CR60","doi-asserted-by":"publisher","unstructured":"Sermanet, P., Eigen, D., Zhang, X., et\u00a0al.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14\u201316, 2014. Conference Track Proceedings (2014). https:\/\/doi.org\/10.48550\/arXiv.1312.6229","DOI":"10.48550\/arXiv.1312.6229"},{"key":"395_CR61","unstructured":"Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: AAAI \u201999 Workshop on Machine Learning for Information Extraction (1999a). https:\/\/www.aaai.org\/Papers\/Workshops\/1999\/WS-99-11\/WS99-11-007.pdf"},{"key":"395_CR62","unstructured":"Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden markov model structure for information extraction. In: AAAI \u201999 Workshop on Machine Learning for Information Extraction (1999b)"},{"key":"395_CR63","doi-asserted-by":"publisher","unstructured":"Shah, A.K., Dey, A., Zanibbi, R.: A Math Formula Extraction and Evaluation Framework for PDF Documents. In: Document Analysis and Recognition\u2014ICDAR2021: 16th International Conference, Lausanne, Switzerland, September 5\u201310, 2021, pp. 19\u201334, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, (2021) https:\/\/doi.org\/10.1007\/978-3-030-86331-9_2","DOI":"10.1007\/978-3-030-86331-9_2"},{"key":"395_CR64","doi-asserted-by":"publisher","unstructured":"Siegel, N., Lourie, N., Power, R., et\u00a0al: Extracting scientific figures with distantly supervised neural networks. In: Chen, J., Gon\u00e7alves, M.A., Allen, J.M. et\u00a0al. (eds.) Proceedings of the 18th ACM\/IEEE on Joint Conference on Digital Libraries, JCDL 2018, Fort Worth, TX, USA, June 3\u20137, 2018, pp. 223\u2013232. ACM (2018).https:\/\/doi.org\/10.1145\/3197026.3197040","DOI":"10.1145\/3197026.3197040"},{"key":"395_CR65","doi-asserted-by":"publisher","unstructured":"Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). https:\/\/doi.org\/10.48550\/ARXIV.1409.1556","DOI":"10.48550\/ARXIV.1409.1556"},{"key":"395_CR66","unstructured":"Singer-Vine, J., Jain, S.: PDFPlumber (2022). https:\/\/github.com\/jsvine\/pdfplumber"},{"key":"395_CR67","doi-asserted-by":"publisher","DOI":"10.1045\/january2003-smith","author":"M Smith","year":"2003","unstructured":"Smith, M., Barton, M., Branschofsky, M., et al.: DSpace: an open source dynamic digital repository. D-Lib Mag. (2003). https:\/\/doi.org\/10.1045\/january2003-smith","journal-title":"D-Lib Mag."},{"key":"395_CR68","doi-asserted-by":"publisher","unstructured":"Smith, R.: An Overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition (ICDAR 2007), September 23\u201326, 2007, Curitiba, Paran\u00e1, Brazil, pp. 629\u2013633. IEEE Computer Society (2007). https:\/\/doi.org\/10.1109\/ICDAR.2007.4376991","DOI":"10.1109\/ICDAR.2007.4376991"},{"key":"395_CR69","unstructured":"Solawetz, J.: YOLOv5 New Version\u2014Improvements And Evaluation (2020). https:\/\/blog.roboflow.com\/yolov5-improvements-and-evaluation\/"},{"key":"395_CR70","doi-asserted-by":"publisher","unstructured":"Song, F., Croft, W.B.: A general language model for information retrieval. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Kansas City, Missouri, USA, November 2\u20136, 1999, pp. 316\u2013321. ACM (1999). https:\/\/doi.org\/10.1145\/319950.320022","DOI":"10.1145\/319950.320022"},{"key":"395_CR71","unstructured":"Taira, R.K., Soderland, S.G.: A statistical natural language processor for medical reports. In: Proceedings AMIA Symposium, pp. 970\u2013974 (1999). https:\/\/pubmed.ncbi.nlm.nih.gov\/10566505"},{"key":"395_CR72","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4684-0510-1","author":"MA Tanner","year":"2012","unstructured":"Tanner, M.A.: Tools for statistical inference: observed data and data augmentation methods, vol 67. Springer Science & Business Media (2012). https:\/\/doi.org\/10.1007\/978-1-4684-0510-1","journal-title":"Springer Science & Business Media"},{"key":"395_CR73","doi-asserted-by":"publisher","DOI":"10.1045\/november14-tkaczyk","author":"D Tkaczyk","year":"2014","unstructured":"Tkaczyk, D., Szostek, P., Bolikowski, L.: GROTOAP2 - the methodology of creating a large ground truth dataset of scientific articles. D. Lib. Mag. (2014). https:\/\/doi.org\/10.1045\/november14-tkaczyk","journal-title":"D. Lib. Mag."},{"issue":"4","key":"395_CR74","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1007\/s10032-015-0249-8","volume":"18","author":"D Tkaczyk","year":"2015","unstructured":"Tkaczyk, D., Szostek, P., Fedoryszak, M., et al.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Document Anal. Recognit. (IJDAR) 18(4), 317\u2013335 (2015). https:\/\/doi.org\/10.1007\/s10032-015-0249-8","journal-title":"Int. J. Document Anal. Recognit. (IJDAR)"},{"key":"395_CR75","doi-asserted-by":"publisher","unstructured":"Uddin, M.S.: TransParsCit: a transformer-based citation parser trained on large-scale synthesized data. Master of Science Thesis, Old Dominion University (2022). https:\/\/doi.org\/10.25777\/qrv9-m891","DOI":"10.25777\/qrv9-m891"},{"key":"395_CR76","doi-asserted-by":"publisher","unstructured":"Uddin, S., Banerjee, B., Wu, J., et\u00a0al.: Building A large collection of multi-domain electronic theses and dissertations. In: Chen, Y., Ludwig, H., Tu, Y. et\u00a0al. (eds.) 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, December 15\u201318, 2021, pp. 6043\u20136045. IEEE (2021). https:\/\/doi.org\/10.1109\/BigData52589.2021.9672058","DOI":"10.1109\/BigData52589.2021.9672058"},{"key":"395_CR77","unstructured":"Ultralytics YOLOv5 (2020). https:\/\/github.com\/ultralytics\/yolov5"},{"key":"395_CR78","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., et\u00a0al.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S. et\u00a0al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\u20139, 2017, Long Beach, CA, USA, pp. 5998\u20136008 (2017). https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html"},{"key":"395_CR79","doi-asserted-by":"publisher","unstructured":"Wang, C., Liao, H.M., Wu, Y., et\u00a0al.: CSPNet: a new backbone that can enhance learning capability of CNN. In: 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14\u201319 2020 IEEE, pp. 1571\u20131580 (2020). https:\/\/doi.org\/10.1109\/CVPRW50498.2020.00203","DOI":"10.1109\/CVPRW50498.2020.00203"},{"key":"395_CR80","doi-asserted-by":"publisher","unstructured":"Wang, K., Liew, J.H., Zou, Y., et\u00a0al.: PANet: Few-shot image semantic segmentation with prototype alignment. In: 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27\u2013November 2, 2019, pp. 9196\u20139205. IEEE (2019). https:\/\/doi.org\/10.1109\/ICCV.2019.00929","DOI":"10.1109\/ICCV.2019.00929"},{"key":"395_CR81","doi-asserted-by":"publisher","DOI":"10.1045\/december2000-weibel","author":"SL Weibel","year":"2000","unstructured":"Weibel, S.L., Koch, T.: Dublin Core Metadata Initiative: Mission, current activities, and future directions. D. Lib. Mag. (2000). https:\/\/doi.org\/10.1045\/december2000-weibel","journal-title":"D. Lib. Mag."},{"key":"395_CR82","doi-asserted-by":"publisher","unstructured":"Wu, J., Sefid, A., Ge, A.C., et\u00a0al.: A supervised learning approach to entity matching between scholarly big datasets. In: Corcho \u00d3, Janowicz, K., Rizzo, G. et\u00a0al. (eds.) Proceedings of the Knowledge Capture Conference, K-CAP 2017, Austin, TX, USA, December 4\u20136, 2017, pp. 41:1\u201341:4. ACM (2017). https:\/\/doi.org\/10.1145\/3148011.3154470","DOI":"10.1145\/3148011.3154470"},{"key":"395_CR83","doi-asserted-by":"publisher","unstructured":"Xu, Y., Li, M., Cui, L., et\u00a0al.: LayoutLM: Pre-training of text and layout for document image understanding. In: Gupta, R., Liu, Y., Tang, J. et\u00a0al. (eds.) KDD \u201920: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23\u201327, 2020, pp. 1192\u20131200. ACM (2020). https:\/\/doi.org\/10.1145\/3394486.3403172","DOI":"10.1145\/3394486.3403172"},{"issue":"4","key":"395_CR84","doi-asserted-by":"publisher","first-page":"331","DOI":"10.1007\/s10032-011-0174-4","volume":"15","author":"R Zanibbi","year":"2012","unstructured":"Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit. 15(4), 331\u2013357 (2012). https:\/\/doi.org\/10.1007\/s10032-011-0174-4","journal-title":"Int. J. Document Anal. Recognit."},{"key":"395_CR85","doi-asserted-by":"crossref","unstructured":"Zhong, X., Tang, J., Jimeno-Yepes, A.: PubLayNet: largest dataset ever for document layout analysis (2019). https:\/\/doi.org\/10.48550\/arXiv.1908.07836","DOI":"10.1109\/ICDAR.2019.00166"}],"container-title":["International Journal on Digital Libraries"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-024-00395-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00799-024-00395-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-024-00395-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,26]],"date-time":"2024-06-26T06:11:00Z","timestamp":1719382260000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00799-024-00395-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,3]]},"references-count":85,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["395"],"URL":"https:\/\/doi.org\/10.1007\/s00799-024-00395-4","relation":{},"ISSN":["1432-5012","1432-1300"],"issn-type":[{"value":"1432-5012","type":"print"},{"value":"1432-1300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,3]]},"assertion":[{"value":"31 March 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 January 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 January 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 May 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}