{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:47:01Z","timestamp":1760150821842,"version":"build-2065373602"},"reference-count":35,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2022,1,17]],"date-time":"2022-01-17T00:00:00Z","timestamp":1642377600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country\u2019s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.<\/jats:p>","DOI":"10.3390\/make4010003","type":"journal-article","created":{"date-parts":[[2022,1,17]],"date-time":"2022-01-17T08:20:42Z","timestamp":1642407642000},"page":"42-65","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["NER in Archival Finding Aids: Extended"],"prefix":"10.3390","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1365-0080","authenticated-orcid":false,"given":"Lu\u00eds Filipe da Costa","family":"Cunha","sequence":"first","affiliation":[{"name":"Department of Informatics, University of Minho, 4710-057 Braga, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8574-1574","authenticated-orcid":false,"given":"Jos\u00e9 Carlos","family":"Ramalho","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Minho, 4710-057 Braga, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2022,1,17]]},"reference":[{"key":"ref_1","first-page":"127","article-title":"HITEX: Um Sistema em Desenvolvimento para Historiadores e Arquivistas","volume":"23","author":"Oliveira","year":"1992","journal-title":"Forum"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1016\/j.ipm.2014.10.006","article-title":"Analysis of named entity recognition and linking for tweets","volume":"51","author":"Derczynski","year":"2015","journal-title":"Inf. Process. Manag."},{"key":"ref_3","unstructured":"Bellot, P., Crestan, E., El-B\u00e8ze, M., Gillard, L., and de Loupy, C. (2021, November 30). Coupling Named Entity Recognition, Vector-Space Model and Knowledge Bases for TREC 11 Question Answering Track. Available online: https:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.14.7561&rep=rep1&type=pdf."},{"key":"ref_4","unstructured":"Santos, D., Seco, N., Cardoso, N., and Vilela, R. (2006, January 22\u201328). HAREM: An Advanced NER Evaluation Contest for Portuguese. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC\u201906), Genoa, Italy."},{"key":"ref_5","unstructured":"Freitas, C., Mota, C., Santos, D., Oliveira, H.G., and Carvalho, P. (2010, January 17\u201323). Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC\u201910), Valletta, Malta."},{"key":"ref_6","unstructured":"Carvalho, P., Oliveira, H.G., Santos, D., Freitas, C., and Mota, C. (2020, November 16). Segundo HAREM: Modelo geral, novidades e avalia\u00e7\u00e3o. Available online: https:\/\/www.researchgate.net\/publication\/236445432_Segundo_HAREM_Modelo_geral_novidades_e_avaliacao."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., and Dias, M.C. (2006). A Golden Resource for Named Entity Recognition in Portuguese. Computational Processing of the Portuguese Language, Springer.","DOI":"10.1007\/11751984"},{"key":"ref_8","unstructured":"Pires, A.R.O. (2017). Named Entity Extraction from Portuguese Web Text. [Master\u2019s Thesis, Faculdade de Engenharia da Universidade do Porto]."},{"key":"ref_9","unstructured":"Ferreira, J., Oliveira, H.G., and Rodrigues, R. (2020, November 16). NLPyPort: Named Entity Recognition with CRF and Rule-Based Relation Extraction. IberLEF@SEPLN. Available online: http:\/\/ceur-ws.org\/Vol-2421\/NER_Portuguese_paper_7.pdf."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sekine, S., and Ranchhod, E. (2009). Named Entities: Recognition, Classification and Use, John Benjamins Publishing Company.","DOI":"10.1075\/bct.19"},{"key":"ref_11","unstructured":"Ingersoll, G.S., Morton, T.S., and Farris, A.L. (2013). Taming Text: How to Find, Organize, and Manipulate It, Manning. OCLC: ocn772977853."},{"key":"ref_12","unstructured":"Rodrigues, A.M., Guimar\u00e3es, C., Barbedo, F., Santos, G., Runa, L., and Penteado, P. (2020, October 17). Orienta\u00e7\u00f5es para a Descri\u00e7\u00e3o Arquiv\u00edstica, Available online: https:\/\/arquivos.dglab.gov.pt\/wp-content\/uploads\/sites\/16\/2013\/10\/oda1-2-3.pdf."},{"key":"ref_13","unstructured":"Lagoze, C., Van de Sompel, H., Nelson, M., and Warner, S. (2020, October 14). Open Archives Initiative\u2014Protocol for Metadata Harvesting\u2014V.2.0. Available online: http:\/\/www.openarchives.org\/OAI\/openarchivesprotocol.html."},{"key":"ref_14","unstructured":"OpenNLP, A. (2020, October 18). Welcome to Apache OpenNLP. Available online: https:\/\/opennlp.apache.org\/."},{"key":"ref_15","unstructured":"Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning (Adaptive Computation and Machine Learning Series), The MIT Press."},{"key":"ref_16","unstructured":"Manning, C.D., and Sch\u00fctze, H. (1999). Foundations of Statistical Natural Language Processing, The MIT Press."},{"key":"ref_17","unstructured":"Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. [Ph.D. Thesis, University of Pennsylva]."},{"key":"ref_18","unstructured":"Manning, C. (2021, January 22). MaxentModels and Discriminative Estimation. Available online: https:\/\/web.stanford.edu\/class\/cs124\/lec\/Maximum_Entropy_Classifiers.pdf."},{"key":"ref_19","first-page":"39","article-title":"A Maximum Entropy Approach to Natural Language Processing","volume":"22","author":"Berger","year":"1996","journal-title":"Comput. Linguist."},{"key":"ref_20","unstructured":"Morais, M. (2020, October 20). NEU 560: Statistical Modeling and Analysis of Neural Data: Lecture 8: Information Theory and Maximum Entropy. Available online: http:\/\/pillowlab.princeton.edu\/teaching\/statneuro2018\/slides\/notes08_infotheory.pdf."},{"key":"ref_21","unstructured":"(2021, January 14). spaCy. Model Architecture. Available online: https:\/\/spacy.io\/models."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016, January 12\u201317). Neural architectures for named entity recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016, San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-1030"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A.R., and Hinton, G. (2013, January 26\u201331). Speech recognition with deep recurrent neural networks. Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing\u2014Proceedings, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_24","unstructured":"Olah, C. (2021, March 10). Understanding Lstm Networks. Available online: http:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs."},{"key":"ref_25","unstructured":"Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Ma, X., and Hovy, E. (2016, January 7\u201312). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016\u2014Long Papers, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1101"},{"key":"ref_27","unstructured":"Cunha, L.F., and Ramalho, J.C. (2021, October 09). NER@DI. Available online: http:\/\/ner.epl.di.uminho.pt."},{"key":"ref_28","unstructured":"Community, A.O.D. (2021, April 10). Apache OpenNLP Developer Documentation. Available online: https:\/\/opennlp.apache.org\/docs\/1.8.2\/manual\/opennlp."},{"key":"ref_29","unstructured":"(2021, January 14). spaCy. Training spaCy\u2019s Statistical Models \u00b7 spaCy Usage Documentation. Available online: https:\/\/spacy.io\/usage\/training."},{"key":"ref_30","unstructured":"Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva, V. (2017, January 18\u201320). Universal Dependencies for Portuguese. Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), Pisa, Italy."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1016\/j.artint.2012.03.006","article-title":"Learning multilingual named entity recognition from Wikipedia","volume":"194","author":"Nothman","year":"2017","journal-title":"Artif. Intell."},{"key":"ref_32","unstructured":"Derczynski, L. (2016, January 23\u201328). Complementarity, F-score, and NLP evaluation. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoro\u017e, Slovenia."},{"key":"ref_33","unstructured":"(2021, April 10). Express\u2014Node.js Web Application Framework, n.d.. Available online: https:\/\/expressjs.com."},{"key":"ref_34","unstructured":"(2021, March 17). Node.js v16.4.0 Documentation, n.d.. Available online: https:\/\/nodejs.org\/api\/child_process.html."},{"key":"ref_35","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., \u0141ukasz, K., and Polosukhin, I. (2021, September 11). Attention Is All You Need. Available online: https:\/\/arxiv.org\/abs\/1706.03762."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/4\/1\/3\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:02:25Z","timestamp":1760133745000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/4\/1\/3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,17]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,3]]}},"alternative-id":["make4010003"],"URL":"https:\/\/doi.org\/10.3390\/make4010003","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2022,1,17]]}}}