{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T01:44:47Z","timestamp":1773020687999,"version":"3.50.1"},"reference-count":41,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2020,3,27]],"date-time":"2020-03-27T00:00:00Z","timestamp":1585267200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"FEDER - Lisbon 2020","award":["POCI-01-0247-FEDER-038539"],"award-info":[{"award-number":["POCI-01-0247-FEDER-038539"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Applied Sciences"],"abstract":"<jats:p>The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based\/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested\u2014Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.<\/jats:p>","DOI":"10.3390\/app10072303","type":"journal-article","created":{"date-parts":[[2020,3,31]],"date-time":"2020-03-31T13:27:19Z","timestamp":1585661239000},"page":"2303","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["Named Entity Recognition for Sensitive Data Discovery in Portuguese"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6563-9858","authenticated-orcid":false,"given":"Mariana","family":"Dias","sequence":"first","affiliation":[{"name":"Inov Inesc Inova\u00e7\u00e3o\u2014Instituto De Novas Tecnologias, 1000-029 Lisbon, Portugal"},{"name":"ISTAR-IUL, Instituto Universit\u00e1rio de Lisboa (ISCTE-IUL), 1649-026 Lisboa, Portugal"}]},{"given":"Jo\u00e3o","family":"Bon\u00e9","sequence":"additional","affiliation":[{"name":"Inov Inesc Inova\u00e7\u00e3o\u2014Instituto De Novas Tecnologias, 1000-029 Lisbon, Portugal"},{"name":"ISTAR-IUL, Instituto Universit\u00e1rio de Lisboa (ISCTE-IUL), 1649-026 Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6662-0806","authenticated-orcid":false,"given":"Jo\u00e3o C.","family":"Ferreira","sequence":"additional","affiliation":[{"name":"Inov Inesc Inova\u00e7\u00e3o\u2014Instituto De Novas Tecnologias, 1000-029 Lisbon, Portugal"},{"name":"ISTAR-IUL, Instituto Universit\u00e1rio de Lisboa (ISCTE-IUL), 1649-026 Lisboa, Portugal"}]},{"given":"Ricardo","family":"Ribeiro","sequence":"additional","affiliation":[{"name":"ISTAR-IUL, Instituto Universit\u00e1rio de Lisboa (ISCTE-IUL), 1649-026 Lisboa, Portugal"},{"name":"INESC-ID Lisboa, 1000-029 Lisbon, Portugal"}]},{"given":"Rui","family":"Maia","sequence":"additional","affiliation":[{"name":"Inov Inesc Inova\u00e7\u00e3o\u2014Instituto De Novas Tecnologias, 1000-029 Lisbon, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2020,3,27]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Dale, R., Moisl, H., and Somers, H. (2000). Handbook of Natural Language Processing, CRC Press.","DOI":"10.1201\/9780824746346"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1075\/li.30.1.03nad","article-title":"A survey of named entity recognition and classification","volume":"30","author":"Nadeau","year":"2007","journal-title":"Lingvist. Investig."},{"key":"ref_3","unstructured":"Dias, M., Maia, R., Ferreira, J., Ribeiro, R., and Martins, A. (2019, January 11\u201312). DataSense Platform. Proceedings of the IASTEM\u2014586th International Conference on Science Technology and Management (ICSTM), Bandar Seri Begawan, Brunei."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Clough, P. (2005, January 4). Extracting metadata for spatially-aware information retrieval on the internet. Proceedings of the 2005 Workshop on Geographic Information Retrieval, Bremen, Germany.","DOI":"10.1145\/1096985.1096992"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Korba, L., Wang, Y., Geng, L., Song, R., Yee, G., Patrick, A.S., Buffett, S., Liu, H., and You, Y. (2008). Private data discovery for privacy compliance in collaborative environments. International Conference on Cooperative Design, Visualization and Engineering, Springer.","DOI":"10.1007\/978-3-540-88011-0_18"},{"key":"ref_6","first-page":"65","article-title":"Empirical methods in information extraction","volume":"18","author":"Cardie","year":"1997","journal-title":"AI Mag."},{"key":"ref_7","unstructured":"Ciravegna, F. (2001, January 4\u201310). 2, an adaptive algorithm for information extraction from web-related texts. Proceedings of the IJCAI\u20142001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, USA."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Grishman, R., and Sundheim, B. (1995, January 6\u20138). Design of the MUC-6 evaluation. Proceedings of the 6th Conference on Message Understanding, Columbia, MD, USA.","DOI":"10.3115\/1072399.1072401"},{"key":"ref_9","first-page":"13","article-title":"An overview of empirical natural language processing","volume":"18","author":"Brill","year":"1997","journal-title":"AI Mag."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Mikheev, A., Moens, M., and Grover, C. (1999, January 8\u201312). Named entity recognition without gazetteers. Proceedings of the Ninth Conference on European chapter of the Association for Computational Linguistics, Bergen, Norway.","DOI":"10.3115\/977035.977037"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1126","DOI":"10.14778\/2536222.2536237","article-title":"Entity extraction, linking, classification, and tagging for social media: A wikipedia-based approach","volume":"6","author":"Gattani","year":"2013","journal-title":"Proc. VLDB Endow."},{"key":"ref_12","unstructured":"Torisawa, K. (2008, January 15\u201320). Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations. Proceedings of the ACL-08: HLT, Columbus, OH, USA."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhou, G., and Su, J. (2002, January 7\u201312). Named entity recognition using an HMM-based chunk tagger. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073163"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Finkel, J.R., Grenager, T., and Manning, C. (2005, January 25\u201330). Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, MI, USA.","DOI":"10.3115\/1219840.1219885"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv.","DOI":"10.18653\/v1\/N16-1030"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"De Castro, P.V.Q., da Silva, N.F.F., and da Silva Soares, A. (2018). Portuguese Named Entity Recognition Using LSTM-CRF. International Conference on Computational, Processing of the Portuguese Language, Springer.","DOI":"10.1007\/978-3-319-99722-3_9"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1162\/tacl_a_00104","article-title":"Named entity recognition with bidirectional LSTM-CNNs","volume":"4","author":"Chiu","year":"2016","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_18","unstructured":"Mota, C., Santos, D., and Ranchhod, E. (2006, January 13\u201317). Avalia\u00e7\u00e3o de reconhecimento de entidades mencionadas: Princ\u00edpio de AREM. Proceedings of the Avalia\u00e7\u00e3o Conjunta: Um Novo Paradigma no Processamento Computacional da l\u00edngua Portuguesa, Computational Processing of the Portuguese Language: 7th International Workshop. PROPOR, Itatiaia, Brazil."},{"key":"ref_19","first-page":"7","article-title":"Preprocessing Techniques for Text Mining","volume":"5","author":"Kannan","year":"2014","journal-title":"Int. J. Comput. Sci. Commun. Networks"},{"key":"ref_20","unstructured":"Paz Suarez Araujo, C. (2002). Florestasint\u00e1(c)tica: A treebank for Portuguese. In quot. Manuel Gonz\u00e1lez Rodrigues, Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas de Gran Canaria, Spain, 29\u201331 May, 2002, ELRA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Arnold, T. (2017). A tidy data model for natural language processing using cleannlp. arXiv.","DOI":"10.32614\/RJ-2017-035"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Ramshaw, L.A., and Marcus, M.P. (1999). Text chunking using transformation-based learning. Natural Language Processing Using Very Large Corpora, Springer.","DOI":"10.1007\/978-94-017-2390-9_10"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Ratinov, L., and Roth, D. (2009, January 4). Design challenges and misconceptions in named entity recognition. Proceedings of the Thirteenth Conference on Computational Natural Language Learning, Boulder, CO, USA.","DOI":"10.3115\/1596374.1596399"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Santos, D., and Cardoso, N. (2006). A golden resource for named entity recognition in Portuguese. International Workshop on Computational, Processing of the Portuguese Language, Springer.","DOI":"10.1007\/11751984_8"},{"key":"ref_25","unstructured":"M\u00f4ro, D.K. (2018). Reconhecimento de Entidades Nomeadas em Documentos de L\u00edngua Portuguesa, TCC- Universidade Federal de Santa Catarina Ararangu\u00e1, Tecnologias de Informa\u00e7\u00e3o e Comunica\u00e7\u00e3o."},{"key":"ref_26","first-page":"1030","article-title":"Phrase clustering for discriminative learning","volume":"Volume 2","author":"Lin","year":"2009","journal-title":"Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1016\/j.artint.2012.03.006","article-title":"Learning multilingual named entity recognition from Wikipedia","volume":"194","author":"Nothman","year":"2013","journal-title":"Artif. Intell."},{"key":"ref_28","first-page":"41","article-title":"NERP-CRF: Uma ferramenta para o reconhecimento de entidades nomeadas por meio de Conditional Random Fields","volume":"6","author":"Vieira","year":"2014","journal-title":"Linguam\u00e1tica"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"188","DOI":"10.3115\/1119176.1119206","article-title":"Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons","volume":"Volume 4","author":"McCallum","year":"2003","journal-title":"Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Sears, T.D., and Sunehag, P. (2007). Induced semantics for undirected graphs: Another look at the Hammersley-Clifford theorem. AIP Conference Proceedings, American Institute of Physics.","DOI":"10.1063\/1.2821254"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Jin, N. (2015, January 31). Ncsu-sas--Ning: Candidate generation and feature engineering for supervised lexical normalization. Proceedings of the Workshop on Noisy User-Generated Text, Beijing, China.","DOI":"10.18653\/v1\/W15-4313"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"W339","DOI":"10.1093\/nar\/gkm368","article-title":"MiPred: Classification of real and pseudo microRNA precursors using random forest prediction model with combined features","volume":"35","author":"Jiang","year":"2007","journal-title":"Nucleic Acids Res."},{"key":"ref_33","unstructured":"Yadav, V., and Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1177\/0361198106196400121","article-title":"Arriving-on-time problem: Discrete algorithm that ensures convergence","volume":"1964","author":"Nie","year":"2006","journal-title":"Transp. Res. Rec."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Athiwaratkun, B., Wilson, A.G., and Anandkumar, A. (2018). Probabilistic FastText for multi-sense word embeddings. arXiv.","DOI":"10.18653\/v1\/P18-1001"},{"key":"ref_37","unstructured":"Ferreira, L., Teixeira, A., and Cunha, J.P.S. (2008). REMMA-Reconhecimento de entidades mencionadas do MedAlert. Desafios na Avalia\u00e7ao Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, Linguateca."},{"key":"ref_38","unstructured":"Pires, A.R.O. (2017). Named Entity Extraction from Portuguese Web Text. [Master\u2019s Thesis, FEUP - Faculdade de Engenharia]."},{"key":"ref_39","unstructured":"Amaral, D.O.F.D. (2017). Reconhecimento de Entidades Nomeadas na \u00e1rea da Geologia: Bacias Sedimentares Brasileiras, PUCRS."},{"key":"ref_40","unstructured":"Pirovani, J.P.C. (2019). CRF+ LG: Uma abordagem h\u00edbrida para o reconhecimento de entidades nomeadas em portugu\u00eas. [Ph.D. Thesis, Universidade Federal do Esp\u00edrito Santo]."},{"key":"ref_41","unstructured":"Li, P.H., Fu, T.J., and Ma, W.Y. (2019). Remedying BiLSTM-CNN Deficiency in Modeling Cross-Context for NER. arXiv."}],"container-title":["Applied Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2076-3417\/10\/7\/2303\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:12:30Z","timestamp":1760173950000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2076-3417\/10\/7\/2303"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,27]]},"references-count":41,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2020,4]]}},"alternative-id":["app10072303"],"URL":"https:\/\/doi.org\/10.3390\/app10072303","relation":{},"ISSN":["2076-3417"],"issn-type":[{"value":"2076-3417","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,27]]}}}