{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T16:13:12Z","timestamp":1774368792128,"version":"3.50.1"},"reference-count":26,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2022,7,27]],"date-time":"2022-07-27T00:00:00Z","timestamp":1658880000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Omigrade S.r.l."}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>The General Data Protection Regulation (GDPR) has allowed EU citizens and residents to have more control over their personal data, simplifying the regulatory environment affecting international business and unifying and homogenising privacy legislation within the EU. This regulation affects all companies that process data of European residents regardless of the place in which they are processed and their registered office, providing for a strict discipline of data protection. These companies must comply with the GDPR and be aware of the content of the data they manage; this is especially important if they are holding sensitive data, that is, any information regarding racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, data relating to the sexual life or sexual orientation of the person, as well as data on physical and mental health. These classes of data are hardly structured, and most frequently they appear within a document such as an email message, a review or a post. It is extremely difficult to know if a company is in possession of sensitive data at the risk of not protecting them properly. The goal of the study described in this paper is to use Machine Learning, in particular the Transformer deep-learning model, to develop classifiers capable of detecting documents that are likely to include sensitive data. Additionally, we want the classifiers to recognize the particular type of sensitive topic with which they deal, in order for a company to have a better knowledge of the data they own. We expect to make the model described in this paper available as a web service, customized to private data of possible customers, or even in a free-to-use version based on the freely available data set we have built to train the classifiers.<\/jats:p>","DOI":"10.3390\/fi14080228","type":"journal-article","created":{"date-parts":[[2022,7,27]],"date-time":"2022-07-27T21:11:05Z","timestamp":1658956265000},"page":"228","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Automatic Detection of Sensitive Data Using Transformer- Based Classifiers"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0829-5822","authenticated-orcid":false,"given":"Michael","family":"Petrolini","sequence":"first","affiliation":[{"name":"Department of Engineering and Architecture, University of Parma, Parco Area delle Scienze 181a, 43124 Parma, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4669-512X","authenticated-orcid":false,"given":"Stefano","family":"Cagnoni","sequence":"additional","affiliation":[{"name":"Department of Engineering and Architecture, University of Parma, Parco Area delle Scienze 181a, 43124 Parma, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5916-9770","authenticated-orcid":false,"given":"Monica","family":"Mordonini","sequence":"additional","affiliation":[{"name":"Department of Engineering and Architecture, University of Parma, Parco Area delle Scienze 181a, 43124 Parma, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2022,7,27]]},"reference":[{"key":"#cr-split#-ref_1.1","unstructured":"European Commission (2016). Regulation"},{"key":"#cr-split#-ref_1.2","unstructured":"(EU) 2016\/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95\/46\/EC (General Data Protection Regulation) (Text with EEA Relevance), European Commission."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1016\/j.clsr.2018.02.002","article-title":"Normative challenges of identification in the Internet of Things: Privacy, profiling, discrimination, and the GDPR","volume":"34","author":"Wachter","year":"2018","journal-title":"Comput. Law Secur. Rev."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Mondschein, C.F., and Monda, C. (2019). The EU\u2019s General Data Protection Regulation (GDPR) in a research context. Fundamentals of Clinical Data Science, Springer.","DOI":"10.1007\/978-3-319-99713-1_5"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3466722","article-title":"Cookie banners and privacy policies: Measuring the impact of the GDPR on the web","volume":"15","author":"Kretschmer","year":"2021","journal-title":"Acm Trans. Web (TWEB)"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Bhaskar, R., Laxman, S., Smith, A., and Thakurta, A. (2010, January 24\u201328). Discovering frequent patterns in sensitive data. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA.","DOI":"10.1145\/1835804.1835869"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"McSherry, F., and Talwar, K. (2007, January 21\u201323). Mechanism Design via Differential Privacy. Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS\u201907), Providence, RI, USA.","DOI":"10.1109\/FOCS.2007.66"},{"key":"ref_7","unstructured":"Halevi, S., and Rabin, T. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. Theory of Cryptography, Springer."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ku\u017eina, V., Vu\u0161ak, E., and Jovi\u0107, A. (October, January 27). Methods for Automatic Sensitive Data Detection in Large Datasets: A Review. Proceedings of the 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia.","DOI":"10.23919\/MIPRO52101.2021.9596735"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Pattanayak, S., and Ludwig, S.A. (2019, January 18\u201321). Improving Data Privacy Using Fuzzy Logic and Autoencoder Neural Network. Proceedings of the 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Lafayette, LA, USA.","DOI":"10.1109\/FUZZ-IEEE.2019.8858823"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Attaullah, H., Anjum, A., Kanwal, T., Malik, S.U.R., Asheralieva, A., Malik, H., Zoha, A., Arshad, K., and Imran, M.A. (2021). F-classify: Fuzzy rule based classification method for privacy preservation of multiple sensitive attributes. Sensors, 21.","DOI":"10.3390\/s21144933"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1109\/TFUZZ.2004.825969","article-title":"Complex dynamics through fuzzy chains","volume":"12","author":"Bucolo","year":"2004","journal-title":"IEEE Trans. Fuzzy Syst."},{"key":"ref_12","unstructured":"(2022, July 27). IBM Security Guardium Data Protection. Available online: https:\/\/www.ibm.com\/products\/ibm-guardium-data-protection."},{"key":"ref_13","unstructured":"(2022, July 27). Azure Information Protection. Available online: https:\/\/azure.microsoft.com\/solutions\/information-protection\/."},{"key":"ref_14","unstructured":"(2022, July 27). Rubrik. Available online: https:\/\/www.rubrik.com\/."},{"key":"ref_15","unstructured":"Lin, T., Wang, Y., Liu, X., and Qiu, X. (2021). A survey of transformers. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1872","DOI":"10.1007\/s11431-020-1647-3","article-title":"Pre-trained models for natural language processing: A survey","volume":"63","author":"Qiu","year":"2020","journal-title":"Sci. China Technol. Sci."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_18","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_19","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polosukhin, L., and Kaiser, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the NIPS\u201916: Proceedings of the 30th International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., and Lee, K. (2018, January 1\u20136). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA.","DOI":"10.18653\/v1\/N18-1202"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Kaur, M., and Mohta, A. (2019, January 27\u201329). A Review of Deep Learning with Recurrent Neural Network. Proceedings of the 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India.","DOI":"10.1109\/ICSSIT46314.2019.8987837"},{"key":"ref_22","unstructured":"Daniel Jurafsky, J.H.M. (2022, July 27). N-gram Language Models. Speech and Language Processing, Available online: https:\/\/web.stanford.edu\/~jurafsky\/slp3\/ed3book.pdf."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015, January 7\u201313). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.11"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Adorni, G., Cagnoni, S., Gori, M., and Maratea, M. (December, January 29). Flat and Hierarchical Classifiers for Detecting Emotion in Tweets. Proceedings of the AI*IA 2016 Advances in Artificial Intelligence: XVth International Conference of the Italian Association for Artificial Intelligence, Genova, Italy.","DOI":"10.1007\/978-3-319-49130-1"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"2622","DOI":"10.1016\/j.eswa.2007.05.028","article-title":"An empirical study of sentiment analysis for chinese documents","volume":"34","author":"Tan","year":"2008","journal-title":"Expert Syst. Appl."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/14\/8\/228\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:57:18Z","timestamp":1760140638000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/14\/8\/228"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,27]]},"references-count":26,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2022,8]]}},"alternative-id":["fi14080228"],"URL":"https:\/\/doi.org\/10.3390\/fi14080228","relation":{},"ISSN":["1999-5903"],"issn-type":[{"value":"1999-5903","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,7,27]]}}}