{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T18:23:54Z","timestamp":1766514234484,"version":"build-2065373602"},"reference-count":32,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T00:00:00Z","timestamp":1722902400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"state government of the Land Tirol"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>Background: The recent rise of large language models has triggered renewed interest in medical free text data, which holds critical information about patients and diseases. However, medical free text is also highly sensitive. Therefore, de-identification is typically required but is complicated since medical free text is mostly unstructured. With the Masketeer algorithm, we present an effective tool to de-identify German medical text. Methods: We used an ensemble of different masking classes to remove references to identifiable data from over 35,000 clinical notes in accordance with the HIPAA Safe Harbor Guidelines. To retain additional context for readers, we implemented an entity recognition scheme and corpus-wide pseudonymization. Results: The algorithm performed with a sensitivity of 0.943 and specificity of 0.933. Further performance analyses showed linear runtime complexity (O(n)) with both increasing text length and corpus size. Conclusions: In the future, large language models will likely be able to de-identify medical free text more effectively and thoroughly than handcrafted rules. However, such gold-standard de-identification tools based on large language models are yet to emerge. In the current absence of such, we hope to provide best practices for a robust rule-based algorithm designed with expert domain knowledge.<\/jats:p>","DOI":"10.3390\/fi16080281","type":"journal-article","created":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T11:54:19Z","timestamp":1722945259000},"page":"281","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6731-6873","authenticated-orcid":false,"given":"Martin","family":"Baumgartner","sequence":"first","affiliation":[{"name":"Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria"},{"name":"Institute of Neural Engineering, Graz University of Technology, 8010 Graz, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6066-9708","authenticated-orcid":false,"given":"Karl","family":"Kreiner","sequence":"additional","affiliation":[{"name":"Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria"}]},{"given":"Fabian","family":"Wiesm\u00fcller","sequence":"additional","affiliation":[{"name":"Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria"},{"name":"Institute of Neural Engineering, Graz University of Technology, 8010 Graz, Austria"},{"name":"Ludwig Boltzmann Institute for Digital Health and Prevention, 5020 Salzburg, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1822-9033","authenticated-orcid":false,"given":"Dieter","family":"Hayn","sequence":"additional","affiliation":[{"name":"Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria"},{"name":"Ludwig Boltzmann Institute for Digital Health and Prevention, 5020 Salzburg, Austria"}]},{"given":"Christian","family":"Puelacher","sequence":"additional","affiliation":[{"name":"Department of Internal Medicine III, Cardiology and Angiology, University Hospital Innsbruck, Medical University Innsbruck, 6020 Innsbruck, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3724-4255","authenticated-orcid":false,"given":"G\u00fcnter","family":"Schreier","sequence":"additional","affiliation":[{"name":"Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria"},{"name":"Institute of Neural Engineering, Graz University of Technology, 8010 Graz, Austria"}]}],"member":"1968","published-online":{"date-parts":[[2024,8,6]]},"reference":[{"key":"ref_1","first-page":"29","article-title":"Only You, Your Doctor, and Many Others May Know","volume":"2015092903","author":"Sweeney","year":"2015","journal-title":"Technol. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., and Samore, M.H. (2010). Automatic De-Identification of Textual Documents in the Electronic Health Record: A Review of Recent Research. BMC Med. Res. Methodol., 10.","DOI":"10.1186\/1471-2288-10-70"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"228","DOI":"10.1177\/1073110520917025","article-title":"Lost in Anonymization\u2014A Data Anonymization Reference Classification Merging Legal and Technical Considerations","volume":"48","author":"Vokinger","year":"2020","journal-title":"J. Law Med. Ethics"},{"key":"#cr-split#-ref_4.1","unstructured":"(2016). European Parliament Regulation"},{"key":"#cr-split#-ref_4.2","unstructured":"(EU) 2016\/679 of the European Parliament (General Data Protection Regulation), European Union."},{"key":"ref_5","unstructured":"(1996). United States Congress Health Insurance Portability and Accountability Act, United States Congress."},{"key":"ref_6","unstructured":"Honnibal, M., Montani, I., Van Landeghem, S., and Boyd, A. (2024, May 29). SpaCy: Industrial-Strength Natural Language Processing in Python 2020. Available online: https:\/\/spacy.io\/."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"e11","DOI":"10.2196\/cardio.9936","article-title":"HerzMobil, an Integrated and Collaborative Telemonitoring-Based Disease Management Program for Patients with Heart Failure: A Feasibility Study Paving the Way to Routine Care","volume":"2","author":"Ammenwerth","year":"2018","journal-title":"JMIR Cardio"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"574","DOI":"10.1197\/jamia.M2441","article-title":"State-of-the-Art Anonymization of Medical Records Using an Iterative Machine Learning Framework","volume":"14","author":"Szarvas","year":"2007","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/s41746-020-0258-y","article-title":"Protected Health Information Filter (Philter): Accurately and Securely de-Identifying Free-Text Clinical Notes","volume":"3","author":"Norgeot","year":"2020","journal-title":"NPJ Digit. Med."},{"key":"ref_10","unstructured":"Marimon, M., Gonzalez-Agirre, A., Intxaurrondo, A., Rodriguez, H., Martin, J.L., Villegas, M., and Krallinger, M. (2019, January 24). Automatic De-Identification of Medical Texts in Spanish: The MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019), Bilbao, Spain."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"727","DOI":"10.1016\/j.tele.2017.08.002","article-title":"DEDUCE: A Pattern Matching Method for Automatic de-Identification of Dutch Medical Text","volume":"35","author":"Menger","year":"2018","journal-title":"Telemat. Inform."},{"key":"ref_12","unstructured":"Trienes, J., Trieschnigg, D., Seifert, C., and Hiemstra, D. (2020). Comparing Rule-Based, Feature-Based and Deep Neural Methods for de-Identification of Dutch Medical Records. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Berg, H., and Dalianis, H. (2019, January 30). Augmenting a De-Identification System for Swedish Clinical Text Using Open Resources and Deep Learning. Proceedings of the Workshop on NLP and Pseudonymisation, Turku, Finland.","DOI":"10.18653\/v1\/D19-6215"},{"key":"ref_14","first-page":"83","article-title":"Medical Text Data Anonymization","volume":"16","author":"Marciniak","year":"2010","journal-title":"J. Med. Inform. Technol."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Mamede, N., Baptista, J., and Dias, F. (2016, January 24\u201329). Automated Anonymization of Text Documents. Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, BC, Canada.","DOI":"10.1109\/CEC.2016.7743936"},{"key":"ref_16","first-page":"33","article-title":"Automated De-Identification of Arabic Medical Records","volume":"2023","author":"Kocaman","year":"2023","journal-title":"Proc. Arab."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.5121\/ijsptm.2019.8201","article-title":"De-Identification of Protected Health Information Phi from Free Text in Medical Records","volume":"8","author":"Sreenivasan","year":"2019","journal-title":"Int. J. Secur. Priv. Trust Manag."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Kajiyama, K., Horiguchi, H., Okumura, T., Morita, M., and Kano, Y. (2020). De-Identifying Free Text of Japanese Electronic Health Records. J. Biomed. Semant., 11.","DOI":"10.1186\/s13326-020-00227-9"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Xu, Y., Zhou, T., Tian, Y., and Li, J. (2015, January 19\u201322). Application of Chinese Medical Document Anonymization in EMR System. Proceedings of the 2015 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Ningbo, China.","DOI":"10.1109\/ICSPCC.2015.7338760"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"ooad045","DOI":"10.1093\/jamiaopen\/ooad045","article-title":"A Certified De-Identification System for All Clinical Text Documents for Information Extraction at Scale","volume":"6","author":"Radhakrishnan","year":"2023","journal-title":"JAMIA Open"},{"key":"ref_21","unstructured":"Larbi, I.B.C., Burchardt, A., and Roller, R. (2023, January 2\u20136). Clinical Text Anonymization, Its Influence on Downstream NLP Tasks and the Risk of Re-Identification. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, Dubrovnik, Croatia."},{"key":"ref_22","first-page":"165","article-title":"De-Identification of German Medical Admission Notes","volume":"253","author":"Riezler","year":"2018","journal-title":"Stud. Health Technol. Inform."},{"key":"ref_23","first-page":"101","article-title":"Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports","volume":"267","author":"Amr","year":"2019","journal-title":"Ger. Med. Data Sci. Shap. Chang.\u2014Creat. Solut. Innov. Med."},{"key":"ref_24","first-page":"203","article-title":"Annotating German Clinical Documents for De-Identification","volume":"264","author":"Kolditz","year":"2019","journal-title":"Stud. Health Technol. Inform."},{"key":"ref_25","first-page":"189","article-title":"Impact Analysis of De-Identification in Clinical Notes Classification","volume":"293","author":"Baumgartner","year":"2022","journal-title":"Stud. Health Technol. Inform."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"2170","DOI":"10.1002\/sim.2677","article-title":"Confidence Intervals for Predictive Values with an Emphasis to Case\u2013Control Studies","volume":"26","author":"Mercaldo","year":"2007","journal-title":"Stat. Med."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1038\/s41597-023-02128-9","article-title":"A Distributable German Clinical Corpus Containing Cardiovascular Clinical Routine Doctor\u2019s Letters","volume":"10","author":"Wiesenbach","year":"2023","journal-title":"Sci. Data"},{"key":"ref_28","unstructured":"Borchert, F., Lohr, C., Modersohn, L., Witt, J., Langer, T., Follmann, M., Gietzelt, M., Arnrich, B., Hahn, U., and Schapranow, M.-P. (2022, January 20\u201325). GGPONC 2.0\u2014The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France."},{"key":"ref_29","unstructured":"Liu, Z., Huang, Y., Yu, X., Zhang, L., Wu, Z., Cao, C., Dai, H., Zhao, L., Li, Y., and Shu, P. (2023). DeID-GPT: Zero-Shot Medical Text de-Identification by GPT-4. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kollapally, N.M., and Geller, J. (2024, January 21\u201323). Safeguarding Ethical AI: Detecting Potentially Sensitive Data Re-Identification and Generation of Misleading or Abusive Content from Quantized Large Language Models. Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024), Rome, Italy.","DOI":"10.5220\/0012411900003657"},{"key":"ref_31","unstructured":"Wang, J.G., Wang, J., Li, M., and Neel, S. (2024). Pandora\u2019s White-Box: Increased Training Data Leakage in Open LLMs. arXiv."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/16\/8\/281\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:30:46Z","timestamp":1760110246000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/16\/8\/281"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,6]]},"references-count":32,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2024,8]]}},"alternative-id":["fi16080281"],"URL":"https:\/\/doi.org\/10.3390\/fi16080281","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2024,8,6]]}}}