{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T20:58:20Z","timestamp":1777582700418,"version":"3.51.4"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,1,30]],"date-time":"2020-01-30T00:00:00Z","timestamp":1580342400000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2020,1,30]],"date-time":"2020-01-30T00:00:00Z","timestamp":1580342400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Med Inform Decis Mak"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n<jats:title>Background<\/jats:title>\n<jats:p>Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Objective<\/jats:title>\n<jats:p>We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Methods<\/jats:title>\n<jats:p>We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Results<\/jats:title>\n<jats:p>Fully customized systems remove 97\u201399% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Conclusion<\/jats:title>\n<jats:p>Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level.<\/jats:p>\n<\/jats:sec>","DOI":"10.1186\/s12911-020-1026-2","type":"journal-article","created":{"date-parts":[[2020,1,30]],"date-time":"2020-01-30T15:03:38Z","timestamp":1580396618000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":28,"title":["Customization scenarios for de-identification of clinical notes"],"prefix":"10.1186","volume":"20","author":[{"given":"Tzvika","family":"Hartman","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael D.","family":"Howell","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jeff","family":"Dean","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8166-4428","authenticated-orcid":false,"given":"Shlomo","family":"Hoory","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ronit","family":"Slyper","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Itay","family":"Laish","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Oren","family":"Gilon","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Danny","family":"Vainstein","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Greg","family":"Corrado","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Katherine","family":"Chou","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ming Jack","family":"Po","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jutta","family":"Williams","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Scott","family":"Ellis","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gavin","family":"Bee","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Avinatan","family":"Hassidim","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rony","family":"Amira","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Genady","family":"Beryozkin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Idan","family":"Szpektor","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yossi","family":"Matias","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,1,30]]},"reference":[{"issue":"Suppl 1","key":"1026_CR1","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1186\/s12911-018-0594-x","volume":"18","author":"X Chen","year":"2018","unstructured":"Chen X, Xie H, Wang FL, Liu Z, Xu J, Hao T. A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak. 2018;18(Suppl 1):14.","journal-title":"BMC Med Inform Decis Mak"},{"key":"1026_CR2","unstructured":"PubMed search conducted 23 April 2018 using the following URL https:\/\/www.ncbi.nlm.nih.gov\/pubmed\/?term=%22free+text%22+OR+%22unstructured+text%22 showed 89 results in 2007 and 460 results in 2018."},{"issue":"1","key":"1026_CR3","first-page":"194","volume":"10","author":"A N\u00e9v\u00e9ol","year":"2015","unstructured":"N\u00e9v\u00e9ol A, Zweigenbaum P. Clinical natural language processing in 2014: foundational methods supporting efficient healthcare. Yearb Med Inform. 2015;10(1):194\u20138.","journal-title":"Yearb Med Inform"},{"key":"1026_CR4","doi-asserted-by":"publisher","first-page":"142","DOI":"10.1016\/j.jbi.2014.01.011","volume":"50","author":"SM Meystre","year":"2014","unstructured":"Meystre SM, Ferr\u00e1ndez \u00d3, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: a study of its impact on clinical text information content. J Biomed Inform. 2014;50:142\u201350.","journal-title":"J Biomed Inform"},{"issue":"3","key":"1026_CR5","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1093\/jamia\/ocw156","volume":"24","author":"F Dernoncourt","year":"2017","unstructured":"Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc. 2017;24(3):596\u2013606.","journal-title":"J Am Med Inform Assoc"},{"key":"1026_CR6","doi-asserted-by":"publisher","first-page":"S34","DOI":"10.1016\/j.jbi.2017.05.023","volume":"75S","author":"Z Liu","year":"2017","unstructured":"Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75S:S34\u201342.","journal-title":"J Biomed Inform"},{"key":"1026_CR7","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1186\/1472-6947-8-32","volume":"8","author":"I Neamatullah","year":"2008","unstructured":"Neamatullah I, Douglass MM, Lehman L-WH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32.","journal-title":"BMC Med Inform Decis Mak"},{"issue":"Suppl","key":"1026_CR8","doi-asserted-by":"publisher","first-page":"S82","DOI":"10.1097\/MLR.0b013e3182585355","volume":"50","author":"CA Kushida","year":"2012","unstructured":"Kushida CA, Nichols DA, Jadrnicek R, Miller R, Walsh JK, Griffin K. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care. 2012;50(Suppl):S82\u2013S101.","journal-title":"Med Care"},{"issue":"Suppl","key":"1026_CR9","doi-asserted-by":"publisher","first-page":"S11","DOI":"10.1016\/j.jbi.2015.06.007","volume":"58","author":"A Stubbs","year":"2015","unstructured":"Stubbs A, Kotfila C, Uzuner \u00d6. Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2\/UTHealth shared task track 1. J Biomed Inform. 2015;58(Suppl):S11\u20139.","journal-title":"J Biomed Inform"},{"key":"1026_CR10","unstructured":"Sweeney L. Replacing personally-identifying information in medical records, the Scrub system. In Proceedings of the AMIA annual fall symposium 1996. American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc; 2016. p. 333"},{"issue":"2","key":"1026_CR11","doi-asserted-by":"publisher","first-page":"176","DOI":"10.1309\/E6K33GBPE5C27FYU","volume":"121","author":"D Gupta","year":"2004","unstructured":"Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121(2):176\u201386.","journal-title":"Am J Clin Pathol"},{"key":"1026_CR12","doi-asserted-by":"crossref","unstructured":"Szarvas G, Farkas R, Kocsor A. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning algorithms. In International Conference on Discovery Science 2006 Oct 7. Berlin: Springer; 2006. p. 267\u2013278.","DOI":"10.1007\/11893318_27"},{"key":"1026_CR13","unstructured":"Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. In i2b2 workshop on challenges in natural language processing for clinical data 2006 Nov 10. p. 10\u201311."},{"issue":"1","key":"1026_CR14","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1016\/j.artmed.2007.10.001","volume":"42","author":"O Uzuner","year":"2008","unstructured":"Uzuner O, Sibanda TC, Luo Y, Szolovits P. A de-identifier for medical discharge summaries. Artif Intell Med. 2008;42(1):13\u201335.","journal-title":"Artif Intell Med"},{"key":"1026_CR15","first-page":"10","volume-title":"i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data","author":"K Hara","year":"2006","unstructured":"Hara K. Others. Applying a SVM based chunker and a text classifier to the deid challenge. In: i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data; 2006. p. 10\u20131."},{"key":"1026_CR16","unstructured":"Yogarajan V, Mayo M, Pfahringer B. A survey of automatic de-identification of longitudinal clinical narratives. arXiv [csAI]. 2018; http:\/\/arxiv.org\/abs\/1810.06765."},{"key":"1026_CR17","doi-asserted-by":"publisher","first-page":"575","DOI":"10.1007\/978-3-319-50496-4_51","volume-title":"Natural Language Understanding and Intelligent Applications","author":"Kun Li","year":"2016","unstructured":"Li K, Chai Y, Zhao H, Nan X, Zhao Y. Learning to Recognize Protected Health Information in Electronic Health Records with Recurrent Neural Network. In Natural Language Understanding and Intelligent Applications 2016 Dec 2. Champ: Springer; 2016. p. 575\u2013582."},{"key":"1026_CR18","doi-asserted-by":"publisher","first-page":"S19","DOI":"10.1016\/j.jbi.2017.06.006","volume":"75S","author":"H-J Lee","year":"2017","unstructured":"Lee H-J, Wu Y, Zhang Y, Xu J, Xu H, Roberts K. A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform. 2017;75S:S19\u201327.","journal-title":"J Biomed Inform"},{"key":"1026_CR19","first-page":"1044","volume":"2017","author":"M Kayaalp","year":"2017","unstructured":"Kayaalp M. Modes of De-identification. AMIA Annu Symp Proc. 2017;2017:1044\u201350.","journal-title":"AMIA Annu Symp Proc"},{"key":"1026_CR20","volume-title":"Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12","author":"JY Lee","year":"2018","unstructured":"Lee JY, Dernoncourt F, Szolovits P. Transfer Learning for Named-Entity Recognition with Neural Networks. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12; 2018. http:\/\/www.lrec-conf.org\/proceedings\/lrec2018\/pdf\/878.pdf."},{"key":"1026_CR21","first-page":"1070","volume":"2017","author":"H-J Lee","year":"2017","unstructured":"Lee H-J, Zhang Y, Roberts K, Xu H. Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc. 2017;2017:1070\u20139.","journal-title":"AMIA Annu Symp Proc"},{"key":"1026_CR22","first-page":"737","volume":"2015","author":"Y Kim","year":"2015","unstructured":"Kim Y, Riloff E, Hurdle JF. A study of concept extraction across different types of clinical notes. AMIA Annu Symp Proc. 2015;2015:737\u201346.","journal-title":"AMIA Annu Symp Proc"},{"key":"1026_CR23","first-page":"1","volume-title":"Proceedings of the BioNLP 2018 Workshop, Melbourne, Australia, July 19","author":"D Newman-Griffis","year":"2018","unstructured":"Newman-Griffis D, Zirikly A. Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility. In: Proceedings of the BioNLP 2018 Workshop, Melbourne, Australia, July 19; 2018. p. 1\u201311."},{"key":"1026_CR24","volume-title":"The Central Role of the Propensity Score in Observational Studies for Causal Effects","author":"PR Rosenbaum","year":"1982","unstructured":"Rosenbaum PR, Rubin DB. The Central Role of the Propensity Score in Observational Studies for Causal Effects; 1982."},{"key":"1026_CR25","unstructured":"Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. US Department of Health & Human Services: Health Information Privacy. https:\/\/www.hhs.gov\/hipaa\/for-professionals\/privacy\/special-topics\/de-identification\/index.html. Accessed September 5, 2019."},{"issue":"5","key":"1026_CR26","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1197\/jamia.M2444","volume":"14","author":"O Uzuner","year":"2007","unstructured":"Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550\u201363.","journal-title":"J Am Med Inform Assoc"},{"issue":"Suppl","key":"1026_CR27","doi-asserted-by":"publisher","first-page":"S20","DOI":"10.1016\/j.jbi.2015.07.020","volume":"58","author":"OU Amber Stubbs","year":"2015","unstructured":"Amber Stubbs OU. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2\/UTHealth Corpus. J Biomed Inform. 2015;58(Suppl):S20.","journal-title":"J Biomed Inform"},{"issue":"23","key":"1026_CR28","doi-asserted-by":"publisher","first-page":"e215","DOI":"10.1161\/01.CIR.101.23.e215","volume":"101","author":"AL Goldberger","year":"2000","unstructured":"Goldberger AL, Amaral LAN, Glass L, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215\u201320.","journal-title":"Circulation"},{"key":"1026_CR29","doi-asserted-by":"publisher","unstructured":"Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3. https:\/\/doi.org\/10.1038\/sdata.2016.35.","DOI":"10.1038\/sdata.2016.35"},{"key":"1026_CR30","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/s18-2021","volume-title":"Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics","author":"V Yadav","year":"2018","unstructured":"Yadav V, Sharp R, Bethard S. Deep Affix Features Improve Neural Named Entity Recognizers. In: Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics; 2018. https:\/\/doi.org\/10.18653\/v1\/s18-2021."},{"key":"1026_CR31","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/d17-2017","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"F Dernoncourt","year":"2017","unstructured":"Dernoncourt F, Lee JY, Szolovits P. NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 2017. https:\/\/doi.org\/10.18653\/v1\/d17-2017."},{"key":"1026_CR32","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/d14-1162","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"J Pennington","year":"2014","unstructured":"Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. https:\/\/doi.org\/10.3115\/v1\/d14-1162."},{"issue":"Jul","key":"1026_CR33","first-page":"2121","volume":"12","author":"J Duchi","year":"2011","unstructured":"Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12(Jul):2121\u201359.","journal-title":"J Mach Learn Res"},{"key":"1026_CR34","volume-title":"Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)","author":"E EET","year":"2017","unstructured":"EET E, Schain M, Mackey A, Gordon A, Saurous RA, Elidan G. Scalable Learning of Non-Decomposable Objectives. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS); 2017."},{"key":"1026_CR35","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102399","volume-title":"Proceedings of the 22nd International Conference on Machine Learning - ICML \u201805","author":"T Joachims","year":"2005","unstructured":"Joachims T. A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning - ICML \u201805; 2005. https:\/\/doi.org\/10.1145\/1102351.1102399."},{"key":"1026_CR36","doi-asserted-by":"publisher","first-page":"160","DOI":"10.18653\/v1\/W18-5618","volume-title":"Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis","author":"G Sheikhshabbafghi","year":"2018","unstructured":"Sheikhshabbafghi G, Birol I, Sarkar A. In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Brussels: Association for Computational Linguistics; 2018. p. 160\u20134."},{"key":"1026_CR37","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1016\/j.jbi.2018.09.008","volume":"87","author":"Y Wang","year":"2018","unstructured":"Wang Y, Liu S, Afzal N, et al. A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform. 2018;87:12\u201320.","journal-title":"J Biomed Inform"},{"key":"1026_CR38","volume-title":"Efficient Estimation of Word Representations in Vector Space","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. 2013. http:\/\/arxiv.org\/abs\/1301.3781. Accessed 9 2019."},{"key":"1026_CR39","first-page":"3111","volume-title":"Advances in Neural Information Processing Systems","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems; 2013. p. 3111\u20139."},{"key":"1026_CR40","unstructured":"El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. California: O\u2019Reilly Media, Inc.; 2013."}],"container-title":["BMC Medical Informatics and Decision Making"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-020-1026-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s12911-020-1026-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-020-1026-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,1,29]],"date-time":"2021-01-29T00:10:39Z","timestamp":1611879039000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcmedinformdecismak.biomedcentral.com\/articles\/10.1186\/s12911-020-1026-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,1,30]]},"references-count":40,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["1026"],"URL":"https:\/\/doi.org\/10.1186\/s12911-020-1026-2","relation":{},"ISSN":["1472-6947"],"issn-type":[{"value":"1472-6947","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,1,30]]},"assertion":[{"value":"11 March 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 January 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 January 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Formal approval was obtained from each respective dataset owner.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not Applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors are employed by Google, LLC and own equity in Alphabet, Inc.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"14"}}