{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T06:20:31Z","timestamp":1774333231670,"version":"3.50.1"},"reference-count":25,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,4,14]],"date-time":"2020-04-14T00:00:00Z","timestamp":1586822400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,4,14]],"date-time":"2020-04-14T00:00:00Z","timestamp":1586822400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100008227","name":"Achievement Rewards for College Scientists Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100008227","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000923","name":"Silicon Valley Community Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000923","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"U.S. Department of Health & Human Services | National Institutes of Health","doi-asserted-by":"publisher","award":["P30 AR070155"],"award-info":[{"award-number":["P30 AR070155"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"U.S. Department of Health & Human Services | National Institutes of Health","doi-asserted-by":"publisher","award":["P30 AR070155"],"award-info":[{"award-number":["P30 AR070155"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"U.S. Department of Health & Human Services | National Institutes of Health","doi-asserted-by":"publisher","award":["U24 CA195858"],"award-info":[{"award-number":["U24 CA195858"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000092","name":"U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine","doi-asserted-by":"publisher","award":["K01 LM012381"],"award-info":[{"award-number":["K01 LM012381"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000092","name":"U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine","doi-asserted-by":"publisher","award":["K23 AR063770"],"award-info":[{"award-number":["K23 AR063770"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000133","name":"U.S. Department of Health & Human Services | Agency for Healthcare Research and Quality","doi-asserted-by":"publisher","award":["R01 HS024412"],"award-info":[{"award-number":["R01 HS024412"]}],"id":[{"id":"10.13039\/100000133","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000133","name":"U.S. Department of Health & Human Services | Agency for Healthcare Research and Quality","doi-asserted-by":"publisher","award":["R01 HS024412"],"award-info":[{"award-number":["R01 HS024412"]}],"id":[{"id":"10.13039\/100000133","id-type":"DOI","asserted-by":"publisher"}]},{"name":"U.S. Department of Health & Human Services | National Institutes of Health"},{"name":"U.S. Department of Health & Human Services | Agency for Healthcare Research and Quality"},{"name":"U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine"},{"name":"U.S. Department of Health & Human Services | National Institutes of Health"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter (\u201cProtected Health Information filter\u201d). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods.<\/jats:p>","DOI":"10.1038\/s41746-020-0258-y","type":"journal-article","created":{"date-parts":[[2020,4,14]],"date-time":"2020-04-14T10:03:12Z","timestamp":1586858592000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":85,"title":["Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes"],"prefix":"10.1038","volume":"3","author":[{"given":"Beau","family":"Norgeot","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0840-614X","authenticated-orcid":false,"given":"Kathleen","family":"Muenzen","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2562-6574","authenticated-orcid":false,"given":"Thomas A.","family":"Peterson","sequence":"additional","affiliation":[]},{"given":"Xuancheng","family":"Fan","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4515-8090","authenticated-orcid":false,"given":"Benjamin S.","family":"Glicksberg","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1240-9949","authenticated-orcid":false,"given":"Gundolf","family":"Schenk","sequence":"additional","affiliation":[]},{"given":"Eugenia","family":"Rutenberg","sequence":"additional","affiliation":[]},{"given":"Boris","family":"Oskotsky","sequence":"additional","affiliation":[]},{"given":"Marina","family":"Sirota","sequence":"additional","affiliation":[]},{"given":"Jinoos","family":"Yazdany","sequence":"additional","affiliation":[]},{"given":"Gabriela","family":"Schmajuk","sequence":"additional","affiliation":[]},{"given":"Dana","family":"Ludwig","sequence":"additional","affiliation":[]},{"given":"Theodore","family":"Goldstein","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7433-2740","authenticated-orcid":false,"given":"Atul J.","family":"Butte","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,4,14]]},"reference":[{"key":"258_CR1","doi-asserted-by":"publisher","first-page":"1046","DOI":"10.1093\/jamia\/ocv202","volume":"23","author":"JC Kirby","year":"2016","unstructured":"Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inf. Assoc. 23, 1046\u20131052 (2016).","journal-title":"J. Am. Med. Inf. Assoc."},{"key":"258_CR2","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1038\/s41591-018-0320-3","volume":"25","author":"B Norgeot","year":"2019","unstructured":"Norgeot, B., Glicksberg, B. S. & Butte, A. J. A call for deep-learning healthcare. Nat. Med. 25, 14\u201315 (2019).","journal-title":"Nat. Med."},{"key":"258_CR3","doi-asserted-by":"publisher","first-page":"i2139","DOI":"10.1136\/bmj.i2139","volume":"353","author":"MA Makary","year":"2016","unstructured":"Makary, M. A. & Daniel, M. Medical error-the third leading cause of death in the US. BMJ 353, i2139 (2016).","journal-title":"BMJ"},{"key":"258_CR4","doi-asserted-by":"publisher","first-page":"1620","DOI":"10.1111\/j.1475-6773.2005.00444.x","volume":"40","author":"KJ O\u2019Malley","year":"2005","unstructured":"O\u2019Malley, K. J. et al. Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40, 1620\u20131639 (2005).","journal-title":"Health Serv. Res."},{"key":"258_CR5","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0187121","volume":"12","author":"E Iqbal","year":"2017","unstructured":"Iqbal, E. et al. ADEPt, a semantically-enriched pipeline for extracting adverse drug events from free-text electronic health records. PLoS ONE 12, e0187121 (2017).","journal-title":"PLoS ONE"},{"key":"258_CR6","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0089324","volume":"9","author":"K Jung","year":"2014","unstructured":"Jung, K. et al. Automated detection of off-label drug use. PLoS ONE 9, e89324 (2014).","journal-title":"PLoS ONE"},{"key":"258_CR7","first-page":"28","volume":"2017","author":"N Afzal","year":"2017","unstructured":"Afzal, N. et al. Surveillance of peripheral arterial disease cases using natural language processing of clinical notes. AMIA Jt Summits Transl. Sci. Proc. 2017, 28\u201336 (2017).","journal-title":"AMIA Jt Summits Transl. Sci. Proc."},{"key":"258_CR8","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2014.32","volume":"1","author":"SG Finlayson","year":"2014","unstructured":"Finlayson, S. G., LePendu, P. & Shah, N. H. Building the graph of medicine from millions of clinical narratives. Sci. Data 1, 140032 (2014).","journal-title":"Sci. Data"},{"key":"258_CR9","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2288-12-109","volume":"12","author":"O Ferrandez","year":"2012","unstructured":"Ferrandez, O. et al. Evaluating current automatic de-identification methods with Veteran\u2019s health administration clinical documents. BMC Med. Res. Methodol. 12, 109 (2012).","journal-title":"BMC Med. Res. Methodol."},{"key":"258_CR10","first-page":"E215","volume":"101","author":"AL Goldberger","year":"2000","unstructured":"Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215\u2013E220 (2000).","journal-title":"Circulation"},{"key":"258_CR11","doi-asserted-by":"publisher","DOI":"10.1186\/1472-6947-8-32","volume":"8","author":"I Neamatullah","year":"2008","unstructured":"Neamatullah, I. et al. Automated de-identification of free-text medical records. BMC Med. Inf. Decis. Mak. 8, 32 (2008).","journal-title":"BMC Med. Inf. Decis. Mak."},{"issue":"Suppl","key":"258_CR12","doi-asserted-by":"publisher","first-page":"S11","DOI":"10.1016\/j.jbi.2015.06.007","volume":"58","author":"A Stubbs","year":"2015","unstructured":"Stubbs, A., Kotfila, C. & Uzuner, O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2\/UTHealth shared task Track 1. J. Biomed. Inf. 58(Suppl), S11\u2013S19 (2015).","journal-title":"J. Biomed. Inf."},{"issue":"Suppl","key":"258_CR13","doi-asserted-by":"publisher","first-page":"S20","DOI":"10.1016\/j.jbi.2015.07.020","volume":"58","author":"A Stubbs","year":"2015","unstructured":"Stubbs, A. & Uzuner, O. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2\/UTHealth corpus. J. Biomed. Inf. 58(Suppl), S20\u2013S29 (2015).","journal-title":"J. Biomed. Inf."},{"key":"258_CR14","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1197\/jamia.M2444","volume":"14","author":"O Uzuner","year":"2007","unstructured":"Uzuner, O., Luo, Y. & Szolovits, P. Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inf. Assoc. 14, 550\u2013563 (2007).","journal-title":"J. Am. Med. Inf. Assoc."},{"key":"258_CR15","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1136\/amiajnl-2012-001012","volume":"20","author":"L Deleger","year":"2013","unstructured":"Deleger, L. et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J. Am. Med. Inf. Assoc. 20, 84\u201394 (2013).","journal-title":"J. Am. Med. Inf. Assoc."},{"key":"258_CR16","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2288-10-70","volume":"10","author":"SM Meystre","year":"2010","unstructured":"Meystre, S. M., Friedlin, F. J., South, B. R., Shen, S. & Samore, M. H. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010).","journal-title":"BMC Med. Res. Methodol."},{"key":"258_CR17","unstructured":"Sibanda, T. & Uzuner, O. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. Association for Computational Linguistics. http:\/\/www.lrec-conf.org\/proceedings\/lrec2016\/workshops\/LREC2016Workshop-ISA12proceedings.pdf."},{"key":"258_CR18","doi-asserted-by":"publisher","first-page":"596","DOI":"10.1093\/jamia\/ocw156","volume":"24","author":"F Dernoncourt","year":"2017","unstructured":"Dernoncourt, F., Lee, J. Y., Uzuner, O. & Szolovits, P. De-identification of patient notes with recurrent neural networks. J. Am. Med Inf. Assoc. 24, 596\u2013606 (2017).","journal-title":"J. Am. Med Inf. Assoc."},{"key":"258_CR19","doi-asserted-by":"publisher","first-page":"S34","DOI":"10.1016\/j.jbi.2017.05.023","volume":"75S","author":"Z Liu","year":"2017","unstructured":"Liu, Z., Tang, B., Wang, X. & Chen, Q. De-identification of clinical notes via recurrent neural network and conditional random field. J. Biomed. Inf. 75S, S34\u2013S42 (2017).","journal-title":"J. Biomed. Inf."},{"key":"258_CR20","doi-asserted-by":"publisher","first-page":"849","DOI":"10.1016\/j.ijmedinf.2010.09.007","volume":"79","author":"J Aberdeen","year":"2010","unstructured":"Aberdeen, J. et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment. Int J. Med Inf. 79, 849\u2013859 (2010).","journal-title":"Int J. Med Inf."},{"key":"258_CR21","unstructured":"Rim, K. Mae2: Portable annotation tool for general natural language use. In Proc 12th Joint ACL-ISO Workshop on Interoperable Semantic Annotation. 75\u201380 (2016)."},{"key":"258_CR22","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1016\/j.jbi.2014.01.014","volume":"50","author":"L Deleger","year":"2014","unstructured":"Deleger, L. et al. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J. Biomed. Inf. 50, 173\u2013183 (2014).","journal-title":"J. Biomed. Inf."},{"key":"258_CR23","doi-asserted-by":"crossref","unstructured":"McMurry, A. J., Fitch, B., Savova, G., Kohane, I. S. & Reis, B. Y. Improved de-identification of physician notes through integrative modeling of both public and private medical text. BMC Med. Inf. Decis. Mak. 13, 112 (2013)..","DOI":"10.1186\/1472-6947-13-112"},{"key":"258_CR24","doi-asserted-by":"publisher","first-page":"507","DOI":"10.1136\/jamia.2009.001560","volume":"17","author":"GK Savova","year":"2010","unstructured":"Savova, G. K. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inf. Assoc. 17, 507\u2013513 (2010).","journal-title":"J. Am. Med. Inf. Assoc."},{"key":"258_CR25","doi-asserted-by":"publisher","first-page":"327","DOI":"10.1017\/S1351324904003523","volume":"10","author":"D Ferrucci","year":"2004","unstructured":"Ferrucci, D., Lally, A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 327\u2013348 (2004).","journal-title":"Nat. Lang. Eng"}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0258-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0258-y","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0258-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,7]],"date-time":"2022-12-07T02:08:43Z","timestamp":1670378923000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0258-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,14]]},"references-count":25,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2020,12]]}},"alternative-id":["258"],"URL":"https:\/\/doi.org\/10.1038\/s41746-020-0258-y","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,4,14]]},"assertion":[{"value":"17 May 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 March 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 April 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"A.J.B., B.N., and E.R. are inventors on a filed disclosure on the Philter technology at the University of California. A.J.B., B.N., and E.R. are inventors on a filed patent some of the components of which are described in this paper.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"57"}}