{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T14:23:14Z","timestamp":1780323794195,"version":"3.54.1"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2020,5,11]],"date-time":"2020-05-11T00:00:00Z","timestamp":1589155200000},"content-version":"vor","delay-in-days":1,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"crossref"}]},{"name":"CIFAR Chair in Artificial Intelligence"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques).<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We employ a new \u201crandom replacement\u201d paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current \u201csearch-and-secure\u201d paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaa038","type":"journal-article","created":{"date-parts":[[2020,3,23]],"date-time":"2020-03-23T20:09:27Z","timestamp":1584994167000},"page":"901-907","source":"Crossref","is-referenced-by-count":13,"title":["Using word embeddings to improve the privacy of clinical notes"],"prefix":"10.1093","volume":"27","author":[{"given":"Mohamed","family":"Abdalla","sequence":"first","affiliation":[{"name":"ICES, Toronto, Canada"},{"name":"The Vector Institute for Artificial Intelligence, Toronto, Canada"},{"name":"Department of Computer Science, University of Toronto, Toronto, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Moustafa","family":"Abdalla","sequence":"additional","affiliation":[{"name":"Computational Statistics & Machine Learning Group, Department of Statistics, University of Oxford, Oxford, UK"},{"name":"Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK"},{"name":"Harvard Medical School, Harvard University, Boston, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Frank","family":"Rudzicz","sequence":"additional","affiliation":[{"name":"The Vector Institute for Artificial Intelligence, Toronto, Canada"},{"name":"Department of Computer Science, University of Toronto, Toronto, Canada"},{"name":"International Centre for Surgical Safety, Li Ka Shing Knowledge Institute, St Michael\u2019s Hospital, Toronto, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Graeme","family":"Hirst","sequence":"additional","affiliation":[{"name":"The Vector Institute for Artificial Intelligence, Toronto, Canada"},{"name":"Department of Computer Science, University of Toronto, Toronto, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2020,5,10]]},"reference":[{"key":"2020110613092827600_ocaa038-B1","article-title":"De-identifying free text of Japanese dummy electronic health records","author":"Kajiyama","journal-title":"Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis"},{"issue":"3","key":"2020110613092827600_ocaa038-B2","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1093\/jamia\/ocw156","article-title":"De-identification of patient notes with recurrent neural networks","volume":"24","author":"Dernoncourt","year":"2017","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"2020110613092827600_ocaa038-B3","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1186\/1472-6947-8-32","article-title":"Automated de-identification of free-text medical records","volume":"8","author":"Neamatullah","year":"2008","journal-title":"BMC Med Inform Decis Mak"},{"key":"2020110613092827600_ocaa038-B4","author":"Fernandes","year":"2018"},{"key":"2020110613092827600_ocaa038-B5","author":"Schakel","year":"2015"},{"key":"2020110613092827600_ocaa038-B6","first-page":"1334","author":"Gong","year":"2018"},{"key":"2020110613092827600_ocaa038-B7","author":"Miller","year":"2001"},{"key":"2020110613092827600_ocaa038-B8","first-page":"777","author":"Thomas","year":"2002"},{"key":"2020110613092827600_ocaa038-B9","first-page":"714","author":"Sibanda","year":"2006"},{"key":"2020110613092827600_ocaa038-B10","doi-asserted-by":"crossref","first-page":"S34","DOI":"10.1016\/j.jbi.2017.05.023","article-title":"De-identification of clinical notes via recurrent neural network and conditional random field","volume":"75","author":"Liu","year":"2017","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B11","doi-asserted-by":"crossref","first-page":"S4","DOI":"10.1016\/j.jbi.2017.06.011","article-title":"De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1","volume":"75","author":"Stubbs","year":"2017","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B12","doi-asserted-by":"crossref","first-page":"S30","DOI":"10.1016\/j.jbi.2015.06.015","article-title":"Automatic detection of protected health information from clinic narratives","volume":"58","author":"Yang","year":"2015","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B13","doi-asserted-by":"crossref","first-page":"S11","DOI":"10.1016\/j.jbi.2015.06.007","article-title":"Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2\/UTHealth shared task Track 1","volume":"58","author":"Stubbs","year":"2015","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B14","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1016\/j.jbi.2014.01.011","article-title":"Text de-identification for privacy protection: a study of its impact on clinical text information content","volume":"50","author":"Meystre","year":"2014","journal-title":"J Biomed Inform"},{"issue":"2","key":"2020110613092827600_ocaa038-B15","doi-asserted-by":"crossref","first-page":"342","DOI":"10.1136\/amiajnl-2012-001034","article-title":"Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text","volume":"20","author":"Carrell","year":"2013","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"2020110613092827600_ocaa038-B16","first-page":"33","article-title":"The distributional hypothesis","volume":"20","author":"Sahlgren","year":"2008","journal-title":"Italian J Linguist"},{"key":"2020110613092827600_ocaa038-B17","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1016\/j.jbi.2018.09.008","article-title":"A comparison of word embeddings for the biomedical natural language processing","volume":"87","author":"Wang","year":"2018","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B18","year":"2019"},{"key":"2020110613092827600_ocaa038-B19","year":"2019"},{"issue":"3","key":"2020110613092827600_ocaa038-B20","doi-asserted-by":"crossref","first-page":"288","DOI":"10.1016\/j.jbi.2006.06.004","article-title":"Measures of semantic similarity and relatedness in the biomedical domain","volume":"40","author":"Pedersen","year":"2007","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B21","author":"Hliaoutakis","year":"2005"},{"issue":"2","key":"2020110613092827600_ocaa038-B22","doi-asserted-by":"crossref","first-page":"251","DOI":"10.1016\/j.jbi.2010.10.004","article-title":"Towards a framework for developing semantic relatedness reference standards","volume":"44","author":"Pakhomov","year":"2011","journal-title":"J Biomed Inform"},{"key":"2020110613092827600_ocaa038-B23","first-page":"572","article-title":"Semantic similarity and relatedness between clinical terms: an experimental study","volume":"2010","author":"Pakhomov","year":"2010","journal-title":"AMIA Annu Symp Proc"},{"key":"2020110613092827600_ocaa038-B24","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Trans Assoc Comput Linguist"},{"issue":"1","key":"2020110613092827600_ocaa038-B25","doi-asserted-by":"crossref","DOI":"10.1038\/sdata.2016.35","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","author":"Johnson","year":"2016","journal-title":"Sci Data"},{"key":"2020110613092827600_ocaa038-B26","article-title":"Deep contextualized word representations","author":"Peters","journal-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies"},{"key":"2020110613092827600_ocaa038-B27","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin","journal-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies"},{"key":"2020110613092827600_ocaa038-B28","author":"Liendo"},{"key":"2020110613092827600_ocaa038-B29","article-title":"Learning word vectors for sentiment analysis","author":"Maas","journal-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies"},{"key":"2020110613092827600_ocaa038-B30","author":"Kim"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/6\/901\/34152514\/ocaa038.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/6\/901\/34152514\/ocaa038.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,6]],"date-time":"2020-11-06T19:27:19Z","timestamp":1604690839000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/27\/6\/901\/5835527"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,10]]},"references-count":30,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2020,5,10]]},"published-print":{"date-parts":[[2020,6,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaa038","relation":{},"ISSN":["1527-974X"],"issn-type":[{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,6]]},"published":{"date-parts":[[2020,5,10]]}}}