{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T23:36:39Z","timestamp":1774481799340,"version":"3.50.1"},"reference-count":26,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,9,21]],"date-time":"2020-09-21T00:00:00Z","timestamp":1600646400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,9,21]],"date-time":"2020-09-21T00:00:00Z","timestamp":1600646400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Japanese Health Labour Sciences Research Grant","award":["H28-ICT-\u4e00\u822c-007"],"award-info":[{"award-number":["H28-ICT-\u4e00\u822c-007"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Biomed Semant"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n<jats:title>Background<\/jats:title>\n<jats:p>Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Results<\/jats:title>\n<jats:p>Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for <jats:italic>age, hospital<\/jats:italic>, <jats:italic>person<\/jats:italic>, <jats:italic>sex<\/jats:italic>, and <jats:italic>time<\/jats:italic>. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Conclusions<\/jats:title>\n<jats:p>Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance.<\/jats:p>\n<jats:p>Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.<\/jats:p>\n<\/jats:sec>","DOI":"10.1186\/s13326-020-00227-9","type":"journal-article","created":{"date-parts":[[2020,9,21]],"date-time":"2020-09-21T11:02:53Z","timestamp":1600686173000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["De-identifying free text of Japanese electronic health records"],"prefix":"10.1186","volume":"11","author":[{"given":"Kohei","family":"Kajiyama","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hiromasa","family":"Horiguchi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Takashi","family":"Okumura","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mizuki","family":"Morita","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7864-842X","authenticated-orcid":false,"given":"Yoshinobu","family":"Kano","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,9,21]]},"reference":[{"key":"227_CR1","unstructured":"Act on the Protection of Personal Information. Japan, 2003.."},{"key":"227_CR2","volume-title":"Health insurance portability and accountability act of 1996 (HIPAA)","author":"R Mullner","year":"1996","unstructured":"Mullner R, Rafalski EM. Health insurance portability and accountability act of 1996 (HIPAA). U.S.: Public Law; 1996."},{"key":"227_CR3","unstructured":"Act on Anonymously Processed Medical Information to Contribute to Medical Research and Development. Japan, 2017."},{"issue":"Suppl","key":"227_CR4","doi-asserted-by":"publisher","first-page":"S11","DOI":"10.1016\/j.jbi.2015.06.007","volume":"58","author":"A Stubbs","year":"2015","unstructured":"Stubbs A, Kotfila C, Uzuner \u00d6. Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2\/UTHealth shared task track 1. J Biomed Inform. 2015;58(Suppl):S11\u20139.","journal-title":"J Biomed Inform"},{"key":"227_CR5","first-page":"476","volume":"192","author":"C Grouin","year":"2013","unstructured":"Grouin C, Zweigenbaum P. Automatic De-identification of French clinical records: comparison of rule-based and machine-learning approaches. Stud Health Technol Inform. 2013;192:476\u201380.","journal-title":"Stud Health Technol Inform"},{"key":"227_CR6","doi-asserted-by":"publisher","first-page":"151","DOI":"10.1016\/j.jbi.2013.12.014","volume":"50","author":"C Grouin","year":"2014","unstructured":"Grouin C, N\u00e9v\u00e9ol A. De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform. 2014;50:151\u201361.","journal-title":"J Biomed Inform"},{"key":"227_CR7","first-page":"1","volume-title":"Proceedings of the 14th International Symposium Health Informatics Management Research","author":"H Dalianis","year":"2009","unstructured":"Dalianis H, Hassel M, Velupillai S. The Stockholm EPR corpus \u2013 Characteristics and some initial findings. In: Proceedings of the 14th International Symposium Health Informatics Management Research; 2009. p. 1\u20137."},{"issue":"6","key":"227_CR8","first-page":"1","volume":"1","author":"H Dalianis","year":"2010","unstructured":"Dalianis H, Velupillai S. De-identifying Swedish clinical text \u2013 refinement of a gold standard and experiments with conditional random fields. J Biomed Sem. 2010;1(6):1\u20136.","journal-title":"J Biomed Sem"},{"key":"227_CR9","doi-asserted-by":"publisher","first-page":"76","DOI":"10.1016\/j.jbi.2017.07.017","volume":"73","author":"Z Jian","year":"2017","unstructured":"Jian Z, Guo X, Liu S, Ma H, Zhang S, Zhang R, Lei J. A cascaded approach for Chinese clinical text de-identification with less annotation effort. J Biomed Inform. 2017;73:76\u201383.","journal-title":"J Biomed Inform"},{"key":"227_CR10","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1016\/j.ijmedinf.2018.05.010","volume":"116","author":"L Du","year":"2018","unstructured":"Du L, Xia C, Deng Z, Lu G, Xia S, Ma J. A machine learning based approach to identify protected health information in Chinese clinical text. Int J Med Inform. 2018;116:24\u201332.","journal-title":"Int J Med Inform"},{"key":"227_CR11","first-page":"696","volume-title":"Proceedings of the NTCIR-10 conference","author":"M Morita","year":"2013","unstructured":"Morita M, Kano Y, Ohkuma T, Miyabe M, Aramaki E. Overview of the NTCIR-10 MedNLP Task. In: Proceedings of the NTCIR-10 conference; 2013. p. 696\u2013701."},{"key":"227_CR12","first-page":"147","volume-title":"Proceedings of the NTCIR-11 conference","author":"E Aramaki","year":"2014","unstructured":"Aramaki E, Morita M, Kano Y, Ohkuma T. Overview of the NTCIR-11 MedNLP-2 Task. In: Proceedings of the NTCIR-11 conference; 2014. p. 147\u201354."},{"issue":"3","key":"227_CR13","first-page":"273","volume":"20","author":"C Cortes","year":"1995","unstructured":"Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273\u201397.","journal-title":"Mach Learn"},{"key":"227_CR14","first-page":"282","volume-title":"Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001)","author":"J Lafferty","year":"2001","unstructured":"Lafferty J, McCallum A, Pereira F. Conditional random fields : Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001); 2001. p. 282\u20139."},{"key":"227_CR15","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhunber J. Long short-term memory. Neural Comput. 1997;9:1735\u201380.","journal-title":"Neural Comput"},{"key":"227_CR16","first-page":"260","volume-title":"Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2016)","author":"G Lample","year":"2016","unstructured":"Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural Architectures for Named Entity Recognition. In: Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2016); 2016. p. 260\u201370."},{"key":"227_CR17","first-page":"1","volume-title":"Proceedings of the Sixth Conference on Natural Language Learning (CoNLL 2002)","author":"E Sang","year":"2002","unstructured":"Sang E. Introduction to the CoNLL-2002 Shared Task: Language-independent Named Entity Recognition. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL 2002); 2002. p. 1\u20134."},{"key":"227_CR18","first-page":"142","volume-title":"Proceedings of the Seventh Conference on Natural Language Learning (HLT-NAACL 2003)","author":"E Sang","year":"2016","unstructured":"Sang E, Fen M, Hovy E. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of the Seventh Conference on Natural Language Learning (HLT-NAACL 2003); 2016. p. 142\u20137."},{"key":"227_CR19","first-page":"97","volume-title":"Proceedings of the First Workshop on Subword and Character Level Models in NLP (SCLeM 2017), 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)","author":"S Misawa","year":"2017","unstructured":"Misawa S, Taniguchi M, Miura Y, Ohkuma T. Character-based Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition. In: Proceedings of the First Workshop on Subword and Character Level Models in NLP (SCLeM 2017), 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017); 2017. p. 97\u2013102."},{"key":"227_CR20","doi-asserted-by":"crossref","unstructured":"Kajiyama K, Horiguchi H, Okumura T, Morita M, Kano Y. De-identifying Free Text of Japanese Dummy Electronic Health Records. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis (LOUHI 2018), 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018). 2018. p. 65\u201370.","DOI":"10.18653\/v1\/W18-5608"},{"key":"227_CR21","first-page":"859","volume-title":"Proceedings of the American Medical Informatics Association (AMIA) Annual Symposium","author":"K Hatano","year":"2003","unstructured":"Hatano K, Ohe K. Information retrieval system for Japanese Standard Disease-code Master Using XML Web Service. In: Proceedings of the American Medical Informatics Association (AMIA) Annual Symposium; 2003. p. 859."},{"key":"227_CR22","first-page":"38","volume-title":"Proceedings of the First Workshop on Natural Language Processing for Medical and Healthcare Fields, The Sixth International Joint Conference on Natural Language Processing (IJCNLP 2013)","author":"O Imaichi","year":"2013","unstructured":"Imaichi O, Yanase T, Niwa Y. A Comparison of Rule-Based and Machine Learning Methods for Medical Information Extraction. In: Proceedings of the First Workshop on Natural Language Processing for Medical and Healthcare Fields, The Sixth International Joint Conference on Natural Language Processing (IJCNLP 2013); 2013. p. 38\u201342."},{"key":"227_CR23","first-page":"1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013)","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013); 2013. p. 1\u20139."},{"key":"227_CR24","first-page":"173","volume-title":"Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL 1999)","author":"E Sang","year":"1999","unstructured":"Sang E, Veenstra J. Representing text chunks. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL 1999); 1999. p. 173\u20139."},{"issue":"3","key":"227_CR25","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1093\/jamia\/ocw156","volume":"24","author":"F Dernoncourt","year":"2017","unstructured":"Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Amer Med Info Assoc. 2017;24(3):596\u2013606.","journal-title":"J Amer Med Info Assoc"},{"key":"227_CR26","doi-asserted-by":"publisher","first-page":"160035","DOI":"10.1038\/sdata.2016.35","volume":"3","author":"A Johnson","year":"2016","unstructured":"Johnson A, Pollard T, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi L, Mark R. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.","journal-title":"Sci Data"}],"container-title":["Journal of Biomedical Semantics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13326-020-00227-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13326-020-00227-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13326-020-00227-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,20]],"date-time":"2021-09-20T23:45:43Z","timestamp":1632181543000},"score":1,"resource":{"primary":{"URL":"https:\/\/jbiomedsem.biomedcentral.com\/articles\/10.1186\/s13326-020-00227-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,21]]},"references-count":26,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["227"],"URL":"https:\/\/doi.org\/10.1186\/s13326-020-00227-9","relation":{},"ISSN":["2041-1480"],"issn-type":[{"value":"2041-1480","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,21]]},"assertion":[{"value":"13 May 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 August 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 September 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The Pathology Reports dataset was used under approval by the ethics committee and the research committee of the Japanese Society of Pathology under a research grant from the Japan Agency for Medical Research and Development (AMED), \u201cJapan Pathology AI Diagnostics Project (JP-AID)\u201d.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"All the authors have agreed to publication of this manuscript.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"N\/A","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"11"}}