{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T22:53:33Z","timestamp":1781736813848,"version":"3.54.5"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,5,14]],"date-time":"2020-05-14T00:00:00Z","timestamp":1589414400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,5,14]],"date-time":"2020-05-14T00:00:00Z","timestamp":1589414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>A serious obstacle to the development of Natural Language Processing (NLP) methods in the clinical domain is the accessibility of textual data. The mental health domain is particularly challenging, partly because clinical documentation relies heavily on free text that is difficult to de-identify completely. This problem could be tackled by using artificial medical data. In this work, we present an approach to generate artificial clinical documents. We apply this approach to discharge summaries from a large mental healthcare provider and discharge summaries from an intensive care unit. We perform an extensive intrinsic evaluation where we (1) apply several measures of text preservation; (2) measure how much the model memorises training data; and (3) estimate clinical validity of the generated text based on a human evaluation task. Furthermore, we perform an extrinsic evaluation by studying the impact of using artificial text in a downstream NLP text classification task. We found that using this artificial data as training data can lead to classification results that are comparable to the original results. Additionally, using only a small amount of information from the original data to condition the generation of the artificial data is successful, which holds promise for reducing the risk of these artificial data retaining rare information from the original data. This is an important finding for our long-term goal of being able to generate artificial clinical data that can be released to the wider research community and accelerate advances in developing computational methods that use healthcare data.<\/jats:p>","DOI":"10.1038\/s41746-020-0267-x","type":"journal-article","created":{"date-parts":[[2020,5,14]],"date-time":"2020-05-14T10:03:18Z","timestamp":1589450598000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":63,"title":["Generation and evaluation of artificial mental health records for Natural Language Processing"],"prefix":"10.1038","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3931-3392","authenticated-orcid":false,"given":"Julia","family":"Ive","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2205-2322","authenticated-orcid":false,"given":"Natalia","family":"Viani","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Joyce","family":"Kam","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0814-5197","authenticated-orcid":false,"given":"Lucia","family":"Yin","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4756-8675","authenticated-orcid":false,"given":"Somain","family":"Verma","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4397-2435","authenticated-orcid":false,"given":"Stephen","family":"Puntis","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8751-5167","authenticated-orcid":false,"given":"Rudolf N.","family":"Cardinal","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4570-9801","authenticated-orcid":false,"given":"Angus","family":"Roberts","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4435-6397","authenticated-orcid":false,"given":"Robert","family":"Stewart","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4178-2980","authenticated-orcid":false,"given":"Sumithra","family":"Velupillai","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2020,5,14]]},"reference":[{"key":"267_CR1","doi-asserted-by":"publisher","first-page":"540","DOI":"10.1136\/amiajnl-2011-000465","volume":"18","author":"WW Chapman","year":"2011","unstructured":"Chapman, W. W. et al. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J. Am. Med. Info. Assoc. 18, 540\u2013543 (2011).","journal-title":"J. Am. Med. Info. Assoc."},{"key":"267_CR2","first-page":"4826","volume":"29","author":"P Bachman","year":"2016","unstructured":"Bachman, P. An architecture for deep, hierarchical generative models. Adv. Neural Inf. Process. Syst. 29, 4826\u20134834 (2016).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"267_CR3","unstructured":"Gulrajani, I. et al. PixelVAE: A latent variable model for natural images. In Proceedings of International Conference on Learning Representations (ICLR) (2016)."},{"key":"267_CR4","doi-asserted-by":"crossref","unstructured":"Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2577\u20132586 (2018).","DOI":"10.18653\/v1\/P18-1240"},{"key":"267_CR5","unstructured":"Liu, P. J. Learning to write notes in electronic health records. Preprint at CoRR https:\/\/arxiv.org\/abs\/1808.02622 (2018)."},{"key":"267_CR6","doi-asserted-by":"publisher","DOI":"10.1038\/s41746-018-0070-0","volume":"1","author":"SH Lee","year":"2018","unstructured":"Lee, S. H. Natural language generation for electronic health records. npj Digital Med. 1, 63 (2018).","journal-title":"npj Digital Med."},{"key":"267_CR7","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2016.35","volume":"3","author":"AEW Johnson","year":"2016","unstructured":"Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).","journal-title":"Sci. Data"},{"key":"267_CR8","unstructured":"Johnson, A. & Pollard, T. The MIMIC-III clinical database. PhysioNet (2016)."},{"key":"267_CR9","doi-asserted-by":"publisher","unstructured":"Perera, G. et al. Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource. BMJ Open https:\/\/doi.org\/10.1136\/bmjopen-2015-008721 (2016).","DOI":"10.1136\/bmjopen-2015-008721"},{"key":"267_CR10","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1186\/1472-6947-13-71","volume":"13","author":"AC Fernandes","year":"2013","unstructured":"Fernandes, A. C. et al. Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med. Inform. Decis. Mak. 13, 71\u201371 (2013).","journal-title":"BMC Med. Inform. Decis. Mak."},{"key":"267_CR11","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0192360","volume":"13","author":"S Gehrmann","year":"2018","unstructured":"Gehrmann, S. et al. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLoS ONE 13, e0192360 (2018).","journal-title":"PLoS ONE"},{"key":"267_CR12","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1186\/1472-6947-8-32","volume":"8","author":"I Neamatullah","year":"2008","unstructured":"Neamatullah, I. et al. Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8, 32\u201332 (2008).","journal-title":"BMC Med. Inform. Decis. Mak."},{"key":"267_CR13","unstructured":"Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of ACL workshop on Text Summarization Branches Out (2004)."},{"key":"267_CR14","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, 311\u2013318 (2002).","DOI":"10.3115\/1073083.1073135"},{"key":"267_CR15","unstructured":"Snover, M., Dorr, B., Schwartz, R., Micciulla, L. & Makhoul, J. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, 223\u2013231 (2006)."},{"key":"267_CR16","unstructured":"Carlini, N., Liu, C., Kos, J., Erlingsson, \u00da. & Song, D. The Secret Sharer: Measuring unintended neural network memorization & extracting secrets. In Proceedings of the 28th USENIX Security Symposium, 267\u2013284 (2019)."},{"key":"267_CR17","unstructured":"Blei, D., Ng, A. & Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993\u20131022 (2003)."},{"key":"267_CR18","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman, L. Random forests. Machine Learning 45, 5\u201332 (2001).","journal-title":"Machine Learning"},{"key":"267_CR19","doi-asserted-by":"publisher","unstructured":"Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1746\u20131751, https:\/\/doi.org\/10.3115\/v1\/D14-1181 (2014).","DOI":"10.3115\/v1\/D14-1181"},{"key":"267_CR20","unstructured":"Berg-Kirkpatrick, T., Burkett, D. & Klein, D. An empirical investigation of statistical significance in NLP. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 995\u20131005 (2012)."},{"key":"267_CR21","doi-asserted-by":"publisher","first-page":"148","DOI":"10.1002\/asi.23363","volume":"67","author":"D S\u00e1nchez","year":"2016","unstructured":"S\u00e1nchez, D. & Batet, M. C-sanitized: a privacy model for document redaction and sanitization. J. Assoc. Inform. Sci. Technol. 67, 148\u2013163 (2016).","journal-title":"J. Assoc. Inform. Sci. Technol."},{"key":"267_CR22","unstructured":"Anandan, B. et al. t-Plausibility: Generalizing words to desensitize text. Trans. Data Privacy 5, 505\u2013534 (2012)."},{"key":"267_CR23","first-page":"5998","volume":"30","author":"A Vaswani","year":"2017","unstructured":"Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst. 30, 5998\u20136008 (2017).","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"267_CR24","first-page":"3104","volume":"27","author":"I Sutskever","year":"2014","unstructured":"Sutskever, I., Vinyals, O. & Le, Q. V. V. Sequence to sequence learning with neural networks. Adv. Neural Inform. Process. Syst. 27, 3104\u20133112 (2014).","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"267_CR25","unstructured":"Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of International Conference on Learning Representations (ICLR) (2015)."},{"key":"267_CR26","doi-asserted-by":"crossref","unstructured":"Peng, N., Ghazvininejad, M., May, J. & Knight, K. Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, 43\u201349 (2018).","DOI":"10.18653\/v1\/W18-1505"},{"key":"267_CR27","doi-asserted-by":"publisher","unstructured":"Rose, S., Engel, D., Cramer, N. & Cowley, W. Automatic keyword extraction from individual documents. In Berry, M. & Kogan, J. (eds) Text Mining: Applications and Theory, 1\u201320, https:\/\/doi.org\/10.1002\/9780470689646.ch1 (2010).","DOI":"10.1002\/9780470689646.ch1"},{"key":"267_CR28","doi-asserted-by":"crossref","unstructured":"Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, 67\u201372 (2017).","DOI":"10.18653\/v1\/P17-4012"},{"key":"267_CR29","unstructured":"\u0158eh\u016f\u0159ek, R. & Sojka, P. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45\u201350 (2010)."},{"key":"267_CR30","doi-asserted-by":"crossref","unstructured":"Bird, S. & Loper, E. NLTK: the natural language toolkit. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Interactive poster and demonstration sessions (ACL), 214\u2013217 (2004).","DOI":"10.3115\/1219044.1219075"},{"key":"267_CR31","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825\u20132830 (2011).","journal-title":"J. Mach. Learn. Res."},{"key":"267_CR32","first-page":"3111","volume":"26","author":"T Mikolov","year":"2013","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst. 26, 3111\u20133119 (2013).","journal-title":"Adv. Neural Inform. Process. Syst."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0267-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0267-x","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0267-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,7]],"date-time":"2022-12-07T02:14:29Z","timestamp":1670379269000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-020-0267-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,14]]},"references-count":32,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2020,12]]}},"alternative-id":["267"],"URL":"https:\/\/doi.org\/10.1038\/s41746-020-0267-x","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,14]]},"assertion":[{"value":"19 December 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 March 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 May 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"69"}}