{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T20:04:28Z","timestamp":1770062668694,"version":"3.49.0"},"reference-count":24,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,2,6]],"date-time":"2025-02-06T00:00:00Z","timestamp":1738800000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:p>Automatic identification of metastatic sites in cancer patients from electronic health records is a challenging yet crucial task with significant implications for diagnosis and treatment. In this study, we demonstrate how advancements in natural language processing, namely the instruction-following capability of recent large language models and extensive model pretraining, made it possible to automate metastases detection from radiology reports texts with a limited amount of gold-labeled data. Specifically, we prompt Llama3, an open-source instruction-tuned large language model, to generate synthetic training data to expand our limited labeled data and adapt BERT, a small pretrained language model, to the task. We further investigate three targeted data augmentation techniques which selectively expand the original training samples, leading to comparable or superior performance compared to vanilla data augmentation, in most cases, while being substantially more computationally efficient. In our experiments, data augmentation improved the average F1-score by 2.3, 3.5, and 3.9 points for lung, liver, and adrenal glands, the organs for which we had access to expert-annotated data. This observation suggests that Llama3, which has not been specifically tailored to this task or clinical data in general, can generate high-quality synthetic data through paraphrasing in the clinical context. We also compare metastasis identification accuracy between models utilizing institutionally standardized reports vs. non-structured reports, which complicate the extraction of relevant information, and show how including patient history with a customized model architecture narrows the gap between those two setups from 7.3 to 4.5 points on F1-score under LoRA tuning. Our work delivers a broadly applicable solution with remarkable performance that does not require model customization for each institution, making large-scale, low-cost spatio-temporal cancer progression pattern extraction possible.<\/jats:p>","DOI":"10.3389\/frai.2025.1513674","type":"journal-article","created":{"date-parts":[[2025,2,6]],"date-time":"2025-02-06T07:12:24Z","timestamp":1738825944000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Targeted generative data augmentation for automatic metastases detection from free-text radiology reports"],"prefix":"10.3389","volume":"8","author":[{"given":"Maede","family":"Ashofteh Barabadi","sequence":"first","affiliation":[]},{"given":"Xiaodan","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"Wai Yip","family":"Chan","sequence":"additional","affiliation":[]},{"given":"Amber L.","family":"Simpson","sequence":"additional","affiliation":[]},{"given":"Richard K. G.","family":"Do","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,2,6]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2303.08774","article-title":"Gpt-4 technical report","author":"Achiam","year":"2023","journal-title":"arXiv"},{"key":"B2","doi-asserted-by":"publisher","first-page":"7383","DOI":"10.1609\/aaai.v34i05.6233","volume":"34","author":"Anaby-Tavor","year":"2020"},{"key":"B3","doi-asserted-by":"crossref","DOI":"10.1109\/CCECE59415.2024.10667245","article-title":"\u201cAdapting large language models for automatic annotation of radiology reports for metastases detection,\u201d","volume-title":"IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)","author":"Barabadi","year":"2024"},{"key":"B4","doi-asserted-by":"publisher","first-page":"826402","DOI":"10.3389\/frai.2022.826402","article-title":"Developing a cancer digital twin: supervised metastases detection from consecutive structured radiology reports","volume":"5","author":"Batch","year":"2022","journal-title":"Front. Artif. Intell"},{"key":"B5","doi-asserted-by":"publisher","first-page":"191","DOI":"10.1162\/tacl_a_00542","article-title":"An empirical survey of data augmentation for limited data learning in NLP","volume":"11","author":"Chen","year":"2023","journal-title":"Trans. Assoc. Comput. Linguist"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2302.13007","article-title":"Auggpt: leveraging chatgpt for text data augmentation","author":"Dai","year":"2023","journal-title":"arXiv"},{"key":"B7","doi-asserted-by":"publisher","first-page":"5574","DOI":"10.1002\/cam4.2474","article-title":"Are 90% of deaths from cancer caused by metastases?","volume":"8","author":"Dillek\u00e5s","year":"2019","journal-title":"Cancer Med"},{"key":"B8","doi-asserted-by":"publisher","first-page":"210043","DOI":"10.1148\/radiol.2021210043","article-title":"Patterns of metastatic disease in patients with cancer derived from natural language processing of structured ct radiology reports over a 10-year period","volume":"301","author":"Do","year":"2021","journal-title":"Radiology"},{"key":"B9","unstructured":"The llama 3 herd of models\n          \n          \n            \n              Dubey\n              A.\n            \n            \n              Jauhri\n              A.\n            \n            \n              Pandey\n              A.\n            \n            \n              Kadian\n              A.\n            \n            \n              Al-Dahle\n              A.\n            \n            \n              Letman\n              A.\n            \n          \n          arXiv [Preprint]\n          \n          2024"},{"key":"B10","unstructured":"Improving small language models on PubMedQA via generative data augmentation\n          \n          \n            \n              Guo\n              Z.\n            \n            \n              Wang\n              P.\n            \n            \n              Wang\n              Y.\n            \n            \n              Yu\n              S.\n            \n          \n          arXiv [Preprint]\n          \n          2023"},{"key":"B11","unstructured":"LoRA: low-rank adaptation of large language models\n          \n          \n            \n              Hu\n              E. J.\n            \n            \n              Shen\n              Y.\n            \n            \n              Wallis\n              P.\n            \n            \n              Allen-Zhu\n              Z.\n            \n            \n              Li\n              Y.\n            \n            \n              Wang\n              S.\n            \n          \n          arXiv [Preprint]\n          \n          2021"},{"key":"B12","doi-asserted-by":"crossref","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"\u201cBART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,\u201d","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2020"},{"key":"B13","doi-asserted-by":"crossref","first-page":"1463","DOI":"10.18653\/v1\/2023.eacl-main.107","article-title":"\u201cSelective in-context data augmentation for intent detection using pointwise V-information,\u201d","volume-title":"Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics","author":"Lin","year":"2023"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2110.07602","article-title":"P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks","author":"Liu","year":"2021","journal-title":"arXiv"},{"key":"B15","first-page":"179","article-title":"\u201cThe parrot dilemma: human-labeled vs. LLM-augmented data in classification tasks,\u201d","author":"M\u00f8ller","year":"2024","journal-title":"Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)"},{"key":"B16","first-page":"311","article-title":"\u201cBleu: a method for automatic evaluation of machine translation,\u201d","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"B17","doi-asserted-by":"crossref","first-page":"15606","DOI":"10.18653\/v1\/2023.findings-emnlp.1044","article-title":"\u201cIs ChatGPT the ultimate data augmentation algorithm?\u201d","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Piedboeuf","year":"2023"},{"key":"B18","doi-asserted-by":"crossref","first-page":"5316","DOI":"10.18653\/v1\/2023.emnlp-main.323","article-title":"\u201cPromptmix: a class boundary augmentation method for large language model distillation,\u201d","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Sahu","year":"2023"},{"key":"B19","doi-asserted-by":"publisher","first-page":"2293","DOI":"10.1056\/NEJMsb1609216","article-title":"Real-world evidence \u2013 what is it and what can it tell us?","volume":"375","author":"Sherman","year":"2016","journal-title":"N. Engl. J. Med"},{"key":"B20","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1903.09244","article-title":"Low resource text classification with ulmfit and backtranslation","author":"Shleifer","year":"2019","journal-title":"arXiv"},{"key":"B21","doi-asserted-by":"crossref","first-page":"6382","DOI":"10.18653\/v1\/D19-1670","article-title":"\u201cEDA: easy data augmentation techniques for boosting performance on text classification tasks,\u201d","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Wei","year":"2019"},{"key":"B22","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1007\/978-3-030-22747-0_7","article-title":"\u201cConditional Bert contextual augmentation,\u201d","volume-title":"Computational Science-ICCS 2019: 19th International Conference, Faro, Portugal, June 12-14, 2019, Proceedings, Part IV 19","author":"Wu","year":"2019"},{"key":"B23","doi-asserted-by":"publisher","first-page":"1922","DOI":"10.18653\/v1\/2023.findings-eacl.144","article-title":"\u201cData augmentation for radiology report simplification,\u201d","author":"Yang","year":"2023","journal-title":"Findings of the Association for Computational Linguistics: EACL 2023"},{"key":"B24","first-page":"1097","article-title":"\u201cTexygen: a benchmarking platform for text generation models,\u201d","volume-title":"The 41st International ACM SIGIR Conference on Research &Development in Information Retrieval, SIGIR '18","author":"Zhu","year":"2018"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1513674\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,6]],"date-time":"2025-02-06T07:12:35Z","timestamp":1738825955000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1513674\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,6]]},"references-count":24,"alternative-id":["10.3389\/frai.2025.1513674"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1513674","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,6]]},"article-number":"1513674"}}