{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,25]],"date-time":"2026-06-25T20:47:19Z","timestamp":1782420439822,"version":"3.54.5"},"reference-count":78,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:00:00Z","timestamp":1769990400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:00:00Z","timestamp":1769990400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Science Foundation","award":["IIS-2245920"],"award-info":[{"award-number":["IIS-2245920"]}]},{"DOI":"10.13039\/100006086","name":"NIH Office of Strategic Coordination","doi-asserted-by":"crossref","award":["R03OD038389"],"award-info":[{"award-number":["R03OD038389"]}],"id":[{"id":"10.13039\/100006086","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Cancer Institute","award":["R01CA258193"],"award-info":[{"award-number":["R01CA258193"]}]},{"name":"Agency for Healthcare Research and Quality Award","award":["R21HS029969"],"award-info":[{"award-number":["R21HS029969"]}]},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"crossref","award":["R21LM013911"],"award-info":[{"award-number":["R21LM013911"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000025","name":"National Institute of Mental Health","doi-asserted-by":"crossref","award":["R21MH137736"],"award-info":[{"award-number":["R21MH137736"]}],"id":[{"id":"10.13039\/100000025","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000049","name":"National Institute on Aging","doi-asserted-by":"publisher","award":["R01AG064529"],"award-info":[{"award-number":["R01AG064529"]}],"id":[{"id":"10.13039\/100000049","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Healthc Inform Res"],"published-print":{"date-parts":[[2026,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%). Common generation methods included LLM prompting (74.6%), fine-tuning (20.3%), and specialized models (5.1%). Evaluations were heterogeneous: intrinsic metrics (27.1%), human-in-the-loop assessments (44.1%), and LLM-based evaluations (13.6%). However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols. Future efforts may focus on developing standardized, transparent evaluation frameworks and expanding accessibility to support effective applications in biomedical research.<\/jats:p>","DOI":"10.1007\/s41666-026-00229-9","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T06:26:27Z","timestamp":1770013587000},"page":"367-392","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives"],"prefix":"10.1007","volume":"10","author":[{"given":"Hanshu","family":"Rao","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Weisi","family":"Liu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Haohan","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"I-Chan","family":"Huang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhe","family":"He","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiaolei","family":"Huang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,2,2]]},"reference":[{"key":"229_CR1","unstructured":"Liu W, He Z, Huang X (2024) Time matters: Examine temporal effects on biomedical language models, pp. 723\u2013732. American Medical Informatics Association, San Francisco, CA, USA (2024). http:\/\/www.ncbi.nlm.nih.gov\/pubmed\/40417490"},{"key":"229_CR2","doi-asserted-by":"publisher","unstructured":"Jones P, Liu W, Huang I-C, Huang X (2025) Examining imbalance effects on performance and demographic fairness of clinical language models. In: 2025 IEEE 13th International Conference on Healthcare Informatics (ICHI), pp 58\u201368. https:\/\/doi.org\/10.1109\/ICHI64645.2025.00016","DOI":"10.1109\/ICHI64645.2025.00016"},{"key":"229_CR3","doi-asserted-by":"publisher","unstructured":"Stade EC, Stirman SW, Ungar LH, Boland CL, Schwartz HA, Yaden DB, Sedoc J, DeRubeis RJ, Willer R, Eichstaedt JC (2024) Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. npj Mental Health Res. 3(1):12. https:\/\/doi.org\/10.1038\/s44184-024-00056-z","DOI":"10.1038\/s44184-024-00056-z"},{"issue":"6","key":"229_CR4","doi-asserted-by":"publisher","first-page":"891","DOI":"10.3390\/ph16060891","volume":"16","author":"A Blanco-Gonzalez","year":"2023","unstructured":"Blanco-Gonzalez A, Cabezon A, Seco-Gonzalez A, Conde-Torres D, Antelo-Riveiro P, Pineiro A, Garcia-Fandino R (2023) The role of ai in drug discovery: challenges, opportunities, and strategies. Pharmaceuticals 16(6):891. https:\/\/doi.org\/10.3390\/ph16060891","journal-title":"Pharmaceuticals"},{"key":"229_CR5","doi-asserted-by":"publisher","unstructured":"Chen BY, Antaki F, Gonzalez M, Uchino K, Albahra S, Robertson S, Ibrikji S, Aube E, Russman A, Hussain MS (2025) Automated identification of stroke thrombolysis contraindications from synthetic clinical notes: A proof-of-concept study. Cerebrovasc Diseases Extra. 15:130\u2013136. https:\/\/doi.org\/10.1159\/000545317","DOI":"10.1159\/000545317"},{"key":"229_CR6","doi-asserted-by":"publisher","DOI":"10.1109\/JBHI.2024.3435085","author":"Y Wu","year":"2024","unstructured":"Wu Y, Mao K, Zhang Y, Chen J (2024) Callm: Enhancing clinical interview analysis through data augmentation with large language models. IEEE J Biomed Health Inform. https:\/\/doi.org\/10.1109\/JBHI.2024.3435085","journal-title":"IEEE J Biomed Health Inform"},{"key":"229_CR7","doi-asserted-by":"publisher","unstructured":"Liu J, Koopman B, Brown NJ, Chu K, Nguyen A (2025) Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports. Artif Intell Med. 159 https:\/\/doi.org\/10.1016\/j.artmed.2024.103027","DOI":"10.1016\/j.artmed.2024.103027"},{"key":"229_CR8","unstructured":"Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, et al (2023) GPT-4 Technical Report. Technical report, OpenAI. arXiv:2303.08774"},{"key":"229_CR9","unstructured":"Dubey A, Grattafiori A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Touvron H, et al (2024) The llama 3 herd of models. Technical report, Meta AI. arXiv:2407.21783"},{"key":"229_CR10","doi-asserted-by":"publisher","unstructured":"Han G, Liu W, Huang X, Borsari B (2024) Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts. In: Proceedings of the 12th IEEE International Conference on Healthcare Informatics (ICHI), pp 392\u2013401. IEEE, Orlando, FL, USA. https:\/\/doi.org\/10.1109\/ICHI61247.2024.00057","DOI":"10.1109\/ICHI61247.2024.00057"},{"key":"229_CR11","doi-asserted-by":"publisher","unstructured":"Ghanadian H, Nejadgholi I, Osman HA (2024) Socially aware synthetic data generation for suicidal ideation detection using large language models. IEEE Access. 12:14350\u201314363. https:\/\/doi.org\/10.1109\/ACCESS.2024.3358206","DOI":"10.1109\/ACCESS.2024.3358206"},{"key":"229_CR12","doi-asserted-by":"publisher","unstructured":"Bucur A-M (2024) Leveraging llm-generated data for detecting depression symptoms on social media. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9\u201312, 2024, Proceedings, Part I, pp 193\u2013204. Springer, Berlin, Heidelberg. https:\/\/doi.org\/10.1007\/978-3-031-71736-9_14","DOI":"10.1007\/978-3-031-71736-9_14"},{"key":"229_CR13","doi-asserted-by":"publisher","unstructured":"Sarkar AR, Chuang Y-S, Mohammed N, Jiang X (2024) De-identification is not enough: a comparison between de-identified and synthetic clinical notes. Sci Rep 14(1):29669. https:\/\/doi.org\/10.1038\/s41598-024-81170-y","DOI":"10.1038\/s41598-024-81170-y"},{"key":"229_CR14","doi-asserted-by":"publisher","unstructured":"Xu R, Cui H, Yu Y, Kan X, Shi W, Zhuang Y, Wang MD, Jin W, Ho J, Yang C (2024) Knowledge-infused prompting: Assessing and advancing clinical text data generation with large language models. In: Ku L-W, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics: ACL 2024, pp 15496\u201315523. Association for Computational Linguistics, Bangkok, Thailand. https:\/\/doi.org\/10.18653\/v1\/2024.findings-acl.916","DOI":"10.18653\/v1\/2024.findings-acl.916"},{"key":"229_CR15","doi-asserted-by":"publisher","unstructured":"Li R, Wang X, Yu H (2023) Two directions for clinical data generation with large language models: Data-to-label and label-to-data. In: Bouamor H, Pino J, Bali K (eds) Findings of the Association for Computational Linguistics: EMNLP 2023, pp 7129\u20137143. Association for Computational Linguistics, Singapore. https:\/\/doi.org\/10.18653\/v1\/2023.findings-emnlp.474","DOI":"10.18653\/v1\/2023.findings-emnlp.474"},{"key":"229_CR16","doi-asserted-by":"publisher","unstructured":"Wang J, Yao Z, Yang Z, Zhou H, Li R, Wang X, Xu Y, Yu H (2024) NoteChat: A dataset of synthetic patient-physician conversations conditioned on clinical notes. In: Ku L-W, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics: ACL 2024, pp 15183\u201315201. Association for Computational Linguistics, Bangkok, Thailand. https:\/\/doi.org\/10.18653\/v1\/2024.findings-acl.901","DOI":"10.18653\/v1\/2024.findings-acl.901"},{"issue":"15","key":"229_CR17","doi-asserted-by":"publisher","first-page":"2733","DOI":"10.3390\/math10152733","volume":"10","author":"A Figueira","year":"2022","unstructured":"Figueira A, Vaz B (2022) Survey on synthetic data generation, evaluation methods and gans. Mathematics 10(15):2733. https:\/\/doi.org\/10.3390\/math10152733","journal-title":"Mathematics"},{"key":"229_CR18","doi-asserted-by":"publisher","unstructured":"Long L, Wang R, Xiao R, Zhao J, Ding X, Chen G, Wang H (2024) On LLMs-driven synthetic data generation, curation, and evaluation: A survey. In: Ku L-W, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics: ACL 2024, pp 11065\u201311082. Association for Computational Linguistics, Bangkok, Thailand. https:\/\/doi.org\/10.18653\/v1\/2024.findings-acl.658","DOI":"10.18653\/v1\/2024.findings-acl.658"},{"key":"229_CR19","doi-asserted-by":"publisher","unstructured":"Li Z, Zhu H, Lu Z, Yin M (2023) Synthetic data generation with large language models for text classification: Potential and limitations. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp 10443\u201310461. Association for Computational Linguistics, Singapore. https:\/\/doi.org\/10.18653\/v1\/2023.emnlp-main.647","DOI":"10.18653\/v1\/2023.emnlp-main.647"},{"key":"229_CR20","doi-asserted-by":"publisher","unstructured":"Murtaza H, Ahmed M, Khan NF, Murtaza G, Zafar S, Bano A (2023) Synthetic data generation: State of the art in health care domain. Comput Sci Rev. 48:100546 https:\/\/doi.org\/10.1016\/j.cosrev.2023.100546","DOI":"10.1016\/j.cosrev.2023.100546"},{"issue":"1","key":"229_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pdig.0000082","volume":"2","author":"A Gonzales","year":"2023","unstructured":"Gonzales A, Guruswamy G, Smith SR (2023) Synthetic data in health care: A narrative review. PLOS Digital Health 2(1):1\u201316. https:\/\/doi.org\/10.1371\/journal.pdig.0000082","journal-title":"PLOS Digital Health"},{"issue":"4","key":"229_CR22","doi-asserted-by":"publisher","first-page":"114","DOI":"10.1093\/jamiaopen\/ooae114","volume":"7","author":"D Smolyak","year":"2024","unstructured":"Smolyak D, Bjarnad\u00f3ttir MV, Crowley K, Agarwal R (2024) Large language models and synthetic health data: progress and prospects. JAMIA Open 7(4):114. https:\/\/doi.org\/10.1093\/jamiaopen\/ooae114","journal-title":"JAMIA Open"},{"key":"229_CR23","doi-asserted-by":"publisher","unstructured":"Loni M, Poursalim F, Asadi M, Gharehbaghi A (2025) A review on generative AI models for synthetic medical text, time series, and longitudinal data. npj Digital Med. 8(1):281 https:\/\/doi.org\/10.1038\/s41746-024-01409-w","DOI":"10.1038\/s41746-024-01409-w"},{"issue":"1","key":"229_CR24","doi-asserted-by":"publisher","first-page":"159","DOI":"10.2307\/2529310","volume":"33","author":"JR Landis","year":"1977","unstructured":"Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159\u2013174. https:\/\/doi.org\/10.2307\/2529310","journal-title":"Biometrics"},{"key":"229_CR25","doi-asserted-by":"publisher","unstructured":"Shakeri S, Santos C, Zhu H, Ng P, Nan F, Wang Z, Nallapati R, Xiang B (2020) End-to-end synthetic data generation for domain adaptation of question answering systems. In: Webber B, Cohn T, He Y, Liu Y (eds) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 5445\u20135460. Association for Computational Linguistics, Online. https:\/\/doi.org\/10.18653\/v1\/2020.emnlp-main.439","DOI":"10.18653\/v1\/2020.emnlp-main.439"},{"key":"229_CR26","doi-asserted-by":"publisher","unstructured":"Ive J, Viani N, Kam J, Yin L, Verma S, Puntis S, Cardinal RN, Roberts A, Stewart R, Velupillai S (2020) Generation and evaluation of artificial mental health records for natural language processing. npj Digit Med. 3 https:\/\/doi.org\/10.1038\/s41746-020-0267-x","DOI":"10.1038\/s41746-020-0267-x"},{"key":"229_CR27","doi-asserted-by":"publisher","unstructured":"Li J, Zhou Y, Jiang X, Natarajan K, Pakhomov SV, Liu H, Xu H (2021) Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. J Am Med Inf Ass. 28:2193\u20132201. https:\/\/doi.org\/10.1093\/jamia\/ocab112","DOI":"10.1093\/jamia\/ocab112"},{"key":"229_CR28","doi-asserted-by":"publisher","unstructured":"Libbi CA, Trienes J, Trieschnigg D, Seifert C (2021) Generating synthetic training data for supervised de-identification of electronic health records. Future Int 13 https:\/\/doi.org\/10.3390\/fi13050136","DOI":"10.3390\/fi13050136"},{"key":"229_CR29","doi-asserted-by":"publisher","unstructured":"Chintagunta B, Katariya N, Amatriain X, Kannan A (2021) Medically aware GPT-3 as a data generator for medical dialogue summarization. In: Shivade C, Gangadharaiah R, Gella S, Konam S, Yuan S, Zhang Y, Bhatia P, Wallace B (eds) Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, pp 66\u201376. Association for Computational Linguistics, Online. https:\/\/doi.org\/10.18653\/v1\/2021.nlpmc-1.9","DOI":"10.18653\/v1\/2021.nlpmc-1.9"},{"key":"229_CR30","doi-asserted-by":"publisher","unstructured":"Lu Q, Dou D, Nguyen TH (2021) Textual data augmentation for patient outcomes prediction. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp 2817\u20132821. https:\/\/doi.org\/10.1109\/BIBM52615.2021.9669861","DOI":"10.1109\/BIBM52615.2021.9669861"},{"key":"229_CR31","doi-asserted-by":"publisher","unstructured":"Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y (2023) A study of generative large language model for medical research and healthcare. npj Digital Med. 6(1):210 https:\/\/doi.org\/10.1038\/s41746-023-00958-w","DOI":"10.1038\/s41746-023-00958-w"},{"key":"229_CR32","doi-asserted-by":"publisher","unstructured":"Hiebel N, Ferret O, Fort K, N\u00e9v\u00e9ol A (2023) Can synthetic text help clinical named entity recognition? a study of electronic health records in French. In: Vlachos A, Augenstein I (eds) Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp 2320\u20132338. Association for Computational Linguistics, Dubrovnik, Croatia . https:\/\/doi.org\/10.18653\/v1\/2023.eacl-main.170","DOI":"10.18653\/v1\/2023.eacl-main.170"},{"key":"229_CR33","doi-asserted-by":"publisher","unstructured":"Khademi S, Palmer C, Dimaguila GL, Javed M, Buttery J, Black J (2023) Data augmentation to improve syndromic detection from emergency department notes. In: Proceedings of the 2023 Australasian Computer Science Week. ACSW \u201923, pp 198\u2013205. Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3579375.3579401","DOI":"10.1145\/3579375.3579401"},{"key":"229_CR34","doi-asserted-by":"publisher","unstructured":"Spitale G, Schneider G, Germani F, Biller-Andorno N (2023) Exploring the role of ai in classifying, analyzing, and generating case reports on assisted suicide cases: feasibility and ethical implications. Frontier Artif Intell 6 https:\/\/doi.org\/10.3389\/frai.2023.1328865","DOI":"10.3389\/frai.2023.1328865"},{"key":"229_CR35","doi-asserted-by":"publisher","unstructured":"Theodorou B, Xiao C, Sun J (2023) Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nature Commun 14(1). https:\/\/doi.org\/10.1038\/s41467-023-41093-0","DOI":"10.1038\/s41467-023-41093-0"},{"key":"229_CR36","doi-asserted-by":"publisher","unstructured":"Sufi F (2024) Addressing data scarcity in the medical domain: A gpt-based approach for synthetic data generation and feature extraction. Information (Switzerland). 15 https:\/\/doi.org\/10.3390\/info15050264","DOI":"10.3390\/info15050264"},{"key":"229_CR37","doi-asserted-by":"publisher","unstructured":"Chen X, Xu P, Li Y, Zhang W, Song F, He M, Shi D (2024) Chatffa: An ophthalmic chat system for unified vision-language understanding and question answering for fundus fluorescein angiography. iScience. 27(7):110021 https:\/\/doi.org\/10.1016\/j.isci.2024.110021","DOI":"10.1016\/j.isci.2024.110021"},{"issue":"6","key":"229_CR38","doi-asserted-by":"publisher","first-page":"1404","DOI":"10.1093\/jamia\/ocae081","volume":"31","author":"O Litake","year":"2024","unstructured":"Litake O, Park BH, Tully JL, Gabriel RA (2024) Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes. J Am Med Inform Assoc 31(6):1404\u20131410. https:\/\/doi.org\/10.1093\/jamia\/ocae081","journal-title":"J Am Med Inform Assoc"},{"issue":"9","key":"229_CR39","doi-asserted-by":"publisher","first-page":"1953","DOI":"10.1093\/jamia\/ocae073","volume":"31","author":"M Nievas","year":"2024","unstructured":"Nievas M, Basu A, Wang Y, Singh H (2024) Distilling large language models for matching patients to clinical trials. J Am Med Inform Assoc 31(9):1953\u20131963. https:\/\/doi.org\/10.1093\/jamia\/ocae073","journal-title":"J Am Med Inform Assoc"},{"key":"229_CR40","doi-asserted-by":"publisher","unstructured":"Wang Y, Wang Z, Wang W, Chen Q, Huang K, Nguyen A, De S (2024) DKE-research at SemEval-2024 task 2: Incorporating data augmentation with generative models and biomedical knowledge to enhance inference robustness. In: Ojha AK, Do\u011fru\u00f6z AS, Tayyar\u00a0Madabushi H, Da\u00a0San\u00a0Martino G, Rosenthal S, Ros\u00e1 A (eds) Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp 88\u201394. Association for Computational Linguistics, Mexico City, Mexico. https:\/\/doi.org\/10.18653\/v1\/2024.semeval-1.15","DOI":"10.18653\/v1\/2024.semeval-1.15"},{"key":"229_CR41","doi-asserted-by":"publisher","unstructured":"Moser D, Bender M, Sariyar M (2024) Generating synthetic healthcare dialogues in emergency medicine using large language models. Stud Health Technol Inf 321:235\u2013239. https:\/\/doi.org\/10.3233\/SHTI241099","DOI":"10.3233\/SHTI241099"},{"key":"229_CR42","doi-asserted-by":"publisher","unstructured":"Bird JJ, Wright D, Sumich A, Lotfi A (2024) Generative ai in psychological therapy: Perspectives on computational linguistics and large language models in written behaviour monitoring. In: Proceedings of the 17th International Conference on PErvasive Technologies Related to Assistive Environments. PETRA \u201924, pp 322\u2013328. Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3652037.3663893","DOI":"10.1145\/3652037.3663893"},{"key":"229_CR43","doi-asserted-by":"publisher","unstructured":"Jeong M, Sohn J, Sung M, Kang J (2024) Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics. 40:119\u2013129 https:\/\/doi.org\/10.1093\/bioinformatics\/btae238","DOI":"10.1093\/bioinformatics\/btae238"},{"key":"229_CR44","doi-asserted-by":"publisher","unstructured":"Zafar A, Sahoo SK, Bhardawaj H, Das A, Ekbal A (2024) Ki-mag: A knowledge-infused abstractive question answering system in medical domain. Neurocomputing. 571 https:\/\/doi.org\/10.1016\/j.neucom.2023.127141","DOI":"10.1016\/j.neucom.2023.127141"},{"issue":"1","key":"229_CR45","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1038\/s41746-023-00970-0","volume":"7","author":"M Guevara","year":"2024","unstructured":"Guevara M, Chen S, Thomas S, Chaunzwa TL, Franco I, Kann BH, Moningi S, Qian JM, Goldstein M, Harper S et al (2024) Large language models to identify social determinants of health in electronic health records. NPJ Digital Med 7(1):6. https:\/\/doi.org\/10.1038\/s41746-023-00970-0","journal-title":"NPJ Digital Med"},{"key":"229_CR46","doi-asserted-by":"publisher","unstructured":"Ehrett C, Hegde S, Andre K, Liu D, Wilson T (2024) Leveraging open-source large language models for data augmentation in hospital staff surveys: Mixed methods study. JMIR Med Educ. 10:51433\u201351433 https:\/\/doi.org\/10.2196\/51433","DOI":"10.2196\/51433"},{"key":"229_CR47","doi-asserted-by":"publisher","unstructured":"Gabriel RA, Litake O, Simpson S, Burton BN, Waterman RS, Macias AA (2024) On the development and validation of large language model-based classifiers for identifying social determinants of health. Proceed National Acad Sci United States of America. 121 https:\/\/doi.org\/10.1073\/pnas.2320716121","DOI":"10.1073\/pnas.2320716121"},{"key":"229_CR48","doi-asserted-by":"publisher","unstructured":"Gao Y, Zhang W, Ren J, Zheng R, Jin Y, Wu D, Shu L, Xu X, Jin Z (2024) Pressinpose: Integrating pressure and inertial sensors for full-body pose estimation in activities. Proc. ACM Interact Mob Wearable Ubiquit Technol 8(4) https:\/\/doi.org\/10.1145\/3699773","DOI":"10.1145\/3699773"},{"key":"229_CR49","doi-asserted-by":"publisher","unstructured":"Kweon S, Kim J, Kim J, Im S, Cho E, Bae S, Oh J, Lee G, Moon JH, You SC, Baek S, Han CH, Jung YB, Jo Y, Choi E (2024) Publicly shareable clinical large language model built on synthetic clinical notes. In: Ku L-W, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics: ACL 2024, pp 5148\u20135168. Association for Computational Linguistics, Bangkok, Thailand. https:\/\/doi.org\/10.18653\/v1\/2024.findings-acl.305","DOI":"10.18653\/v1\/2024.findings-acl.305"},{"key":"229_CR50","doi-asserted-by":"publisher","unstructured":"Weerasinghe K, Janapati S, Ge X, Kim S, Iyer S, Stankovic JA, Alemzadeh H (2024) Real-time multimodal cognitive assistant for emergency medical services. In: 2024 IEEE\/ACM Ninth International Conference on Internet-of-Things Design and Implementation (IoTDI), pp 85\u201396. https:\/\/doi.org\/10.1109\/IoTDI61053.2024.00012","DOI":"10.1109\/IoTDI61053.2024.00012"},{"issue":"3","key":"229_CR51","doi-asserted-by":"publisher","first-page":"953","DOI":"10.1162\/coli_a_00520","volume":"50","author":"M Delmas","year":"2024","unstructured":"Delmas M, Wysocka M, Freitas A (2024) Relation extraction in underexplored biomedical domains: A diversity-optimized sampling and synthetic data generation approach. Comput Linguist 50(3):953\u20131000. https:\/\/doi.org\/10.1162\/coli_a_00520","journal-title":"Comput Linguist"},{"key":"229_CR52","doi-asserted-by":"publisher","unstructured":"Zeinali N, Albashayreh A, Fan W, White SG (2024) Symptom-bert: Enhancing cancer symptom detection in ehr clinical notes. J Pain Symp Manage 68:190\u20131981 https:\/\/doi.org\/10.1016\/j.jpainsymman.2024.05.015","DOI":"10.1016\/j.jpainsymman.2024.05.015"},{"key":"229_CR53","doi-asserted-by":"publisher","unstructured":"Mishra P, Yao Z, Vashisht P, Ouyang F, Wang B, Mody VD, Yu H (2024) SYNFAC-EDIT: Synthetic imitation edit feedback for factual alignment in clinical summarization. In: Al-Onaizan Y, Bansal M, Chen Y-N (eds) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp 20061\u201320083. Association for Computational Linguistics, Miami, Florida, USA. https:\/\/doi.org\/10.18653\/v1\/2024.emnlp-main.1120","DOI":"10.18653\/v1\/2024.emnlp-main.1120"},{"key":"229_CR54","doi-asserted-by":"publisher","unstructured":"Zhang J, Cui W, Huang Y, Das K, Kumar S (2024) Synthetic knowledge ingestion: Towards knowledge refinement and injection for enhancing large language models. In: Al-Onaizan Y, Bansal M, Chen Y-N (eds) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp 21456\u201321473. Association for Computational Linguistics, Miami, Florida, USA. https:\/\/doi.org\/10.18653\/v1\/2024.emnlp-main.1196","DOI":"10.18653\/v1\/2024.emnlp-main.1196"},{"key":"229_CR55","doi-asserted-by":"publisher","unstructured":"Dobhal U, Garcia C, Inoue S (2024) Synthetic skeleton data generation using large language model for nurse activity recognition. In: Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing. UbiComp \u201924, pp 493\u2013499. Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3675094.3678445","DOI":"10.1145\/3675094.3678445"},{"key":"229_CR56","doi-asserted-by":"publisher","unstructured":"Wang N, Treewaree S, Zirikly A, Lu YL, Nguyen MH, Agarwal B, Shah J, Stevenson JM, Taylor CO (2024) Taxonomy-based prompt engineering to generate synthetic drug-related patient portal messages. J Biomed Inf. 160 https:\/\/doi.org\/10.1016\/j.jbi.2024.104752","DOI":"10.1016\/j.jbi.2024.104752"},{"key":"229_CR57","unstructured":"Jones E, Palangi H, Sim\u00f5es C, Chandrasekaran V, Mukherjee S, Mitra A, Awadallah A, Kamar E (2024) Teaching language models to hallucinate less with synthetic tasks. In: 12th International Conference on Learning Representations, ICLR 2024, p 12. https:\/\/openreview.net\/forum?id=xpw7V0P136"},{"key":"229_CR58","doi-asserted-by":"publisher","unstructured":"Alghamdi HM, Mostafa A (2024) Towards reliable healthcare llm agents: A case study for pilgrims during hajj. Information (Switzerland). 15 https:\/\/doi.org\/10.3390\/info15070371","DOI":"10.3390\/info15070371"},{"key":"229_CR59","doi-asserted-by":"publisher","DOI":"10.1145\/3674838","author":"Y Wang","year":"2024","unstructured":"Wang Y, Fu T, Xu Y, Ma Z, Xu H, Du B, Lu Y, Gao H, Wu J, Chen J (2024) Twin-gpt: Digital twins for clinical trials via large language model. ACM Trans Multimed Comput Commun Appl. https:\/\/doi.org\/10.1145\/3674838","journal-title":"ACM Trans Multimed Comput Commun Appl"},{"key":"229_CR60","unstructured":"Woolsey CR, Bisht P, Rothman J, Leroy G (2024) Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Transl Sci. 2024, 429\u2013438"},{"issue":"1","key":"229_CR61","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1111\/jnu.13004","volume":"57","author":"JK Scroggins","year":"2024","unstructured":"Scroggins JK, Topaz M, Song J, Zolnoori M (2024) Does synthetic data augmentation improve the performances of machine learning classifiers for identifying health problems in patient-nurse verbal communications in home healthcare settings? J Nurs Scholarsh 57(1):47\u201358. https:\/\/doi.org\/10.1111\/jnu.13004","journal-title":"J Nurs Scholarsh"},{"key":"229_CR62","doi-asserted-by":"publisher","unstructured":"Albayrak A, Xiao Y, Mukherjee P, Barnett SS, Marcou CA, Hart SN (2025) Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation. J Pathol Inf. 16 https:\/\/doi.org\/10.1016\/j.jpi.2024.100409","DOI":"10.1016\/j.jpi.2024.100409"},{"key":"229_CR63","doi-asserted-by":"publisher","unstructured":"Wang Z, Jiang J, Zhan Y, Zhou B, Li Y, Zhang C, Yu B, Ding L, Jin H, Peng J, Lin X, Liu W (2025) Hypnos: A domain-specific large language model for anesthesiology. Neurocomputing. 624 https:\/\/doi.org\/10.1016\/j.neucom.2025.129389","DOI":"10.1016\/j.neucom.2025.129389"},{"key":"229_CR64","doi-asserted-by":"publisher","unstructured":"Theodorou B, Danek B, Tummala V, Kumar SP, Malin B, Sun J (2025) Improving medical machine learning models with generative balancing for equity and excellence. npj Digital Med. 8 (2025) https:\/\/doi.org\/10.1038\/s41746-025-01438-z","DOI":"10.1038\/s41746-025-01438-z"},{"key":"229_CR65","doi-asserted-by":"publisher","unstructured":"Cai Z, Fang H, Liu J, Xu G, Long Y, Guan Y, Ke T (2025) Improving unified information extraction in chinese mental health domain with instruction-tuned llms and type-verification component. Artif Intell Med. 162 https:\/\/doi.org\/10.1016\/j.artmed.2025.103087","DOI":"10.1016\/j.artmed.2025.103087"},{"key":"229_CR66","doi-asserted-by":"publisher","unstructured":"Barr AA, Quan J, Guo E, Sezgin E (2025) Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data. Frontier Artif Intell. 8 https:\/\/doi.org\/10.3389\/frai.2025.1533508","DOI":"10.3389\/frai.2025.1533508"},{"issue":"5","key":"229_CR67","doi-asserted-by":"publisher","first-page":"885","DOI":"10.1093\/jamia\/ocaf037","volume":"32","author":"Y-S Chuang","year":"2025","unstructured":"Chuang Y-S, Sarkar AR, Hsu Y-C, Mohammed N, Jiang X (2025) Robust privacy amidst innovation with large language models through a critical assessment of the risks. J Am Med Inform Assoc 32(5):885\u2013892. https:\/\/doi.org\/10.1093\/jamia\/ocaf037","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"229_CR68","doi-asserted-by":"publisher","first-page":"240","DOI":"10.1038\/s41746-025-01653-8","volume":"8","author":"H Kim","year":"2025","unstructured":"Kim H, Hwang H, Lee J, Park S, Kim D, Lee T, Yoon C, Sohn J, Park J, Reykhart O et al (2025) Small language models learn enhanced reasoning skills from medical textbooks. NPJ Digital Med 8(1):240. https:\/\/doi.org\/10.1038\/s41746-025-01653-8","journal-title":"NPJ Digital Med"},{"key":"229_CR69","doi-asserted-by":"publisher","unstructured":"Li J, Wan Z, Yu L, Liu H, Song H (2025) Synthetic data-driven approaches for chinese medical abstract sentence classification: Computational study. JMIR Format Res. 9 https:\/\/doi.org\/10.2196\/54803","DOI":"10.2196\/54803"},{"key":"229_CR70","doi-asserted-by":"publisher","unstructured":"Barabadi MA, Zhu X, Chan WY, Simpson AL, Do RKG (2025) Targeted generative data augmentation for automatic metastases detection from free-text radiology reports. Frontier Artif Intell. 8 https:\/\/doi.org\/10.3389\/frai.2025.1513674","DOI":"10.3389\/frai.2025.1513674"},{"key":"229_CR71","doi-asserted-by":"publisher","unstructured":"Peter OOE, Adeniran OT, John-Otumu AMG, Khalifa F, Rahman MM (2025) Text-guided synthesis in medical multimedia retrieval: A framework for enhanced colonoscopy image classification and segmentation. Algorithms. 18 https:\/\/doi.org\/10.3390\/a18030155","DOI":"10.3390\/a18030155"},{"key":"229_CR72","doi-asserted-by":"publisher","DOI":"10.1109\/TMI.2025.3548872","author":"J Li","year":"2025","unstructured":"Li J, Zhu C, Zheng S, Chen P, Sun Y, Li H, Yang L (2025) Topofm: Topology-guided pathology foundation model for high-resolution pathology image synthesis with cellular-level control. IEEE Trans Med Imaging. https:\/\/doi.org\/10.1109\/TMI.2025.3548872","journal-title":"IEEE Trans Med Imaging"},{"key":"229_CR73","doi-asserted-by":"publisher","unstructured":"\u0160uvalov H, Lepson M, Kukk V, Malk M, Ilves N, Kuulmets HA, Kolde R (2025) Using synthetic health care data to leverage large language models for named entity recognition: Development and validation study. J Med Int Res. 27 https:\/\/doi.org\/10.2196\/66279","DOI":"10.2196\/66279"},{"key":"229_CR74","doi-asserted-by":"publisher","unstructured":"Miletic M, Sariyar M (2025) Utility-based analysis of statistical approaches and deep learning models for synthetic data generation with focus on correlation structures: Algorithm development and validation. JMIR AI. 4 https:\/\/doi.org\/10.2196\/65729","DOI":"10.2196\/65729"},{"issue":"2","key":"229_CR75","doi-asserted-by":"publisher","first-page":"151","DOI":"10.1016\/j.imed.2025.03.002","volume":"5","author":"X Chen","year":"2025","unstructured":"Chen X, Xiang J, Lu S, Liu Y, He M, Shi D (2025) Evaluating large language models and agents in healthcare: key challenges in clinical applications. Intell Med 5(2):151\u2013163. https:\/\/doi.org\/10.1016\/j.imed.2025.03.002","journal-title":"Intell Med"},{"issue":"7","key":"229_CR76","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1136\/bmj.n71","volume":"169","author":"MJ Page","year":"2021","unstructured":"...Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hrobjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 169(7):71. https:\/\/doi.org\/10.1136\/bmj.n71","journal-title":"BMJ"},{"issue":"7","key":"229_CR77","doi-asserted-by":"publisher","first-page":"467","DOI":"10.7326\/M18-0850","volume":"169","author":"AC Tricco","year":"2018","unstructured":"Tricco AC, Lillie E, Zarin W, O\u2019Brien KK, Colquhoun H, Levac D, Moher D, Peters MD, Horsley T, Weeks L et al (2018) Prisma extension for scoping reviews (prisma-scr): Checklist and explanation. Ann Intern Med 169(7):467\u2013473. https:\/\/doi.org\/10.7326\/M18-0850","journal-title":"Ann Intern Med"},{"key":"229_CR78","doi-asserted-by":"publisher","unstructured":"Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, Lehman L-wH, Celi LA, Mark RG (2023) MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 10(1):1 https:\/\/doi.org\/10.1038\/s41597-022-01899-x","DOI":"10.1038\/s41597-022-01899-x"}],"container-title":["Journal of Healthcare Informatics Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41666-026-00229-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41666-026-00229-9","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41666-026-00229-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T06:37:00Z","timestamp":1778222220000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41666-026-00229-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,2]]},"references-count":78,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,6]]}},"alternative-id":["229"],"URL":"https:\/\/doi.org\/10.1007\/s41666-026-00229-9","relation":{},"ISSN":["2509-4971","2509-498X"],"issn-type":[{"value":"2509-4971","type":"print"},{"value":"2509-498X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,2]]},"assertion":[{"value":"1 September 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 December 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 January 2026","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 February 2026","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing Interests"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics Approval and Consent to Participate"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for Publication"}}]}}