{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,20]],"date-time":"2026-04-20T18:16:23Z","timestamp":1776708983133,"version":"3.51.2"},"reference-count":143,"publisher":"Springer Science and Business Media LLC","issue":"12","license":[{"start":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T00:00:00Z","timestamp":1760659200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T00:00:00Z","timestamp":1760659200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001799","name":"Murdoch University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001799","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Artif Intell Rev"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    While Large Language Models (LLMs) have shown remarkable proficiency in text-based tasks, they struggle to interact effectively with the more realistic world without the perceptions of other modalities such as visual and audio. Multi-modal LLMs, which integrate these additional modalities, have become increasingly important across various domains. Despite the significant advancements and potential of multi-modal LLMs, there has been no comprehensive PRISMA-based systematic review that examines their applications across different domains. The objective of this work is to fill this gap by systematically reviewing and synthesising the quantitative research literature on domain-specific applications of multi-modal LLMs. This systematic review follows the PRISMA guidelines to analyse research literature published after 2022, the release of OpenAI\u2019s ChatGPT\n                    <jats:inline-formula>\n                      <jats:tex-math>$$-$$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    3.5. The literature search was conducted across several online databases, including Nature, Scopus, and Google Scholar. A total of 22 studies were identified, with 11 focusing on the medical domain, 3 on autonomous driving, and 2 on geometric analysis. The remaining studies covered a range of topics, with one each on climate, music, e-commerce, sentiment analysis, human-robot interaction, and construction. This review provides a comprehensive overview of the current state of multi-modal LLMs, highlights their domain-specific applications, and identifies gaps and future research directions.\n                  <\/jats:p>","DOI":"10.1007\/s10462-025-11398-1","type":"journal-article","created":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T02:48:29Z","timestamp":1760669309000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A systematic review of multi-modal large language models on domain-specific applications"],"prefix":"10.1007","volume":"58","author":[{"given":"Sirui","family":"Li","sequence":"first","affiliation":[]},{"given":"Kok Wai","family":"Wong","sequence":"additional","affiliation":[]},{"given":"Guanjin","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Thach-Thao","family":"Duong","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,17]]},"reference":[{"key":"11398_CR1","unstructured":"Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, et al (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774"},{"key":"11398_CR2","unstructured":"Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B (2018) Sanity checks for saliency maps. Advances in neural information processing systems 31"},{"key":"11398_CR3","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1016\/j.enbuild.2014.10.074","volume":"87","author":"M Aksoezen","year":"2015","unstructured":"Aksoezen M, Daniel M, Hassler U, Kohler N (2015) Building age as an indicator for energy consumption. Energy Build 87:74\u201386","journal-title":"Energy Build"},{"key":"11398_CR4","unstructured":"Alexey D (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929"},{"key":"11398_CR5","doi-asserted-by":"crossref","unstructured":"Amjad H, Ashraf MS, Sherazi SZA, Khan S, Fraz MM, Hameed T, Bukhari SAC (2023) Attention-based explainability approaches in healthcare natural language processing. HEALTHINF 689\u2013696","DOI":"10.5220\/0011927300003414"},{"key":"11398_CR6","doi-asserted-by":"crossref","unstructured":"Arnab A, Dehghani M, Heigold G, Sun C, Lu\u010di\u0107 M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 6836\u20136846","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"11398_CR7","doi-asserted-by":"crossref","unstructured":"Augenstein I, Baldwin T, Cha M, Chakraborty T, Ciampaglia GL, Corney D, DiResta R, Ferrara E, Hale S, Halevy A et al (2024) Factuality challenges in the era of large language models and opportunities for fact-checking. Nature Machine Intelligence 1\u201312","DOI":"10.1038\/s42256-024-00881-z"},{"key":"11398_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-015-0069-3","volume":"7","author":"D Bajusz","year":"2015","unstructured":"Bajusz D, R\u00e1cz A, H\u00e9berger K (2015) Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminf 7:1\u201313","journal-title":"J Cheminf"},{"key":"11398_CR9","doi-asserted-by":"publisher","DOI":"10.1017\/9781108676649","volume-title":"Human-robot interaction: an introduction","author":"C Bartneck","year":"2020","unstructured":"Bartneck C, Belpaeme T, Eyssel F, Kanda T, Keijsers M, \u0160abanovi\u0107 S (2020) Human-robot interaction: an introduction. Cambridge University Press, Cambridge"},{"key":"11398_CR10","doi-asserted-by":"crossref","unstructured":"Beltagy I, Lo K, Cohan A (2019) Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676","DOI":"10.18653\/v1\/D19-1371"},{"key":"11398_CR11","doi-asserted-by":"crossref","unstructured":"Belyaeva A, Cosentino J, Hormozdiari F, Eswaran K, Shetty S, Corrado G, Carroll A, McLean CY, Furlotte NA (2023) Multimodal llms for health grounded in individual-specific data. In: Workshop on Machine Learning for Multimodal Healthcare Data, pp. 86\u2013102. Springer","DOI":"10.1007\/978-3-031-47679-2_7"},{"key":"11398_CR12","doi-asserted-by":"crossref","unstructured":"Bodenreider O (2004) The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research 32(suppl_1):267\u2013270","DOI":"10.1093\/nar\/gkh061"},{"key":"11398_CR13","doi-asserted-by":"crossref","unstructured":"Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, Wetscherek M, Naumann T, Nori A, Alvarez-Valle J, et al (2022) Making the most of text semantics to improve biomedical vision\u2013language processing. In: European Conference on Computer Vision, pp. 1\u201321. Springer","DOI":"10.1007\/978-3-031-20059-5_1"},{"key":"11398_CR14","unstructured":"Bojarski M, Testa DD, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, Zhang X, Zhao J, Zieba K (2016) End to end learning for self-driving cars. CoRR arXiv:1604.07316"},{"issue":"10","key":"11398_CR15","doi-asserted-by":"publisher","first-page":"259","DOI":"10.1007\/s10462-024-10902-3","volume":"57","author":"F Bolanos","year":"2024","unstructured":"Bolanos F, Salatino A, Osborne F, Motta E (2024) Artificial intelligence for literature reviews: opportunities and challenges. Artif Intell Rev 57(10):259","journal-title":"Artif Intell Rev"},{"issue":"6293","key":"11398_CR16","doi-asserted-by":"publisher","first-page":"1573","DOI":"10.1126\/science.aaf2654","volume":"352","author":"J-F Bonnefon","year":"2016","unstructured":"Bonnefon J-F, Shariff A, Rahwan I (2016) The social dilemma of autonomous vehicles. Science 352(6293):1573\u20131576","journal-title":"Science"},{"issue":"4","key":"11398_CR17","doi-asserted-by":"publisher","first-page":"1061","DOI":"10.1037\/0033-295X.111.4.1061","volume":"111","author":"D Borsboom","year":"2004","unstructured":"Borsboom D, Mellenbergh GJ, Van Heerden J (2004) The concept of validity. Psychol Rev 111(4):1061","journal-title":"Psychol Rev"},{"key":"11398_CR18","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877\u20131901","journal-title":"Adv Neural Inf Process Syst"},{"key":"11398_CR19","doi-asserted-by":"crossref","unstructured":"Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621\u201311631","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"11398_CR20","doi-asserted-by":"crossref","unstructured":"Cai S, Bao K, Guo H, Zhang J, Song J, Zheng B (2024) Geogpt4v: Towards geometric multi-modal large language models with geometric image generation. arXiv preprint arXiv:2406.11503","DOI":"10.18653\/v1\/2024.emnlp-main.44"},{"key":"11398_CR21","unstructured":"Carolan K, Fennelly L, Smeaton AF (2024) A review of multi-modal large language and vision models. arXiv preprint arXiv:2404.01322"},{"issue":"1","key":"11398_CR22","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1038\/s41746-024-01101-z","volume":"7","author":"X Chen","year":"2024","unstructured":"Chen X, Zhang W, Xu P, Zhao Z, Zheng Y, Shi D, He M (2024) Ffa-gpt: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. NPJ Digital Med 7(1):111","journal-title":"NPJ Digital Med"},{"key":"11398_CR23","doi-asserted-by":"crossref","unstructured":"Chen Z, Zhou Y, Tran A, Zhao J, Wan L, Ooi GSK, Cheng LT-E, Thng CH, Xu X, Liu Y, et al (2023) Medical phrase grounding with region-phrase context contrastive alignment. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 371\u2013381. Springer","DOI":"10.1007\/978-3-031-43990-2_35"},{"key":"11398_CR24","doi-asserted-by":"crossref","unstructured":"Chen L, Li J, Dong X, Zhang P, He C, Wang J, Zhao F, Lin D (2023) Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793","DOI":"10.1007\/978-3-031-72643-9_22"},{"key":"11398_CR25","doi-asserted-by":"crossref","unstructured":"Cheng P, Mao C, Tang J, Yang S, Cheng Y, Wang W, Gu Q, Han W, Chen H, Li S et al (2024) Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 1\u201318","DOI":"10.1038\/s41422-024-00989-2"},{"key":"11398_CR26","unstructured":"Chiang W-L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez JE, Stoica I, Xing EP (2023) Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/"},{"key":"11398_CR27","doi-asserted-by":"publisher","unstructured":"Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pp. 539\u2013546. IEEE Computer Society. https:\/\/doi.org\/10.1109\/CVPR.2005.202","DOI":"10.1109\/CVPR.2005.202"},{"issue":"240","key":"11398_CR28","first-page":"1","volume":"24","author":"A Chowdhery","year":"2023","unstructured":"Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S et al (2023) Palm: scaling language modeling with pathways. J Mach Learn Res 24(240):1\u2013113","journal-title":"J Mach Learn Res"},{"key":"11398_CR29","first-page":"240","volume":"24","author":"A Chowdhery","year":"2023","unstructured":"Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, Schuh P, Shi K, Tsvyashchenko S, Maynez J, Rao A, Barnes P, Tay Y, Shazeer N, Prabhakaran V, Reif E, Du N, Hutchinson B, Pope R, Bradbury J, Austin J, Isard M, Gur-Ari G, Yin P, Duke T, Levskaya A, Ghemawat S, Dev S, Michalewski H, Garcia X, Misra V, Robinson K, Fedus L, Zhou D, Ippolito D, Luan D, Lim H, Zoph B, Spiridonov A, Sepassi R, Dohan D, Agrawal S, Omernick M, Dai AM, Pillai TS, Pellat M, Lewkowycz A, Moreira E, Child R, Polozov O, Lee K, Zhou Z, Wang X, Saeta B, Diaz M, Firat O, Catasta M, Wei J, Meier-Hellstern K, Eck D, Dean J, Petrov S, Fiedel N (2023) Palm: scaling language modeling with pathways. J Mach Learn Res 24:240\u20131240113","journal-title":"J Mach Learn Res"},{"key":"11398_CR30","unstructured":"Copet J, Kreuk F, Gat I, Remez T, Kant D, Synnaeve G, Adi Y, D\u00e9fossez A (2023) Simple and controllable music generation. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http:\/\/papers.nips.cc\/paper_files\/paper\/2023\/hash\/94b472a1842cd7c56dcb125fb2765fbd-Abstract-Conference.html"},{"key":"11398_CR31","doi-asserted-by":"crossref","unstructured":"Deng J, Yang Z, Chen T, Zhou W, Li H (2021) Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 1769\u20131779","DOI":"10.1109\/ICCV48922.2021.00179"},{"key":"11398_CR32","doi-asserted-by":"publisher","unstructured":"Deruyttere T, Vandenhende S, Grujicic D, Van\u00a0Gool L, Moens M-F (2019) Talk2Car: Taking control of your self-driving car. In: Inui K, Jiang J, Ng V, Wan X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2088\u20132098. Association for Computational Linguistics, Hong Kong, China. https:\/\/doi.org\/10.18653\/v1\/D19-1215. https:\/\/aclanthology.org\/D19-1215\/","DOI":"10.18653\/v1\/D19-1215"},{"key":"11398_CR33","unstructured":"Devlin J (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"11398_CR34","unstructured":"Ding Y, Fan W, Ning L, Wang S, Li H, Yin D, Chua T-S, Li Q (2024) A survey on rag meets llms: Towards retrieval-augmented large language models. arXiv preprint arXiv:2405.06211"},{"key":"11398_CR35","first-page":"279","volume":"121","author":"K Donnelly","year":"2006","unstructured":"Donnelly K et al (2006) Snomed-ct: the advanced terminology and coding system for ehealth. Stud Health Technol Inf 121:279","journal-title":"Stud Health Technol Inf"},{"key":"11398_CR36","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"11398_CR37","doi-asserted-by":"publisher","unstructured":"Du Y, Fu Z, Liu Q, Wang Y (2022) Visual grounding with transformers. In: IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan, July 18-22, 2022, pp. 1\u20136. IEEE. https:\/\/doi.org\/10.1109\/ICME52920.2022.9859880","DOI":"10.1109\/ICME52920.2022.9859880"},{"key":"11398_CR38","unstructured":"Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, Mathur A, Schelten A, Yang A, Fan A, et al (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783"},{"key":"11398_CR39","doi-asserted-by":"crossref","unstructured":"Edwards C, Lai T, Ros K, Honke G, Cho K, Ji H (2022) Translation between molecules and natural language. arXiv preprint arXiv:2204.11817","DOI":"10.18653\/v1\/2022.emnlp-main.26"},{"key":"11398_CR40","doi-asserted-by":"crossref","unstructured":"Eslami S, Meinel C, De\u00a0Melo G (2023) Pubmedclip: How much does clip benefit visual question answering in the medical domain? In: Findings of the Association for Computational Linguistics: EACL 2023, pp. 1181\u20131193","DOI":"10.18653\/v1\/2023.findings-eacl.88"},{"key":"11398_CR41","doi-asserted-by":"crossref","unstructured":"Faiz A, Kaneda S, Wang R, Osi RC, Sharma P, Chen F, Jiang L (2024) Llmcarbon: Modeling the end-to-end carbon footprint of large language models. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https:\/\/openreview.net\/forum?id=aIok3ZD9to","DOI":"10.1109\/CODES-ISSS60120.2024.00011"},{"issue":"12","key":"11398_CR42","doi-asserted-by":"publisher","first-page":"2911","DOI":"10.1109\/TCYB.2015.2492999","volume":"46","author":"M Ficocelli","year":"2015","unstructured":"Ficocelli M, Terao J, Nejat G (2015) Promoting interactions between humans and robots using robotic emotional behavior. IEEE Trans Cybernet 46(12):2911\u20132923","journal-title":"IEEE Trans Cybernet"},{"key":"11398_CR43","unstructured":"Gao J, Pi R, Zhang J, Ye J, Zhong W, Wang Y, Hong L, Han J, Xu H, Li Z, et al (2023) G-llava: solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370"},{"key":"11398_CR44","doi-asserted-by":"publisher","unstructured":"Gao P, Han J, Zhang R, Lin Z, Geng S, Zhou A, Zhang W, Lu P, He C, Yue X, Li H, Qiao Y (2023) Llama-adapter V2: parameter-efficient visual instruction model. CoRR arXiv:2304.15010. https:\/\/doi.org\/10.48550\/ARXIV.2304.15010","DOI":"10.48550\/ARXIV.2304.15010"},{"issue":"4","key":"11398_CR45","doi-asserted-by":"publisher","first-page":"359","DOI":"10.1006\/enfo.2001.0061","volume":"2","author":"TD Gauthier","year":"2001","unstructured":"Gauthier TD (2001) Detecting trends using spearman\u2019s rank correlation coefficient. Environmental forensics 2(4):359\u2013362","journal-title":"Environmental forensics"},{"issue":"3","key":"11398_CR46","doi-asserted-by":"publisher","first-page":"534","DOI":"10.1016\/j.pec.2020.09.017","volume":"104","author":"L Gerchow","year":"2021","unstructured":"Gerchow L, Burka LR, Miner S, Squires A (2021) Language barriers between nurses and patients: a scoping review. Patient Educ Couns 104(3):534\u2013553","journal-title":"Patient Educ Couns"},{"key":"11398_CR47","doi-asserted-by":"publisher","unstructured":"Han Z, Gao C, Liu J, Zhang J, Zhang SQ (2024) Parameter-efficient fine-tuning for large models: a comprehensive survey. CoRR arXiv:2403.14608. https:\/\/doi.org\/10.48550\/ARXIV.2403.14608","DOI":"10.48550\/ARXIV.2403.14608"},{"key":"11398_CR48","doi-asserted-by":"publisher","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770\u2013778. IEEE Computer Society. https:\/\/doi.org\/10.1109\/CVPR.2016.90","DOI":"10.1109\/CVPR.2016.90"},{"issue":"12","key":"11398_CR49","doi-asserted-by":"publisher","first-page":"5954","DOI":"10.1109\/TCYB.2020.2974688","volume":"51","author":"A Hong","year":"2020","unstructured":"Hong A, Lunscher N, Hu T, Tsuboi Y, Zhang X, Reis Alves SF, Nejat G, Benhabib B (2020) A multimodal emotional human-robot interaction architecture for social robots engaged in bidirectional communication. IEEE Trans Cybernet 51(12):5954\u20135968","journal-title":"IEEE Trans Cybernet"},{"key":"11398_CR50","doi-asserted-by":"crossref","unstructured":"Hu Y, Yang J, Chen L, Li K, Sima C, Zhu X, Chai S, Du S, Lin T, Wang W, et al (2023) Planning-oriented autonomous driving. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 17853\u201317862","DOI":"10.1109\/CVPR52729.2023.01712"},{"key":"11398_CR51","doi-asserted-by":"crossref","unstructured":"Huang S-C, Shen L, Lungren MP, Yeung S (2021) Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 3942\u20133951","DOI":"10.1109\/ICCV48922.2021.00391"},{"key":"11398_CR52","unstructured":"Hussain AS, Liu S, Sun C, Shan Y (2023) M2ugen: Multi-modal music understanding and generation with the power of large language models. arXiv preprint arXiv:2311.11255"},{"key":"11398_CR53","doi-asserted-by":"crossref","unstructured":"Izacard G, Grave E (2020) Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282","DOI":"10.18653\/v1\/2021.eacl-main.74"},{"key":"11398_CR54","unstructured":"Jang J, Ye S, Yang S, Shin J, Han J, Kim G, Choi SJ, Seo M (2022) Towards continual knowledge learning of language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https:\/\/openreview.net\/forum?id=vfsRB5MImo9"},{"issue":"12","key":"11398_CR55","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3571730","volume":"55","author":"Z Ji","year":"2023","unstructured":"Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang YJ, Madotto A, Fung P (2023) Survey of hallucination in natural language generation. ACM Comput Surv 55(12):1\u201338","journal-title":"ACM Comput Surv"},{"issue":"1","key":"11398_CR56","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1038\/s41597-019-0322-0","volume":"6","author":"AE Johnson","year":"2019","unstructured":"Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C-Y, Mark RG, Horng S (2019) Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6(1):317","journal-title":"Sci Data"},{"key":"11398_CR57","doi-asserted-by":"crossref","unstructured":"Karra SR, Tulabandhula T (2024) Interarec: Interactive recommendations using multimodal large language models. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 32\u201343. Springer","DOI":"10.1007\/978-981-97-2650-9_3"},{"issue":"D1","key":"11398_CR58","doi-asserted-by":"publisher","first-page":"1102","DOI":"10.1093\/nar\/gky1033","volume":"47","author":"S Kim","year":"2019","unstructured":"Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B et al (2019) Pubchem 2019 update: improved access to chemical data. Nucleic Acids Res 47(D1):1102\u20131109","journal-title":"Nucleic Acids Res"},{"key":"11398_CR59","doi-asserted-by":"crossref","unstructured":"Kim J, Rohrbach A, Darrell T, Canny J, Akata Z (2018) Textual explanations for self-driving vehicles. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563\u2013578","DOI":"10.1007\/978-3-030-01216-8_35"},{"key":"11398_CR60","doi-asserted-by":"crossref","unstructured":"Kirkpatrick J, Pascanu R, Rabinowitz NC, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2016) Overcoming catastrophic forgetting in neural networks. CoRR arXiv:1612.00796","DOI":"10.1073\/pnas.1611835114"},{"issue":"5","key":"11398_CR61","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3342240","volume":"10","author":"S Law","year":"2019","unstructured":"Law S, Paige B, Russell C (2019) Take a look around: using street view and satellite images to estimate house prices. ACM Trans Intell Syst Technol (TIST) 10(5):1\u201319","journal-title":"ACM Trans Intell Syst Technol (TIST)"},{"key":"11398_CR62","doi-asserted-by":"publisher","unstructured":"Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7871\u20137880. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/V1\/2020.ACL-MAIN.703","DOI":"10.18653\/V1\/2020.ACL-MAIN.703"},{"key":"11398_CR63","first-page":"19652","volume":"34","author":"M Li","year":"2021","unstructured":"Li M, Sigal L (2021) Referring transformer: a one-step approach to multi-task visual grounding. Adv Neural Inf Process Syst 34:19652\u201319664","journal-title":"Adv Neural Inf Process Syst"},{"issue":"12","key":"11398_CR64","doi-asserted-by":"publisher","first-page":"5289","DOI":"10.1007\/s10115-023-01923-5","volume":"65","author":"S Li","year":"2023","unstructured":"Li S, Wong KW, Zhu D, Fung CC (2023) Drug-cov: a drug-origin knowledge graph discovering drug repurposing targeting covid-19. Knowl Inf Syst 65(12):5289\u20135308","journal-title":"Knowl Inf Syst"},{"key":"11398_CR65","unstructured":"Li Y, Chen Y, Rajabifard A, Khoshelham K, Aleksandrov M (2018) Estimating building age from google street view images using deep learning (short paper). In: 10th International Conference on Geographic Information Science (GIScience 2018). Schloss-Dagstuhl-Leibniz Zentrum f\u00fcr Informatik"},{"key":"11398_CR66","unstructured":"Li J, Li D, Xiong C, Hoi S (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888\u201312900. PMLR"},{"key":"11398_CR67","unstructured":"Li J, Li D, Savarese S, Hoi S (2023) Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730\u201319742. PMLR"},{"key":"11398_CR68","unstructured":"Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J (2024) Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv Neural Inf Process Syst 36"},{"key":"11398_CR69","unstructured":"Li J, Yuan Y, Zhang Z (2024) Enhancing llm factual accuracy with rag to counter hallucinations: A case study on domain-specific queries in private knowledge-bases. arXiv preprint arXiv:2403.10446"},{"key":"11398_CR70","doi-asserted-by":"crossref","unstructured":"Li Y, Wang S, Ding H, Chen H (2023) Large language models in finance: A survey. In: Proceedings of the Fourth ACM International Conference on AI in Finance, pp. 374\u2013382","DOI":"10.1145\/3604237.3626869"},{"issue":"5","key":"11398_CR71","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2024.103805","volume":"61","author":"X Liang","year":"2024","unstructured":"Liang X, Wang D, Zhong H, Wang Q, Li R, Jia R, Wan B (2024) Candidate-heuristic in-context learning: a new framework for enhancing medical visual question answering with llms. Inf Process Manag 61(5):103805","journal-title":"Inf Process Manag"},{"key":"11398_CR72","unstructured":"Liang PP, Goindani A, Chafekar T, Mathur L, Yu H, Salakhutdinov R, Morency L-P (2024) Hemm: Holistic evaluation of multimodal foundation models. arXiv preprint arXiv:2407.03418"},{"key":"11398_CR73","doi-asserted-by":"publisher","unstructured":"Liao H, Shen H, Li Z, Wang C, Li G, Bie Y, Xu C (2023) GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. CoRR arXiv:2312.03543https:\/\/doi.org\/10.48550\/ARXIV.2312.03543","DOI":"10.48550\/ARXIV.2312.03543"},{"key":"11398_CR74","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2024.108073","volume":"171","author":"P Liu","year":"2024","unstructured":"Liu P, Ren Y, Tao J, Ren Z (2024) Git-mol: a multi-modal large language model for molecular science with graph, image, and text. Comput Biol Med 171:108073","journal-title":"Comput Biol Med"},{"key":"11398_CR75","unstructured":"Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. CoRR arXiv:1907.11692"},{"key":"11398_CR76","unstructured":"Liu J, Wang Z, Ye Q, Chong D, Zhou P, Hua Y (2023) Qilin-med-vl: towards chinese large vision-language model for general healthcare. arXiv preprint arXiv:2310.17956"},{"key":"11398_CR77","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 10012\u201310022","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"11398_CR78","unstructured":"Liu H, Li C, Wu Q, Lee YJ (2024) Visual instruction tuning. Adv Neural Inf Process Syst 36"},{"key":"11398_CR79","doi-asserted-by":"crossref","unstructured":"Liu H, Li C, Li Y, Lee YJ (2024) Improved baselines with visual instruction tuning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296\u201326306","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"11398_CR80","doi-asserted-by":"crossref","unstructured":"Liu S, Hussain AS, Sun C, Shan Y (2024) Music understanding llama: Advancing text-to-music generation with question answering and captioning. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 286\u2013290. IEEE","DOI":"10.1109\/ICASSP48485.2024.10447027"},{"key":"11398_CR81","doi-asserted-by":"crossref","unstructured":"Liu H, Yuan Y, Liu X, Mei X, Kong Q, Tian Q, Wang Y, Wang W, Wang Y, Plumbley MD (2024) Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE\/ACM Transactions on Audio, Speech, and Language Processing","DOI":"10.1109\/TASLP.2024.3399607"},{"issue":"3","key":"11398_CR82","doi-asserted-by":"publisher","first-page":"863","DOI":"10.1038\/s41591-024-02856-4","volume":"30","author":"MY Lu","year":"2024","unstructured":"Lu MY, Chen B, Williamson DF, Chen RJ, Liang I, Ding T, Jaume G, Odintsov I, Le LP, Gerber G et al (2024) A visual-language foundation model for computational pathology. Nat Med 30(3):863\u2013874","journal-title":"Nat Med"},{"key":"11398_CR83","doi-asserted-by":"crossref","unstructured":"Lu MY, Chen B, Williamson DF, Chen RJ, Zhao M, Chow AK, Ikemura K, Kim A, Pouli D, Patel A et al (2024) A multimodal generative ai copilot for human pathology. Nature 1\u20133","DOI":"10.1038\/s41586-024-07618-3"},{"key":"11398_CR84","unstructured":"Lu P, Bansal H, Xia T, Liu J, Li C, Hajishirzi H, Cheng H, Chang K-W, Galley M, Gao J (2023) Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. arXiv e-prints, 2310"},{"issue":"1","key":"11398_CR85","doi-asserted-by":"publisher","first-page":"9603","DOI":"10.1038\/s41598-024-60210-7","volume":"14","author":"MSU Miah","year":"2024","unstructured":"Miah MSU, Kabir MM, Sarwar TB, Safran M, Alfarhood S, Mridha M (2024) A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and llm. Sci Rep 14(1):9603","journal-title":"Sci Rep"},{"key":"11398_CR86","unstructured":"Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. arXiv preprint arXiv:2402.06196"},{"issue":"9","key":"11398_CR87","doi-asserted-by":"publisher","first-page":"10137","DOI":"10.1007\/s10462-023-10423-5","volume":"56","author":"SK Mondal","year":"2023","unstructured":"Mondal SK, Zhang H, Kabir HD, Ni K, Dai H-N (2023) Machine translation and its evaluation: a study. Artif Intell Rev 56(9):10137\u201310226","journal-title":"Artif Intell Rev"},{"issue":"2","key":"11398_CR88","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1901\/jeab.2001.76-235","volume":"76","author":"J Myerson","year":"2001","unstructured":"Myerson J, Green L, Warusawitharana M (2001) Area under the curve as a measure of discounting. J Exp Anal Behav 76(2):235\u2013243","journal-title":"J Exp Anal Behav"},{"key":"11398_CR89","doi-asserted-by":"publisher","unstructured":"Nie M, Peng R, Wang C, Cai X, Han J, Xu H, Zhang L (2023) Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. CoRR arXiv:2312.03661. https:\/\/doi.org\/10.48550\/ARXIV.2312.03661","DOI":"10.48550\/ARXIV.2312.03661"},{"key":"11398_CR90","doi-asserted-by":"publisher","unstructured":"Ogawa Y, Zhao C, Oki T, Chen S, Sekimoto Y (2023) Deep learning approach for classifying the built year and structure of individual buildings by automatically linking street view images and GIS building data. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 16, 1740\u20131755. https:\/\/doi.org\/10.1109\/JSTARS.2023.3237509","DOI":"10.1109\/JSTARS.2023.3237509"},{"key":"11398_CR91","doi-asserted-by":"crossref","unstructured":"Orenstrakh MS, Karnalim O, Suarez CA, Liut M (2023) Detecting llm-generated text in computing education: a comparative study for chatgpt cases. arXiv preprint arXiv:2307.07411","DOI":"10.1109\/COMPSAC61105.2024.00027"},{"key":"11398_CR92","doi-asserted-by":"crossref","unstructured":"Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE et al (2021) The prisma 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372","DOI":"10.1136\/bmj.n71"},{"key":"11398_CR93","unstructured":"Pandya K, Holia M (2023) Automating customer service using langchain: building custom open-source gpt chatbot for organizations. arXiv preprint arXiv:2310.05421"},{"issue":"8","key":"11398_CR94","first-page":"9","volume":"1","author":"A Radford","year":"2019","unstructured":"Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9","journal-title":"OpenAI blog"},{"key":"11398_CR95","unstructured":"Radford A, Narasimhan K, Salimans T, Sutskever I, et al (2018) Improving language understanding by generative pre-training"},{"key":"11398_CR96","unstructured":"Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748\u20138763. PMLR"},{"key":"11398_CR97","first-page":"140","volume":"21","author":"C Raffel","year":"2020","unstructured":"Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:140\u2013114067","journal-title":"J Mach Learn Res"},{"key":"11398_CR98","unstructured":"Reddy RG, Fung YR, Zeng Q, Li M, Wang Z, Sullivan P, Ji H (2023) Smartbook: Ai-assisted situation report generation. arXiv preprint arXiv:2303.14337"},{"key":"11398_CR99","doi-asserted-by":"publisher","unstructured":"Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, Wan X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 3980\u20133990. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/V1\/D19-1410","DOI":"10.18653\/V1\/D19-1410"},{"key":"11398_CR100","doi-asserted-by":"crossref","unstructured":"Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 658\u2013666","DOI":"10.1109\/CVPR.2019.00075"},{"issue":"7954","key":"11398_CR101","doi-asserted-by":"publisher","first-page":"773","DOI":"10.1038\/d41586-023-00816-5","volume":"615","author":"K Sanderson","year":"2023","unstructured":"Sanderson K (2023) Gpt-4 is here: what scientists think. Nature 615(7954):773","journal-title":"Nature"},{"issue":"1","key":"11398_CR102","doi-asserted-by":"publisher","first-page":"187","DOI":"10.1146\/annurev-control-060117-105157","volume":"1","author":"W Schwarting","year":"2018","unstructured":"Schwarting W, Alonso-Mora J, Rus D (2018) Planning and decision-making for autonomous vehicles. Ann Rev Control Robot Auton Syst1(1):187\u2013210","journal-title":"Ann Rev Control Robot Auton Syst"},{"key":"11398_CR103","doi-asserted-by":"crossref","unstructured":"Su C, Wen J, Kang J, Wang Y, Pan H, Hossain MS (2024) Hybrid rag-empowered multi-modal llm for secure healthcare data management: A diffusion-based contract theory approach. arXiv preprint arXiv:2407.00978","DOI":"10.1109\/JIOT.2024.3521425"},{"key":"11398_CR104","unstructured":"Su B, Du D, Yang Z, Zhou Y, Li J, Rao A, Sun H, Lu Z, Wen J-R (2022) A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481"},{"key":"11398_CR105","doi-asserted-by":"publisher","DOI":"10.1016\/j.cities.2022.103787","volume":"128","author":"M Sun","year":"2022","unstructured":"Sun M, Zhang F, Duarte F, Ratti C (2022) Understanding architecture age and style through deep learning. Cities 128:103787","journal-title":"Cities"},{"key":"11398_CR106","doi-asserted-by":"crossref","unstructured":"Tan Z, Shen Y, Cheng X, Zong C, Zhang W, Shao J, Lu W, Zhuang Y (2024) Learning global controller in latent space for parameter-efficient fine-tuning. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4044\u20134055","DOI":"10.18653\/v1\/2024.acl-long.222"},{"issue":"8","key":"11398_CR107","doi-asserted-by":"publisher","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","volume":"29","author":"AJ Thirunavukarasu","year":"2023","unstructured":"Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29(8):1930\u20131940","journal-title":"Nat Med"},{"issue":"10","key":"11398_CR108","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1109\/MSPEC.2021.9563954","volume":"58","author":"NC Thompson","year":"2021","unstructured":"Thompson NC, Greenewald K, Lee K, Manso GF (2021) Deep learning\u2019s diminishing returns: the cost of improvement is becoming unsustainable. IEEE Spectr 58(10):50\u201355","journal-title":"IEEE Spectr"},{"key":"11398_CR109","unstructured":"Thoppilan R, Freitas DD, Hall J, Shazeer N, Kulshreshtha A, Cheng H, Jin A, Bos T, Baker L, Du Y, Li Y, Lee H, Zheng HS, Ghafouri A, Menegali M, Huang Y, Krikun M, Lepikhin D, Qin J, Chen D, Xu Y, Chen Z, Roberts A, Bosma M, Zhou Y, Chang C, Krivokon I, Rusch W, Pickett M, Meier-Hellstern KS, Morris MR, Doshi T, Santos RD, Duke T, Soraker J, Zevenbergen B, Prabhakaran V, Diaz M, Hutchinson B, Olson K, Molina A, Hoffman-John E, Lee J, Aroyo L, Rajakumar R, Butryna A, Lamm M, Kuzmina V, Fenton J, Cohen A, Bernstein R, Kurzweil R, Arcas BA, Cui C, Croak M, Chi EH, Le Q (2022) Lamda: Language models for dialog applications. CoRR arXiv:2201.08239"},{"key":"11398_CR110","doi-asserted-by":"publisher","unstructured":"Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton-Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura PS, Lachaux M, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T (2023) Llama 2: Open foundation and fine-tuned chat models. CoRR arXiv:2307.09288. https:\/\/doi.org\/10.48550\/ARXIV.2307.09288","DOI":"10.48550\/ARXIV.2307.09288"},{"issue":"1","key":"11398_CR111","doi-asserted-by":"publisher","first-page":"480","DOI":"10.1038\/s43247-023-01084-x","volume":"4","author":"SA Vaghefi","year":"2023","unstructured":"Vaghefi SA, Stammbach D, Muccione V, Bingler J, Ni J, Kraus M, Allen S, Colesanti-Senni C, Wekhof T, Schimanski T et al (2023) Chatclimate: grounding conversational ai in climate science. Commun Earth Environ 4(1):480","journal-title":"Commun Earth Environ"},{"key":"11398_CR112","unstructured":"Vaswani A (2017) Attention is all you need. arXiv preprint arXiv:1706.03762"},{"key":"11398_CR113","doi-asserted-by":"crossref","unstructured":"Vedantam R, Zitnick CL, Parikh D. Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566\u20134575","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"11398_CR114","doi-asserted-by":"crossref","unstructured":"Vu T, Krishna K, Alzubi S, Tar C, Faruqui M, Sung Y-H (2024) Foundational autoraters: Taming large language models for better automatic evaluation. arXiv preprint arXiv:2407.10817","DOI":"10.18653\/v1\/2024.emnlp-main.949"},{"key":"11398_CR115","doi-asserted-by":"crossref","unstructured":"Wan P, Huang Z, Tang W, Nie Y, Pei D, Deng S, Chen J, Zhou Y, Duan H, Chen Q et al (2024) Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat Med 1\u20138","DOI":"10.1038\/s41591-024-03148-7"},{"key":"11398_CR116","unstructured":"Wang W, Xie J, Hu C, Zou H, Fan J, Tong W, Wen Y, Wu S, Deng H, Li Z, et al (2023) Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245"},{"key":"11398_CR117","doi-asserted-by":"publisher","unstructured":"Wang C, Hasler S, Tanneberg D, Ocker F, Joublin F, Ceravola A, Deigmoeller J, Gienger M (2024) Lami: Large language models for multi-modal human-robot interaction. In: Mueller FF, Kyburz P, Williamson JR, Sas C (eds.) Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA 2024, Honolulu, HI, USA, May 11-16, 2024, pp. 218\u2013121810. ACM. https:\/\/doi.org\/10.1145\/3613905.3651029","DOI":"10.1145\/3613905.3651029"},{"key":"11398_CR118","unstructured":"Wang K, Pan J, Shi W, Lu Z, Zhan M, Li H (2024) Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804"},{"key":"11398_CR119","doi-asserted-by":"publisher","unstructured":"Webb TW, Holyoak KJ, Lu H (2022) Emergent analogical reasoning in large language models. CoRR arXiv:2212.09196. https:\/\/doi.org\/10.48550\/ARXIV.2212.09196","DOI":"10.48550\/ARXIV.2212.09196"},{"key":"11398_CR120","unstructured":"Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, Yogatama D, Bosma M, Zhou D, Metzler D, Chi EH, Hashimoto T, Vinyals O, Liang P, Dean J, Fedus W (2022) Emergent abilities of large language models. Trans. Mach. Learn. Res. 2022"},{"issue":"1","key":"11398_CR121","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/CI00057A005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31\u201336. https:\/\/doi.org\/10.1021\/CI00057A005","journal-title":"J Chem Inf Comput Sci"},{"issue":"1","key":"11398_CR122","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1038\/s41746-024-01233-2","volume":"7","author":"IC Wiest","year":"2024","unstructured":"Wiest IC, Ferber D, Zhu J, Treeck M, Meyer SK, Juglan R, Carrero ZI, Paech D, Kleesiek J, Ebert MP et al (2024) Privacy-preserving large language models for structured medical information retrieval. NPJ Digital Med 7(1):257","journal-title":"npj Digital Medicine"},{"key":"11398_CR123","doi-asserted-by":"crossref","unstructured":"Wu C, Zhang X, Zhang Y, Wang Y, Xie W (2023) Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 21372\u201321383","DOI":"10.1101\/2023.01.10.23284412"},{"issue":"2","key":"11398_CR124","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1007\/s40894-022-00194-5","volume":"8","author":"Z Xie","year":"2023","unstructured":"Xie Z, Man W, Liu C, Fu X (2023) A prisma-based systematic review of measurements for school bullying. Adolescent Res Rev 8(2):219\u2013259","journal-title":"Adolescent research review"},{"issue":"1","key":"11398_CR125","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1186\/s13321-022-00624-5","volume":"14","author":"Z Xu","year":"2022","unstructured":"Xu Z, Li J, Yang Z, Li S, Li H (2022) Swinocsr: end-to-end optical chemical structure recognition using a swin transformer. J Cheminf 14(1):41","journal-title":"J Cheminf"},{"issue":"4","key":"11398_CR126","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2024.103724","volume":"61","author":"L Yang","year":"2024","unstructured":"Yang L, Wang Z, Li Z, Na J-C, Yu J (2024) An empirical study of multimodal entity-based sentiment analysis with chatgpt: improving in-context learning via entity-aware contrastive learning. Information Processing & Management 61(4):103724","journal-title":"Information Processing & Management"},{"key":"11398_CR127","doi-asserted-by":"publisher","DOI":"10.1016\/J.KNOSYS.2023.110823","volume":"278","author":"L Yang","year":"2023","unstructured":"Yang L, Wang J, Na J, Yu J (2023) Generating paraphrase sentences for multimodal entity-category-sentiment triple extraction. Knowl Based Syst 278:110823. https:\/\/doi.org\/10.1016\/J.KNOSYS.2023.110823","journal-title":"Knowl Based Syst"},{"key":"11398_CR128","unstructured":"Yang Z, Li L, Lin K, Wang J, Lin C-C, Liu Z, Wang L (2023) The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9(1), 1"},{"key":"11398_CR129","unstructured":"Yizhi L, Yuan R, Zhang G, Ma Y, Chen X, Yin H, Xiao C, Lin C, Ragni A, Benetos E, et al (2023) Mert: Acoustic music understanding model with large-scale self-supervised training. In: The Twelfth International Conference on Learning Representations"},{"key":"11398_CR130","doi-asserted-by":"publisher","unstructured":"Yu J, Jiang J, Yang L, Xia R (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 3342\u20133352. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/V1\/2020.ACL-MAIN.306","DOI":"10.18653\/V1\/2020.ACL-MAIN.306"},{"key":"11398_CR131","unstructured":"Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y (2022) Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917"},{"key":"11398_CR132","doi-asserted-by":"publisher","unstructured":"Yuan J, Sun S, Omeiza D, Zhao B, Newman P, Kunze L, Gadd M (2024) Rag-driver: Generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model. CoRR arXiv:2402.10828. https:\/\/doi.org\/10.48550\/ARXIV.2402.10828","DOI":"10.48550\/ARXIV.2402.10828"},{"issue":"6","key":"11398_CR133","doi-asserted-by":"publisher","first-page":"1091","DOI":"10.1109\/TPAMI.2007.1078","volume":"29","author":"L Yujian","year":"2007","unstructured":"Yujian L, Bo L (2007) A normalized levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091\u20131095","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11398_CR134","doi-asserted-by":"publisher","unstructured":"Zeng Z, Goo JM, Wang X, Chi B, Wang M, Boehm J (2024) Zero-shot building age classification from facade image using GPT-4. CoRR arXiv:2404.09921. https:\/\/doi.org\/10.48550\/ARXIV.2404.09921","DOI":"10.48550\/ARXIV.2404.09921"},{"key":"11398_CR135","doi-asserted-by":"crossref","unstructured":"Zeppelzauer M, Despotovic M, Sakeena M, Koch D, D\u00f6ller M (2018) Automatic prediction of building age from photographs. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 126\u2013134","DOI":"10.1145\/3206025.3206060"},{"issue":"1","key":"11398_CR136","doi-asserted-by":"publisher","first-page":"4542","DOI":"10.1038\/s41467-023-40260-7","volume":"14","author":"X Zhang","year":"2023","unstructured":"Zhang X, Wu C, Zhang Y, Xie W, Wang Y (2023) Knowledge-enhanced visual-language pre-training on chest radiology images. Nat Commun 14(1):4542","journal-title":"Nat Commun"},{"key":"11398_CR137","unstructured":"Zhang Y, Jiang H, Miura Y, Manning CD, Langlotz CP (2022) Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2\u201325. PMLR"},{"key":"11398_CR138","unstructured":"Zhang S, Xu Y, Usuyama N, Bagga J, Tinn R, Preston S, Rao R, Wei M, Valluri N, Wong C, et al (2023) Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915 2(3), 6"},{"key":"11398_CR139","unstructured":"Zhang R, Zhang Y, Shao K, Shan Y, Xia G (2022) Vis2mus: Exploring multimodal representation mapping for controllable music generation. arXiv preprint arXiv:2211.05543"},{"issue":"1","key":"11398_CR140","doi-asserted-by":"publisher","first-page":"5649","DOI":"10.1038\/s41467-024-50043-3","volume":"15","author":"J Zhou","year":"2024","unstructured":"Zhou J, He X, Sun L, Xu J, Chen X, Chu Y, Zhou L, Liao X, Zhang B, Afvari S et al (2024) Pre-trained multimodal large language model enhances dermatological diagnosis using skingpt-4. Nat Commun 15(1):5649","journal-title":"Nat Commun"},{"key":"11398_CR141","unstructured":"Zhu B, Lin B, Ning M, Yan Y, Cui J, Wang H, Pang Y, Jiang W, Zhang J, Li Z, et al (2023) Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852"},{"key":"11398_CR142","doi-asserted-by":"publisher","unstructured":"Zhu C, Zhou Y, Shen Y, Luo G, Pan X, Lin M, Chen C, Cao L, Sun X, Ji R (2022) Seqtr: A simple yet universal network for visual grounding. In: Avidan S, Brostow GJ, Ciss\u00e9 M, Farinella GM, Hassner T (eds.) Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXV. Lecture Notes in Computer Science, vol. 13695, pp. 598\u2013615. Springer. https:\/\/doi.org\/10.1007\/978-3-031-19833-5_35","DOI":"10.1007\/978-3-031-19833-5_35"},{"key":"11398_CR143","unstructured":"Zou K, Bai Y, Chen Z, Zhou Y, Chen Y, Ren K, Wang M, Yuan X, Shen X, Fu H (2024) Medrg: Medical report grounding with multi-modal large language model. arXiv preprint arXiv:2404.06798"}],"container-title":["Artificial Intelligence Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10462-025-11398-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10462-025-11398-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10462-025-11398-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T02:02:38Z","timestamp":1764986558000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10462-025-11398-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,17]]},"references-count":143,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["11398"],"URL":"https:\/\/doi.org\/10.1007\/s10462-025-11398-1","relation":{},"ISSN":["1573-7462"],"issn-type":[{"value":"1573-7462","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,17]]},"assertion":[{"value":"7 October 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 September 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 October 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare to have no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"383"}}