{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T20:50:01Z","timestamp":1774903801406,"version":"3.50.1"},"reference-count":120,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100012165","name":"Key Technologies Research and Development Program","doi-asserted-by":"publisher","award":["2024YFB3908503, 2024YFB3908500"],"award-info":[{"award-number":["2024YFB3908503, 2024YFB3908500"]}],"id":[{"id":"10.13039\/501100012165","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62322608"],"award-info":[{"award-number":["62322608"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities that are analogous to those exhibited by humans. Concurrently, an emerging research trend is focused on extending these LLM-powered AI agents into the\n                    <jats:italic>multimodal<\/jats:italic>\n                    domain. This extension facilitates the interpretation and response of AI agents to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as\n                    <jats:italic>large multimodal agents<\/jats:italic>\n                    ( for short). First, we introduce the essential components involved in developing  and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks that integrate multiple , with the aim of enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, which impedes effective comparison among different . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of  and propose potential future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field.\n                  <\/jats:p>","DOI":"10.1007\/s44267-025-00093-y","type":"journal-article","created":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T06:02:12Z","timestamp":1764136932000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Large multimodal agents: a survey"],"prefix":"10.1007","volume":"3","author":[{"given":"Junlin","family":"Xie","sequence":"first","affiliation":[]},{"given":"Zhihong","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Ruifei","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Guanbin","family":"Li","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,11,26]]},"reference":[{"issue":"2","key":"93_CR1","doi-asserted-by":"publisher","first-page":"115","DOI":"10.1017\/S0269888900008122","volume":"10","author":"M. Wooldridge","year":"1995","unstructured":"Wooldridge, M., & Jennings, N. R. (1995). Intelligent agents: theory and practice. Knowledge Engineering Review, 10(2), 115\u2013152.","journal-title":"Knowledge Engineering Review"},{"key":"93_CR2","unstructured":"Osoba, O.A., Vardavas, R., Grana, J., Zutshi, R., & Jaycocks, A. (2020). Policy-focused agent-based modeling using RL behavioral models. arXiv preprint. arXiv:2006.05048."},{"key":"93_CR3","doi-asserted-by":"publisher","DOI":"10.1016\/j.amc.2020.125312","volume":"382","author":"X. Wang","year":"2020","unstructured":"Wang, X., & Su, H. (2020). Completely model-free RL-based consensus of continuous-time multi-agent systems. Applied Mathematics and Computation, 382, 125312.","journal-title":"Applied Mathematics and Computation"},{"issue":"1","key":"93_CR4","first-page":"1","volume":"1","author":"J. Z. Pan","year":"2023","unstructured":"Pan, J. Z., Razniewski, S., Kalo, J.-C., Singhania, S., Chen, J., Dietze, S., Jabeen, H., Omeliyanenko, J., Zhang, W., Lissandrini, M., et al. (2023). Large language models and knowledge graphs: opportunities and challenges. Transactions on Graph Data and Knowledge, 1(1), 1\u201338.","journal-title":"Transactions on Graph Data and Knowledge"},{"key":"93_CR5","doi-asserted-by":"publisher","first-page":"8289","DOI":"10.18653\/v1\/2023.emnlp-main.516","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing","author":"Z. Zhang","year":"2023","unstructured":"Zhang, Z., Fang, M., Chen, L., Namazi-Rad, M.-R., & Wang, J. (2023). How do large language models capture the ever-changing world knowledge? A review of recent advances. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 8289\u20138311). Stroudsburg: ACL."},{"key":"93_CR6","first-page":"19594","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"Y. A. Li","year":"2023","unstructured":"Li, Y. A., Han, C., Raghavan, V., Mischler, G., & Mesgarani, N. (2023). Styletts 2: towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 19594\u201319621). Red Hook: Curran Associates."},{"key":"93_CR7","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"L. Yang","year":"2024","unstructured":"Yang, L., Zhang, S., Yu, Z., Bao, G., Wang, Y., Wang, J., Xu, R., Ye, W., Xie, X., Chen, W., et al. (2024). Supervised knowledge makes large language models better in-context learners. In Proceedings of the 12th international conference on learning representations (pp. 1\u201317). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=bAMPOUF227."},{"key":"93_CR8","first-page":"68539","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"T. Schick","year":"2023","unstructured":"Schick, T., Dwivedi-Yu, J., Dess\u00ec, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: language models can teach themselves to use tools. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 68539\u201368551). Red Hook: Curran Associates."},{"key":"93_CR9","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"Y. Qin","year":"2024","unstructured":"Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al. (2024). Toolllm: facilitating large language models to master 16000+ real-world APIs. In Proceedings of the 12th international conference on learning representations (pp. 1\u201323). Retrieved August 7, 2025, from https:\/\/openreview.net\/pdf?id=dHng2O0Jjr."},{"issue":"2","key":"93_CR10","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4222-0","volume":"68","author":"Z. Xi","year":"2025","unstructured":"Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., et al. (2025). The rise and potential of large language model based agents: a survey. Science China. Information Sciences, 68(2), 121101.","journal-title":"Science China. Information Sciences"},{"key":"93_CR11","first-page":"1","volume":"2","author":"T. Sumers","year":"2024","unstructured":"Sumers, T., Yao, S., Narasimhan, K., & Griffiths, T. (2024). Cognitive architectures for language agents. Transactions on Machine Learning Research, 2, 1\u201332.","journal-title":"Transactions on Machine Learning Research"},{"key":"93_CR12","doi-asserted-by":"publisher","DOI":"10.1007\/s11704-024-40231-1","volume":"18","author":"L. Wang","year":"2024","unstructured":"Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z.-Y., Tang, J., Chen, X., Lin, Y., et al. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18, 186345.","journal-title":"Frontiers of Computer Science"},{"key":"93_CR13","first-page":"14953","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"T. Gupta","year":"2023","unstructured":"Gupta, T., & Kembhavi, A. (2023). Visual programming: compositional visual reasoning without training. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 14953\u201314962). Piscataway: IEEE."},{"key":"93_CR14","first-page":"23802","volume-title":"Proceedings of the 38th AAAI conference on artificial intelligence and 36th conference on innovative applications of artificial intelligence and 14th symposium on educational advances in artificial intelligence","author":"R. Huang","year":"2024","unstructured":"Huang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Liu, J., et al. (2024). AudioGPT: understanding and generating speech, music, sound, and talking head. In Proceedings of the 38th AAAI conference on artificial intelligence and 36th conference on innovative applications of artificial intelligence and 14th symposium on educational advances in artificial intelligence (pp. 23802\u201323804). Palo Alto: AAAI Press."},{"key":"93_CR15","unstructured":"Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., & Wang, L. (2023). MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint. arXiv:2303.11381."},{"key":"93_CR16","unstructured":"Wang, J., Chen, D., Luo, C., Dai, X., Yuan, L., Wu, Z., & Jiang, Y.-G. (2023). ChatVideo: a tracklet-centric multimodal and versatile video understanding system. arXiv preprint. arXiv:2304.14407."},{"key":"93_CR17","first-page":"11854","volume-title":"Proceedings of the 2023 IEEE\/CVF international conference on computer vision","author":"D. Sur\u00eds","year":"2023","unstructured":"Sur\u00eds, D., Menon, S., & Vondrick, C. (2023). ViperGPT: visual inference via Python execution for reasoning. In Proceedings of the 2023 IEEE\/CVF international conference on computer vision (pp. 11854\u201311864). Piscataway: IEEE."},{"key":"93_CR18","unstructured":"Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023). Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint. arXiv:2303.04671."},{"key":"93_CR19","first-page":"71995","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"R. Yang","year":"2023","unstructured":"Yang, R., Song, L., Li, Y., Zhao, S., Ge, Y., Li, X., & Shan, Y. (2023). GPT4Tools: teaching large language model to use tools via self-instruction. In Proceedings of the 37th international conference on neural information processing systems (pp. 71995\u201372007). Red Hook: Curran Associates."},{"key":"93_CR20","unstructured":"Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). HuggingGPT: solving AI tasks with ChatGPT and its friends in huggingface. arXiv preprint. arXiv:2303.17580."},{"key":"93_CR21","volume-title":"Autonomous driving: technical, legal and social aspects","author":"M. Maurer","year":"2016","unstructured":"Maurer, M., Gerdes, J.C., Lenz, B., & Winner, H. (2016). Autonomous driving: technical, legal and social aspects. Berlin: Springer."},{"key":"93_CR22","unstructured":"Gao, D., Ji, L., Zhou, L., Lin, K.Q., Chen, J., Fan, Z., & Shou, M.Z. (2023). AssistGPT: a general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint. arXiv:2306.08640."},{"key":"93_CR23","first-page":"867","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"Z. Hu","year":"2023","unstructured":"Hu, Z., Iscen, A., Sun, C., Chang, K.-W., Sun, Y., Ross, D., Schmid, C., & Fathi, A. (2023). AVIS: autonomous visual information seeking with large language model agent. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 867\u2013878). Red Hook: Curran Associates."},{"issue":"1","key":"93_CR24","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-025-60802-5","volume":"16","author":"D. Cai","year":"2025","unstructured":"Cai, D., Wang, S., Peng, C., Zhang, Z., Lu, Z., Qi, T., Lane, N. D., & Xu, M. (2025). Ubiquitous memory augmentation via mobile multimodal embedding system. Nature Communications, 16(1), 1\u201312.","journal-title":"Nature Communications"},{"key":"93_CR25","first-page":"17380","volume-title":"Proceedings of the IEEE international conference on robotics and automation","author":"Y. Long","year":"2024","unstructured":"Long, Y., Li, X., Cai, W., & Dong, H. (2024). Discuss before moving: visual language navigation via multi-expert discussions. In Proceedings of the IEEE international conference on robotics and automation (pp. 17380\u201317387). Piscataway: IEEE."},{"key":"93_CR26","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"X. Liu","year":"2024","unstructured":"Liu, X., Li, R., Ji, W., & Lin, T. (2024). Towards robust multi-modal reasoning via model selection. In Proceedings of the 12th international conference on learning representations (pp. 1\u201322). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=KTf4DGAzus."},{"key":"93_CR27","first-page":"246","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations","author":"D. Yu","year":"2023","unstructured":"Yu, D., Song, K., Lu, P., He, T., Tan, X., Ye, W., Zhang, S., & Bian, J. (2023). MusicAgent: an AI agent for music understanding and generation with large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations (pp. 246\u2013255). Stroudsburg: ACL."},{"key":"93_CR28","unstructured":"Xie, T., Zhou, F., Cheng, Z., Shi, P., Weng, L., Liu, Y., Hua, T.J., Zhao, J., Liu, Q., Liu, C., et\u00a0al. (2023). OpenAgents: an open platform for language agents in the wild. arXiv preprint. arXiv:2310.10634."},{"key":"93_CR29","first-page":"20","volume-title":"Proceedings of the 18th European conference on computer vision","author":"J. Yang","year":"2024","unstructured":"Yang, J., Dong, Y., Liu, S., Li, B., Wang, Z., Tan, H., Jiang, C., Kang, J., Zhang, Y., Zhou, K., et al. (2024). Octopus: embodied vision-language programmer from environmental feedback. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 20\u201338). Cham: Springer."},{"key":"93_CR30","unstructured":"Tang, W., Zhou, Y., Xu, E., Cheng, K., Li, M., & Xiao, L. (2025). DSGBench: a diverse strategic game benchmark for evaluating LLM-based agents in complex decision-making environments. arXiv preprint. arXiv:2503.06047."},{"key":"93_CR31","unstructured":"Vemprala, S., Chen, S., Shukla, A., Narayanan, D., & Kapoor, A. (2023). GRID: a platform for general robot intelligence development. arXiv preprint. arXiv:2310.00887."},{"key":"93_CR32","doi-asserted-by":"crossref","unstructured":"Tao, H., Sethuraman, T. V., Shlapentokh-Rothman, M., Hoiem, D., & Ji, H. (2023). WebWISE: web interface control and sequential exploration with large language models. arXiv preprint. arXiv:2310.16042.","DOI":"10.18653\/v1\/2024.findings-naacl.234"},{"key":"93_CR33","first-page":"5168","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"G. Zheng","year":"2023","unstructured":"Zheng, G., Yang, B., Tang, J., Zhou, H.-Y., & Yang, S. (2023). DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 5168\u20135191). Red Hook: Curran Associates."},{"key":"93_CR34","first-page":"89","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Z. Liu","year":"2024","unstructured":"Liu, Z., Lai, Z., Gao, Z., Cui, E., Li, Z., Zhu, X., Lu, L., Chen, Q., Qiao, Y., Dai, J., et al. (2024). ControlLLM: augment language models with tools by searching on graphs. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 89\u2013105). Cham: Springer."},{"key":"93_CR35","first-page":"43447","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"P. Lu","year":"2023","unstructured":"Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.-W., Wu, Y.N., Zhu, S.-C., & Gao, J. (2023). Chameleon: plug-and-play compositional reasoning with large language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 43447\u201343478). Red Hook: Curran Associates."},{"key":"93_CR36","unstructured":"Mao, J., Qian, Y., Zhao, H., & Wang, Y. (2023). GPT-Driver: learning to drive with GPT. arXiv preprint. arXiv:2310.01415."},{"key":"93_CR37","first-page":"126","volume-title":"Proceedings of the 18th European conference on computer vision","author":"S. Liu","year":"2024","unstructured":"Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al. (2024). LLaVA-Plus: learning to use tools for creating multimodal agents. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 126\u2013142). Cham: Springer."},{"key":"93_CR38","unstructured":"Yan, A., Yang, Z., Zhu, W., Lin, K., Li, L., Wang, J., Yang, J., Zhong, Y., McAuley, J., Gao, J., et\u00a0al. (2023). GPT-4v in wonderland: large multimodal models for zero-shot smartphone GUI navigation. arXiv preprint. arXiv:2311.07562."},{"key":"93_CR39","unstructured":"Chen, W.-G., Spiridonova, I., Yang, J., Gao, J., & Li, C. (2023). LLaVA-Interactive: an all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint. arXiv:2311.00571."},{"key":"93_CR40","first-page":"16307","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Qin","year":"2024","unstructured":"Qin, Y., Zhou, E., Liu, Q., Yin, Z., Sheng, L., Zhang, R., Qiao, Y., & Shao, J. (2024). MP5: a multi-modal open-ended embodied system in minecraft via active perception. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 16307\u201316316). Piscataway: IEEE."},{"key":"93_CR41","doi-asserted-by":"crossref","unstructured":"Lee, S., Choi, J., Lee, J., Choi, H., Ko, S. Y., Oh, S., & Shin, I. (2023). Explore, select, derive, and recall: augmenting LLM with human-like memory for mobile task automation. arXiv preprint. arXiv:2312.03003.","DOI":"10.1145\/3636534.3690682"},{"key":"93_CR42","first-page":"26275","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Yang","year":"2024","unstructured":"Yang, Y., Zhou, T., Li, K., Tao, D., Li, L., Shen, L., He, X., Jiang, J., & Shi, Y. (2024). Embodied multi-modal agent trained by an LLM from a parallel textworld. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 26275\u201326285). Piscataway: IEEE."},{"key":"93_CR43","first-page":"187","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Z. Zhao","year":"2024","unstructured":"Zhao, Z., Chai, W., Wang, X., Li, B., Hao, S., Cao, S., Ye, T., & Wang, G. (2024). See and think: embodied agent in virtual environment. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 187\u2013204). Cham: Springer."},{"key":"93_CR44","first-page":"1","volume-title":"Proceedings of the 2025 CHI conference on human factors in computing systems","author":"C. Zhang","year":"2025","unstructured":"Zhang, C., Yang, Z., Liu, J., Li, Y., Han, Y., Chen, X., Huang, Z., Fu, B., & Yu, G. (2025). AppAgent: multimodal agents as smartphone users. In Proceedings of the 2025 CHI conference on human factors in computing systems (pp. 1\u201320). New York: ACM."},{"issue":"3","key":"93_CR45","doi-asserted-by":"publisher","first-page":"1894","DOI":"10.1109\/TPAMI.2024.3511593","volume":"47","author":"Z. Wang","year":"2025","unstructured":"Wang, Z., Cai, S., Liu, A., Jin, Y., Hou, J., Zhang, B., Lin, H., He, Z., Zheng, Z., Yang, Y., et al. (2025). JARVIS-1: open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3), 1894\u20131907.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"93_CR46","unstructured":"Wen, H., Wang, H., Liu, J., & Li, Y. (2023). DroidBot-GPT: GPT-powered UI automation for Android. arXiv preprint. arXiv:2304.07061."},{"key":"93_CR47","first-page":"6678","volume-title":"Proceedings of the IEEE\/CVF winter conference on applications of computer vision","author":"C. Wang","year":"2025","unstructured":"Wang, C., Luo, W., Dong, S., Xuan, X., Li, Z., Ma, L., & Gao, S. (2025). MLLM-Tool: a multimodal large language model for tool agent learning. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision (pp. 6678\u20136687). Piscataway: IEEE."},{"key":"93_CR48","first-page":"910","volume-title":"Proceedings of the IEEE\/CVF winter conference on applications of computer vision workshops","author":"D. Fu","year":"2024","unstructured":"Fu, D., Li, X., Wen, L., Dou, M., Cai, P., Shi, B., & Qiao, Y. (2024). Drive like a human: rethinking autonomous driving with large language models. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision workshops (pp. 910\u2013919). Piscataway: IEEE."},{"key":"93_CR49","first-page":"128374","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"Z. Wang","year":"2024","unstructured":"Wang, Z., Li, A., Li, Z., & Liu, X. (2024). GenArtist: multimodal LLM as an agent for unified image generation and editing. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 128374\u2013128395). Red Hook: Curran Associates."},{"key":"93_CR50","first-page":"1","volume-title":"Proceedings of the 13th international conference on learning representations","author":"B. Gou","year":"2024","unstructured":"Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., & Su, Y. (2024). Navigating the digital world as humans do: universal visual grounding for GUI agents. In Proceedings of the 13th international conference on learning representations (pp. 1\u201333). Retrieved September 5, 2025, from https:\/\/openreview.net\/forum?id=kxnoqaisCT."},{"key":"93_CR51","first-page":"27529","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"P. Mazzaglia","year":"2024","unstructured":"Mazzaglia, P., Verbelen, T., Dhoedt, B., Courville, A. C., & Mudumba, S. R. (2024). GenRL: multimodal-foundation world models for generalization in embodied agents. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 27529\u201327555). Red Hook: Curran Associates."},{"key":"93_CR52","first-page":"25981","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"S. Wu","year":"2024","unstructured":"Wu, S., Zhao, S., Huang, Q., Huang, K., Yasunaga, M., Cao, K., Ioannidis, V., Subbian, K., Leskovec, J., & Zou, J. Y. (2024). Avatar: optimizing LLM agents for tool usage via contrastive reasoning. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 25981\u201326010). Red Hook: Curran Associates."},{"key":"93_CR53","unstructured":"Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., & Sang, J. (2024). Mobile-Agent: autonomous multi-modal mobile device agent with visual perception. arXiv preprint. arXiv:2401.16158."},{"key":"93_CR54","first-page":"9039","volume-title":"Proceedings of the computer vision and pattern recognition conference","author":"Z. Li","year":"2025","unstructured":"Li, Z., Xie, Y., Shao, R., Chen, G., Jiang, D., & Nie, L. (2025). Optimus-2: multimodal minecraft agent with goal-observation-action conditioned policy. In Proceedings of the computer vision and pattern recognition conference (pp. 9039\u20139049). Piscataway: IEEE."},{"key":"93_CR55","first-page":"19477","volume-title":"Proceedings of the computer vision and pattern recognition conference","author":"Y. Sun","year":"2025","unstructured":"Sun, Y., Zhao, S., Yu, T., Wen, H., Va, S., Xu, M., Li, Y., & Zhang, C. (2025). GUI-Xplore: empowering generalizable GUI agents with one exploration. In Proceedings of the computer vision and pattern recognition conference (pp. 19477\u201319486). Piscataway: IEEE."},{"key":"93_CR56","unstructured":"Zhang, Z., Zhu, L., Fang, Z., Huang, Z., & Luo, Y. (2025). Provable ordering and continuity in vision-language pretraining for generalizable embodied agents. arXiv preprint. arXiv:2502.01218."},{"key":"93_CR57","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Song, L., & Mam, J. S. (2025). Modular multi-agent framework for multi-modal medical diagnosis via role-specialized collaboration. arXiv preprint. arXiv:2506.19835.","DOI":"10.18653\/v1\/2025.findings-acl.1298"},{"key":"93_CR58","first-page":"13258","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Gao","year":"2024","unstructured":"Gao, Z., Du, Y., Zhang, X., Ma, X., Han, W., Zhu, S.-C., & Li, Q. (2024). CLOVA: a closed-loop visual assistant with tool usage and update. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 13258\u201313268). Piscataway: IEEE."},{"key":"93_CR59","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"L. Yuan","year":"2024","unstructured":"Yuan, L., Chen, Y., Wang, X., Fung, Y. R., Peng, H., & Ji, H. (2024). CRAFT: customizing LLMs by creating and retrieving from specialized toolsets. In Proceedings of the 12th international conference on learning representations (pp. 1\u201329). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=G0vdDSt9XM."},{"key":"93_CR60","first-page":"70115","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"L. Chen","year":"2023","unstructured":"Chen, L., Li, B., Shen, S., Yang, J., Li, C., Keutzer, K., Darrell, T., & Liu, Z. (2023). Large language models are visual reasoning coordinators. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 70115\u201370140). Red Hook: Curran Associates."},{"key":"93_CR61","unstructured":"Wang, Z., Cai, S., Liu, A., Ma, X., & Liang, Y. (2023). Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. arXiv preprint. arXiv:2302.01560."},{"key":"93_CR62","doi-asserted-by":"crossref","unstructured":"Gao, D., Ji, L., Bai, Z., Ouyang, M., Li, P., Mao, D., Wu, Q., Zhang, W., Wang, P., Guo, X., et\u00a0al. (2023). ASSISTGUI: task-oriented desktop graphical user interface automation. arXiv preprint. arXiv:2312.13108.","DOI":"10.1109\/CVPR52733.2024.01262"},{"key":"93_CR63","unstructured":"Li, S., Wang, R., Hsieh, C.-J., Cheng, M., & Zhou, T. (2024). MuLan: multimodal-LLM agent for progressive multi-object diffusion. arXiv preprint. arXiv:2402.12741."},{"key":"93_CR64","first-page":"3132","volume-title":"Findings of the Association for Computational Linguistics","author":"Z. Zhang","year":"2024","unstructured":"Zhang, Z., & Zhang, A. (2024). You only look at screens: multimodal chain-of-action agents. In Findings of the Association for Computational Linguistics (pp. 3132\u20133149). Stroudsburg: ACL."},{"key":"93_CR65","first-page":"55976","volume-title":"Proceedings of the international conference on machine learning","author":"Z. Yang","year":"2024","unstructured":"Yang, Z., Chen, G., Li, X., Wang, W., & Yang, Y. (2024). DoraemonGPT: toward understanding dynamic scenes with large language models (exemplified as a video agent). In Proceedings of the international conference on machine learning (pp. 55976\u201355997). Retrieved August 8, 2025, from https:\/\/openreview.net\/forum?id=QMy2RLnxGN."},{"key":"93_CR66","unstructured":"Wu, Z., Han, C., Ding, Z., Weng, Z., Liu, Z., Yao, S., Yu, T., & Kong, L. (2024). OS-Copilot: towards generalist computer agents with self-improvement. arXiv preprint. arXiv:2402.07456."},{"key":"93_CR67","unstructured":"Liu, Y., Song, X., Jiang, K., Chen, W., Luo, J., Li, G., & Lin, L. (2024). Multimodal embodied interactive agent for cafe scene. arXiv preprint. arXiv:2402.00290."},{"key":"93_CR68","unstructured":"Zhang, Y., Maezawa, A., Xia, G., Yamamoto, K., & Dixon, S. (2023). Loop copilot: conducting AI ensembles for music generation and iterative editing. arXiv preprint. arXiv:2310.12404."},{"key":"93_CR69","unstructured":"Liu, X., Zhu, Z., Liu, H., Yuan, Y., Cui, M., Huang, Q., Liang, J., Cao, Y., Kong, Q., Plumbley, M. D., et\u00a0al. (2023). Wavjourney: compositional audio creation with large language models. arXiv preprint. arXiv:2307.14335."},{"key":"93_CR70","first-page":"19730","volume-title":"Proceedings of the international conference on machine learning","author":"J. Li","year":"2023","unstructured":"Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the international conference on machine learning (pp. 19730\u201319742). Retrieved August 7, 2025, from https:\/\/proceedings.mlr.press\/v202\/li23q.html."},{"key":"93_CR71","first-page":"38","volume-title":"Proceedings of the 18th European conference on computer vision","author":"S. Liu","year":"2024","unstructured":"Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al. (2024). Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 38\u201355). Cham: Springer."},{"key":"93_CR72","first-page":"12888","volume-title":"Proceedings of the international conference on machine learning","author":"J. Li","year":"2022","unstructured":"Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the international conference on machine learning (pp. 12888\u201312900). Retrieved August 7, 2025, from https:\/\/proceedings.mlr.press\/v162\/li22n.html."},{"key":"93_CR73","first-page":"1","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"W. Dai","year":"2023","unstructured":"Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. C. H. (2023). InstructBLIP: towards general-purpose vision-language models with instruction tuning. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 1\u201318). Red Hook: Curran Associates."},{"key":"93_CR74","unstructured":"Xu, J., Wang, X., Cao, Y.-P., Cheng, W., Shan, Y., & Gao, S. (2023). InstructP2P: learning to edit 3D point clouds with text instructions. arXiv preprint. arXiv:2306.07154."},{"key":"93_CR75","first-page":"10684","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"R. Rombach","year":"2022","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 10684\u201310695). Piscataway: IEEE."},{"key":"93_CR76","unstructured":"Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et\u00a0al. Improving image generation with better captions. Retrieved August 7, 2025, from https:\/\/cdn.openai.com\/papers\/dall-e-3.pdf."},{"key":"93_CR77","first-page":"4015","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"A. Kirillov","year":"2023","unstructured":"Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. (2023). Segment anything. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 4015\u20134026). Piscataway: IEEE."},{"key":"93_CR78","unstructured":"Liu, Y., Chu, L., Chen, G., Wu, Z., Chen, Z., Lai, B., & Hao, Y. (2021). PaddleSeg: a high-efficient development toolkit for image segmentation. arXiv preprint. arXiv:2101.06175."},{"key":"93_CR79","first-page":"341","volume-title":"Proceedings of the 17th European conference on computer vision","author":"B. Ye","year":"2022","unstructured":"Ye, B., Chang, H., Ma, B., Shan, S., & Chen, X. (2022). Joint feature learning and relation modeling for tracking: a one-stream framework. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 341\u2013357). Cham: Springer."},{"issue":"12","key":"93_CR80","doi-asserted-by":"publisher","first-page":"2649","DOI":"10.1109\/TVCG.2012.291","volume":"18","author":"N. Cao","year":"2012","unstructured":"Cao, N., Lin, Y.-R., Sun, X., Lazer, D., Liu, S., & Qu, H. (2012). Whisper: tracing the spatiotemporal process of information diffusion in real time. IEEE Transactions on Visualization and Computer Graphics, 18(12), 2649\u20132658.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"93_CR81","first-page":"302","volume-title":"Proceedings of the conference on robot learning","author":"J. Zhang","year":"2023","unstructured":"Zhang, J., Zhang, J., Pertsch, K., Liu, Z., Ren, X., Chang, M., Sun, S.H., & Lim, J. J. (2023). Bootstrap your own skills: learning to solve new tasks with large language model guidance. In Proceedings of the conference on robot learning (pp. 302\u2013325). Retrieved August 7, 2025, from https:\/\/proceedings.mlr.press\/v229\/zhang23a.html."},{"key":"93_CR82","first-page":"1","volume-title":"Proceedings of the 13th international conference on learning representations","author":"J.-Y. He","year":"2025","unstructured":"He, J.-Y., Cheng, Z.-Q., Li, C., Sun, J., He, Q., Xiang, W., Chen, H., Lan, J.-P., Lin, X., Zhu, K., et al. (2025). MetaDesigner: advancing artistic typography through AI-driven, user-centric, and multilingual wordart synthesis. In Proceedings of the 13th international conference on learning representations (pp. 1\u201324). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=Mv3GAYJGcW."},{"key":"93_CR83","first-page":"881","volume-title":"Proceedings of the 62nd annual meeting of the Association for Computational Linguistics","author":"J. Y. Koh","year":"2024","unstructured":"Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M., Huang, P.-Y., Neubig, G., Zhou, S., Salakhutdinov, R., & Fried, D. (2024). VisualWebArena: evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd annual meeting of the Association for Computational Linguistics (pp. 881\u2013905). Stroudsburg: ACL."},{"key":"93_CR84","first-page":"1","volume-title":"Proceedings of the 13th international conference on learning representations","author":"L. Zheng","year":"2024","unstructured":"Zheng, L., Huang, Z., Xue, Z., Wang, X., An, B., & Yan, S. (2024). Agentstudio: a toolkit for building general virtual agents. In Proceedings of the 13th international conference on learning representations (pp. 1\u201342). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=axUf8BOjnH."},{"key":"93_CR85","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"Y. Wu","year":"2024","unstructured":"Wu, Y., Tang, X., Mitchell, T. M., & Li, Y. (2024). SmartPlay: a benchmark for LLMs as intelligent agents. In Proceedings of the 12th international conference on learning representations (pp. 1\u201319). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=S2oTVrlcp3."},{"key":"93_CR86","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"G. Mialon","year":"2024","unstructured":"Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., & Scialom, T. (2024). GAIA: a benchmark for general AI assistants. In Proceedings of the 12th international conference on learning representations (pp. 1\u201325). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=fibxvahvs3."},{"key":"93_CR87","unstructured":"L\u00f9, X.H., Kasner, Z., & Reddy, S. (2024). Weblinx: real-world website navigation with multi-turn dialogue. arXiv preprint. arXiv:2402.05930."},{"key":"93_CR88","first-page":"1","volume-title":"Proceedings of the 41st international conference on machine learning","author":"J. Xie","year":"2024","unstructured":"Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y., Xiao, Y., & Su, Y. (2024). Travelplanner: a benchmark for real-world planning with language agents. In Proceedings of the 41st international conference on machine learning (pp. 1\u201324). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=l5XQzNkAOe."},{"key":"93_CR89","first-page":"28091","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"X. Deng","year":"2023","unstructured":"Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., & Su, Y. (2023). Mind2web: towards a generalist agent for the web. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 28091\u201328114). Red Hook: Curran Associates."},{"key":"93_CR90","volume-title":"Proceedings of the 12th international conference on learning representations","author":"S. Zhou","year":"2024","unstructured":"Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. (2024). Webarena: a realistic web environment for building autonomous agents. In Proceedings of the 12th international conference on learning representations. Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=oKn9c6ytLx."},{"key":"93_CR91","first-page":"1","volume-title":"Proceedings of the 13th international conference on learning representations","author":"J. Chen","year":"2024","unstructured":"Chen, J., Yuen, D., Xie, B., Yang, Y., Chen, G., Wu, Z., Yixing, L., Zhou, X., Liu, W., Wang, S., et al. (2024). Spa-Bench: a comprehensive benchmark for smartphone agent evaluation. In Proceedings of the 13th international conference on learning representations (pp. 1\u201338). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=OZbFRNhpwr."},{"key":"93_CR92","unstructured":"Paglieri, D., Cupia\u0142, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuci\u0144ski, \u0141., Pinto, L., Fergus, R., et\u00a0al. (2024). BALROG: benchmarking agentic LLM and VLM reasoning on games. arXiv preprint. arXiv:2411.13543."},{"key":"93_CR93","unstructured":"Yang, J., Shao, S., Liu, D., & Shao, J. (2025). RiOSWorld: benchmarking the risk of multimodal compter-use agents. arXiv preprint. arXiv:2506.00618."},{"key":"93_CR94","unstructured":"Wen, H., Li, Y., Liu, G., Zhao, S., Yu, T., Li, T. J.-J., Jiang, S., Liu, Y., Zhang, Y., & Liu, Y. (2023). Empowering LLM to use smartphone for intelligent task automation. arXiv preprint. arXiv:2308.15272."},{"key":"93_CR95","doi-asserted-by":"crossref","unstructured":"Huq, F., Wang, Z. Z., Xu, F. F., Ou, T., Zhou, S., Bigham, J. P., & Neubig, G. (2025). Cowpilot: a framework for autonomous and human-agent collaborative web navigation. arXiv preprint. arXiv:2501.16609.","DOI":"10.18653\/v1\/2025.naacl-demo.17"},{"key":"93_CR96","unstructured":"Wang, J., Xu, H., Zhang, X., Yan, M., Zhang, J., Huang, F., & Sang, J. (2025). Mobile-Agent-V: learning mobile device operation through video-guided multi-agent collaboration. arXiv preprint. arXiv:2502.17110."},{"key":"93_CR97","first-page":"29490","volume-title":"Proceedings of the computer vision and pattern recognition conference","author":"Z. Huang","year":"2025","unstructured":"Huang, Z., Cheng, Z., Pan, J., Hou, Z., & Zhan, M. (2025). Spiritsight agent: advanced GUI agent with one look. In Proceedings of the computer vision and pattern recognition conference (pp. 29490\u201329500). Piscataway: IEEE."},{"key":"93_CR98","first-page":"2990","volume-title":"Findings of the Association for Computational Linguistics","author":"J. Kim","year":"2025","unstructured":"Kim, J., Kim, M.-S., Chung, J., Cho, J., Kim, J., Kim, S., Sim, G., & Yu, Y. (2025). Egospeak: learning when to speak for egocentric conversational agents in the wild. In Findings of the Association for Computational Linguistics (pp. 2990\u20133005). Stroudsburg: ACL."},{"key":"93_CR99","first-page":"1","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"H. Wang","year":"2024","unstructured":"Wang, H., Liu, P., Cai, W., Wu, M., Qian, Z., & Dong, H. (2024). MO-DDN: a coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 1\u201339). Red Hook: Curran Associates."},{"key":"93_CR100","first-page":"1","volume-title":"Proceedings of the 13th international conference on learning representations","author":"M. T. Matthews","year":"2024","unstructured":"Matthews, M. T., Beukman, M., Lu, C., & Foerster, J. N. (2024). Kinetix: investigating the training of general agents through open-ended physics-based control tasks. In Proceedings of the 13th international conference on learning representations (pp. 1\u201350). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=zCxGCdzreM.2025."},{"key":"93_CR101","unstructured":"Zhou, X., Liu, M., Zagar, B.L., Yurtsever, E., & Knoll, A. C. (2023). Vision language models in autonomous driving and intelligent transportation systems. arXiv preprint. arXiv:2310.14414."},{"key":"93_CR102","unstructured":"Wen, L., Yang, X., Fu, D., Wang, X., Cai, P., Li, X., Ma, T., Li, Y., Xu, L., Shang, D., et\u00a0al. (2023). On the road with GPT-4v (ision): early explorations of visual-language model on autonomous driving. arXiv preprint. arXiv:2311.05332."},{"key":"93_CR103","unstructured":"Wei, K., Zhou, Z., Wang, B., Araki, J., Lange, L., Huang, r., & Feng, z. (2025). Premind: multi-agent video understanding for advanced indexing of presentation-style videos. arXiv preprint. arXiv:2503.00162."},{"key":"93_CR104","first-page":"1","volume-title":"Proceedings of the 13th international conference on learning representations","author":"G. Sun","year":"2025","unstructured":"Sun, G., Jin, M., Wang, Z., Wang, C.-L., Ma, S., Wang, Q., Geng, T., Wu, Y.N., Zhang, Y., & Liu, D. (2025). Visual agents as fast and slow thinkers. In Proceedings of the 13th international conference on learning representations (pp. 1\u201326). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=ncCuiD3KJQ."},{"key":"93_CR105","unstructured":"Fan, S., Guo, M.-H., & Yang, S. (2025). Agentic keyframe search for video question answering. arXiv preprint. arXiv:2503.16032."},{"key":"93_CR106","unstructured":"Wu, W., Zhu, Z., & Mike, S.Z. (2025). Automated movie generation via multi-agent cot planning. arXiv preprint. arXiv:2503.07314."},{"key":"93_CR107","unstructured":"Xu, X., Mei, J., Li, C., Wu, Y., Yan, M., Lai, S., Zhang, J., & Wu, M. (2025). MM-StoryAgent: immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint. arXiv:2503.05242."},{"key":"93_CR108","unstructured":"Liao, X., Zeng, X., Wang, L., Yu, G., Lin, G., & Zhang, C. (2025). Motionagent: fine-grained controllable video generation via motion field agent. arXiv preprint. arXiv:2502.03207."},{"key":"93_CR109","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"S. Karthik","year":"2024","unstructured":"Karthik, S., Roth, K., Mancini, M., & Akata, Z. (2024). Vision-by-language for training-free compositional image retrieval. In Proceedings of the 12th international conference on learning representations (pp. 1\u201316). Retrieved August 7, 2025, from https:\/\/openreview.net\/forum?id=EDPxCjXzSb."},{"key":"93_CR110","first-page":"1","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"Y. Guo","year":"2024","unstructured":"Guo, Y., Zhuang, S., Li, K., Qiao, Y., & Wang, Y. (2024). Transagent: transfer vision-language foundation models with heterogeneous agent collaboration. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 1\u201326). Red Hook: Curran Associates."},{"key":"93_CR111","unstructured":"Gao, Z., Zhang, B., Li, P., Ma, X., Yuan, T., Fan, Y., Wu, Y., Jia, Y., Zhu, S.-C., & Li, Q. (2024). Multi-modal agent tuning: building a VLM-driven agent for efficient tool usage. arXiv preprint. arXiv:2412.15606."},{"key":"93_CR112","doi-asserted-by":"crossref","unstructured":"Yan, Y., Wang, S., Huo, J., Yu, P. S., Hu, X., & Wen, Q. (2025). Mathagent: leveraging a mixture-of-math-agent framework for real-world multimodal mathematical error detection. arXiv preprint. arXiv:2503.18132.","DOI":"10.18653\/v1\/2025.acl-industry.7"},{"key":"93_CR113","unstructured":"Zhang, Z., Pham, P., Zhao, W., Wan, K., Li, Y.-J., Zhou, J., Miranda, D., Kale, A., & Xu, C. (2024). Treat visual tokens as text? But your MLLM only needs fewer efforts to see. arXiv preprint. arXiv:2410.06169."},{"key":"93_CR114","unstructured":"Zhang, H., Guo, H., Guo, S., Cao, M., Huang, W., Liu, J., & Zhang, G. (2024). ING-VP: MLLMs cannot play easy vision-based games yet. arXiv preprint. arXiv:2410.06555."},{"issue":"4","key":"93_CR115","doi-asserted-by":"publisher","first-page":"591","DOI":"10.1007\/s00424-024-02984-3","volume":"477","author":"T. Vandemeulebroucke","year":"2025","unstructured":"Vandemeulebroucke, T. (2025). The ethics of artificial intelligence systems in healthcare and medicine: from a local to a global perspective, and back. Pfl\u00fcgers Archiv-European Journal of Physiology, 477(4), 591\u2013601.","journal-title":"Pfl\u00fcgers Archiv-European Journal of Physiology"},{"key":"93_CR116","first-page":"12289","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Luo","year":"2024","unstructured":"Luo, Y., Shi, M., Khan, M.O., Afzal, M.M., Huang, H., Yuan, S., Tian, Y., Song, L., Kouhana, A., Elze, T., et al. (2024). Fairclip: harnessing fairness in vision-language learning. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12289\u201312301). Piscataway: IEEE."},{"key":"93_CR117","first-page":"14203","volume-title":"Proceedings of the computer vision and pattern recognition conference","author":"J. Yang","year":"2025","unstructured":"Yang, J., Tan, R., Wu, Q., Zheng, R., Peng, B., Liang, Y., Gu, Y., Cai, M., Ye, S., Jang, J., et al. (2025). Magma: a foundation model for multimodal AI agents. In Proceedings of the computer vision and pattern recognition conference (pp. 14203\u201314214). Piscataway: IEEE."},{"issue":"3","key":"93_CR118","doi-asserted-by":"publisher","first-page":"323","DOI":"10.3233\/AIC-220301","volume":"37","author":"C. Leturc","year":"2024","unstructured":"Leturc, C., & Bonnet, G. (2024). Using n-ary multi-modal logics in argumentation frameworks to reason about ethics. AI Communications, 37(3), 323\u2013355.","journal-title":"AI Communications"},{"key":"93_CR119","unstructured":"Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et\u00a0al. (2025). Why do multi-agent LLM systems fail? arXiv preprint. arXiv:2503.13657."},{"key":"93_CR120","unstructured":"Durante, Z., Huang, Q., Wake, N., Gong, R., Park, J. S., Sarkar, B., Taori, R., Noda, Y., Terzopoulos, D., Choi, Y., et\u00a0al. (2024). Agent AI: surveying the horizons of multimodal interaction. arXiv preprint. arXiv:2401.03568."}],"updated-by":[{"DOI":"10.1007\/s44267-025-00104-y","type":"correction","label":"Correction","source":"publisher","updated":{"date-parts":[[2026,1,14]],"date-time":"2026-01-14T00:00:00Z","timestamp":1768348800000}}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00093-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-025-00093-y","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00093-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T08:08:18Z","timestamp":1768464498000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-025-00093-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,26]]},"references-count":120,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["93"],"URL":"https:\/\/doi.org\/10.1007\/s44267-025-00093-y","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"value":"2097-3330","type":"print"},{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,26]]},"assertion":[{"value":"27 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 October 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 October 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 November 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 December 2025","order":6,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Update","order":7,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The original online version of this article was revised: The name of author Junlin Xie has been corrected.","order":8,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 January 2026","order":9,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Correction","order":10,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"A Correction to this paper has been published:","order":11,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"https:\/\/doi.org\/10.1007\/s44267-025-00104-y","URL":"https:\/\/doi.org\/10.1007\/s44267-025-00104-y","order":12,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no competing interests to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"24"}}