{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T16:06:25Z","timestamp":1775145985932,"version":"3.50.1"},"reference-count":107,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2022ZD0161300"],"award-info":[{"award-number":["2022ZD0161300"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U24A20325"],"award-info":[{"award-number":["U24A20325"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62321005"],"award-info":[{"award-number":["62321005"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62376134"],"award-info":[{"award-number":["62376134"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge the gap between the language decisions and the vehicle control commands by standardizing the decision states according to the off-the-shelf motion planning module. (2) We employ a multimodal LLM (MLLM) to model the behavior planning module of a module AD system, which uses driving rules, user commands, and inputs from various sensors (e.g., camera, LiDAR) as input and makes driving decisions and provide explanations. This model can plug-and-play in existing AD systems such as Autopilot and Apollo for close-loop driving. (3) We design an effective data engine to collect a dataset that includes decision state and corresponding explanation annotation for model training and evaluation. We conduct extensive experiments and show that replacing the decision-making modules of the Autopilot and Apollo with DriveMLM resulted in significant improvements of 3.2 and 4.7 points on the CARLA Town05 Long, respectively, demonstrating the effectiveness of our model. We hope this work can serve as a baseline for autonomous driving with LLMs.<\/jats:p>","DOI":"10.1007\/s44267-025-00095-w","type":"journal-article","created":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T01:44:31Z","timestamp":1764121471000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["DriveMLM: aligning multi-modal large language models with behavioral planning states for autonomous driving"],"prefix":"10.1007","volume":"3","author":[{"given":"Erfei","family":"Cui","sequence":"first","affiliation":[]},{"given":"Wenhai","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Zhiqi","family":"Li","sequence":"additional","affiliation":[]},{"given":"Jiangwei","family":"Xie","sequence":"additional","affiliation":[]},{"given":"Haoming","family":"Zou","sequence":"additional","affiliation":[]},{"given":"Hanming","family":"Deng","sequence":"additional","affiliation":[]},{"given":"Gen","family":"Luo","sequence":"additional","affiliation":[]},{"given":"Lewei","family":"Lu","sequence":"additional","affiliation":[]},{"given":"Xizhou","family":"Zhu","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6785-0785","authenticated-orcid":false,"given":"Jifeng","family":"Dai","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,11,26]]},"reference":[{"key":"95_CR1","unstructured":"DriveLM contributors (2023). DriveLM: drive on language. Retrieved September 22, 2025, from https:\/\/github.com\/OpenDriveLab\/DriveLM"},{"key":"95_CR2","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"L. Wen","year":"2024","unstructured":"Wen, L., Fu, D., Li, X., Cai, X., Ma, T., Cai, P., Dou, M., Shi, B., He, L., & Qiao, Y. (2024). DiLu: a knowledge-driven approach to autonomous driving with large language models. In Proceedings of the 12th international conference on learning representations (pp. 1\u201320). Retrieved September 22, 2025, from https:\/\/openreview.net\/forum?id=OqTMUPuLuC."},{"key":"95_CR3","first-page":"14093","volume-title":"Proceedings of the IEEE international conference on robotics and automation","author":"L. Chen","year":"2023","unstructured":"Chen, L., Sinavski, O., H\u00fcnermann, J., Karnsund, A., Willmott, A. J., Birch, D., Maund, D., & Shotton, J. (2023). Driving with LLMs: fusing object-level vector modality for explainable autonomous driving. In Proceedings of the IEEE international conference on robotics and automation (pp. 14093\u201314100). Piscataway: IEEE."},{"key":"95_CR4","first-page":"22442","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"S. Wang","year":"2025","unstructured":"Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y., & Alvarez, J. M. (2025). Omnidrive: a holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 22442\u201322452). Piscataway: IEEE."},{"key":"95_CR5","unstructured":"Zheng, Y., Xing, Z., Zhang, Q., Jin, B., Li, P., Zheng, Y., Xia, Z., Zhan, K., Lang, X., Chen, Y., et\u00a0al. (2024). PlanAgent: a multi-modal large language agent for closed-loop vehicle motion planning. arXiv preprint. arXiv:2406.01587."},{"key":"95_CR6","unstructured":"Baidu Apollo auto. Retrieved September 22, 2025, from https:\/\/github.com\/ApolloAuto\/apollo."},{"key":"95_CR7","unstructured":"The Autoware Foundation (2018). Autoware: Open-source software for urban autonomous driving. Retrieved September 22, 2025, from https:\/\/github.com\/CPFL\/Autoware."},{"key":"95_CR8","unstructured":"Fontana, F. (2021). Self-driving cars and openpilot: a complete overview of the framework. Master\u2019s thesis, Politecnico di Milano."},{"key":"95_CR9","first-page":"1","volume-title":"Proceedings of the 1st annual conference on robot learning","author":"A. Dosovitskiy","year":"2017","unstructured":"Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: an open urban driving simulator. In Proceedings of the 1st annual conference on robot learning (pp. 1\u201316). Retrieved September 22, 2025, from http:\/\/proceedings.mlr.press\/v78\/dosovitskiy17a.html."},{"key":"95_CR10","unstructured":"Tesla (2023). Autopilot and full self-driving capability. Retrieved September 22, 2025, from https:\/\/www.tesla.com\/support\/autopilot."},{"key":"95_CR11","unstructured":"Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved September 22, 2025, from https:\/\/openai.com\/index\/language-unsupervised\/."},{"key":"95_CR12","unstructured":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. Retrieved September 22, 2025, from https:\/\/api.semanticscholar.org\/CorpusID:160025533."},{"key":"95_CR13","first-page":"1877","volume-title":"Proceedings of the 34th international conference on neural information processing systems","author":"T. Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 1877\u20131901). Red Hook: Curran Associates."},{"key":"95_CR14","first-page":"27730","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"L. Ouyang","year":"2022","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 27730\u201327744). Red Hook: Curran Associates."},{"key":"95_CR15","unstructured":"OpenAI. (2023). GPT-4 technical report. Retrieved September 22, 2025, from https:\/\/cdn.openai.com\/papers\/gpt-4.pdf."},{"key":"95_CR16","first-page":"23716","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"J.-B. Alayrac","year":"2022","unstructured":"Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 23716\u201323736). Red Hook: Curran Associates."},{"key":"95_CR17","first-page":"51993","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"S. Huang","year":"2023","unstructured":"Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. (2023). Language is not all you need: aligning perception with language models. In A. Oh, T. Naumann, A. Globerson, K. S\u00e1Nchez, A. Vazquez, & Y. Bengio (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 51993\u201352007). Red Hook: Curran Associates."},{"key":"95_CR18","first-page":"52819","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"H. Liu","year":"2023","unstructured":"Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. S\u00e1Nchez, A. Vazquez, & Y. Bengio (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 52819\u201352832). Red Hook: Curran Associates."},{"key":"95_CR19","first-page":"26286","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Liu","year":"2024","unstructured":"Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 26286\u201326296). Piscataway: IEEE."},{"key":"95_CR20","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"D. Zhu","year":"2024","unstructured":"Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2024). MiniGPT-4: enhancing vision-language understanding with advanced large language models. In Proceedings of the 12th international conference on learning representations (pp. 1\u201320). Retrieved September 22, 2025, from https:\/\/openreview.net\/forum?id=1tZbq88f27."},{"key":"95_CR21","first-page":"49250","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"W. Dai","year":"2023","unstructured":"Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). InstructBLIP: towards general-purpose vision-language models with instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. S\u00e1Nchez, A. Vazquez, & Y. Bengio (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 49250\u201349267). Red Hook: Curran Associates."},{"key":"95_CR22","unstructured":"Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et\u00a0al. (2023). mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint. arXiv:2304.14178."},{"key":"95_CR23","unstructured":"Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., & Zhao, R. (2023). Shikra: unleashing multimodal LLM\u2019s referential dialogue magic. arXiv preprint. arXiv:2306.15195."},{"key":"95_CR24","first-page":"61501","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"W. Wang","year":"2023","unstructured":"Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al. (2023). VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 61501\u201361513). Red Hook: Curran Associates."},{"key":"95_CR25","unstructured":"Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., & Wei, F. (2023). Kosmos-2: grounding multimodal large language models to the world. arXiv preprint. arXiv:2306.14824."},{"key":"95_CR26","first-page":"9579","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Lai","year":"2024","unstructured":"Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., & Jia, J. (2024). LISA: reasoning segmentation via large language model. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 9579\u20139589). Piscataway: IEEE."},{"key":"95_CR27","unstructured":"Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., & Zhou, J. (2023). Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint. arXiv:2308.12966."},{"key":"95_CR28","first-page":"24185","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Chen","year":"2024","unstructured":"Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. (2024). InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 24185\u201324198). Piscataway: IEEE."},{"issue":"12","key":"95_CR29","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4231-5","volume":"67","author":"Z. Chen","year":"2024","unstructured":"Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. (2024). How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites. Science China. Information Sciences, 67(12), 220101.","journal-title":"Science China. Information Sciences"},{"key":"95_CR30","doi-asserted-by":"publisher","first-page":"7421","DOI":"10.18653\/v1\/2024.emnlp-main.422","volume-title":"Proceedings of the 2024 conference on empirical methods in natural language processing","author":"Y. Li","year":"2024","unstructured":"Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). Eagle-2: faster inference of language models with dynamic draft trees. In Proceedings of the 2024 conference on empirical methods in natural language processing (pp. 7421\u20137432). Stroudsburg: ACL."},{"key":"95_CR31","doi-asserted-by":"publisher","first-page":"11975","DOI":"10.1007\/978-3-030-96530-3","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"X. Zhai","year":"2023","unstructured":"Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 11975\u201311986). Piscataway: IEEE."},{"key":"95_CR32","first-page":"11976","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Liu","year":"2022","unstructured":"Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 11976\u201311986). Piscataway: IEEE."},{"key":"95_CR33","unstructured":"Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., & Qiao, Y. (2023). VideoChat: chat-centric video understanding. arXiv preprint. arXiv:2305.06355."},{"key":"95_CR34","doi-asserted-by":"publisher","first-page":"543","DOI":"10.18653\/v1\/2023.emnlp-demo.49","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations","author":"H. Zhang","year":"2023","unstructured":"Zhang, H., Li, X., & Bing, L. (2023). Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations (pp. 543\u2013553). Stroudsburg: ACL."},{"key":"95_CR35","first-page":"1","volume-title":"Proceedings of the 41st international conference on machine learning","author":"S. Wu","year":"2024","unstructured":"Wu, S., Fei, H., Qu, L., Ji, W., & Chua, T.-S. (2024). Next-GPT: any-to-any multimodal LLM. In A. Krause, E. Brunskill, C. Szepesv\u00e1ri, K. Chaudhuri, & J. Zhu (Eds.), Proceedings of the 41st international conference on machine learning (pp. 1\u201314). Retrieved September 22, 2025, from https:\/\/icml.cc\/virtual\/2024\/poster\/34200."},{"key":"95_CR36","unstructured":"Junqing, H., Kunhao, P., Xiaoqun, D., Zhuoyang, S., Yibo, L., Yuxin, L., Hao, W., Qianguo, S., Songxin, Z., Zejian, X., et\u00a0al. (2023). Never lost in the middle: improving large language models via attention strengthening question answering. arXiv preprint. arXiv:2311.09198."},{"key":"95_CR37","first-page":"52","volume-title":"Proceedings of the 18th European conference on computer vision workshops","author":"S. Zhang","year":"2024","unstructured":"Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Chen, K., & Luo, P. (2024). GPT4RoI: instruction tuning large language model on region-of-interest. In A. Del Bue, C. Canton, J. Pont-Tuset, & T. Tommasi (Eds.), Proceedings of the 18th European conference on computer vision workshops (pp. 52\u201370). Cham: Springer."},{"key":"95_CR38","unstructured":"Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023). Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint. arXiv:2303.04671."},{"key":"95_CR39","unstructured":"Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., & Wang, L. (2023). MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint. arXiv:2303.11381."},{"key":"95_CR40","unstructured":"Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). HuggingGPT: solving AI tasks with ChatGPT and its friends in huggingface. arXiv preprint. arXiv:2303.17580."},{"key":"95_CR41","unstructured":"Liu, Z., He, Y., Wang, W., Wang, W., Wang, Y., Chen, S., Zhang, Q., Yang, Y., Li, Q., Yu, J., et\u00a0al. (2023). InternChat: solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint. arXiv:2305.05662."},{"key":"95_CR42","first-page":"11854","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"D. Sur\u00eds","year":"2023","unstructured":"Sur\u00eds, D., Menon, S., & Vondrick, C. (2023). ViperGPT: visual inference via python execution for reasoning. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 11854\u201311864). Piscataway: IEEE."},{"key":"95_CR43","first-page":"89","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Z. Liu","year":"2024","unstructured":"Liu, Z., Lai, Z., Gao, Z., Cui, E., Zhu, X., Lu, L., Chen, Q., Qiao, Y., Dai, J., & Wang, W. (2024). ControlLLM: augment language models with tools by searching on graphs. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 89\u2013105). Cham: Springer."},{"key":"95_CR44","first-page":"71995","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"R. Yang","year":"2023","unstructured":"Yang, R., Song, L., Li, Y., Zhao, S., Ge, Y., Li, X., & Shan, Y. (2023). GPT4Tools: teaching large language model to use tools via self-instruction. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 71995\u201372007). Red Hook: Curran Associates."},{"key":"95_CR45","first-page":"1","volume-title":"Proceedings of the 37th conference on neural information processing systems","author":"G. Li","year":"2023","unstructured":"Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., & Ghanem, B. (2023). Camel: communicative agents for \u201cmind\u201d exploration of large language model society. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th conference on neural information processing systems (pp. 1\u201318). Red Hook: Curran Associates."},{"key":"95_CR46","unstructured":"Yang, H., Yue, S., & He, Y. (2023). Auto-GPT for online decision making: benchmarks and additional opinions. arXiv preprint. arXiv:2306.02224."},{"key":"95_CR47","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"S. Hong","year":"2024","unstructured":"Hong, S., Zheng, X., Chen, J., Cheng, Y., Zhang, C., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., Ran, C., et al. (2024). MetaGPT: meta programming for multi-agent collaborative framework. In Proceedings of the 12th international conference on learning representations (pp. 1\u201329). Retrieved September 22, 2025, from https:\/\/openreview.net\/forum?id=VtmBAGCN7o."},{"key":"95_CR48","first-page":"1","volume-title":"Proceedings of the 36th annual ACM symposium on user interface software and technology","author":"J. S. Park","year":"2023","unstructured":"Park, J. S., O\u2019Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: interactive simulacra of human behavior. In Proceedings of the 36th annual ACM symposium on user interface software and technology (pp. 1\u201322). New York: ACM."},{"key":"95_CR49","first-page":"8469","volume-title":"Proceedings of the 40th international conference on machine learning","author":"D. Driess","year":"2023","unstructured":"Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. (2023). PaLM-E: an embodied multimodal language model. In A. Krause, E. Brunskill, C. Szepesv\u00e1ri, K. Chaudhuri, & J. Zhu (Eds.), Proceedings of the 40th international conference on machine learning (pp. 8469\u20138488). Retrieved September 22, 2025, from https:\/\/proceedings.mlr.press\/v202\/driess23a.html."},{"key":"95_CR50","first-page":"25081","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"Y. Mu","year":"2023","unstructured":"Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., & Luo, P. (2023). EmbodiedGPT: vision-language pre-training via embodied chain of thought. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 25081\u201325094). Red Hook: Curran Associates."},{"key":"95_CR51","volume-title":"Proceedings of the robotics: science and systems conference","author":"A. Brohan","year":"2023","unstructured":"Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. (2023). RT-1: robotics transformer for real-world control at scale. In Proceedings of the robotics: science and systems conference. Robotics: RSS Foundation."},{"key":"95_CR52","first-page":"2165","volume-title":"Proceedings of the conference on robot learning","author":"A. Brohan","year":"2023","unstructured":"Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. (2023). RT-2: vision-language-action models transfer web knowledge to robotic control. In J. Tan, M. Toussaint, & K. Darvish (Eds.), Proceedings of the conference on robot learning (pp. 2165\u20132183). Retrieved September 22, 2025, from https:\/\/proceedings.mlr.press\/v229\/zitkovich23a.html."},{"key":"95_CR53","first-page":"6892","volume-title":"Proceedings of the IEEE international conference on robotics and automation","author":"A. Padalkar","year":"2024","unstructured":"Padalkar, A., Pooley, A., Jain, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Singh, A., Brohan, A., et al. (2024). Open X-embodiment: robotic learning datasets and RT-X models. In Proceedings of the IEEE international conference on robotics and automation (pp. 6892\u20136903). Piscataway: IEEE."},{"key":"95_CR54","first-page":"1","volume-title":"Proceedings of the 17th European conference on computer vision","author":"Z. Li","year":"2022","unstructured":"Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022). BEVFormer: learning bird\u2019s-eye-view representation from multi-camera images via spatiotemporal transformers. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 1\u201318). Cham: Springer."},{"key":"95_CR55","first-page":"17830","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Yang","year":"2023","unstructured":"Yang, C., Chen, Y., Tian, H., Tao, C., Zhu, X., Zhang, Z., Huang, G., Li, H., Qiao, Y., Lu, L., et al. (2023). BEVFormer V2: adapting modern image backbones to bird\u2019s-eye-view recognition via perspective supervision. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 17830\u201317839). Piscataway: IEEE."},{"key":"95_CR56","first-page":"10421","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"T. Liang","year":"2022","unstructured":"Liang, T., Xie, H., Yu, K., Xia, Z., Lin, Z., Wang, Y., Tang, T., Wang, B., & Tang, Z. (2022). BEVFusion: a simple and robust LiDAR-camera fusion framework. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 10421\u201310434). Red Hook: Curran Associates."},{"key":"95_CR57","first-page":"3235","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision workshops","author":"A. Singh","year":"2023","unstructured":"Singh, A., & Bankiti, V. (2023). Surround-view vision-based 3D detection for autonomous driving: a survey. In Proceedings of the IEEE\/CVF international conference on computer vision workshops (pp. 3235\u20133244). Piscataway: IEEE."},{"key":"95_CR58","first-page":"8406","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"W. Tong","year":"2023","unstructured":"Tong, W., Sima, C., Wang, T., Chen, L., Wu, S., Deng, H., Gu, Y., Lu, L., Luo, P., Lin, D., et al. (2023). Scene as occupancy. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 8406\u20138415). Piscataway: IEEE."},{"issue":"7","key":"95_CR59","doi-asserted-by":"publisher","first-page":"11814","DOI":"10.1109\/TNNLS.2024.3495045","volume":"36","author":"Y. Shi","year":"2025","unstructured":"Shi, Y., Jiang, K., Li, J., Wen, J., Qian, Z., Yang, M., Wang, K., & Yang, D. (2025). Grid-centric traffic scenario perception for autonomous driving: a comprehensive review. IEEE Transactions on Neural Networks and Learning Systems, 36(7), 11814\u201311834.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"95_CR60","unstructured":"Li, Z., Yu, Z., Austin, D., Fang, M., Lan, S., Kautz, J., & Alvarez, J. M. (2023). FB-OCC: 3D occupancy prediction based on forward-backward view transformation. arXiv preprint. arXiv:2307.01492."},{"key":"95_CR61","unstructured":"Renz, K., Chen, L., Marcu, A.-M., H\u00fcnermann, J., Hanotte, B., Karnsund, A., Shotton, J., Arani, E., & Sinavski, O. (2024). CarLLaVA: vision language models for camera-only closed-loop driving. arXiv preprint. arXiv:2406.10165."},{"key":"95_CR62","first-page":"11993","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"K. Renz","year":"2025","unstructured":"Renz, K., Chen, L., Arani, E., & Sinavski, O. (2025). SimLingo: vision-only closed-loop autonomous driving with language-action alignment. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 11993\u201312003). Piscataway: IEEE."},{"key":"95_CR63","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-018-9850-9","volume":"62","author":"S. Chen","year":"2019","unstructured":"Chen, S., Jian, Z., Huang, Y., Chen, Y., Zhou, Z., & Zheng, N. (2019). Autonomous driving: cognitive construction and situation understanding. Science China. Information Sciences, 62, 81101.","journal-title":"Science China. Information Sciences"},{"key":"95_CR64","first-page":"17853","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Hu","year":"2023","unstructured":"Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al. (2023). Planning-oriented autonomous driving. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 17853\u201317862). Piscataway: IEEE."},{"key":"95_CR65","first-page":"87","volume-title":"Proceedings of the 18th European conference on computer vision","author":"W. Zheng","year":"2024","unstructured":"Zheng, W., Song, R., Guo, X., Zhang, C., & Chen, L. (2024). GenAD: generative end-to-end autonomous driving. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 87\u2013104). Cham: Springer."},{"key":"95_CR66","first-page":"3962","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"E. Vinitsky","year":"2022","unstructured":"Vinitsky, E., Lichtl\u00e9, N., Yang, X., Amos, B., & Foerster, J. (2022). Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 3962\u20133974). Red Hook: Curran Associates."},{"key":"95_CR67","unstructured":"Zhou, M., Luo, J., Villella, J., Yang, Y., Rusu, D., Miao, J., Zhang, W., Alban, M., Fadakar, I., Chen, Z., et\u00a0al. (2020). SMARTS: scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint. arXiv:2010.09776."},{"key":"95_CR68","first-page":"21983","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Jia","year":"2023","unstructured":"Jia, X., Wu, P., Chen, L., Xie, J., He, C., Yan, J., & Li, H. (2023). Think twice before driving: towards scalable decoders for end-to-end autonomous driving. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 21983\u201321994). Piscataway: IEEE."},{"key":"95_CR69","first-page":"13723","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Shao","year":"2023","unstructured":"Shao, H., Wang, L., Chen, R., Waslander, S. L., Li, H., & Liu, Y. (2023). ReasonNet: end-to-end driving with temporal and global reasoning. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 13723\u201313733). Piscataway: IEEE."},{"key":"95_CR70","first-page":"726","volume-title":"Proceedings of the conference on robot learning","author":"H. Shao","year":"2023","unstructured":"Shao, H., Wang, L., Chen, R., Li, H., & Liu, Y. (2023). Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Proceedings of the conference on robot learning (pp. 726\u2013737). Retrieved September 22, 2025, from https:\/\/proceedings.mlr.press\/v205\/shao23a.html."},{"key":"95_CR71","first-page":"8306","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"B. Jiang","year":"2023","unstructured":"Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., & Wang, X. (2023). VAD: vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 8306\u20138316). Piscataway: IEEE."},{"issue":"11","key":"95_CR72","doi-asserted-by":"publisher","first-page":"12878","DOI":"10.1109\/TPAMI.2022.3200245","volume":"45","author":"K. Chitta","year":"2023","unstructured":"Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., & Geiger, A. (2023). TransFuser: imitation with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 12878\u201312895.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"95_CR73","first-page":"66","volume-title":"Proceedings of the conference on robot learning","author":"D. Chen","year":"2020","unstructured":"Chen, D., Zhou, B., Koltun, V., & Kr\u00e4henb\u00fchl, P. (2020). Learning by cheating. In Proceedings of the conference on robot learning (pp. 66\u201375). Retrieved September 22, 2025, from https:\/\/proceedings.mlr.press\/v100\/chen20a.html."},{"key":"95_CR74","first-page":"15590","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"D. Chen","year":"2021","unstructured":"Chen, D., Koltun, V., & Kr\u00e4henb\u00fchl, P. (2021). Learning to drive from a world on rails. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 15590\u201315599). Piscataway: IEEE."},{"key":"95_CR75","first-page":"17222","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"D. Chen","year":"2022","unstructured":"Chen, D., & Kr\u00e4henb\u00fchl, P. (2022). Learning from all vehicles. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 17222\u201317231). Piscataway: IEEE."},{"key":"95_CR76","volume-title":"Proceedings of the 38th international conference on neural information processing systems (datasets and benchmarks track)","author":"X. Jia","year":"2024","unstructured":"Jia, X., Yang, Z., Li, Q., Zhang, Z., & Yan, J. (2024). Bench2Drive: towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. In Proceedings of the 38th international conference on neural information processing systems (datasets and benchmarks track). Retrieved September 22, 2025, from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2024\/hash\/017761f94a1cd66d01c041aff85492c4-Abstract-Datasets_and_Benchmarks_Track.html."},{"key":"95_CR77","unstructured":"Zhang, T., Jin, T., Wang, L., Liu, J., Liang, S., Zhang, M., Liu, A., & Liu, X. (2025). Bench2ADVLM: a closed-loop benchmark for vision-language models in autonomous driving. arXiv preprint. arXiv:2508.02028."},{"issue":"10","key":"95_CR78","doi-asserted-by":"publisher","first-page":"8186","DOI":"10.1109\/LRA.2024.3440097","volume":"9","author":"Z. Xu","year":"2024","unstructured":"Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K. K., Li, Z., & Zhao, H. (2024). DriveGPT4: interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 9(10), 8186\u20138193.","journal-title":"IEEE Robotics and Automation Letters"},{"key":"95_CR79","unstructured":"Mao, J., Qian, Y., Zhao, H., & Wang, Y. (2023). GPT-driver: learning to drive with GPT. arXiv preprint. arXiv:2310.01415."},{"key":"95_CR80","first-page":"5154","volume-title":"Proceedings of the 26th IEEE international conference on intelligent transportation systems","author":"J. Liu","year":"2023","unstructured":"Liu, J., Hang, P., Qi, X., Wang, J., & Sun, J. (2023). MTD-GPT: a multi-task decision-making GPT model for autonomous driving at unsignalized intersections. In Proceedings of the 26th IEEE international conference on intelligent transportation systems (pp. 5154\u20135161). Piscataway: IEEE."},{"key":"95_CR81","unstructured":"Sha, H., Mu, Y., Jiang, Y., Chen, L., Xu, C., Luo, P., Li, S. E., Tomizuka, M., Zhan, W., & Ding, M. (2023). LanguageMPC: large language models as decision makers for autonomous driving. arXiv preprint. arXiv:2310.03026."},{"key":"95_CR82","first-page":"15120","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Shao","year":"2024","unstructured":"Shao, H., Hu, Y., Wang, L., Waslander, S. L., Liu, Y., & Li, H. (2024). Lmdrive: closed-loop end-to-end driving with large language models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15120\u201315130). Piscataway: IEEE."},{"key":"95_CR83","first-page":"10020","volume-title":"Proceedings of the IEEE\/RSJ international conference on intelligent robots and systems","author":"P. Paul","year":"2024","unstructured":"Paul, P., Garg, A., Choudhary, T., Singh, A. K., & Krishna, K. M. (2024). Lego-drive: language-enhanced goal-oriented closed-loop end-to-end autonomous driving. In Proceedings of the IEEE\/RSJ international conference on intelligent robots and systems (pp. 10020\u201310026). Piscataway: IEEE."},{"key":"95_CR84","unstructured":"Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., & Bai, X. (2025). Orion: a holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint. arXiv:2503.19755."},{"key":"95_CR85","first-page":"8748","volume-title":"Proceedings of the 38th international conference on machine learning","author":"A. Radford","year":"2021","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 8748\u20138763). Retrieved September 22, 2025, from http:\/\/proceedings.mlr.press\/v139\/radford21a.html."},{"key":"95_CR86","first-page":"8458","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"L. Fan","year":"2022","unstructured":"Fan, L., Pang, Z., Zhang, T., Wang, Y.-X., Zhao, H., Wang, F., Wang, N., & Zhang, Z. (2022). Embracing single stride 3D object detector with sparse transformer. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8458\u20138468). Piscataway: IEEE."},{"key":"95_CR87","unstructured":"Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et\u00a0al. (2023). LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint. arXiv:2304.15010."},{"key":"95_CR88","first-page":"19358","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Fang","year":"2023","unstructured":"Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., & Cao, Y. (2023). EVA: exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 19358\u201319369). Piscataway: IEEE."},{"key":"95_CR89","unstructured":"Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., Azhar, F., et\u00a0al. (2023). LLaMA: open and efficient foundation language models. arXiv preprint. arXiv:2302.13971."},{"key":"95_CR90","first-page":"9403","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Yang","year":"2023","unstructured":"Yang, H., He, T., Liu, J., Chen, H., Wu, B., Lin, B., He, X., & Ouyang, W. (2023). GD-MAE: generative decoder for MAE pre-training on LiDAR point clouds. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 9403\u20139414). Piscataway: IEEE."},{"key":"95_CR91","first-page":"1","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"J. Mao","year":"2021","unstructured":"Mao, J., Niu, M., Jiang, C., Liang, H., Chen, J., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., et al. (2021). One million scenes for autonomous driving: ONCE dataset. In Proceedings of the 35th international conference on neural information processing systems (pp. 1\u201313). Red Hook: Curran Associates."},{"key":"95_CR92","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s44267-024-00067-6","volume":"2","author":"Z. Gao","year":"2024","unstructured":"Gao, Z., Chen, Z., Cui, E., Ren, Y., Wang, W., Zhu, J., Tian, H., Ye, S., He, J., Zhu, X., et al. (2024). Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2, 1\u201317.","journal-title":"Visual Intelligence"},{"key":"95_CR93","unstructured":"Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et\u00a0al. (2024). Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint. arXiv:2412.05271."},{"key":"95_CR94","first-page":"311","volume-title":"Proceedings of the 40th annual meeting of the Association for Computational Linguistics","author":"K. Papineni","year":"2002","unstructured":"Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311\u2013318). Stroudsburg: ACL."},{"key":"95_CR95","first-page":"4566","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"R. Vedantam","year":"2015","unstructured":"Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566\u20134575). Piscataway: IEEE."},{"key":"95_CR96","first-page":"65","volume-title":"Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization","author":"S. Banerjee","year":"2005","unstructured":"Banerjee, S., & Lavie, A. (2005). Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization (pp. 65\u201372). Stroudsburg: ACL."},{"key":"95_CR97","first-page":"911","volume-title":"Proceedings of the IEEE\/CVF winter conference on applications of computer vision workshops","author":"S. Xing","year":"2025","unstructured":"Xing, S., Qian, C., Wang, Y., Hua, H., Tian, K., Zhou, Y., & Tu, Z. (2025). OpenEMMA: open-source multimodal model for end-to-end autonomous driving. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision workshops (pp. 911\u2013919). Piscataway: IEEE."},{"key":"95_CR98","first-page":"8359","volume-title":"Proceedings of the AAAI conference on artificial intelligence","author":"D. Wu","year":"2025","unstructured":"Wu, D., Han, W., Wang, T., Liu, Y., Zhang, X., & Shen, J. (2025). Language prompt for autonomous driving. In Proceedings of the AAAI conference on artificial intelligence (pp. 8359\u20138367). Palo Alto: AAAI Press."},{"key":"95_CR99","first-page":"4542","volume-title":"Proceedings of the AAAI conference on artificial intelligence","author":"T. Qian","year":"2024","unstructured":"Qian, T., Chen, J., Zhuo, L., Jiao, Y., & Jiang, Y.-G. (2024). NuScenes-QA: a multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI conference on artificial intelligence (pp. 4542\u20134550). Palo Alto: AAAI Press."},{"key":"95_CR100","first-page":"7498","volume-title":"Proceedings of the IEEE\/CVF winter conference on applications of computer vision","author":"E. Sachdeva","year":"2024","unstructured":"Sachdeva, E., Agarwal, N., Chundi, S., Roelofs, S., Li, J., Dariush, B., Choi, C., & Kochenderfer, M. (2024). Rank2Tell: a multimodal driving dataset for joint importance ranking and reasoning. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision (pp. 7498\u20137507). Piscataway: IEEE."},{"key":"95_CR101","unstructured":"Movva, R., Balachandar, S., Peng, K., Agostini, G., Garg, N., & Pierson, E. (2023). Large language models shape and are shaped by society: a survey of arXiv publication patterns. arXiv preprint. arXiv:2307.10700."},{"key":"95_CR102","first-page":"533","volume-title":"Proceedings of the 17th European conference on computer vision","author":"S. Hu","year":"2022","unstructured":"Hu, S., Chen, L., Wu, P., Li, H., Yan, J., & Tao, D. (2022). ST-P3: end-to-end vision-based autonomous driving via spatial-temporal feature learning. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 533\u2013549). Cham: Springer."},{"key":"95_CR103","first-page":"17853","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Hu","year":"2023","unstructured":"Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al. (2023). Planning-oriented autonomous driving. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 17853\u201317862). Piscataway: IEEE."},{"key":"95_CR104","unstructured":"Zhai, J.-T., Feng, Z., Du, J., Mao, Y., Liu, J.-J., Tan, Z., Zhang, Y., Ye, X., & Wang, J. (2023). Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint. arXiv:2305.10430."},{"key":"95_CR105","first-page":"14864","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Li","year":"2024","unstructured":"Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., & Alvarez, J. M. (2024). Is ego status all you need for open-loop end-to-end autonomous driving? In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 14864\u201314873). Piscataway: IEEE."},{"key":"95_CR106","first-page":"15222","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Z. Zhang","year":"2021","unstructured":"Zhang, Z., Liniger, A., Dai, D., Yu, F., & Van Gool, L. (2021). End-to-end urban driving by imitating a reinforcement learning coach. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 15222\u201315232). Piscataway: IEEE."},{"key":"95_CR107","first-page":"20703","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"A. Hu","year":"2022","unstructured":"Hu, A., Corrado, G., Griffiths, N., Murez, Z., Gurau, C., Yeo, H., Kendall, A., Cipolla, R., & Shotton, J. (2022). Model-based imitation learning for urban driving. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 20703\u201320716). Red Hook: Curran Associates."}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00095-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-025-00095-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00095-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T03:03:18Z","timestamp":1764126198000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-025-00095-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,26]]},"references-count":107,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["95"],"URL":"https:\/\/doi.org\/10.1007\/s44267-025-00095-w","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"value":"2097-3330","type":"print"},{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,26]]},"assertion":[{"value":"26 June 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 October 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 October 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 November 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Wenhai Wang is an Associate Editor at Visual Intelligence and was not involved in the editorial review of this article or the decision to publish it. The authors declare that they have no other competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"22"}}