{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T17:52:49Z","timestamp":1775065969950,"version":"3.50.1"},"reference-count":267,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"name":"National Key Research and Development Program of China","award":["2024YFC3307603"],"award-info":[{"award-number":["2024YFC3307603"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62476152 and U24B20180"],"award-info":[{"award-number":["62476152 and U24B20180"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>\n            The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative articles along with their code repositories in\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"url\" xlink:href=\"https:\/\/github.com\/tsinghua-fib-lab\/World-Model\">https:\/\/github.com\/tsinghua-fib-lab\/World-Model<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3746449","type":"journal-article","created":{"date-parts":[[2025,6,27]],"date-time":"2025-06-27T07:11:23Z","timestamp":1751008283000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Understanding World or Predicting Future? A Comprehensive Survey of World Models"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7985-6263","authenticated-orcid":false,"given":"Jingtao","family":"Ding","sequence":"first","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0187-6015","authenticated-orcid":false,"given":"Yunke","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-9049-8483","authenticated-orcid":false,"given":"Yu","family":"Shang","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-3795-2505","authenticated-orcid":false,"given":"Yuheng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3236-6640","authenticated-orcid":false,"given":"Zefang","family":"Zong","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3279-7117","authenticated-orcid":false,"given":"Jie","family":"Feng","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1701-2588","authenticated-orcid":false,"given":"Yuan","family":"Yuan","sequence":"additional","affiliation":[{"name":"Electronic Engineering, Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1110-2124","authenticated-orcid":false,"given":"Hongyuan","family":"Su","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4689-2289","authenticated-orcid":false,"given":"Nian","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6009-5108","authenticated-orcid":false,"given":"Nicholas","family":"Sukiennik","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5720-4026","authenticated-orcid":false,"given":"Fengli","family":"Xu","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5617-1659","authenticated-orcid":false,"given":"Yong","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University","place":["Beijing, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,9]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat et\u00a0al. 2023. Gpt-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_3_3_2","unstructured":"Ali Agha Kyohei Otsu Benjamin Morrell David D Fan Rohan Thakker Angel Santamaria-Navarro Sung-Kyun Kim Amanda Bouman Xianmei Lei Jeffrey Edlund et\u00a0al. 2021. Nebula: Quest for robotic autonomy in challenging environments; team costar at the darpa subterranean challenge. arXiv:2103.11470. Retrieved from https:\/\/arxiv.org\/abs\/2103.11470"},{"key":"e_1_3_3_4_2","unstructured":"Ilge Akkaya Marcin Andrychowicz Maciek Chociej Mateusz Litwin Bob McGrew Arthur Petron Alex Paino Matthias Plappert Glenn Powell Raphael Ribas et\u00a0al. 2019. Solving rubik\u2019s cube with a robot hand. arXiv:1910.07113. Retrieved from https:\/\/arxiv.org\/abs\/1910.07113"},{"key":"e_1_3_3_5_2","unstructured":"Altera AL Andrew Ahn Nic Becker Stephanie Carroll Nico Christie Manuel Cortes Arda Demirci Melissa Du Frankie Li Shuying Luo et\u00a0al. 2024. Project sid: Many-agent simulations toward AI civilization. arXiv:2411.00114. Retrieved from https:\/\/arxiv.org\/abs\/2411.00114"},{"key":"e_1_3_3_6_2","unstructured":"Jorge Aldaco Travis Armstrong Robert Baruch Jeff Bingham Sanky Chan Kenneth Draper Debidatta Dwibedi Chelsea Finn Pete Florence Spencer Goodrich et\u00a0al. 2024. ALOHA 2: An enhanced low-cost hardware for bimanual teleoperation. arXiv:2405.02292. Retrieved from https:\/\/arxiv.org\/abs\/2405.02292"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ITSC.2017.8317913"},{"key":"e_1_3_3_8_2","unstructured":"Genesis Authors. 2024. Genesis: A Generative and Universal Physics Engine for Robotics and Beyond. Retrieved June 11 2025 from https:\/\/github.com\/Genesis-Embodied-AI\/Genesis"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00116"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1038\/s43588-024-00606-7"},{"key":"e_1_3_3_11_2","volume-title":"Dynamic Programming and Optimal Control: Volume I","author":"Bertsekas Dimitri","year":"2012","unstructured":"Dimitri Bertsekas. 2012. Dynamic Programming and Optimal Control: Volume I. Vol. 4. Athena scientific."},{"key":"e_1_3_3_12_2","unstructured":"Kevin Black Mitsuhiko Nakamoto Pranav Atreya Homer Walke Chelsea Finn Aviral Kumar and Sergey Levine. 2023. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv:2310.10639. Retrieved from https:\/\/arxiv.org\/abs\/2310.10639"},{"key":"e_1_3_3_13_2","first-page":"arXiv\u20132311","article-title":"Muvo: A multimodal generative world model for autonomous driving with geometric representations","author":"Bogdoll Daniel","year":"2023","unstructured":"Daniel Bogdoll, Yitian Yang, and J. Marius Z\u00f6llner. 2023. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv e-prints (2023), arXiv\u20132311.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0165-1889(98)00011-6"},{"key":"e_1_3_3_15_2","unstructured":"Tim Brooks Bill Peebles Connor Holmes Will DePue Yufei Guo Li Jing David Schnurr Joe Taylor Troy Luhman Eric Luhman Clarence Ng Ricky Wang and Aditya Ramesh. 2024. Video generation models as world simulators. (2024). Retrieved June 11 2025 from https:\/\/openai.com\/research\/video-generation-models-as-world-simulators"},{"key":"e_1_3_3_16_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_17_2","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Bruce Jake","year":"2024","unstructured":"Jake Bruce, Michael D. Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et\u00a0al. 2024. Genie: Generative interactive environments. In Proceedings of the 41st International Conference on Machine Learning."},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00204"},{"key":"e_1_3_3_19_2","article-title":"Matterport3D: Learning from RGB-D data in indoor environments","author":"Chang Angel","year":"2017","unstructured":"Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (2017). IEEE, 667\u2013676.","journal-title":"International Conference on 3D Vision"},{"key":"e_1_3_3_20_2","unstructured":"Chi-Lam Cheang Guangzeng Chen Ya Jing Tao Kong Hang Li Yifeng Li Yuxiao Liu Hongtao Wu Jiafeng Xu Yichu Yang et\u00a0al. 2024. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv:2410.06158. Retrieved from https:\/\/arxiv.org\/abs\/2410.06158"},{"key":"e_1_3_3_21_2","unstructured":"Jianyu Chen Bodi Yuan and Masayoshi Tomizuka. 2019. Model-free Deep Reinforcement Learning for Urban Autonomous Driving. arxiv:1904.09503. Retrieved from https:\/\/arxiv.org\/abs\/1904.09503"},{"key":"e_1_3_3_22_2","unstructured":"Zhili Cheng Zhitong Wang Jinyi Hu Shengding Hu An Liu Yuge Tu Pengkai Li Lei Shi Zhiyuan Liu and Maosong Sun. 2024. LEGENT: Open platform for embodied agents. arXiv:2404.18243. Retrieved from https:\/\/arxiv.org\/abs\/2404.18243"},{"key":"e_1_3_3_23_2","first-page":"027836492412736","article-title":"Diffusion policy: Visuomotor policy learning via action diffusion","author":"Chi Cheng","year":"2024","unstructured":"Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. 2024. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research 0, 0 (2024), 02783649241273668.","journal-title":"The International Journal of Robotics Research"},{"key":"e_1_3_3_24_2","unstructured":"Xiaowei Chi Hengyuan Zhang Chun-Kai Fan Xingqun Qi Rongyu Zhang Anthony Chen Chi-min Chan Wei Xue Wenhan Luo Shanghang Zhang et\u00a0al. 2024. EVA: An embodied world model for future video anticipation. arXiv:2410.15461. Retrieved from https:\/\/arxiv.org\/abs\/2410.15461"},{"key":"e_1_3_3_25_2","unstructured":"Joseph Cho Fachrina Dewi Puspitasari Sheng Zheng Jingyao Zheng Lik-Hang Lee Tae-Ho Kim Choong Seon Hong and Chaoning Zhang. 2024. Sora as an agi world model? A complete survey on text-to-video generation. arXiv:2403.05131. Retrieved from https:\/\/arxiv.org\/abs\/2403.05131"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/IV47402.2020.9304564"},{"key":"e_1_3_3_27_2","unstructured":"Wei Chow Jiageng Mao Boyi Li Daniel Seita Vitor Guizilini and Yue Wang. 2025. PhysBench: Benchmarking and enhancing vision-language models for physical world understanding. arXiv:2501.16411. Retrieved from https:\/\/arxiv.org\/abs\/2501.16411"},{"key":"e_1_3_3_28_2","article-title":"Deep reinforcement learning in a handful of trials using probabilistic dynamics models","volume":"31","author":"Chua Kurtland","year":"2018","unstructured":"Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems 31 (2018).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8793868"},{"issue":"1","key":"e_1_3_3_30_2","first-page":"99","article-title":"A feasibility study on RUNWAY GEN-2 for generating realistic style images","volume":"16","author":"Cui Yifan","year":"2024","unstructured":"Yifan Cui, Xinyi Shan, and Jeanhun Chung. 2024. A feasibility study on RUNWAY GEN-2 for generating realistic style images. International Journal of Internet, Broadcasting and Communication 16, 1 (2024), 99\u2013105.","journal-title":"International Journal of Internet, Broadcasting and Communication"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-024-08025-4"},{"key":"e_1_3_3_32_2","first-page":"5982","article-title":"ProcTHOR: Large-scale embodied AI using procedural generation","volume":"35","author":"Deitke Matt","year":"2022","unstructured":"Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. 2022. ProcTHOR: Large-scale embodied AI using procedural generation. Advances in Neural Information Processing Systems 35 (2022), 5982\u20135994.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3641519.3657513"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10514-018-9748-z"},{"key":"e_1_3_3_35_2","unstructured":"Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_3_36_2","first-page":"5547","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Du Nan","year":"2022","unstructured":"Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et\u00a0al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning. PMLR, 5547\u20135569."},{"key":"e_1_3_3_37_2","article-title":"Learning universal policies via text-guided video generation","volume":"36","author":"Du Yilun","year":"2024","unstructured":"Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. 2024. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_38_2","unstructured":"Haoyi Duan Hong-Xing Yu Sirui Chen Li Fei-Fei and Jiajun Wu. 2025. Worldscore: A unified evaluation benchmark for world generation. arXiv:2504.00983. Retrieved from https:\/\/arxiv.org\/abs\/2504.00983"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/MRA.2006.1638022"},{"key":"e_1_3_3_40_2","article-title":"Video prediction models as rewards for reinforcement learning","volume":"36","author":"Escontrela Alejandro","year":"2024","unstructured":"Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. 2024. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00675"},{"key":"e_1_3_3_42_2","first-page":"18343","article-title":"Minedojo: Building open-ended embodied agents with internet-scale knowledge","volume":"35","author":"Fan Linxi","year":"2022","unstructured":"Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. 2022. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems 35 (2022), 18343\u201318362.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_43_2","unstructured":"Jie Feng Yuwei Du Tianhui Liu Siqi Guo Yuming Lin and Yong Li. 2024. CityGPT: Empowering urban spatial cognition of large language models. arXiv:2406.13948. Retrieved from https:\/\/arxiv.org\/abs\/2406.13948"},{"key":"e_1_3_3_44_2","unstructured":"Jie Feng Jinwei Zeng Qingyue Long Hongyi Chen Jie Zhao Yanxin Xi Zhilun Zhou Yuan Yuan Shengyuan Wang Qingbin Zeng et\u00a0al. 2025. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents smart cities and earth science. arXiv:2504.09848. Retrieved from https:\/\/arxiv.org\/abs\/2504.09848"},{"key":"e_1_3_3_45_2","unstructured":"Jie Feng Jun Zhang Junbo Yan Xin Zhang Tianjian Ouyang Tianhui Liu Yuwei Du Siqi Guo and Yong Li. 2024. CityBench: Evaluating the capabilities of large language model as world model. arXiv:2406.13945. Retrieved from https:\/\/arxiv.org\/abs\/2406.13945"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-023-05732-2"},{"key":"e_1_3_3_47_2","article-title":"Unsupervised learning for physical interaction through video prediction","volume":"29","author":"Finn Chelsea","year":"2016","unstructured":"Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. Advances in Neural Information Processing Systems 29 (2016).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2017.7989324"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72933-1_4"},{"key":"e_1_3_3_50_2","unstructured":"Chuang Gan Jeremy Schwartz Seth Alter Damian Mrowca Martin Schrimpf James Traer Julian De Freitas Jonas Kubilius Abhishek Bhandwaldar Nick Haber et\u00a0al. 2020. Threedworld: A platform for interactive multi-modal physical simulation. arXiv:2007.04954. Retrieved from https:\/\/arxiv.org\/abs\/2007.04954"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1057\/s41599-024-03611-3"},{"key":"e_1_3_3_52_2","doi-asserted-by":"crossref","unstructured":"Chen Gao Xiaochong Lan Zhihong Lu Jinzhu Mao Jinghua Piao Huandong Wang Depeng Jin and Yong Li. 2023. S3: Social-network simulation system with large language model-empowered agents. arXiv:2307.14984. Retrieved from https:\/\/arxiv.org\/abs\/2307.14984","DOI":"10.2139\/ssrn.4607026"},{"key":"e_1_3_3_53_2","unstructured":"Chen Gao Baining Zhao Weichen Zhang Jinzhu Mao Jun Zhang Zhiheng Zheng Fanhang Man Jianjie Fang Zile Zhou Jinqiang Cui et\u00a0al. 2024. EmbodiedCity: A benchmark platform for embodied agent in real-world city environment. arXiv:2410.09604. Retrieved from https:\/\/arxiv.org\/abs\/2410.09604"},{"key":"e_1_3_3_54_2","article-title":"Alexa arena: A user-centric interactive platform for embodied ai","volume":"36","author":"Gao Qiaozi","year":"2024","unstructured":"Qiaozi Gao, Govind Thattai, Suhaila Shakiah, Xiaofeng Gao, Shreyas Pansare, Vasu Sharma, Gaurav Sukhatme, Hangjie Shi, Bofei Yang, Desheng Zhang, et\u00a0al. 2024. Alexa arena: A user-centric interactive platform for embodied ai. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_55_2","unstructured":"Shenyuan Gao Jiazhi Yang Li Chen Kashyap Chitta Yihang Qiu Andreas Geiger Jun Zhang and Hongyang Li. 2024. Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv:2405.17398. Retrieved from https:\/\/arxiv.org\/abs\/2405.17398"},{"key":"e_1_3_3_56_2","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"Georgiev Ignat","year":"2025","unstructured":"Ignat Georgiev, Varun Giridhar, Nicklas Hansen, and Animesh Garg. 2025. PWM: Policy learning with multi-task world models. In Proceedings of the 13th International Conference on Learning Representations."},{"key":"e_1_3_3_57_2","unstructured":"Elliot Gestrin Marco Kuhlmann and Jendrik Seipp. 2024. NL2Plan: Robust LLM-driven planning from minimal text descriptions. arXiv:2405.04215. Retrieved from https:\/\/arxiv.org\/abs\/2405.04215"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_3_3_59_2","doi-asserted-by":"crossref","unstructured":"Jiahui Gong Jingtao Ding Fanjin Meng Chen Yang Hong Chen Zuojian Wang Haisheng Lu and Yong Li. 2025. BehaveGPT: A foundation model for large-scale user behavior modeling. arXiv:2505.17631. Retrieved from https:\/\/arxiv.org\/abs\/2505.17631","DOI":"10.3390\/math13152505"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-024-00863-1"},{"key":"e_1_3_3_61_2","unstructured":"Zhouhong Gu Xiaoxuan Zhu Haoran Guo Lin Zhang Yin Cai Hao Shen Jiangjie Chen Zheyu Ye Yifei Dai Yan Gao et\u00a0al. 2024. Agent group chat: An interactive group chat simulacra for better eliciting collective emergent behavior. arXiv:2403.13433. Retrieved from https:\/\/arxiv.org\/abs\/2403.13433"},{"key":"e_1_3_3_62_2","first-page":"79081","article-title":"Leveraging pre-trained large language models to construct and utilize world models for model-based task planning","volume":"36","author":"Guan Lin","year":"2023","unstructured":"Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems 36 (2023), 79081\u201379094.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_63_2","article-title":"World models for autonomous driving: An initial survey","author":"Guan Yanchen","year":"2024","unstructured":"Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. 2024. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles (2024).","journal-title":"IEEE Transactions on Intelligent Vehicles"},{"key":"e_1_3_3_64_2","article-title":"Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research","volume":"36","author":"Gulino Cole","year":"2024","unstructured":"Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, et\u00a0al. 2024. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_65_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Gurnee Wes","year":"2024","unstructured":"Wes Gurnee and Max Tegmark. 2024. Language models represent space and time. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_66_2","article-title":"Recurrent world models facilitate policy evolution","volume":"31","author":"Ha David","year":"2018","unstructured":"David Ha and J\u00fcrgen Schmidhuber. 2018. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems 31 (2018).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_67_2","unstructured":"David Ha and J\u00fcrgen Schmidhuber. 2018. World models. arXiv:1803.10122. Retrieved from https:\/\/arxiv.org\/abs\/1803.10122"},{"key":"e_1_3_3_68_2","unstructured":"Sehoon Ha Peng Xu Zhenyu Tan Sergey Levine and Jie Tan. 2020. Learning to walk in the real world with minimal human effort. arXiv:2002.08550. Retrieved from https:\/\/arxiv.org\/abs\/2002.08550"},{"key":"e_1_3_3_69_2","unstructured":"Danijar Hafner Timothy Lillicrap Jimmy Ba and Mohammad Norouzi. 2019. Dream to control: Learning behaviors by latent imagination. arXiv:1912.01603. Retrieved from https:\/\/arxiv.org\/abs\/1912.01603"},{"key":"e_1_3_3_70_2","first-page":"2555","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Hafner Danijar","year":"2019","unstructured":"Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. 2019. Learning latent dynamics for planning from pixels. In Proceedings of the International Conference on Machine Learning. PMLR, 2555\u20132565."},{"key":"e_1_3_3_71_2","unstructured":"Danijar Hafner Timothy Lillicrap Mohammad Norouzi and Jimmy Ba. 2020. Mastering atari with discrete world models. arXiv:2010.02193. Retrieved from https:\/\/arxiv.org\/abs\/2010.02193"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-025-08744-2"},{"key":"e_1_3_3_73_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Hansen Nicklas","year":"2024","unstructured":"Nicklas Hansen, Hao Su, and Xiaolong Wang. 2024. TD-MPC2: Scalable, robust world models for continuous control. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_74_2","volume-title":"Proceedings of the 38th Annual Conference on Neural Information Processing Systems","author":"He Haoran","year":"2024","unstructured":"Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, and Xuelong Li. 2024. Learning an actionable discrete diffusion policy via large-scale actionless video pre-training. In Proceedings of the 38th Annual Conference on Neural Information Processing Systems."},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11425-019-9547-2"},{"key":"e_1_3_3_76_2","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840\u20136851.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASE.2021.3064065"},{"key":"e_1_3_3_78_2","unstructured":"Anthony Hu Lloyd Russell Hudson Yeo Zak Murez George Fedoseev Alex Kendall Jamie Shotton and Gianluca Corrado. 2023. Gaia-1: A generative world model for autonomous driving. arXiv:2309.17080. Retrieved from https:\/\/arxiv.org\/abs\/2309.17080"},{"key":"e_1_3_3_79_2","unstructured":"Anthony Hu Lloyd Russell Hudson Yeo Zak Murez George Fedoseev Alex Kendall Jamie Shotton and Gianluca Corrado. 2023. GAIA-1: A Generative World Model for Autonomous Driving. arxiv:2309.17080. Retrieved from https:\/\/arxiv.org\/abs\/2309.17080"},{"key":"e_1_3_3_80_2","unstructured":"Yihan Hu Jiazhi Yang Li Chen Keyu Li Chonghao Sima Xizhou Zhu Siqi Chai Senyao Du Tianwei Lin Wenhai Wang Lewei Lu Xiaosong Jia Qiang Liu Jifeng Dai Yu Qiao and Hongyang Li. 2023. Planning-oriented Autonomous Driving. arxiv:2212.10156. Retrieved from https:\/\/arxiv.org\/abs\/2212.10156"},{"key":"e_1_3_3_81_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10927"},{"key":"e_1_3_3_82_2","unstructured":"Pu Hua Minghuan Liu Annabella Macaluso Yunfeng Lin Weinan Zhang Huazhe Xu and Lirui Wang. 2024. GenSim2: Scaling robot data generation with multi-modal and reasoning LLMs. arXiv:2410.03645. Retrieved from https:\/\/arxiv.org\/abs\/2410.03645"},{"key":"e_1_3_3_83_2","unstructured":"Wenlong Huang Fei Xia Ted Xiao Harris Chan Jacky Liang Pete Florence Andy Zeng Jonathan Tompson Igor Mordatch Yevgen Chebotar et\u00a0al. 2022. Inner monologue: Embodied reasoning through planning with language models. arXiv:2207.05608. Retrieved from https:\/\/arxiv.org\/abs\/2207.05608"},{"key":"e_1_3_3_84_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIV.2022.3167103"},{"key":"e_1_3_3_85_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA46639.2022.9812060"},{"key":"e_1_3_3_86_2","unstructured":"Hakan Inan Kartikeya Upasani Jianfeng Chi Rashi Rungta Krithika Iyer Yuning Mao Michael Tontchev Qing Hu Brian Fuller Davide Testuggine et\u00a0al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv:2312.06674. Retrieved from https:\/\/arxiv.org\/abs\/2312.06674"},{"key":"e_1_3_3_87_2","unstructured":"Anna A Ivanova Aalok Sathe Benjamin Lipkin Unnathi Kumar Setayesh Radkani Thomas H Clark Carina Kauf Jennifer Hu RT Pramod Gabriel Grand et\u00a0al. 2024. Elements of world knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models. arXiv:2405.09605. Retrieved from https:\/\/arxiv.org\/abs\/2405.09605"},{"key":"e_1_3_3_88_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00772"},{"key":"e_1_3_3_89_2","first-page":"991","volume-title":"Proceedings of the Conference on Robot Learning","author":"Jang Eric","year":"2022","unstructured":"Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. 2022. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the Conference on Robot Learning. PMLR, 991\u20131002."},{"key":"e_1_3_3_90_2","article-title":"When to trust your model: Model-based policy optimization","volume":"32","author":"Janner Michael","year":"2019","unstructured":"Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. 2019. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_91_2","first-page":"1273","article-title":"Offline reinforcement learning as one big sequence modeling problem","volume":"34","author":"Janner Michael","year":"2021","unstructured":"Michael Janner, Qiyang Li, and Sergey Levine. 2021. Offline reinforcement learning as one big sequence modeling problem. Advances in Neural Information Processing Systems 34 (2021), 1273\u20131286.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_92_2","doi-asserted-by":"crossref","unstructured":"Jiarui Ji Yang Li Hongtao Liu Zhicheng Du Zhewei Wei Weiran Shen Qi Qi and Yankai Lin. 2024. SRAP-Agent: Simulating and optimizing scarce resource allocation policy with LLM-based agent. arXiv:2410.14152. Retrieved from https:\/\/arxiv.org\/abs\/2410.14152","DOI":"10.18653\/v1\/2024.findings-emnlp.15"},{"key":"e_1_3_3_93_2","unstructured":"Chiyu Max Jiang Andre Cornman Cheolho Park Ben Sapp Yin Zhou and Dragomir Anguelov. 2023. MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion. arxiv:2306.03083. Retrieved from https:\/\/arxiv.org\/abs\/2306.03083"},{"key":"e_1_3_3_94_2","doi-asserted-by":"publisher","DOI":"10.5555\/3692070.3692961"},{"key":"e_1_3_3_95_2","volume-title":"Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness","author":"Johnson-Laird Philip Nicholas","year":"1983","unstructured":"Philip Nicholas Johnson-Laird. 1983. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Number 6. Harvard University Press."},{"key":"e_1_3_3_96_2","unstructured":"Gregory Kahn Adam Villaflor Vitchyr Pong Pieter Abbeel and Sergey Levine. 2017. Uncertainty-aware reinforcement learning for collision avoidance. arXiv:1702.01182. Retrieved from https:\/\/arxiv.org\/abs\/1702.01182"},{"key":"e_1_3_3_97_2","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Kambhampati Subbarao","year":"2024","unstructured":"Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B. Murthy. 2024. Position: LLMs can\u2019t plan, but can help planning in LLM-modulo frameworks. In Proceedings of the 41st International Conference on Machine Learning."},{"key":"e_1_3_3_98_2","unstructured":"Bingyi Kang Yang Yue Rui Lu Zhijie Lin Yang Zhao Kaixin Wang Gao Huang and Jiashi Feng. 2024. How far is video generation from world model: A physical law perspective. arXiv:2411.02385. Retrieved from https:\/\/arxiv.org\/abs\/2411.02385"},{"key":"e_1_3_3_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9196738"},{"key":"e_1_3_3_100_2","doi-asserted-by":"publisher","DOI":"10.1109\/JRA.1987.1087068"},{"key":"e_1_3_3_101_2","doi-asserted-by":"publisher","DOI":"10.1007\/s43154-020-00021-6"},{"key":"e_1_3_3_102_2","unstructured":"Eric Kolve Roozbeh Mottaghi Winson Han Eli VanderBilt Luca Weihs Alvaro Herrasti Matt Deitke Kiana Ehsani Daniel Gordon Yuke Zhu et\u00a0al. 2017. Ai2-thor: An interactive 3d environment for visual ai. arXiv:1712.05474. Retrieved from https:\/\/arxiv.org\/abs\/1712.05474"},{"key":"e_1_3_3_103_2","first-page":"13","article-title":"Model predictive control","volume":"38","author":"Kouvaritakis Basil","year":"2016","unstructured":"Basil Kouvaritakis and Mark Cannon. 2016. Model predictive control. Switzerland: Springer International Publishing 38 (2016), 13\u201356.","journal-title":"Switzerland: Springer International Publishing"},{"key":"e_1_3_3_104_2","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_105_2","article-title":"Kling AI: Next-Generation AI Creative Studio.","year":"2024","unstructured":"Kuaishou. 2024. Kling AI: Next-Generation AI Creative Studio. Retrieved May 06, 2025 from https:\/\/www.klingai.com\/global\/.","journal-title":"https:\/\/www.klingai.com\/global\/"},{"key":"e_1_3_3_106_2","unstructured":"Aounon Kumar Chirag Agarwal Suraj Srinivas Soheil Feizi and Hima Lakkaraju. 2023. Certifying llm safety against adversarial prompting. arXiv:2309.02705. Retrieved from https:\/\/arxiv.org\/abs\/2309.02705"},{"key":"e_1_3_3_107_2","doi-asserted-by":"crossref","unstructured":"Ashish Kumar Zipeng Fu Deepak Pathak and Jitendra Malik. 2021. Rma: Rapid motor adaptation for legged robots. arXiv:2107.04034. Retrieved from https:\/\/arxiv.org\/abs\/2107.04034","DOI":"10.15607\/RSS.2021.XVII.011"},{"key":"e_1_3_3_108_2","unstructured":"Varun Ravi Kumar Senthil Yogamani Hazem Rashed Ganesh Sistu Christian Witt Isabelle Leang Stefan Milz and Patrick M\u00e4der. 2023. OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving. arxiv:2102.07448. Retrieved from https:\/\/arxiv.org\/abs\/2102.07448"},{"key":"e_1_3_3_109_2","unstructured":"Thanard Kurutach Ignasi Clavera Yan Duan Aviv Tamar and Pieter Abbeel. 2018. Model-ensemble trust-region policy optimization. arXiv:1802.10592. Retrieved from https:\/\/arxiv.org\/abs\/1802.10592"},{"issue":"1","key":"e_1_3_3_110_2","first-page":"1","article-title":"A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27","volume":"62","author":"LeCun Yann","year":"2022","unstructured":"Yann LeCun. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62, 1 (2022), 1\u201362.","journal-title":"Open Review"},{"key":"e_1_3_3_111_2","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_3_3_112_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00698"},{"key":"e_1_3_3_113_2","unstructured":"Lincan Li Wei Shao Wei Dong Yijun Tian Qiming Zhang Kaixiang Yang and Wenjie Zhang. 2024. Data-centric evolution in autonomous driving: A comprehensive survey of big data system data mining and closed-loop technologies. arXiv:2401.12888. Retrieved from https:\/\/arxiv.org\/abs\/2401.12888"},{"key":"e_1_3_3_114_2","first-page":"100428","article-title":"Embodied agent interface: Benchmarking llms for embodied decision making","volume":"37","author":"Li Manling","year":"2024","unstructured":"Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et\u00a0al. 2024. Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems 37 (2024), 100428\u2013100534.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_115_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.829"},{"key":"e_1_3_3_116_2","unstructured":"Qinbin Li Junyuan Hong Chulin Xie Jeffrey Tan Rachel Xin Junyi Hou Xavier Yin Zhun Wang Dan Hendrycks Zhangyang Wang et\u00a0al. 2024. Llm-pbe: Assessing data privacy in large language models. arXiv:2408.12787. Retrieved from https:\/\/arxiv.org\/abs\/2408.12787"},{"issue":"3","key":"e_1_3_3_117_2","first-page":"3461","article-title":"Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning","volume":"45","author":"Li Quanyi","year":"2022","unstructured":"Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. 2022. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3461\u20133475.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_3_118_2","unstructured":"Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun and Max Tegmark. 2024. The Geometry of Concepts: Sparse Autoencoder Feature Structure. arxiv:2410.19750. Retrieved from https:\/\/arxiv.org\/abs\/2410.19750"},{"key":"e_1_3_3_119_2","unstructured":"Zhiqi Li Wenhai Wang Hongyang Li Enze Xie Chonghao Sima Tong Lu Yu Qiao and Jifeng Dai. 2022. BEVFormer: Learning bird\u2019s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv:2203.17270. Retrieved from https:\/\/arxiv.org\/abs\/2203.17270"},{"key":"e_1_3_3_120_2","unstructured":"Jessy Lin Yuqing Du Olivia Watkins Danijar Hafner Pieter Abbeel Dan Klein and Anca Dragan. 2024. Learning to Model the World with Language. arxiv:2308.01399. Retrieved from https:\/\/arxiv.org\/abs\/2308.01399"},{"key":"e_1_3_3_121_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10514-023-10131-7"},{"key":"e_1_3_3_122_2","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023. Visual Instruction Tuning. Advances in Neural Information Processing Systems 36 (2023) 34892\u201334916."},{"key":"e_1_3_3_123_2","article-title":"Visual instruction tuning","volume":"36","author":"Liu Haotian","year":"2024","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_124_2","unstructured":"Hao Liu Wilson Yan Matei Zaharia and Pieter Abbeel. 2024. World model on million-length video and language with ringattention. arXiv:2402.08268. Retrieved from https:\/\/arxiv.org\/abs\/2402.08268"},{"key":"e_1_3_3_125_2","first-page":"360","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Liu Shaowei","year":"2024","unstructured":"Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. 2024. Physgen: Rigid-body physics-grounded image-to-video generation. In Proceedings of the European Conference on Computer Vision. Springer, 360\u2013378."},{"key":"e_1_3_3_126_2","unstructured":"Zhihan Liu Hao Hu Shenao Zhang Hongyi Guo Shuqi Ke Boyi Liu and Zhaoran Wang. 2024. Reason for Future Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency. arxiv:2309.17382. Retrieved from https:\/\/arxiv.org\/abs\/2309.17382"},{"key":"e_1_3_3_127_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA57147.2024.10611565"},{"key":"e_1_3_3_128_2","article-title":"Microscopic traffic simulation using SUMO","author":"Lopez Pablo Alvarez","year":"2018","unstructured":"Pablo Alvarez Lopez, Michael Behrisch, Laura Bieker-Walz, Jakob Erdmann, Yun-Pang Fl\u00f6tter\u00f6d, Robert Hilbrich, Leonhard L\u00fccken, Johannes Rummel, Peter Wagner, and Evamarie Wie\u00dfner. 2018. Microscopic traffic simulation using SUMO. In Proceedings of the 21st IEEE Intelligent Transportation Systems Conference. Retrieved from https:\/\/elib.dlr.de\/124092\/","journal-title":"Proceedings of the 21st IEEE Intelligent Transportation Systems Conference"},{"key":"e_1_3_3_129_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-022-3696-5"},{"key":"e_1_3_3_130_2","unstructured":"Yuping Luo Huazhe Xu Yuanzhi Li Yuandong Tian Trevor Darrell and Tengyu Ma. 2018. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv:1807.03858. Retrieved from https:\/\/arxiv.org\/abs\/1807.03858"},{"key":"e_1_3_3_131_2","unstructured":"Xinji Mai Zeng Tao Junxiong Lin Haoran Wang Yang Chang Yanlan Kang Yan Wang and Wenqiang Zhang. 2024. From efficient multimodal models to world models: A survey. arXiv:2407.00118. Retrieved from https:\/\/arxiv.org\/abs\/2407.00118"},{"key":"e_1_3_3_132_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01560"},{"key":"e_1_3_3_133_2","unstructured":"Rohin Manvi Samar Khanna Marshall Burke David Lobell and Stefano Ermon. 2024. Large language models are geographically biased. arXiv:2402.02680. Retrieved from https:\/\/arxiv.org\/abs\/2402.02680"},{"key":"e_1_3_3_134_2","unstructured":"Rohin Manvi Samar Khanna Gengchen Mai Marshall Burke David Lobell and Stefano Ermon. 2023. GeoLLM: Extracting geospatial knowledge from large language models. arXiv:2310.06213. Retrieved from https:\/\/arxiv.org\/abs\/2310.06213"},{"key":"e_1_3_3_135_2","doi-asserted-by":"crossref","unstructured":"Russell Mendonca Shikhar Bahl and Deepak Pathak. 2023. Structured world models from human videos. arXiv:2308.10901. Retrieved from https:\/\/arxiv.org\/abs\/2308.10901","DOI":"10.15607\/RSS.2023.XIX.012"},{"key":"e_1_3_3_136_2","unstructured":"Chen Min Dawei Zhao Liang Xiao Yiming Nie and Bin Dai. 2023. Uniworld: Autonomous driving pre-training via world models. arXiv:2308.07234. Retrieved from https:\/\/arxiv.org\/abs\/2308.07234"},{"key":"e_1_3_3_137_2","unstructured":"Marvin Minsky. 1974. A framework for representing knowledge. MIT Cambridge."},{"key":"e_1_3_3_138_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14236"},{"key":"e_1_3_3_139_2","unstructured":"Thomas M. Moerland Joost Broekens Aske Plaat and Catholijn M .Jonker. 2018. A0c: Alpha zero in continuous action space. arXiv:1805.09613. Retrieved from https:\/\/arxiv.org\/abs\/1805.09613"},{"key":"e_1_3_3_140_2","unstructured":"Saman Motamed Laura Culp Kevin Swersky Priyank Jaini and Robert Geirhos. 2025. Do generative video models learn physical principles from watching videos? arXiv:2501.09038. Retrieved from https:\/\/arxiv.org\/abs\/2501.09038"},{"key":"e_1_3_3_141_2","unstructured":"Fangwen Mu Lin Shi Song Wang Zhuohao Yu Binquan Zhang Chenxue Wang Shichao Liu and Qing Wang. 2023. ClarifyGPT: Empowering LLM-based code generation with intention clarification. arXiv:2310.10996. Retrieved from https:\/\/arxiv.org\/abs\/2310.10996"},{"key":"e_1_3_3_142_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2018.8463189"},{"key":"e_1_3_3_143_2","doi-asserted-by":"crossref","unstructured":"Nigamaa Nayakanti Rami Al-Rfou Aurick Zhou Kratarth Goel Khaled S. Refaat and Benjamin Sapp. 2022. Wayformer: Motion Forecasting via Simple and Efficient Attention Networks. arxiv:2207.05844. Retrieved from https:\/\/arxiv.org\/abs\/2207.05844","DOI":"10.1109\/ICRA48891.2023.10160609"},{"key":"e_1_3_3_144_2","unstructured":"Jiquan Ngiam Benjamin Caine Vijay Vasudevan Zhengdong Zhang Hao-Tien Lewis Chiang Jeffrey Ling Rebecca Roelofs Alex Bewley Chenxi Liu Ashish Venugopal et\u00a0al. 2021. Scene transformer: A unified multi-task model for behavior prediction and planning. arXiv:2106.08417. Retrieved from https:\/\/arxiv.org\/abs\/2106.08417"},{"key":"e_1_3_3_145_2","article-title":"Value prediction network","volume":"30","author":"Oh Junhyuk","year":"2017","unstructured":"Junhyuk Oh, Satinder Singh, and Honglak Lee. 2017. Value prediction network. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_146_2","article-title":"Introducing ChatGPT","year":"2022","unstructured":"OpenAI. 2022. Introducing ChatGPT. Retrieved May 06, 2025 from https:\/\/openai.com\/blog\/chatgpt.","journal-title":"https:\/\/openai.com\/blog\/chatgpt"},{"key":"e_1_3_3_147_2","article-title":"Sora: Creating video from text.","year":"2024","unstructured":"OpenAI. 2024. Sora: Creating video from text. Retrieved May 06, 2025 from https:\/\/openai.com\/sora.","journal-title":"https:\/\/openai.com\/sora"},{"key":"e_1_3_3_148_2","doi-asserted-by":"crossref","unstructured":"Marios Papachristou and Yuan Yuan. 2024. Network formation and dynamics among multi-LLMs. arXiv:2402.10659. Retrieved from https:\/\/arxiv.org\/abs\/2402.10659","DOI":"10.2139\/ssrn.5410522"},{"key":"e_1_3_3_149_2","doi-asserted-by":"publisher","DOI":"10.1145\/3586183.3606763"},{"key":"e_1_3_3_150_2","doi-asserted-by":"publisher","DOI":"10.1145\/3526113.3545616"},{"key":"e_1_3_3_151_2","first-page":"6236","article-title":"Avlen: Audio-visual-language embodied navigation in 3d environments","volume":"35","author":"Paul Sudipta","year":"2022","unstructured":"Sudipta Paul, Amit Roy-Chowdhury, and Anoop Cherian. 2022. Avlen: Audio-visual-language embodied navigation in 3d environments. Advances in Neural Information Processing Systems 35 (2022), 6236\u20136249.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_152_2","doi-asserted-by":"crossref","unstructured":"Judea Pearl. 2009. Causal inference in statistics: An overview. (2009).","DOI":"10.1214\/09-SS057"},{"key":"e_1_3_3_153_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01408"},{"key":"e_1_3_3_154_2","unstructured":"Jinghua Piao Yuwei Yan Jun Zhang Nian Li Junbo Yan Xiaochong Lan Zhihong Lu Zhiheng Zheng Jing Yi Wang Di Zhou et\u00a0al. 2025. AgentSociety: Large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society. arXiv:2502.08691. Retrieved from https:\/\/arxiv.org\/abs\/2502.08691"},{"key":"e_1_3_3_155_2","unstructured":"Giorgio Piatti Zhijing Jin Max Kleiman-Weiner Bernhard Sch\u00f6lkopf Mrinmaya Sachan and Rada Mihalcea. 2024. Cooperate or collapse: Emergence of sustainability behaviors in a society of LLM agents. arXiv:2404.16698. Retrieved from https:\/\/arxiv.org\/abs\/2404.16698"},{"key":"e_1_3_3_156_2","doi-asserted-by":"publisher","DOI":"10.1017\/S0140525X00076512"},{"key":"e_1_3_3_157_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00886"},{"key":"e_1_3_3_158_2","first-page":"652","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Qi Charles R.","year":"2017","unstructured":"Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 652\u2013660."},{"key":"e_1_3_3_159_2","unstructured":"Charles R. Qi Li Yi Hao Su and Leonidas J. Guibas. 2017. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arxiv:1706.02413. Retrieved from https:\/\/arxiv.org\/abs\/1706.02413"},{"key":"e_1_3_3_160_2","first-page":"23192","article-title":"Pointnext: Revisiting pointnet++ with improved training and scaling strategies","volume":"35","author":"Qian Guocheng","year":"2022","unstructured":"Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. 2022. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems 35 (2022), 23192\u201323204.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_161_2","unstructured":"Yiran Qin Zhelun Shi Jiwen Yu Xijun Wang Enshen Zhou Lijun Li Zhenfei Yin Xihui Liu Lu Sheng Jing Shao et\u00a0al. 2024. Worldsimbench: Towards video generation models as world simulators. arXiv:2410.18072. Retrieved from https:\/\/arxiv.org\/abs\/2410.18072"},{"key":"e_1_3_3_162_2","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020. Retrieved from https:\/\/arxiv.org\/abs\/2103.00020"},{"key":"e_1_3_3_163_2","article-title":"Direct preference optimization: Your language model is secretly a reward model","volume":"36","author":"Rafailov Rafael","year":"2024","unstructured":"Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_164_2","first-page":"7953","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Rajeswaran Aravind","year":"2020","unstructured":"Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. 2020. A game theoretic framework for model based reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 7953\u20137963."},{"key":"e_1_3_3_165_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3154404"},{"key":"e_1_3_3_166_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems.C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2015\/file\/14bfa6bb14875e45bba028a21ed38046-Paper.pdf"},{"key":"e_1_3_3_167_2","unstructured":"Weiming Ren Huan Yang Ge Zhang Cong Wei Xinrun Du Wenhao Huang and Wenhu Chen. 2024. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv:2402.04324. Retrieved from https:\/\/arxiv.org\/abs\/2402.04324"},{"key":"e_1_3_3_168_2","unstructured":"Jonathan Richens David Abel Alexis Bellot and Tom Everitt. 2025. General agents need world models. arXiv:2506.01622. Retrieved from https:\/\/arxiv.org\/abs\/2506.01622"},{"key":"e_1_3_3_169_2","doi-asserted-by":"crossref","unstructured":"Marc Rigter Tarun Gupta Agrin Hilmkil and Chao Ma. 2024. AVID: Adapting video diffusion models to world models. arXiv:2410.12822. Retrieved from https:\/\/arxiv.org\/abs\/2410.12822","DOI":"10.32388\/H7BFDW"},{"key":"e_1_3_3_170_2","unstructured":"Jonathan Roberts Timo L\u00fcddecke Sowmen Das Kai Han and Samuel Albanie. 2023. GPT4GEO: How a language model sees the world\u2019s geography. arXiv:2306.00020. Retrieved from https:\/\/arxiv.org\/abs\/2306.00020"},{"key":"e_1_3_3_171_2","first-page":"91","volume-title":"Proceedings of the Conference on Robot Learning","author":"Rudin Nikita","year":"2022","unstructured":"Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. 2022. Learning to walk in minutes using massively parallel deep reinforcement learning. In Proceedings of the Conference on Robot Learning. PMLR, 91\u2013100."},{"key":"e_1_3_3_172_2","unstructured":"Mohammad Reza Samsami Artem Zholus Janarthanan Rajendran and Sarath Chandar. 2024. Mastering memory tasks with world models. arXiv:2403.04253. Retrieved from https:\/\/arxiv.org\/abs\/2403.04253"},{"key":"e_1_3_3_173_2","first-page":"8459","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Sanchez-Gonzalez Alvaro","year":"2020","unstructured":"Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. 2020. Learning to simulate complex physics with graph networks. In Proceedings of the International Conference on Machine Learning. PMLR, 8459\u20138468."},{"key":"e_1_3_3_174_2","unstructured":"Maarten Sap Ronan LeBras Daniel Fried and Yejin Choi. 2022. Neural theory-of-mind? On the limits of social intelligence in large lms. arXiv:2210.13312. Retrieved from https:\/\/arxiv.org\/abs\/2210.13312"},{"key":"e_1_3_3_175_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00943"},{"key":"e_1_3_3_176_2","doi-asserted-by":"publisher","DOI":"10.1109\/icra40945.2020.9197132"},{"key":"e_1_3_3_177_2","doi-asserted-by":"publisher","DOI":"10.1080\/0022250X.1971.9989794"},{"key":"e_1_3_3_178_2","unstructured":"Ingmar Schubert Jingwei Zhang Jake Bruce Sarah Bechtle Emilio Parisotto Martin Riedmiller Jost Tobias Springenberg Arunkumar Byravan Leonard Hasenclever and Nicolas Heess. 2023. A generalist dynamics model for control. arXiv:2305.10912. Retrieved from https:\/\/arxiv.org\/abs\/2305.10912"},{"key":"e_1_3_3_179_2","unstructured":"John Schulman Filip Wolski Prafulla Dhariwal Alec Radford and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347. Retrieved from https:\/\/arxiv.org\/abs\/1707.06347"},{"key":"e_1_3_3_180_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA48891.2023.10161227"},{"key":"e_1_3_3_181_2","unstructured":"Yu Shang Jiansheng Chen Hangyu Fan Jingtao Ding Jie Feng and Yong Li. 2024. UrbanWorld: An urban world model for 3D city generation. arXiv:2407.11965. Retrieved from https:\/\/arxiv.org\/abs\/2407.11965"},{"key":"e_1_3_3_182_2","unstructured":"Yu Shang Yu Li Fengli Xu and Yong Li. 2024. DefInt: A default-interventionist framework for efficient reasoning with hybrid large language models. arXiv:2402.02563. Retrieved from https:\/\/arxiv.org\/abs\/2402.02563"},{"key":"e_1_3_3_183_2","unstructured":"Chenyang Shao Fengli Xu Bingbing Fan Jingtao Ding Yuan Yuan Meng Wang and Yong Li. 2024. Beyond imitation: Generating human mobility from context-aware reasoning with large language models. arXiv:2402.09836. Retrieved from https:\/\/arxiv.org\/abs\/2402.09836"},{"key":"e_1_3_3_184_2","doi-asserted-by":"publisher","DOI":"10.1109\/IROS51168.2021.9636667"},{"key":"e_1_3_3_185_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Shi Hongzhi","year":"2022","unstructured":"Hongzhi Shi, Jingtao Ding, Yufan Cao, Li Liu, Yong Li, et\u00a0al. 2022. Learning symbolic models for graph-structured physical mechanism. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_3_186_2","doi-asserted-by":"publisher","DOI":"10.1177\/02783649231219020"},{"key":"e_1_3_3_187_2","unstructured":"Haojun Shi Suyu Ye Xinyu Fang Chuanyang Jin Layla Isik Yen-Ling Kuo and Tianmin Shu. 2024. MuMA-ToM: Multi-modal multi-agent theory of mind. arXiv:2408.12574. Retrieved from https:\/\/arxiv.org\/abs\/2408.12574"},{"key":"e_1_3_3_188_2","first-page":"6531","article-title":"Motion transformer with global intention localization and local movement refinement","volume":"35","author":"Shi Shaoshuai","year":"2022","unstructured":"Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. 2022. Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems 35 (2022), 6531\u20136543.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_189_2","unstructured":"Mohit Shridhar Xingdi Yuan Marc-Alexandre C\u00f4t\u00e9 Yonatan Bisk Adam Trischler and Matthew Hausknecht. 2020. Alfworld: Aligning text and embodied environments for interactive learning. arXiv:2010.03768. Retrieved from https:\/\/arxiv.org\/abs\/2010.03768"},{"key":"e_1_3_3_190_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"e_1_3_3_191_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature24270"},{"key":"e_1_3_3_192_2","doi-asserted-by":"publisher","DOI":"10.1109\/70.285583"},{"key":"e_1_3_3_193_2","doi-asserted-by":"crossref","unstructured":"Laura Smith Ilya Kostrikov and Sergey Levine. 2022. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. arXiv:2208.07860. Retrieved from https:\/\/arxiv.org\/abs\/2208.07860","DOI":"10.15607\/RSS.2023.XIX.056"},{"key":"e_1_3_3_194_2","doi-asserted-by":"crossref","unstructured":"Yang Song Jascha Sohl-Dickstein Diederik P. Kingma Abhishek Kumar Stefano Ermon and Ben Poole. 2020. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456. Retrieved from https:\/\/arxiv.org\/abs\/2011.13456","DOI":"10.1155\/2011\/613695"},{"key":"e_1_3_3_195_2","first-page":"1","article-title":"Testing theory of mind in large language models and humans","author":"Strachan James WA","year":"2024","unstructured":"James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et\u00a0al. 2024. Testing theory of mind in large language models and humans. Nature Human Behaviour (2024), 1\u201311.","journal-title":"Nature Human Behaviour"},{"key":"e_1_3_3_196_2","unstructured":"Winnie Street John Oliver Siy Geoff Keeling Adrien Baranes Benjamin Barnett Michael McKibben Tatenda Kanyere Alison Lentz Robin IM Dunbar et\u00a0al. 2024. LLMs achieve adult human performance on higher-order theory of mind tasks. arXiv:2405.18870. Retrieved from https:\/\/arxiv.org\/abs\/2405.18870"},{"key":"e_1_3_3_197_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2019.101469"},{"key":"e_1_3_3_198_2","article-title":"SpatialLM: Large Language Model for Spatial Understanding","author":"Team ManyCore Research","year":"2025","unstructured":"ManyCore Research Team. 2025. SpatialLM: Large Language Model for Spatial Understanding. Retrieved June 11, 2025 from https:\/\/github.com\/manycore-research\/SpatialLM.","journal-title":"https:\/\/github.com\/manycore-research\/SpatialLM"},{"key":"e_1_3_3_199_2","doi-asserted-by":"crossref","unstructured":"Marvin Teichmann Michael Weber Marius Zoellner Roberto Cipolla and Raquel Urtasun. 2018. MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving. arxiv:1612.07695. Retrieved from https:\/\/arxiv.org\/abs\/1612.07695","DOI":"10.1109\/IVS.2018.8500504"},{"key":"e_1_3_3_200_2","unstructured":"Ran Tian Boyi Li Xinshuo Weng Yuxiao Chen Edward Schmerling Yue Wang Boris Ivanovic and Marco Pavone. 2024. Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving. arxiv:2407.00959. Retrieved from https:\/\/arxiv.org\/abs\/2407.00959"},{"key":"e_1_3_3_201_2","doi-asserted-by":"publisher","DOI":"10.1037\/h0061626"},{"key":"e_1_3_3_202_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. Llama: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_3_203_2","article-title":"Attention is all you need","author":"Vaswani A","year":"2017","unstructured":"A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_204_2","unstructured":"Dat Vu Bao Ngo and Hung Phan. 2022. HybridNets: End-to-End Perception Network. arxiv:2203.09035. Retrieved from https:\/\/arxiv.org\/abs\/2203.09035"},{"key":"e_1_3_3_205_2","first-page":"15909","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang Ao","year":"2024","unstructured":"Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. 2024. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15909\u201315920."},{"key":"e_1_3_3_206_2","unstructured":"Hanqing Wang Jiahe Chen Wensi Huang Qingwei Ben Tai Wang Boyu Mi Tao Huang Siheng Zhao Yilun Chen Sizhe Yang et\u00a0al. 2024. Grutopia: Dream general robots in a city at scale. arXiv:2407.10943. Retrieved from https:\/\/arxiv.org\/abs\/2407.10943"},{"key":"e_1_3_3_207_2","unstructured":"Lirui Wang Yiyang Ling Zhecheng Yuan Mohit Shridhar Chen Bao Yuzhe Qin Bailin Wang Huazhe Xu and Xiaolong Wang. 2023. Gensim: Generating robotic simulation tasks via large language models. arXiv:2310.01361. Retrieved from https:\/\/arxiv.org\/abs\/2310.01361"},{"key":"e_1_3_3_208_2","unstructured":"Lening Wang Wenzhao Zheng Yilong Ren Han Jiang Zhiyong Cui Haiyang Yu and Jiwen Lu. 2024. OccSora: 4D occupancy generation models as world simulators for autonomous driving. arXiv:2405.20337. Retrieved from https:\/\/arxiv.org\/abs\/2405.20337"},{"key":"e_1_3_3_209_2","unstructured":"Tingwu Wang and Jimmy Ba. 2019. Exploring model-based planning with policy networks. arXiv:1906.08649. Retrieved from https:\/\/arxiv.org\/abs\/1906.08649"},{"key":"e_1_3_3_210_2","unstructured":"Xiaofeng Wang Zheng Zhu Guan Huang Xinze Chen Jiagang Zhu and Jiwen Lu. 2023. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv:2309.09777. Retrieved from https:\/\/arxiv.org\/abs\/2309.09777"},{"key":"e_1_3_3_211_2","unstructured":"Xiaofeng Wang Zheng Zhu Guan Huang Xinze Chen Jiagang Zhu and Jiwen Lu. 2023. DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving. arxiv:2309.09777. Retrieved from https:\/\/arxiv.org\/abs\/2309.09777"},{"key":"e_1_3_3_212_2","unstructured":"Xiaofeng Wang Zheng Zhu Guan Huang Boyuan Wang Xinze Chen and Jiwen Lu. 2024. Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv:2401.09985. Retrieved from https:\/\/arxiv.org\/abs\/2401.09985"},{"key":"e_1_3_3_213_2","unstructured":"Yuqi Wang Jiawei He Lue Fan Hongxin Li Yuntao Chen and Zhaoxiang Zhang. 2023. Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. arxiv:2311.17918. Retrieved from https:\/\/arxiv.org\/abs\/2311.17918"},{"key":"e_1_3_3_214_2","doi-asserted-by":"crossref","unstructured":"Yi Ru Wang Jiafei Duan Dieter Fox and Siddhartha Srinivasa. 2023. NEWTON: Are large language models capable of physical reasoning? arXiv:2310.07018. Retrieved from https:\/\/arxiv.org\/abs\/2310.07018","DOI":"10.18653\/v1\/2023.findings-emnlp.652"},{"key":"e_1_3_3_215_2","doi-asserted-by":"publisher","DOI":"10.22215\/timreview\/1282"},{"key":"e_1_3_3_216_2","unstructured":"Alex Wilf Sihyun Shawn Lee Paul Pu Liang and Louis-Philippe Morency. 2023. Think Twice: Perspective-Taking Improves Large Language Models\u2019 Theory-of-Mind Capabilities. arxiv:2311.10227. Retrieved from https:\/\/arxiv.org\/abs\/2311.10227"},{"key":"e_1_3_3_217_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11633-022-1339-y"},{"key":"e_1_3_3_218_2","unstructured":"Hongtao Wu Ya Jing Chilam Cheang Guangzeng Chen Jiafeng Xu Xinghang Li Minghuan Liu Hang Li and Tao Kong. 2023. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv:2312.13139. Retrieved from https:\/\/arxiv.org\/abs\/2312.13139"},{"key":"e_1_3_3_219_2","unstructured":"Jincenzi Wu Zhuang Chen Jiawen Deng Sahand Sabour Helen Meng and Minlie Huang. 2024. Coke: A cognitive knowledge graph for machine theory of mind. arXiv:2305.05390. Retrieved from https:\/\/arxiv.org\/abs\/2305.05390"},{"key":"e_1_3_3_220_2","unstructured":"Jialong Wu Shaofeng Yin Ningya Feng Xu He Dong Li Jianye Hao and Mingsheng Long. 2024. iVideoGPT: Interactive VideoGPTs are scalable world models. arXiv:2405.15223. Retrieved from https:\/\/arxiv.org\/abs\/2405.15223"},{"key":"e_1_3_3_221_2","first-page":"2226","volume-title":"Proceedings of the Conference on Robot Learning","author":"Wu Philipp","year":"2023","unstructured":"Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. 2023. Daydreamer: World models for physical robot learning. In Proceedings of the Conference on Robot Learning. PMLR, 2226\u20132240."},{"key":"e_1_3_3_222_2","unstructured":"Wayne Wu Honglin He Yiran Wang Chenda Duan Jack He Zhizheng Liu Quanyi Li and Bolei Zhou. 2024. MetaUrban: A simulation platform for embodied AI in urban spaces. arXiv:2407.08725. Retrieved from https:\/\/arxiv.org\/abs\/2407.08725"},{"key":"e_1_3_3_223_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01111"},{"key":"e_1_3_3_224_2","unstructured":"Jiannan Xiang Guangyi Liu Yi Gu Qiyue Gao Yuting Ning Yuheng Zha Zeyu Feng Tianhua Tao Shibo Hao Yemin Shi et\u00a0al. 2024. Pandora: Towards general world model with natural language actions and video states. arXiv:2406.09455. Retrieved from https:\/\/arxiv.org\/abs\/2406.09455"},{"key":"e_1_3_3_225_2","article-title":"Language models meet world models: Embodied experiences enhance language models","volume":"36","author":"Xiang Jiannan","year":"2024","unstructured":"Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. 2024. Language models meet world models: Embodied experiences enhance language models. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_226_2","unstructured":"Fengli Xu Qianyue Hao Zefang Zong Jingwei Wang Yunke Zhang Jingyi Wang Xiaochong Lan Jiahui Gong Tianjian Ouyang Fanjin Meng et\u00a0al. 2025. Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv:2501.09686. Retrieved from https:\/\/arxiv.org\/abs\/2501.09686"},{"key":"e_1_3_3_227_2","unstructured":"Fengli Xu Jun Zhang Chen Gao Jie Feng and Yong Li. 2023. Urban generative intelligence (ugi): A foundational platform for agents in embodied city environment. arXiv:2312.11813. Retrieved from https:\/\/arxiv.org\/abs\/2312.11813"},{"key":"e_1_3_3_228_2","unstructured":"Wenrui Xu Dalin Lyu Weihang Wang Jie Feng Chen Gao and Yong Li. 2025. Defining and evaluating visual language models\u2019 basic spatial abilities: A perspective from psychometrics. arXiv:2502.11859. Retrieved from https:\/\/arxiv.org\/abs\/2502.11859"},{"key":"e_1_3_3_229_2","unstructured":"Yuzhuang Xu Shuo Wang Peng Li Fuwen Luo Xiaolong Wang Weidong Liu and Yang Liu. 2023. Exploring large language models for communication games: An empirical study on werewolf. arXiv:2309.04658. Retrieved from https:\/\/arxiv.org\/abs\/2309.04658"},{"key":"e_1_3_3_230_2","first-page":"39062","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Yan Wilson","year":"2023","unstructured":"Wilson Yan, Danijar Hafner, Stephen James, and Pieter Abbeel. 2023. Temporally consistent transformers for video generation. In Proceedings of the International Conference on Machine Learning. PMLR, 39062\u201339098."},{"key":"e_1_3_3_231_2","unstructured":"Wilson Yan Yunzhi Zhang Pieter Abbeel and Aravind Srinivas. 2021. Videogpt: Video generation using vq-vae and transformers. arXiv:2104.10157. Retrieved from https:\/\/arxiv.org\/abs\/2104.10157"},{"key":"e_1_3_3_232_2","unstructured":"Xu Yan Haiming Zhang Yingjie Cai Jingming Guo Weichao Qiu Bin Gao Kaiqiang Zhou Yue Zhao Huan Jin Jiantao Gao et\u00a0al. 2024. Forging vision foundation models for autonomous driving: Challenges methodologies and opportunities. arXiv:2401.08045. Retrieved from https:\/\/arxiv.org\/abs\/2401.08045"},{"key":"e_1_3_3_233_2","unstructured":"Yuwei Yan Qingbin Zeng Zhiheng Zheng Jingzhe Yuan Jie Feng Jun Zhang Fengli Xu and Yong Li. 2024. OpenCity: A scalable platform to simulate urban activities with massive LLM agents. arXiv:2410.21286. Retrieved from https:\/\/arxiv.org\/abs\/2410.21286"},{"key":"e_1_3_3_234_2","unstructured":"Deshun Yang Luhui Hu Yu Tian Zihao Li Chris Kelly Bang Yang Cindy Yang and Yuexian Zou. 2024. WorldGPT: A sora-inspired video AI agent as rich world models from text and image inputs. arXiv:2403.07944. Retrieved from https:\/\/arxiv.org\/abs\/2403.07944"},{"key":"e_1_3_3_235_2","unstructured":"Mengjiao Yang Yilun Du Bo Dai Dale Schuurmans Joshua B. Tenenbaum and Pieter Abbeel. 2023. Probabilistic adaptation of text-to-video models. arXiv:2306.01872. Retrieved from https:\/\/arxiv.org\/abs\/2306.01872"},{"key":"e_1_3_3_236_2","unstructured":"Mengjiao Yang Yilun Du Kamyar Ghasemipour Jonathan Tompson Dale Schuurmans and Pieter Abbeel. 2023. Learning interactive real-world simulators. arXiv:2310.06114. Retrieved from https:\/\/arxiv.org\/abs\/2310.06114"},{"key":"e_1_3_3_237_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Yang Sherry","year":"2024","unstructured":"Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. 2024. Learning interactive real-world simulators. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_238_2","unstructured":"Sherry Yang Jacob Walker Jack Parker-Holder Yilun Du Jake Bruce Andre Barreto Pieter Abbeel and Dale Schuurmans. 2024. Video as the new language for real-world decision making. arXiv:2402.17139. Retrieved from https:\/\/arxiv.org\/abs\/2402.17139"},{"key":"e_1_3_3_239_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01536"},{"key":"e_1_3_3_240_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01538"},{"key":"e_1_3_3_241_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.hcc.2024.100211"},{"key":"e_1_3_3_242_2","unstructured":"Shengming Yin Chenfei Wu Huan Yang Jianfeng Wang Xiaodong Wang Minheng Ni Zhengyuan Yang Linjie Li Shuguang Liu Fan Yang et\u00a0al. 2023. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv:2303.12346. Retrieved from https:\/\/arxiv.org\/abs\/2303.12346"},{"key":"e_1_3_3_243_2","unstructured":"Jifan Yu Xiaozhi Wang Shangqing Tu Shulin Cao Daniel Zhang-Li Xin Lv Hao Peng Zijun Yao Xiaohan Zhang Hanming Li et\u00a0al. 2023. Kola: Carefully benchmarking world knowledge of large language models. arXiv:2306.09296. Retrieved from https:\/\/arxiv.org\/abs\/2306.09296"},{"key":"e_1_3_3_244_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01008"},{"key":"e_1_3_3_245_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2022.3145064"},{"key":"e_1_3_3_246_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10289"},{"key":"e_1_3_3_247_2","doi-asserted-by":"publisher","DOI":"10.1093\/pnasnexus\/pgaf081"},{"key":"e_1_3_3_248_2","unstructured":"Jintian Zhang Xin Xu and Shumin Deng. 2023. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv:2310.02124. Retrieved from https:\/\/arxiv.org\/abs\/2310.02124"},{"key":"e_1_3_3_249_2","unstructured":"Lunjun Zhang Yuwen Xiong Ze Yang Sergio Casas Rui Hu and Raquel Urtasun. 2024. Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion. arxiv:2311.01017. Retrieved from https:\/\/arxiv.org\/abs\/2311.01017"},{"key":"e_1_3_3_250_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72627-9_22"},{"key":"e_1_3_3_251_2","unstructured":"Wenqi Zhang Ke Tang Hai Wu Mengna Wang Yongliang Shen Guiyang Hou Zeqi Tan Peng Li Yueting Zhuang and Weiming Lu. 2024. Agent-pro: Learning to evolve via policy-level reflection and optimization. arXiv:2402.17574. Retrieved from https:\/\/arxiv.org\/abs\/2402.17574"},{"key":"e_1_3_3_252_2","doi-asserted-by":"crossref","unstructured":"Zeyu Zhang Xiaohe Bo Chen Ma Rui Li Xu Chen Quanyu Dai Jieming Zhu Zhenhua Dong and Ji-Rong Wen. 2024. A survey on the memory mechanism of large language model based agents. arXiv:2404.13501. Retrieved from https:\/\/arxiv.org\/abs\/2404.13501","DOI":"10.1145\/3748302"},{"key":"e_1_3_3_253_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA48891.2023.10161243"},{"key":"e_1_3_3_254_2","unstructured":"Zhejun Zhang Alexander Liniger Christos Sakaridis Fisher Yu and Luc Van Gool. 2023. Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding. arxiv:2310.12970. Retrieved from https:\/\/arxiv.org\/abs\/2310.12970"},{"key":"e_1_3_3_255_2","doi-asserted-by":"crossref","unstructured":"Baining Zhao Jianjie Fang Zichao Dai Ziyou Wang Jirong Zha Weichen Zhang Chen Gao Yue Wang Jinqiang Cui Xinlei Chen et\u00a0al. 2025. Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces. arXiv:2503.06157. Retrieved from https:\/\/arxiv.org\/abs\/2503.06157","DOI":"10.18653\/v1\/2025.acl-long.1558"},{"key":"e_1_3_3_256_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01542"},{"key":"e_1_3_3_257_2","unstructured":"Guosheng Zhao Xiaofeng Wang Zheng Zhu Xinze Chen Guan Huang Xiaoyi Bao and Xingang Wang. 2024. DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation. arxiv:2403.06845. Retrieved from https:\/\/arxiv.org\/abs\/2403.06845"},{"key":"e_1_3_3_258_2","article-title":"Large language models as commonsense knowledge for large-scale task planning","volume":"36","author":"Zhao Zirui","year":"2024","unstructured":"Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_259_2","unstructured":"Haoyu Zhen Xiaowen Qiu Peihao Chen Jincheng Yang Xin Yan Yilun Du Yining Hong and Chuang Gan. 2024. 3d-vla: A 3d vision-language-action generative world model. arXiv:2403.09631. Retrieved from https:\/\/arxiv.org\/abs\/2403.09631"},{"key":"e_1_3_3_260_2","doi-asserted-by":"publisher","DOI":"10.1126\/sciadv.abk2607"},{"key":"e_1_3_3_261_2","doi-asserted-by":"crossref","unstructured":"Wenzhao Zheng Weiliang Chen Yuanhui Huang Borui Zhang Yueqi Duan and Jiwen Lu. 2023. Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv:2311.16038. Retrieved from https:\/\/arxiv.org\/abs\/2311.16038","DOI":"10.1007\/978-3-031-72624-8_4"},{"key":"e_1_3_3_262_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72624-8_4"},{"key":"e_1_3_3_263_2","doi-asserted-by":"crossref","unstructured":"Hongyu Zhou Zheng Ge Zeming Li and Xiangyu Zhang. 2022. MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception. arxiv:2211.10593. Retrieved from https:\/\/arxiv.org\/abs\/2211.10593","DOI":"10.1109\/ICCV51070.2023.00785"},{"key":"e_1_3_3_264_2","unstructured":"Siyuan Zhou Yilun Du Jiaben Chen Yandong Li Dit-Yan Yeung and Chuang Gan. 2024. RoboDreamer: Learning compositional world models for robot imagination. arXiv:2404.12377. Retrieved from https:\/\/arxiv.org\/abs\/2404.12377"},{"key":"e_1_3_3_265_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01713"},{"key":"e_1_3_3_266_2","unstructured":"Fangqi Zhu Hongtao Wu Song Guo Yuxiao Liu Chilam Cheang and Tao Kong. 2024. Irasim: Learning interactive real-robot action simulators. arXiv:2406.14540. Retrieved from https:\/\/arxiv.org\/abs\/2406.14540"},{"key":"e_1_3_3_267_2","unstructured":"Zheng Zhu Xiaofeng Wang Wangbo Zhao Chen Min Nianchen Deng Min Dou Yuqi Wang Botian Shi Kai Wang Chi Zhang et\u00a0al. 2024. Is sora a world simulator? A comprehensive survey on general world models and beyond. arXiv:2405.03520. Retrieved from https:\/\/arxiv.org\/abs\/2405.03520"},{"key":"e_1_3_3_268_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2019.2913166"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746449","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,9]],"date-time":"2025-09-09T14:33:45Z","timestamp":1757428425000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746449"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,9]]},"references-count":267,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3746449"],"URL":"https:\/\/doi.org\/10.1145\/3746449","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,9]]},"assertion":[{"value":"2024-11-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}