{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T21:15:44Z","timestamp":1775682944554,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":65,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,5,8]],"date-time":"2025-05-08T00:00:00Z","timestamp":1746662400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,5,8]]},"DOI":"10.1145\/3701716.3717565","type":"proceedings-article","created":{"date-parts":[[2025,5,23]],"date-time":"2025-05-23T16:12:56Z","timestamp":1748016776000},"page":"2181-2186","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8122-7568","authenticated-orcid":false,"given":"Feiyang","family":"Wang","sequence":"first","affiliation":[{"name":"Central South University, Changsha, Hunan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4846-3162","authenticated-orcid":false,"given":"Xiaomin","family":"Yu","sequence":"additional","affiliation":[{"name":"Great Bay University, Guangdong, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8404-9489","authenticated-orcid":false,"given":"Wangyu","family":"Wu","sequence":"additional","affiliation":[{"name":"The University of Liverpool, Liverpool, United Kingdom"}]}],"member":"320","published-online":{"date-parts":[[2025,5,23]]},"reference":[{"key":"e_1_3_2_2_1_1","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katie Millican Malcolm Reynolds Roman Ring Eliza Rutherford Serkan Cabi Tengda Han Zhitao Gong Sina Samangooei Marianne Monteiro Jacob Menick Sebastian Borgeaud Andrew Brock Aida Nematzadeh Sahand Sharifzadeh Mikolaj Binkowski Ricardo Barreira Oriol Vinyals Andrew Zisserman and Karen Simonyan. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. arxiv: 2204.14198 [cs.CV]"},{"key":"e_1_3_2_2_2_1","unstructured":"Kirolos Ataallah Xiaoqian Shen Eslam Abdelrahman Essam Sleiman Deyao Zhu Jian Ding and Mohamed Elhoseiny. 2024. MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens. arxiv: 2404.03413 [cs.CV] https:\/\/arxiv.org\/abs\/2404.03413"},{"key":"e_1_3_2_2_3_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems Vol. 33 (2020) 1877--1901."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01750"},{"key":"e_1_3_2_2_5_1","volume-title":"Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions. arXiv preprint arXiv:2304.04227","author":"Chen Jun","year":"2023","unstructured":"Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny. 2023a. Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions. arXiv preprint arXiv:2304.04227 (2023)."},{"key":"e_1_3_2_2_6_1","unstructured":"Jun Chen Deyao Zhu Xiaoqian Shen Xiang Li Zechun Liu Pengchuan Zhang Raghuraman Krishnamoorthi Vikas Chandra Yunyang Xiong and Mohamed Elhoseiny. 2023b. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arxiv: 2310.09478 [cs.CV]"},{"key":"e_1_3_2_2_7_1","volume-title":"Xing","author":"Chiang Wei-Lin","year":"2023","unstructured":"Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https:\/\/vicuna.lmsys.org"},{"key":"e_1_3_2_2_8_1","volume-title":"Charles Sutton, Sebastian Gehrmann, et al.","author":"Chowdhery Aakanksha","year":"2022","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)."},{"key":"e_1_3_2_2_9_1","unstructured":"Hyung Won Chung Le Hou Shayne Longpre Barret Zoph Yi Tay William Fedus Eric Li Xuezhi Wang Mostafa Dehghani Siddhartha Brahma et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1088\/1751--8121\/ac2596nolinkurl10.1088\/1751-8121\/ac2596"},{"key":"e_1_3_2_2_11_1","unstructured":"Denes Ferenc. 2012. Ruwix Twisty Puzzle Wiki More Rubik's Patterns. https:\/\/ruwix.com\/the-rubiks-cube\/rubiks-cube-patterns-algorithms\/more-rubiks-patterns\/ Last accessed on 2024-5-22."},{"key":"e_1_3_2_2_12_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_13_1","volume-title":"Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al.","author":"Driess Danny","year":"2023","unstructured":"Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023a. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)."},{"key":"e_1_3_2_2_14_1","unstructured":"Danny Driess Fei Xia Mehdi S. M. Sajjadi Corey Lynch Aakanksha Chowdhery Brian Ichter Ayzaan Wahid Jonathan Tompson Quan Vuong Tianhe Yu Wenlong Huang Yevgen Chebotar Pierre Sermanet Daniel Duckworth Sergey Levine Vincent Vanhoucke Karol Hausman Marc Toussaint Klaus Greff Andy Zeng Igor Mordatch and Pete Florence. 2023b. PaLM-E: An Embodied Multimodal Language Model. arxiv: 2303.03378 [cs.LG]"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"crossref","unstructured":"Zhizhao Duan Hao Cheng Duo Xu Xi Wu Xiangxie Zhang Xi Ye and Zhen Xie. 2024. CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario. arxiv: 2405.03194 [cs.CV]","DOI":"10.1109\/CVPRW63382.2024.00713"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"crossref","unstructured":"Xiaojiao Guo Xuhang Chen Shenghong Luo Shuqiang Wang and Chi-Man Pun. 2024. Dual-Hybrid Attention Network for Specular Highlight Removal. In ACM MM. 10173--10181.","DOI":"10.1145\/3664647.3680745"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2025.3525593"},{"key":"e_1_3_2_2_18_1","volume-title":"Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.","author":"Hoffmann Jordan","year":"2022","unstructured":"Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022)."},{"key":"e_1_3_2_2_19_1","volume-title":"Qiang Liu, et al.","author":"Huang Shaohan","year":"2023","unstructured":"Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. 2023. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME52920.2022.9859889"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01303"},{"key":"e_1_3_2_2_22_1","unstructured":"Mamadou Keita Wassim Hamidouche Hassen Bougueffa Abdenour Hadid and Abdelmalik Taleb-Ahmed. 2024. Harnessing the Power of Large Vision Language Models for Synthetic Image Detection. arxiv: 2404.02726 [cs.CV]"},{"key":"e_1_3_2_2_23_1","unstructured":"Junnan Li Dongxu Li Silvio Savarese and Steven Hoi. 2023a. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML."},{"key":"e_1_3_2_2_24_1","volume-title":"Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023b. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)."},{"key":"e_1_3_2_2_25_1","volume-title":"International Conference on Machine Learning. PMLR, 12888--12900","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900."},{"key":"e_1_3_2_2_26_1","volume-title":"Emerging Cutting-Edge Developments in Intelligent Traffic and Transportation Systems","author":"Li Xiwen","unstructured":"Xiwen Li, Tristalee Mangin, Surojit Saha, Rehman Mohammed, Evan Blanchard, Dillon Tang, Henry Poppe, Ouk Choi, Kerry Kelly, and Ross Whitaker. 2024a. Real-time idling vehicles detection using combined audio-visual deep learning. In Emerging Cutting-Edge Developments in Intelligent Traffic and Transportation Systems. IOS Press, 142--158."},{"key":"e_1_3_2_2_27_1","volume-title":"Joint audio-visual idling vehicle detection with streamlined input dependencies. arXiv preprint arXiv:2410.21170","author":"Li Xiwen","year":"2024","unstructured":"Xiwen Li, Rehman Mohammed, Tristalee Mangin, Surojit Saha, Ross T Whitaker, Kerry E Kelly, and Tolga Tasdizen. 2024b. Joint audio-visual idling vehicle detection with streamlined input dependencies. arXiv preprint arXiv:2410.21170 (2024)."},{"key":"e_1_3_2_2_28_1","unstructured":"Haotian Liu Chunyuan Li Yuheng Li and Yong Jae Lee. 2023. Improved Baselines with Visual Instruction Tuning. arxiv: 2310.03744 [cs.CV]"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"crossref","unstructured":"Aoran Mei Jianhua Wang Guo-Niu Zhu and Zhongxue Gan. 2024. GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games. arxiv: 2405.13751 [cs.RO]","DOI":"10.1109\/ICMA61710.2024.10633088"},{"key":"e_1_3_2_2_30_1","volume-title":"https:\/\/openai.com\/blog\/chatgpt","author":"Introducing AI.","year":"2022","unstructured":"OpenAI. 2022. Introducing ChatGPT. https:\/\/openai.com\/blog\/chatgpt (2022)."},{"key":"e_1_3_2_2_31_1","unstructured":"OpenAI. 2023. GPT-4 Technical Report."},{"key":"e_1_3_2_2_32_1","volume-title":"Solving Rubik's Cube with a Robot Hand. arxiv","author":"Ilge Akkaya AI","year":"1910","unstructured":"OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. 2019. Solving Rubik's Cube with a Robot Hand. arxiv: 1910.07113 [cs.LG]"},{"key":"e_1_3_2_2_33_1","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, Vol. 35 (2022), 27730--27744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_34_1","volume-title":"Percy Liang, and Michael S.","author":"Park Joon Sung","year":"2023","unstructured":"Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arxiv: 2304.03442 [cs.HC]"},{"key":"e_1_3_2_2_35_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog Vol. 1 8 (2019) 9."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_3_2_2_37_1","volume-title":"Fran\u00e7ois Yvon, Matthias Gall\u00e9, et al.","author":"Scao Teven Le","year":"2022","unstructured":"Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili\u0107, Daniel Hesslow, Roman Castagn\u00e9, Alexandra Sasha Luccioni, Fran\u00e7ois Yvon, Matthias Gall\u00e9, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)."},{"key":"e_1_3_2_2_38_1","unstructured":"Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari Jared Casper Zhun Liu Shrimai Prabhumoye George Zerveas Vijay Korthikanti et al. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022)."},{"key":"e_1_3_2_2_39_1","volume-title":"ViperGPT: Visual Inference via Python Execution for Reasoning. arXiv preprint arXiv:2303.08128","author":"Sur\u00eds D\u00eddac","year":"2023","unstructured":"D\u00eddac Sur\u00eds, Sachit Menon, and Carl Vondrick. 2023. ViperGPT: Visual Inference via Python Execution for Reasoning. arXiv preprint arXiv:2303.08128 (2023)."},{"key":"e_1_3_2_2_40_1","volume-title":"Hashimoto","author":"Taori Rohan","year":"2023","unstructured":"Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https:\/\/github.com\/tatsu-lab\/stanford_alpaca."},{"key":"e_1_3_2_2_41_1","volume-title":"Gemini: A Family of Highly Capable Multimodal Models. arxiv: 2312.11805 [cs.CL] https:\/\/arxiv.org\/abs\/2312.11805","author":"Team Gemini","year":"2024","unstructured":"Gemini Team. 2024. Gemini: A Family of Highly Capable Multimodal Models. arxiv: 2312.11805 [cs.CL] https:\/\/arxiv.org\/abs\/2312.11805"},{"key":"e_1_3_2_2_42_1","unstructured":"XAgent Team. 2023. XAgent: An Autonomous Agent for Complex Task Solving."},{"key":"e_1_3_2_2_43_1","volume-title":"Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773","author":"Huat Tiong Anthony Meng","year":"2022","unstructured":"Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. 2022. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773 (2022)."},{"key":"e_1_3_2_2_44_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_3_2_2_45_1","first-page":"200","article-title":"Multimodal few-shot learning with frozen language models","volume":"34","author":"Tsimpoukelli Maria","year":"2021","unstructured":"Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200--212.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Prateek Verma Minh-Hao Van and Xintao Wu. 2024. Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis. arxiv: 2405.00876 [cs.CV]","DOI":"10.1109\/BigData62323.2024.10825000"},{"key":"e_1_3_2_2_47_1","volume-title":"Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus.","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022). https:\/\/openreview.net\/forum?id=yzkSU5zdwD Survey Certification."},{"key":"e_1_3_2_2_48_1","volume-title":"Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671","author":"Wu Chenfei","year":"2023","unstructured":"Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671 (2023)."},{"key":"e_1_3_2_2_49_1","first-page":"109626","article-title":"Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation","volume":"139","author":"Wu Wangyu","year":"2025","unstructured":"Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Jimin Xiao, Fei Ma, and Renrong Ouyang. 2025. Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation. EAAI, Vol. 139 (2025), 109626.","journal-title":"EAAI"},{"key":"e_1_3_2_2_50_1","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Wu Wangyu","unstructured":"Wangyu Wu, Tianhong Dai, Xiaowei Huang, Fei Ma, and Jimin Xiao. 2024a. Image Augmentation with Controlled Diffusion for Weakly-Supervised Semantic Segmentation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6175--6179."},{"key":"e_1_3_2_2_51_1","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Wu Wangyu","unstructured":"Wangyu Wu, Tianhong Dai, Xiaowei Huang, Fei Ma, and Jimin Xiao. 2024b. Image Augmentation with Controlled Diffusion for Weakly-Supervised Semantic Segmentation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6175--6179."},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/SMC54092.2024.10831685"},{"key":"e_1_3_2_2_53_1","volume-title":"2024 d. Prompt Categories Cluster for Weakly Supervised Semantic Segmentation. arXiv preprint arXiv:2412.13823","author":"Wu Wangyu","year":"2024","unstructured":"Wangyu Wu, Xianglin Qiu, Siqi Song, Xiaowei Huang, Fei Ma, and Jimin Xiao. 2024 d. Prompt Categories Cluster for Weakly Supervised Semantic Segmentation. arXiv preprint arXiv:2412.13823 (2024)."},{"key":"e_1_3_2_2_54_1","volume-title":"Proceedings of the 29th International Conference on Computational Linguistics. 2561--2571","author":"Xiao Zhaomin","year":"2022","unstructured":"Zhaomin Xiao and Eduardo Blanco. 2022. Are people located in the places they mention in their tweets? a multimodal approach. In Proceedings of the 29th International Conference on Computational Linguistics. 2561--2571."},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3655497.3655500"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCMI59957.2023.10458651"},{"key":"e_1_3_2_2_57_1","volume-title":"Zero-shot video question answering via frozen bidirectional language models. arXiv preprint arXiv:2206.08155","author":"Yang Antoine","year":"2022","unstructured":"Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022. Zero-shot video question answering via frozen bidirectional language models. arXiv preprint arXiv:2206.08155 (2022)."},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/lra.2020.2969912nolinkurl10.1109\/lra.2020.2969912"},{"key":"e_1_3_2_2_59_1","unstructured":"Zhengyuan Yang Linjie Li Kevin Lin Jianfeng Wang Chung-Ching Lin Zicheng Liu and Lijuan Wang. 2023. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arxiv: 2309.17421 [cs.CV]"},{"key":"e_1_3_2_2_60_1","unstructured":"Zhengyuan Yang* Linjie Li* Jianfeng Wang* Kevin Lin* Ehsan Azarnasab* Faisal Ahmed* Zicheng Liu Ce Liu Michael Zeng and Lijuan Wang. 2023. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action. (2023)."},{"key":"e_1_3_2_2_61_1","volume-title":"Swift Sampler: Efficient Learning of Sampler by 10 Parameters. arXiv preprint arXiv:2410.05578","author":"Yao Jiawei","year":"2024","unstructured":"Jiawei Yao, Chuming Li, and Canran Xiao. 2024. Swift Sampler: Efficient Learning of Sampler by 10 Parameters. arXiv preprint arXiv:2410.05578 (2024)."},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2023.08.195"},{"key":"e_1_3_2_2_63_1","volume-title":"Xi Victoria Lin, et al","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)."},{"key":"e_1_3_2_2_64_1","volume-title":"BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions. arXiv preprint arXiv:2303.06594","author":"Zhu Deyao","year":"2023","unstructured":"Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny. 2023a. ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions. arXiv preprint arXiv:2303.06594 (2023)."},{"key":"e_1_3_2_2_65_1","unstructured":"Deyao Zhu Jun Chen Xiaoqian Shen Xiang Li and Mohamed Elhoseiny. 2023b. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arxiv: 2304.10592 [cs.CV]"}],"event":{"name":"WWW '25: The ACM Web Conference 2025","location":"Sydney NSW Australia","acronym":"WWW '25","sponsor":["SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web"]},"container-title":["Companion Proceedings of the ACM on Web Conference 2025"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701716.3717565","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3701716.3717565","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T03:05:42Z","timestamp":1759892742000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701716.3717565"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,8]]},"references-count":65,"alternative-id":["10.1145\/3701716.3717565","10.1145\/3701716"],"URL":"https:\/\/doi.org\/10.1145\/3701716.3717565","relation":{},"subject":[],"published":{"date-parts":[[2025,5,8]]},"assertion":[{"value":"2025-05-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}