{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T02:11:55Z","timestamp":1765505515762,"version":"3.48.0"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","funder":[{"name":"National Science and Technology Council, Taiwan","award":["NSTC 113-2634-F-002-003- and 114-2221-E-002-070-MY3"],"award-info":[{"award-number":["NSTC 113-2634-F-002-003- and 114-2221-E-002-070-MY3"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,10]]},"DOI":"10.1145\/3746252.3760963","type":"proceedings-article","created":{"date-parts":[[2025,11,8]],"date-time":"2025-11-08T00:36:36Z","timestamp":1762562196000},"page":"4659-4664","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["VQA-Induct: Instruction Induction for Visual Question Answering"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2888-7278","authenticated-orcid":false,"given":"Po-Chun","family":"Chen","sequence":"first","affiliation":[{"name":"Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9169-3081","authenticated-orcid":false,"given":"Hen-Hsen","family":"Huang","sequence":"additional","affiliation":[{"name":"Institute of Information Science, Academia Sinica, Taipei, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9757-9423","authenticated-orcid":false,"given":"Hsin-Hsi","family":"Chen","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan and AI Research Center (AINTU), National Taiwan University, Taipei, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,11,10]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv, Vol. abs\/2005.14165 (2020). https:\/\/api.semanticscholar.org\/CorpusID:218971783"},{"key":"e_1_3_2_1_2_1","volume-title":"Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"6518","author":"Chen Lichang","year":"2024","unstructured":"Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. 2024a. InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 6503-6518. https:\/\/proceedings.mlr.press\/v235\/chen24e.html"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.297"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.962"},{"key":"e_1_3_2_1_5_1","volume-title":"Yew Ken Chia, and Soujanya Poria","author":"Ghosal Deepanway","year":"2025","unstructured":"Deepanway Ghosal, Vernon Toh, Yew Ken Chia, and Soujanya Poria. 2025. AlgoPuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Algorithmic Multimodal Puzzles. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 9615-9632. https:\/\/aclanthology.org\/2025.naacl-long.486\/"},{"key":"e_1_3_2_1_6_1","volume-title":"Flash-Lite and Pro. https:\/\/developers.googleblog.com\/en\/gemini-2-family-expands\/ FEB. 5","author":"Gemini","year":"2025","unstructured":"Google. 2025. Gemini 2.0: Flash, Flash-Lite and Pro. https:\/\/developers.googleblog.com\/en\/gemini-2-family-expands\/ FEB. 5, 2025."},{"key":"e_1_3_2_1_7_1","volume-title":"MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale. ArXiv","author":"Guo Jarvis","year":"2024","unstructured":"Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. 2024. MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale. ArXiv, Vol. abs\/2412.05237 (2024). https:\/\/api.semanticscholar.org\/CorpusID:274581700"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","unstructured":"Or Honovich Uri Shaham Samuel R. Bowman and Omer Levy. 2023. Instruction Induction: From Few Examples to Natural Language Task Descriptions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Anna Rogers Jordan Boyd-Graber and Naoaki Okazaki (Eds.). Association for Computational Linguistics Toronto Canada 1935-1952. doi:10.18653\/v1\/2023.acl-long.108","DOI":"10.18653\/v1\/2023.acl-long.108"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.634"},{"key":"e_1_3_2_1_10_1","volume-title":"Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models. ArXiv","author":"Hu Yushi","year":"2024","unstructured":"Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke S. Zettlemoyer, Noah A. Smith, and Ranjay Krishna. 2024a. Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models. ArXiv, Vol. abs\/2406.09403 (2024). https:\/\/api.semanticscholar.org\/CorpusID:270440440"},{"key":"e_1_3_2_1_11_1","volume-title":"Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering. ArXiv","author":"Hu Zhongjian","year":"2024","unstructured":"Zhongjian Hu, Peng Yang, Bing Li, and Fengyuan Liu. 2024b. Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering. ArXiv, Vol. abs\/2412.16936 (2024). https:\/\/api.semanticscholar.org\/CorpusID:274982704"},{"key":"e_1_3_2_1_12_1","volume-title":"Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.","author":"Kojima Takeshi","year":"2022","unstructured":"Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv, Vol. abs\/2205.11916 (2022). https:\/\/api.semanticscholar.org\/CorpusID:249017743"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612389"},{"key":"e_1_3_2_1_14_1","volume-title":"LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception. ArXiv","author":"Liao Yuan-Hong","year":"2025","unstructured":"Yuan-Hong Liao, Sven Elflein, Liu He, Laura Leal-Taix'e, Yejin Choi, Sanja Fidler, and David Acuna. 2025. LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception. ArXiv, Vol. abs\/2504.15362 (2025). https:\/\/api.semanticscholar.org\/CorpusID:277993790"},{"key":"e_1_3_2_1_15_1","volume-title":"Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models. ArXiv","author":"Liu Zuyan","year":"2024","unstructured":"Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. 2024. Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models. ArXiv, Vol. abs\/2403.12966 (2024). https:\/\/api.semanticscholar.org\/CorpusID:268532518"},{"key":"e_1_3_2_1_16_1","volume-title":"PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving. ArXiv","author":"Luo Xuewen","year":"2025","unstructured":"Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junn Yong Loo. 2024. PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving. ArXiv, Vol. abs\/2412.02025 (2024). https:\/\/api.semanticscholar.org\/CorpusID:274445818"},{"key":"e_1_3_2_1_17_1","volume-title":"The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/","author":"AI.","year":"2025","unstructured":"MetaAI. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/ April 5, 2025."},{"key":"e_1_3_2_1_18_1","volume-title":"https:\/\/mistral.ai\/news\/pixtral-12b","author":"Announcing AI.","year":"2024","unstructured":"MistralAI. 2024a. Announcing Pixtral 12B. https:\/\/mistral.ai\/news\/pixtral-12b Sep 17, 2024."},{"key":"e_1_3_2_1_19_1","volume-title":"Pixtral Large. https:\/\/mistral.ai\/news\/pixtral-large","author":"AI.","year":"2024","unstructured":"MistralAI. 2024b. Pixtral Large. https:\/\/mistral.ai\/news\/pixtral-large Nov 18, 2024."},{"key":"e_1_3_2_1_20_1","volume-title":"Medium is the new large. https:\/\/mistral.ai\/news\/mistral-medium-3","author":"AI.","year":"2025","unstructured":"MistralAI. 2025a. Medium is the new large. https:\/\/mistral.ai\/news\/mistral-medium-3 May 7, 2025."},{"key":"e_1_3_2_1_21_1","volume-title":"https:\/\/mistral.ai\/news\/mistral-small-3-1","author":"Mistral AI.","year":"2025","unstructured":"MistralAI. 2025b. Mistral Small 3.1. https:\/\/mistral.ai\/news\/mistral-small-3-1 Mar 17, 2025."},{"key":"e_1_3_2_1_22_1","first-page":"14420","volume-title":"Compositional Chain-of-Thought Prompting for Large Multimodal Models. 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Mitra Chancharik","year":"2023","unstructured":"Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. 2023. Compositional Chain-of-Thought Prompting for Large Multimodal Models. 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 14420-14431. https:\/\/api.semanticscholar.org\/CorpusID:265498786"},{"key":"e_1_3_2_1_23_1","volume-title":"KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning. In AAAI Conference on Artificial Intelligence. https:\/\/api.semanticscholar.org\/CorpusID:267095090","author":"Mondal Debjyoti","year":"2024","unstructured":"Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. 2024. KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning. In AAAI Conference on Artificial Intelligence. https:\/\/api.semanticscholar.org\/CorpusID:267095090"},{"key":"e_1_3_2_1_24_1","volume-title":"GPT-4o mini: advancing cost-efficient intelligence. https:\/\/openai.com\/index\/gpt-4o-mini-advancing-cost-efficient-intelligence\/","author":"AI.","year":"2024","unstructured":"OpenAI. 2024a. GPT-4o mini: advancing cost-efficient intelligence. https:\/\/openai.com\/index\/gpt-4o-mini-advancing-cost-efficient-intelligence\/ July 18, 2024."},{"key":"e_1_3_2_1_25_1","volume-title":"https:\/\/openai.com\/index\/hello-gpt-4o\/","author":"AI.","year":"2024","unstructured":"OpenAI. 2024b. Hello GPT-4o. https:\/\/openai.com\/index\/hello-gpt-4o\/ May 13, 2024."},{"key":"e_1_3_2_1_26_1","volume-title":"https:\/\/openai.com\/index\/gpt-4-1\/","author":"AI.","year":"2025","unstructured":"OpenAI. 2025a. Introducing GPT-4.1 in the API. https:\/\/openai.com\/index\/gpt-4-1\/ April 14, 2025."},{"key":"e_1_3_2_1_27_1","volume-title":"https:\/\/openai.com\/index\/introducing-gpt-4-5\/","author":"AI.","year":"2025","unstructured":"OpenAI. 2025b. Introducing GPT-4.5. https:\/\/openai.com\/index\/introducing-gpt-4-5\/ February 27, 2025."},{"key":"e_1_3_2_1_28_1","unstructured":"OpenAI. 2025c. OpenAI Platform: Images and Vision. https:\/\/platform.openai.com\/docs\/guides\/images-vision?api-mode=chat#gpt-4o-gpt-4-1-gpt-4o-mini-cua-and-o-series-except-o4-mini Accessed: 2025-06-04."},{"key":"e_1_3_2_1_29_1","unstructured":"Hao Shao Shengju Qian Han Xiao Guanglu Song Zhuofan Zong Letian Wang Yu Liu and Hongsheng Li. 2024. Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning. In Neural Information Processing Systems. https:\/\/api.semanticscholar.org\/CorpusID:271051212"},{"key":"e_1_3_2_1_30_1","volume-title":"Jian Jiao, and Denis Xavier Charles.","author":"Sun Hong","year":"2023","unstructured":"Hong Sun, Xue Li, Yi Xu, Youkow Homma, Qinhao Cao, Min man Wu, Jian Jiao, and Denis Xavier Charles. 2023. AutoHint: Automatic Prompt Optimization with Hint Generation. ArXiv, Vol. abs\/2307.07415 (2023). https:\/\/api.semanticscholar.org\/CorpusID:259924936"},{"key":"e_1_3_2_1_31_1","volume-title":"AAAI Conference on Artificial Intelligence. https:\/\/api.semanticscholar.org\/CorpusID:258546810","author":"Wang Lei","year":"2023","unstructured":"Lei Wang, Yilang Hu, Jiabang He, Xingdong Xu, Ning Liu, Hui juan Liu, and Hengtao Shen. 2023. T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering. In AAAI Conference on Artificial Intelligence. https:\/\/api.semanticscholar.org\/CorpusID:258546810"},{"key":"e_1_3_2_1_32_1","volume-title":"Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation. ArXiv","author":"Wang Yu","year":"2024","unstructured":"Yu Wang, Shiwan Zhao, Zhihu Wang, Heyuan Huang, Ming Fan, Yubo Zhang, Zhixing Wang, Haijun Wang, and Ting Liu. 2024. Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation. ArXiv, Vol. abs\/2409.03271 (2024). https:\/\/api.semanticscholar.org\/CorpusID:272423721"},{"volume-title":"European Conference on Computer Vision. https:\/\/api.semanticscholar.org\/CorpusID:268532065","author":"Wu Yixuan","key":"e_1_3_2_1_33_1","unstructured":"Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Jian Wu, and Philip H. S. Torr. 2024. DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM. In European Conference on Computer Vision. https:\/\/api.semanticscholar.org\/CorpusID:268532065"},{"key":"e_1_3_2_1_34_1","volume-title":"LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. ArXiv","author":"Xu Guowei","year":"2024","unstructured":"Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. 2024. LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. ArXiv, Vol. abs\/2411.10440 (2024). https:\/\/api.semanticscholar.org\/CorpusID:274116688"},{"key":"e_1_3_2_1_35_1","volume-title":"PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. ArXiv","author":"Zhang Xiaoman","year":"2023","unstructured":"Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023b. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. ArXiv, Vol. abs\/2305.10415 (2023). https:\/\/api.semanticscholar.org\/CorpusID:258741360"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.659"},{"key":"e_1_3_2_1_37_1","article-title":"Multimodal Chain-of-Thought Reasoning in Language","volume":"2024","author":"Zhang Zhuosheng","year":"2023","unstructured":"Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alexander J. Smola. 2023c. Multimodal Chain-of-Thought Reasoning in Language Models. Trans. Mach. Learn. Res., Vol. 2024 (2023). https:\/\/api.semanticscholar.org\/CorpusID:256504063","journal-title":"Models. Trans. Mach. Learn. Res."},{"key":"e_1_3_2_1_38_1","volume-title":"DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models. ArXiv","author":"Zheng Ge","year":"2023","unstructured":"Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. 2023. DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models. ArXiv, Vol. abs\/2310.16436 (2023). https:\/\/api.semanticscholar.org\/CorpusID:264451538"},{"key":"e_1_3_2_1_39_1","volume-title":"Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework. ArXiv","author":"Zhi Zhuo","year":"2025","unstructured":"Zhuo Zhi, Chen Feng, Adam Daneshmend, Mine Orlu, Andreas Demosthenous, Lu Yin, Da Li, Ziquan Liu, and Miguel Rodrigues. 2025. Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework. ArXiv, Vol. abs\/2503.08308 (2025). https:\/\/api.semanticscholar.org\/CorpusID:276928791"},{"key":"e_1_3_2_1_40_1","volume-title":"Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba.","author":"Zhou Yongchao","year":"2022","unstructured":"Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large Language Models Are Human-Level Prompt Engineers. ArXiv, Vol. abs\/2211.01910 (2022). https:\/\/api.semanticscholar.org\/CorpusID:253265328"}],"event":{"name":"CIKM '25: The 34th ACM International Conference on Information and Knowledge Management","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval","SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web"],"location":"Seoul Republic of Korea","acronym":"CIKM '25"},"container-title":["Proceedings of the 34th ACM International Conference on Information and Knowledge Management"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746252.3760963","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T02:08:23Z","timestamp":1765505303000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746252.3760963"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,10]]},"references-count":40,"alternative-id":["10.1145\/3746252.3760963","10.1145\/3746252"],"URL":"https:\/\/doi.org\/10.1145\/3746252.3760963","relation":{},"subject":[],"published":{"date-parts":[[2025,11,10]]},"assertion":[{"value":"2025-11-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}