{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:56:19Z","timestamp":1781538979549,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810807","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"787-796","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Beyond Inconsistent Reasoning: Intermediate Process Direct Preference Optimization on Small Multi-modal Large Language Models for Visual Question Answering"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-8127-499X","authenticated-orcid":false,"given":"Yongzhu","family":"Miao","sequence":"first","affiliation":[{"name":"School of Computer Science, National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7613-1192","authenticated-orcid":false,"given":"Puzhen","family":"Su","sequence":"additional","affiliation":[{"name":"School of Computer Science, National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7471-0996","authenticated-orcid":false,"given":"Haoran","family":"Yin","sequence":"additional","affiliation":[{"name":"School of Computer Science, National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6508-5119","authenticated-orcid":false,"given":"Shasha","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science, National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8802-3906","authenticated-orcid":false,"given":"Jintao","family":"Tang","sequence":"additional","affiliation":[{"name":"School of Computer Science, National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7780-2330","authenticated-orcid":false,"given":"Ting","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Science, National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_3_1_3_2","unstructured":"Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding Localization Text Reading and Beyond. arxiv:https:\/\/arXiv.org\/abs\/2308.12966\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2308.12966"},{"key":"e_1_3_3_1_4_2","unstructured":"Shuai Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Sibo Song Kai Dang Peng Wang Shijie Wang Jun Tang et\u00a0al. 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.13923 (2025)."},{"key":"e_1_3_3_1_5_2","doi-asserted-by":"crossref","unstructured":"Lin Chen Jinsong Li Xiaoyi Dong Pan Zhang Yuhang Zang Zehui Chen Haodong Duan Jiaqi Wang Yu Qiao Dahua Lin et\u00a0al. 2024. Are we on the right way for evaluating large vision-language models? Advances in Neural Information Processing Systems 37 (2024) 27056\u201327087.","DOI":"10.52202\/079017-0850"},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"crossref","unstructured":"Qiguang Chen Libo Qin Jin Zhang Zhi Chen Xiao Xu and Wanxiang Che. 2024. M 3 CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.16473 (2024).","DOI":"10.18653\/v1\/2024.acl-long.446"},{"key":"e_1_3_3_1_7_2","unstructured":"Tianzhe Chu Yuexiang Zhai Jihan Yang Shengbang Tong Saining Xie Dale Schuurmans Quoc\u00a0V Le Sergey Levine and Yi Ma. 2025. Sft memorizes rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2501.17161 (2025)."},{"key":"e_1_3_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.00847"},{"key":"e_1_3_3_1_9_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et\u00a0al. 2024. The llama 3 herd of models. arXiv e-prints (2024) arXiv\u20132407."},{"key":"e_1_3_3_1_10_2","unstructured":"Jinlan Fu Shenzhen Huangfu Hao Fei Xiaoyu Shen Bryan Hooi Xipeng Qiu and See-Kiong Ng. 2025. Chip: Cross-modal hierarchical direct preference optimization for multimodal llms. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2501.16629 (2025)."},{"key":"e_1_3_3_1_11_2","unstructured":"Yuhan Fu Ruobing Xie Xingwu Sun Zhanhui Kang and Xirong Li. 2024. Mitigating hallucination in multimodal large language model via hallucination-targeted direct preference optimization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.10436 (2024)."},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_3_1_13_2","unstructured":"Wenxuan Huang Bohan Jia Zijie Zhai Shaosheng Cao Zheyu Ye Fei Zhao Zhe Xu Yao Hu and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.06749 (2025)."},{"key":"e_1_3_3_1_14_2","unstructured":"Aaron Hurst Adam Lerer Adam\u00a0P Goucher Adam Perelman Aditya Ramesh Aidan Clark AJ Ostrow Akila Welihinda Alan Hayes Alec Radford et\u00a0al. 2024. Gpt-4o system card. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2410.21276 (2024)."},{"key":"e_1_3_3_1_15_2","doi-asserted-by":"crossref","unstructured":"Byeong\u00a0Su Kim Jieun Kim Deokwoo Lee and Beakcheol Jang. 2025. Visual question answering: A survey of methods datasets evaluation and challenges. Comput. Surveys 57 10 (2025) 1\u201335.","DOI":"10.1145\/3728635"},{"key":"e_1_3_3_1_16_2","unstructured":"Xin Lai Zhuotao Tian Yukang Chen Senqiao Yang Xiangru Peng and Jiaya Jia. 2024. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.18629 (2024)."},{"key":"e_1_3_3_1_17_2","unstructured":"Bo Li Yuanhan Zhang Dong Guo Renrui Zhang Feng Li Hao Zhang Kaichen Zhang Peiyuan Zhang Yanwei Li Ziwei Liu et\u00a0al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2408.03326 (2024)."},{"key":"e_1_3_3_1_18_2","unstructured":"Feng Li Renrui Zhang Hao Zhang Yuanhan Zhang Bo Li Wei Li Zejun Ma and Chunyuan Li. 2024. Llava-next-interleave: Tackling multi-image video and 3d in large multimodal models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.07895 (2024)."},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Yujie Lin Ante Wang Moye Chen Jingyao Liu Hao Liu Jinsong Su and Xinyan Xiao. 2025. Investigating inference-time scaling for chain of multi-modal thought: A preliminary study. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.11514 (2025).","DOI":"10.18653\/v1\/2025.findings-acl.808"},{"key":"e_1_3_3_1_20_2","unstructured":"Bingshuai Liu Chenyang Lyu Zijun Min Zhanyu Wang Jinsong Su and Longyue Wang. 2023. Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.01714 (2023)."},{"key":"e_1_3_3_1_21_2","first-page":"216","volume-title":"European conference on computer vision","author":"Liu Yuan","year":"2024","unstructured":"Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et\u00a0al. 2024. Mmbench: Is your multi-modal model an all-around player?. In European conference on computer vision. Springer, 216\u2013233."},{"key":"e_1_3_3_1_22_2","unstructured":"Pan Lu Hritik Bansal Tony Xia Jiacheng Liu Chunyuan Li Hannaneh Hajishirzi Hao Cheng Kai-Wei Chang Michel Galley and Jianfeng Gao. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.02255 (2023)."},{"key":"e_1_3_3_1_23_2","unstructured":"Lee Micheal. 2025. Generalization in LLM Reasoning: A Meta-Learned Approach to Optimal Imitation and Exploration. (2025)."},{"key":"e_1_3_3_1_24_2","unstructured":"Yingzhe Peng Gongrui Zhang Miaosen Zhang Zhiyuan You Jie Liu Qipeng Zhu Kai Yang Xingzhong Xu Xin Geng and Xu Yang. 2025. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.07536 (2025)."},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"crossref","unstructured":"Hao Shao Shengju Qian Han Xiao Guanglu Song Zhuofan Zong Letian Wang Yu Liu and Hongsheng Li. 2024. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems 37 (2024) 8612\u20138642.","DOI":"10.52202\/079017-0275"},{"key":"e_1_3_3_1_26_2","unstructured":"Zhiqing Sun Sheng Shen Shengcao Cao Haotian Liu Chunyuan Li Yikang Shen Chuang Gan Liang-Yan Gui Yu-Xiong Wang Yiming Yang et\u00a0al. 2023. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2309.14525 (2023)."},{"key":"e_1_3_3_1_27_2","unstructured":"Gemini Team Petko Georgiev Ving\u00a0Ian Lei Ryan Burnell Libin Bai Anmol Gulati Garrett Tanzer Damien Vincent Zhufeng Pan Shibo Wang et\u00a0al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.05530 (2024)."},{"key":"e_1_3_3_1_28_2","doi-asserted-by":"crossref","unstructured":"Omkar Thawakar Dinura Dissanayake Ketan More Ritesh Thawkar Ahmed Heakl Noor Ahsan Yuhao Li Mohammed Zumri Jean Lahoud Rao\u00a0Muhammad Anwer et\u00a0al. 2025. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2501.06186 (2025).","DOI":"10.18653\/v1\/2025.findings-acl.1247"},{"key":"e_1_3_3_1_29_2","unstructured":"Dingzirui Wang Xuanliang Zhang Keyan Xu Qingfu Zhu Wanxiang Che and Yang Deng. 2025. Bounds of Chain-of-Thought Robustness: Reasoning Steps Embed Norms and Beyond. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2509.21284 (2025)."},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Fei Wang Wenxuan Zhou James\u00a0Y Huang Nan Xu Sheng Zhang Hoifung Poon and Muhao Chen. 2024. mdpo: Conditional preference optimization for multimodal large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.11839 (2024).","DOI":"10.18653\/v1\/2024.emnlp-main.460"},{"key":"e_1_3_3_1_31_2","unstructured":"Yaoting Wang Shengqiong Wu Yuecheng Zhang Shuicheng Yan Ziwei Liu Jiebo Luo and Hao Fei. 2025. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.12605 (2025)."},{"key":"e_1_3_3_1_32_2","unstructured":"Hongyan Xie Yitong Yao Yikun Ban Zixuan Huang Deqing Wang Zhenhe Wu Haoxiang Su Chao Wang and Shuangyong Song. 2025. Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2509.05602 (2025)."},{"key":"e_1_3_3_1_33_2","unstructured":"Ning Xie Farley Lai Derek Doran and Asim Kadav. 2019. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1901.06706 (2019)."},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51701.2025.00202"},{"key":"e_1_3_3_1_35_2","unstructured":"Zhengyuan Yang Linjie Li Kevin Lin Jianfeng Wang Chung-Ching Lin Zicheng Liu and Lijuan Wang. 2023. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2309.17421 (2023)."},{"key":"e_1_3_3_1_36_2","unstructured":"Huanjin Yao Jiaxing Huang Wenhao Wu Jingyi Zhang Yibo Wang Shunyu Liu Yingjie Wang Yuxin Song Haocheng Feng Li Shen et\u00a0al. 2024. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2412.18319 (2024)."},{"key":"e_1_3_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01310"},{"key":"e_1_3_3_1_38_2","unstructured":"Weihao Yu Zhengyuan Yang Linjie Li Jianfeng Wang Kevin Lin Zicheng Liu Xinchao Wang and Lijuan Wang. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2308.02490 (2023)."},{"key":"e_1_3_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00913"},{"key":"e_1_3_3_1_40_2","doi-asserted-by":"crossref","unstructured":"Jingyi Zhang Jiaxing Huang Huanjin Yao Shunyu Liu Xikun Zhang Shijian Lu and Dacheng Tao. 2025. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.12937 (2025).","DOI":"10.1109\/ICCV51701.2025.00181"},{"key":"e_1_3_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.542"},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.acl-long.82"},{"key":"e_1_3_3_1_43_2","unstructured":"Zhiyuan Zhao Bin Wang Linke Ouyang Xiaoyi Dong Jiaqi Wang and Conghui He. 2023. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2311.16839 (2023)."},{"key":"e_1_3_3_1_44_2","unstructured":"Guanghao Zhou Panjia Qiu Cen Chen Jie Wang Zheming Yang Jian Xu and Minghui Qiu. 2025. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2504.21277 (2025)."}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:30:21Z","timestamp":1781537421000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810807"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":43,"alternative-id":["10.1145\/3805622.3810807","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810807","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}