{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:10:28Z","timestamp":1765339828647,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755393","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T07:38:54Z","timestamp":1761377934000},"page":"4339-4348","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["DR-VQA: Decompose-then-Reconstruct for Visual Question Answering in BLV Assistance"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8294-0007","authenticated-orcid":false,"given":"Bocheng","family":"Pan","sequence":"first","affiliation":[{"name":"Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6545-1999","authenticated-orcid":false,"given":"Hailong","family":"Shi","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4660-8092","authenticated-orcid":false,"given":"Xingyu","family":"Gao","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems Vol. 35 (2022) 23716-23736."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_3_1","volume-title":"Smart Navigation System For Visually Impaired. In International Conference on Emerging Engineering Trends andScience (ICEETS-2016)","author":"Anita X","year":"2016","unstructured":"X Anita, R Abirami, and M Epsi Vennila. 2016. Smart Navigation System For Visually Impaired. In International Conference on Emerging Engineering Trends andScience (ICEETS-2016)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_1_5_1","unstructured":"Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding Localization Text Reading and Beyond. arXiv:2308.12966 [cs.CV] https:\/\/arxiv.org\/abs\/2308.12966"},{"key":"e_1_3_2_1_6_1","volume-title":"Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930","author":"Bai Zechen","year":"2024","unstructured":"Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1866029.1866080"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2470654.2481291"},{"key":"e_1_3_2_1_9_1","volume-title":"Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi.","author":"Dai Wenliang","year":"2023","unstructured":"Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV] https:\/\/arxiv.org\/abs\/2305.06500"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00179"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29771"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58520-4_25"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_47"},{"key":"e_1_3_2_1_14_1","volume-title":"Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering. arXiv preprint arXiv:2412.16936","author":"Hu Zhongjian","year":"2024","unstructured":"Zhongjian Hu, Peng Yang, Bing Li, and Fengyuan Liu. 2024. Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering. arXiv preprint arXiv:2412.16936 (2024)."},{"key":"e_1_3_2_1_15_1","volume-title":"Visual hallucinations of multi-modal large language models. arXiv preprint arXiv:2402.14683","author":"Huang Wen","year":"2024","unstructured":"Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. 2024. Visual hallucinations of multi-modal large language models. arXiv preprint arXiv:2402.14683 (2024)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3571730"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.06.005"},{"key":"e_1_3_2_1_18_1","volume-title":"Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.","author":"Kojima Takeshi","year":"2022","unstructured":"Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, Vol. 35 (2022), 22199-22213."},{"key":"e_1_3_2_1_19_1","first-page":"95","article-title":"A Deep Learning Based Model to Assist Blind People in Their Navigation","volume":"21","author":"Kumar Nitin","year":"2022","unstructured":"Nitin Kumar and Anuj Jain. 2022. A Deep Learning Based Model to Assist Blind People in Their Navigation. J. Inf. Technol. Educ. Innov. Pract., Vol. 21 (2022), 95-114.","journal-title":"J. Inf. Technol. Educ. Innov. Pract."},{"key":"e_1_3_2_1_20_1","volume-title":"International conference on machine learning. PMLR, 12888-12900","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888-12900."},{"key":"e_1_3_2_1_21_1","volume-title":"Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision-ECCV 2020: 16th European Conference","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al., 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX 16. Springer, 121-137."},{"key":"e_1_3_2_1_22_1","volume-title":"International Conference on Machine Learning. PMLR, 6565-6576","author":"Liang Paul Pu","year":"2021","unstructured":"Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning. PMLR, 6565-6576."},{"key":"e_1_3_2_1_23_1","unstructured":"Haotian Liu Chunyuan Li Yuheng Li and Yong Jae Lee. 2024a. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 [cs.CV] https:\/\/arxiv.org\/abs\/2310.03744"},{"key":"e_1_3_2_1_24_1","volume-title":"A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253","author":"Liu Hanchao","year":"2024","unstructured":"Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. 2024b. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01247-4"},{"key":"e_1_3_2_1_26_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019)."},{"key":"e_1_3_2_1_27_1","volume-title":"Advances in Neural Information Processing Systems","volume":"36","author":"Luo Gen","year":"2024","unstructured":"Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. 2024. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems, Vol. 36 (2024)."},{"key":"e_1_3_2_1_28_1","first-page":"4171","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"38","author":"Oscar Ma","year":"2024","unstructured":"Oscar Ma nas, Benno Krojer, and Aishwarya Agrawal. 2024. Improving automatic vqa evaluation using large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4171-4179."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063176.2063200"},{"key":"e_1_3_2_1_30_1","first-page":"33","article-title":"e-vision: an AI-powered system for promoting the autonomy of visually impaired","volume":"3","author":"Migkotzidis Panagiotis","year":"2020","unstructured":"Panagiotis Migkotzidis, Fotis Kalaganis, Kostas Georgiadis, Elisavet Chatzilari, George Pehlivanides, Spyros Tsafaras, Kostas Monastiridis, George Martinidis, Spiros Nikolopoulos, and Ioannis Kompatsiaris. 2020. e-vision: an AI-powered system for promoting the autonomy of visually impaired. European Journal of Creative Practices in Cities and Landscapes, Vol. 3, 2 (2020), 33-53.","journal-title":"European Journal of Creative Practices in Cities and Landscapes"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01367"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173633"},{"key":"e_1_3_2_1_33_1","first-page":"7357","article-title":"Soat: A scene-and object-aware transformer for vision-and-language navigation","volume":"34","author":"Moudgil Abhinav","year":"2021","unstructured":"Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, and Dhruv Batra. 2021. Soat: A scene-and object-aware transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 7357-7367.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_34_1","volume-title":"Frozen transformers in language models are effective visual encoder layers. arXiv preprint arXiv:2310.12973","author":"Pang Ziqi","year":"2023","unstructured":"Ziqi Pang, Ziyang Xie, Yunze Man, and Yu-Xiong Wang. 2023. Frozen transformers in language models are effective visual encoder layers. arXiv preprint arXiv:2310.12973 (2023)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445040"},{"key":"e_1_3_2_1_36_1","volume-title":"International conference on machine learning. PMLR, 8748-8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748-8763."},{"key":"e_1_3_2_1_37_1","volume-title":"Visual hallucination: Definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2403.17306","author":"Rani Anku","year":"2024","unstructured":"Anku Rani, Vipula Rawte, Harshad Sharma, Neeraj Anand, Krishnav Rajbangshi, Amit Sheth, and Amitava Das. 2024. Visual hallucination: Definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2403.17306 (2024)."},{"key":"e_1_3_2_1_38_1","volume-title":"A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922","author":"Rawte Vipula","year":"2023","unstructured":"Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023)."},{"key":"e_1_3_2_1_39_1","volume-title":"Kaylee Burns, Trevor Darrell, and Kate Saenko.","author":"Rohrbach Anna","year":"2018","unstructured":"Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156 (2018)."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01438"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01519"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818048.2820013"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i17.29884"},{"key":"e_1_3_2_1_44_1","first-page":"8483","article-title":"Language models with image descriptors are strong few-shot video-language learners","volume":"35","author":"Wang Zhenhailong","year":"2022","unstructured":"Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al., 2022. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, Vol. 35 (2022), 8483-8497.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_45_1","volume-title":"Zihang Dai, Yulia Tsvetkov, and Yuan Cao.","author":"Wang Zirui","year":"2021","unstructured":"Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. 2021. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021)."},{"key":"e_1_3_2_1_46_1","volume-title":"Denny Zhou, et al.","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, Vol. 35 (2022), 24824-24837."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00563"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i3.20215"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-29364-1_2"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3369699"},{"key":"e_1_3_2_1_51_1","first-page":"5168","article-title":"Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models","volume":"36","author":"Zheng Ge","year":"2023","unstructured":"Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. 2023. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neural Information Processing Systems, Vol. 36 (2023), 5168-5191.","journal-title":"Advances in Neural Information Processing Systems"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755393","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:08:54Z","timestamp":1765339734000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755393"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":51,"alternative-id":["10.1145\/3746027.3755393","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755393","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}