{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:57:50Z","timestamp":1781539070134,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":27,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810610","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"206-214","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Empowering Vision Language Models for Training-Free Visual Search via Context-Aware Scanpath Simulation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-8326-4980","authenticated-orcid":false,"given":"Haoran","family":"Wang","sequence":"first","affiliation":[{"name":"Renmin University of China, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-6310-710X","authenticated-orcid":false,"given":"Dan","family":"Wan","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Shuai Bai Yuxuan Cai Ruizhe Chen Keqin Chen Xionghui Chen Zesen Cheng Lianghao Deng Wei Ding Chang Gao Chunjiang Ge Wenbin Ge Zhifang Guo Qidong Huang Jie Huang Fei Huang Binyuan Hui Shutong Jiang Zhaohai Li Mingsheng Li Mei Li Kaixin Li Zicheng Lin Junyang Lin Xuejing Liu Jiawei Liu Chenglong Liu Yang Liu Dayiheng Liu Shixuan Liu Dunjie Lu Ruilin Luo Chenxu Lv Rui Men Lingchen Meng Xuancheng Ren Xingzhang Ren Sibo Song Yuchong Sun Jun Tang Jianhong Tu Jianqiang Wan Peng Wang Pengfei Wang Qiuyue Wang Yuxuan Wang Tianbao Xie Yiheng Xu Haiyang Xu Jin Xu Zhibo Yang Mingkun Yang Jianxin Yang An Yang Bowen Yu Fei Zhang Hang Zhang Xi Zhang Bo Zheng Humen Zhong Jingren Zhou Fan Zhou Jing Zhou Yuanzhi Zhu and Ke Zhu. 2025. Qwen3-VL Technical Report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2511.21631 (2025)."},{"key":"e_1_3_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3379155.3391314"},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CBMI62980.2024.10859215"},{"key":"e_1_3_3_1_5_2","doi-asserted-by":"crossref","unstructured":"Davide Caffagni Federico Cocchi Luca Barsellotti Nicholas Moratelli Sara Sarto Lorenzo Baraldi Marcella Cornia and Rita Cucchiara. 2024. The revolution of multimodal large language models: a survey. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.12451 (2024).","DOI":"10.18653\/v1\/2024.findings-acl.807"},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"crossref","unstructured":"Declan Campbell Sunayana Rane Tyler Giallanza Camillo\u00a0Nicol\u00f2 De\u00a0Sabbata Kia Ghods Amogh Joshi Alexander Ku Steven Frankland Tom Griffiths Jonathan\u00a0D Cohen et\u00a0al. 2024. Understanding the limits of vision language models through the lens of the binding problem. Advances in Neural Information Processing Systems 37 (2024) 113436\u2013113460.","DOI":"10.52202\/079017-3604"},{"key":"e_1_3_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01073"},{"key":"e_1_3_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02402"},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"crossref","unstructured":"Yupei Chen Zhibo Yang Seoyoung Ahn Dimitris Samaras Minh Hoai and Gregory Zelinsky. 2021. Coco-search18 fixation dataset for predicting goal-directed attention control. Scientific reports 11 1 (2021) 8776.","DOI":"10.1038\/s41598-021-87715-9"},{"key":"e_1_3_3_1_10_2","unstructured":"Bo Li Yuanhan Zhang Dong Guo Renrui Zhang Feng Li Hao Zhang Kaichen Zhang Peiyuan Zhang Yanwei Li Ziwei Liu et\u00a0al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2408.03326 (2024)."},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"crossref","unstructured":"Peizhao Li Junfeng He Gang Li Rachit Bhargava Shaolei Shen Nachiappan Valliappan Youwei Liang Hongxiang Gu Venky Ramachandran Yang Li et\u00a0al. 2024. UniAR: A Unified model for predicting human Attention and Responses on visual content. Advances in Neural Information Processing Systems 37 (2024) 106346\u2013106369.","DOI":"10.52202\/079017-3374"},{"key":"e_1_3_3_1_12_2","unstructured":"Jiaying Lin Shuquan Ye and Rynson\u00a0WH Lau. 2024. Do Multimodal Large Language Models See Like Humans?arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2412.09603 (2024)."},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51701.2025.00263"},{"key":"e_1_3_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00145"},{"key":"e_1_3_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00067"},{"key":"e_1_3_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3746027.3758208"},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"crossref","unstructured":"Hosnieh Sattar Mario Fritz and Andreas Bulling. 2020. Deep gaze pooling: Inferring and visually decoding search intents from human gaze fixations. Neurocomputing 387 (2020) 369\u2013382.","DOI":"10.1016\/j.neucom.2020.01.028"},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Zhiqi Shao Haoning Xi David\u00a0A Hensher Ze Wang Xiaolin Gong and Junbin Gao. 2025. A spatial-temporal dynamic attention-based Mamba model for multi-type passenger demand prediction in multimodal public transit systems. Transportation Research Part E: Logistics and Transportation Review 202 (2025) 104282.","DOI":"10.1016\/j.tre.2025.104282"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Rakshith\u00a0Sharma Srinivasa Jaejin Cho Chouchang Yang Yashas\u00a0Malur Saidutta Ching-Hua Lee Yilin Shen and Hongxia Jin. 2023. CWCL: Cross-modal transfer with continuously weighted contrastive loss. Advances in Neural Information Processing Systems 36 (2023) 78496\u201378513.","DOI":"10.52202\/075280-3432"},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00219"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData59044.2023.10386743"},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01243"},{"key":"e_1_3_3_1_23_2","doi-asserted-by":"crossref","unstructured":"Haoning Xi Zhiqi Shao David\u00a0A Hensher John\u00a0D Nelson Huaming Chen and Kasun Wijayaratna. 2025. A multi-task Transformer with mixture-of-experts for personalized periodic predictions of individual travel behavior in multimodal public transport. Transportation Research Part C: Emerging Technologies 179 (2025) 105287.","DOI":"10.1016\/j.trc.2025.105287"},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.01260"},{"key":"e_1_3_3_1_25_2","unstructured":"Hao Yan Xingchen Liu Hao Wang Zhenbiao Cao Handong Zheng Liang Yin Xinxing Su Zihao Chen Jihao Wu Minghui Liao et\u00a0al. 2025. Visuriddles: Fine-grained perception is a primary bottleneck for multimodal large language models in abstract visual reasoning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.02537 (2025)."},{"key":"e_1_3_3_1_26_2","volume-title":"Computer Vision and Pattern Recognition","author":"Yang Zhibo","year":"2023","unstructured":"Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Gregory Zelinsky, Minh Hoai, and Dimitris Samaras. 2023. Predicting human attention using computational attention. In Computer Vision and Pattern Recognition."},{"key":"e_1_3_3_1_27_2","volume-title":"Eye movements and vision","author":"Yarbus Alfred\u00a0L","year":"2013","unstructured":"Alfred\u00a0L Yarbus. 2013. Eye movements and vision. Springer."},{"key":"e_1_3_3_1_28_2","unstructured":"Jiarui Zhang Jinyi Hu Mahyar Khayatkhoei Filip Ilievski and Maosong Sun. 2024. Exploring perceptual limitation of multimodal large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.07384 (2024)."}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:45:39Z","timestamp":1781538339000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810610"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":27,"alternative-id":["10.1145\/3805622.3810610","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810610","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}