{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:57:41Z","timestamp":1781539061462,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":48,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810864","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"69-78","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6776-2040","authenticated-orcid":false,"given":"Fengbin","family":"Zhu","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6691-7908","authenticated-orcid":false,"given":"Zijing","family":"Cai","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-2676-6177","authenticated-orcid":false,"given":"Yuzhe","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2838-1987","authenticated-orcid":false,"given":"Pengyang","family":"Shao","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5199-1428","authenticated-orcid":false,"given":"Wenjie","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5828-9842","authenticated-orcid":false,"given":"Fuli","family":"Feng","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, 0000-0002-5828-9842, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5461-3986","authenticated-orcid":false,"given":"Richang","family":"Hong","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6097-7807","authenticated-orcid":false,"given":"Tat-Seng","family":"Chua","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Lucas Beyer Andreas Steiner Andr\u00e9\u00a0Susano Pinto Alexander Kolesnikov Xiao Wang Daniel Salz Maxim Neumann Ibrahim Alabdulmohsin Michael Tschannen Emanuele Bugliarello et\u00a0al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.07726 (2024)."},{"key":"e_1_3_3_1_3_2","unstructured":"Florian Bordes Richard\u00a0Yuanzhe Pang Anurag Ajay Alexander\u00a0C. Li Adrien Bardes Suzanne Petryk Oscar Ma\u00f1as Zhiqiu Lin Anas Mahmoud Bargav Jayaraman Mark Ibrahim Melissa Hall Yunyang Xiong Jonathan Lebensold Candace Ross Srihari Jayakumar Chuan Guo Diane Bouchacourt Haider Al-Tahan Karthik Padthe Vasu Sharma Hu Xu Xiaoqing\u00a0Ellen Tan Megan Richards Samuel Lavoie Pietro Astolfi Reyhane\u00a0Askari Hemmat Jun Chen Kushal Tirumala Rim Assouel Mazda Moayeri Arjang Talattof Kamalika Chaudhuri Zechun Liu Xilun Chen Quentin Garrido Karen Ullrich Aishwarya Agrawal Kate Saenko Asli Celikyilmaz and Vikas Chandra. 2024. An Introduction to Vision-Language Modeling. arxiv:https:\/\/arXiv.org\/abs\/2405.17247\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2405.17247"},{"key":"e_1_3_3_1_4_2","volume-title":"The Thirteenth International Conference on Learning Representations","author":"Cai Mu","unstructured":"Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong\u00a0Jae Lee. [n. d.]. Matryoshka Multimodal Models. In The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_3_1_5_2","unstructured":"Haonan Chen Hong Liu Yuping Luo Liang Wang Nan Yang Furu Wei and Zhicheng Dou. 2025. MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.23115 (2025)."},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"crossref","unstructured":"Haonan Chen Liang Wang Nan Yang Yutao Zhu Ziliang Zhao Furu Wei and Zhicheng Dou. 2025. mme5: Improving multimodal multilingual embeddings via high-quality synthetic data. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.08468 (2025).","DOI":"10.18653\/v1\/2025.findings-acl.433"},{"key":"e_1_3_3_1_7_2","unstructured":"Benjamin Clavi\u00e9 Antoine Chaffin and Griffin Adams. 2024. Reducing the footprint of multi-vector retrieval with minimal performance impact via token pooling. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2409.14683 (2024)."},{"key":"e_1_3_3_1_8_2","unstructured":"Wanqing Cui Wei Huang Yazhi Guo Yibo Hu Meiguang Jin Junfeng Ma and Keping Bi. 2025. Attention Grounded Enhancement for Visual Document Retrieval. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2511.13415 (2025)."},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"crossref","unstructured":"Mostafa Dehghani Basil Mustafa Josip Djolonga Jonathan Heek Matthias Minderer Mathilde Caron Andreas Steiner Joan Puigcerver Robert Geirhos Ibrahim\u00a0M Alabdulmohsin et\u00a0al. 2023. Patch n\u2019pack: Navit a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems 36 (2023) 2252\u20132274.","DOI":"10.52202\/075280-0106"},{"key":"e_1_3_3_1_10_2","volume-title":"The Thirteenth International Conference on Learning Representations","author":"Faysse Manuel","unstructured":"Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. [n. d.]. ColPali: Efficient Document Retrieval with Vision Language Models. In The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.mrl-main.36"},{"key":"e_1_3_3_1_12_2","unstructured":"Edward\u00a0J Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen et\u00a0al. 2022. Lora: Low-rank adaptation of large language models. ICLR 1 2 (2022) 3."},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"crossref","unstructured":"Wenbo Hu Zi-Yi Dou Liunian Li Amita Kamath Nanyun Peng and Kai-Wei Chang. 2024. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems 37 (2024) 50168\u201350188.","DOI":"10.52202\/079017-1588"},{"key":"e_1_3_3_1_14_2","unstructured":"Weijian Jian Yajun Zhang Dawei Liang Chunyu Xie Yixiao He Dawei Leng and Yuhui Yin. 2025. Rzenembed: Towards comprehensive multimodal retrieval. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2510.27350 (2025)."},{"key":"e_1_3_3_1_15_2","unstructured":"Ting Jiang Minghui Song Zihan Zhang Haizhen Huang Weiwei Deng Feng Sun Qi Zhang Deqing Wang and Fuzhen Zhuang. 2024. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.12580 (2024)."},{"key":"e_1_3_3_1_16_2","volume-title":"ICLR","author":"Jiang Ziyan","year":"2025","unstructured":"Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2025. VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks. In ICLR."},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401075"},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19815-1_29"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Aditya Kusupati Gantavya Bhatt Aniket Rege Matthew Wallingford Aditya Sinha Vivek Ramanujan William Howard-Snyder Kaifeng Chen Sham Kakade Prateek Jain et\u00a0al. 2022. Matryoshka representation learning. Advances in Neural Information Processing Systems 35 (2022) 30233\u201330249.","DOI":"10.52202\/068431-2192"},{"key":"e_1_3_3_1_20_2","first-page":"18893","volume-title":"International Conference on Machine Learning","author":"Lee Kenton","year":"2023","unstructured":"Kenton Lee, Mandar Joshi, Iulia\u00a0Raluca Turc, Hexiang Hu, Fangyu Liu, Julian\u00a0Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning. PMLR, 18893\u201318912."},{"key":"e_1_3_3_1_21_2","first-page":"19730","volume-title":"International conference on machine learning","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730\u201319742."},{"key":"e_1_3_3_1_22_2","unstructured":"Wentong Li Yuqian Yuan Jian Liu Dongqi Tang Song Wang Jie Qin Jianke Zhu and Lei Zhang. 2025. Tokenpacker: Efficient visual projector for multimodal llm. International Journal of Computer Vision (2025) 1\u201319."},{"key":"e_1_3_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02527"},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"e_1_3_3_1_25_2","unstructured":"Haotian Liu Chunyuan Li Yuheng Li Bo Li Yuanhan Zhang Sheng Shen and Yong\u00a0Jae Lee. 2024. Llavanext: Improved reasoning ocr and world knowledge."},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.373"},{"key":"e_1_3_3_1_27_2","unstructured":"Yubo Ma Jinsong Li Yuhang Zang Xiaobao Wu Xiaoyi Dong Pan Zhang Yuhang Cao Haodong Duan Jiaqi Wang Yixin Cao et\u00a0al. 2025. Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.04997 (2025)."},{"key":"e_1_3_3_1_28_2","unstructured":"Quentin Mac\u00e9 Ant\u00f3nio Loison and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. arxiv:https:\/\/arXiv.org\/abs\/2505.17166\u00a0[cs.IR] https:\/\/arxiv.org\/abs\/2505.17166"},{"key":"e_1_3_3_1_29_2","first-page":"2071","volume-title":"Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track","author":"Masry Ahmed","year":"2025","unstructured":"Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik\u00a0Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay\u00a0Kalkunte Suresh, Srivatsava Daruru, Enamul Hoque, Spandana Gella, et\u00a0al. 2025. ColMate: Contrastive late interaction and masked text for multimodal document retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2071\u20132080."},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00225"},{"key":"e_1_3_3_1_31_2","unstructured":"Rui Meng Ziyan Jiang Ye Liu Mingyi Su Xinyi Yang Yuepeng Fu Can Qin Zeyuan Chen Ran Xu Caiming Xiong Yingbo Zhou Wenhu Chen and Semih Yavuz. 2025. VLM2Vec-V2: Advancing Multimodal Embedding for Videos Images and Visual Documents. arxiv:https:\/\/arXiv.org\/abs\/2507.04590\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2507.04590"},{"key":"e_1_3_3_1_32_2","unstructured":"Junbo Niu Yuanhong Zheng Ziyang Miao Hejun Dong Chunjiang Ge Hao Liang Ma Lu Bohan Zeng Qiahao Zheng Conghui He et\u00a0al. 2025. Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.12776 (2025)."},{"key":"e_1_3_3_1_33_2","volume-title":"The Thirty-eighth Annual Conference on Neural Information Processing Systems","author":"Tong Shengbang","year":"2024","unstructured":"Shengbang Tong, Ellis L\u00a0Brown II, Penghao Wu, Sanghyun Woo, ADITHYA\u00a0JAIRAM IYER, Sai\u00a0Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Rob Fergus, Yann LeCun, and Saining Xie. 2024. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=Vi8AepAXGy"},{"key":"e_1_3_3_1_34_2","unstructured":"Michael Tschannen Alexey Gritsenko Xiao Wang Muhammad\u00a0Ferjad Naeem Ibrahim Alabdulmohsin Nikhil Parthasarathy Talfan Evans Lucas Beyer Ye Xia Basil Mustafa et\u00a0al. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding. Localization and Dense Features 6 (2025)."},{"key":"e_1_3_3_1_35_2","unstructured":"Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge et\u00a0al. 2024. Qwen2-vl: Enhancing vision-language model\u2019s perception of the world at any resolution. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2409.12191 (2024)."},{"key":"e_1_3_3_1_36_2","unstructured":"Peng Wang Shijie Wang Junyang Lin Shuai Bai Xiaohuan Zhou Jingren Zhou Xinggang Wang and Chang Zhou. 2023. One-peace: Exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2305.11172 (2023)."},{"key":"e_1_3_3_1_37_2","unstructured":"Weiyun Wang Zhangwei Gao Lixin Gu Hengjun Pu Long Cui Xingguang Wei Zhaoyang Liu Linglin Jing Shenglong Ye Jie Shao et\u00a0al. 2025. Internvl3. 5: Advancing open-source multimodal models in versatility reasoning and efficiency. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2508.18265 (2025)."},{"key":"e_1_3_3_1_38_2","unstructured":"Haoran Wei Yaofeng Sun and Yukun Li. 2026. DeepSeek-OCR 2: Visual Causal Flow. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2601.20552 (2026)."},{"key":"e_1_3_3_1_39_2","unstructured":"Zilin Xiao Qi Ma Mengting Gu Chun-cheng\u00a0Jason Chen Xintao Chen Vicente Ordonez and Vijai Mohan. 2025. Metaembed: Scaling multimodal retrieval at test-time with flexible late interaction. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2509.18095 (2025)."},{"key":"e_1_3_3_1_40_2","unstructured":"Mengyao Xu Wenfei Zhou Yauhen Babakhin Gabriel Moreira Ronay Ak Radek Osmulski Bo Liu Even Oldridge and Benedikt Schifferer. 2025. Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text Image Audio and Video. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2510.03458 (2025)."},{"key":"e_1_3_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403172"},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Linli Yao Long Xing Yang Shi Sida Li Yuanxin Liu Yuhao Dong Yi-Fan Zhang Lei Li Qingxiu Dong Xiaoyi Dong et\u00a0al. 2025. Towards Efficient Multimodal Large Language Models: A Survey on Token Compression. Authorea Preprints (2025).","DOI":"10.36227\/techrxiv.176823010.07236701\/v1"},{"key":"e_1_3_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.187"},{"key":"e_1_3_3_1_44_2","volume-title":"The Thirteenth International Conference on Learning Representations","author":"Yu Shi","unstructured":"Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et\u00a0al. [n. d.]. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. In The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01100"},{"key":"e_1_3_3_1_46_2","first-page":"1393","volume-title":"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track","author":"Zhang Xin","year":"2024","unstructured":"Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et\u00a0al. 2024. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 1393\u20131412."},{"key":"e_1_3_3_1_47_2","unstructured":"Xin Zhang Yanzhao Zhang Wen Xie Mingxin Li Ziqi Dai Dingkun Long Pengjun Xie Meishan Zhang Wenjie Li and Min Zhang. 2024. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2412.16855 (2024)."},{"key":"e_1_3_3_1_48_2","unstructured":"Yanzhao Zhang Mingxin Li Dingkun Long Xin Zhang Huan Lin Baosong Yang Pengjun Xie An Yang Dayiheng Liu Junyang Lin et\u00a0al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.05176 (2025)."},{"key":"e_1_3_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548422"}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:39:02Z","timestamp":1781537942000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810864"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":48,"alternative-id":["10.1145\/3805622.3810864","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810864","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}