{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T10:54:58Z","timestamp":1770720898357,"version":"3.49.0"},"reference-count":84,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"Zhejiang Province Pioneer Research and Development Project","award":["2024C01017"],"award-info":[{"award-number":["2024C01017"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>\n                    Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify \u201cvisual words\u201d in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM\u2019s awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\sim\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.\n                  <\/jats:p>","DOI":"10.1145\/3777472","type":"journal-article","created":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T16:05:03Z","timestamp":1763568303000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Re-purposing SAM into Efficient Visual Projectors for MLLM-based Referring Image Segmentation"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-7885-302X","authenticated-orcid":false,"given":"Xiaobo","family":"Yang","sequence":"first","affiliation":[{"name":"College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9955-3569","authenticated-orcid":false,"given":"Xiaojin","family":"Gong","sequence":"additional","affiliation":[{"name":"College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2026,2,9]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"PyTorch. 2024. Accelerating Generative AI with PyTorch: Segment Anything Fast\u2014pytorch.org. Retrieved September 30 2024 from https:\/\/pytorch.org\/blog\/accelerating-generative-ai\/"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.120"},{"key":"e_1_3_1_4_2","first-page":"23716","article-title":"Flamingo: A visual language model for few-shot learning","volume":"35","author":"Alayrac Jean-Baptiste","year":"2022","unstructured":"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems, Vol. 35, 23716\u201323736.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_5_2","unstructured":"Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou and Jingren Zhou. 2023. Qwen-VL: A versatile vision-language model for understanding localization text reading and beyond. arXiv:2308.12966. Retrieved from https:\/\/arxiv.org\/abs\/2308.12966"},{"key":"e_1_3_1_6_2","unstructured":"Shuai Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Sibo Song Kai Dang Peng Wang Shijie Wang Jun Tang et al. 2025. Qwen2.5-VL technical report. arXiv:2502.13923. Retrieved from https:\/\/arxiv.org\/abs\/2502.13923"},{"key":"e_1_3_1_7_2","unstructured":"Ethan Baron Idan Tankel Peter Tu and Guy Ben-Yosef. 2024. Real classification by description: Extending CLIP\u2019s limits of part attributes recognition. arXiv:2412.13947. Retrieved from https:\/\/arxiv.org\/abs\/2412.13947"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01311"},{"key":"e_1_3_1_10_2","unstructured":"Jun Chen Deyao Zhu Xiaoqian Shen Xiang Li Zechu Liu Pengchuan Zhang Raghuraman Krishnamoorthi Vikas Chandra Yunyang Xiong and Mohamed Elhoseiny. 2023. MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478. Retrieved from https:\/\/arxiv.org\/abs\/2310.09478"},{"key":"e_1_3_1_11_2","unstructured":"Keqin Chen Zhao Zhang Weili Zeng Richong Zhang Feng Zhu and Rui Zhao. 2023. Shikra: Unleashing multimodal LLM\u2019s referential dialogue magic. arXiv:2306.15195. Retrieved from https:\/\/arxiv.org\/abs\/2306.15195"},{"key":"e_1_3_1_12_2","doi-asserted-by":"crossref","unstructured":"Liang Chen Haozhe Zhao Tianyu Liu Shuai Bai Junyang Lin Chang Zhou and Baobao Chang. 2024. An image is worth 1\/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv:2403.06764. Retrieved from https:\/\/arxiv.org\/abs\/2403.06764","DOI":"10.1007\/978-3-031-73004-7_2"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4231-5"},{"key":"e_1_3_1_14_2","first-page":"24185","volume-title":"CVPR","author":"Chen Zhe","year":"2024","unstructured":"Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 24185\u201324198."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"e_1_3_1_16_2","unstructured":"Xiangxiang Chu Limeng Qiao Xinyang Lin Shuang Xu Yang Yang Yiming Hu Fei Wei Xinyu Zhang Bo Zhang Xiaolin Wei et al. 2023. MobileVLM: A fast reproducible and strong vision language assistant for mobile devices. arXiv:2312.16886. Retrieved from https:\/\/arxiv.org\/abs\/2312.16886"},{"key":"e_1_3_1_17_2","unstructured":"Xiangxiang Chu Limeng Qiao Xinyu Zhang Shuang Xu Fei Wei Yang Yang Xiaofei Sun Yiming Hu Xinyang Lin Bo Zhang et al. 2024. MobileVLM V2: Faster and stronger baseline for vision language model. arXiv:2402.03766. Retrieved from https:\/\/arxiv.org\/abs\/2402.03766"},{"key":"e_1_3_1_18_2","unstructured":"Wenliang Dai Junnan Li Dongxu Li Anthony Meng Huat Tiong Junqi Zhao Weisheng Wang Boyang Li Pascale Fung and Steven Hoi. 2023. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500. Retrieved from https:\/\/arxiv.org\/abs\/2305.06500"},{"key":"e_1_3_1_19_2","volume-title":"ICLR","author":"Dao Tri","year":"2024","unstructured":"Tri Dao. 2024. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR."},{"key":"e_1_3_1_20_2","first-page":"16344","volume-title":"Advances in Neural Information Processing Systems","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, Vol. 35, 16344\u201316359."},{"key":"e_1_3_1_21_2","first-page":"4171","volume-title":"NAACL-HLT","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. Jill Burstein, Christy Doran, and Thamar Solorio (Eds.), Association for Computational Linguistics, 4171\u20134186."},{"key":"e_1_3_1_22_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/s44267-024-00067-6"},{"key":"e_1_3_1_24_2","first-page":"3","article-title":"LoRA: Low-rank adaptation of large language models","volume":"1","author":"Hu Edward J.","year":"2022","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In ICLR, Vol. 1, 3.","journal-title":"ICLR"},{"key":"e_1_3_1_25_2","unstructured":"Donggon Jang Yucheol Cho Suin Lee Taehyeon Kim and Dae-Shik Kim. 2025. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. arXiv:2503.13881. Retrieved from https:\/\/arxiv.org\/abs\/2503.13881"},{"issue":"1","key":"e_1_3_1_26_2","first-page":"18","article-title":"Toward complex-query referring image segmentation: A novel benchmark","volume":"21","author":"Ji Wei","year":"2024","unstructured":"Wei Ji, Li Li, Hao Fei, Xiangyan Liu, Xun Yang, Juncheng Li, and Roger Zimmermann. 2024. Toward complex-query referring image segmentation: A novel benchmark. ACM Transactions on Multimedia Computing Communications, and Applications 21, 1, Article 40 (2024), 18.","journal-title":"ACM Transactions on Multimedia Computing Communications, and Applications"},{"key":"e_1_3_1_27_2","unstructured":"Dongsheng Jiang Yuchen Liu Songlin Liu Jin\u2019 Zhao Hao Zhang Zhen Gao Xiaopeng Zhang Jin Li and Hongkai Xiong. 2023. From CLIP to DINO: Visual encoders shout in multi-modal large language models. arXiv:2310.08825. Retrieved from https:\/\/arxiv.org\/abs\/2310.08825"},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","unstructured":"Alexander Kirillov Eric Mintun Nikhila Ravi Hanzi Mao Chloe Rolland Laura Gustafson Tete Xiao Spencer Whitehead Alexander C. Berg Wan-Yen Lo et al. 2023. Segment anything. arXiv:2304.02643. Retrieved from https:\/\/arxiv.org\/abs\/2304.02643","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00915"},{"key":"e_1_3_1_30_2","unstructured":"Hongliang Li Jiaxin Zhang Wenhui Liao Dezhi Peng Kai Ding and Lianwen Jin. 2025. Beyond token compression: A training-free reduction framework for efficient visual processing in MLLMs. arXiv:2501.19036. Retrieved from https:\/\/arxiv.org\/abs\/2501.19036"},{"key":"e_1_3_1_31_2","volume-title":"ICML","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3698771"},{"key":"e_1_3_1_33_2","first-page":"19652","article-title":"Referring transformer: A one-step approach to multi-task visual grounding","volume":"34","author":"Li Muchen","year":"2021","unstructured":"Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. In Advances in Neural Information Processing Systems, Vol. 34, 19652\u201319664.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_34_2","unstructured":"Shuai Li Jian Xu Xiao-Hui Li Chao Deng and Lin-Lin Huang. 2025. QG-VTC: Question-guided visual token compression in MLLMs for efficient VQA. arXiv:2504.00654. Retrieved from https:\/\/arxiv.org\/abs\/2504.00654"},{"key":"e_1_3_1_35_2","unstructured":"Wentong Li Yuqian Yuan Jian Liu Dongqi Tang Song Wang Jie Qin Jianke Zhu and Lei Zhang. 2024. TokenPacker: Efficient visual projector for multimodal LLM. arXiv:2407.02392. Retrieved from https:\/\/arxiv.org\/abs\/2407.02392"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.111664"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02259"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"e_1_3_1_39_2","first-page":"34892","article-title":"Visual instruction tuning","volume":"36","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Advances in Neural Information Processing Systems, Vol. 36, 34892\u201334916.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01789"},{"key":"e_1_3_1_41_2","unstructured":"Shilong Liu Zhaoyang Zeng Tianhe Ren Feng Li Hao Zhang Jie Yang Chunyuan Li Jianwei Yang Hang Su Jun Zhu et al. 2023. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv:2303.05499. Retrieved from https:\/\/arxiv.org\/abs\/2303.05499"},{"key":"e_1_3_1_42_2","unstructured":"Dongchen Lu Yuyao Sun Zilu Zhang Leping Huang Jianliang Zeng Mao Shu and Huo Cao. 2025. InternVL-X: Advancing and accelerating InternVL series with efficient visual token compression. arXiv:2503.21307. Retrieved from https:\/\/arxiv.org\/abs\/2503.21307"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.9"},{"key":"e_1_3_1_44_2","unstructured":"Maxime Oquab Timoth\u00e9e Darcet Th\u00e9o Moutakanni Huy Vo Marc Szafraniec Vasil Khalidov Pierre Fernandez Daniel Haziza Francisco Massa Alaaeldin El-Nouby et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv:2304.07193. Retrieved from https:\/\/arxiv.org\/abs\/2304.07193"},{"key":"e_1_3_1_45_2","unstructured":"Zhiliang Peng Wenhui Wang Li Dong Yaru Hao Shaohan Huang Shuming Ma and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824. Retrieved from https:\/\/arxiv.org\/abs\/2306.14824"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02561"},{"key":"e_1_3_1_47_2","unstructured":"Dustin Podell Zion English Kyle Lacey Andreas Blattmann Tim Dockhorn Jonas M\u00fcller Joe Penna and Robin Rombach. 2023. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952. Retrieved from https:\/\/arxiv.org\/abs\/2307.01952"},{"key":"e_1_3_1_48_2","first-page":"8748","volume-title":"ICML","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML, 8748\u20138763."},{"issue":"140","key":"e_1_3_1_49_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Hanoona Rasheed Muhammad Maaz Sahal Shaji Abdelrahman Shaker Salman Khan Hisham Cholakkal Rao M. Anwer Eric Xing Ming-Hsuan Yang and Fahad S. Khan. 2024. GLaMM: Pixel grounding large multimodal model. In CVPR.","DOI":"10.1109\/CVPR52733.2024.01236"},{"key":"e_1_3_1_51_2","unstructured":"Tianhe Ren Shilong Liu Ailing Zeng Jing Lin Kunchang Li He Cao Jiayu Chen Xinyu Huang Yukang Chen Feng Yan et al. 2024. Grounded SAM: Assembling open-world models for diverse visual tasks. arXiv:2401.14159. Retrieved from https:\/\/arxiv.org\/abs\/2401.14159"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02491"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01443"},{"key":"e_1_3_1_54_2","unstructured":"Rico Sennrich Barry Haddow and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv:1508.07909. Retrieved from https:\/\/arxiv.org\/abs\/1508.07909"},{"key":"e_1_3_1_55_2","unstructured":"Yuzhang Shang Mu Cai Bingxin Xu Yong Jae Lee and Yan Yan. 2024. LlaVA-PruMerge: Adaptive token reduction for efficient large multimodal models. arXiv:2403.15388. Retrieved from https:\/\/arxiv.org\/abs\/2403.15388"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_3_1_57_2","first-page":"87310","article-title":"Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs","volume":"37","author":"Tong Peter","year":"2024","unstructured":"Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri Iyer, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In Advances in Neural Information Processing Systems, Vol. 37, 87310\u201387356.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_58_2","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_59_2","unstructured":"Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge et al. 2024. Qwen2-VL: Enhancing vision-language model\u2019s perception of the world at any resolution. arXiv:2409.12191. Retrieved from https:\/\/arxiv.org\/abs\/2409.12191"},{"key":"e_1_3_1_60_2","first-page":"61501","article-title":"VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks","volume":"36","author":"Wang Wenhai","year":"2023","unstructured":"Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. 2023. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. In Advances in Neural Information Processing Systems, Vol. 36, 61501\u201361513.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01139"},{"key":"e_1_3_1_62_2","unstructured":"Zichen Wen Yifeng Gao Shaobo Wang Junyuan Zhang Qintong Zhang Weijia Li Conghui He and Linfeng Zhang. 2025. Stop looking for important tokens in multimodal language models: Duplication matters more. arXiv:2502.11494. Retrieved from https:\/\/arxiv.org\/abs\/2502.11494"},{"key":"e_1_3_1_63_2","unstructured":"Chenyun Wu and Subhransu Maji. 2022. How well does CLIP understand texture? arXiv:2203.11449. Retrieved from https:\/\/arxiv.org\/abs\/2203.11449"},{"key":"e_1_3_1_64_2","first-page":"69925","article-title":"VisionLLM v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks","volume":"37","author":"Wu Jiannan","year":"2024","unstructured":"Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. 2024. VisionLLM v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. In Advances in Neural Information Processing Systems, Vol. 37, 69925\u201369975.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_65_2","unstructured":"Size Wu Sheng Jin Wenwei Zhang Lumin Xu Wentao Liu Wei Li and Chen Change Loy. 2024. F-LMM: Grounding frozen large multimodal models. arXiv:2406.05821. Retrieved from https:\/\/arxiv.org\/abs\/2406.05821"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2023.111243"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.3233\/FAIA240541"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01313"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00058"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01762"},{"key":"e_1_3_1_71_2","unstructured":"Linli Yao Lei Li Shuhuai Ren Lean Wang Yuanxin Liu Xu Sun and Lu Hou. 2024. DeCo: Decoupling token compression from semantic abstraction in multimodal large language models. arXiv:2405.20985. Retrieved from https:\/\/arxiv.org\/abs\/2405.20985"},{"key":"e_1_3_1_72_2","volume-title":"ICLR","author":"You Haoxuan","year":"2024","unstructured":"Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2024. FERRET: Refer and ground anything anywhere at any granularity. In ICLR."},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"e_1_3_1_74_2","unstructured":"Haobo Yuan Xiangtai Li Tao Zhang Zilong Huang Shilin Xu Shunping Ji Yunhai Tong Lu Qi Jiashi Feng and Ming-Hsuan Yang. 2025. Sa2VA: Marrying SAM2 with LLaVA for dense grounded understanding of images and videos. arXiv:2501.04001. Retrieved from https:\/\/arxiv.org\/abs\/2501.04001"},{"key":"e_1_3_1_75_2","unstructured":"Weili Zeng Ziyuan Huang Kaixiang Ji and Yichao Yan. 2025. Skip-vision: A comprehensive framework for accelerating vision-language models. arXiv:2503.21817. Retrieved from https:\/\/arxiv.org\/abs\/2503.21817"},{"key":"e_1_3_1_76_2","unstructured":"Chaoning Zhang Dongshen Han Jung Uk Yu Qiao Sung-Ho Kim Seungkyu Bae Choong Lee and Seon Hong. 2023. Faster segment anything: Towards lightweight SAM for mobile applications. arXiv:2306.14289. Retrieved from https:\/\/arxiv.org\/abs\/2306.14289"},{"key":"e_1_3_1_77_2","unstructured":"Qizhe Zhang Aosong Cheng Ming Lu Zhiyong Zhuo Minqi Wang Jiajun Cao Shaobo Guo Qi She and Shanghang Zhang. 2024. Attention is all you need for training-free visual token pruning: Make VLM inference faster. arXiv:2412.01818. Retrieved from https:\/\/arxiv.org\/abs\/2412.01818"},{"key":"e_1_3_1_78_2","unstructured":"Shilong Zhang Peize Sun Shoufa Chen Min Xiao Wenqi Shao Wenwei Zhang Kai Chen and Ping Luo. 2023. GPT4RoI: Instruction tuning large language model on region-of-interest. arXiv:2307.03601. Retrieved from https:\/\/arxiv.org\/abs\/2307.03601"},{"key":"e_1_3_1_79_2","first-page":"71737","article-title":"OMG-LLaVA: Bridging image-level, object-level, pixel-level reasoning and understanding","volume":"37","author":"Zhang Tao","year":"2024","unstructured":"Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Change Loy Chen, and Shuicheng Yan. 2024. OMG-LLaVA: Bridging image-level, object-level, pixel-level reasoning and understanding. In Advances in Neural Information Processing System, Vol. 37, 71737\u201371767.","journal-title":"Advances in Neural Information Processing System"},{"key":"e_1_3_1_80_2","unstructured":"Yuan Zhang Chun-Kai Fan Junpeng Ma Wenzhao Zheng Tao Huang Kuan Cheng Denis Gudovskiy Tomoyuki Okuno Yohei Nakata Kurt Keutzer et al. 2024. SparseVLM: Visual token sparsification for efficient vision-language model inference. arXiv:2410.04417. Retrieved from https:\/\/arxiv.org\/abs\/2410.04417"},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-demos.4"},{"key":"e_1_3_1_82_2","unstructured":"Tiancheng Zhao Tianqi Zhang Mingwei Zhu Haozhan Shen Kyusong Lee Xiaopeng Lu and Jianwei Yin. 2022. Vl-checklist: Evaluating pre-trained vision-language models with objects attributes and relations. arXiv:2207.00221. Retrieved from https:\/\/arxiv.org\/abs\/2207.00221"},{"key":"e_1_3_1_83_2","unstructured":"Deyao Zhu Jun Chen Xiaoqian Shen Xiang Li and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592. Retrieved from https:\/\/arxiv.org\/abs\/2304.10592"},{"key":"e_1_3_1_84_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01451"},{"key":"e_1_3_1_85_2","first-page":"19769","article-title":"Segment everything everywhere all at once","volume":"36","author":"Zou Xueyan","year":"2023","unstructured":"Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. 2023. Segment everything everywhere all at once. In Advances in Neural Information Processing System, Vol. 36, 19769\u201319782.","journal-title":"Advances in Neural Information Processing System"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3777472","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T14:57:02Z","timestamp":1770649022000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3777472"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,9]]},"references-count":84,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3777472"],"URL":"https:\/\/doi.org\/10.1145\/3777472","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,9]]},"assertion":[{"value":"2025-06-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}