{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:41:51Z","timestamp":1765309311701,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":53,"publisher":"ACM","funder":[{"name":"Ministry of Agriculture and Rural Affairs, Key Research and Development Program of Heilongjiang Province","award":["2022ZX01A22, 2021ZXJ05A03"],"award-info":[{"award-number":["2022ZX01A22, 2021ZXJ05A03"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62350710797, 61972114, 62106061"],"award-info":[{"award-number":["62350710797, 61972114, 62106061"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Science and Technology Major Project of China","award":["2021ZD0110901"],"award-info":[{"award-number":["2021ZD0110901"]}]},{"name":"Collaborative Innovation and Promotion System of the Modern Agricultural Industry Technology For Watermelon and Melon"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755305","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T06:54:17Z","timestamp":1761375257000},"page":"4058-4067","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["BOLT: Fewer Tokens but More Performance Retention for Efficient Vision-Language Models Inference"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0610-4321","authenticated-orcid":false,"given":"Jiahua","family":"Bao","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China, National Key Laboratory of SFTaS, Harbin, China, and China Mobile 5G Institute, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1263-9907","authenticated-orcid":false,"given":"Siyao","family":"Cheng","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China, National Key Laboratory of SFTaS, Harbin, China, and China Mobile 5G Institute, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6451-0181","authenticated-orcid":false,"given":"Jiaxing","family":"Du","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4447-1130","authenticated-orcid":false,"given":"Changjiang","family":"He","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1154-3157","authenticated-orcid":false,"given":"Zeming","family":"Lang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6769-2115","authenticated-orcid":false,"given":"Hao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6209-6886","authenticated-orcid":false,"given":"Jie","family":"Liu","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China, National Key Laboratory of SFTaS, Harbin, Heilongjiang, China, and China Mobile 5G Institute, Harbin, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Ma nas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al.","author":"Bordes Florian","year":"2024","unstructured":"Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Ma nas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al., 2024. An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247 (2024)."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"crossref","unstructured":"Lin Chen Jinsong Li Xiaoyi Dong Pan Zhang Yuhang Zang Zehui Chen Haodong Duan Jiaqi Wang Yu Qiao Dahua Lin et al. 2024a. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330 (2024).","DOI":"10.52202\/079017-0850"},{"key":"e_1_3_2_2_3_1","volume-title":"European Conference on Computer Vision. Springer, 19-35","author":"Chen Liang","year":"2024","unstructured":"Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024b. An image is worth 1\/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision. Springer, 19-35."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACVW60836.2024.00106"},{"key":"e_1_3_2_2_5_1","volume-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, Vol. 35 (2022), 16344-16359."},{"key":"e_1_3_2_2_6_1","volume-title":"UniMIC: Towards Universal Multi-modality Perceptual Image Compression. arXiv preprint arXiv:2412.04912","author":"Gao Yixin","year":"2024","unstructured":"Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Zongyu Guo, Yiting Lu, Yulin Ren, and Zhibo Chen. 2024. UniMIC: Towards Universal Multi-modality Perceptual Image Compression. arXiv preprint arXiv:2412.04912 (2024)."},{"key":"e_1_3_2_2_7_1","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01363"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00686"},{"key":"e_1_3_2_2_10_1","unstructured":"Christopher Keith Michael Robinson Francis Duncan Allan Worthington Joseph Wilson and Sofia Harris. 2024. Optimizing large language models: A novel approach through dynamic token pruning. (2024)."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_15"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1039\/c1ay05566f"},{"key":"e_1_3_2_2_13_1","first-page":"70","article-title":"Matrix inversion using Cholesky decomposition. In 2013 signal processing: Algorithms, architectures, arrangements, and applications (SPA)","author":"Krishnamoorthy Aravindh","year":"2013","unstructured":"Aravindh Krishnamoorthy and Deepak Menon. 2013. Matrix inversion using Cholesky decomposition. In 2013 signal processing: Algorithms, architectures, arrangements, and applications (SPA). IEEE, 70-72.","journal-title":"IEEE"},{"key":"e_1_3_2_2_14_1","volume-title":"Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125","author":"Li Bohao","year":"2023","unstructured":"Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023b. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)."},{"key":"e_1_3_2_2_15_1","volume-title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. arXiv preprint arXiv:2407.07895","author":"Li Feng","year":"2024","unstructured":"Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024c. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. arXiv preprint arXiv:2407.07895 (2024)."},{"key":"e_1_3_2_2_16_1","volume-title":"Tokenpacker: Efficient visual projector for multimodal llm. arXiv preprint arXiv:2407.02392","author":"Li Wentong","year":"2024","unstructured":"Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. 2024b. Tokenpacker: Efficient visual projector for multimodal llm. arXiv preprint arXiv:2407.02392 (2024)."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.20"},{"key":"e_1_3_2_2_18_1","volume-title":"Visual Large Language Models for Generalized and Specialized Applications. arXiv preprint arXiv:2501.02765","author":"Li Yifan","year":"2025","unstructured":"Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, and Yu Kong. 2025a. Visual Large Language Models for Generalized and Specialized Applications. arXiv preprint arXiv:2501.02765 (2025)."},{"key":"e_1_3_2_2_19_1","volume-title":"European Conference on Computer Vision. Springer, 323-340","author":"Li Yanwei","year":"2024","unstructured":"Yanwei Li, Chengyao Wang, and Jiaya Jia. 2024a. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision. Springer, 323-340."},{"key":"e_1_3_2_2_20_1","unstructured":"Zongxia Li Xiyang Wu Hongyang Du Fuxiao Liu Huy Nghiem and Guangyao Shi. [n.d.]. A Survey of State of the Art Large Vision Language Models: Alignment Benchmark Evaluations and Challenges. ( [n. d.])."},{"key":"e_1_3_2_2_21_1","volume-title":"Benchmark evaluations, applications, and challenges of large vision language models: A survey. arXiv preprint arXiv:2501.02189","author":"Li Zongxia","year":"2025","unstructured":"Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. 2025b. Benchmark evaluations, applications, and challenges of large vision language models: A survey. arXiv preprint arXiv:2501.02189 (2025)."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME52920.2022.9859720"},{"key":"e_1_3_2_2_23_1","volume-title":"Visual instruction tuning. Advances in neural information processing systems","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2023), 34892-34916."},{"key":"e_1_3_2_2_24_1","volume-title":"Mmbench: Is your multi-modal model an all-around player?","author":"Liu Yuan","year":"2025","unstructured":"Yuan Liu and et al., 2025. Mmbench: Is your multi-modal model an all-around player?. In ECCV. Springer, 216-233."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.acha.2023.101601"},{"key":"e_1_3_2_2_26_1","first-page":"2507","article-title":"Learn to explain: Multimodal reasoning via thought chains for science question answering","volume":"35","author":"Lu Pan","year":"2022","unstructured":"Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, Vol. 35 (2022), 2507-2521.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00156"},{"key":"e_1_3_2_2_28_1","first-page":"606","article-title":"Efficiently scaling transformer inference","volume":"5","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, Vol. 5 (2023), 606-624.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_2_29_1","volume-title":"International conference on machine learning. PmLR, 8748-8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748-8763."},{"key":"e_1_3_2_2_30_1","volume-title":"Large vision-language model alignment and misalignment: A survey through the lens of explainability. arXiv preprint arXiv:2501.01346","author":"Shu Dong","year":"2025","unstructured":"Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. 2025. Large vision-language model alignment and misalignment: A survey through the lens of explainability. arXiv preprint arXiv:2501.01346 (2025)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 8317-8326.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"e_1_3_2_2_32_1","volume-title":"Juliette Love, et al.","author":"Team Gemma","year":"2024","unstructured":"Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi\u00e8re, Mihir Sanjay Kale, Juliette Love, et al., 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)."},{"key":"e_1_3_2_2_33_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al., 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_3_2_2_34_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_2_35_1","first-page":"52792","article-title":"Tuning multi-mode token-level prompt alignment across modalities","volume":"36","author":"Wang Dongsheng","year":"2023","unstructured":"Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, and Hanwang Zhang. 2023b. Tuning multi-mode token-level prompt alignment across modalities. Advances in Neural Information Processing Systems, Vol. 36 (2023), 52792-52810.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_36_1","volume-title":"FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance. arXiv preprint arXiv:2501.02430","author":"Wang Haicheng","year":"2025","unstructured":"Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Qu\u00e9tu, and Enzo Tartaglione. 2025. FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance. arXiv preprint arXiv:2501.02430 (2025)."},{"key":"e_1_3_2_2_37_1","unstructured":"Jiaqi Wang Hanqi Jiang Yiheng Liu Chong Ma Xu Zhang Yi Pan Mengyuan Liu Peiran Gu Sichen Xia Wenjun Li et al. 2024. A comprehensive review of multimodal large language models: Performance and challenges across different tasks. arXiv preprint arXiv:2408.01319 (2024)."},{"key":"e_1_3_2_2_38_1","volume-title":"SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models. arXiv preprint arXiv:2305.15033","author":"Wang Zekun","year":"2023","unstructured":"Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Haichao Zhu, Jiafeng Liang, Liping Shan, Ming Liu, Dongliang Xu, Qing Yang, and Bing Qin. 2023a. SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models. arXiv preprint arXiv:2305.15033 (2023)."},{"key":"e_1_3_2_2_39_1","volume-title":"Efficient vision-language models by summarizing visual tokens into compact registers. arXiv preprint arXiv:2410.14072","author":"Wen Yuxin","year":"2024","unstructured":"Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, and Mahyar Najibi. 2024. Efficient vision-language models by summarizing visual tokens into compact registers. arXiv preprint arXiv:2410.14072 (2024)."},{"key":"e_1_3_2_2_40_1","volume-title":"Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images. arXiv preprint arXiv:2502.13928","author":"Wu Shengguang","year":"2025","unstructured":"Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, and Nick Haber. 2025. Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images. arXiv preprint arXiv:2502.13928 (2025)."},{"key":"e_1_3_2_2_41_1","unstructured":"X.ai. 2024. Grok 1.5v. https:\/\/x.ai\/blog\/grok-1.5v"},{"key":"e_1_3_2_2_42_1","volume-title":"Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv preprint arXiv:2410.17247","author":"Xing Long","year":"2024","unstructured":"Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al., 2024. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv preprint arXiv:2410.17247 (2024)."},{"key":"e_1_3_2_2_43_1","volume-title":"Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization. arXiv preprint arXiv:2502.13146","author":"Xing Shuo","year":"2025","unstructured":"Shuo Xing, Yuping Wang, Peiran Li, Ruizheng Bai, Yueqi Wang, Chengxuan Qian, Huaxiu Yao, and Zhengzhong Tu. 2025. Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization. arXiv preprint arXiv:2502.13146 (2025)."},{"key":"e_1_3_2_2_44_1","volume-title":"Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467","author":"Yang Senqiao","year":"2024","unstructured":"Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. 2024. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467 (2024)."},{"key":"e_1_3_2_2_45_1","volume-title":"LDP: Learnable dynamic precision for efficient deep neural network training and inference. arXiv preprint arXiv:2203.07713","author":"Yu Zhongzhi","year":"2022","unstructured":"Zhongzhi Yu, Yonggan Fu, Shang Wu, Mengquan Li, Haoran You, and Yingyan Lin. 2022. LDP: Learnable dynamic precision for efficient deep neural network training and inference. arXiv preprint arXiv:2203.07713 (2022)."},{"key":"e_1_3_2_2_46_1","volume-title":"A Survey of Multimodal Learning: Methods, Applications, and Future. Comput. Surveys","author":"Yuan Yuan","year":"2025","unstructured":"Yuan Yuan, Zhaojian Li, and Bin Zhao. 2025. A Survey of Multimodal Learning: Methods, Applications, and Future. Comput. Surveys (2025)."},{"key":"e_1_3_2_2_47_1","volume-title":"Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601","author":"Zhang Duzhen","year":"2024","unstructured":"Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024c. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 (2024)."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"crossref","unstructured":"Jingyi Zhang Jiaxing Huang et al. 2024b. Vision-language models for vision tasks: A survey. IEEE TPAMI (2024).","DOI":"10.1007\/s11704-024-40051-3"},{"key":"e_1_3_2_2_49_1","first-page":"4388","article-title":"Self-distillation: Towards efficient and compact neural networks","volume":"44","author":"Zhang Linfeng","year":"2021","unstructured":"Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. 2021. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 8 (2021), 4388-4403.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_2_50_1","unstructured":"Shengyu Zhang Linfeng Dong Xiaoya Li Sen Zhang Xiaofei Sun Shuhe Wang Jiwei Li Runyi Hu Tianwei Zhang Fei Wu et al. 2023. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023)."},{"key":"e_1_3_2_2_51_1","volume-title":"Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417","author":"Zhang Yuan","year":"2024","unstructured":"Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al., 2024a. Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417 (2024)."},{"key":"e_1_3_2_2_52_1","volume-title":"The singular value decomposition, applications and beyond. arXiv preprint arXiv:1510.08532","author":"Zhang Zhihua","year":"2015","unstructured":"Zhihua Zhang. 2015. The singular value decomposition, applications and beyond. arXiv preprint arXiv:1510.08532 (2015)."},{"key":"e_1_3_2_2_53_1","first-page":"46595","article-title":"Judging llm-as-a-judge with mt-bench and chatbot arena","volume":"36","author":"Zheng Lianmin","year":"2023","unstructured":"Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al., 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, Vol. 36 (2023), 46595-46623.","journal-title":"Advances in Neural Information Processing Systems"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755305","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:39:26Z","timestamp":1765309166000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755305"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":53,"alternative-id":["10.1145\/3746027.3755305","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755305","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}