{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,29]],"date-time":"2026-03-29T01:10:55Z","timestamp":1774746655180,"version":"3.50.1"},"reference-count":217,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,12,1]],"date-time":"2025-12-01T00:00:00Z","timestamp":1764547200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T00:00:00Z","timestamp":1765238400000},"content-version":"vor","delay-in-days":8,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62302167, U23A20343, 72192821"],"award-info":[{"award-number":["62302167, U23A20343, 72192821"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>In the past years, multimodal large language models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, this survey summarizes the timeline of representative efficient MLLMs, the current state of research in structures and strategies, and the applications. Finally, the limitations of current efficient MLLM research and promising future directions are discussed.<\/jats:p>","DOI":"10.1007\/s44267-025-00099-6","type":"journal-article","created":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T07:44:25Z","timestamp":1765266265000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Efficient multimodal large language models: a survey"],"prefix":"10.1007","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-6892-2964","authenticated-orcid":false,"given":"Yizhang","family":"Jin","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0242-6481","authenticated-orcid":false,"given":"Jian","family":"Li","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4757-2685","authenticated-orcid":false,"given":"Tianjun","family":"Gu","sequence":"additional","affiliation":[]},{"given":"Yexin","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Bo","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Jinxiang","family":"Lai","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2431-1159","authenticated-orcid":false,"given":"Zhenye","family":"Gan","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6592-8411","authenticated-orcid":false,"given":"Yabiao","family":"Wang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4216-8090","authenticated-orcid":false,"given":"Chengjie","family":"Wang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9346-1196","authenticated-orcid":false,"given":"Xin","family":"Tan","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1653-4341","authenticated-orcid":false,"given":"Lizhuang","family":"Ma","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,12,9]]},"reference":[{"key":"99_CR1","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et\u00a0al. (2023). GPT-4 technical report. arXiv preprint. arXiv:2303.08774."},{"key":"99_CR2","unstructured":"Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et\u00a0al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint. arXiv:2312.11805."},{"key":"99_CR3","unstructured":"Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et\u00a0al. (2023). mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint. arXiv:2304.14178."},{"key":"99_CR4","first-page":"13040","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Q. Ye","year":"2024","unstructured":"Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian, Q., Zhang, J., & Huang, F. (2024). mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 13040\u201313051). Piscataway: IEEE."},{"key":"99_CR5","first-page":"24185","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Chen","year":"2024","unstructured":"Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. (2024). InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 24185\u201324198). Piscataway: IEEE."},{"key":"99_CR6","unstructured":"Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., & Zhou, J. (2023). Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint. arXiv:2308.12966."},{"key":"99_CR7","first-page":"14398","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Q. Sun","year":"2024","unstructured":"Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., & Wang, X. (2024). Generative multimodal models are in-context learners. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 14398\u201314409). Piscataway: IEEE."},{"key":"99_CR8","first-page":"34892","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"H. Liu","year":"2023","unstructured":"Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 34892\u201334916). Red Hook: Curran Associates."},{"key":"99_CR9","first-page":"49250","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"W. Dai","year":"2023","unstructured":"Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P. N., & Hoi, S. (2023). InstructBLIP: towards general-purpose vision-language models with instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 49250\u201349267). Red Hook: Curran Associates."},{"key":"99_CR10","unstructured":"Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., & Elhoseiny, M. (2023). MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint. arXiv:2310.09478."},{"key":"99_CR11","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"D. Zhu","year":"2024","unstructured":"Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2024). MiniGPT-4: enhancing vision-language understanding with advanced large language models. In Proceedings of the 12th international conference on learning representations (pp. 1\u201317). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=1tZbq88f27."},{"key":"99_CR12","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s44267-024-00067-6","volume":"2","author":"Z. Gao","year":"2024","unstructured":"Gao, Z., Chen, Z., Cui, E., Ren, Y., Wang, W., Zhu, J., Tian, H., Ye, S., He, J., Zhu, X., et al. (2024). Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2, 32.","journal-title":"Visual Intelligence"},{"key":"99_CR13","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., & Tang, J. (2024). LLaVA-phi: efficient multi-modal assistant with small language model. arXiv preprint. arXiv:2401.02330.","DOI":"10.1145\/3688863.3689575"},{"key":"99_CR14","unstructured":"Hinck, M., Olson, M. L., Cobbley, D., Tseng, S.-Y., & Lal, V. (2024). LLaVA-Gemma: accelerating multimodal foundation models with a compact language model. arXiv preprint. arXiv:2404.01331."},{"key":"99_CR15","unstructured":"Yuan, Z., Li, Z., & Sun, L. (2023). TinyGPT-V: efficient multimodal large language model via small backbones. arXiv preprint. arXiv:2312.16862."},{"key":"99_CR16","unstructured":"Chu, X., Qiao, L., Zhang, X., Xu, S., Wei, F., Yang, Y., Sun, X., Hu, Y., Lin, X., Zhang, B., et\u00a0al. (2024). MobileVLM V2: faster and stronger baseline for vision language model. arXiv preprint. arXiv:2402.03766."},{"key":"99_CR17","first-page":"102","volume-title":"Proceedings of the 4th NeurIPS efficient natural language and speech processing workshop","author":"Y. Qiao","year":"2024","unstructured":"Qiao, Y., Yu, Z., Zhao, Z., Chen, S., Sun, M., Guo, L., Wu, Q., & Liu, J. (2024). VL-Mamba: exploring state space models for multimodal learning. In M. Rezagholizadeh, P. Passban, S. Samiee, V. P. Nia, Y. Cheng, Y. Deng, Q. Liu, & B. Chen (Eds.), Proceedings of the 4th NeurIPS efficient natural language and speech processing workshop (pp. 102\u2013113). Red Hook: Curran Associates."},{"key":"99_CR18","doi-asserted-by":"crossref","unstructured":"Shi, B., Wu, Z., Mao, M., Wang, X., & Darrell, T. (2024). When do we not need larger vision models? arXiv preprint. arXiv:2403.13043.","DOI":"10.1007\/978-3-031-73242-3_25"},{"key":"99_CR19","first-page":"390","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Z. Guo","year":"2024","unstructured":"Guo, Z., Xu, R., Yao, Y., Cui, J., Ni, Z., Ge, C., Chua, T., Liu, Z., & Huang, G. (2024). LLaVA-UHD: an LMM perceiving any aspect ratio and high-resolution images. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 390\u2013406). Cham: Springer."},{"key":"99_CR20","unstructured":"Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Zhang, J., Ning, M., & Yuan, L. (2024). MoE-LLaVA: mixture of experts for large vision-language models. arXiv preprint. arXiv:2401.15947."},{"key":"99_CR21","first-page":"10421","volume-title":"Proceedings of the 39th AAAI conference on artificial intelligence","author":"H. Zhao","year":"2025","unstructured":"Zhao, H., Zhang, M., Zhao, W., Ding, P., Huang, S., & Wang, D. (2025). Cobra: extending mamba to multi-modal large language model for efficient inference. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Proceedings of the 39th AAAI conference on artificial intelligence (pp. 10421\u201310429). Palo Alto: AAAI Press."},{"key":"99_CR22","first-page":"1","volume":"2024","author":"Z. Wan","year":"2024","unstructured":"Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., et al. (2024). Efficient large language models: a survey. Transactions on Machine Learning Research, 2024, 1\u201367.","journal-title":"Transactions on Machine Learning Research"},{"key":"99_CR23","doi-asserted-by":"crossref","unstructured":"Cha, J., Kang, W., Mun, J., & Roh, B. (2024). Honeybee: locality-enhanced projector for multimodal LLM. arXiv preprint. arXiv:2312.06742.","DOI":"10.1109\/CVPR52733.2024.01311"},{"key":"99_CR24","doi-asserted-by":"crossref","unstructured":"Kar, O. F., Tonioni, A., Poklukar, P., Kulshrestha, A., Zamir, A., & Tombari, F. (2024). BRAVE: broadening the visual encoding of vision-language models. arXiv preprint. arXiv:2404.07204.","DOI":"10.1007\/978-3-031-72640-8_7"},{"key":"99_CR25","first-page":"42566","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"X. Dong","year":"2024","unstructured":"Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Zhang, S., Duan, H., Zhang, W., Li, Y., et al. (2024). InternLM-XComposer2-4KHD: a pioneering large vision-language model handling resolutions from 336 pixels to 4K HD. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 42566\u201342592). Red Hook: Curran Associates."},{"key":"99_CR26","first-page":"32400","volume-title":"Proceedings of the 41st international conference on machine learning","author":"D. Liu","year":"2024","unstructured":"Liu, D., Zhang, R., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., Zhang, K., et al. (2024). SPHINX-X: scaling data and parameters for a family of multi-modal large language models. In Proceedings of the 41st international conference on machine learning (pp. 32400\u201332420). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=tDMlQkJRhZ."},{"issue":"1\u20132","key":"99_CR27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1561\/0600000110","volume":"16","author":"C. Li","year":"2024","unstructured":"Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., & Gao, J. (2024). Multimodal foundation models: from specialists to general-purpose assistants. Foundations and Trends in Computer Graphics and Vision, 16(1\u20132), 1\u2013214.","journal-title":"Foundations and Trends in Computer Graphics and Vision"},{"key":"99_CR28","doi-asserted-by":"crossref","unstructured":"Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2023). A survey on multimodal large language models. arXiv preprint. arXiv:2306.13549.","DOI":"10.1093\/nsr\/nwae403"},{"key":"99_CR29","first-page":"12401","volume-title":"Findings of the Association for Computational Linguistics","author":"D. Zhang","year":"2024","unstructured":"Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C., & Yu, D. (2024). MM-LLMs: recent advances in multimodal large language models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics (pp. 12401\u201312430). Stroudsburg: ACL."},{"key":"99_CR30","first-page":"12954","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Chen","year":"2024","unstructured":"Chen, J., Yu, Q., Shen, X., Yuille, A. L., & Chen, L. (2024). ViTamin: designing scalable vision models in the vision-language era. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12954\u201312966). Piscataway: IEEE."},{"key":"99_CR31","first-page":"23716","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"J.-B. Alayrac","year":"2022","unstructured":"Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 23716\u201323736). Red Hook: Curran Associates."},{"key":"99_CR32","unstructured":"Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et\u00a0al. (2023). MobileVLM: a fast, strong and open vision language assistant for mobile devices. arXiv preprint. arXiv:2312.16886."},{"key":"99_CR33","doi-asserted-by":"publisher","first-page":"2961","DOI":"10.1109\/TMM.2025.3557680","volume":"27","author":"Z. Shao","year":"2025","unstructured":"Shao, Z., Yu, Z., Yu, J., Ouyang, X., Zheng, L., Gai, Z., Wang, M., Kuang, Z., & Ding, J. (2025). Imp: highly capable large multimodal models for mobile devices. IEEE Transactions on Multimedia, 27, 2961\u20132974.","journal-title":"IEEE Transactions on Multimedia"},{"key":"99_CR34","unstructured":"Zhou, B., Hu, Y., Weng, X., Jia, J., Luo, J., Liu, X., Wu, J., & Huang, L. (2024). TinyLLaVA: a framework of small-scale large multimodal models. arXiv preprint. arXiv:2402.14289."},{"key":"99_CR35","unstructured":"He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., & Zhao, B. (2024). Efficient multimodal learning from data-centric perspective. arXiv preprint. arXiv:2402.11530."},{"key":"99_CR36","doi-asserted-by":"crossref","unstructured":"Li, Y., Zhang, Y., Wang, C., Zhong, Z., Chen, Y., Chu, R., Liu, S., & Jia, J. (2024). Mini-Gemini: mining the potential of multi-modality vision language models. arXiv preprint. arXiv:2403.18814.","DOI":"10.1109\/TPAMI.2025.3637265"},{"key":"99_CR37","unstructured":"Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yu, E., Sun, J., Han, C., & Zhang, X. (2024). Small language model meets with reinforced vision vocabulary. arXiv preprint. arXiv:2401.12503."},{"key":"99_CR38","unstructured":"Chen, G. H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., & Wang, B. (2024). ALLaVA: harnessing GPT4V-synthesized data for a lite vision-language model. arXiv preprint. arXiv:2402.11684."},{"key":"99_CR39","doi-asserted-by":"crossref","unstructured":"McKinzie, B., Gan, Z., Fauconnier, J.-P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Weers, F., et\u00a0al. (2024). MM1: methods, analysis & insights from multimodal LLM pre-training. arXiv preprint. arXiv:2403.09611.","DOI":"10.1007\/978-3-031-73397-0_18"},{"key":"99_CR40","first-page":"10986","volume-title":"Proceedings of the 39th AAAI conference on artificial intelligence","author":"M. Zhu","year":"2025","unstructured":"Zhu, M., Zhu, Y., Liu, N., Liu, X., Xu, Z., Shen, C., & Peng, Y. (2025). A comprehensive overhaul of multimodal assistant with small language models. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Proceedings of the 39th AAAI conference on artificial intelligence (pp. 10986\u201310994). Palo Alto: AAAI Press."},{"key":"99_CR41","unstructured":"Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et\u00a0al. (2024). MiniCPM-V: a GPT-4V level MLLM on your phone. arXiv preprint. arXiv:2408.01800."},{"key":"99_CR42","unstructured":"Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et\u00a0al. (2024). DeepSeek-VL: towards real-world vision-language understanding. arXiv preprint. arXiv:2403.05525."},{"key":"99_CR43","unstructured":"Yu, Y.-Q., Liao, M., Wu, J., Liao, Y., Zheng, X., & Zeng, W. (2024). TextHawk: exploring efficient fine-grained perception of multimodal large language models. arXiv preprint. arXiv:2404.09204."},{"key":"99_CR44","doi-asserted-by":"publisher","first-page":"1882","DOI":"10.18653\/v1\/2024.emnlp-main.112","volume-title":"Proceedings of the 2024 conference on empirical methods in natural language processing","author":"L. Zhang","year":"2024","unstructured":"Zhang, L., Hu, A., Xu, H., Yan, M., Xu, Y., Jin, Q., Zhang, J., & Huang, F. (2024). TinyChart: efficient chart understanding with program-of-thoughts learning and visual token merging. In Y. Al-Onaizan, M. Bansal, & Y. Chen (Eds.), Proceedings of the 2024 conference on empirical methods in natural language processing (pp. 1882\u20131898). Stroudsburg: ACL."},{"key":"99_CR45","unstructured":"Chen, J., Liu, Y., Li, D., An, X., Feng, Z., Zhao, Y., & Xie, Y. (2024). Plug-and-play grounding of reasoning in multimodal large language models. arXiv preprint. arXiv:2403.19322."},{"key":"99_CR46","unstructured":"Shang, Y., Cai, M., Xu, B., Lee, Y. J., & Yan, Y. (2024). LLaVA-PruMerge: adaptive token reduction for efficient large multimodal models. arXiv preprint. arXiv:2403.15388."},{"key":"99_CR47","first-page":"15710","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Cao","year":"2024","unstructured":"Cao, J., Ye, P., Li, S., Yu, C., Tang, Y., Lu, J., & Chen, T. (2024). MADTP: multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15710\u201315719). Piscataway: IEEE."},{"key":"99_CR48","first-page":"103305","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"Z. Zong","year":"2024","unstructured":"Zong, Z., Ma, B., Shen, D., Song, G., Shao, H., Jiang, D., Li, H., & Liu, Y. (2024). MoVA: adapting mixture of vision experts to multimodal context. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 103305\u2013103333). Red Hook: Curran Associates."},{"key":"99_CR49","doi-asserted-by":"publisher","first-page":"5971","DOI":"10.18653\/v1\/2024.emnlp-main.342","volume-title":"Proceedings of the 2024 conference on empirical methods in natural language processing","author":"B. Lin","year":"2024","unstructured":"Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., & Yuan, L. (2024). Video-LLaVA: learning united visual representation by alignment before projection. In Y. Al-Onaizan, M. Bansal, & Y. Chen (Eds.), Proceedings of the 2024 conference on empirical methods in natural language processing (pp. 5971\u20135984). Stroudsburg: ACL."},{"key":"99_CR50","first-page":"8285","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition workshops","author":"M. Gagrani","year":"2024","unstructured":"Gagrani, M., Goel, R., Jeon, W., Park, J., Lee, M., & Lott, C. (2024). On speculative decoding for multimodal large language models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition workshops (pp. 8285\u20138289). Piscataway: IEEE."},{"key":"99_CR51","first-page":"19","volume-title":"Proceedings of the 18th European conference on computer vision","author":"L. Chen","year":"2024","unstructured":"Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., & Chang, B. (2024). An image is worth 1\/2 tokens after layer 2: plug-and-play inference acceleration for large vision-language models. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 19\u201335). Cham: Springer."},{"key":"99_CR52","first-page":"5334","volume-title":"Proceedings of the 39th AAAI conference on artificial intelligence","author":"Z. Lin","year":"2025","unstructured":"Lin, Z., Lin, M., Lin, L., & Ji, R. (2025). Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Proceedings of the 39th AAAI conference on artificial intelligence (pp. 5334\u20135342). Palo Alto: AAAI Press."},{"key":"99_CR53","first-page":"87874","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"H. Lauren\u00e7on","year":"2024","unstructured":"Lauren\u00e7on, H., Tronchon, L., Cord, M., & Sanh, V. (2024). What matters when building vision-language models? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp. 87874\u201387907). Red Hook: Curran Associates."},{"key":"99_CR54","doi-asserted-by":"crossref","unstructured":"Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., & Han, S. (2023). VILA: on pre-training for visual language models. arXiv preprint. arXiv:2312.07533.","DOI":"10.1109\/CVPR52733.2024.02520"},{"key":"99_CR55","unstructured":"Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., & Ji, R. (2023). Cheap and quick: efficient vision-language instruction tuning for large language models. arXiv preprint. arXiv:2305.15023."},{"key":"99_CR56","unstructured":"Zhang, W., Lin, T., Liu, J., Shu, F., Li, H., Zhang, L., Wanggui, H., Zhou, H., Lv, Z., Jiang, H., et\u00a0al. (2024). HyperLLaVA: dynamic visual and language expert tuning for multimodal large language models. arXiv preprint. arXiv:2403.13447."},{"key":"99_CR57","unstructured":"Wu, Q., Ye, W., Zhou, Y., Sun, X., & Ji, R. (2024). Not all attention is needed: parameter and computation efficient transfer learning for multi-modal large language models. arXiv preprint. arXiv:2403.15226."},{"key":"99_CR58","first-page":"22062","volume-title":"Proceedings of the 41st international conference on machine learning","author":"S. Jie","year":"2024","unstructured":"Jie, S., Tang, Y., Ding, N., Deng, Z., Han, K., & Wang, Y. (2024). Memory-space visual prompting for efficient vision-language fine-tuning. In Proceedings of the 41st international conference on machine learning (pp. 22062\u201322074). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=FHkavpr5Ze."},{"key":"99_CR59","first-page":"26286","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Liu","year":"2024","unstructured":"Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 26286\u201326296). Piscataway: IEEE."},{"key":"99_CR60","first-page":"370","volume-title":"Proceedings of the 18th European conference on computer vision","author":"L. Chen","year":"2024","unstructured":"Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., & Lin, D. (2024). ShareGPT4V: improving large multi-modal models with better captions. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 370\u2013387). Cham: Springer."},{"key":"99_CR61","unstructured":"Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., & Jiang, Y.-G. (2023). To see is to believe: prompting GPT-4V for better visual instruction tuning. arXiv preprint. arXiv:2311.07574."},{"key":"99_CR62","first-page":"6325","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"Y. Goyal","year":"2017","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6325\u20136334). Piscataway: IEEE."},{"key":"99_CR63","first-page":"8317","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"A. Singh","year":"2019","unstructured":"Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., & Rohrbach, M. (2019). Towards VQA models that can read. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8317\u20138326). Piscataway: IEEE."},{"key":"99_CR64","first-page":"6700","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"D. A. Hudson","year":"2019","unstructured":"Hudson, D. A., & Manning, C. D. (2019). GQA: a new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 6700\u20136709). Piscataway: IEEE."},{"key":"99_CR65","unstructured":"Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et\u00a0al. (2023). MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint. arXiv:2306.13394."},{"key":"99_CR66","first-page":"216","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Y. Liu","year":"2025","unstructured":"Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. (2025). MMBench: is your multi-modal model an all-around player? In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 216\u2013233). Cham: Springer."},{"key":"99_CR67","doi-asserted-by":"publisher","first-page":"292","DOI":"10.18653\/v1\/2023.emnlp-main.20","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing","author":"Y. Li","year":"2023","unstructured":"Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., & Wen, J.-R. (2023). Evaluating object hallucination in large vision-language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 292\u2013305). Stroudsburg: ACL."},{"key":"99_CR68","unstructured":"Chaves, J. M. Z., Huang, S.-C., Xu, Y., Xu, H., Usuyama, N., Zhang, S., Wang, F., Xie, Y., Khademi, M., Yang, Z., et\u00a0al. (2024). Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. arXiv preprint. arXiv:2403.08002."},{"key":"99_CR69","first-page":"3843","volume-title":"Proceedings of the 2024 conference on empirical methods in natural language processing","author":"S. Jiang","year":"2024","unstructured":"Jiang, S., Zheng, T., Zhang, Y., Jin, Y., Yuan, L., & Liu, Z. (2024). Med-MoE: mixture of domain-specific experts for lightweight medical vision-language models. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Proceedings of the 2024 conference on empirical methods in natural language processing (pp. 3843\u20133860). Stroudsburg: ACL."},{"key":"99_CR70","first-page":"26753","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Li","year":"2024","unstructured":"Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., & Bai, X. (2024). Monkey: image resolution and text label are important things for large multi-modal models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 26753\u201326763). Piscataway: IEEE."},{"key":"99_CR71","first-page":"15534","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Liu","year":"2024","unstructured":"Liu, C., Yin, K., Cao, H., Jiang, X., Li, X., Liu, Y., Jiang, D., Sun, X., & Xu, L. (2024). HRVDA: high-resolution visual document assistant. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15534\u201315545). Piscataway: IEEE."},{"key":"99_CR72","first-page":"38728","volume-title":"Proceedings of the international conference on machine learning","author":"H. Xu","year":"2023","unstructured":"Xu, H., Ye, Q., Yan, M., Shi, Y., Ye, J., Xu, Y., Li, C., Bi, B., Qian, Q., Wang, W., et al. (2023). mPLUG-2: a modularized multi-modal foundation model across text, image and video. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the international conference on machine learning (pp. 38728\u201338748). Retrieved October 17, 2025, from https:\/\/proceedings.mlr.press\/v202\/xu23s.html."},{"key":"99_CR73","first-page":"13504","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"B. He","year":"2024","unstructured":"He, B., Li, H., Jang, Y. K., Jia, M., Cao, X., Shah, A., Shrivastava, A., & Lim, S.-N. (2024). MA-LMM: memory-augmented large multimodal model for long-term video understanding. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 13504\u201313514). Piscataway: IEEE."},{"key":"99_CR74","first-page":"323","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Y. Li","year":"2024","unstructured":"Li, Y., Wang, C., & Jia, J. (2024). LLaMA-VID: an image is worth 2 tokens in large language models. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 323\u2013340). Cham: Springer."},{"issue":"10","key":"99_CR75","doi-asserted-by":"publisher","first-page":"8186","DOI":"10.1109\/LRA.2024.3440097","volume":"9","author":"Z. Xu","year":"2024","unstructured":"Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K. K., Li, Z., & Zhao, H. (2024). DriveGPT4: interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 9(10), 8186\u20138193.","journal-title":"IEEE Robotics and Automation Letters"},{"key":"99_CR76","first-page":"403","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Y. Ma","year":"2024","unstructured":"Ma, Y., Cao, Y., Sun, J., Pavone, M., & Xiao, C. (2024). Dolphins: multimodal language model for driving. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 403\u2013420). Cham: Springer."},{"key":"99_CR77","unstructured":"Elgendy, H., Sharshar, A., Aboeitta, A., Ashraf, Y., & Guizani, M. (2024). GeoLLaVA: efficient fine-tuned vision-language models for temporal change detection in remote sensing. arXiv preprint. arXiv:2410.19552."},{"key":"99_CR78","unstructured":"Wang, F., Chen, M., Li, Y., Wang, D., Wang, H., Guo, Z., Wang, Z., Shan, B., Lan, L., Wang, Y., et\u00a0al. (2025). GeoLLaVA-8K: scaling remote-sensing multimodal large language models to 8k resolution. arXiv preprint. arXiv:2505.21375."},{"key":"99_CR79","doi-asserted-by":"crossref","unstructured":"Koksal, A., & Alatan, A. A. (2025). TinyRS-R1: compact multimodal language model for remote sensing. arXiv preprint. arXiv:2505.12099.","DOI":"10.1109\/LGRS.2025.3623244"},{"key":"99_CR80","first-page":"8748","volume-title":"Proceedings of the 38th international conference on machine learning","author":"A. Radford","year":"2021","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 8748\u20138763). Retrieved October 17, 2025, from http:\/\/proceedings.mlr.press\/v139\/radford21a.html."},{"issue":"3","key":"99_CR81","first-page":"3","volume":"1","author":"M. Javaheripi","year":"2023","unstructured":"Javaheripi, M., Bubeck, S., Abdin, M., Aneja, J., Bubeck, S., Mendes, C. C. T., Chen, W., Del Giorno, A., Eldan, R., Gopi, S., et al. (2023). Phi-2: the surprising power of small language models. Microsoft Research Blog, 1(3), 3.","journal-title":"Microsoft Research Blog"},{"key":"99_CR82","doi-asserted-by":"publisher","first-page":"11975","DOI":"10.1007\/978-3-030-96530-3","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"X. Zhai","year":"2023","unstructured":"Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 11975\u201311986). Piscataway: IEEE."},{"key":"99_CR83","first-page":"1","volume":"2024","author":"M. Oquab","year":"2024","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2024). DINOv2: learning robust visual features without supervision. Transactions on Machine Learning Research, 2024, 1\u201332.","journal-title":"Transactions on Machine Learning Research"},{"key":"99_CR84","volume-title":"First conference on language modeling","author":"A. Gu","year":"2024","unstructured":"Gu, A., & Dao, T. (2024). Mamba: linear-time sequence modeling with selective state spaces. In First conference on language modeling. Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=tEYskw1VY2."},{"key":"99_CR85","unstructured":"Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi\u00e8re, M., Kale, M. S., Love, J., et\u00a0al. (2024). Gemma: open models based on Gemini research and technology. arXiv preprint. arXiv:2403.08295."},{"key":"99_CR86","unstructured":"Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et\u00a0al. (2023). Qwen technical report. arXiv preprint. arXiv:2309.16609."},{"key":"99_CR87","first-page":"19358","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Fang","year":"2023","unstructured":"Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., & Cao, Y. (2023). EVA: exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 19358\u201319369). Piscataway: IEEE."},{"key":"99_CR88","first-page":"19730","volume-title":"Proceedings of the 40th international conference on machine learning","author":"J. Li","year":"2023","unstructured":"Li, J., Li, D., Savarese, S., & Hoi, S. C. H. (2023). BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the 40th international conference on machine learning (pp. 19730\u201319742). Retrieved October 17, 2025, from https:\/\/proceedings.mlr.press\/v202\/li23q.html."},{"key":"99_CR89","first-page":"11976","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Liu","year":"2022","unstructured":"Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 11976\u201311986). Piscataway: IEEE."},{"key":"99_CR90","unstructured":"Zhang, P., Zeng, G., Wang, T., & Lu, W. (2024). TinyLlama: an open-source small language model. arXiv preprint. arXiv:2401.02385."},{"key":"99_CR91","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"A. Fang","year":"2024","unstructured":"Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A. T., & Shankar, V. (2024). Data filtering networks. In Proceedings of the 12th international conference on learning representations (pp. 1\u201317). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=KAk6ngZ09F."},{"key":"99_CR92","unstructured":"Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhao, W., et\u00a0al. (2024). MiniCPM: unveiling the potential of small language models with scalable training strategies. arXiv preprint. arXiv:2404.06395."},{"key":"99_CR93","unstructured":"Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et\u00a0al. (2024). DeepSeek LLM: scaling open-source language models with longtermism. arXiv preprint. arXiv:2401.02954."},{"key":"99_CR94","unstructured":"Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et\u00a0al. (2024). Phi-3 technical report: a highly capable language model locally on your phone. arXiv preprint. arXiv:2404.14219."},{"key":"99_CR95","first-page":"240","volume-title":"Proceedings of the 18th European conference on computer vision","author":"K. You","year":"2024","unstructured":"You, K., Zhang, H., Schoop, E., Weers, F., Swearngin, A., Nichols, J., Yang, Y., & Gan, Z. (2024). Ferret-UI: grounded mobile UI understanding with multimodal LLMs. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 240\u2013255). Cham: Springer."},{"key":"99_CR96","volume-title":"Proceedings of the 42nd international conference on machine learning","author":"Y. Zhang","year":"2025","unstructured":"Zhang, Y., Fan, C.-K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D. A., Okuno, T., Nakata, Y., Keutzer, K., et al. (2025). SparseVLM: visual token sparsification for efficient vision-language model inference. In Proceedings of the 42nd international conference on machine learning. Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=80faIPZ67S."},{"key":"99_CR97","unstructured":"Zhang, Q., Cheng, A., Lu, M., Zhang, R., Zhuo, Z., Cao, J., Guo, S., She, Q., & Zhang, S. (2024). Beyond text-visual attention: exploiting visual cues for effective token pruning in VLMs. arXiv preprint. arXiv:2412.01818."},{"key":"99_CR98","first-page":"19803","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Yang","year":"2025","unstructured":"Yang, C., Sui, Y., Xiao, J., Huang, L., Gong, Y., Li, C., Yan, J., Bai, Y., Sadayappan, P., Hu, X., et al. (2025). TopV: compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 19803\u201319813). Piscataway: IEEE."},{"key":"99_CR99","first-page":"18992","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"K. Tao","year":"2025","unstructured":"Tao, K., Qin, C., You, H., Sui, Y., & Wang, H. (2025). DyCoke: dynamic compression of tokens for fast video large language models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18992\u201319001). Piscataway: IEEE."},{"key":"99_CR100","unstructured":"Shen, L., Gong, G., He, T., Zhang, Y., Liu, P., Zhao, S., & Ding, G. (2025). FastVID: dynamic density pruning for fast video large language models. arXiv preprint. arXiv:2503.11187."},{"key":"99_CR101","unstructured":"Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d.\u00a0l., Hanna, E. B., Bressand, F., et\u00a0al. (2024). Mixtral of experts. arXiv preprint. arXiv:2401.04088."},{"key":"99_CR102","first-page":"166","volume-title":"Proceedings of the 18th European conference on computer vision","author":"H. Wang","year":"2024","unstructured":"Wang, H., Ye, Y., Wang, Y., Nie, Y., & Huang, C. (2024). Elysium: exploring object-level perception in videos via MLLM. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 166\u2013185). Cham: Springer."},{"key":"99_CR103","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"B. Zhu","year":"2024","unstructured":"Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., Wang, H., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al. (2024). LanguageBind: extending video-language pretraining to n-modality by language-based semantic alignment. In Proceedings of the 12th international conference on learning representations (pp. 1\u201321). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=QmZKc7UZCy."},{"issue":"6","key":"99_CR104","doi-asserted-by":"publisher","first-page":"7900","DOI":"10.1109\/TPAMI.2022.3217852","volume":"45","author":"H. Ding","year":"2022","unstructured":"Ding, H., Liu, C., Wang, S., & Jiang, X. (2022). VLT: vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7900\u20137916.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"99_CR105","first-page":"1","volume-title":"Proceedings of the 9th international conference on learning representations","author":"A. Dosovitskiy","year":"2021","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: transformers for image recognition at scale. In Proceedings of the 9th international conference on learning representations (pp. 1\u201321). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=YicbFdNTTy."},{"issue":"5","key":"99_CR106","doi-asserted-by":"publisher","first-page":"3123","DOI":"10.1109\/TPAMI.2023.3341806","volume":"46","author":"W. Wang","year":"2023","unstructured":"Wang, W., Chen, W., Qiu, Q., Chen, L., Wu, B., Lin, B., He, X., & Liu, W. (2023). Crossformer++: a versatile vision transformer hinging on cross-scale attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5), 3123\u20133136.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"99_CR107","unstructured":"Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., & Shi, H. (2021). Escaping the big data paradigm with compact transformers. arXiv preprint. arXiv:2104.05704."},{"key":"99_CR108","first-page":"1","volume-title":"Proceedings of the 8th international conference on learning representations","author":"N. Kitaev","year":"2020","unstructured":"Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: the efficient transformer. In Proceedings of the 8th international conference on learning representations (pp. 1\u201312). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=rkgNKkHtvB."},{"key":"99_CR109","first-page":"12934","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"Y. Li","year":"2022","unstructured":"Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., & Ren, J. (2022). EfficientFormer: vision transformers at MobileNet speed. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 12934\u201312949). Red Hook: Curran Associates."},{"key":"99_CR110","first-page":"16889","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Y. Li","year":"2023","unstructured":"Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., & Ren, J. (2023). Rethinking vision transformers for MobileNet size and speed. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 16889\u201316900). Piscataway: IEEE."},{"key":"99_CR111","first-page":"4931","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"A. Chavan","year":"2022","unstructured":"Chavan, A., Shen, Z., Liu, Z., Liu, Z., Cheng, K.-T., & Xing, E. P. (2022). Vision transformer slimming: multi-dimension searching in continuous optimization space. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 4931\u20134941). Piscataway: IEEE."},{"key":"99_CR112","first-page":"12270","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"M. Chen","year":"2021","unstructured":"Chen, M., Peng, H., Fu, J., & Ling, H. (2021). AutoFormer: searching transformers for visual recognition. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 12270\u201312280). Piscataway: IEEE."},{"key":"99_CR113","first-page":"1","volume-title":"Proceedings of the 10th international conference on learning representations","author":"C. Gong","year":"2022","unstructured":"Gong, C., Wang, D., Li, M., Chen, X., Yan, Z., Tian, Y., Liu, Q., & Chandra, V. (2022). NASViT: neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In Proceedings of the 10th international conference on learning representations (pp. 1\u201318). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=Qaw16njk6L."},{"key":"99_CR114","first-page":"10894","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Q. Zhou","year":"2022","unstructured":"Zhou, Q., Sheng, K., Zheng, X., Li, K., Sun, X., Tian, Y., Chen, J., & Ji, R. (2022). Training-free transformer architecture search. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 10894\u201310903). Piscataway: IEEE."},{"key":"99_CR115","first-page":"33","volume-title":"Proceedings of the 17th European conference on computer vision","author":"J. Liu","year":"2022","unstructured":"Liu, J., Huang, X., Song, G., Li, H., & Liu, Y. (2022). UniNet: unified architecture search with convolution, transformer, and MLP. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 33\u201349). Cham: Springer."},{"key":"99_CR116","unstructured":"Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to merge tokens in vision transformers. arXiv preprint. arXiv:2202.12015."},{"key":"99_CR117","first-page":"13937","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"Y. Rao","year":"2021","unstructured":"Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., & Hsieh, C. (2021). DynamicViT: efficient vision transformers with dynamic token sparsification. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 13937\u201313949). Red Hook: Curran Associates."},{"key":"99_CR118","unstructured":"Li, W., Wang, X., Xia, X., Wu, J., Xiao, X., Zheng, M., & Wen, S. (2022). SepViT: separable vision transformer. arXiv preprint. arXiv:2203.15380."},{"key":"99_CR119","first-page":"28805","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"D. Kuznedelev","year":"2023","unstructured":"Kuznedelev, D., Kurtic, E., Frantar, E., & Alistarh, D. (2023). CAP: correlation-aware pruning for highly-accurate sparse vision models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 28805\u201328831). Red Hook: Curran Associates."},{"key":"99_CR120","doi-asserted-by":"crossref","unstructured":"Wang, A., Chen, H., Lin, Z., Zhao, S., Han, J., & Ding, G. (2025). CAIT: triple-win compression towards high accuracy, fast inference, and favorable transferability for ViTs. arXiv preprint. arXiv:2309.15755.","DOI":"10.1109\/TPAMI.2025.3616854"},{"key":"99_CR121","first-page":"3143","volume-title":"Proceedings of the 36th AAAI conference on artificial intelligence","author":"F. Yu","year":"2022","unstructured":"Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., & Cui, L. (2022). Width & depth pruning for vision transformers. In Proceedings of the 36th AAAI conference on artificial intelligence (pp. 3143\u20133151). Palo Alto: AAAI Press."},{"key":"99_CR122","first-page":"24355","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"L. Yu","year":"2023","unstructured":"Yu, L., & Xiang, W. (2023). X-Pruner: explainable pruning for vision transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 24355\u201324363). Piscataway: IEEE."},{"key":"99_CR123","unstructured":"Zhu, M., Tang, Y., & Han, K. (2021). Vision transformer pruning. arXiv preprint. arXiv:2104.08500."},{"key":"99_CR124","first-page":"12165","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Tang","year":"2022","unstructured":"Tang, Y., Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., & Tao, D. (2022). Patch slimming for efficient vision transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12165\u201312174). Piscataway: IEEE."},{"key":"99_CR125","first-page":"620","volume-title":"Proceedings of the 17th European conference on computer vision","author":"Z. Kong","year":"2022","unstructured":"Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al. (2022). SPViT: enabling faster vision transformers via latency-aware soft token pruning. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 620\u2013640). Cham: Springer."},{"key":"99_CR126","first-page":"10347","volume-title":"Proceedings of the 38th international conference on machine learning","author":"H. Touvron","year":"2021","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & J\u00e9gou, H. (2021). Training data-efficient image transformers & distillation through attention. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 10347\u201310357). Retrieved October 17, 2025, from http:\/\/proceedings.mlr.press\/v139\/touvron21a.html."},{"key":"99_CR127","first-page":"68","volume-title":"Proceedings of the 17th European conference on computer vision","author":"K. Wu","year":"2022","unstructured":"Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., & Yuan, L. (2022). TinyViT: fast pretraining distillation for small vision transformers. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 68\u201385). Cham: Springer."},{"key":"99_CR128","unstructured":"Lo, K. M., Liang, Y., Du, W., Fan, Y., Wang, Z., Huang, W., Ma, L., & Fu, J. (2024). m2mKD: module-to-module knowledge distillation for modular transformers. arXiv preprint. arXiv:2402.16918."},{"key":"99_CR129","first-page":"9164","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"Z. Hao","year":"2022","unstructured":"Hao, Z., Guo, J., Jia, D., Han, K., Tang, Y., Zhang, C., Hu, H., & Wang, Y. (2022). Learning efficient vision transformers via fine-grained manifold distillation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 9164\u20139175). Red Hook: Curran Associates."},{"key":"99_CR130","first-page":"12145","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Zhang","year":"2022","unstructured":"Zhang, J., Peng, H., Wu, K., Liu, M., Xiao, B., Fu, J., & Yuan, L. (2022). MiniViT: compressing vision transformers with weight multiplexing. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12145\u201312154). Piscataway: IEEE."},{"key":"99_CR131","first-page":"12052","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Chen","year":"2022","unstructured":"Chen, X., Cao, Q., Zhong, Y., Zhang, J., Gao, S., & Tao, D. (2022). DearKD: data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12052\u201312062). Piscataway: IEEE."},{"key":"99_CR132","first-page":"16773","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"S. Ren","year":"2022","unstructured":"Ren, S., Gao, Z., Hua, T., Xue, Z., Tian, Y., He, S., & Zhao, H. (2022). Co-advise: cross inductive bias distillation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 16773\u201316782). Piscataway: IEEE."},{"key":"99_CR133","first-page":"191","volume-title":"Proceedings of the 17th European conference on computer vision","author":"Z. Yuan","year":"2022","unstructured":"Yuan, Z., Xue, C., Chen, Y., Wu, Q., & Sun, G. (2022). PTQ4ViT: post-training quantization for vision transformers with twin uniform quantization. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 191\u2013207). Cham: Springer."},{"key":"99_CR134","doi-asserted-by":"publisher","first-page":"5380","DOI":"10.1145\/3503161.3547826","volume-title":"Proceedings of the 30th ACM international conference on multimedia","author":"Y. Ding","year":"2022","unstructured":"Ding, Y., Qin, H., Yan, Q., Chai, Z., Liu, J., Wei, X., & Liu, X. (2022). Towards accurate post-training quantization for vision transformer. In J. Magalh\u00e3es, A. D. Bimbo, S. Satoh, N. Sebe, X. Alameda-Pineda, Q. Jin, V. Oria, & L. Toni (Eds.), Proceedings of the 30th ACM international conference on multimedia (pp. 5380\u20135388). New York: ACM."},{"key":"99_CR135","first-page":"20321","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Liu","year":"2023","unstructured":"Liu, Y., Yang, H., Dong, Z., Keutzer, K., Du, L., & Zhang, S. (2023). NoisyQuant: noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 20321\u201320330). Piscataway: IEEE."},{"issue":"7","key":"99_CR136","doi-asserted-by":"publisher","first-page":"8813","DOI":"10.1109\/TPAMI.2022.3229313","volume":"45","author":"Z. Wang","year":"2022","unstructured":"Wang, Z., Wang, C., Xu, X., Zhou, J., & Lu, J. (2022). Quantformer: learning extremely low-precision vision transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 8813\u20138826.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"99_CR137","first-page":"16196","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Lin","year":"2023","unstructured":"Lin, C., Peng, B., Li, Z., Tan, W., Ren, Y., Xiao, J., & Pu, S. (2023). Bit-shrinking: limiting instantaneous sharpness for improving post-training quantization. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 16196\u201316205). Piscataway: IEEE."},{"key":"99_CR138","first-page":"34451","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"Y. Li","year":"2022","unstructured":"Li, Y., Xu, S., Zhang, B., Cao, X., Gao, P., & Guo, G. (2022). Q-ViT: accurate and fully quantized low-bit vision transformer. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 34451\u201334463). Red Hook: Curran Associates."},{"key":"99_CR139","unstructured":"Xu, S., Li, Y., Ma, T., Zeng, B., Zhang, B., Gao, P., & Lv, J. (2022). TerViT: an efficient ternary vision transformer. arXiv preprint. arXiv:2201.08050."},{"key":"99_CR140","first-page":"5651","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Y. He","year":"2023","unstructured":"He, Y., Lou, Z., Zhang, L., Liu, J., Wu, W., Zhou, H., & Zhuang, B. (2023). BiViT: extremely compressed binary vision transformers. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 5651\u20135663). Piscataway: IEEE."},{"key":"99_CR141","first-page":"9015","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"P. Dong","year":"2023","unstructured":"Dong, P., Lu, L., Wu, C., Lyu, C., Yuan, G., Tang, H., & Wang, Y. (2023). PackQViT: faster sub-8-bit vision transformers via full and packed quantization on the mobile. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 9015\u20139028). Red Hook: Curran Associates."},{"key":"99_CR142","first-page":"4665","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition workshops","author":"P.-H. C. Le","year":"2023","unstructured":"Le, P.-H. C., & Li, X. (2023). BinaryViT: pushing binary vision transformers towards convolutional models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition workshops (pp. 4665\u20134674). Piscataway: IEEE."},{"key":"99_CR143","first-page":"22658","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Yu","year":"2023","unstructured":"Yu, C., Chen, T., Gan, Z., & Fan, J. (2023). Boost vision transformer with GPU-friendly sparsity and quantization. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 22658\u201322668). Piscataway: IEEE."},{"key":"99_CR144","first-page":"109","volume-title":"Proceedings of the 32nd international conference on field-programmable logic and applications","author":"Z. Li","year":"2022","unstructured":"Li, Z., Sun, M., Lu, A., Ma, H., Yuan, G., Xie, Y., Tang, H., Li, Y., Leeser, M., Wang, Z., et al. (2022). Auto-ViT-Acc: an FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. In Proceedings of the 32nd international conference on field-programmable logic and applications (pp. 109\u2013116). Piscataway: IEEE."},{"key":"99_CR145","first-page":"396","volume-title":"Proceedings of the 17th European conference on computer vision","author":"M. Fayyaz","year":"2022","unstructured":"Fayyaz, M., Koohpayegani, S. A., Jafari, F. R., Sengupta, S., Joze, H. R. V., Sommerlade, E., Pirsiavash, H., & Gall, J. (2022). Adaptive token sampling for efficient vision transformers. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 396\u2013414). Cham: Springer."},{"key":"99_CR146","first-page":"1389","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"J. Zhang","year":"2023","unstructured":"Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., & Wang, C. (2023). Rethinking mobile block for efficient attention-based models. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 1389\u20131400). Piscataway: IEEE."},{"key":"99_CR147","first-page":"1","volume-title":"Proceedings of the 10th international conference on learning representations","author":"S. Yu","year":"2022","unstructured":"Yu, S., Chen, T., Shen, J., Yuan, H., Tan, J., Yang, S., Liu, J., & Wang, Z. (2022). Unified visual transformer compression. In Proceedings of the 10th international conference on learning representations (pp. 1\u201317). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=9jsZiUgkCZP."},{"key":"99_CR148","first-page":"19974","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"T. Chen","year":"2021","unstructured":"Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., & Wang, Z. (2021). Chasing sparsity in vision transformers: an end-to-end exploration. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 19974\u201319988). Red Hook: Curran Associates."},{"key":"99_CR149","unstructured":"Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint. arXiv:1503.02531."},{"key":"99_CR150","unstructured":"Du, D., Gong, G., & Chu, X. (2024). Model quantization and hardware acceleration for vision transformers: a comprehensive survey. arXiv preprint. arXiv:2405.00314."},{"key":"99_CR151","first-page":"28092","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"Z. Liu","year":"2021","unstructured":"Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., & Gao, W. (2021). Post-training quantization for vision transformer. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 28092\u201328103). Red Hook: Curran Associates."},{"key":"99_CR152","doi-asserted-by":"crossref","unstructured":"Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y., Lebr\u00f3n, F., & Sanghai, S. (2023). GQA: training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint. arXiv:2305.13245.","DOI":"10.18653\/v1\/2023.emnlp-main.298"},{"key":"99_CR153","unstructured":"Shazeer, N. (2019). Fast transformer decoding: one write-head is all you need. arXiv preprint. arXiv:1911.02150."},{"key":"99_CR154","first-page":"4271","volume-title":"Proceedings of the 34th international conference on neural information processing systems","author":"Z. Dai","year":"2020","unstructured":"Dai, Z., Lai, G., Yang, Y., & Le, Q. (2020). Funnel-transformer: filtering out sequential redundancy for efficient language processing. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 4271\u20134282). Red Hook: Curran Associates."},{"key":"99_CR155","first-page":"3744","volume-title":"Proceedings of the 36th international conference on machine learning","author":"J. Lee","year":"2019","unstructured":"Lee, J., Lee, Y., Kim, J., Kosiorek, A. R., Choi, S., & Teh, Y. W. (2019). Set transformer: a framework for attention-based permutation-invariant neural networks. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th international conference on machine learning (pp. 3744\u20133753). Retrieved October 17, 2025, from http:\/\/proceedings.mlr.press\/v97\/lee19d.html."},{"key":"99_CR156","unstructured":"Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: self-attention with linear complexity. arXiv preprint. arXiv:2006.04768."},{"key":"99_CR157","unstructured":"Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Belanger, D., Colwell, L., et\u00a0al. (2020). Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint. arXiv:2006.03555."},{"key":"99_CR158","first-page":"1","volume-title":"Proceedings of the 9th international conference on learning representations","author":"D. Lepikhin","year":"2021","unstructured":"Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2021). GShard: scaling giant models with conditional computation and automatic sharding. In Proceedings of the 9th international conference on learning representations (pp. 1\u201323). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=qrwe7XHTmYb."},{"issue":"120","key":"99_CR159","first-page":"1","volume":"23","author":"W. Fedus","year":"2022","unstructured":"Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1\u201339.","journal-title":"Journal of Machine Learning Research"},{"key":"99_CR160","doi-asserted-by":"crossref","unstructured":"Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K., et\u00a0al. (2023). RWKV: reinventing RNNs for the transformer era. arXiv preprint. arXiv:2305.13048.","DOI":"10.18653\/v1\/2023.findings-emnlp.936"},{"key":"99_CR161","unstructured":"Gu, A., Goel, K., & R\u00e9, C. (2021). Efficiently modeling long sequences with structured state spaces. arXiv preprint. arXiv:2111.00396."},{"key":"99_CR162","first-page":"22982","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"A. Gupta","year":"2022","unstructured":"Gupta, A., Gu, A., & Berant, J. (2022). Diagonal state spaces are as effective as structured state spaces. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 22982\u201322994). Red Hook: Curran Associates."},{"key":"99_CR163","doi-asserted-by":"publisher","first-page":"5254","DOI":"10.18653\/v1\/2023.emnlp-main.319","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing","author":"Z. Hu","year":"2023","unstructured":"Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E., Bing, L., Xu, X., Poria, S., & Lee, R. K. (2023). LLM-adapters: an adapter family for parameter-efficient fine-tuning of large language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 5254\u20135276). Stroudsburg: ACL."},{"key":"99_CR164","first-page":"1950","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"H. Liu","year":"2022","unstructured":"Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 1950\u20131965). Red Hook: Curran Associates."},{"key":"99_CR165","unstructured":"Zhang, L., Zhang, L., Shi, S., Chu, X., & Li, B. (2023). LoRA-FA: memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint. arXiv:2308.03303."},{"key":"99_CR166","first-page":"3266","volume-title":"Proceedings of the 17th conference of the European chapter of the Association for Computational Linguistics","author":"M. Valipour","year":"2023","unstructured":"Valipour, M., Rezagholizadeh, M., Kobyzev, I., & Ghodsi, A. (2023). DyLoRA: parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In A. Vlachos & I. Augenstein (Eds.), Proceedings of the 17th conference of the European chapter of the Association for Computational Linguistics (pp. 3266\u20133279). Stroudsburg: ACL."},{"key":"99_CR167","first-page":"8187","volume-title":"Proceedings of the 62nd annual meeting of the Association for Computational Linguistics","author":"K. Lv","year":"2024","unstructured":"Lv, K., Yang, Y., Liu, T., Guo, Q., & Qiu, X. (2024). Full parameter fine-tuning for large language models with limited resources. In L. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd annual meeting of the Association for Computational Linguistics (pp. 8187\u20138198). Stroudsburg: ACL."},{"key":"99_CR168","first-page":"53038","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"S. Malladi","year":"2023","unstructured":"Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 53038\u201353075). Red Hook: Curran Associates."},{"key":"99_CR169","unstructured":"Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: accurate post-training quantization for generative pre-trained transformers. arXiv preprint. arXiv:2210.17323."},{"key":"99_CR170","first-page":"87","volume":"6","author":"J. Lin","year":"2024","unstructured":"Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., & Han, S. (2024). AWQ: activation-aware weight quantization for on-device LLM compression and acceleration. Proceedings of Machine Learning and Systems, 6, 87\u2013100.","journal-title":"Proceedings of Machine Learning and Systems"},{"issue":"1","key":"99_CR171","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1109\/TCSVT.2024.3457610","volume":"35","author":"J. Xiao","year":"2025","unstructured":"Xiao, J., Li, Z., Li, J., Yang, L., & Gu, Q. (2025). BinaryViT: toward efficient and accurate binary vision transformers. IEEE Transactions on Circuits and Systems for Video Technology, 35(1), 195\u2013206.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"99_CR172","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et\u00a0al. (2023). Llama 2: open foundation and fine-tuned chat models. arXiv preprint. arXiv:2307.09288."},{"key":"99_CR173","first-page":"1","volume-title":"Proceedings of the 10th international conference on learning representations","author":"E. J. Hu","year":"2022","unstructured":"Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: low-rank adaptation of large language models. In Proceedings of the 10th international conference on learning representations (pp. 1\u201313). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=nZeVKeeFYf9."},{"key":"99_CR174","first-page":"2556","volume-title":"Proceedings of the 56th annual meeting of the Association for Computational Linguistics","author":"P. Sharma","year":"2018","unstructured":"Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th annual meeting of the Association for Computational Linguistics (pp. 2556\u20132565). Stroudsburg: ACL."},{"key":"99_CR175","first-page":"3558","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"S. Changpinyo","year":"2021","unstructured":"Changpinyo, S., Sharma, P., Ding, N., & Soricut, R. (2021). Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 3558\u20133568). Piscataway: IEEE."},{"key":"99_CR176","first-page":"1143","volume-title":"Proceedings of the 25th international conference on neural information processing systems","author":"V. Ordonez","year":"2011","unstructured":"Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2Text: describing images using 1 million captioned photographs. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.), Proceedings of the 25th international conference on neural information processing systems (pp. 1143\u20131151). Red Hook: Curran Associates."},{"key":"99_CR177","first-page":"25278","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"C. Schuhmann","year":"2022","unstructured":"Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). LAION-5B: an open large-scale dataset for training next generation image-text models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 25278\u201325294). Red Hook: Curran Associates."},{"key":"99_CR178","unstructured":"Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., & Komatsuzaki, A. (2021). LAION-400m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint. arXiv:2111.02114."},{"key":"99_CR179","unstructured":"Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., & Kim, S. (2022). Coyo-700m: image-text pair dataset. Retrieved October 17, 2025, from https:\/\/github.com\/kakaobrain\/coyo-dataset."},{"key":"99_CR180","unstructured":"Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Doll\u00e1r, P., & Zitnick, C. L. (2015). Microsoft COCO captions: data collection and evaluation server. arXiv preprint. arXiv:1504.00325."},{"key":"99_CR181","first-page":"787","volume-title":"Proceedings of the 2014 conference on empirical methods in natural language processing","author":"S. Kazemzadeh","year":"2014","unstructured":"Kazemzadeh, S., Ordonez, V., Matten, M., & Berg, T. L. (2014). Referitgame: referring to objects in photographs of natural scenes. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 787\u2013798). Stroudsburg: ACL."},{"key":"99_CR182","first-page":"2200","volume-title":"Proceedings of the IEEE\/CVF winter conference on applications of computer vision","author":"M. Mathew","year":"2021","unstructured":"Mathew, M., Karatzas, D., & Jawahar, C. (2021). DocVQA: a dataset for VQA on document images. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision (pp. 2200\u20132209). Piscataway: IEEE."},{"key":"99_CR183","first-page":"8958","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"W. Zhu","year":"2023","unstructured":"Zhu, W., Hessel, J., Awadalla, A., Gadre, S. Y., Dodge, J., Fang, A., Yu, Y., Schmidt, L., Wang, W. Y., & Choi, Y. (2023). Multimodal C4: an open, billion-scale corpus of images interleaved with text. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 8958\u20138974). Red Hook: Curran Associates."},{"key":"99_CR184","first-page":"71683","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"H. Lauren\u00e7on","year":"2023","unstructured":"Lauren\u00e7on, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A. M., Kiela, D., et al. (2023). OBELICS: an open web-scale filtered dataset of interleaved image-text documents. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 71683\u201371702). Red Hook: Curran Associates."},{"key":"99_CR185","first-page":"740","volume-title":"Proceedings of the 13th European conference on computer vision","author":"T. Lin","year":"2014","unstructured":"Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., & Zitnick, C. L. (2014). Microsoft COCO: common objects in context. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Proceedings of the 13th European conference on computer vision (pp. 740\u2013755). Cham: Springer."},{"key":"99_CR186","first-page":"4015","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"A. Kirillov","year":"2023","unstructured":"Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. (2023). Segment anything. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 4015\u20134026). Piscataway: IEEE."},{"key":"99_CR187","first-page":"742","volume-title":"Proceedings of the 16th European conference on computer vision","author":"O. Sidorov","year":"2020","unstructured":"Sidorov, O., Hu, R., Rohrbach, M., & Singh, A. (2020). Textcaps: a dataset for image captioning with reading comprehension. In A. Vedaldi, H. Bischof, T. Brox, & J. Frahm (Eds.), Proceedings of the 16th European conference on computer vision (pp. 742\u2013758). Cham: Springer."},{"key":"99_CR188","unstructured":"Saleh, B., & Elgammal, A. (2015). Large-scale classification of fine-art paintings: learning the right metric on the right feature. arXiv preprint. arXiv:1505.00855."},{"issue":"1","key":"99_CR189","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","volume":"123","author":"R. Krishna","year":"2017","unstructured":"Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. (2017). Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32\u201373.","journal-title":"International Journal of Computer Vision"},{"key":"99_CR190","unstructured":"OpenAI (2023). ShareGPT. Retrieved October 17, 2025, from https:\/\/sharegpt.com\/."},{"key":"99_CR191","first-page":"146","volume-title":"Proceedings of the 17th European conference on computer vision","author":"D. Schwenk","year":"2022","unstructured":"Schwenk, D., Khandelwal, A., Clark, C., Marino, K., & Mottaghi, R. (2022). A-OKVQA: a benchmark for visual question answering using world knowledge. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 146\u2013162). Cham: Springer."},{"key":"99_CR192","first-page":"3195","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"K. Marino","year":"2019","unstructured":"Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). OK-VQA: a visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 3195\u20133204). Piscataway: IEEE."},{"key":"99_CR193","first-page":"11","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"J. Mao","year":"2016","unstructured":"Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. (2016). Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11\u201320). Piscataway: IEEE."},{"key":"99_CR194","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"F. Liu","year":"2024","unstructured":"Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., & Wang, L. (2024). Mitigating hallucination in large multi-modal models via robust instruction tuning. In Proceedings of the 12th international conference on learning representations (pp. 1\u201345). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=J44HfH4JCg."},{"key":"99_CR195","first-page":"5356","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"A. Gupta","year":"2019","unstructured":"Gupta, A., Dollar, P., & Girshick, R. (2019). LVIS: a dataset for large vocabulary instance segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 5356\u20135364)."},{"key":"99_CR196","unstructured":"LAION (2023). GPT-4V dataset. Retrieved October 17, 2025, from https:\/\/huggingface.co\/datasets\/laion\/gpt4v-dataset."},{"key":"99_CR197","unstructured":"Zhao, B., Wu, B., & Huang, T. (2023). SVIT: scaling up visual instruction tuning. arXiv preprint. arXiv:2307.04087."},{"key":"99_CR198","unstructured":"Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., & Jiang, D. (2023). WizardLM: empowering large language models to follow complex instructions. arXiv preprint. arXiv:2304.12244."},{"key":"99_CR199","first-page":"2507","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"P. Lu","year":"2022","unstructured":"Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K., Zhu, S., Tafjord, O., Clark, P., & Kalyan, A. (2022). Learn to explain: multimodal reasoning via thought chains for science question answering. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 2507\u20132521). Red Hook: Curran Associates."},{"key":"99_CR200","first-page":"3608","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"D. Gurari","year":"2018","unstructured":"Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., & Bigham, J. P. (2018). VizWiz grand challenge: answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3608\u20133617). Piscataway: IEEE."},{"key":"99_CR201","first-page":"9556","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Yue","year":"2024","unstructured":"Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. (2024). MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 9556\u20139567). Piscataway: IEEE."},{"key":"99_CR202","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"P. Lu","year":"2024","unstructured":"Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., & Gao, J. (2024). MathVista: evaluating mathematical reasoning of foundation models in visual contexts. In Proceedings of the 12th international conference on learning representations (pp. 1\u2013116). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=KUNzEQMWU7."},{"key":"99_CR203","doi-asserted-by":"crossref","unstructured":"Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., & Shan, Y. (2023). SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint. arXiv:2307.16125.","DOI":"10.1109\/CVPR52733.2024.01263"},{"key":"99_CR204","first-page":"57730","volume-title":"Proceedings of the 41st international conference on machine learning","author":"W. Yu","year":"2024","unstructured":"Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., & Wang, L. (2024). MM-Vet: evaluating large multimodal models for integrated capabilities. In Proceedings of the 41st international conference on machine learning (pp. 57730\u201357754). Retrieved October 17, 2025, from https:\/\/openreview.net\/forum?id=KOTutrSR2y."},{"issue":"12","key":"99_CR205","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4235-6","volume":"67","author":"Y. Liu","year":"2024","unstructured":"Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.-C., Liu, C.-L., Jin, L., & Bai, X. (2024). OCRBench: on the hidden mystery of OCR in large multimodal models. Science China. Information Sciences, 67(12), 220102.","journal-title":"Science China. Information Sciences"},{"key":"99_CR206","first-page":"2263","volume-title":"Findings of the Association for Computational Linguistics","author":"A. Masry","year":"2022","unstructured":"Masry, A., Long, D. X., Tan, J. Q., Joty, S. R., & Hoque, E. (2022). ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics (pp. 2263\u20132279). Stroudsburg: ACL."},{"key":"99_CR207","unstructured":"OpenCompass contributors. Opencompass: a universal evaluation platform for foundation models. Retrieved October 17, 2025, from https:\/\/github.com\/open-compass\/opencompass."},{"key":"99_CR208","first-page":"2694","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Ding","year":"2023","unstructured":"Ding, H., Liu, C., He, S., Jiang, X., & Loy, C. C. (2023). MeViS: a large-scale benchmark for video segmentation with motion expressions. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 2694\u20132703). Piscataway: IEEE."},{"key":"99_CR209","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2025.3600507","author":"H. Ding","year":"2025","unstructured":"Ding, H., Liu, C., He, S., Ying, K., Jiang, X., Loy, C. C., & Jiang, Y.-G. (2025). MeViS: a multi-modal dataset for referring motion expression video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Advance online publication. https:\/\/doi.org\/10.1109\/TPAMI.2025.3600507.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"99_CR210","unstructured":"Ding, H., Tang, S., He, S., Liu, C., Wu, Z., & Jiang, Y.-G. (2025). Multimodal referring segmentation: a survey. arXiv preprint. arXiv:2508.00265."},{"issue":"3","key":"99_CR211","volume":"1","author":"T. Tu","year":"2024","unstructured":"Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.-C., Carroll, A., Lau, C., Tanno, R., Ktena, I., et al. (2024). Towards generalist biomedical AI. New England Journal of Medicine Artificial Intelligence, 1(3), AIoa2300138.","journal-title":"New England Journal of Medicine Artificial Intelligence"},{"key":"99_CR212","first-page":"3478","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"S. Azizi","year":"2021","unstructured":"Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., Loh, A., Karthikesalingam, A., Kornblith, S., Chen, T., et al. (2021). Big self-supervised models advance medical image classification. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 3478\u20133488). Piscataway: IEEE."},{"key":"99_CR213","unstructured":"Han, Y., Zhang, C., Chen, X., Yang, X., Wang, Z., Yu, G., Fu, B., & Zhang, H. (2023). ChartLlama: a multimodal LLM for chart understanding and generation. arXiv preprint. arXiv:2311.16483."},{"key":"99_CR214","unstructured":"Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., & Qiao, Y. (2023). VideoChat: chat-centric video understanding. arXiv preprint. arXiv:2305.06355."},{"key":"99_CR215","first-page":"543","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing","author":"H. Zhang","year":"2023","unstructured":"Zhang, H., Li, X., & Bing, L. (2023). Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. In Y. Feng & E. Lefever (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 543\u2013553). Stroudsburg: ACL."},{"key":"99_CR216","unstructured":"Mai, G., Huang, W., Sun, J., Song, S., Mishra, D., Liu, N., Gao, S., Liu, T., Cong, G., Hu, Y., et\u00a0al. (2023). On the opportunities and challenges of foundation models for geospatial artificial intelligence. arXiv preprint. arXiv:2304.06798."},{"key":"99_CR217","unstructured":"Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y.-G., & Tao, D. (2024). A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint. arXiv:2406.14555."}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00099-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-025-00099-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00099-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T09:51:21Z","timestamp":1765273881000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-025-00099-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12]]},"references-count":217,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["99"],"URL":"https:\/\/doi.org\/10.1007\/s44267-025-00099-6","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"value":"2097-3330","type":"print"},{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12]]},"assertion":[{"value":"24 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 November 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 November 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 December 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no relevant financial or non-financial interests to disclose.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"27"}}