{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:47:10Z","timestamp":1765309630177,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","funder":[{"name":"National Science Foundation of China","award":["62206048"],"award-info":[{"award-number":["62206048"]}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["BK20220819"],"award-info":[{"award-number":["BK20220819"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["2242025K30024"],"award-info":[{"award-number":["2242025K30024"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Big Data Computing Center of Southeast University"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755764","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T06:54:17Z","timestamp":1761375257000},"page":"5130-5139","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-9263-7190","authenticated-orcid":false,"given":"Huiyi","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1751-239X","authenticated-orcid":false,"given":"Jiawei","family":"Peng","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1008-7053","authenticated-orcid":false,"given":"Kaihua","family":"Tang","sequence":"additional","affiliation":[{"name":"Huawei Singapore Research Center, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7729-0622","authenticated-orcid":false,"given":"Xin","family":"Geng","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8276-2679","authenticated-orcid":false,"given":"Xu","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems Vol. 35 (2022) 23716-23736."},{"key":"e_1_3_2_1_2_1","volume-title":"Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390","author":"Awadalla Anas","year":"2023","unstructured":"Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al., 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00161"},{"key":"e_1_3_2_1_4_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems Vol. 33 (2020) 1877-1901."},{"key":"e_1_3_2_1_5_1","volume-title":"Data curation alone can stabilize in-context learning. arXiv preprint arXiv:2212.10378","author":"Chang Ting-Yun","year":"2022","unstructured":"Ting-Yun Chang and Robin Jia. 2022. Data curation alone can stabilize in-context learning. arXiv preprint arXiv:2212.10378 (2022)."},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV).","author":"Das Anurag","year":"2024","unstructured":"Anurag Das, Xinting Hu, Li Jiang, and Bernt Schiele. 2024. MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment. In Proceedings of the European Conference on Computer Vision (ECCV)."},{"key":"e_1_3_2_1_7_1","unstructured":"Qingxiu Dong Lei Li Damai Dai Ce Zheng Jingyuan Ma Rui Li Heming Xia Jingjing Xu Zhiyong Wu Tianyu Liu et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.01719"},{"key":"e_1_3_2_1_9_1","volume-title":"African or european swallow? benchmarking large vision-language models for fine-grained object classification. arXiv preprint arXiv:2406.14496","author":"Geigle Gregor","year":"2024","unstructured":"Gregor Geigle, Radu Timofte, and Goran Glava\u0161. 2024. African or european swallow? benchmarking large vision-language models for fine-grained object classification. arXiv preprint arXiv:2406.14496 (2024)."},{"key":"e_1_3_2_1_10_1","volume-title":"Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models. arXiv preprint arXiv:2501.15140","author":"He Hulingxiao","year":"2025","unstructured":"Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, and Yuxin Peng. 2025. Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models. arXiv preprint arXiv:2501.15140 (2025)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00384"},{"key":"e_1_3_2_1_13_1","volume-title":"Visual instruction tuning towards general-purpose multimodal model: A survey. arXiv preprint arXiv:2312.16602","author":"Huang Jiaxing","year":"2023","unstructured":"Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, and Shijian Lu. 2023. Visual instruction tuning towards general-purpose multimodal model: A survey. arXiv preprint arXiv:2312.16602 (2023)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02776"},{"key":"e_1_3_2_1_15_1","volume-title":"Proc. CVPR workshop on fine-grained visual categorization (FGVC)","volume":"2","author":"Khosla Aditya","year":"2011","unstructured":"Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. 2011. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), Vol. 2."},{"key":"e_1_3_2_1_16_1","first-page":"71683","article-title":"Obelics: An open web-scale filtered dataset of interleaved image-text documents","volume":"36","author":"Lauren\u00e7on Hugo","year":"2023","unstructured":"Hugo Lauren\u00e7on, Lucile Saulnier, L\u00e9o Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al., 2023. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, Vol. 36 (2023), 71683-71702.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_17_1","volume-title":"Advances in Neural Information Processing Systems","volume":"36","author":"Lauren\u00e7on Hugo","year":"2024","unstructured":"Hugo Lauren\u00e7on, Lucile Saulnier, L\u00e9o Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al., 2024. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, Vol. 36 (2024)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.78"},{"key":"e_1_3_2_1_19_1","volume-title":"2023 f. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219","author":"Li Bo","year":"2023","unstructured":"Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. 2023 f. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219 (2023)."},{"key":"e_1_3_2_1_20_1","volume-title":"Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425","author":"Li Bo","year":"2023","unstructured":"Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. 2023d. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)."},{"key":"e_1_3_2_1_21_1","volume-title":"2023 e. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726","author":"Li Bo","year":"2023","unstructured":"Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023 e. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023)."},{"key":"e_1_3_2_1_22_1","volume-title":"International conference on machine learning. PMLR","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023a. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730-19742."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02522"},{"key":"e_1_3_2_1_24_1","unstructured":"Lei Li Yuwei Yin Shicheng Li Liang Chen Peiyi Wang Shuhuai Ren Mukai Li Yazheng Yang Jingjing Xu Xu Sun et al. 2023c. Mtextsuperscript3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv preprint arXiv:2306.04387 (2023)."},{"key":"e_1_3_2_1_25_1","volume-title":"Unified demonstration retriever for in-context learning. arXiv preprint arXiv:2305.04320","author":"Li Xiaonan","year":"2023","unstructured":"Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng Qiu. 2023b. Unified demonstration retriever for in-context learning. arXiv preprint arXiv:2305.04320 (2023)."},{"key":"e_1_3_2_1_26_1","volume-title":"Finding support examples for in-context learning. arXiv preprint arXiv:2302.13539","author":"Li Xiaonan","year":"2023","unstructured":"Xiaonan Li and Xipeng Qiu. 2023. Finding support examples for in-context learning. arXiv preprint arXiv:2302.13539 (2023)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3672758.3672824"},{"key":"e_1_3_2_1_28_1","volume-title":"Visual instruction tuning. Advances in neural information processing systems","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2023), 34892-34916."},{"key":"e_1_3_2_1_29_1","volume-title":"A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253","author":"Liu Hanchao","year":"2024","unstructured":"Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. 2024. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.deelio-1.10"},{"key":"e_1_3_2_1_31_1","volume-title":"Which examples to annotate for in-context learning? towards effective and efficient selection. arXiv preprint arXiv:2310.20046","author":"Mavromatis Costas","year":"2023","unstructured":"Costas Mavromatis, Balasubramaniam Srinivasan, Zhengyuan Shen, Jiani Zhang, Huzefa Rangwala, Christos Faloutsos, and George Karypis. 2023. Which examples to annotate for in-context learning? towards effective and efficient selection. arXiv preprint arXiv:2310.20046 (2023)."},{"key":"e_1_3_2_1_32_1","volume-title":"Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837","author":"Min Sewon","year":"2022","unstructured":"Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 (2022)."},{"volume-title":"What in-context learning ''learns'' in-context: Disentangling task recognition and task learning. Master's thesis","author":"Pan Jane","key":"e_1_3_2_1_33_1","unstructured":"Jane Pan. 2023. What in-context learning ''learns'' in-context: Disentangling task recognition and task learning. Master's thesis. Princeton University."},{"key":"e_1_3_2_1_34_1","volume-title":"Sub-SA: Strengthen In-context Learning via Submodular Selective Annotation. arXiv preprint arXiv:2407.05693","author":"Qian Jian","year":"2024","unstructured":"Jian Qian, Miao Sun, Sifan Zhou, Ziyu Zhao, Ruizhi Hun, and Patrick Chiang. 2024. Sub-SA: Strengthen In-context Learning via Submodular Selective Annotation. arXiv preprint arXiv:2407.05693 (2024)."},{"key":"e_1_3_2_1_35_1","volume-title":"International conference on machine learning. PmLR, 8748-8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748-8763."},{"key":"e_1_3_2_1_36_1","volume-title":"A survey of deep active learning. ACM computing surveys (CSUR)","author":"Ren Pengzhen","year":"2021","unstructured":"Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR), Vol. 54, 9 (2021), 1-40."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Olga Russakovsky Jia Deng Hao Su Jonathan Krause Sanjeev Satheesh Sean Ma Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision Vol. 115 (2015) 211-252.","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_38_1","volume-title":"Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489","author":"Sener Ozan","year":"2017","unstructured":"Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017)."},{"key":"e_1_3_2_1_39_1","first-page":"200","article-title":"Multimodal few-shot learning with frozen language models","volume":"34","author":"Tsimpoukelli Maria","year":"2021","unstructured":"Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, Vol. 34 (2021), 200-212.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_40_1","unstructured":"Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge et al. 2024. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)."},{"key":"e_1_3_2_1_41_1","unstructured":"Peter Welinder Steve Branson Takeshi Mita Catherine Wah Florian Schroff Serge Belongie and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010)."},{"key":"e_1_3_2_1_42_1","volume-title":"Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models. arXiv e-prints","author":"Yang Xu","year":"2023","unstructured":"Xu Yang, Yingzhe Peng, Haoxuan Ma, Shuo Xu, Chi Zhang, Yucheng Han, and Hanwang Zhang. 2023a. Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models. arXiv e-prints (2023), arXiv-2312."},{"key":"e_1_3_2_1_43_1","first-page":"40924","article-title":"Exploring diverse in-context configurations for image captioning","volume":"36","author":"Yang Xu","year":"2023","unstructured":"Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. 2023b. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, Vol. 36 (2023), 40924-40943.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_44_1","unstructured":"Qinghao Ye Haiyang Xu Guohai Xu Jiabo Ye Ming Yan Yiyang Zhou Junyang Wang Anwen Hu Pengcheng Shi Yaya Shi et al. 2023. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)."},{"key":"e_1_3_2_1_45_1","volume-title":"End-to-end spoken conversational question answering: Task, dataset and model. arXiv preprint arXiv:2204.14272","author":"You Chenyu","year":"2022","unstructured":"Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, and Yuexian Zou. 2022. End-to-end spoken conversational question answering: Task, dataset and model. arXiv preprint arXiv:2204.14272 (2022)."},{"key":"e_1_3_2_1_46_1","first-page":"3985","article-title":"MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering","author":"You Chenyu","year":"2021","unstructured":"Chenyu You, Nuo Chen, and Yuexian Zou. 2021a. MRD-Net: Multi-Modal Residual Knowledge Distillation for Spoken Question Answering.. In IJCAI. 3985-3991.","journal-title":"IJCAI."},{"key":"e_1_3_2_1_47_1","unstructured":"Chenyu You Nuo Chen and Yuexian Zou. 2021b. Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering. In Findings of the Association for Computational Linguistics: EMNLP."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01100"},{"key":"e_1_3_2_1_49_1","volume-title":"Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415","author":"Zhang Yuhui","year":"2024","unstructured":"Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. 2024. Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415 (2024)."},{"key":"e_1_3_2_1_50_1","first-page":"17773","article-title":"What makes good examples for visual in-context learning","volume":"36","author":"Zhang Yuanhan","year":"2023","unstructured":"Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. 2023. What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems, Vol. 36 (2023), 17773-17794.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_51_1","volume-title":"Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592","author":"Zhu Deyao","year":"2023","unstructured":"Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)."}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755764","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:44:52Z","timestamp":1765309492000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755764"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":51,"alternative-id":["10.1145\/3746027.3755764","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755764","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}