{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T05:03:57Z","timestamp":1765343037580,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":35,"publisher":"ACM","funder":[{"name":"JSPS KAKENHI","award":["JP21H04907 and JP24H00732"],"award-info":[{"award-number":["JP21H04907 and JP24H00732"]}]},{"name":"JST CREST","award":["JPMJCR20D3"],"award-info":[{"award-number":["JPMJCR20D3"]}]},{"name":"JST AIP Acceleration","award":["JPMJCR24U3"],"award-info":[{"award-number":["JPMJCR24U3"]}]},{"name":"JST K Program","award":["JPMJKP24C2"],"award-info":[{"award-number":["JPMJKP24C2"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755623","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T07:27:39Z","timestamp":1761377259000},"page":"4808-4816","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-5902-1567","authenticated-orcid":false,"given":"Futa","family":"Waseda","sequence":"first","affiliation":[{"name":"The University of Tokyo, Tokyo, Japan and National Institute of Informatics, Tokyo, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0061-0680","authenticated-orcid":false,"given":"Saku","family":"Sugawara","sequence":"additional","affiliation":[{"name":"National Institute of Informatics, Tokyo, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4908-1860","authenticated-orcid":false,"given":"Isao","family":"Echizen","sequence":"additional","affiliation":[{"name":"National Institute of Informatics, Tokyo, Japan and The University of Tokyo, Tokyo, Japan"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022) 23716--23736."},{"key":"e_1_3_2_1_2_1","volume-title":"Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390","author":"Awadalla Anas","year":"2023","unstructured":"Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1001\/jama.2017.14585"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2017.49"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 24185--24198","author":"Chen Zhe","year":"2024","unstructured":"Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 24185--24198."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.461"},{"key":"e_1_3_2_1_8_1","volume-title":"Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215--223","author":"Coates Adam","year":"2011","unstructured":"Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215--223."},{"key":"e_1_3_2_1_9_1","volume-title":"International conference on machine learning. PMLR, 2206--2216","author":"Croce Francesco","year":"2020","unstructured":"Francesco Croce and Matthias Hein. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning. PMLR, 2206--2216."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2006.79"},{"key":"e_1_3_2_1_12_1","unstructured":"Ian J Goodfellow Jonathon Shlens and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In ICLR."},{"key":"e_1_3_2_1_13_1","unstructured":"Gregory Griffin Alex Holub Pietro Perona et al. 2007. Caltech-256 object category dataset. Technical Report. Technical Report 7694 California Institute of Technology Pasadena."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTARS.2019.2918242"},{"key":"e_1_3_2_1_15_1","volume-title":"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. ICCV","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. 2021. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. ICCV (2021)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2013.77"},{"key":"e_1_3_2_1_17_1","unstructured":"Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. University of Toronto."},{"key":"e_1_3_2_1_18_1","volume-title":"Tiny imagenet visual recognition challenge. CS 231N 7, 7","author":"Le Yann","year":"2015","unstructured":"Yann Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3."},{"key":"e_1_3_2_1_19_1","volume-title":"International conference on machine learning. PMLR","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730--19742."},{"key":"e_1_3_2_1_20_1","volume-title":"Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694--9705."},{"key":"e_1_3_2_1_21_1","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023. Visual Instruction Tuning."},{"key":"e_1_3_2_1_22_1","volume-title":"Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083","author":"Madry Aleksander","year":"2017","unstructured":"Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)."},{"key":"e_1_3_2_1_23_1","volume-title":"Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151","author":"Maji Subhransu","year":"2013","unstructured":"Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)."},{"key":"e_1_3_2_1_24_1","volume-title":"Understanding zero-shot adversarial robustness for large-scale models. ICLR","author":"Mao Chengzhi","year":"2022","unstructured":"Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl Vondrick. 2022. Understanding zero-shot adversarial robustness for large-scale models. ICLR (2022)."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"e_1_3_2_1_27_1","volume-title":"International conference on machine learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763."},{"key":"e_1_3_2_1_28_1","volume-title":"Francesco Croce, and Matthias Hein.","author":"Schlarmann Christian","year":"2024","unstructured":"Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. 2024. Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models. arXiv preprint arXiv:2402.12336 (2024)."},{"key":"e_1_3_2_1_29_1","unstructured":"Christian Szegedy Wojciech Zaremba Ilya Sutskever Joan Bruna Dumitru Erhan Ian Goodfellow and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR."},{"key":"e_1_3_2_1_30_1","unstructured":"Haohan Wang Songwei Ge Zachary Lipton and Eric P Xing. 2019. Learning Robust Global Representations by Penalizing Local Predictive Power. In Advances in Neural Information Processing Systems. 10506--10518."},{"key":"e_1_3_2_1_31_1","volume-title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191","author":"Wang Peng","year":"2024","unstructured":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191 (2024)."},{"key":"e_1_3_2_1_32_1","volume-title":"Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. CVPR","author":"Wang Sibo","year":"2024","unstructured":"Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. 2024. Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. CVPR (2024)."},{"volume-title":"Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition","author":"Xiao Jianxiong","key":"e_1_3_2_1_33_1","unstructured":"Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485--3492."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01522"},{"key":"e_1_3_2_1_35_1","first-page":"96424","article-title":"Text-guided attention is all you need for zero-shot robustness in vision-language models","volume":"37","author":"Yu Lu","year":"2025","unstructured":"Lu Yu, Haiyang Zhang, and Changsheng Xu. 2025. Text-guided attention is all you need for zero-shot robustness in vision-language models. Advances in Neural Information Processing Systems 37 (2025), 96424--96448.","journal-title":"Advances in Neural Information Processing Systems"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755623","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:59:00Z","timestamp":1765342740000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755623"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":35,"alternative-id":["10.1145\/3746027.3755623","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755623","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}