{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,23]],"date-time":"2026-06-23T08:46:27Z","timestamp":1782204387846,"version":"3.54.5"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T00:00:00Z","timestamp":1779148800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T00:00:00Z","timestamp":1779148800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001638","name":"Dublin City University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001638","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis Comput"],"published-print":{"date-parts":[[2026,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Large Multimodal Models (LMMs) have achieved remarkable performance across vision-language tasks, yet their robustness against adversarial attacks remains critically underexplored. While LMMs are vulnerable to visual encoder attacks, they exhibit surprising resilience due to encoder diversity\u2014attacks optimized for CLIP fail to transfer to EVA-CLIP, especially when textual context is provided. We introduce the\n                    <jats:bold>Adaptive Ensemble PGD (AE-PGD)<\/jats:bold>\n                    attack, which simultaneously targets both encoders through three key innovations: (1)\n                    <jats:italic>dynamic adversarial caption selection<\/jats:italic>\n                    , combining gradient magnitude with global semantic displacement to identify the most attack-effective caption per model; (2) an\n                    <jats:italic>adaptive weight controller<\/jats:italic>\n                    , dynamically balancing each encoder\u2019s contribution using real-time loss, gradient norm, and confidence metrics; and (3) an\n                    <jats:italic>Expectation over Transforms (EoT)<\/jats:italic>\n                    gradient update ensuring robustness against input-transformation defenses. Evaluated on COCO 2014 images, AE-PGD reduces accuracy from a 75.42% baseline to 0.0% across all three evaluation metrics\u2014visual encoding, image-to-text recall, and LLM answer recall\u2014achieving complete model collapse. Manifold analysis confirms that adversarial perturbations push image embeddings to antipodal regions of the joint embedding space, activating semantically opposite concept clusters and producing structured hallucinations. WordNet WUP similarity analysis reveals a 33.5 percentage point semantic drop across the test set. AE-PGD causes state-of-the-art LMMs (LLaVA, Qwen-VL, GPT-4V) to catastrophically misidentify a bullet train as a \u201chelicopter crash,\u201d with strong black-box transfer yielding a 65 percentage point recall collapse on unseen encoders. This work exposes critical vulnerabilities in current LMM architectures and underscores the urgent need for ensemble-aware defense mechanisms.\n                  <\/jats:p>","DOI":"10.1007\/s00371-026-04480-4","type":"journal-article","created":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T16:46:07Z","timestamp":1779209167000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Adaptive ensemble attack: breaking Large Multimodal Models via dynamic caption selection and weighted gradients"],"prefix":"10.1007","volume":"42","author":[{"given":"Sudhir Kumar","family":"Pandey","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jian-Xun","family":"Mi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Israr","family":"Ahmad","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Muhammad Salman","family":"Pathan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,5,19]]},"reference":[{"key":"4480_CR1","doi-asserted-by":"crossref","unstructured":"Cui, X., Aparcedo, A., Jang, Y.K., Lim, S.-N.: On the robustness of large multimodal models against image adversarial attacks. In: CVPR (2024)","DOI":"10.1109\/CVPR52733.2024.02325"},{"key":"4480_CR2","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS, (2023)","DOI":"10.52202\/075280-1516"},{"key":"4480_CR3","unstructured":"Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML, (2023)"},{"key":"4480_CR4","doi-asserted-by":"crossref","unstructured":"Dai, W., Li, J., Li, D., et al.: InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS, (2023)","DOI":"10.52202\/075280-2142"},{"key":"4480_CR5","unstructured":"Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML, (2021)"},{"key":"4480_CR6","doi-asserted-by":"crossref","unstructured":"Fang, Y., Wang, W., Xie, B. et al.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR, (2023)","DOI":"10.1109\/CVPR52729.2023.01855"},{"key":"4480_CR7","unstructured":"Chiang, W.-L., Li, Z., Lin, Z., et al.: Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. (2023)"},{"key":"4480_CR8","unstructured":"Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR, (2018)"},{"key":"4480_CR9","unstructured":"Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML, (2020)"},{"key":"4480_CR10","doi-asserted-by":"crossref","unstructured":"Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: IEEE S&P, (2017)","DOI":"10.1109\/SP.2017.49"},{"key":"4480_CR11","unstructured":"Liu, Y., Chen, X., Liu, C., Song, D.: Delving into transferable adversarial examples and black-box attacks. In: ICLR, (2017)"},{"key":"4480_CR12","unstructured":"Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: ICML, (2018)"},{"key":"4480_CR13","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., et al.: Microsoft COCO: Common objects in context. In: ECCV, (2014)","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"4480_CR14","unstructured":"Bai, J., Bai, S., Yang, S., et al.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, (2023)"},{"key":"4480_CR15","unstructured":"OpenAI. GPT-4V(ision) system card. Technical report, OpenAI, (2023)"},{"key":"4480_CR16","doi-asserted-by":"crossref","unstructured":"Xie, Y., Zhong, J.-X., Wang, K., Ding, Y., Shan, S., Chen, X.: Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-based Adversarial Attacks. In: CVPR, (2025)","DOI":"10.1109\/CVPR52734.2025.01368"},{"key":"4480_CR17","doi-asserted-by":"crossref","unstructured":"Waseda, F., Tejero-de-Pablos, A., Echizen, I.: Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships. arXiv preprint arXiv:2405.18770, (2025)","DOI":"10.1109\/WACV61042.2026.00673"},{"key":"4480_CR18","unstructured":"Rashid, M.B., Rivas, P., et al.: A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications. arXiv preprint arXiv:2502.16361, (2025)"},{"key":"4480_CR19","unstructured":"Szegedy, C., Zaremba, W., Sutskever, I., et al.: Intriguing properties of neural networks. In: ICLR, (2014)"},{"key":"4480_CR20","unstructured":"Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR, (2015)"},{"key":"4480_CR21","doi-asserted-by":"crossref","unstructured":"Dong, Y., Liao, F., Pang, T., et al.: Boosting adversarial attacks with momentum. In: CVPR, (2018)","DOI":"10.1109\/CVPR.2018.00957"},{"key":"4480_CR22","doi-asserted-by":"crossref","unstructured":"Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: ICCV Workshops, (2023)","DOI":"10.1109\/ICCVW60793.2023.00395"},{"key":"4480_CR23","doi-asserted-by":"crossref","unstructured":"Xie, C., Zhang, Z., Zhou, Y., et al.: Improving transferability of adversarial examples with input diversity. In: CVPR, (2019)","DOI":"10.1109\/CVPR.2019.00284"},{"key":"4480_CR24","unstructured":"Li, J., Selvaraju, R., Gotmare, A., et al.: Align before fuse: Vision and language representation learning with momentum distillation. In: NeurIPS, (2022)"},{"key":"4480_CR25","unstructured":"Kim, W., Son, B., Kim, I.: ViLT: Vision-and-Language Transformer without convolution or region supervision. In: ICML, (2021)"},{"key":"4480_CR26","unstructured":"Nagrani, A., Yang, S., Arnab, A., et al.: Attention bottlenecks for multimodal fusion. In: NeurIPS, (2021)"},{"key":"4480_CR27","unstructured":"Tram\u00e8r, F., Carlini, N., Brendel, W., Madry, A.: On adaptive attacks to adversarial example defenses. In: NeurIPS, (2020)"},{"issue":"11","key":"4480_CR28","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1145\/219717.219748","volume":"38","author":"GA Miller","year":"1995","unstructured":"Miller, G.A.: WordNet: A lexical database for English. Commun. ACM 38(11), 39\u201341 (1995)","journal-title":"Commun. ACM"},{"key":"4480_CR29","doi-asserted-by":"crossref","unstructured":"Xing, M., Feng, Z., Su, Y., Oh, C.: Learning by Erasing: Conditional Entropy Based Transferable Out-of-Distribution Detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), (2024)","DOI":"10.1609\/aaai.v38i6.28444"},{"key":"4480_CR30","doi-asserted-by":"crossref","unstructured":"Yang, Y., Su, Yi., An, S.: VDSSA: Ventral & Dorsal Sequential Self-attention AutoEncoder for Cognitive-Consistency Disentanglement. In: Pattern Recognition and Computer Vision (PRCV), Lecture Notes in Computer Science, vol. 13536, Springer, Cham, (2022)","DOI":"10.1007\/978-3-031-18910-4_55"},{"key":"4480_CR31","doi-asserted-by":"crossref","unstructured":"Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV, (2023)","DOI":"10.1109\/ICCV51070.2023.01100"},{"key":"4480_CR32","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-025-16148-5","author":"I Ahmad","year":"2025","unstructured":"Ahmad, I., Shang, F., Pathan, M.S., Wajahat, A., Kim, Y.-S.: vDual-stream hybrid architecture with adaptive multi-scale boundary-aware mechanisms for robust urban change detection in smart cities. Sci. Rep. (2025). https:\/\/doi.org\/10.1038\/s41598-025-16148-5","journal-title":"Sci. Rep."}],"container-title":["The Visual Computer"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00371-026-04480-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00371-026-04480-4","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00371-026-04480-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,6,23]],"date-time":"2026-06-23T07:46:37Z","timestamp":1782200797000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00371-026-04480-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,19]]},"references-count":32,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2026,6]]}},"alternative-id":["4480"],"URL":"https:\/\/doi.org\/10.1007\/s00371-026-04480-4","relation":{},"ISSN":["0178-2789","1432-2315"],"issn-type":[{"value":"0178-2789","type":"print"},{"value":"1432-2315","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,19]]},"assertion":[{"value":"17 December 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 March 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 May 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"304"}}