{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:24:50Z","timestamp":1757618690454,"version":"3.44.0"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2025,7,7]],"date-time":"2025-07-07T00:00:00Z","timestamp":1751846400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,7]],"date-time":"2025-07-07T00:00:00Z","timestamp":1751846400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision (CV) and natural language processing (NLP). At the intersection of CV and NLP is the problem of image captioning, where the related models\u2019 robustness against adversarial attacks has not been well studied. This paper presents a novel adversarial attack strategy, attention-based image captioning attack (AICAttack), designed to attack image captioning models through subtle perturbations to images. Operating within a black-box attack scenario, our algorithm requires no access to the target model\u2019s architecture, parameters, or gradient information. We introduce an attention-based candidate selection mechanism that identifies the optimal pixels for attack, followed by a customized differential evolution method to optimize the perturbations of the pixels\u2019 RGB values. We demonstrate AICAttack\u2019s effectiveness through extensive experiments on benchmark datasets against multiple victim models. The experimental results demonstrate that our method outperforms current leading-edge techniques by achieving consistently higher attack success rates.<\/jats:p>","DOI":"10.1007\/s11633-024-1535-z","type":"journal-article","created":{"date-parts":[[2025,7,7]],"date-time":"2025-07-07T01:29:52Z","timestamp":1751851792000},"page":"769-782","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["AICAttack: Adversarial Image Captioning Attack with Attention-based Optimization"],"prefix":"10.1007","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-8595-0944","authenticated-orcid":false,"given":"Jiyao","family":"Li","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingze","family":"Ni","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yifei","family":"Dong","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tianqing","family":"Zhu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yongshun","family":"Gong","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3003-1313","authenticated-orcid":false,"given":"Wei","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,7,7]]},"reference":[{"issue":"1","key":"1535_CR1","doi-asserted-by":"publisher","first-page":"1003","DOI":"10.1109\/JSEN.2021.3130268","volume":"22","author":"C C Lin","year":"2022","unstructured":"C. C. Lin, C. H. Kuo, H. T. Chiang. CNN-based classification for point cloud object with bearing angle image. IEEE Sensors Journal, vol. 22, no. 1, pp. 1003\u20131011, 2022. DOI: https:\/\/doi.org\/10.1109\/JSEN.2021.3130268.","journal-title":"IEEE Sensors Journal"},{"key":"1535_CR2","doi-asserted-by":"publisher","first-page":"770","DOI":"10.1109\/CVPR.2016.90","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA","author":"K He","year":"2016","unstructured":"K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770\u2013778, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.90."},{"key":"1535_CR3","first-page":"1097","volume-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA","author":"A Krizhevsky","year":"2012","unstructured":"A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097\u20131105, 2012."},{"key":"1535_CR4","volume-title":"Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA","author":"K Simonyan","year":"2015","unstructured":"K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015."},{"key":"1535_CR5","volume-title":"Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA","author":"I J Goodfellow","year":"2015","unstructured":"I. J. Goodfellow, J. Shlens, C. Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015."},{"key":"1535_CR6","first-page":"12888","volume-title":"Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA","author":"J Li","year":"2022","unstructured":"J. Li, D. Li, C. Xiong, S. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 12888\u201312900, 2022."},{"key":"1535_CR7","doi-asserted-by":"publisher","first-page":"3242","DOI":"10.1109\/CVPR.2017.345","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA","author":"J Lu","year":"2017","unstructured":"J. Lu, C. Xiong, D. Parikh, R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 3242\u20133250, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.345."},{"key":"1535_CR8","doi-asserted-by":"publisher","first-page":"4565","DOI":"10.1109\/CVPR.2016.494","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA","author":"J Johnson","year":"2016","unstructured":"J. Johnson, A. Karpathy, Li F. F. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 4565\u20134574, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.494."},{"key":"1535_CR9","doi-asserted-by":"publisher","first-page":"889","DOI":"10.18653\/v1\/P18-1082","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia","author":"A Fan","year":"2018","unstructured":"A. Fan, M. Lewis, Y. Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 889\u2013898, 2018. DOI: https:\/\/doi.org\/10.18653\/v1\/P18-1082."},{"key":"1535_CR10","doi-asserted-by":"publisher","first-page":"4130","DOI":"10.1109\/CVPR.2019.00426","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA","author":"Y Xu","year":"2019","unstructured":"Y. Xu, B. Wu, F. Shen, Y. Fan, Y. Zhang, H. T. Shen, W. Liu. Exact adversarial attack to image captioning via structured output learning with latent variables. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 4130\u20134139, 2019. DOI: https:\/\/doi.org\/10.1109\/CVPR.2019.00426."},{"issue":"1","key":"1535_CR11","doi-asserted-by":"publisher","first-page":"539","DOI":"10.1109\/TPAMI.2022.3148210","volume":"45","author":"M Stefanini","year":"2023","unstructured":"M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, R. Cucchiara. From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 539\u2013559, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2022.3148210.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1535_CR12","doi-asserted-by":"publisher","first-page":"626","DOI":"10.1109\/TIFS.2022.3226905","volume":"18","author":"N Aafaq","year":"2023","unstructured":"N. Aafaq, N. Akhtar, W. Liu, M. Shah, A. Mian. Language model agnostic gray-box adversarial attack on image captioning. IEEE Transactions on Information Forensics and Security, vol. 18, pp. 626\u2013638, 2023. DOI: https:\/\/doi.org\/10.1109\/TIFS.2022.3226905.","journal-title":"IEEE Transactions on Information Forensics and Security"},{"issue":"1","key":"1535_CR13","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1109\/TCSVT.2021.3067449","volume":"32","author":"C Yan","year":"2022","unstructured":"C. Yan, Y. Hao, L. Li, J. Yin, A. Liu, Z. Mao, Z. Y. Chen, X. Y. Gao. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 43\u201351, 2022. DOI: https:\/\/doi.org\/10.1109\/TCSVT.2021.3067449.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"issue":"4","key":"1535_CR14","doi-asserted-by":"publisher","first-page":"1293","DOI":"10.1007\/s10209-022-00906-7","volume":"22","author":"M Leotta","year":"2023","unstructured":"M. Leotta, F. Mori, M. Ribaudo. Evaluating the effectiveness of automatic image captioning for web accessibility. Universal Access in the Information Society, vol. 22, no. 4, pp. 1293\u20131313, 2023. DOI: https:\/\/doi.org\/10.1007\/s10209-022-00906-7.","journal-title":"Universal Access in the Information Society"},{"key":"1535_CR15","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1109\/ICDIS.2018.00020","volume-title":"Proceedings of the 1st International Conference on Data Intelligence and Security, South Padre Island, USA","author":"F Ahmed","year":"2018","unstructured":"F. Ahmed, M. S. Mahmud, R. Al-Fahad, S. Alam, M. Yeasin. Image captioning for ambient awareness on a sidewalk. In Proceedings of the 1st International Conference on Data Intelligence and Security, South Padre Island, USA, pp. 85\u201391, 2018. DOI: https:\/\/doi.org\/10.1109\/ICDIS.2018.00020."},{"key":"1535_CR16","doi-asserted-by":"publisher","first-page":"5162","DOI":"10.18653\/v1\/2021.emnlp-main.419","volume-title":"Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic","author":"X Yang","year":"2021","unstructured":"X. Yang, S. Karaman, J. Tetreault, A. Jaimes. Journalistic guidelines aware news image captioning. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 5162\u20135175, 2021. DOI: https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.419."},{"key":"1535_CR17","volume-title":"Enhancing journalism with AI: A study of contextualized image captioning for news articles using LLMs and LMMs","author":"A Anagnostopoulou","year":"2024","unstructured":"A. Anagnostopoulou, T. Gouvea, D. Sonntag. Enhancing journalism with AI: A study of contextualized image captioning for news articles using LLMs and LMMs, [Online], Available: https:\/\/arxiv.org\/abs\/2408.04331, 2024."},{"key":"1535_CR18","doi-asserted-by":"publisher","first-page":"1378","DOI":"10.1109\/IV48863.2021.9575562","volume-title":"Proceedings of IEEE Intelligent Vehicles Symposium, Nagoya, Japan","author":"Y Mori","year":"2021","unstructured":"Y. Mori, T. Hirakawa, T. Yamashita, H. Fujiyoshi. Image captioning for near-future events from vehicle camera images and motion information. In Proceedings of IEEE Intelligent Vehicles Symposium, Nagoya, Japan, pp. 1378\u20131384, 2021. DOI: https:\/\/doi.org\/10.1109\/IV48863.2021.9575562."},{"key":"1535_CR19","doi-asserted-by":"publisher","first-page":"1420","DOI":"10.1109\/ACCESS.2020.3047091","volume":"9","author":"W Li","year":"2021","unstructured":"W. Li, Z. Qu, H. Song, P. Wang, B. Xue. The traffic scene understanding and prediction based on image captioning. IEEE Access, vol. 9, pp. 1420\u20131427, 2021. DOI: https:\/\/doi.org\/10.1109\/ACCESS.2020.3047091.","journal-title":"IEEE Access"},{"key":"1535_CR20","volume-title":"Proceedings of the 5th International Conference on Learning Representations, Toulon, France","author":"A Kurakin","year":"2017","unstructured":"A. Kurakin, I. Goodfellow, S. Bengio. Adversarial machine learning at scale. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017."},{"key":"1535_CR21","volume-title":"Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada","author":"A Madry","year":"2018","unstructured":"A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu. Towards deep learning models resistant to adversarial attacks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018."},{"doi-asserted-by":"publisher","unstructured":"Z. Yin, Y. Zhuo, Z. Ge. Transfer adversarial attacks across industrial intelligent systems. Reliability Engineering & System Safety, vol. 237, Article number 109299, 2023. DOI: https:\/\/doi.org\/10.1016\/j.ress.2023.109299.","key":"1535_CR22","DOI":"10.1016\/j.ress.2023.109299"},{"key":"1535_CR23","doi-asserted-by":"publisher","first-page":"3905","DOI":"10.24963\/ijcai.2018\/543","volume-title":"Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden","author":"C Xiao","year":"2018","unstructured":"C. Xiao, B. Li, J. Y. Zhu, W. He, M. Liu, D. Song. Generating adversarial examples with adversarial networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 3905\u20133911, 2018. DOI: https:\/\/doi.org\/10.24963\/ijcai.2018\/543."},{"key":"1535_CR24","doi-asserted-by":"publisher","first-page":"1378","DOI":"10.1109\/ICCV.2017.153","volume-title":"Proceedings of IEEE International Conference on Computer Vision, Venice, Italy","author":"C Xie","year":"2017","unstructured":"C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, A. Yuille. Adversarial examples for semantic segmentation and object detection. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 1378\u20131387, 2017. DOI: https:\/\/doi.org\/10.1109\/ICCV.2017.153."},{"key":"1535_CR25","doi-asserted-by":"publisher","first-page":"2574","DOI":"10.1109\/CVPR.2016.282","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA","author":"S M Moosavi-Dezfooli","year":"2016","unstructured":"S. M. Moosavi-Dezfooli, A. Fawzi, P. Frossard. DeepFool: A simple and accurate method to fool deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2574\u20132582, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.282."},{"key":"1535_CR26","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1109\/CVPR.2017.17","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA","author":"S M Moosavi-Dezfooli","year":"2017","unstructured":"S. M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, P. Frossard. Universal adversarial perturbations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 86\u201394, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.17."},{"issue":"5","key":"1535_CR27","doi-asserted-by":"publisher","first-page":"828","DOI":"10.1109\/TEVC.2019.2890858","volume":"23","author":"J Su","year":"2019","unstructured":"J. Su, D. V. Vargas, K. Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828\u2013841, 2019. DOI: https:\/\/doi.org\/10.1109\/TEVC.2019.2890858.","journal-title":"IEEE Transactions on Evolutionary Computation"},{"issue":"6","key":"1535_CR28","doi-asserted-by":"publisher","first-page":"4980","DOI":"10.1109\/JIOT.2020.3034899","volume":"8","author":"X Yang","year":"2021","unstructured":"X. Yang, W. Liu, S. Zhang, W. Liu, D. Tao. Targeted attention attack on deep learning models in road sign recognition. IEEE Internet of Things Journal, vol. 8, no. 6, pp. 4980\u20134990, 2021. DOI: https:\/\/doi.org\/10.1109\/JIOT.2020.3034899.","journal-title":"IEEE Internet of Things Journal"},{"issue":"6","key":"1535_CR29","doi-asserted-by":"publisher","first-page":"1164","DOI":"10.1109\/TKDE.2018.2790928","volume":"30","author":"Z Yin","year":"2018","unstructured":"Z. Yin, F. Wang, W. Liu, S. Chawla. Sparse feature attacks in adversarial learning. IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 6, pp. 1164\u20131177, 2018. DOI: https:\/\/doi.org\/10.1109\/TKDE.2018.2790928.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"1535_CR30","doi-asserted-by":"publisher","first-page":"2758","DOI":"10.1109\/IJCNN.2017.7966196","volume-title":"Proceedings of International Joint Conference on Neural Networks, Anchorage, USA","author":"A S Chivukula","year":"2017","unstructured":"A. S. Chivukula, W. Liu. Adversarial learning games with deep learning models. In Proceedings of International Joint Conference on Neural Networks, Anchorage, USA, pp. 2758\u20132767, 2017. DOI: https:\/\/doi.org\/10.1109\/IJCNN.2017.7966196."},{"key":"1535_CR31","volume-title":"Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada","author":"D P Kingma","year":"2014","unstructured":"D. P. Kingma, M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014."},{"key":"1535_CR32","doi-asserted-by":"publisher","first-page":"2587","DOI":"10.18653\/v1\/P18-1241","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia","author":"H Chen","year":"2018","unstructured":"H. Chen, H. Zhang, P. Y. Chen, J. Yi, C. J. Hsieh. Attacking visual language grounding with adversarial examples: A case study on neural image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2587\u20132597, 2018. DOI: https:\/\/doi.org\/10.18653\/v1\/P18-1241."},{"key":"1535_CR33","volume-title":"Contextual LSTM (CLSTM) models for large scale NLP tasks","author":"S Ghosh","year":"2016","unstructured":"S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, L. Heck. Contextual LSTM (CLSTM) models for large scale NLP tasks, [Online], Available: https:\/\/arxiv.org\/abs\/1602.06291, 2016."},{"key":"1535_CR34","doi-asserted-by":"publisher","first-page":"1070","DOI":"10.1109\/CVPR.2017.120","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA","author":"R Vedantam","year":"2017","unstructured":"R. Vedantam, S. Bengio, K. Murphy, D. Parikh, G. Chechik. Context-aware captions from context-agnostic supervision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 1070\u20131079, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.120."},{"issue":"1","key":"1535_CR35","doi-asserted-by":"publisher","first-page":"8142","DOI":"10.1609\/aaai.v33i01.33018142","volume":"33","author":"C Chen","year":"2019","unstructured":"C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, Q. Ju. Improving image captioning with conditional generative adversarial nets. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, vol. 33, no. 1, pp. 8142\u20138150, 2019. DOI: https:\/\/doi.org\/10.1609\/aaai.v33i01.33018142.","journal-title":"Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA"},{"key":"1535_CR36","doi-asserted-by":"publisher","first-page":"4951","DOI":"10.1109\/CVPR.2018.00520","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA","author":"X Xu","year":"2018","unstructured":"X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, D. Song. Fooling vision and language models despite localization and attention mechanism. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4951\u20134961, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00520."},{"key":"1535_CR37","volume-title":"Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA","author":"D Bahdanau","year":"2015","unstructured":"D. Bahdanau, K. Cho, Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015."},{"key":"1535_CR38","first-page":"2048","volume-title":"Proceedings of the 32nd International Conference on Machine Learning, Lille, France","author":"K Xu","year":"2015","unstructured":"K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 2048\u20132057, 2015."},{"key":"1535_CR39","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1007\/978-3-319-10602-1_48","volume-title":"Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland","author":"T Y Lin","year":"2014","unstructured":"T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, pp. 740\u2013755, 2014. DOI: https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48."},{"key":"1535_CR40","doi-asserted-by":"publisher","first-page":"853","DOI":"10.1613\/jair.3994","volume":"47","author":"M Hodosh","year":"2013","unstructured":"M. Hodosh, P. Young, J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, vol. 47, pp. 853\u2013899, 2013. DOI: https:\/\/doi.org\/10.1613\/jair.3994.","journal-title":"Journal of Artificial Intelligence Research"},{"key":"1535_CR41","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, USA","author":"K Papineni","year":"2002","unstructured":"K. Papineni, S. Roukos, T. Ward, W. J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, USA, pp. 311\u2013318, 2002. DOI: https:\/\/doi.org\/10.3115\/1073083.1073135."},{"key":"1535_CR42","first-page":"74","volume-title":"Proceedings of Text Summarization Branches Out, Barcelona, Spain","author":"C Y Lin","year":"2004","unstructured":"C. Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out, Barcelona, Spain, pp. 74\u201381, 2004."},{"key":"1535_CR43","doi-asserted-by":"publisher","first-page":"1085","DOI":"10.18653\/v1\/P19-1103","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy","author":"S Ren","year":"2019","unstructured":"S. Ren, Y. Deng, K. He, W. Che. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1085\u20131097, 2019. DOI: https:\/\/doi.org\/10.18653\/v1\/P19-1103."},{"key":"1535_CR44","doi-asserted-by":"publisher","first-page":"7241","DOI":"10.18653\/v1\/2022.emnlp-main.488","volume-title":"Proceedings of Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, UAE","author":"C Li","year":"2022","unstructured":"C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao, J. Zhang, S. Huang, F. Huang, J. Zhou, L. Si. mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, UAE, pp. 7241\u20137259, 2022. DOI: https:\/\/doi.org\/10.18653\/v1\/2022.emnlp-main.488."},{"key":"1535_CR45","first-page":"19730","volume-title":"Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA","author":"J Li","year":"2023","unstructured":"J. Li, D. Li, S. Savarese, S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, USA, pp. 19730\u201319742, 2023."},{"key":"1535_CR46","doi-asserted-by":"publisher","first-page":"2173","DOI":"10.1109\/BigData59044.2023.10386812","volume-title":"Proceedings of IEEE International Conference on Big Data, Sorrento, Italy","author":"J C Hu","year":"2023","unstructured":"J. C. Hu, R. Cavicchioli, A. Capotondi. Exploiting multiple sequence lengths in fast end to end training for image captioning. In Proceedings of IEEE International Conference on Big Data, Sorrento, Italy, pp. 2173\u20132182, 2023. DOI: https:\/\/doi.org\/10.1109\/BigData59044.2023.10386812."}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-024-1535-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-024-1535-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-024-1535-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,7]],"date-time":"2025-09-07T01:56:00Z","timestamp":1757210160000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-024-1535-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,7]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["1535"],"URL":"https:\/\/doi.org\/10.1007\/s11633-024-1535-z","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"type":"print","value":"2731-538X"},{"type":"electronic","value":"2731-5398"}],"subject":[],"published":{"date-parts":[[2025,7,7]]},"assertion":[{"value":"9 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 December 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 July 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declared that they have no conflicts of interest to this work.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations of conflict of interest"}}]}}