{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,20]],"date-time":"2026-04-20T13:59:23Z","timestamp":1776693563475,"version":"3.51.2"},"reference-count":66,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T00:00:00Z","timestamp":1707523200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T00:00:00Z","timestamp":1707523200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61806218"],"award-info":[{"award-number":["61806218"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Nowadays, Artificial Intelligence Generated Content (AIGC) has shown promising prospects in both computer vision and natural language processing communities. Meanwhile, as an essential aspect of AIGC, image to captions has received much more attention. Recent vision-language research is developing from the bulky region visual representations based on object detectors toward more convenient and flexible grid ones. However, this kind of research typically concentrates on image understanding tasks like image classification, with less attention paid to content generation tasks. In this paper, we explore how to capitalize on the expressive features embedded in the grid visual representations for better image captioning. To this end, we present a Transformer-based image captioning model, dubbed FeiM, with two straightforward yet effective designs. We first design the feature queries that consist of a limited set of learnable vectors, which act as the local signals to capture specific visual information from global grid features. Then, taking augmented global grid features and the local feature queries as inputs, we develop a feature interaction module to query relevant visual concepts from grid features, and to enable interaction between the local signal and overall context. Finally, the refined grid visual representations and the linguistic features pass through a Transformer architecture for multi-modal fusion. With the two novel and simple designs, FeiM can fully leverage meaningful visual knowledge to improve image captioning performance. Extensive experiments are performed on the competitive MSCOCO benchmark to confirm the effectiveness of the proposed approach, and the results show that FeiM yields more eminent results than existing advanced captioning models.<\/jats:p>","DOI":"10.1007\/s40747-023-01341-8","type":"journal-article","created":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T14:02:08Z","timestamp":1707573728000},"page":"3541-3556","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Exploring better image captioning with grid features"],"prefix":"10.1007","volume":"10","author":[{"given":"Jie","family":"Yan","sequence":"first","affiliation":[]},{"given":"Yuxiang","family":"Xie","sequence":"additional","affiliation":[]},{"given":"Yanming","family":"Guo","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4568-551X","authenticated-orcid":false,"given":"Yingmei","family":"Wei","sequence":"additional","affiliation":[]},{"given":"Xidao","family":"Luan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,10]]},"reference":[{"key":"1341_CR1","doi-asserted-by":"crossref","unstructured":"Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473\u20131482","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"1341_CR2","doi-asserted-by":"crossref","unstructured":"Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128\u20133137","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"1341_CR3","unstructured":"Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR"},{"key":"1341_CR4","doi-asserted-by":"crossref","unstructured":"Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156\u20133164","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"1341_CR5","unstructured":"Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104\u20133112"},{"issue":"4","key":"1341_CR6","doi-asserted-by":"publisher","first-page":"652","DOI":"10.1109\/TPAMI.2016.2587640","volume":"39","author":"O Vinyals","year":"2016","unstructured":"Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652\u2013663","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1341_CR7","unstructured":"Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048\u20132057"},{"key":"1341_CR8","doi-asserted-by":"crossref","unstructured":"Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: a framework for generating controllable and grounded captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8307\u20138316","DOI":"10.1109\/CVPR.2019.00850"},{"key":"1341_CR9","unstructured":"Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090"},{"key":"1341_CR10","doi-asserted-by":"crossref","unstructured":"Jiang W, Ma L, Jiang Y-G, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision, pages 499\u2013515","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"1341_CR11","doi-asserted-by":"crossref","unstructured":"You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4651\u20134659","DOI":"10.1109\/CVPR.2016.503"},{"key":"1341_CR12","unstructured":"Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91\u201399"},{"key":"1341_CR13","doi-asserted-by":"crossref","unstructured":"Jiang H, Misra I, Rohrbach M, Learned ME, Chen X (2020) In defense of grid features for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10267\u201310276","DOI":"10.1109\/CVPR42600.2020.01028"},{"key":"1341_CR14","doi-asserted-by":"crossref","unstructured":"Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) RSTNet: captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 15465\u201315474","DOI":"10.1109\/CVPR46437.2021.01521"},{"key":"1341_CR15","doi-asserted-by":"crossref","unstructured":"Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In Proceedings of the European Conference on Computer Vision, pages 15\u201329","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"1341_CR16","doi-asserted-by":"crossref","unstructured":"Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 606\u2013612","DOI":"10.1609\/aaai.v26i1.8205"},{"key":"1341_CR17","first-page":"1143","volume":"24","author":"Vicente Ordonez","year":"2011","unstructured":"Ordonez Vicente, Kulkarni Girish, Berg Tamara (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inform Process Syst 24:1143\u20131151","journal-title":"Adv Neural Inform Process Syst"},{"issue":"12","key":"1341_CR18","doi-asserted-by":"publisher","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","volume":"35","author":"G Kulkarni","year":"2013","unstructured":"Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891\u20132903","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1341_CR19","unstructured":"Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daume H (2012) III. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747\u2013756"},{"key":"1341_CR20","doi-asserted-by":"crossref","unstructured":"Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: Phrase learning for caption generation from images. In Proceedings of the IEEE International Conference on Computer Vision, pages 2668\u20132676","DOI":"10.1109\/ICCV.2015.306"},{"issue":"5","key":"1341_CR21","doi-asserted-by":"publisher","first-page":"1361","DOI":"10.1109\/JAS.2020.1003300","volume":"7","author":"X Li","year":"2020","unstructured":"Li X, Liu Y, Wang K, Wang F-Y (2020) A recurrent attention and interaction model for pedestrian trajectory prediction. IEEE\/CAA J Autom Sinica 7(5):1361\u20131370","journal-title":"IEEE\/CAA J Autom Sinica"},{"issue":"7","key":"1341_CR22","doi-asserted-by":"publisher","first-page":"1243","DOI":"10.1109\/JAS.2020.1003402","volume":"8","author":"P Liu","year":"2020","unstructured":"Liu P, Zhou Y, Peng D, Dapeng W (2020) Global-attention-based neural networks for vision language intelligence. IEEE\/CAA J Autom Sinica 8(7):1243\u20131252","journal-title":"IEEE\/CAA J Autom Sinica"},{"key":"1341_CR23","doi-asserted-by":"crossref","unstructured":"Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008\u20137024","DOI":"10.1109\/CVPR.2017.131"},{"issue":"6","key":"1341_CR24","doi-asserted-by":"publisher","first-page":"1489","DOI":"10.1109\/JAS.2020.1003180","volume":"7","author":"T Zhou","year":"2020","unstructured":"Zhou T, Chen M, Zou J (2020) Reinforcement learning based data fusion method for multi-sensors. IEEE\/CAA J Autom Sinica 7(6):1489\u20131497","journal-title":"IEEE\/CAA J Autom Sinica"},{"key":"1341_CR25","doi-asserted-by":"crossref","unstructured":"Seo PH, Sharma P, Levinboim T, Han B, Soricut R (2020) Reinforcing an image caption generator using off-line human feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, 34(3): 2693\u20132700","DOI":"10.1609\/aaai.v34i03.5655"},{"key":"1341_CR26","unstructured":"Devlin J, Chang M.-W, Lee K, Toutanova K (2018) \u201cBert: Pre-training of deep bidirectional transformers for language understanding,\u201d arXiv preprint arXiv: 1810.04805"},{"key":"1341_CR27","doi-asserted-by":"crossref","unstructured":"Zhang P, Li X, Hu v, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) VINVL: revisiting visual representations in vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5579\u20135588","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"1341_CR28","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998\u20136008"},{"key":"1341_CR29","doi-asserted-by":"crossref","unstructured":"Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pages 8928\u20138937","DOI":"10.1109\/ICCV.2019.00902"},{"key":"1341_CR30","doi-asserted-by":"crossref","unstructured":"Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10578\u201310587","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"1341_CR31","doi-asserted-by":"crossref","unstructured":"Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10971\u201310980","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"1341_CR32","doi-asserted-by":"crossref","unstructured":"Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2556\u20132565","DOI":"10.18653\/v1\/P18-1238"},{"key":"1341_CR33","doi-asserted-by":"crossref","unstructured":"Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077\u20136086","DOI":"10.1109\/CVPR.2018.00636"},{"key":"1341_CR34","doi-asserted-by":"crossref","unstructured":"Datta S, Sikka K, Roy A, Ahuja K, Parikh D, Divakaran A (2019) Align2Ground: weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 2601\u20132610","DOI":"10.1109\/ICCV.2019.00269"},{"key":"1341_CR35","doi-asserted-by":"crossref","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"1341_CR36","doi-asserted-by":"crossref","unstructured":"Wu Q, Shen C, Liu L, Dick A, Hengel A van den (2016) What value do explicit high level concepts have in vision to language problems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 203\u2013212","DOI":"10.1109\/CVPR.2016.29"},{"key":"1341_CR37","doi-asserted-by":"crossref","unstructured":"Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 375\u2013383","DOI":"10.1109\/CVPR.2017.345"},{"key":"1341_CR38","unstructured":"Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: Transforming objects into words. In Advances in Neural Information Processing Systems, pages 11137\u201311147"},{"key":"1341_CR39","unstructured":"Qiu L, Zhang R, Guo Z, Zeng Z, Li Y, Zhang G (2021) VT-CLIP: enhancing vision-language models with visual-guided texts. arXiv preprint arXiv: 2112.02399"},{"key":"1341_CR40","unstructured":"Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Pages 8748\u20138763"},{"key":"1341_CR41","doi-asserted-by":"crossref","unstructured":"Jiang Z, Xu FF, Araki J, Neubig G (2020) How can we know what language models know? Transactions of the Association for Computational Linguistics, pages 423\u2013438","DOI":"10.1162\/tacl_a_00324"},{"key":"1341_CR42","doi-asserted-by":"crossref","unstructured":"Shin T, Razeghi Y, Logan RL IV, Eric W, Sameer S (2020) Autoprompt: eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980","DOI":"10.18653\/v1\/2020.emnlp-main.346"},{"issue":"9","key":"1341_CR43","doi-asserted-by":"publisher","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","volume":"130","author":"K Zhou","year":"2022","unstructured":"Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337\u20132348","journal-title":"Int J Comput Vis"},{"key":"1341_CR44","unstructured":"Zhu Y, Liu H, Song Y, Yuan Z, Han X, Yuan C, Chen Q, Wang J (2022) One model to edit them all: Free-form text-driven image manipulation with semantic modulations. In Advances in Neural Information Processing Systems, pages 25146\u201325159"},{"key":"1341_CR45","doi-asserted-by":"crossref","unstructured":"Kim D-J, Choi J, Oh T-H, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6271\u20136280","DOI":"10.1109\/CVPR.2019.00643"},{"issue":"3","key":"1341_CR46","doi-asserted-by":"publisher","first-page":"2013","DOI":"10.1007\/s11042-019-08209-5","volume":"79","author":"S Wang","year":"2020","unstructured":"Wang S, Lan L, Zhang X, Dong G, Luo Z (2020) Object-aware semantics of attention for image captioning. Multimed Tools Appl 79(3):2013\u20132030","journal-title":"Multimed Tools Appl"},{"key":"1341_CR47","doi-asserted-by":"crossref","unstructured":"Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, Zitnick CL (2014) Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"1341_CR48","doi-asserted-by":"crossref","unstructured":"Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311\u2013318","DOI":"10.3115\/1073083.1073135"},{"key":"1341_CR49","unstructured":"Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization, pages 65\u201372"},{"key":"1341_CR50","unstructured":"Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pages 74\u201381"},{"key":"1341_CR51","doi-asserted-by":"crossref","unstructured":"Vedantam R, Zitnick CL, Parikh (2015) Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566\u20134575","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"1341_CR52","doi-asserted-by":"crossref","unstructured":"Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision, pages 684\u2013699","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"1341_CR53","doi-asserted-by":"crossref","unstructured":"Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10685\u201310694","DOI":"10.1109\/CVPR.2019.01094"},{"key":"1341_CR54","doi-asserted-by":"crossref","unstructured":"Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pages 4634\u20134643","DOI":"10.1109\/ICCV.2019.00473"},{"key":"1341_CR55","doi-asserted-by":"crossref","unstructured":"Yang X, Gao C, Zhang H, Cai J (2021) Auto-parsing network for image captioning and visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2197\u20132207","DOI":"10.1109\/ICCV48922.2021.00220"},{"key":"1341_CR56","doi-asserted-by":"crossref","unstructured":"Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10327\u201310336","DOI":"10.1109\/CVPR42600.2020.01034"},{"issue":"2","key":"1341_CR57","doi-asserted-by":"publisher","first-page":"710","DOI":"10.1109\/TPAMI.2019.2909864","volume":"44","author":"Z-J Zha","year":"2019","unstructured":"Zha Z-J, Liu D, Zhang H, Zhang Y, Feng W (2019) Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell 44(2):710\u2013722","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1341_CR58","doi-asserted-by":"crossref","unstructured":"Yang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9847\u20139857","DOI":"10.1109\/CVPR46437.2021.00972"},{"key":"1341_CR59","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.10.014","author":"Y Wang","year":"2022","unstructured":"Wang Y, Xu J, Sun Y (2022) A visual persistence model for image captioning. Neurocomputing. https:\/\/doi.org\/10.1016\/j.neucom.2021.10.014","journal-title":"Neurocomputing"},{"key":"1341_CR60","doi-asserted-by":"crossref","unstructured":"Zhang Z, Qiang W, Wang Y (2021) and Fang Chen. Exploring pairwise relationships adaptively from linguistic context in image captioning, IEEE Transactions on Multimedia, pp 3101\u20133113","DOI":"10.1109\/TMM.2021.3093725"},{"key":"1341_CR61","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.109420","volume":"138","author":"Y Ma","year":"2023","unstructured":"Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recognit 138:109420","journal-title":"Pattern Recognit"},{"key":"1341_CR62","doi-asserted-by":"crossref","unstructured":"Li Xiang L, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In Proceedings of ACL","DOI":"10.18653\/v1\/2021.acl-long.353"},{"key":"1341_CR63","unstructured":"He J, Zhou C, Ma X, Berg-Kirkpatrick T, Neubig G (2022) Towards a unified view of parameter-efficient transfer learning. In ICLR"},{"key":"1341_CR64","doi-asserted-by":"crossref","unstructured":"Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision, pages 382\u2013398","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"1341_CR65","unstructured":"Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference for Learning Representations"},{"key":"1341_CR66","doi-asserted-by":"crossref","unstructured":"Chauhan S, Singh M, Aggarwal AK (2021) Experimental Analysis of Effect of Tuning Parameters on The Performance of Diversity-Driven Multi-Parent Evolutionary Algorithm. 2021 IEEE 2nd International Conference On Electrical Power and Energy Systems (ICEPES), pages 1\u20136","DOI":"10.1109\/ICEPES52894.2021.9699655"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-01341-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-023-01341-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-01341-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T18:13:33Z","timestamp":1715883213000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-023-01341-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,10]]},"references-count":66,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["1341"],"URL":"https:\/\/doi.org\/10.1007\/s40747-023-01341-8","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,10]]},"assertion":[{"value":"13 September 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 December 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}