{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T15:46:58Z","timestamp":1762530418958,"version":"build-2065373602"},"reference-count":54,"publisher":"Springer Science and Business Media LLC","issue":"15","license":[{"start":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T00:00:00Z","timestamp":1759104000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"},{"start":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T00:00:00Z","timestamp":1759104000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62276054","61877009"],"award-info":[{"award-number":["62276054","61877009"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Appl Intell"],"published-print":{"date-parts":[[2025,10]]},"DOI":"10.1007\/s10489-025-06866-8","type":"journal-article","created":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T09:38:31Z","timestamp":1759138711000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Visual residual aggregation network for visual-language prompt tuning"],"prefix":"10.1007","volume":"55","author":[{"given":"Yunqian","family":"Yu","sequence":"first","affiliation":[]},{"given":"Feng","family":"Guo","sequence":"additional","affiliation":[]},{"given":"Xianlong","family":"Tian","sequence":"additional","affiliation":[]},{"given":"Biao","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Mengmeng","family":"Jing","sequence":"additional","affiliation":[]},{"given":"Lin","family":"Zuo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,9,29]]},"reference":[{"key":"6866_CR1","first-page":"23716","volume":"35","author":"J-B Alayrac","year":"2022","unstructured":"Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M et al (2022) Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35:23716\u201323736","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"6866_CR2","unstructured":"Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748\u20138763. PMLR"},{"issue":"2","key":"6866_CR3","first-page":"3","volume":"1","author":"EJ Hu","year":"2022","unstructured":"Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W et al (2022) Lora: Low-rank adaptation of large language models. ICLR. 1(2):3","journal-title":"ICLR."},{"issue":"9","key":"6866_CR4","doi-asserted-by":"publisher","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","volume":"130","author":"K Zhou","year":"2022","unstructured":"Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9):2337\u20132348","journal-title":"Int. J. Comput. Vision"},{"key":"6866_CR5","doi-asserted-by":"crossref","unstructured":"Zhou K, Yang J, Loy CC, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816\u201316825","DOI":"10.1109\/CVPR52688.2022.01631"},{"key":"6866_CR6","doi-asserted-by":"crossref","unstructured":"Yao H, Zhang R, Xu C (2023) Visual-language prompt tuning with knowledge-guided context optimization. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 6757\u20136767","DOI":"10.1109\/CVPR52729.2023.00653"},{"key":"6866_CR7","unstructured":"Alexey D (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929"},{"key":"6866_CR8","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems. 30"},{"key":"6866_CR9","unstructured":"Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904\u20134916. PMLR"},{"key":"6866_CR10","unstructured":"Schuhmann C, Vencu R, Beaumont R, Kaczmarczyk R, Mullis C, Katta A, Coombes T, Jitsev J, Komatsuzaki A (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv:2111.02114"},{"key":"6866_CR11","first-page":"25278","volume":"35","author":"C Schuhmann","year":"2022","unstructured":"Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M, Coombes T, Katta A, Mullis C, Wortsman M et al (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35:25278\u201325294","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"6866_CR12","unstructured":"Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597\u20131607. PMLR"},{"key":"6866_CR13","unstructured":"Kim W, Son B, Kim I (2021) Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583\u20135594. PMLR"},{"key":"6866_CR14","unstructured":"Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32"},{"key":"6866_CR15","doi-asserted-by":"crossref","unstructured":"He K, Chen X, Xie S, Li Y, Doll\u00e1r P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000\u201316009","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"6866_CR16","first-page":"9694","volume":"34","author":"J Li","year":"2021","unstructured":"Li J, Selvaraju R, Gotmare A, Joty S, Xiong C, Hoi SCH (2021) Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34:9694\u20139705","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"6866_CR17","doi-asserted-by":"crossref","unstructured":"Li J, Gao M, Wei L, Tang S, Zhang W, Li M, Ji W, Tian Q, Chua T-S, Zhuang Y (2023) Gradient-regulated meta-prompt learning for generalizable vision-language models. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 2551\u20132562","DOI":"10.1109\/ICCV51070.2023.00241"},{"key":"6866_CR18","unstructured":"Zang Y, Li W, Zhou K, Huang C, Loy CC (2022) Unified vision and language prompt learning. arXiv:2210.07225"},{"key":"6866_CR19","doi-asserted-by":"publisher","DOI":"10.1016\/j.cag.2024.01.012","volume":"119","author":"J Xing","year":"2024","unstructured":"Xing J, Liu J, Wang J, Sun L, Chen X, Gu X, Wang Y (2024) A survey of efficient fine-tuning methods for vision-language models\u2014prompt and adapter. Computers & Graphics. 119:103885","journal-title":"Computers & Graphics."},{"key":"6866_CR20","doi-asserted-by":"crossref","unstructured":"Zhu B, Niu Y, Han Y, Wu Y, Zhang H (2023) Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 15659\u201315669","DOI":"10.1109\/ICCV51070.2023.01435"},{"key":"6866_CR21","doi-asserted-by":"crossref","unstructured":"Zhang J, Wu S, Gao L, Shen HT, Song J (2024) Dept: Decoupled prompt tuning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 12924\u201312933","DOI":"10.1109\/CVPR52733.2024.01228"},{"key":"6866_CR22","doi-asserted-by":"publisher","first-page":"8718","DOI":"10.1609\/aaai.v39i8.32942","volume":"39","author":"J Xie","year":"2025","unstructured":"Xie J, Zhang Y, Peng J, Huang Z, Cao L (2025) Textrefiner: Internal visual feature as efficient refiner for vision-language models prompt tuning. Proceedings of the AAAI Conference on Artificial Intelligence 39:8718\u20138726","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"6866_CR23","unstructured":"Hou S, Shang X, Gowda SN, Lu Y, Wu C, Yan Y, Wang H (2025) Capt: Class-aware prompt tuning for federated long-tailed learning with vision-language model. arXiv:2503.06993"},{"key":"6866_CR24","doi-asserted-by":"crossref","unstructured":"Tian X, Zou S, Yang Z, Zhang J (2024) Argue: Attribute-guided prompt tuning for vision-language models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 28578\u201328587","DOI":"10.1109\/CVPR52733.2024.02700"},{"key":"6866_CR25","doi-asserted-by":"crossref","unstructured":"Yao H, Zhang R, Xu C (2024) Tcp: Textual-based class-aware prompt tuning for visual-language model. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 23438\u201323448","DOI":"10.1109\/CVPR52733.2024.02212"},{"key":"6866_CR26","doi-asserted-by":"crossref","unstructured":"Kim G, Kim S, Lee S (2024) Aapl: Adding attributes to prompt learning for vision-language models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 1572\u20131582","DOI":"10.1109\/CVPRW63382.2024.00164"},{"key":"6866_CR27","unstructured":"Yoon HS, Yoon E, Tee JTJ, Hasegawa-Johnson M, Li Y, Yoo CD (2024) C-tpt: Calibrated test-time prompt tuning for vision-language models via text feature dispersion. arXiv:2403.14119"},{"key":"6866_CR28","doi-asserted-by":"crossref","unstructured":"Jia M, Tang L, Chen B-C, Cardie C, Belongie S, Hariharan B, Lim S-N (2022) Visual prompt tuning. In: European Conference on Computer Vision, pp. 709\u2013727. Springer","DOI":"10.1007\/978-3-031-19827-4_41"},{"key":"6866_CR29","first-page":"45206","volume":"37","author":"M Wu","year":"2024","unstructured":"Wu M, Cai X, Ji J, Li J, Huang O, Luo G, Fei H, Jiang G, Sun X, Ji R (2024) Controlmllm: Training-free visual prompt learning for multimodal large language models. Adv. Neural. Inf. Process. Syst. 37:45206\u201345234","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"6866_CR30","doi-asserted-by":"crossref","unstructured":"Pei W, Xia T, Chen F, Li J, Tian J, Lu G (2024) Sa2vp: Spatially aligned-and-adapted visual prompt. Proceedings of the AAAI Conference on Artificial Intelligence 38:4450\u20134458","DOI":"10.1609\/aaai.v38i5.28243"},{"key":"6866_CR31","unstructured":"Le M, Nguyen A, Nguyen H, Nguyen C, Ho N (2025) Adaptive prompt: Unlocking the power of visual prompt tuning. arXiv:2501.18936"},{"issue":"2","key":"6866_CR32","doi-asserted-by":"publisher","first-page":"511","DOI":"10.1007\/s11263-024-02172-x","volume":"133","author":"C Xu","year":"2025","unstructured":"Xu C, Zhu Y, Shen H, Chen B, Liao Y, Chen X, Wang L (2025) Progressive visual prompt learning with contrastive feature re-formation. Int. J. Comput. Vision 133(2):511\u2013526","journal-title":"Int. J. Comput. Vision"},{"key":"6866_CR33","unstructured":"Wang Y, Cheng L, Fang C, Zhang D, Duan M, Wang M (2024) Revisiting the power of prompt for visual tuning. arXiv:2402.02382"},{"key":"6866_CR34","first-page":"5552","volume":"37","author":"R Zeng","year":"2024","unstructured":"Zeng R, Han C, Wang Q, Wu C, Geng T, Huangg L, Wu YN, Liu D (2024) Visual fourier prompt tuning. Adv. Neural. Inf. Process. Syst. 37:5552\u20135585","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"6866_CR35","doi-asserted-by":"crossref","unstructured":"Khattak MU, Rasheed H, Maaz M, Khan S, Khan FS (2023) Maple: Multi-modal prompt learning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 19113\u201319122","DOI":"10.1109\/CVPR52729.2023.01832"},{"key":"6866_CR36","doi-asserted-by":"crossref","unstructured":"Cho E, Kim J, Kim HJ (2023) Distribution-aware prompt tuning for vision-language models. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 22004\u201322013","DOI":"10.1109\/ICCV51070.2023.02011"},{"key":"6866_CR37","unstructured":"Chen G, Yao W, Song X, Li X, Rao Y, Zhang K (2022) Plot: Prompt learning with optimal transport for vision-language models. arXiv:2210.01253"},{"key":"6866_CR38","doi-asserted-by":"crossref","unstructured":"Khattak MU, Wasim ST, Naseer M, Khan S, Yang M-H, Khan FS (2023) Self-regulating prompts: Foundational model adaptation without forgetting. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 15190\u201315200","DOI":"10.1109\/ICCV51070.2023.01394"},{"key":"6866_CR39","doi-asserted-by":"crossref","unstructured":"Wang T, Liu Y, Liang JC, Cui Y, Mao Y, Nie S, Liu J, Feng F, Xu Z, Han C et al (2024) M$$^2$$pt: Multimodal prompt tuning for zero-shot instruction learning. arXiv:2409.15657","DOI":"10.18653\/v1\/2024.emnlp-main.218"},{"key":"6866_CR40","doi-asserted-by":"crossref","unstructured":"Yao H, Zhang R, Yu L, Zhang Y, Xu C (2024) Sep: Self-enhanced prompt tuning for visual-language model. arXiv:2405.15549","DOI":"10.1109\/CVPR52733.2024.02212"},{"key":"6866_CR41","unstructured":"Peng W, Liu K, Hu J, Zhang M (2025) Biomed-dpt: Dual modality prompt tuning for biomedical vision-language models. arXiv:2505.05189"},{"key":"6866_CR42","unstructured":"Zhang Q (2024) Generalizable prompt tuning for vision-language models. arXiv:2410.03189"},{"key":"6866_CR43","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248\u2013255. Ieee","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"6866_CR44","unstructured":"Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178\u2013178. IEEE"},{"key":"6866_CR45","doi-asserted-by":"crossref","unstructured":"Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498\u20133505 . IEEE","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"6866_CR46","doi-asserted-by":"crossref","unstructured":"Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554\u2013561","DOI":"10.1109\/ICCVW.2013.77"},{"key":"6866_CR47","doi-asserted-by":"crossref","unstructured":"Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722\u2013729. IEEE","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"6866_CR48","doi-asserted-by":"crossref","unstructured":"Bossard L, Guillaumin M, Van\u00a0Gool L (2014) Food-101\u2013mining discriminative components with random forests. In: Computer vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446\u2013461. Springer","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"6866_CR49","unstructured":"Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. (2013)"},{"issue":"7","key":"6866_CR50","doi-asserted-by":"publisher","first-page":"2217","DOI":"10.1109\/JSTARS.2019.2918242","volume":"12","author":"P Helber","year":"2019","unstructured":"Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 12(7):2217\u20132226","journal-title":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing."},{"key":"6866_CR51","unstructured":"Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402"},{"key":"6866_CR52","doi-asserted-by":"crossref","unstructured":"Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606\u20133613","DOI":"10.1109\/CVPR.2014.461"},{"key":"6866_CR53","doi-asserted-by":"crossref","unstructured":"Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485\u20133492 . IEEE","DOI":"10.1109\/CVPR.2010.5539970"},{"key":"6866_CR54","doi-asserted-by":"crossref","unstructured":"Zanella M, Ben\u00a0Ayed I (2024) Low-rank few-shot adaptation of vision-language models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 1593\u20131603","DOI":"10.1109\/CVPRW63382.2024.00166"}],"container-title":["Applied Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10489-025-06866-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10489-025-06866-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10489-025-06866-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T15:41:45Z","timestamp":1762530105000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10489-025-06866-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,29]]},"references-count":54,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["6866"],"URL":"https:\/\/doi.org\/10.1007\/s10489-025-06866-8","relation":{},"ISSN":["0924-669X","1573-7497"],"issn-type":[{"type":"print","value":"0924-669X"},{"type":"electronic","value":"1573-7497"}],"subject":[],"published":{"date-parts":[[2025,9,29]]},"assertion":[{"value":"13 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 August 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 September 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"988"}}