{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T20:03:20Z","timestamp":1781553800464,"version":"3.54.5"},"reference-count":58,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T00:00:00Z","timestamp":1698192000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T00:00:00Z","timestamp":1698192000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2024,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Soft prompt learning has emerged as a promising direction for adapting V &amp;L models to a downstream task using a few training examples. However, current methods significantly overfit the training data suffering from large accuracy degradation when tested on unseen classes from the same domain. In addition, all prior methods operate exclusively under the assumption that both vision and language data is present. To this end, we make the following 5 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we also propose <jats:italic>grouped<\/jats:italic> LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) Moreover, we identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) Importantly, we show that LASP is inherently amenable to including, during training, <jats:italic>virtual classes<\/jats:italic>, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Expanding for the first time the setting to language-only adaptation, (5) we present a novel zero-shot variant of LASP where no visual samples at all are available for the downstream task. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets. Finally, (c) we show that our zero-shot variant improves upon CLIP without requiring any extra data. Code will be made available.<\/jats:p>","DOI":"10.1007\/s11263-023-01904-9","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T07:02:21Z","timestamp":1698217341000},"page":"1108-1125","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Language-Aware Soft Prompting: Text-to-Text Optimization for Few- and Zero-Shot Adaptation of V &amp;L Models"],"prefix":"10.1007","volume":"132","author":[{"given":"Adrian","family":"Bulat","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Georgios","family":"Tzimiropoulos","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2023,10,25]]},"reference":[{"key":"1904_CR1","unstructured":"Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., & Hasson, Y., et al. (2022). Flamingo: A visual language model for few-shot learning. arXiv:2204.14198"},{"key":"1904_CR2","unstructured":"Albuquerque, I., Naik, N., Li, J., Keskar, N., & Socher, R. (2020). Improving out-ofdistribution generalization via multi-task self-supervised pretraining. arXiv:2003.13525"},{"key":"1904_CR3","unstructured":"Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450"},{"key":"1904_CR4","unstructured":"Balaji, Y., Sankaranarayanan, S., & Chellappa, R. (2018). Metareg: Towards domain generalization using meta-regularization. In Advances in neural information processing systems (Vol. 31)."},{"key":"1904_CR5","doi-asserted-by":"crossref","unstructured":"Bossard, L., Guillaumin, M., & Gool, L. V. (2014). Food-101-mining discriminative components with random forests. In European conference on computer vision (pp. 446\u2013461).","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"1904_CR6","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., & Dhariwal, P. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1904_CR7","first-page":"9912","volume":"33","author":"M Caron","year":"2020","unstructured":"Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912\u20139924.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1904_CR8","unstructured":"Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597\u20131607)."},{"key":"1904_CR9","doi-asserted-by":"crossref","unstructured":"Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 3606\u20133613).","DOI":"10.1109\/CVPR.2014.461"},{"key":"1904_CR10","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 ieee conference on computer vision and pattern recognition (pp. 248\u2013255).","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"1904_CR11","unstructured":"Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805"},{"key":"1904_CR12","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., & Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929"},{"key":"1904_CR13","unstructured":"Dou, Q., Coelho de Castro, D., Kamnitsas, K., & Glocker, B. (2019). Domain generalization via model-agnostic learning of semantic features. Advances in Neural Information Processing Systems, 32."},{"key":"1904_CR14","doi-asserted-by":"crossref","unstructured":"Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In 2004 Conference on computer vision and pattern recognition workshop (pp. 178\u2013178).","DOI":"10.1109\/CVPR.2004.383"},{"issue":"7","key":"1904_CR15","doi-asserted-by":"publisher","first-page":"2217","DOI":"10.1109\/JSTARS.2019.2918242","volume":"12","author":"P Helber","year":"2019","unstructured":"Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217\u20132226.","journal-title":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing"},{"key":"1904_CR16","doi-asserted-by":"crossref","unstructured":"Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., & Dorundo, E., et al. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the ieee\/cvf international conference on computer vision (pp. 8340\u20138349).","DOI":"10.1109\/ICCV48922.2021.00823"},{"key":"1904_CR17","doi-asserted-by":"crossref","unstructured":"Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021). Natural adversarial examples. In Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (pp. 15262\u201315271).","DOI":"10.1109\/CVPR46437.2021.01501"},{"key":"1904_CR18","unstructured":"Hinton, G., Vinyals, O., & Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531"},{"key":"1904_CR19","unstructured":"Hu, S., Zhang, K., Chen, Z., & Chan, L. (2020). Domain generalization via multidomain discriminant analysis. In Uncertainty in artificial intelligence (pp. 292\u2013302)."},{"key":"1904_CR20","unstructured":"Huang, T., Chu, J., & Wei, F. (2022). Unsupervised prompt learning for vision-language models. arXiv:2204.03649"},{"key":"1904_CR21","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., & Pham, H., et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904\u20134916)."},{"key":"1904_CR22","doi-asserted-by":"crossref","unstructured":"Krause, J., Stark, M., Deng, J., Fei-Fei, L. (2013). 3d object representations for finegrained categorization. In Proceedings of the ieee international conference on computer vision workshops (pp. 554\u2013561).","DOI":"10.1109\/ICCVW.2013.77"},{"issue":"6","key":"1904_CR23","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1145\/3065386","volume":"60","author":"A Krizhevsky","year":"2017","unstructured":"Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84\u201390.","journal-title":"Communications of the ACM"},{"key":"1904_CR24","first-page":"19884","volume":"33","author":"M Laskin","year":"2020","unstructured":"Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., & Srinivas, A. (2020). Reinforcement learning with augmented data. Advances in Neural Information Processing Systems, 33, 19884\u201319895.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1904_CR25","doi-asserted-by":"crossref","unstructured":"Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter efficient prompt tuning. arXiv:2104.08691","DOI":"10.18653\/v1\/2021.emnlp-main.243"},{"key":"1904_CR26","doi-asserted-by":"crossref","unstructured":"Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv:2101.00190","DOI":"10.18653\/v1\/2021.acl-long.353"},{"key":"1904_CR27","unstructured":"Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., & Shao, J., et al. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv:2110.05208"},{"key":"1904_CR28","unstructured":"Li, Z., Zhou, F., Chen, F., & Li, H. (2017). Metasgd: Learning to learn quickly for few-shot learning. arXiv:1707.09835"},{"key":"1904_CR29","unstructured":"Liang, W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. (2022). Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. arXiv:2203.02053"},{"key":"1904_CR30","doi-asserted-by":"crossref","unstructured":"Lu, Y., Liu, J., Zhang, Y., Liu, Y., & Tian, X. (2022). Prompt distribution learning. In Ieee conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR52688.2022.00514"},{"key":"1904_CR31","unstructured":"Mahajan, D., Tople, S., & Sharma, A. (2021). Domain generalization using causal matching. In International conference on machine learning (pp. 7313\u20137324)."},{"key":"1904_CR32","unstructured":"Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv:1306.5151"},{"key":"1904_CR33","unstructured":"Nichol, A., Achiam, J., & Schulman, J. (2018). On first-order meta-learning algorithms. arXiv:1803.02999"},{"key":"1904_CR34","doi-asserted-by":"crossref","unstructured":"Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics and image processing (pp. 722\u2013729).","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"1904_CR35","doi-asserted-by":"crossref","unstructured":"Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3498\u20133505).","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"1904_CR36","unstructured":"Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., & DeVito, Z., et al. (2017). Automatic differentiation in pytorch."},{"key":"1904_CR37","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748\u20138763)."},{"issue":"8","key":"1904_CR38","first-page":"9","volume":"1","author":"A Radford","year":"2019","unstructured":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.","journal-title":"OpenAI blog"},{"key":"1904_CR39","unstructured":"Rajeswaran, A., Finn, C., Kakade, S. M., & Levine, S. (2019). Meta-learning with implicit gradients. Advances in Neural Information Processing Systems, 32."},{"key":"1904_CR40","unstructured":"Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning (pp. 5389\u20135400)."},{"key":"1904_CR41","doi-asserted-by":"crossref","unstructured":"Ren, S., Li, L., Ren, X., Zhao, G., & Sun, X. (2022). Rethinking the openness of clip. arXiv:2206.01986","DOI":"10.18653\/v1\/2023.findings-acl.610"},{"key":"1904_CR42","doi-asserted-by":"crossref","unstructured":"Schick, T., & Sch\u00fctze, H. (2020a). Exploiting cloze questions for few shot text classification and natural language inference. arXiv:2001.07676","DOI":"10.18653\/v1\/2021.eacl-main.20"},{"key":"1904_CR43","doi-asserted-by":"crossref","unstructured":"Schick, T., & Sch\u00fctze, H. (2020b). It\u2019s not just size that matters: Small language models are also few-shot learners. arXiv:2009.07118","DOI":"10.18653\/v1\/2021.naacl-main.185"},{"key":"1904_CR44","doi-asserted-by":"crossref","unstructured":"Shao, R., Lan, X., Li, J., & Yuen, P. C. (2019). Multiadversarial discriminative deep domain generalization for face presentation attack detection. In Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (pp. 10023\u201310031).","DOI":"10.1109\/CVPR.2019.01026"},{"key":"1904_CR45","doi-asserted-by":"crossref","unstructured":"Shi, Y., Yu, X., Sohn, K., Chandraker, M., & Jain, A. K. (2020). Towards universal representation learning for deep face recognition. In Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (pp. 6817\u20136826).","DOI":"10.1109\/CVPR42600.2020.00685"},{"key":"1904_CR46","doi-asserted-by":"crossref","unstructured":"Song, Y., Wang, T., Cai, P., Mondal, S. K., & Sahoo, J. P. (2023). A comprehensive survey of fewshot learning: Evolution, applications, challenges, and opportunities. ACM Computing Surveys.","DOI":"10.1145\/3582688"},{"key":"1904_CR47","unstructured":"Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402"},{"key":"1904_CR48","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30."},{"key":"1904_CR49","unstructured":"Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019). Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32."},{"key":"1904_CR50","doi-asserted-by":"crossref","unstructured":"Xian, Y., Schiele, B., & Akata, Z. (2017). Zeroshot learning-the good, the bad and the ugly. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 4582\u20134591).","DOI":"10.1109\/CVPR.2017.328"},{"key":"1904_CR51","doi-asserted-by":"crossref","unstructured":"Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 3485\u20133492).","DOI":"10.1109\/CVPR.2010.5539970"},{"key":"1904_CR52","unstructured":"Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., et al. (2021). Filip: Finegrained interactive language-image pretraining. arXiv:2111.07783"},{"key":"1904_CR53","doi-asserted-by":"crossref","unstructured":"Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., & Fergus, R. (2021). Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the aaai conference on artificial intelligence (Vol. 35, pp. 10674\u201310681).","DOI":"10.1609\/aaai.v35i12.17276"},{"key":"1904_CR54","unstructured":"Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022). Coca: Contrastive captioners are imagetext foundation models. arXiv:2205.01917."},{"key":"1904_CR55","unstructured":"Zhou, K., Loy, C.C., & Liu, Z. (2023). Semisupervised domain generalization with stochastic stylematch. International Journal of Computer Vision, 1\u201311."},{"key":"1904_CR56","doi-asserted-by":"crossref","unstructured":"Zhou, K., Yang, J., Loy, C.C., & Liu, Z. (2022a). Conditional prompt learning for visionlanguage models. In Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (pp. 16816\u201316825).","DOI":"10.1109\/CVPR52688.2022.01631"},{"issue":"9","key":"1904_CR57","doi-asserted-by":"publisher","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","volume":"130","author":"K Zhou","year":"2022","unstructured":"Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337\u20132348.","journal-title":"International Journal of Computer Vision"},{"key":"1904_CR58","doi-asserted-by":"crossref","unstructured":"Zhu, B., Niu, Y., Han, Y., Wu, Y., & Zhang, H. (2022). Prompt-aligned gradient for prompt tuning. arXiv:2205.14865","DOI":"10.1109\/ICCV51070.2023.01435"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-023-01904-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-023-01904-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-023-01904-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T11:11:13Z","timestamp":1711451473000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-023-01904-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,25]]},"references-count":58,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["1904"],"URL":"https:\/\/doi.org\/10.1007\/s11263-023-01904-9","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,25]]},"assertion":[{"value":"2 April 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 September 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 October 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}