{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T08:00:23Z","timestamp":1777622423911,"version":"3.51.4"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2025,4,17]],"date-time":"2025-04-17T00:00:00Z","timestamp":1744848000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000923","name":"Australian Research Council","doi-asserted-by":"crossref","award":["DE220101075"],"award-info":[{"award-number":["DE220101075"]}],"id":[{"id":"10.13039\/501100000923","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2025,4,30]]},"abstract":"<jats:p>Vision-language models, such as the Contrastive Language-Image Pre-Training (CLIP) model, have achieved significant success in image classification tasks. CLIP demonstrates high expressive power in few-shot learning scenarios due to its pairing of text and image encoders. However, CLIP still faces over-fitting when trained with a limited number of samples. To mitigate this, image augmentation techniques have been proposed in few-shot learning tasks to prevent over-fitting by enriching the dataset. Existing image augmentation methods, primarily designed for single-modal image models, focus solely on transformations within the image itself. However, for CLIP, merely increasing visual variety without considering textual content can reduce generalization ability and may even mislead the model. To address this issue, we introduce a novel image augmentation approach\u2014Integrated Image-Text Augmentation (ITA)\u2014 for CLIP model in few-shot learning tasks. This method generates new and diverse augmented images to increase the diversity of the training data and reduce over-fitting. Additionally, ITA establishes an alignment between the augmented images and their textual descriptions. Through this alignment, the model not only learns to recognize visual elements in the images but also understands the semantic connections between these elements and the text descriptions. This dual-modal approach enhances the model\u2019s flexibility and accuracy in processing few-shot learning tasks. Extensive experiments in few-shot image classification scenarios have demonstrated that ITA shows significant improvements compared to various image augmentation techniques.<\/jats:p>","DOI":"10.1145\/3712700","type":"journal-article","created":{"date-parts":[[2025,1,20]],"date-time":"2025-01-20T17:20:48Z","timestamp":1737393648000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Integrated Image-Text Augmentation for Few-Shot Learning in Vision-Language Models"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-8397-3410","authenticated-orcid":false,"given":"Ran","family":"Wang","sequence":"first","affiliation":[{"name":"Australian Artificial Intelligence Institute, Faculty of Engineering and IT, University of Technology Sydney, Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9122-0775","authenticated-orcid":false,"given":"Hua","family":"Zuo","sequence":"additional","affiliation":[{"name":"Australian Artificial Intelligence Institute, Faculty of Engineering and IT, University of Technology Sydney, Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0602-6255","authenticated-orcid":false,"given":"Zhen","family":"Fang","sequence":"additional","affiliation":[{"name":"Australian Artificial Intelligence Institute, Faculty of Engineering and IT, University of Technology Sydney, Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0690-4732","authenticated-orcid":false,"given":"Jie","family":"Lu","sequence":"additional","affiliation":[{"name":"Australian Artificial Intelligence Institute, Faculty of Engineering and IT, University of Technology Sydney, Sydney, Australia"}]}],"member":"320","published-online":{"date-parts":[[2025,4,17]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"e_1_3_1_3_2","first-page":"1","article-title":"PLOT: Prompt learning with optimal transport for vision-language models","author":"Chen Guangyi","year":"2023","unstructured":"Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. 2023. PLOT: Prompt learning with optimal transport for vision-language models. In ICLR. OpenReview.Net, 1\u201315.","journal-title":"ICLR"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33013379"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.461"},{"key":"e_1_3_1_6_2","unstructured":"Ekin Dogus Cubuk Barret Zoph Dandelion Man\u00e9 Vijay Vasudevan and Quoc V. Le. 2018. AutoAugment: Learning augmentation policies from data. arXiv:1805.09501. Retrieved from https:\/\/arxiv.org\/abs\/1805.09501"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_8_2","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. Association for Computational Linguistics, 4171\u20134186.","journal-title":"NAACL-HLT"},{"key":"e_1_3_1_9_2","unstructured":"Terrance Devries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552. Retrieved from https:\/\/arxiv.org\/abs\/1708.04552"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01129"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-demo.10"},{"key":"e_1_3_1_12_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et\u00a0al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2005.09.012"},{"key":"e_1_3_1_14_2","unstructured":"Peng Gao Shijie Geng Renrui Zhang Teli Ma Rongyao Fang Yongfeng Zhang Hongsheng Li and Yu Qiao. 2021. CLIP-adapter: Better vision-language models with feature adapters. arXiv:2110.04544. Retrieved from https:\/\/arxiv.org\/abs\/2110.04544"},{"key":"e_1_3_1_15_2","first-page":"1","volume-title":"ICLR","author":"Gu Xiuye","year":"2022","unstructured":"Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2022. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR. OpenReview.net, 1\u201310."},{"key":"e_1_3_1_16_2","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385. Retrieved from https:\/\/arxiv.org\/abs\/1512.03385"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTARS.2019.2918242"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510030"},{"key":"e_1_3_1_19_2","unstructured":"Muhammad Uzair Khattak Hanoona Abdul Rasheed Muhammad Maaz Salman Khan and Fahad Shahbaz Khan. 2022. MaPLe: Multi-modal prompt learning. arXiv:2210.03117. Retrieved from https:\/\/arxiv.org\/abs\/2210.03117"},{"key":"e_1_3_1_20_2","first-page":"554","article-title":"3D object representations for fine-grained categorization","author":"Krause Jonathan","year":"2013","unstructured":"Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D object representations for fine-grained categorization. In ICCV. IEEE Computer Society, 554\u2013561.","journal-title":"ICCV"},{"key":"e_1_3_1_21_2","first-page":"1","article-title":"Language-driven semantic segmentation","author":"Li Boyi","year":"2022","unstructured":"Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, Vladlen Koltun, and Ren\u00e9 Ranftl. 2022. Language-driven semantic segmentation. In ICLR. OpenReview.net, 1\u201312.","journal-title":"ICLR"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3560815"},{"key":"e_1_3_1_23_2","unstructured":"Subhransu Maji Esa Rahtu Juho Kannala Matthew B. Blaschko and Andrea Vedaldi. 2013. Fine-grained visual classification of aircraft. arXiv:1306.5151. Retrieved from https:\/\/arxiv.org\/abs\/1306.5151"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3609483"},{"key":"e_1_3_1_25_2","series-title":"Proceedings of Machine Learning Research","first-page":"8152","volume-title":"ICML","volume":"139","author":"Ni Renkun","year":"2021","unstructured":"Renkun Ni, Micah Goldblum, Amr Sharaf, Kezhi Kong, and Tom Goldstein. 2021. Data augmentation for meta-learning. In ICML, Proceedings of Machine Learning Research, Vol. 139, PMLR, 8152\u20138161."},{"key":"e_1_3_1_26_2","first-page":"722","article-title":"Automated flower classification over a large number of classes","author":"Nilsback Maria-Elena","year":"2008","unstructured":"Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In ICVGIP. IEEE Computer Society, 722\u2013729.","journal-title":"ICVGIP"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"e_1_3_1_28_2","unstructured":"Fang Peng Xiaoshan Yang Linhui Xiao Yaowei Wang and Changsheng Xu. 2023. SgVA-CLIP: Semantic-guided visual adapting of vision-language models for few-shot image classification. arXiv:2211.16191. Retrieved from https:\/\/arxiv.org\/abs\/2211.16191"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1250"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00755"},{"key":"e_1_3_1_31_2","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et\u00a0al. 2021. Learning transferable visual models from natural language supervision. arXiv:2103.00020. Retrieved from https:\/\/arxiv.org\/abs\/2103.00020"},{"key":"e_1_3_1_32_2","unstructured":"Alec Radford Jeff Wu Rewon Child D. Luan Dario Amodei and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog. Retrieved from https:\/\/cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf"},{"key":"e_1_3_1_33_2","first-page":"6545","article-title":"Fine-tuned CLIP models are efficient video learners","author":"Rasheed Hanoona Abdul","year":"2023","unstructured":"Hanoona Abdul Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman H. Khan, and Fahad Shahbaz Khan. 2023. Fine-tuned CLIP models are efficient video learners. In CVPR. IEEE, 6545\u20136554.","journal-title":"CVPR"},{"key":"e_1_3_1_34_2","first-page":"1","article-title":"Bridging the gap between object and image-level representations for open-vocabulary detection","author":"Rasheed Hanoona Abdul","year":"2022","unstructured":"Hanoona Abdul Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman H. Khan, and Fahad Shahbaz Khan. 2022. Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 1\u201313.","journal-title":"NeurIPS"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2021.02.007"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN60899.2024.10650901"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TFUZZ.2024.3389705"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3682067"},{"key":"e_1_3_1_39_2","unstructured":"Khurram Soomro Amir Roshan Zamir and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https:\/\/arxiv.org\/abs\/1212.0402"},{"key":"e_1_3_1_40_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. arXiv:1706.03762. Retrieved from https:\/\/arxiv.org\/abs\/1706.03762"},{"key":"e_1_3_1_41_2","first-page":"3630","volume-title":"NIPS.","author":"Vinyals Oriol","year":"2016","unstructured":"Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NIPS. Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.), 3630\u20133638."},{"key":"e_1_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Ran Wang Hua Zuo Zhen Fang and Jie Lu. 2024. Towards robustness prompt tuning with fully test-time adaptation for CLIP\u2019S zero-shot generalization. In MM. Retrieved from https:\/\/openreview.net\/forum?id=BVFAVis7ui","DOI":"10.1145\/3664647.3681213"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293318"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3386252"},{"key":"e_1_3_1_45_2","first-page":"3485","article-title":"SUN database: Large-scale scene recognition from Abbey to zoo","author":"Xiao Jianxiong","year":"2010","unstructured":"Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. SUN database: Large-scale scene recognition from Abbey to zoo. In CVPR. IEEE Computer Society, 3485\u20133492.","journal-title":"CVPR"},{"key":"e_1_3_1_46_2","unstructured":"Xiaoyu Yang Jie Lu and En Yu. 2024. Adapting multi-modal large language model to concept drift from Pre-training onwards. arXiv:2405.13459. Retrieved from https:\/\/arxiv.org\/abs\/2405.13459"},{"key":"e_1_3_1_47_2","first-page":"6022","article-title":"CutMix: Regularization strategy to train strong classifiers with localizable features","author":"Yun Sangdoo","year":"2019","unstructured":"Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. 2019. CutMix: Regularization strategy to train strong classifiers with localizable features. In ICCV. IEEE, 6022\u20136031.","journal-title":"ICCV"},{"key":"e_1_3_1_48_2","first-page":"1","article-title":"Mixup: Beyond empirical Risk minimization","author":"Zhang Hongyi","year":"2018","unstructured":"Hongyi Zhang, Moustapha Ciss\u00e9, Yann N. Dauphin, and David Lopez-Paz. 2018. Mixup: Beyond empirical Risk minimization. In ICLR. OpenReview.net, 1\u201313.","journal-title":"ICLR"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01631"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01653-1"}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712700","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3712700","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:18:08Z","timestamp":1750295888000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712700"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,17]]},"references-count":49,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,4,30]]}},"alternative-id":["10.1145\/3712700"],"URL":"https:\/\/doi.org\/10.1145\/3712700","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,17]]},"assertion":[{"value":"2024-02-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}