{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,2]],"date-time":"2026-07-02T03:29:19Z","timestamp":1782962959464,"version":"3.54.5"},"reference-count":99,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2023,9,7]],"date-time":"2023-09-07T00:00:00Z","timestamp":1694044800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,9,7]],"date-time":"2023-09-07T00:00:00Z","timestamp":1694044800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001774","name":"University of Sydney","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001774","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2024,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Transferring knowledge from pre-trained deep models for downstream tasks, particularly with limited labeled samples, is a fundamental problem in computer vision research. Recent advances in large-scale, task-agnostic vision-language pre-trained models, which are learned with billions of samples, have shed new light on this problem. In this study, we investigate how to efficiently transfer aligned visual and textual knowledge for downstream visual recognition tasks. We first revisit the role of the linear classifier in the vanilla transfer learning framework, and then propose a new paradigm where the parameters of the classifier are initialized with semantic targets from the textual encoder and remain fixed during optimization. To provide a comparison, we also initialize the classifier with knowledge from various resources. In the empirical study, we demonstrate that our paradigm improves the performance and training speed of transfer learning tasks. With only minor modifications, our approach proves effective across 17 visual datasets that span three different data domains: image, video, and 3D point cloud.<\/jats:p>","DOI":"10.1007\/s11263-023-01876-w","type":"journal-article","created":{"date-parts":[[2023,9,7]],"date-time":"2023-09-07T12:04:58Z","timestamp":1694088298000},"page":"392-409","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":31,"title":["Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective"],"prefix":"10.1007","volume":"132","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8511-743X","authenticated-orcid":false,"given":"Wenhao","family":"Wu","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhun","family":"Sun","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuxin","family":"Song","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jingdong","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wanli","family":"Ouyang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2023,9,7]]},"reference":[{"key":"1876_CR1","doi-asserted-by":"crossref","unstructured":"Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu\u010di\u0107, M., & Schmid, C. (2021). Vivit: A video vision transformer. In ICCV (pp. 6836\u20136846).","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"1876_CR2","unstructured":"Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML, PMLR (pp. 813\u2013824)."},{"key":"1876_CR3","doi-asserted-by":"crossref","unstructured":"Bossard, L., Guillaumin, M., & Van\u00a0Gool, L. (2014). Food-101\u2013mining discriminative components with random forests. In ECCV.","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"1876_CR4","doi-asserted-by":"crossref","unstructured":"Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., & Chalupka, K. (2020). Rethinking zero-shot video classification: End-to-end training for realistic applications. In CVPR (pp. 4613\u20134623).","DOI":"10.1109\/CVPR42600.2020.00467"},{"key":"1876_CR5","unstructured":"Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., & Kim, S. (2022). Coyo-700m: Image-text pair dataset. https:\/\/github.com\/kakaobrain\/coyo-dataset"},{"key":"1876_CR6","doi-asserted-by":"crossref","unstructured":"Caba\u00a0Heilbron, F., Escorcia, V., Ghanem, B., & Carlos\u00a0Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR (pp. 961\u2013970).","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"1876_CR7","doi-asserted-by":"crossref","unstructured":"Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.","DOI":"10.1109\/CVPR.2017.502"},{"key":"1876_CR8","unstructured":"Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., & Zisserman, A. (2018). A short note about kinetics-600. arXiv preprint arXiv:1808.01340"},{"key":"1876_CR9","doi-asserted-by":"crossref","unstructured":"Chen, S., & Huang, D. (2021). Elaborative rehearsal for zero-shot action recognition. In ICCV (pp. 13638\u201313647).","DOI":"10.1109\/ICCV48922.2021.01338"},{"key":"1876_CR10","doi-asserted-by":"crossref","unstructured":"Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised vision transformers. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"1876_CR11","doi-asserted-by":"crossref","unstructured":"Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR.","DOI":"10.1109\/CVPR.2014.461"},{"key":"1876_CR12","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248\u2013255).","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"1876_CR13","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et\u00a0al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929"},{"key":"1876_CR14","doi-asserted-by":"crossref","unstructured":"Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In ICCV (pp. 6824\u20136835).","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"1876_CR15","doi-asserted-by":"crossref","unstructured":"Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Computer vision and pattern recognition workshop.","DOI":"10.1109\/CVPR.2004.383"},{"key":"1876_CR16","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In CVPR (pp. 203\u2013213).","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"1876_CR17","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV (pp. 6202\u20136211).","DOI":"10.1109\/ICCV.2019.00630"},{"key":"1876_CR18","doi-asserted-by":"crossref","unstructured":"Gao, J., Zhang, T., & Xu, C. (2019). I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In AAAI (vol. 33, pp. 8303\u20138311).","DOI":"10.1609\/aaai.v33i01.33018303"},{"key":"1876_CR19","unstructured":"Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544"},{"key":"1876_CR20","doi-asserted-by":"crossref","unstructured":"Gao, R., Oh, T. H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR (pp. 10457\u201310467).","DOI":"10.1109\/CVPR42600.2020.01047"},{"key":"1876_CR21","doi-asserted-by":"crossref","unstructured":"Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In CVPR (pp. 12046\u201312055).","DOI":"10.1109\/CVPR.2019.01232"},{"key":"1876_CR22","unstructured":"Goyal, A., Law, H., Liu, B., Newell, A., & Deng, J. (2021). Revisiting point cloud shape classification with a simple and effective baseline. In International conference on machine learning, PMLR (pp. 3809\u20133820)."},{"key":"1876_CR23","unstructured":"Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. In NeurIPS (pp. 15908\u201315919)."},{"key":"1876_CR24","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770\u2013778).","DOI":"10.1109\/CVPR.2016.90"},{"key":"1876_CR25","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR (pp. 9729\u20139738).","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"1876_CR26","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 16000\u201316009).","DOI":"10.1109\/CVPR52688.2022.01553"},{"issue":"7","key":"1876_CR27","doi-asserted-by":"publisher","first-page":"2217","DOI":"10.1109\/JSTARS.2019.2918242","volume":"12","author":"P Helber","year":"2019","unstructured":"Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing., 12(7), 2217\u20132226.","journal-title":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing."},{"key":"1876_CR28","unstructured":"Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, PMLR (pp. 448\u2013456)."},{"key":"1876_CR29","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Le, Q., Sung, Y. H., Li, Z., & Duerig, T. (2021a). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, PMLR (pp. 4904\u20134916)."},{"key":"1876_CR30","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Le, Q., Sung, Y. H., Li, Z., & Duerig, T. (2021b). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, PMLR (pp. 4904\u20134916)."},{"key":"1876_CR31","doi-asserted-by":"crossref","unstructured":"Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In ICCV (pp. 2000\u20132009).","DOI":"10.1109\/ICCV.2019.00209"},{"key":"1876_CR32","doi-asserted-by":"crossref","unstructured":"Ju, C., Han, T., Zheng, K., Zhang, Y., & Xie, W. (2022). Prompting visual-language models for efficient video understanding. In ECCV (pp. 105\u2013124), Springer.","DOI":"10.1007\/978-3-031-19833-5_7"},{"key":"1876_CR33","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et\u00a0al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950"},{"key":"1876_CR34","doi-asserted-by":"crossref","unstructured":"Kim, T. S., Jones, J., Peven, M., Xiao, Z., Bai, J., Zhang, Y., Qiu, W., Yuille, A., & Hager, G. D. (2021). Daszl: Dynamic action signatures for zero-shot learning. AAAI, (vol. 35, pp. 1817\u20131826).","DOI":"10.1609\/aaai.v35i3.16276"},{"key":"1876_CR35","doi-asserted-by":"crossref","unstructured":"Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D object representations for fine-grained categorization. In 4th International IEEE workshop on 3D representation and recognition (3dRR-13), Sydney, Australia.","DOI":"10.1109\/ICCVW.2013.77"},{"key":"1876_CR36","unstructured":"Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS (pp. 25)."},{"key":"1876_CR37","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV (pp. 2556\u20132563).","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"1876_CR38","unstructured":"Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022a). Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546"},{"key":"1876_CR39","unstructured":"Li, J., Li, D., Xiong, C., & Hoi, S. (2022b). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086"},{"issue":"4","key":"1876_CR40","doi-asserted-by":"publisher","first-page":"453","DOI":"10.1007\/s10115-006-0013-y","volume":"10","author":"T Li","year":"2006","unstructured":"Li, T., Zhu, S., & Ogihara, M. (2006). Using discriminant analysis for multi-class classification: An experimental investigation. Knowledge and Information Systems, 10(4), 453\u2013472.","journal-title":"Knowledge and Information Systems"},{"key":"1876_CR41","doi-asserted-by":"crossref","unstructured":"Lin, C. C., Lin, K., Wang, L., Liu, Z., & Li, L. (2022a). Cross-modal representation learning for zero-shot action recognition. In CVPR (pp. 19978\u201319988).","DOI":"10.1109\/CVPR52688.2022.01935"},{"key":"1876_CR42","doi-asserted-by":"crossref","unstructured":"Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV.","DOI":"10.1109\/ICCV.2019.00718"},{"key":"1876_CR43","doi-asserted-by":"crossref","unstructured":"Lin, Z., Geng, S., Zhang, R., Gao, P., de\u00a0Melo, G., Wang, X., Dai, J., Qiao, Y., & Li, H. (2022b). Frozen clip models are efficient video learners. In ECCV (pp. 388\u2013404), Springer.","DOI":"10.1007\/978-3-031-19833-5_23"},{"key":"1876_CR44","doi-asserted-by":"crossref","unstructured":"Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). Teinet: Towards an efficient architecture for video recognition. In AAAI (pp. 11669\u201311676).","DOI":"10.1609\/aaai.v34i07.6836"},{"key":"1876_CR45","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 10012\u201310022).","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1876_CR46","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In CVPR (pp. 3202\u20133211).","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"1876_CR47","doi-asserted-by":"crossref","unstructured":"L\u00fcddecke, T., & Ecker, A. (2022). Image segmentation using text and image prompts. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 7086\u20137096).","DOI":"10.1109\/CVPR52688.2022.00695"},{"key":"1876_CR48","doi-asserted-by":"crossref","unstructured":"Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., & Li, T. (2021). Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860","DOI":"10.1016\/j.neucom.2022.07.028"},{"key":"1876_CR49","unstructured":"Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151"},{"key":"1876_CR50","doi-asserted-by":"crossref","unstructured":"Mishra, A., Verma, V. K., Reddy, M. S. K., Arulkumar, S., Rai, P., & Mittal, A. (2018). A generative approach to zero-shot and few-shot action recognition. In WACV (pp. 372\u2013380).","DOI":"10.1109\/WACV.2018.00047"},{"key":"1876_CR51","unstructured":"Mokady, R., Hertz, A., & Bermano, A. H. (2021). Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734"},{"key":"1876_CR52","doi-asserted-by":"crossref","unstructured":"Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., & Ling, H. (2022). Expanding language-image pretrained models for general video recognition. In ECCV.","DOI":"10.1007\/978-3-031-19772-7_1"},{"key":"1876_CR53","doi-asserted-by":"crossref","unstructured":"Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP.","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"1876_CR54","unstructured":"Van\u00a0den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv e-prints (pp. arXiv\u20131807)."},{"key":"1876_CR55","unstructured":"Pan, J., Lin, Z., Zhu, X., Shao, J., & Li, H. (2022). St-adapter: Parameter-efficient image-to-video transfer learning for action recognition. arXiv preprint arXiv:2206.13559"},{"key":"1876_CR56","doi-asserted-by":"crossref","unstructured":"Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In CVPR.","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"1876_CR57","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV (pp. 5533\u20135541).","DOI":"10.1109\/ICCV.2017.590"},{"key":"1876_CR58","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et\u00a0al. (2021). Learning transferable visual models from natural language supervision. In ICML, PMLR (pp. 8748\u20138763)."},{"key":"1876_CR59","unstructured":"Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In ICML, PMLR (pp. 8821\u20138831)."},{"key":"1876_CR60","doi-asserted-by":"crossref","unstructured":"Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., & Lu, J. (2022). Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18082\u201318091).","DOI":"10.1109\/CVPR52688.2022.01755"},{"key":"1876_CR61","doi-asserted-by":"crossref","unstructured":"Ribani, R., & Marengoni, M. (2019). A survey of transfer learning for convolutional neural networks. In 2019 32nd SIBGRAPI conference on graphics, patterns and images tutorials (SIBGRAPI-T) (pp. 47\u201357), IEEE.","DOI":"10.1109\/SIBGRAPI-T.2019.00010"},{"key":"1876_CR62","unstructured":"Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., & Angelova, A. (2021). Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297"},{"key":"1876_CR63","unstructured":"Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108"},{"key":"1876_CR64","unstructured":"Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et\u00a0al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402"},{"key":"1876_CR65","doi-asserted-by":"crossref","unstructured":"Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer vision\u2013ECCV 2016: 14th European conference, Amsterdam, The Netherlands, proceedings, part I 14, (pp. 510\u2013526), Springer.","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"1876_CR66","unstructured":"Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556"},{"key":"1876_CR67","unstructured":"Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402"},{"key":"1876_CR68","doi-asserted-by":"crossref","unstructured":"Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In ICCV (pp. 843\u2013852).","DOI":"10.1109\/ICCV.2017.97"},{"key":"1876_CR69","unstructured":"Sun, Q., Fang, Y., Wu, L., Wang, X., & Cao, Y. (2023). Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389"},{"key":"1876_CR70","unstructured":"Sun, Z. (2022). Design of the topology for contrastive visual-textual alignment. arXiv preprint arXiv:2209.02127"},{"key":"1876_CR71","doi-asserted-by":"crossref","unstructured":"Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., & Liu, C. (2018). A survey on deep transfer learning. In Artificial neural networks and machine learning\u2013ICANN 2018: 27th international conference on artificial neural networks, Rhodes, Greece, proceedings, part III 27 (pp. 270\u2013279), Springer.","DOI":"10.1007\/978-3-030-01424-7_27"},{"key":"1876_CR72","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR (pp. 6450\u20136459).","DOI":"10.1109\/CVPR.2018.00675"},{"key":"1876_CR73","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., & Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In ICCV (pp. 5552\u20135561).","DOI":"10.1109\/ICCV.2019.00565"},{"key":"1876_CR74","doi-asserted-by":"crossref","unstructured":"Wang, L., Li, W., Li, W., & Van\u00a0Gool, L. (2018a). Appearance-and-relation networks for video classification. In CVPR.","DOI":"10.1109\/CVPR.2018.00155"},{"key":"1876_CR75","doi-asserted-by":"crossref","unstructured":"Wang, L., Tong, Z., Ji, B., & Wu, G. (2021a). Tdn: Temporal difference networks for efficient action recognition. In CVPR (pp. 1895\u20131904).","DOI":"10.1109\/CVPR46437.2021.00193"},{"key":"1876_CR76","unstructured":"Wang, M., Xing, J., & Liu, Y. (2021b). Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472"},{"key":"1876_CR77","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In CVPR (pp. 7794\u20137803).","DOI":"10.1109\/CVPR.2018.00813"},{"key":"1876_CR78","doi-asserted-by":"crossref","unstructured":"Wu, C. Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & Girshick, R. (2019a). Long-term feature banks for detailed video understanding. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 284\u2013293).","DOI":"10.1109\/CVPR.2019.00037"},{"key":"1876_CR79","doi-asserted-by":"crossref","unstructured":"Wu, W., He, D., Tan, X., Chen, S., & Wen, S. (2019b). Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV (pp. 6222\u20136231).","DOI":"10.1109\/ICCV.2019.00632"},{"key":"1876_CR80","doi-asserted-by":"crossref","unstructured":"Wu, W., He, D., Lin, T., Li, F., Gan, C., & Ding, E. (2021). Mvfnet: Multi-view fusion network for efficient video recognition. AAAI (vol. 35, pp. 2943\u20132951).","DOI":"10.1609\/aaai.v35i4.16401"},{"key":"1876_CR81","doi-asserted-by":"crossref","unstructured":"Wu, W., Zhao, Y., Xu, Y., Tan, X., He, D., Zou, Z., Ye, J., Li, Y., Yao, M., Dong, Z., et\u00a0al. (2021b). Dsanet: Dynamic segment aggregation network for video-level representation learning. In ACM MM (pp. 1903\u20131911).","DOI":"10.1145\/3474085.3475344"},{"key":"1876_CR82","unstructured":"Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1912\u20131920)."},{"key":"1876_CR83","doi-asserted-by":"crossref","unstructured":"Xia, B., Wang, Z., Wu, W., Wang, H., & Han, J. (2022a). Temporal saliency query network for efficient video recognition. In ECCV (pp. 741\u2013759).","DOI":"10.1007\/978-3-031-19830-4_42"},{"key":"1876_CR84","doi-asserted-by":"crossref","unstructured":"Xia, B., Wu, W., Wang, H., Su, R., He, D., Yang, H., Fan, X., & Ouyang, W. (2022b). Nsnet: Non-saliency suppression sampler for efficient video recognition. In ECCV (pp. 705\u2013723).","DOI":"10.1007\/978-3-031-19830-4_40"},{"key":"1876_CR85","doi-asserted-by":"crossref","unstructured":"Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.","DOI":"10.1109\/CVPR.2010.5539970"},{"key":"1876_CR86","doi-asserted-by":"crossref","unstructured":"Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV (pp. 305\u2013321).","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"1876_CR87","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., & Schmid, C. (2022). Multiview transformers for video recognition. In CVPR (pp. 3333\u20133343).","DOI":"10.1109\/CVPR52688.2022.00333"},{"key":"1876_CR88","doi-asserted-by":"crossref","unstructured":"Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., & Gao, J. (2022). Unified contrastive learning in image-text-label space. In CVPR, (pp. 19163\u201319173).","DOI":"10.1109\/CVPR52688.2022.01857"},{"key":"1876_CR89","unstructured":"Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022). Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917"},{"key":"1876_CR90","unstructured":"Yuan, L., Chen, D., Chen, Y. L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., et\u00a0al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432"},{"key":"1876_CR91","doi-asserted-by":"crossref","unstructured":"Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2021). Scaling vision transformers. arXiv preprint arXiv:2106.04560","DOI":"10.1109\/CVPR52688.2022.01179"},{"key":"1876_CR92","unstructured":"Zhang, B., Yu, J., Fifty, C., Han, W., Dai, A. M., Pang, R., & Sha, F. (2021a). Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175"},{"key":"1876_CR93","unstructured":"Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao, Y., & Li, H. (2021b). Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930"},{"key":"1876_CR94","doi-asserted-by":"crossref","unstructured":"Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., & Li, H. (2022). Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8552\u20138562).","DOI":"10.1109\/CVPR52688.2022.00836"},{"key":"1876_CR95","doi-asserted-by":"crossref","unstructured":"Zhao, S., Zhu, L., Wang, X., & Yang, Y. (2022). Centerclip: Token clustering for efficient text-video retrieval. In SIRIR.","DOI":"10.1145\/3477495.3531950"},{"key":"1876_CR96","doi-asserted-by":"crossref","unstructured":"Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV.","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"1876_CR97","unstructured":"Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2021). Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134"},{"key":"1876_CR98","doi-asserted-by":"crossref","unstructured":"Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 16816\u201316825).","DOI":"10.1109\/CVPR52688.2022.01631"},{"issue":"1","key":"1876_CR99","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1109\/JPROC.2020.3004555","volume":"109","author":"F Zhuang","year":"2020","unstructured":"Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., & He, Q. (2020). A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1), 43\u201376.","journal-title":"Proceedings of the IEEE"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-023-01876-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-023-01876-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-023-01876-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,30]],"date-time":"2024-01-30T07:18:56Z","timestamp":1706599136000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-023-01876-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,7]]},"references-count":99,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,2]]}},"alternative-id":["1876"],"URL":"https:\/\/doi.org\/10.1007\/s11263-023-01876-w","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,7]]},"assertion":[{"value":"28 February 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 August 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 September 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}