{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T02:23:24Z","timestamp":1768011804361,"version":"3.49.0"},"reference-count":91,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2024,4,29]],"date-time":"2024-04-29T00:00:00Z","timestamp":1714348800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,4,29]],"date-time":"2024-04-29T00:00:00Z","timestamp":1714348800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"crossref","award":["JP22KF0119"],"award-info":[{"award-number":["JP22KF0119"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"crossref","award":["JP20H04205"],"award-info":[{"award-number":["JP20H04205"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100002241","name":"Japan Science and Technology Agency","doi-asserted-by":"publisher","award":["JPMJCR20U1"],"award-info":[{"award-number":["JPMJCR20U1"]}],"id":[{"id":"10.13039\/501100002241","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2024,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The task of few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. How to better describe the action in each video and how to compare the similarity between videos are two of the most critical factors in this task. Directly describing the video globally or by its individual frames cannot well represent the spatiotemporal dependencies within an action. On the other hand, naively matching the global representations of two videos is also not optimal since action can happen at different locations in a video with different speeds. In this work, we propose a novel approach that describes each video using multiple types of prototypes and then computes the video similarity with a particular matching strategy for each type of prototypes. To better model the spatiotemporal dependency, we describe the video by generating prototypes that model the multi-level spatiotemporal relations via transformers. There are a total of three types of prototypes. The first type of prototypes are trained to describe specific aspects of the action in the video e.g., the start of the action, regardless of its timestamp. These prototypes are directly matched one-to-one between two videos to compare their similarity. The second type of prototypes are the timestamp-centered prototypes that are trained to focus on specific timestamps of the video. To deal with the temporal variation of actions in a video, we apply bipartite matching to allow the matching of prototypes of different timestamps. The third type of prototypes are generated from the timestamp-centered prototypes, which regularize their temporal consistency while serving as an auxiliary summarization of the whole video. Experiments demonstrate that our proposed method achieves state-of-the-art results on multiple benchmarks.<\/jats:p>","DOI":"10.1007\/s11263-024-02017-7","type":"journal-article","created":{"date-parts":[[2024,4,29]],"date-time":"2024-04-29T13:02:06Z","timestamp":1714395726000},"page":"3977-4002","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Matching Compound Prototypes for Few-Shot Action Recognition"],"prefix":"10.1007","volume":"132","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8067-6227","authenticated-orcid":false,"given":"Yifei","family":"Huang","sequence":"first","affiliation":[]},{"given":"Lijin","family":"Yang","sequence":"additional","affiliation":[]},{"given":"Guo","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Hongjie","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Feng","family":"Lu","sequence":"additional","affiliation":[]},{"given":"Yoichi","family":"Sato","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,4,29]]},"reference":[{"key":"2017_CR1","doi-asserted-by":"crossref","unstructured":"Afrasiyabi, A., Larochelle, H., Lalonde, J. F., & Gagn\u00e9, C. (2022). Matching feature sets for few-shot image classification. In CVPR.","DOI":"10.1109\/CVPR52688.2022.00881"},{"key":"2017_CR2","unstructured":"Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In NeurIPS."},{"key":"2017_CR3","unstructured":"Antoniou, A., Edwards, H., & Storkey, A. (2019). How to train your MAML. In ICML."},{"key":"2017_CR4","doi-asserted-by":"crossref","unstructured":"Bateni, P., Barber, J., Van de Meent, J. W., & Wood, F. (2022). Enhancing few-shot image classification with unlabelled examples. In WACV.","DOI":"10.1109\/WACV51458.2022.00166"},{"key":"2017_CR5","unstructured":"Bishay, M., Zoumpourlis, G., & Patras, I. (2019). Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC."},{"key":"2017_CR6","doi-asserted-by":"crossref","unstructured":"Cao, K., Ji, J., Cao, Z., Chang, C. Y., & Niebles, J. C. (2020). Few-shot video classification via temporal alignment. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01063"},{"key":"2017_CR7","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In ECCV.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"2017_CR8","doi-asserted-by":"crossref","unstructured":"Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.","DOI":"10.1109\/CVPR.2017.502"},{"key":"2017_CR9","doi-asserted-by":"crossref","unstructured":"Chang, C. Y., Huang, D. A., Sui, Y., Fei-Fei, L., & Niebles, J. C. (2019). D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In CVPR.","DOI":"10.1109\/CVPR.2019.00366"},{"key":"2017_CR10","doi-asserted-by":"crossref","unstructured":"Chowdhury, A., Jiang, M., Chaudhuri, S., & Jermaine, C. (2021). Few-shot image classification: Just use a library of pre-trained feature extractors and a simple classifier. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00931"},{"key":"2017_CR11","doi-asserted-by":"crossref","unstructured":"Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., & Yang, M. Y. (2021). Spatial-temporal transformer for dynamic scene graph generation. In ICCV.","DOI":"10.1109\/ICCV48922.2021.01606"},{"key":"2017_CR12","doi-asserted-by":"crossref","unstructured":"Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. (2018). Scaling egocentric vision: The epic-kitchens dataset. In ECCV.","DOI":"10.1007\/978-3-030-01225-0_44"},{"key":"2017_CR13","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"2017_CR14","doi-asserted-by":"crossref","unstructured":"Deng, J., Yang, Z., Chen, T., Zhou, W., & Li, H. (2021). TransVG: End-to-end visual grounding with transformers. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00179"},{"key":"2017_CR15","unstructured":"Dhillon, G. S., Chaudhari, P., Ravichandran, A., & Soatto, S. (2019). A baseline for few-shot image classification. In ICLR."},{"key":"2017_CR16","unstructured":"Doersch, C., Gupta, A., & Zisserman, A. (2020). Crosstransformers: Spatially-aware few-shot transfer. In NeurIPS."},{"key":"2017_CR17","doi-asserted-by":"crossref","unstructured":"Fan, Q., Zhuo, W., Tang, C. K., & Tai, Y. W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. In CVPR.","DOI":"10.1109\/CVPR42600.2020.00407"},{"key":"2017_CR18","unstructured":"Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR."},{"key":"2017_CR19","doi-asserted-by":"crossref","unstructured":"Fu, Y., Zhang, L., Wang, J., Fu, Y., & Jiang, Y. G. (2020). Depth guided adaptive meta-fusion network for few-shot video recognition. In ACM MM.","DOI":"10.1145\/3394171.3413502"},{"key":"2017_CR20","doi-asserted-by":"crossref","unstructured":"Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., et al. (2017). The \u201csomething something\u201d video database for learning and evaluating visual common sense. In ICCV.","DOI":"10.1109\/ICCV.2017.622"},{"key":"2017_CR21","unstructured":"Grauman, K., Westbury, A., Byrne, E., et al. (2021). Ego4D: Around the world in 3000 hours of egocentric video. arXiv:2110.07058"},{"key":"2017_CR22","doi-asserted-by":"crossref","unstructured":"Gui, L. Y., Wang, Y. X., Ramanan, D., & Moura, J. M. (2018). Few-shot human motion prediction via meta-learning. In ECCV.","DOI":"10.1007\/978-3-030-01237-3_27"},{"key":"2017_CR23","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., & Girshick, R. (2017). Mask R-CNN. In ICCV.","DOI":"10.1109\/ICCV.2017.322"},{"key":"2017_CR24","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.","DOI":"10.1109\/CVPR.2016.90"},{"key":"2017_CR25","doi-asserted-by":"crossref","unstructured":"Hong, J., Fisher, M., Gharbi, M., & Fatahalian, K. (2021). Video pose distillation for few-shot, fine-grained sports action recognition. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00912"},{"key":"2017_CR26","doi-asserted-by":"publisher","first-page":"7795","DOI":"10.1109\/TIP.2020.3007841","volume":"29","author":"Y Huang","year":"2020","unstructured":"Huang, Y., Cai, M., Li, Z., Lu, F., & Sato, Y. (2020). Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing, 29, 7795\u20137806.","journal-title":"IEEE Transactions on Image Processing"},{"key":"2017_CR27","doi-asserted-by":"crossref","unstructured":"Huang, Y., Cai, M., Li, Z., & Sato, Y. (2018). Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV.","DOI":"10.1007\/978-3-030-01225-0_46"},{"issue":"4","key":"2017_CR28","doi-asserted-by":"publisher","first-page":"306","DOI":"10.1109\/THMS.2020.2965429","volume":"50","author":"Y Huang","year":"2020","unstructured":"Huang, Y., Cai, M., & Sato, Y. (2020). An ego-vision system for discovering human joint attention. IEEE Transactions on Human-Machine Systems, 50(4), 306\u2013316.","journal-title":"IEEE Transactions on Human-Machine Systems"},{"key":"2017_CR29","doi-asserted-by":"crossref","unstructured":"Huang, Y., Yang, L., & Sato, Y. (2022). Compound prototype matching for few-shot action recognition. In ECCV.","DOI":"10.1007\/978-3-031-19772-7_21"},{"key":"2017_CR30","doi-asserted-by":"crossref","unstructured":"Huang, Y., Yang, L., & Sato, Y. (2023). Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 18908\u201318918).","DOI":"10.1109\/CVPR52729.2023.01813"},{"key":"2017_CR31","doi-asserted-by":"crossref","unstructured":"Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2019). Few-shot object detection via feature reweighting. In ICCV.","DOI":"10.1109\/ICCV.2019.00851"},{"key":"2017_CR32","doi-asserted-by":"crossref","unstructured":"Kang, D., Kwon, H., Min, J., & Cho, M. (2021). Relational embedding for few-shot classification. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00870"},{"key":"2017_CR33","doi-asserted-by":"crossref","unstructured":"Kliper-Gross, O., Hassner, T., & Wolf, L. (2011). One shot similarity metric learning for action recognition. In SIMBAD.","DOI":"10.1007\/978-3-642-24471-1_3"},{"key":"2017_CR34","unstructured":"Koch, G., Zemel, R., Salakhutdinov, R., et al. (2015). Siamese neural networks for one-shot image recognition. In ICML."},{"key":"2017_CR35","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"2017_CR36","doi-asserted-by":"crossref","unstructured":"Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval research logistics quarterly.","DOI":"10.1002\/nav.3800020109"},{"key":"2017_CR37","doi-asserted-by":"crossref","unstructured":"Kumar Dwivedi, S., Gupta, V., Mitra, R., Ahmed, S., & Jain, A. (2019). Protogan: Towards few shot learning for action recognition. In CVPRW.","DOI":"10.1109\/ICCVW.2019.00166"},{"key":"2017_CR38","doi-asserted-by":"crossref","unstructured":"Li, H., Eigen, D., Dodge, S., Zeiler, M., & Wang, X. (2019). Finding task-relevant features for few-shot learning by category traversal. In CVPR.","DOI":"10.1109\/CVPR.2019.00009"},{"key":"2017_CR39","doi-asserted-by":"crossref","unstructured":"Li, S., Liu, H., Qian, R., Li, Y., See, J., Fei, M., Yu, X., & Lin, W. (2021). Ta2n: Two-stage action alignment network for few-shot action recognition. arXiv:2107.04782","DOI":"10.1609\/aaai.v36i2.20029"},{"key":"2017_CR40","doi-asserted-by":"crossref","unstructured":"Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"2017_CR41","doi-asserted-by":"crossref","unstructured":"Liu, W., Zhang, C., Lin, G., & Liu, F. (2020). CRNet: Cross-reference networks for few-shot segmentation. In CVPR.","DOI":"10.1109\/CVPR42600.2020.00422"},{"key":"2017_CR42","doi-asserted-by":"crossref","unstructured":"Liu, Y., Zhang, X., Zhang, S., & He, X. (2020). Part-aware prototype network for few-shot semantic segmentation. In ECCV.","DOI":"10.1007\/978-3-030-58545-7_9"},{"key":"2017_CR43","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"2017_CR44","doi-asserted-by":"crossref","unstructured":"Lu, Z., He, S., Zhu, X., Zhang, L., Song, Y. Z., & Xiang, T. (2021). Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00862"},{"key":"2017_CR45","unstructured":"Luo, X., Xu, J., & Xu, Z. (2022). Channel importance matters in few-shot image classification. In ICML."},{"key":"2017_CR46","doi-asserted-by":"crossref","unstructured":"Mishra, A., Verma, V. K., Reddy, M. S. K., Arulkumar, S., Rai, P., & Mittal, A. (2018). A generative approach to zero-shot and few-shot action recognition. In WACV.","DOI":"10.1109\/WACV.2018.00047"},{"key":"2017_CR47","doi-asserted-by":"crossref","unstructured":"Nguyen, K. D., Tran, Q. H., Nguyen, K., Hua, B. S., & Nguyen, R. (2022). Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV.","DOI":"10.1007\/978-3-031-20044-1_27"},{"key":"2017_CR48","doi-asserted-by":"crossref","unstructured":"Patravali, J., Mittal, G., Yu, Y., Li, F., & Chen, M. (2021). Unsupervised few-shot action recognition via action-appearance aligned meta-adaptation. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00837"},{"key":"2017_CR49","doi-asserted-by":"crossref","unstructured":"Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. (2021). Temporal-relational crosstransformers for few-shot action recognition. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00054"},{"key":"2017_CR50","doi-asserted-by":"crossref","unstructured":"Qiao, S., Liu, C., Shen, W., & Yuille, A. L. (2018). Few-shot image recognition by predicting parameters from activations. In CVPR.","DOI":"10.1109\/CVPR.2018.00755"},{"key":"2017_CR51","unstructured":"Ravi, S., & Larochelle, H. (2017). Optimization as a model for few-shot learning. In ICLR."},{"key":"2017_CR52","doi-asserted-by":"crossref","unstructured":"Samarasinghe, S., Rizve, M. N., Kardan, N., & Shah, M. (2023). CDFSL-V: Cross-domain few-shot learning for videos. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 11643\u201311652).","DOI":"10.1109\/ICCV51070.2023.01069"},{"key":"2017_CR53","unstructured":"Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical networks for few-shot learning. In NeurIPS."},{"key":"2017_CR54","unstructured":"Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402"},{"key":"2017_CR55","doi-asserted-by":"crossref","unstructured":"Sun, R., Li, Y., Zhang, T., Mao, Z., Wu, F., & Zhang, Y. (2021). Lesion-aware transformers for diabetic retinopathy grading. In CVPR.","DOI":"10.1109\/CVPR46437.2021.01079"},{"key":"2017_CR56","doi-asserted-by":"crossref","unstructured":"Thatipelli, A., Narayan, S., Khan, S., Anwer, R. M., Khan, F. S., & Ghanem, B. (2021). Spatio-temporal relation modeling for few-shot action recognition. arXiv:2112.05132","DOI":"10.1109\/CVPR52688.2022.01933"},{"key":"2017_CR57","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS."},{"key":"2017_CR58","unstructured":"Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. In NeurIPS."},{"key":"2017_CR59","doi-asserted-by":"crossref","unstructured":"Wang, H., Zhang, X., Hu, Y., Yang, Y., Cao, X., & Zhen, X. (2020). Few-shot semantic segmentation with democratic attention networks. In ECCV.","DOI":"10.1007\/978-3-030-58601-0_43"},{"key":"2017_CR60","doi-asserted-by":"crossref","unstructured":"Wang, K., Liew, J. H., Zou, Y., Zhou, D., & Feng, J. (2019). Panet: Few-shot image semantic segmentation with prototype alignment. In ICCV.","DOI":"10.1109\/ICCV.2019.00929"},{"key":"2017_CR61","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"2017_CR62","unstructured":"Wang, X., Huang, T. E., Darrell, T., Gonzalez, J. E., & Yu, F. (2020). Frustratingly simple few-shot object detection. In ICML."},{"key":"2017_CR63","doi-asserted-by":"crossref","unstructured":"Wang, X., Ye, W., Qi, Z., Zhao, X., Wang, G., Shan, Y., & Wang, H. (2021). Semantic-guided relation propagation network for few-shot action recognition. In ACM MM.","DOI":"10.1145\/3474085.3475253"},{"key":"2017_CR64","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, S., Qing, Z., Gao, C., Zhang, Y., Zhao, D., & Sang, N. (2023). Molo: Motion-augmented long-short contrastive learning for few-shot action recognition. In CVPR.","DOI":"10.1109\/CVPR52729.2023.01727"},{"key":"2017_CR65","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2022). Hybrid relation guided set matching for few-shot action recognition. In CVPR.","DOI":"10.1109\/CVPR52688.2022.01932"},{"key":"2017_CR66","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, S., Qing, Z., Zuo, Z., Gao, C., Jin, R., & Sang, N. (2023). Hyrsm++: Hybrid relation guided temporal set matching for few-shot action recognition. arXiv:2301.03330","DOI":"10.1109\/CVPR52688.2022.01932"},{"key":"2017_CR67","doi-asserted-by":"crossref","unstructured":"Wanyan, Y., Yang, X., Chen, C., & Xu, C. (2023). Active exploration of multimodal complementarity for few-shot action recognition. In CVPR.","DOI":"10.1109\/CVPR52729.2023.00628"},{"key":"2017_CR68","doi-asserted-by":"crossref","unstructured":"Wei, X. S., Wang, P., Liu, L., Shen, C., & Wu, J. (2019). Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. In TIP.","DOI":"10.1109\/TIP.2019.2924811"},{"key":"2017_CR69","doi-asserted-by":"crossref","unstructured":"Wu, J., Zhang, T., Zhang, Z., Wu, F., & Zhang, Y. (2022). Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR (pp. 9151\u20139160).","DOI":"10.1109\/CVPR52688.2022.00894"},{"key":"2017_CR70","doi-asserted-by":"crossref","unstructured":"Xia, H., Li, K., Min, M. R., & Ding, Z. (2023). Few-shot video classification via representation fusion and promotion learning. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 19311\u201319320).","DOI":"10.1109\/ICCV51070.2023.01769"},{"key":"2017_CR71","doi-asserted-by":"crossref","unstructured":"Xian, Y., Korbar, B., Douze, M., Schiele, B., Akata, Z., & Torresani, L. (2020). Generalized many-way few-shot video classification. In ECCV.","DOI":"10.1007\/978-3-030-65414-6_10"},{"key":"2017_CR72","doi-asserted-by":"crossref","unstructured":"Xian, Y., Korbar, B., Douze, M., Torresani, L., Schiele, B., & Akata, Z. (2021). Generalized few-shot video classification with video retrieval and feature generation. In TPAMI.","DOI":"10.1007\/978-3-030-65414-6_10"},{"key":"2017_CR73","doi-asserted-by":"crossref","unstructured":"Xing, J., Wang, M., Liu, Y., & Mu, B. (2023). Revisiting the spatial and temporal modeling for few-shot action recognition. In Proceedings of the AAAI conference on artificial intelligence (pp. 3001\u20133009).","DOI":"10.1609\/aaai.v37i3.25403"},{"key":"2017_CR74","doi-asserted-by":"crossref","unstructured":"Xing, J., Wang, M., Ruan, Y., Chen, B., Guo, Y., Mu, B., Dai, G., Wang, J., & Liu, Y. (2023). Boosting few-shot action recognition with graph-guided hybrid matching. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 1740\u20131750).","DOI":"10.1109\/ICCV51070.2023.00167"},{"key":"2017_CR75","doi-asserted-by":"crossref","unstructured":"Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., & Jiang, Y. G. (2018). Dense dilated network for few shot action recognition. In ICMR.","DOI":"10.1145\/3206025.3206028"},{"key":"2017_CR76","doi-asserted-by":"crossref","unstructured":"Xu, C., Fu, Y., Liu, C., Wang, C., Li, J., Huang, F., Zhang, L., & Xue, X. (2021). Learning dynamic alignment via meta-filter for few-shot learning. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00514"},{"key":"2017_CR77","doi-asserted-by":"crossref","unstructured":"Xu, M., Zhao, C., Rojas, D. S., Thabet, A., & Ghanem, B. (2020). G-tad: Sub-graph localization for temporal action detection. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01017"},{"key":"2017_CR78","unstructured":"Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. In NeurIPS."},{"key":"2017_CR79","doi-asserted-by":"crossref","unstructured":"Yang, L., Huang, Y., Sugano, Y., & Sato, Y. (2022). Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 14722\u201314732).","DOI":"10.1109\/CVPR52688.2022.01431"},{"key":"2017_CR80","doi-asserted-by":"crossref","unstructured":"Ye, H. J., Hu, H., Zhan, D. C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. In CVPR (pp. 8808\u20138817).","DOI":"10.1109\/CVPR42600.2020.00883"},{"key":"2017_CR81","doi-asserted-by":"crossref","unstructured":"Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2019). Graph convolutional networks for temporal action localization. In ICCV.","DOI":"10.1109\/ICCV.2019.00719"},{"key":"2017_CR82","doi-asserted-by":"crossref","unstructured":"Zhang, C., Cai, Y., Lin, G., & Shen, C. (2020). DeepEMD: Few-shot image classification with differentiable earth mover\u2019s distance and structured classifiers. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01222"},{"key":"2017_CR83","doi-asserted-by":"crossref","unstructured":"Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00446"},{"key":"2017_CR84","doi-asserted-by":"crossref","unstructured":"Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P. H., & Koniusz, P. (2020). Few-shot action recognition with permutation-invariant attention. In ECCV.","DOI":"10.1007\/978-3-030-58558-7_31"},{"key":"2017_CR85","doi-asserted-by":"crossref","unstructured":"Zhang, S., Zhou, J., & He, X. (2021). Learning implicit temporal alignment for few-shot video classification. In IJCAI.","DOI":"10.24963\/ijcai.2021\/181"},{"key":"2017_CR86","doi-asserted-by":"crossref","unstructured":"Zheng, S., Chen, S., & Jin, Q. (2022). Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV. Springer.","DOI":"10.1007\/978-3-031-19772-7_18"},{"key":"2017_CR87","doi-asserted-by":"crossref","unstructured":"Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV.","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"2017_CR88","doi-asserted-by":"crossref","unstructured":"Zhu, L., & Yang, Y. (2018). Compound memory networks for few-shot video classification. In ECCV.","DOI":"10.1007\/978-3-030-01234-2_46"},{"key":"2017_CR89","doi-asserted-by":"crossref","unstructured":"Zhu, L., & Yang, Y. (2020). Label independent memory for semi-supervised few-shot video classification. In TPAMI.","DOI":"10.1109\/TPAMI.2020.3007511"},{"key":"2017_CR90","unstructured":"Zhu, X., Toisoul, A., Perez-Rua, J.M., Zhang, L., Martinez, B., & Xiang, T. (2021). Few-shot action recognition with prototype-centered attentive learning. In BMVC."},{"key":"2017_CR91","unstructured":"Zhu, Z., Wang, L., Guo, S., & Wu, G. (2021). A closer look at few-shot video classification: A new baseline and benchmark. In BMVC."}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02017-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-024-02017-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02017-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,27]],"date-time":"2024-08-27T07:44:18Z","timestamp":1724744658000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-024-02017-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,29]]},"references-count":91,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9]]}},"alternative-id":["2017"],"URL":"https:\/\/doi.org\/10.1007\/s11263-024-02017-7","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,29]]},"assertion":[{"value":"4 August 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 April 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}