{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T17:45:35Z","timestamp":1767980735862,"version":"3.49.0"},"reference-count":72,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"The National Key Research and Development Program of China","award":["2018YFE0118400"],"award-info":[{"award-number":["2018YFE0118400"]}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U20B2052, 61936011"],"award-info":[{"award-number":["U20B2052, 61936011"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,4,30]]},"abstract":"<jats:p>Transformer has exhibited promising performance in various video recognition tasks but brings a huge computational cost in modeling spatial-temporal cues. This work aims to boost the efficiency of existing video transformers for action recognition through eliminating redundancies in their tokens and efficiently learning motion cues of moving objects. We propose a lightweight and plug-and-play module, namely Spatial-temporal Token Merger (STTM), to merge the tokens belonging to the same object into a more compact representation. STTM first adaptively identifies crucial object clues underlying the video as meta tokens. Similarity scores between input tokens and meta tokens are hence computed and used to guide the fusion of similar tokens in both spatial and temporal domains, respectively. To compensate for motion cues lost in the merging procedure, we compute the linear aggregation of spatial-temporal positions of tokens as motion features. STTM hence outputs a compact set of tokens fusing both appearance and motion features of moving objects. This procedure substantially decreases the number of tokens that need to be processed by each Transformer block and boosts the efficiency. As a general module, STTM can be applied to different layers of various video Transformers. Extensive experiments on the action recognition datasets Kinectics-400 and SSv2 demonstrate its promising performance. For example, it reduces the computation complexity of ViT by 38% while maintaining a similar performance on Kinectics-400. It also brings 1.7% gains of top-1 accuracy on SSv2 under the same computational cost.<\/jats:p>","DOI":"10.1145\/3633781","type":"journal-article","created":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T11:48:40Z","timestamp":1701690520000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Efficient Video Transformers via Spatial-temporal Token Merging for Action Recognition"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-0071-454X","authenticated-orcid":false,"given":"Zhanzhou","family":"Feng","sequence":"first","affiliation":[{"name":"National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-5142-9177","authenticated-orcid":false,"given":"Jiaming","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Electronic Engineering and Computer Science, Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6024-3854","authenticated-orcid":false,"given":"Lei","family":"Ma","sequence":"additional","affiliation":[{"name":"National Biomedical Imaging Center, College of Future Technology, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, \rBeijing Academy of Artificial Intelligence, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9053-9314","authenticated-orcid":false,"given":"Shiliang","family":"Zhang","sequence":"additional","affiliation":[{"name":"National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China"}]}],"member":"320","published-online":{"date-parts":[[2024,1,11]]},"reference":[{"key":"e_1_3_1_2_2","article-title":"Quantifying attention flow in transformers","author":"Abnar Samira","year":"2020","unstructured":"Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928 (2020).","journal-title":"arXiv preprint arXiv:2005.00928"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_3_1_4_2","unstructured":"Gedas Bertasius Heng Wang and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In International Conference on Machine Learning PMLR 813\u2013824."},{"key":"e_1_3_1_5_2","article-title":"Token merging: Your vit but faster","author":"Bolya Daniel","year":"2022","unstructured":"Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2022. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022).","journal-title":"arXiv preprint arXiv:2210.09461"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_1_7_2","first-page":"1910","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Chen Jiawei","year":"2022","unstructured":"Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 1910\u20131921."},{"key":"e_1_3_1_8_2","first-page":"3435","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Chen Yunpeng","year":"2019","unstructured":"Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. 2019. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 3435\u20133444."},{"key":"e_1_3_1_9_2","article-title":"Complementary coarse-to-fine matching for video object segmentation","author":"Chen Zhen","year":"2023","unstructured":"Zhen Chen, Ming Yang, and Shiliang Zhang. 2023. Complementary coarse-to-fine matching for video object segmentation. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM\u201923) 19, 6 (2023), 1\u201321.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications (TOMM\u201923)"},{"key":"e_1_3_1_10_2","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).","journal-title":"arXiv preprint arXiv:1810.04805"},{"key":"e_1_3_1_11_2","article-title":"An image is worth 16x16 words: Transformers for image recognition at scale","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Uszkoreit Jakob, and Houlsby Neil. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).","journal-title":"arXiv preprint arXiv:2010.11929"},{"key":"e_1_3_1_12_2","unstructured":"Quanfu Fan Chun-Fu Richard Chen Hilde Kuehne Marco Pistoia and David Cox. 2019. More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems . 2264\u20132273."},{"key":"e_1_3_1_13_2","unstructured":"Quanfu Fan and Rameswar Panda. 2021. An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.14104 2 (2021)."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Zhanzhou Feng and Shiliang Zhang. 2023. Efficient vision transformer via token merger. IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society 32 (2023) 4156\u20134169.","DOI":"10.1109\/TIP.2023.3293763"},{"key":"e_1_3_1_17_2","first-page":"10386","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Feng Zhanzhou","year":"2023","unstructured":"Zhanzhou Feng and Shiliang Zhang. 2023. Evolved part masking for self-supervised learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10386\u201310395."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16235"},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Raghav Goyal Samira Ebrahimi Kahou Vincent Michalski Joanna Materzynska Susanne Westphal Heuna Kim Valentin Haenel Ingo Fruend Peter Yianilos Moritz Mueller-Freitag Florian Hoppe Christian Thurau Ingo Bax and Roland Memisevic. 2017. The \u201csomething something\u201d video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision . 5842\u20135850.","DOI":"10.1109\/ICCV.2017.622"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"e_1_3_1_21_2","article-title":"Turbo training with token dropout","author":"Han Tengda","year":"2022","unstructured":"Tengda Han, Weidi Xie, and Andrew Zisserman. 2022. Turbo training with token dropout. arXiv preprint arXiv:2210.04889 (2022).","journal-title":"arXiv preprint arXiv:2210.04889"},{"key":"e_1_3_1_22_2","article-title":"Masked autoencoders are scalable vision learners","author":"He Kaiming","year":"2021","unstructured":"Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1r, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021).","journal-title":"arXiv preprint arXiv:2111.06377"},{"key":"e_1_3_1_23_2","first-page":"22690","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Huang Huaibo","year":"2023","unstructured":"Huaibo Huang, Xiaoqiang Zhou, Jie Cao, Ran He, and Tieniu Tan. 2023. Vision transformer with super token sampling. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 22690\u201322699."},{"key":"e_1_3_1_24_2","volume-title":"International Conference on Learning Representations","author":"Huang Ziyuan","year":"2022","unstructured":"Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, and Marcelo H. Ang Jr. 2022. TAda! Temporally-adaptive convolutions for video understanding. In International Conference on Learning Representations."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00209"},{"key":"e_1_3_1_26_2","unstructured":"Zi-Hang Jiang Qibin Hou Li Yuan Daquan Zhou Yujun Shi Xiaojie Jin Anran Wang and Jiashi Feng. 2021. All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems 34 (2021) 18590\u201318602."},{"key":"e_1_3_1_27_2","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev Suleyman Mustafa and Zisserman Andrew. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01576"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00633"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485472"},{"key":"e_1_3_1_31_2","first-page":"arXiv\u20132104","article-title":"VidTr: Video transformer without convolutions","author":"Li Xinyu","year":"2021","unstructured":"Xinyu Li, Yanyi Zhang, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe. 2021. VidTr: Video transformer without convolutions. arXiv e-prints (2021), arXiv\u20132104.","journal-title":"arXiv e-prints"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00099"},{"key":"e_1_3_1_33_2","unstructured":"Zhaowen Li Zhiyang Chen Fan Yang Wei Li Yousong Zhu Chaoyang Zhao Rui Deng Liwei Wu Rui Zhao Ming Tang and Wang Jinqiao. 2021. Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems 34 (2021) 13165\u201313176."},{"key":"e_1_3_1_34_2","article-title":"Not all patches are what you need: Expediting vision transformers via token reorganizations","author":"Liang Youwei","year":"2022","unstructured":"Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800 (2022).","journal-title":"arXiv preprint arXiv:2202.07800"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718"},{"key":"e_1_3_1_36_2","article-title":"Transformer in convolutional neural networks","volume":"2106","author":"Liu Yun","year":"2021","unstructured":"Yun Liu, Guolei Sun, Yu Qiu, Le Zhang, Ajad Chhatkuli, and Luc Van Gool. 2021. Transformer in convolutional neural networks. CoRR abs\/2106.03180 (2021). arXiv:2106.03180https:\/\/arxiv.org\/abs\/2106.03180","journal-title":"CoRR"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"e_1_3_1_38_2","first-page":"10334","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Long Sifan","year":"2023","unstructured":"Sifan Long, Zhen Zhao, Jimin Pi, Shengsheng Wang, and Jingdong Wang. 2023. Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10334\u201310343."},{"key":"e_1_3_1_39_2","unstructured":"Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)."},{"key":"e_1_3_1_40_2","article-title":"Token pooling in vision transformers","author":"Marin Dmitrii","year":"2021","unstructured":"Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, and Oncel Tuzel. 2021. Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 (2021).","journal-title":"arXiv preprint arXiv:2110.03860"},{"key":"e_1_3_1_41_2","first-page":"3163","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Neimark Daniel","year":"2021","unstructured":"Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. 2021. Video transformer network. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 3163\u20133172."},{"key":"e_1_3_1_42_2","first-page":"160","volume-title":"European Conference on Computer Vision","author":"Park Seong Hyeon","year":"2022","unstructured":"Seong Hyeon Park, Jihoon Tack, Byeongho Heo, Jung-Woo Ha, and Jinwoo Shin. 2022. K-centered patch sampling for efficient video recognition. In European Conference on Computer Vision. Springer, 160\u2013176."},{"key":"e_1_3_1_43_2","first-page":"12493","article-title":"Keeping your eye on the ball: Trajectory attention in video transformers","volume":"34","author":"Patrick Mandela","year":"2021","unstructured":"Mandela Patrick, Dylan Campbell, Yuki Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Jo\u00e3o F. Henriques. 2021. Keeping your eye on the ball: Trajectory attention in video transformers. Advances in Neural Information Processing Systems 34 (2021), 12493\u201312506.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_44_2","first-page":"13937","article-title":"DynamicViT: Efficient vision transformers with dynamic token sparsification","volume":"34","author":"Rao Yongming","year":"2021","unstructured":"Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. DynamicViT: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems 34 (2021), 13937\u201313949.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"3","key":"e_1_3_1_45_2","first-page":"1110","article-title":"Hierarchical long short-term concurrent memory for human interaction recognition","volume":"43","author":"Shu Xiangbo","year":"2019","unstructured":"Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. 2019. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 3 (2019), 1110\u20131118.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3050918"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3571735"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2928540"},{"key":"e_1_3_1_49_2","first-page":"10347","volume-title":"International Conference on Machine Learning","author":"Touvron Hugo","year":"2021","unstructured":"Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv\u00e9 J\u00e9gou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347\u201310357."},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_1_51_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez ?ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems . 6000\u20136010."},{"key":"e_1_3_1_52_2","first-page":"69","volume-title":"European Conference on Computer Vision","author":"Wang Junke","year":"2022","unstructured":"Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, and Yu-Gang Jiang. 2022. Efficient video transformers with spatial-temporal token selection. In European Conference on Computer Vision. Springer, 69\u201386."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"issue":"3","key":"e_1_3_1_55_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3569584","article-title":"A differentiable parallel sampler for efficient video classification","volume":"19","author":"Wang Xiaohan","year":"2023","unstructured":"Xiaohan Wang, Linchao Zhu, Fei Wu, and Yi Yang. 2023. A differentiable parallel sampler for efficient video classification. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) 19, 3 (2023), 1\u201318.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications (TOMM)"},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","unstructured":"Zejia Weng Zuxuan Wu Hengduo Li Jingjing Chen and Yu-Gang Jiang. 2023. HCMS: Hierarchical and conditional modality selection for efficient video recognition. ACM Transactions on Multimedia Computing Communications and Applications 20 2 (2023) 1\u201318.","DOI":"10.1145\/3572776"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00023"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00631"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00137"},{"key":"e_1_3_1_60_2","unstructured":"Enze Xie Wenhai Wang Zhiding Yu Anima Anandkumar Jose M Alvarez and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34 (2021) 12077\u201312090."},{"key":"e_1_3_1_61_2","article-title":"Pyramid self-attention polymerization learning for semi-supervised skeleton-based action recognition","author":"Xu Binqian","year":"2023","unstructured":"Binqian Xu and Xiangbo Shu. 2023. Pyramid self-attention polymerization learning for semi-supervised skeleton-based action recognition. arXiv preprint arXiv:2302.02327 (2023).","journal-title":"arXiv preprint arXiv:2302.02327"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3247103"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3538749"},{"key":"e_1_3_1_64_2","first-page":"5752","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Xu Ruihan","year":"2023","unstructured":"Ruihan Xu, Haokui Zhang, Wenze Hu, Shiliang Zhang, and Xiaoyu Wang. 2023. ParCNetV2: Oversized kernel with enhanced attention. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 5752\u20135762."},{"key":"e_1_3_1_65_2","article-title":"GPViT: A high resolution non-hierarchical vision transformer with group propagation","author":"Yang Chenhongyi","year":"2022","unstructured":"Chenhongyi Yang, Jiarui Xu, Shalini De Mello, Elliot J. Crowley, and Xiaolong Wang. 2022. GPViT: A high resolution non-hierarchical vision transformer with group propagation. arXiv preprint arXiv:2212.06795 (2022).","journal-title":"arXiv preprint arXiv:2212.06795"},{"key":"e_1_3_1_66_2","first-page":"11101","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zeng Wang","year":"2022","unstructured":"Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. 2022. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 11101\u201311111."},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475272"},{"key":"e_1_3_1_68_2","first-page":"13577","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201921)","author":"Zhang Yanyi","year":"2021","unstructured":"Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe. 2021. VidTr: Video transformer without convolutions. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201921). 13577\u201313587."},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"Mengyi Zhao Hao Tang Pan Xie Shuling Dai Nicu Sebe and Wei Wang. 2023. Bidirectional transformer gan for long-term human motion prediction. ACM Transactions on Multimedia Computing Communications and Applications 19 5 (2023) 1\u201319.","DOI":"10.1145\/3579359"},{"key":"e_1_3_1_70_2","doi-asserted-by":"crossref","unstructured":"Sixiao Zheng Jiachen Lu Hengshuang Zhao Xiatian Zhu Zekun Luo Yabiao Wang Yanwei Fu Jianfeng Feng Tao Xiang Philip HS Torr and Zhang Li. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition . 6881\u20136890.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1145\/3576857"},{"key":"e_1_3_1_72_2","article-title":"Deformable DETR: Deformable transformers for end-to-end object detection","author":"Zhu Xizhou","year":"2020","unstructured":"Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).","journal-title":"arXiv preprint arXiv:2010.04159"},{"key":"e_1_3_1_73_2","first-page":"695","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV \u201918)","author":"Zolfaghari Mohammadreza","year":"2018","unstructured":"Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV \u201918). 695\u2013712."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633781","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3633781","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:35:48Z","timestamp":1750178148000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633781"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,11]]},"references-count":72,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4,30]]}},"alternative-id":["10.1145\/3633781"],"URL":"https:\/\/doi.org\/10.1145\/3633781","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,11]]},"assertion":[{"value":"2023-05-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}