{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T15:46:37Z","timestamp":1776786397049,"version":"3.51.2"},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","issue":"5","funder":[{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62576131, 62272144"],"award-info":[{"award-number":["62576131, 62272144"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"University Synergy Innovation Program of Anhui Province","award":["GXXT-2023-015"],"award-info":[{"award-number":["GXXT-2023-015"]}]},{"name":"Anhui Natural Science Foundation","award":["2408085MF157"],"award-info":[{"award-number":["2408085MF157"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,5,31]]},"abstract":"<jats:p>\n                    Gaze target detection aims to localize a person\u2019s gaze target. During gaze transition in video, the absence of accurate temporal variation modeling (TVM) may lead to errors in gaze target localization. In this work, we propose a Transition-aware Gaze Model (TGM), which focuses on analyzing temporal differences to achieve accurate location variation modeling. The TGM contains four key components: a frame gaze model, and three transition-aware modules (path variation, direction variation, and fusion).\n                    <jats:italic toggle=\"yes\">First<\/jats:italic>\n                    , the frame Transformer extracts gaze location and direction features.\n                    <jats:italic toggle=\"yes\">Second<\/jats:italic>\n                    , to analyze the feature difference among transition frames, we introduce TVM guided by transition-aware loss. TVM analyzes the location features to capture the moving trajectory of targets (defined as\n                    <jats:italic toggle=\"yes\">path variation<\/jats:italic>\n                    ), which facilitates the search for target locations near the path.\n                    <jats:italic toggle=\"yes\">Third<\/jats:italic>\n                    , TVM also analyzes the direction features to capture the transition-aware direction area (defined as\n                    <jats:italic toggle=\"yes\">direction variation<\/jats:italic>\n                    ), which facilitates the search for target locations within this area.\n                    <jats:italic toggle=\"yes\">Fourth<\/jats:italic>\n                    , since gaze directions dynamically adjust to track gaze targets, path variation, and direction variation are inherently aligned with the natural movement of a person\u2019s gaze. Thus, these two variations are fused into a unified transition-aware feature, which helps cover all potential target locations. To search for accurate target locations, we embed this transition-aware feature into frame features with cross-attention, which can enhance gaze target detection in transition frames. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two datasets, namely VideoAttentionTarget and VideoCoAtt.\n                  <\/jats:p>","DOI":"10.1145\/3799429","type":"journal-article","created":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T09:46:39Z","timestamp":1773481599000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Transition-aware Path and Direction Variation Modeling for Gaze Target Detection in Video"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-3801-1377","authenticated-orcid":false,"given":"Xingming","family":"Yang","sequence":"first","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, ChinaChina"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5232-7086","authenticated-orcid":false,"given":"Jing","family":"Jin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, ChinaChina"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7332-5653","authenticated-orcid":false,"given":"Kewei","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9834-4730","authenticated-orcid":false,"given":"Zhao","family":"Xie","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6540-1159","authenticated-orcid":false,"given":"Chongjia","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2594-254X","authenticated-orcid":false,"given":"Dan","family":"Guo","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China and Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,4,21]]},"reference":[{"issue":"2","key":"e_1_3_1_2_2","first-page":"38:1","article-title":"Look at me! Correcting eye gaze in live video communication","volume":"15","author":"Hsu Chih-Fan","year":"2019","unstructured":"Chih-Fan Hsu, Yu-Shuen Wang, Chin-Laung Lei, and Kuan-Ta Chen. 2019. Look at me! Correcting eye gaze in live video communication. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 38:1\u201338:21.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIV.2022.3141071"},{"key":"e_1_3_1_4_2","first-page":"5017","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops \u201922)","author":"Chen Jianhang","year":"2022","unstructured":"Jianhang Chen, Xu Zhang, Yue Wu, Shalini Ghosh, Pradeep Natarajan, Shih-Fu Chang, and Jan P. Allebach. 2022. One-stage object referring with gaze estimation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops \u201922), 5017\u20135026."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3358415"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3688973"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2022.3222077"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3223688"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3085755"},{"key":"e_1_3_1_10_2","first-page":"199","volume-title":"Advances in Neural Information Processing Systems","volume":"28","author":"Recasens Adri\u00e0","year":"2015","unstructured":"Adri\u00e0 Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking? In Advances in Neural Information Processing Systems, Vol. 28, 199\u2013207."},{"key":"e_1_3_1_11_2","first-page":"35","volume-title":"Proceedings of the 14th Asian Conference on Computer Vision on Computer Vision (ACCV \u201918)Lecture Notes in Computer Science","volume":"11363","author":"Lian Dongze","year":"2018","unstructured":"Dongze Lian, Zehao Yu, and Shenghua Gao. 2018. Believe it or not, we know what you are looking at! In Proceedings of the 14th Asian Conference on Computer Vision on Computer Vision (ACCV \u201918). Lecture Notes in Computer Science, Vol. 11363, 35\u201350."},{"key":"e_1_3_1_12_2","doi-asserted-by":"crossref","first-page":"3316","DOI":"10.1109\/WACV45572.2020.9093515","volume-title":"Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV \u201920)","author":"S\u00fcmer \u00d6mer","year":"2020","unstructured":"\u00d6mer S\u00fcmer, Peter Gerjets, Ulrich Trautwein, and Enkelejda Kasneci. 2020. Attention flow: End-to-end joint attention estimation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV \u201920), 3316\u20133325."},{"key":"e_1_3_1_13_2","first-page":"2192","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922)","author":"Tu Danyang","year":"2022","unstructured":"Danyang Tu, Xiongkuo Min, Huiyu Duan, Guodong Guo, Guangtao Zhai, and Wei Shen. 2022. End-to-end human-gaze-target detection with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 2192\u20132200."},{"key":"e_1_3_1_14_2","first-page":"21803","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201923)","author":"Tonini Francesco","year":"2023","unstructured":"Francesco Tonini, Nicola Dall\u2019Asen, Cigdem Beyan, and Elisa Ricci. 2023. Object-aware gaze target detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201923). IEEE, 21803\u201321812."},{"key":"e_1_3_1_15_2","first-page":"5395","volume-title":"Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201920)","author":"Chong Eunji","year":"2020","unstructured":"Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M. Rehg. 2020. Detecting attended visual targets in video. In Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201920), 5395\u20135405."},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"880","DOI":"10.1109\/WACV56688.2023.00094","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV \u201923)","author":"Miao Qiaomu","year":"2023","unstructured":"Qiaomu Miao, Minh Hoai, and Dimitris Samaras. 2023. Patch-level gaze distribution prediction for gaze following. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV \u201923), 880\u2013889."},{"key":"e_1_3_1_17_2","first-page":"397","volume-title":"Proceedings of the 15th European Conference on Computer Vision (ECCV \u201918), Lecture Notes in Computer Science","volume":"11209","author":"Chong Eunji","year":"2018","unstructured":"Eunji Chong, Nataniel Ruiz, Yongxin Wang, Yun Zhang, Agata Rozga, and James M. Rehg. 2018. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Proceedings of the 15th European Conference on Computer Vision (ECCV \u201918), Lecture Notes in Computer Science, Vol. 11209, 397\u2013412."},{"key":"e_1_3_1_18_2","first-page":"6460","volume-title":"Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201918)","author":"Fan Lifeng","year":"2018","unstructured":"Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. 2018. Inferring shared attention in social scene videos. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201918), Computer Vision Foundation\/IEEE Computer Society, 6460\u20136468."},{"key":"e_1_3_1_19_2","first-page":"4997","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops \u201922)","author":"Gideon John","year":"2022","unstructured":"John Gideon, Shan Su, and Simon Stent. 2022. Unsupervised multi-view gaze representation learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops \u201922), 4997\u20135005."},{"key":"e_1_3_1_20_2","first-page":"4187","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922)","author":"Zhang Mingfang","year":"2022","unstructured":"Mingfang Zhang, Yunfei Liu, and Feng Lu. 2022. Gazeonce: Real-time multi-person gaze estimation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 4187\u20134196."},{"key":"e_1_3_1_21_2","first-page":"598","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201916)","author":"Pan Junting","year":"2016","unstructured":"Junting Pan, Elisa Sayrol, Xavier Gir\u00f3-I-Nieto, Kevin McGuinness, and Noel E. O\u2019Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201916), 598\u2013606."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2023.3312392"},{"key":"e_1_3_1_23_2","first-page":"4988","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops \u201922)","author":"Oh Jun O.","year":"2022","unstructured":"Jun O. Oh, Hyung Jin Chang, and Sang-Il Choi. 2022. Self-attention with convolution and deconvolution for efficient eye gaze estimation from a full face image. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops \u201922), 4988\u20134996."},{"key":"e_1_3_1_24_2","first-page":"22035","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923)","author":"Cai Xin","year":"2023","unstructured":"Xin Cai, Jiabei Zeng, Shiguang Shan, and Xilin Chen. 2023. Source-free adaptive gaze estimation by uncertainty reduction. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923), 22035\u201322045."},{"key":"e_1_3_1_25_2","first-page":"2008","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201924)","author":"Tafasca Samy","year":"2024","unstructured":"Samy Tafasca, Anshul Gupta, and Jean-Marc Odobez. 2024. Sharingan: A transformer architecture for multi-person gaze following. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201924). IEEE, 2008\u20132017."},{"key":"e_1_3_1_26_2","first-page":"87","volume-title":"Proceedings of the 17th European Conference on Computer Vision (ECCV \u201922)Lecture Notes in Computer Science","volume":"13664","author":"Tu Danyang","year":"2022","unstructured":"Danyang Tu, Xiongkuo Min, Huiyu Duan, Guodong Guo, Guangtao Zhai, and Wei Shen. 2022. Iwin: Human-object interaction detection via transformer with irregular windows. In Proceedings of the 17th European Conference on Computer Vision (ECCV \u201922). Shai Avidan, Gabriel J. Brostow, Moustapha Ciss\u00e9, Giovanni Maria Farinella, and Tal Hassner (Eds.), Lecture Notes in Computer Science, Vol. 13664, Springer, 87\u2013103."},{"key":"e_1_3_1_27_2","first-page":"8310","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925)","author":"Mazzamuto Michele","year":"2025","unstructured":"Michele Mazzamuto, Antonino Furnari, Yoichi Sato, and Giovanni Maria Farinella. 2025. Gazing into missteps: Leveraging eye-gaze for unsupervised mistake detection in egocentric videos of skilled human activities. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925), Computer Vision Foundation\/IEEE, 8310\u20138320."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2025.113774"},{"key":"e_1_3_1_29_2","first-page":"420","volume-title":"Proceedings of the International Conference on Multimodal Interaction (ICMI \u201922)","author":"Tonini Francesco","year":"2022","unstructured":"Francesco Tonini, Cigdem Beyan, and Elisa Ricci. 2022. Multimodal across domains gaze target detection. In Proceedings of the International Conference on Multimodal Interaction (ICMI \u201922), 420\u2013431."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2022.104924"},{"key":"e_1_3_1_31_2","first-page":"11390","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921)","author":"Fang Yi","year":"2021","unstructured":"Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. 2021. Dual attention guided gaze target detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921), 11390\u201311399."},{"key":"e_1_3_1_32_2","first-page":"2688","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923)","author":"Balim Haldun","year":"2023","unstructured":"Haldun Balim, Seonwook Park, Xi Wang, Xucong Zhang, and Otmar Hilliges. 2023. EFE: End-to-end frame-to-gaze estimation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923), 2688\u20132697."},{"key":"e_1_3_1_33_2","first-page":"14106","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922)","author":"Bao Jun","year":"2022","unstructured":"Jun Bao, Buyu Liu, and Jun Yu. 2022. Escnet: Gaze target detection with the understanding of 3d scenes. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 14106\u201314115."},{"issue":"11","key":"e_1_3_1_34_2","first-page":"359:1","article-title":"Depth matters: Spatial proximity-based gaze cone generation for gaze following in wild","volume":"20","author":"Liu Feiyang","year":"2024","unstructured":"Feiyang Liu, Kun Li, Zhun Zhong, Wei Jia, Bin Hu, Xun Yang, Meng Wang, and Dan Guo. 2024. Depth matters: Spatial proximity-based gaze cone generation for gaze following in wild. ACM Transactions on Multimedia Computing, Communications, and Applications 20, 11 (2024), 359:1\u2013359:24.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"issue":"1","key":"e_1_3_1_35_2","first-page":"20:1","article-title":"Robust unsupervised gaze calibration using conversation and manipulation attention priors","volume":"18","author":"Siegfried R\u00e9my","year":"2022","unstructured":"R\u00e9my Siegfried and Jean-Marc Odobez. 2022. Robust unsupervised gaze calibration using conversation and manipulation attention priors. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 1 (2022), 20:1\u201320:27.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2023.3244269"},{"key":"e_1_3_1_37_2","article-title":"GazeHTA: End-to-end gaze target detection with head-target association","author":"Lin Zhi-Yi","year":"2024","unstructured":"Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, and Xucong Zhang. 2024. GazeHTA: End-to-end gaze target detection with head-target association. In Proceedings of the IEEE\/CVF Conference on Robotics and Automation.","journal-title":"Proceedings of the IEEE\/CVF Conference on Robotics and Automation"},{"key":"e_1_3_1_38_2","first-page":"1441","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923)","author":"Mondal Sounak","year":"2023","unstructured":"Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory J. Zelinsky, and Minh Hoai. 2023. Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923), 1441\u20131450."},{"key":"e_1_3_1_39_2","first-page":"23891","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925)","author":"Cheng Yihua","year":"2025","unstructured":"Yihua Cheng, Hengfei Wang, Zhongqun Zhang, Yang Yue, Boeun Kim, Feng Lu, and Hyung Jin Chang. 2025. 3D prior is all you need: Cross-task few-shot 2D gaze estimation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925). Computer Vision Foundation\/IEEE, 23891\u201323900."},{"key":"e_1_3_1_40_2","first-page":"28874","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925)","author":"Ryan Fiona","year":"2025","unstructured":"Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg. 2025. Gaze-LLE: Gaze target estimation via large-scale learned encoders. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925), Computer Vision Foundation\/IEEE, 28874\u201328884."},{"key":"e_1_3_1_41_2","first-page":"1895","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921)","author":"Wang Limin","year":"2021","unstructured":"Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. TDN: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921), 1895\u20131904."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3224327"},{"key":"e_1_3_1_43_2","first-page":"6092","volume-title":"Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI \u201923)","author":"Wang Lintao","year":"2023","unstructured":"Lintao Wang, Kun Hu, Lei Bai, Yu Ding, Wanli Ouyang, and Zhiyong Wang. 2023. Multi-scale control signal-aware transformer for motion synthesis without phase. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI \u201923), 6092\u20136100."},{"key":"e_1_3_1_44_2","first-page":"1231","volume-title":"Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI \u201923)","author":"Lee Taeryung","year":"2023","unstructured":"Taeryung Lee, Gyeongsik Moon, and Kyoung Mu Lee. 2023. Multiact: Long-term 3D human motion generation from multiple action labels. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI \u201923), 1231\u20131239."},{"key":"e_1_3_1_45_2","first-page":"18011","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923)","author":"Wang Xiang","year":"2023","unstructured":"Xiang Wang, Shiwei Zhang, Zhiwu Qing, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. 2023. MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923), 18011\u201318021."},{"key":"e_1_3_1_46_2","first-page":"9141","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922)","author":"Wu Jiamin","year":"2022","unstructured":"Jiamin Wu, Tianzhu Zhang, Zhe Zhang, Feng Wu, and Yongdong Zhang. 2022. Motion-modulated temporal fragment alignment network for few-shot action recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 9141\u20139150."},{"key":"e_1_3_1_47_2","first-page":"1404","volume-title":"Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI \u201922)","author":"Li Shuyuan","year":"2022","unstructured":"Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei, Xiaoyuan Yu, and Weiyao Lin. 2022. TA2N: Two-stage action alignment network for few-shot action recognition. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI \u201922), 1404\u20131411."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3223955"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3521712"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2025.3533573"},{"key":"e_1_3_1_51_2","first-page":"6218","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI \u201925)","author":"Nguyen Thong Thanh","year":"2025","unstructured":"Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Cong-Duy T. Nguyen, See-Kiong Ng, and Anh Tuan Luu. 2025. Motion-aware contrastive learning for temporal panoptic scene graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI \u201925). Toby Walsh, Julie Shah, and Zico Kolter (Eds.), AAAI Press, 6218\u20136226."},{"key":"e_1_3_1_52_2","first-page":"8438","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925)","author":"Thoker Fida Mohammad","year":"2025","unstructured":"Fida Mohammad Thoker, Letian Jiang, Chen Zhao, and Bernard Ghanem. 2025. SMILE: Infusing spatial and motion semantics in masked video learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201925), Computer Vision Foundation\/IEEE, 8438\u20138449."},{"key":"e_1_3_1_53_2","first-page":"1701","volume-title":"Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI \u201923)","author":"Liu Mengyuan","year":"2023","unstructured":"Mengyuan Liu, Fanyang Meng, Chen Chen, and Songtao Wu. 2023. Novel motion patterns matter for practical skeleton-based action recognition. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI \u201923), 1701\u20131709."},{"issue":"10","key":"e_1_3_1_54_2","first-page":"320:1","article-title":"EOGT: Video anomaly detection with enhanced object information and global temporal dependency","volume":"20","author":"Pi Ruoyan","year":"2024","unstructured":"Ruoyan Pi, Peng Wu, Xiangteng He, and Yuxin Peng. 2024. EOGT: Video anomaly detection with enhanced object information and global temporal dependency. ACM Transactions on Multimedia Computing, Communications, and Applications 20, 10 (2024), 320:1\u2013320:21.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_55_2","article-title":"Mamba: Linear-time sequence modeling with selective state spaces","author":"Gu Albert","year":"2023","unstructured":"Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the 1st Conference on Language Modeling.","journal-title":"Proceedings of the 1st Conference on Language Modeling"},{"key":"e_1_3_1_56_2","volume-title":"Proceedings of the 10th International Conference on Learning Representations (ICLR \u201922)","author":"Gu Albert","year":"2022","unstructured":"Albert Gu, Karan Goel, and R\u00e9 Christopher. 2022. Efficiently modeling long sequences with structured state spaces. In Proceedings of the 10th International Conference on Learning Representations (ICLR \u201922). OpenReview.net."},{"key":"e_1_3_1_57_2","first-page":"1","volume-title":"Proceedings of the 18th European Conference on Computer Vision (ECCV \u201924)Lecture Notes in Computer Science","volume":"15083","author":"Park Jinyoung","year":"2024","unstructured":"Jinyoung Park, Hee-Seon Kim, Kangwook Ko, Minbeom Kim, and Changick Kim. 2024. VideoMamba: Spatio-temporal selective state space model. In Proceedings of the 18th European Conference on Computer Vision (ECCV \u201924). Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G\u00fcl Varol (Eds.), Lecture Notes in Computer Science, Vol. 15083, Springer, 1\u201318."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_59_2","first-page":"213","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920)Lecture Notes in Computer Science","volume":"12346","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920). Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12346, 213\u2013229."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3799429","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T14:57:36Z","timestamp":1776783456000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3799429"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,21]]},"references-count":58,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2026,5,31]]}},"alternative-id":["10.1145\/3799429"],"URL":"https:\/\/doi.org\/10.1145\/3799429","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,21]]},"assertion":[{"value":"2025-09-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-14","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-04-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}