{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T11:48:43Z","timestamp":1774352923096,"version":"3.50.1"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"10","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>\n            Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this article, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural language textual description (\n            <jats:italic toggle=\"yes\">text-to-motion<\/jats:italic>\n            ) and vice-versa (\n            <jats:italic toggle=\"yes\">motion-to-text<\/jats:italic>\n            ). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning\u2014where we train on multiple text-motion datasets simultaneously\u2014together with the introduction of a Cross-Consistent Contrastive Loss (CCCL) function, which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely used KIT Motion Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here:\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/mesnico\/MOTpp\">https:\/\/github.com\/mesnico\/MOTpp<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3744565","type":"journal-article","created":{"date-parts":[[2025,6,12]],"date-time":"2025-06-12T10:48:50Z","timestamp":1749725330000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3011-2487","authenticated-orcid":false,"given":"Nicola","family":"Messina","sequence":"first","affiliation":[{"name":"CNR ISTI, Pisa, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7668-8521","authenticated-orcid":false,"given":"Jan","family":"Sedmidubsky","sequence":"additional","affiliation":[{"name":"Masaryk University, Brno, Czech Republic"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6258-5313","authenticated-orcid":false,"given":"Fabrizio","family":"Falchi","sequence":"additional","affiliation":[{"name":"CNR ISTI, Pisa, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2331-7671","authenticated-orcid":false,"given":"Tom\u00e1\u0161","family":"Rebok","sequence":"additional","affiliation":[{"name":"Masaryk University, Brno, Czech Republic"}]}],"member":"320","published-online":{"date-parts":[[2025,10,14]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","unstructured":"Emre Aksan Manuel Kaufmann Peng Cao and Otmar Hilliges. 2020. A spatio-temporal transformer for 3D Human motion prediction. arXiv:2004.08692. DOI: 10.48550\/ARXIV.2004.08692","DOI":"10.48550\/ARXIV.2004.08692"},{"key":"e_1_3_2_3_2","first-page":"6836","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Arnab Anurag","year":"2021","unstructured":"Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu\u010di\u0107, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In IEEE\/CVF International Conference on Computer Vision (ICCV), 6836\u20136846."},{"key":"e_1_3_2_4_2","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arxiv:2005.14165. Retrieved from https:\/\/arxiv.org\/abs\/2005.14165"},{"key":"e_1_3_2_5_2","first-page":"10","volume-title":"International Conference on Multimedia Retrieval (ICMR)","author":"Budikova Petra","year":"2021","unstructured":"Petra Budikova, Jan Sedmidubsky, and Pavel Zezula. 2021. Efficient indexing of 3D human motions. In International Conference on Multimedia Retrieval (ICMR). ACM, 10\u201318. Retrieved from https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463646"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.123061"},{"key":"e_1_3_2_7_2","first-page":"1","volume-title":"IEEE International Conference on Multimedia and Expo (ICME)","author":"Cheng Yi-Bin","year":"2021","unstructured":"Yi-Bin Cheng, Xipeng Chen, Junhong Chen, Pengxu Wei, Dongyu Zhang, and Liang Lin. 2021. Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In IEEE International Conference on Multimedia and Expo (ICME), 1\u20136."},{"key":"e_1_3_2_8_2","volume-title":"2nd ACM International Conference on Multimedia in Asia (MMAsia)","author":"Cheng Yi-Bin","year":"2021","unstructured":"Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, and Liang Lin. 2021. Motion-transformer: Self-supervised pre-training for skeleton-based action recognition. In 2nd ACM International Conference on Multimedia in Asia (MMAsia). ACM, New York, NY."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","unstructured":"Haodong Duan Jiaqi Wang Kai Chen and Dahua Lin. 2022. DG-STGCN: Dynamic spatial-temporal modeling for skeleton-based action recognition. arXiv:2210.05895. DOI: 10.48550\/ARXIV.2210.05895","DOI":"10.48550\/ARXIV.2210.05895"},{"key":"e_1_3_2_10_2","first-page":"1","article-title":"A comprehensive survey on human pose estimation approaches","author":"Dubey Shradha","year":"2022","unstructured":"Shradha Dubey and Manish Dixit. 2022. A comprehensive survey on human pose estimation approaches. Multimedia Systems 29 (2022), 1\u201329. Retrieved from https:\/\/link.springer.com\/article\/10.1007\/s00530-022-00980-0","journal-title":"Multimedia Systems"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","unstructured":"Han Fang Pengfei Xiong Luhui Xu and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv:2106.11097. Retrieved from 10.48550\/arXiv.2106.11097","DOI":"10.48550\/arXiv.2106.11097"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73636-0_19"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612097"},{"key":"e_1_3_2_14_2","first-page":"1396","volume-title":"IEEE\/CVF International Conference on Computer Vision","author":"Ghosh Anindita","year":"2021","unstructured":"Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of compositional animations from textual descriptions. In IEEE\/CVF International Conference on Computer Vision, 1396\u20131406."},{"key":"e_1_3_2_15_2","first-page":"5152","article-title":"Generating diverse and natural 3D human motions from text","author":"Guo Chuan","year":"2022","unstructured":"Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating diverse and natural 3D human motions from text. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5152\u20135161.","journal-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"e_1_3_2_16_2","first-page":"580","volume-title":"European Conference on Computer Vision (ECCV)","author":"Guo Chuan","year":"2022","unstructured":"Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022. TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 580\u2013597."},{"key":"e_1_3_2_17_2","first-page":"2021","volume-title":"28th ACM International Conference on Multimedia","author":"Guo Chuan","year":"2020","unstructured":"Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In 28th ACM International Conference on Multimedia, 2021\u20132029."},{"key":"e_1_3_2_18_2","first-page":"762","volume-title":"36th AAAI Conference on Artificial Intelligence (AAAI)","author":"Guo Tianyu","year":"2022","unstructured":"Tianyu Guo, Hong Liu, Zhan Chen, Mengyuan Liu, Tao Wang, and Runwei Ding. 2022. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In 36th AAAI Conference on Artificial Intelligence (AAAI), 762\u2013770."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29789"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582418"},{"key":"e_1_3_2_21_2","first-page":"4904","volume-title":"International Conference on Machine Learning","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904\u20134916."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-023-3140-y"},{"key":"e_1_3_2_23_2","first-page":"4171","volume-title":"2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171\u20134186."},{"key":"e_1_3_2_24_2","unstructured":"Jihoon Kim Youngjae Yu Seungyoun Shin Taehyun Byun and Sungjoon Choi. 2022. Learning joint representation of human motion and language. arXiv:2210.15187. Retrieved from https:\/\/arxiv.org\/abs\/2210.15187"},{"key":"e_1_3_2_25_2","first-page":"5081","volume-title":"IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV)","author":"Kim Taehoon","year":"2024","unstructured":"Taehoon Kim, ChanHee Kang, JaeHyuk Park, Daun Jeong, ChangHee Yang, Suk-Ju Kang, and Kyeongbo Kong. 2024. Human motion aware text-to-video generation with explicit camera control. In IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), 5081\u20135090."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1561\/2200000056"},{"key":"e_1_3_2_27_2","first-page":"3298","volume-title":"IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV)","author":"Lee Sumin","year":"2023","unstructured":"Sumin Lee, Sangmin Woo, Yeonju Park, Muhammad Adi Nugroho, and Changick Kim. 2023. Modality mixer for multi-modal action recognition. In IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), 3298\u20133307."},{"key":"e_1_3_2_28_2","article-title":"Motion-X: A large-scale 3D expressive whole-body human motion dataset","author":"Lin Jing","year":"2023","unstructured":"Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. 2023. Motion-X: A large-scale 3D expressive whole-body human motion dataset. In Advances in Neural Information Processing Systems.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413548"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3276796"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3596711.3596800"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.07.028"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413505"},{"key":"e_1_3_2_34_2","first-page":"5442","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Mahmood Naureen","year":"2019","unstructured":"Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of motion capture as surface shapes. In IEEE\/CVF International Conference on Computer Vision (ICCV), 5442\u20135451."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3451390"},{"key":"e_1_3_2_36_2","unstructured":"Nicola Messina Davide Alessandro Coccomini Andrea Esuli and Fabrizio Falchi. 2022. Transformer-based multi-modal proposal and re-rank for wikipedia image-caption matching. arXiv:2206.10436. Retrieved from https:\/\/arxiv.org\/abs\/2206.10436"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3592069"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3549555.3549576"},{"key":"e_1_3_2_39_2","first-page":"26","volume-title":"18th International Conference on Computer Analysis of Images and Patterns (CAIP)","volume":"11678","author":"Papadopoulos Konstantinos","year":"2019","unstructured":"Konstantinos Papadopoulos, Enjie Ghorbel, Renato Baptista, Djamila Aouada, and Bj\u00f6rn E. Ottersten. 2019. Two-stage RGB-based action detection using augmented 3D poses. In 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Vol. 11678, Springer, 26\u201335."},{"key":"e_1_3_2_40_2","first-page":"10985","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Petrovich Mathis","year":"2021","unstructured":"Mathis Petrovich, Michael J. Black, and G\u00fcl Varol. 2021. Action-conditioned 3D human motion synthesis with transformer VAE. In IEEE\/CVF International Conference on Computer Vision (ICCV), 10985\u201310995."},{"key":"e_1_3_2_41_2","first-page":"480","volume-title":"European Conference on Computer Vision (ECCV)","author":"Petrovich Mathis","year":"2022","unstructured":"Mathis Petrovich, Michael J. Black, and G\u00fcl Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 480\u2013497."},{"key":"e_1_3_2_42_2","first-page":"9488","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Petrovich Mathis","year":"2023","unstructured":"Mathis Petrovich, Michael J. Black, and G\u00fcl Varol. 2023. TMR: Text-to-motion retrieval using contrastive 3D human motion synthesis. In IEEE\/CVF International Conference on Computer Vision (ICCV), 9488\u20139497."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1089\/big.2016.0028"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et al. 2021. Learning transferable visual models from natural language supervision. arXiv:2103.00020. Retrieved from 10.48550\/arXiv.2103.00020","DOI":"10.48550\/arXiv.2103.00020"},{"issue":"8","key":"e_1_3_2_45_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-28238-6_8"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3075766"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664815"},{"key":"e_1_3_2_49_2","first-page":"20020","volume-title":"2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Shvetsova Nina","year":"2022","unstructured":"Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20020\u201320029."},{"key":"e_1_3_2_50_2","unstructured":"Nyle Siddiqui Praveen Tirupattur and Mubarak Shah. 2023. DVANet: Disentangling view and action features for multi-view action recognition. arXiv:2312.05719. Retrieved from https:\/\/arxiv.org\/abs\/2312.05719"},{"key":"e_1_3_2_51_2","first-page":"16857","article-title":"Mpnet: Masked and permuted pre-training for language understanding","volume":"33","author":"Song Kaitao","year":"2020","unstructured":"Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. In Advances in Neural Information Processing Systems, Vol. 33, 16857\u201316867.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2818328"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3324835"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612449"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472722"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/HUMANOIDS.2014.7041470"},{"key":"e_1_3_2_57_2","volume-title":"11th International Conference on Learning Representations (ICLR)","author":"Tevet Guy","year":"2023","unstructured":"Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-Or, and Amit Haim Bermano. 2023. Human motion diffusion model. In 11th International Conference on Learning Representations (ICLR). Retrieved from https:\/\/openreview.net\/forum?id=SJ1kSyO2jwu"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-023-17276-8"},{"key":"e_1_3_2_59_2","first-page":"30","article-title":"Attention is all you need. In","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_60_2","first-page":"56","volume-title":"IEEE International Conference on Image Processing (ICIP)","author":"Wang Guoquan","year":"2023","unstructured":"Guoquan Wang, Hong Liu, Tianyu Guo, Jingwen Guo, Ti Wang, and Yidi Li. 2023. Self-supervised 3D skeleton representation learning with active sampling and adaptive relabeling for action recognition. In IEEE International Conference on Image Processing (ICIP), 56\u201360."},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612490"},{"key":"e_1_3_2_62_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1007\/978-981-99-8141-0_22","volume-title":"Neural Information Processing","author":"Weng Libo","year":"2024","unstructured":"Libo Weng, Weidong Lou, and Fei Gao. 2024. Language guided graph transformer for skeleton action recognition. In Neural Information Processing. Springer Nature Singapore, Singapore, 283\u2013299."},{"key":"e_1_3_2_63_2","first-page":"11535","article-title":"MLP: Motion label prior for temporal sentence localization in untrimmed 3D human motions","author":"Yan Sheng","year":"2024","unstructured":"Sheng Yan, Mengyuan Liu, Yong Wang, Yang Liu, and Hong Liu. 2024. MLP: Motion label prior for temporal sentence localization in untrimmed 3D human motions. IEEE Transactions on Circuits and Systems for Video Technology 34 (2024), 11535\u201311550. Retrieved from https:\/\/ieeexplore.ieee.org\/document\/10584551","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3595916.3626459"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2024.3425283"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3194350"},{"key":"e_1_3_2_67_2","first-page":"1083","volume-title":"47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)","author":"Yang Yang","year":"2024","unstructured":"Yang Yang, Haoyu Shi, and Huaiwen Zhang. 2024. Hierarchical semantics alignment for 3D human motion retrieval. In 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 1083\u20131092."},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00095"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","unstructured":"Jianrong Zhang Yangsong Zhang Xiaodong Cun Shaoli Huang Yong Zhang Hongwei Zhao Hongtao Lu and Xi Shen. 2023. T2M-GPT: Generating human motion from textual descriptions with discrete representations. arXiv:2301.06052. DOI: 10.48550\/ARXIV.2301.06052","DOI":"10.48550\/ARXIV.2301.06052"},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","unstructured":"Mingyuan Zhang Zhongang Cai Liang Pan Fangzhou Hong Xinying Guo Lei Yang and Ziwei Liu. 2022. MotionDiffuse: Text-driven human motion generation with diffusion model. arXiv:2208.15001. DOI: 10.48550\/ARXIV.2208.15001","DOI":"10.48550\/ARXIV.2208.15001"},{"key":"e_1_3_2_71_2","first-page":"5579","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Pengchuan","year":"2021","unstructured":"Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 5579\u20135588."},{"key":"e_1_3_2_72_2","doi-asserted-by":"publisher","unstructured":"Yaqi Zhang Di Huang Bin Liu Shixiang Tang Yan Lu Lu Chen Lei Bai Qi Chu Nenghai Yu and Wanli Ouyang. 2023. MotionGPT: Finetuned LLMs are general-purpose motion generators. arXiv:2306.10900. Retrieved from 10.48550\/arXiv.2306.10900","DOI":"10.48550\/arXiv.2306.10900"},{"key":"e_1_3_2_73_2","unstructured":"Yuhao Zhang Hang Jiang Yasuhide Miura Christopher D. Manning and Curtis P. Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv:2010.00747. Retrieved from https:\/\/arxiv.org\/abs\/2010.00747"},{"key":"e_1_3_2_74_2","first-page":"5745","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhou Yi","year":"2019","unstructured":"Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 5745\u20135753."},{"key":"e_1_3_2_75_2","article-title":"LanguageBind: Extending video-language pretraining to N-modality by language-based semantic alignment","author":"Zhu Bin","year":"2024","unstructured":"Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. 2024. LanguageBind: Extending video-language pretraining to N-modality by language-based semantic alignment. In International Conference on Learning Representations (ICLR).","journal-title":"International Conference on Learning Representations (ICLR)"},{"issue":"30","key":"e_1_3_2_76_2","first-page":"6","article-title":"Recall, precision and average precision","volume":"2","author":"Zhu Mu","year":"2004","unstructured":"Mu Zhu. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, 30 (2004), 6.","journal-title":"Department of Statistics and Actuarial Science, University of Waterloo, Waterloo"},{"issue":"4","key":"e_1_3_2_77_2","first-page":"1","article-title":"Temporal refinement graph convolutional network for skeleton-based action recognition","volume":"5","author":"Zhuang Tianming","year":"2024","unstructured":"Tianming Zhuang, Zhen Qin, Yi Ding, Fuhu Deng, LeDuo Chen, Zhiguang Qin, and Kim-Kwang Raymond Choo. 2024. Temporal refinement graph convolutional network for skeleton-based action recognition. IEEE Transactions on Artificial Intelligence 5, 4 (2024), 1\u201314. Retrieved from https:\/\/ieeexplore.ieee.org\/document\/10310028","journal-title":"IEEE Transactions on Artificial Intelligence"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744565","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T21:23:31Z","timestamp":1760477011000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744565"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,14]]},"references-count":76,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3744565"],"URL":"https:\/\/doi.org\/10.1145\/3744565","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,14]]},"assertion":[{"value":"2024-06-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}