{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T11:55:47Z","timestamp":1772279747524,"version":"3.50.1"},"reference-count":59,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T00:00:00Z","timestamp":1772236800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T00:00:00Z","timestamp":1772236800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["BK20230833"],"award-info":[{"award-number":["BK20230833"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62302093"],"award-info":[{"award-number":["62302093"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2026,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.<\/jats:p>","DOI":"10.1007\/s44267-026-00110-8","type":"journal-article","created":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T03:59:45Z","timestamp":1772251185000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Temporal consistency-aware text-to-motion generation"],"prefix":"10.1007","volume":"4","author":[{"given":"Hongsong","family":"Wang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenjing","family":"Yan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6872-5540","authenticated-orcid":false,"given":"Qiuxia","family":"Lai","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xin","family":"Geng","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,2,28]]},"reference":[{"key":"110_CR1","first-page":"5152","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Guo","year":"2022","unstructured":"Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022). Generating diverse and natural 3D human motions from text. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 5152\u20135161). Piscataway: IEEE."},{"key":"110_CR2","first-page":"1541","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"M. Kappel","year":"2021","unstructured":"Kappel, M., Golyanik, V., Elgharib, M., Henningson, J.-O., Seidel, H.-P., Castillo, S., Theobalt, C., & Magnor, M. (2021). High-fidelity neural human motion transfer from monocular video. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1541\u20131550). Piscataway: IEEE."},{"key":"110_CR3","first-page":"319","volume-title":"Proceedings of the international conference on intelligent virtual agents","author":"A. Antakli","year":"2018","unstructured":"Antakli, A., Hermann, E., Zinnikus, I., Du, H., & Fischer, K. (2018). Intelligent distributed human motion simulation in human-robot collaboration environments. In Proceedings of the international conference on intelligent virtual agents (pp. 319\u2013324). New York: ACM."},{"key":"110_CR4","first-page":"792","volume-title":"Proceedings of the 30th international conference on machine learning","author":"H. Koppula","year":"2013","unstructured":"Koppula, H., & Saxena, A. (2013). Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In Proceedings of the 30th international conference on machine learning (pp. 792\u2013800). Retrieved January 19, 2026, from http:\/\/proceedings.mlr.press\/v28\/koppula13.pdf."},{"issue":"1","key":"110_CR5","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1109\/TPAMI.2015.2430335","volume":"38","author":"H. S. Koppula","year":"2016","unstructured":"Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 14\u201329.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"110_CR6","doi-asserted-by":"publisher","first-page":"719","DOI":"10.1109\/3DV.2019.00084","volume-title":"Proceedings of the 2019 international conference on 3D vision","author":"C. Ahuja","year":"2019","unstructured":"Ahuja, C., & Morency, L.-P. (2019). Language2pose: natural language grounded pose forecasting. In Proceedings of the 2019 international conference on 3D vision (pp. 719\u2013728). Piscataway: IEEE."},{"key":"110_CR7","first-page":"1396","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"A. Ghosh","year":"2021","unstructured":"Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., & Slusallek, P. (2021). Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 1396\u20131406). Piscataway: IEEE."},{"key":"110_CR8","first-page":"1900","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Guo","year":"2024","unstructured":"Guo, C., Mu, Y., Javed, M. G., Wang, S., & Cheng, L. (2024). Momask: generative masked modeling of 3D human motions. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1900\u20131910). Piscataway: IEEE."},{"key":"110_CR9","first-page":"1","volume-title":"Proceedings of the 32nd international conference on neural information processing systems","author":"A. S. Lin","year":"2018","unstructured":"Lin, A. S., Wu, L., Corona, R., Tai, K., Huang, Q., & Mooney, R. J. (2018). Generating animated videos of human activities from natural language descriptions. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Proceedings of the 32nd international conference on neural information processing systems (pp. 1\u20137). Red Hook: Curran Associates."},{"key":"110_CR10","first-page":"1546","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"E. Pinyoanuntapong","year":"2024","unstructured":"Pinyoanuntapong, E., Wang, P., Lee, M., & Chen, C. (2024). MMM: generative masked motion model. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1546\u20131555). Piscataway: IEEE."},{"key":"110_CR11","doi-asserted-by":"publisher","first-page":"414","DOI":"10.1109\/3DV57658.2022.00053","volume-title":"Proceedings of the 2022 international conference on 3D vision","author":"N. Athanasiou","year":"2022","unstructured":"Athanasiou, N., Petrovich, M., Black, M. J., & Teach, G. V. (2022). Temporal action composition for 3D humans. In Proceedings of the 2022 international conference on 3D vision (pp. 414\u2013423). Piscataway: IEEE."},{"key":"110_CR12","first-page":"23222","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Lin","year":"2023","unstructured":"Lin, J., Chang, J., Liu, L., Li, G., Lin, L., Tian, Q., & Chen, C. (2023). Being comes from not-being: open-vocabulary text-to-motion generation with wordless training. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 23222\u201323231). Piscataway: IEEE."},{"key":"110_CR13","first-page":"480","volume-title":"Proceedings of the 17th European conference on computer vision","author":"M. Petrovich","year":"2022","unstructured":"Petrovich, M., Black, M. J., & Varol, G. (2022). Temos: generating diverse human motions from textual descriptions. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 480\u2013497). Cham: Springer."},{"key":"110_CR14","first-page":"580","volume-title":"Proceedings of the 17th European conference on computer vision","author":"C. Guo","year":"2022","unstructured":"Guo, C., Zuo, X., Wang, S., & Cheng, L. (2022). TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 580\u2013597). Cham: Springer."},{"key":"110_CR15","first-page":"1","volume-title":"Proceedings of the IEEE international conference on acoustics, speech and signal processing","author":"S. R. Hosseyni","year":"2025","unstructured":"Hosseyni, S. R., Rahmani, A. A., Seyedmohammadi, S. J., Seyedin, S., & Mohammadi, A. (2025). BAD: bidirectional auto-regressive diffusion for text-to-motion generation. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (pp. 1\u20135). Piscataway: IEEE."},{"key":"110_CR16","first-page":"1144","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Liu","year":"2024","unstructured":"Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., & Black, M. J. (2024). Emage: towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1144\u20131154). Piscataway: IEEE."},{"key":"110_CR17","first-page":"1566","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Liu","year":"2024","unstructured":"Liu, Y., Cao, Q., Wen, Y., Jiang, H., & Ding, C. (2024). Towards variable and coordinated holistic co-speech motion generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1566\u20131576). Piscataway: IEEE."},{"key":"110_CR18","first-page":"172","volume-title":"Proceedings of the 18th European conference on computer vision","author":"E. Pinyoanuntapong","year":"2024","unstructured":"Pinyoanuntapong, E., Saleem, M. U., Wang, P., Lee, M., Das, S., & Chen, C. (2024). BAMM: bidirectional autoregressive motion model. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 172\u2013190). Cham: Springer."},{"key":"110_CR19","first-page":"25615","volume-title":"Proceedings of the 39th AAAI conference on artificial intelligence","author":"H. Yang","year":"2025","unstructured":"Yang, H., Su, K., Zhang, Y., Chen, J., Qian, K., Liu, G., & Gan, C. (2025). UniMuMo: unified text, music and motion generation. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Proceedings of the 39th AAAI conference on artificial intelligence (pp. 25615\u201325623). Palo Alto: AAAI Press."},{"key":"110_CR20","first-page":"469","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Yi","year":"2023","unstructured":"Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., & Black, M. J. (2023). Generating holistic 3D human motion from speech. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 469\u2013480). Piscataway: IEEE."},{"key":"110_CR21","first-page":"14730","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Zhang","year":"2023","unstructured":"Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., & Shan, Y. (2023). Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 14730\u201314740). Piscataway: IEEE."},{"key":"110_CR22","first-page":"1524","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"R. Li","year":"2024","unstructured":"Li, R., Zhang, Y., Zhang, Y., Zhang, H., Guo, J., Zhang, Y., Liu, Y., & Li, X. (2024). Lodge: a coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1524\u20131534). Piscataway: IEEE."},{"issue":"4","key":"110_CR23","doi-asserted-by":"publisher","first-page":"236","DOI":"10.1089\/big.2016.0028","volume":"4","author":"M. Plappert","year":"2016","unstructured":"Plappert, M., Mandery, C., & Asfour, T. (2016). The kit motion-language dataset. Big Data, 4(4), 236\u2013252.","journal-title":"Big Data"},{"key":"110_CR24","first-page":"1","volume-title":"Proceedings of the 39th AAAI conference on artificial intelligence","author":"W. Weng","year":"2025","unstructured":"Weng, W., Tan, X., Wang, J., Xie, G.-S., Zhou, P., & Wang, H. (2025). ReAlign: text-to-motion generation via step-aware reward-guided alignment. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Proceedings of the 39th AAAI conference on artificial intelligence (pp. 1\u20139). Palo Alto: AAAI Press."},{"key":"110_CR25","first-page":"1","volume-title":"Proceedings of the 39th international conference on neural information processing systems","author":"X. Tan","year":"2025","unstructured":"Tan, X., Wang, H., Geng, X., & Zhou, P. (2025). Sopo: text-to-motion generation using semi-online preference optimization. In Proceedings of the 39th international conference on neural information processing systems (pp. 1\u201314). Red Hook: Curran Associates."},{"key":"110_CR26","first-page":"6306","volume-title":"Proceedings of the 31st international conference on neural information processing systems","author":"A. van den Oord","year":"2017","unstructured":"van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Proceedings of the 31st international conference on neural information processing systems (pp. 6306\u20136315). Red Hook: Curran Associates."},{"key":"110_CR27","first-page":"1191","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"T. Zhou","year":"2015","unstructured":"Zhou, T., Lee, Y. J., Yu, S. X., & Efros, A. A. (2015). Flowweb: joint image set alignment by weaving consistent, pixel-wise correspondences. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1191\u20131200). Piscataway: IEEE."},{"key":"110_CR28","first-page":"117","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"T. Zhou","year":"2016","unstructured":"Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016). Learning dense correspondence via 3D-guided cycle consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 117\u2013126). Piscataway: IEEE."},{"key":"110_CR29","first-page":"4032","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"X. Zhou","year":"2015","unstructured":"Zhou, X., Zhu, M., & Daniilidis, K. (2015). Multi-image matching via fast alternating minimization. In Proceedings of the IEEE international conference on computer vision (pp. 4032\u20134040). Piscataway: IEEE."},{"key":"110_CR30","first-page":"849","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"F. Wang","year":"2013","unstructured":"Wang, F., Huang, Q., & Guibas, L. J. (2013). Image co-segmentation via consistent functional maps. In Proceedings of the IEEE international conference on computer vision (pp. 849\u2013856). Piscataway: IEEE."},{"key":"110_CR31","first-page":"3142","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"F. Wang","year":"2014","unstructured":"Wang, F., Huang, Q., Ovsjanikov, M., & Guibas, L. J. (2014). Unsupervised multi-class joint image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3142\u20133149). Piscataway: IEEE."},{"key":"110_CR32","first-page":"1152","volume-title":"Proceedings of the 32nd international conference on neural information processing systems","author":"T.-C. Wang","year":"2018","unstructured":"Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Proceedings of the 32nd international conference on neural information processing systems (pp. 1152\u20131164). Red Hook: Curran Associates."},{"key":"110_CR33","first-page":"1801","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"D. Dwibedi","year":"2019","unstructured":"Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., & Zisserman, A. (2019). Temporal cycle-consistency learning. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1801\u20131810). Piscataway: IEEE."},{"key":"110_CR34","first-page":"2223","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"J.-Y. Zhu","year":"2017","unstructured":"Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223\u20132232). Piscataway: IEEE."},{"key":"110_CR35","first-page":"6116","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"Y. Li","year":"2018","unstructured":"Li, Y., Duan, N., Zhou, B., Chu, X., Ouyang, W., Wang, X., & Zhou, M. (2018). Visual question generation as dual task of visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6116\u20136124). Piscataway: IEEE."},{"key":"110_CR36","first-page":"6649","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"M. Shah","year":"2019","unstructured":"Shah, M., Chen, X., Rohrbach, M., & Parikh, D. (2019). Cycle-consistency for robust visual question answering. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 6649\u20136658). Piscataway: IEEE."},{"key":"110_CR37","unstructured":"Tang, D., Duan, N., Qin, T., & Zhou, M. (2017). Question answering and question generation as dual tasks. arXiv preprint. arXiv:1706.02027."},{"key":"110_CR38","first-page":"15471","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Wang","year":"2022","unstructured":"Wang, H., Liang, W., Shen, J., Van Gool, L., & Wang, W. (2022). Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15471\u201315481). Piscataway: IEEE."},{"issue":"4","key":"110_CR39","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3386569.3392457","volume":"39","author":"M. Chu","year":"2020","unstructured":"Chu, M., Xie, Y., Mayer, J., Leal-Taix\u00e9, L., & Thuerey, N. (2020). Learning temporal coherence via self-supervision for GAN-based video generation. ACM Transactions on Graphics, 39(4), 1\u201375.","journal-title":"ACM Transactions on Graphics"},{"key":"110_CR40","first-page":"14042","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"G. Le Moing","year":"2021","unstructured":"Le Moing, G., Ponce, J., & Schmid, C. (2021). CCVS: context-aware controllable video synthesis. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 14042\u201314055). Red Hook: Curran Associates."},{"key":"110_CR41","first-page":"39062","volume-title":"Proceedings of the international conference on machine learning","author":"W. Yan","year":"2023","unstructured":"Yan, W., Hafner, D., James, S., & Abbeel, P. (2023). Temporally consistent transformers for video generation. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the international conference on machine learning (pp. 39062\u201339098). Retrieved August 2, 2025, from https:\/\/proceedings.mlr.press\/v202\/yan23b.html."},{"key":"110_CR42","first-page":"1481","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Xu","year":"2024","unstructured":"Xu, Z., Zhang, J., Liew, J. H., Yan, H., Liu, J.-W., Zhang, C., Feng, J., & Shou, M. Z. (2024). Magicanimate: temporally consistent human image animation using diffusion model. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1481\u20131490). Piscataway: IEEE."},{"key":"110_CR43","first-page":"8207","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"F. Liang","year":"2024","unstructured":"Liang, F., Wu, B., Wang, J., Yu, L., Li, K., Zhao, Y., Misra, I., Huang, J.-B., Zhang, P., Vajda, P., et al. (2024). Flowvid: taming imperfect optical flows for consistent video-to-video synthesis. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8207\u20138216). Piscataway: IEEE."},{"key":"110_CR44","first-page":"62352","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"Y. Liu","year":"2023","unstructured":"Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., & Hou, L. (2023). FETV: a benchmark for fine-grained evaluation of open-domain text-to-video generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 62352\u201362387). Red Hook: Curran Associates."},{"key":"110_CR45","first-page":"5442","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"N. Mahmood","year":"2019","unstructured":"Mahmood, N., Ghorbani, N., Troje, N.\u00a0F., Pons-Moll, G., & Black, M.\u00a0J. (2019). AMASS: archive of motion capture as surface shapes. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 5442\u20135451). Piscataway: IEEE."},{"key":"110_CR46","first-page":"2021","volume-title":"Proceedings of the ACM international conference on multimedia","author":"C. Guo","year":"2020","unstructured":"Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., & Cheng, L. (2020). Action2Motion: conditioned generation of 3D human motions. In Proceedings of the ACM international conference on multimedia (pp. 2021\u20132029). New York: ACM."},{"key":"110_CR47","first-page":"1","volume-title":"Proceedings of the ACM international conference on multimedia in Asia","author":"S. Yan","year":"2023","unstructured":"Yan, S., Liu, Y., Wang, H., Du, X., Liu, M., & Liu, H. (2023). Cross-modal retrieval for motion and text via droptriple loss. In Proceedings of the ACM international conference on multimedia in Asia (pp. 1\u20137). New York: ACM."},{"key":"110_CR48","volume-title":"Proceedings of the 11th international conference on learning representations","author":"G. Tevet","year":"2023","unstructured":"Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2023). Human motion diffusion model. In Proceedings of the 11th international conference on learning representations. Retrieved February 2, 2025, from https:\/\/openreview.net\/forum?id=SJ1kSyO2jwu."},{"key":"110_CR49","first-page":"18000","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Chen","year":"2023","unstructured":"Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18000\u201318010). Piscataway: IEEE."},{"issue":"6","key":"110_CR50","doi-asserted-by":"publisher","first-page":"4115","DOI":"10.1109\/TPAMI.2024.3355414","volume":"46","author":"M. Zhang","year":"2024","unstructured":"Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2024). Motiondiffuse: text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6), 4115\u20134128.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"110_CR51","first-page":"390","volume-title":"Proceedings of the 18th European conference on computer vision","author":"W. Dai","year":"2024","unstructured":"Dai, W., Chen, L.-H., Wang, J., Liu, J., Dai, B., & Tang, Y. (2024). MotionLCM: real-time controllable motion generation via latent consistency model. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 390\u2013408). Cham: Springer."},{"key":"110_CR52","first-page":"265","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Z. Zhang","year":"2024","unstructured":"Zhang, Z., Liu, A., Reid, I., Hartley, R., Zhuang, B., & Tang, H. (2024). Motion mamba: efficient and long sequence motion generation. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 265\u2013282). Cham: Springer."},{"key":"110_CR53","first-page":"180","volume-title":"Proceedings of the 18th European conference on computer vision","author":"Y. Huang","year":"2024","unstructured":"Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., & Como, L. L. (2024). Controllable motion generation through language guided pose code editing. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Proceedings of the 18th European conference on computer vision (pp. 180\u2013196). Cham: Springer."},{"key":"110_CR54","first-page":"14806","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Kong","year":"2023","unstructured":"Kong, H., Gong, K., Lian, D., Mi, M. B., & Wang, X. (2023). Priority-centric human motion generation in discrete latent space. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 14806\u201314816). Piscataway: IEEE."},{"key":"110_CR55","first-page":"20067","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"B. Jiang","year":"2023","unstructured":"Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). MotionGPT: human motion as a foreign language. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 20067\u201320079). Red Hook: Curran Associates."},{"key":"110_CR56","first-page":"7368","volume-title":"Proceedings of the 38th AAAI conference on artificial intelligence","author":"Y. Zhang","year":"2024","unstructured":"Zhang, Y., Huang, D., Liu, B., Tang, S., Lu, Y., Chen, L., Bai, L., Chu, Q., Yu, N., & Ouyang, W. (2024). MotionGPT: finetuned LLMs are general-purpose motion generators. In M. J. Wooldridge, J. G. Dy, & S. Natarajan (Eds.), Proceedings of the 38th AAAI conference on artificial intelligence (pp. 7368\u20137376). Palo Alto: AAAI Press."},{"key":"110_CR57","first-page":"9797","volume-title":"Proceedings of the 39th AAAI conference on artificial intelligence","author":"L.-A. Zeng","year":"2025","unstructured":"Zeng, L.-A., Huang, G., Wu, G., & Zheng, W.-S. (2025). Light-T2M: a lightweight and fast model for text-to-motion generation. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Proceedings of the 39th AAAI conference on artificial intelligence (pp. 9797\u20139805). Palo Alto: AAAI Press."},{"key":"110_CR58","first-page":"27849","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"B. Wu","year":"2025","unstructured":"Wu, B., Xie, J., Shen, K., Kong, Z., Ren, J., Bai, R., Qu, R., & Shen, L. (2025). MG-MotionLLM: a unified framework for motion comprehension and generation across multiple granularities. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 27849\u201327858). Piscataway: IEEE."},{"key":"110_CR59","first-page":"13129","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"J. Shi","year":"2025","unstructured":"Shi, J., Liu, L., Sun, Y., Zhang, Z., Zhou, J., & Nie, Q. (2025). Genm3: generative pretrained multi-path motion model for text conditional human motion generation. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 13129\u201313139). Piscataway: IEEE."}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-026-00110-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-026-00110-8","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-026-00110-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T09:21:27Z","timestamp":1772270487000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-026-00110-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,28]]},"references-count":59,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,12]]}},"alternative-id":["110"],"URL":"https:\/\/doi.org\/10.1007\/s44267-026-00110-8","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"value":"2097-3330","type":"print"},{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,28]]},"assertion":[{"value":"30 September 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 February 2026","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 February 2026","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 February 2026","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Xin Geng is an Associate Editors at Visual Intelligence and was not involved in the editorial review of this article or the decision to publish it. The authors declare that they have no other competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"7"}}