{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T18:11:47Z","timestamp":1772647907957,"version":"3.50.1"},"reference-count":59,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T00:00:00Z","timestamp":1772582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Robotics"],"abstract":"<jats:p>Most existing research in human pose estimation focuses on predicting joint positions, paying limited attention to recovering the full 6D human pose, which comprises both 3D joint positions and bone orientations. Position-only methods treat joints as independent points, often resulting in structurally implausible poses and increased sensitivity to depth ambiguities\u2014cases where poses share nearly identical joint positions but differ significantly in limb orientations. Incorporating bone orientation information helps enforce geometric consistency, yielding more anatomically plausible skeletal structures. Additionally, many state-of-the-art methods rely on large, computationally expensive models, which limit their applicability in real-time scenarios, such as human\u2013robot collaboration. In this work, we propose STAG-Net, a novel 2D-to-6D lifting network that integrates Graph Convolutional Networks (GCNs), attention mechanisms, and Temporal Convolutional Networks (TCNs). By simultaneously learning joint positions and bone orientations, STAG-Net promotes geometrically consistent skeletal structures while remaining lightweight and computationally efficient. On the Human3.6M benchmark, STAG-Net achieves an MPJPE of 41.8 mm using 243 input frames. In addition, we introduce a lightweight single-frame variant, STG-Net, which achieves 50.8 mm MPJPE while operating in real time at 60 FPS using a single RGB camera. Extensive experiments on multiple large-scale datasets demonstrate the effectiveness and efficiency of the proposed approach.<\/jats:p>","DOI":"10.3390\/robotics15030054","type":"journal-article","created":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T15:01:07Z","timestamp":1772636467000},"page":"54","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["STAG-Net: A Lightweight Spatial\u2013Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human\u2013Robot Collaboration Scenarios"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8176-9068","authenticated-orcid":false,"given":"Chunxin","family":"Yang","sequence":"first","affiliation":[{"name":"Interfaculty Initiative in Information Studies, Graduate School of Interdisciplinary Information Studies, The University of Tokyo, Tokyo 113-0033, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6221-2899","authenticated-orcid":false,"given":"Ruoyu","family":"Jia","sequence":"additional","affiliation":[{"name":"Graduate School of Engineering, The University of Tokyo, Tokyo 113-0033, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9415-802X","authenticated-orcid":false,"given":"Qitong","family":"Guo","sequence":"additional","affiliation":[{"name":"Graduate School of Engineering, The University of Tokyo, Tokyo 113-0033, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-1565-238X","authenticated-orcid":false,"given":"Xiaohang","family":"Shi","sequence":"additional","affiliation":[{"name":"Graduate School of Engineering, The University of Tokyo, Tokyo 113-0033, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-4544-1955","authenticated-orcid":false,"given":"Masahiro","family":"Hirano","sequence":"additional","affiliation":[{"name":"Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2880-7055","authenticated-orcid":false,"given":"Yuji","family":"Yamakawa","sequence":"additional","affiliation":[{"name":"Graduate School of Interdisciplinary Information Studies, The University of Tokyo, Tokyo 113-0033, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2026,3,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. (2019, January 15\u201320). 3d human pose estimation in video with temporal convolutions and semi-supervised training. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00794"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Martinez, J., Hossain, R., Romero, J., and Little, J.J. (2017, January 22\u201329). A simple yet effective baseline for 3d human pose estimation. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.288"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1282","DOI":"10.1109\/TMM.2022.3141231","article-title":"Exploiting temporal contexts with strided transformer for 3d human pose estimation","volume":"25","author":"Li","year":"2022","journal-title":"IEEE Trans. Multimed."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Li, W., Liu, H., Tang, H., Wang, P., and Van Gool, L. (2022, January 18\u201322). Mhformer: Multi-hypothesis transformer for 3d human pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01280"},{"key":"ref_5","unstructured":"Kipf, T.N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Yu, B.X., Zhang, Z., Liu, Y., Zhong, S.h., Liu, Y., and Chen, C.W. (2023, January 2\u20136). Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00810"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"107000","DOI":"10.1016\/j.patcog.2019.107000","article-title":"Dynamic graph convolutional networks","volume":"97","author":"Manessi","year":"2020","journal-title":"Pattern Recognit."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liu, K., Ding, R., Zou, Z., Wang, L., and Tang, W. (2020). A comprehensive study of weight sharing in graph networks for 3d human pose estimation. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, 23\u201328 August 2020, Springer. Proceedings, Part X 16.","DOI":"10.1007\/978-3-030-58607-2_19"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zou, Z., and Tang, W. (2021, January 11\u201317). Modulated graph convolutional network for 3D human pose estimation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01128"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas, D.N. (2019, January 15\u201320). Semantic graph convolutional networks for 3d human pose regression. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00354"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Zhou, X., Derpanis, K.G., and Daniilidis, K. (2017, January 21\u201326). Coarse-to-fine volumetric prediction for single-image 3D human pose. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.139"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Sosa, J., and Hogg, D. (2023, January 17\u201324). Self-supervised 3d human pose estimation from a single image. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPRW59228.2023.00507"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Tome, D., Russell, C., and Agapito, L. (2017, January 21\u201326). Lifting from the deep: Convolutional 3d pose estimation from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.603"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.C., and Asari, V. (2020, January 13\u201319). Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00511"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1781","DOI":"10.1109\/TPAMI.2022.3164344","article-title":"From human pose similarity metric to 3D human pose estimator: Temporal propagating LSTM networks","volume":"45","author":"Lee","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Kanazawa, A., Black, M.J., Jacobs, D.W., and Malik, J. (2018, January 18\u201323). End-to-end recovery of human shape and pose. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00744"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., and Sun, Z. (2021, January 10\u201317). Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Nashville, TN, USA.","DOI":"10.1109\/ICCV48922.2021.01125"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"10145","DOI":"10.1109\/TPAMI.2021.3136136","article-title":"Orientation keypoints for 6D human pose estimation","volume":"44","author":"Fisch","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Banik, S., Avagyan, E., Auddy, S., Gracia, A.M., and Knoll, A. (2023). PoseGraphNet++: Enriching 3D human pose with orientation estimation. arXiv.","DOI":"10.2139\/ssrn.4821028"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., and Thalmann, N.M. (2019, January 27\u201328). Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00236"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, Y., and Lin, D. (2018, January 2\u20137). Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Song, X., Li, Z., Chen, S., and Demachi, K. (2024). Quater-gcn: Enhancing 3d human pose estimation with orientation and semi-supervised training. arXiv.","DOI":"10.3233\/FAIA240479"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Xu, T., and Takano, W. (2021, January 19\u201325). Graph stacked hourglass networks for 3d human pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01584"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"104174","DOI":"10.1016\/j.jvcir.2024.104174","article-title":"Multi-hop graph transformer network for 3D human pose estimation","volume":"101","author":"Islam","year":"2024","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_25","unstructured":"Aouaidjia, K., Li, A., Zhang, W., and Zhang, C. (2025). 3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"7075","DOI":"10.1109\/TPAMI.2020.3029762","article-title":"Co-embedding of nodes and edges with graph neural networks","volume":"45","author":"Jiang","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Barnes, C., Lu, J., Yang, J., and Li, H. (2019, January 15\u201320). On the continuity of rotation representations in neural networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00589"},{"key":"ref_28","unstructured":"Holschneider, M., Kronland-Martinet, R., Morlet, J., and Tchamitchian, P. (1987, January 14\u201318). A real-time algorithm for signal analysis with the help of the wavelet transform. Proceedings of the Wavelets: Time-Frequency Methods and Phase Space Proceedings of the International Conference, Marseille, France."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"82-1","DOI":"10.1145\/3386569.3392410","article-title":"XNect: Real-time multi-person 3D motion capture with a single RGB camera","volume":"39","author":"Mehta","year":"2020","journal-title":"Acm Trans. Graph. (TOG)"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Tang, Z., Qiu, Z., Hao, Y., Hong, R., and Yao, T. (2023, January 17\u201324). 3d human pose estimation with spatio-temporal criss-cross attention. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00464"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1007\/s10851-009-0161-2","article-title":"Metrics for 3D rotations: Comparison and analysis","volume":"35","author":"Huynh","year":"2009","journal-title":"J. Math. Imaging Vis."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1325","DOI":"10.1109\/TPAMI.2013.248","article-title":"Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments","volume":"36","author":"Ionescu","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., and Theobalt, C. (2017). Monocular 3d human pose estimation in the wild using improved cnn supervision. Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10\u201312 October 2017, IEEE.","DOI":"10.1109\/3DV.2017.00064"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1007\/s11263-009-0273-6","article-title":"Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion","volume":"87","author":"Sigal","year":"2010","journal-title":"Int. J. Comput. Vis."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., and Lu, C. (2021, January 19\u201325). Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00339"},{"key":"ref_36","unstructured":"Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2026, March 01). Automatic Differentiation in Pytorch. Available online: https:\/\/pytorch.org."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., and Sun, J. (2018, January 18\u201323). Cascaded pyramid network for multi-person pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00742"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s44443-025-00023-4","article-title":"MSTFormer: Multi-granularity spatial-temporal transformers for 3D human pose estimation","volume":"37","author":"Lin","year":"2025","journal-title":"J. King Saud Univ. Comput. Inf. Sci."},{"key":"ref_39","unstructured":"Hao, X., and Li, H. (2025, January 19\u201323). Perspose: 3d human pose estimation with perspective encoding and perspective rotation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Honolulu, HI, USA."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Fang, H.S., Xu, Y., Wang, W., Liu, X., and Zhu, S.C. (2018, January 2\u20137). Learning pose grammar to encode human body configuration for 3d pose estimation. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12270"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Zhou, X., and Daniilidis, K. (2018, January 18\u201323). Ordinal depth supervision for 3d human pose estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00763"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lee, K., Lee, I., and Lee, S. (2018, January 8\u201314). Propagating lstm: 3d pose estimation based on joint interdependency. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_8"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Ci, H., Wang, C., Ma, X., and Wang, Y. (2019, January 27\u201328). Optimizing network structure for 3d human pose estimation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00235"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Zhao, W., Wang, W., and Tian, Y. (2022, January 18\u201324). Graformer: Graph-oriented transformer for 3d pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01979"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., and Lin, S. (2020). Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23\u201328 August 2020, Springer.","DOI":"10.1007\/978-3-030-58568-6_30"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1109\/TCSVT.2021.3057267","article-title":"Anatomy-aware 3d human pose estimation with bone-based pose decomposition","volume":"32","author":"Chen","year":"2021","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zhao, Q., Zheng, C., Liu, M., Wang, P., and Chen, C. (2023, January 17\u201324). Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00857"},{"key":"ref_48","unstructured":"Luo, C., Chu, X., and Yuille, A. (2018). Orinet: A fully convolutional network for 3d human pose estimation. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Wandt, B., and Rosenhahn, B. (2019, January 15\u201320). Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00797"},{"key":"ref_50","first-page":"16","article-title":"Metrabs: Metric-scale truncation-robust heatmaps for absolute 3d human pose estimation","volume":"3","author":"Linder","year":"2020","journal-title":"IEEE Trans. Biom. Behav. Identity Sci."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and Ding, Z. (2021, January 11\u201317). 3d human pose estimation with spatial and temporal transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Virtual Event.","DOI":"10.1109\/ICCV48922.2021.01145"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Gong, K., Li, B., Zhang, J., Wang, T., Huang, J., Mi, M.B., Feng, J., and Wang, X. (2022, January 18\u201324). PoseTriplet: Co-evolving 3D human pose estimation, imitation, and hallucination under self-supervision. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01074"},{"key":"ref_53","unstructured":"Oreshkin, B.N. (2023). 3d human pose and shape estimation via hybrik-transformer. arXiv."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Shetty, K., Birkhold, A., Jaganathan, S., Strobel, N., Kowarschik, M., Maier, A., and Egger, B. (2023, January 17\u201324). Pliks: A pseudo-linear inverse kinematic solver for 3d human body estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00063"},{"key":"ref_55","unstructured":"Qian, X., Tang, Y., Zhang, N., Han, M., Xiao, J., Huang, M.C., and Lin, R.S. (2023). Hstformer: Hierarchical spatial-temporal transformers for 3d human pose estimation. arXiv."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"7914","DOI":"10.1109\/TIP.2021.3109517","article-title":"Learning dynamical human-joint affinity for 3d pose estimation in videos","volume":"30","author":"Zhang","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Hossain, M.R.I., and Little, J.J. (2018, January 8\u201314). Exploiting temporal information for 3d human pose estimation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01249-6_5"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Mehraban, S., Adeli, V., and Taati, B. (2024, January 3\u20138). MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA.","DOI":"10.1109\/WACV57701.2024.00677"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., and Wang, Y. (2023, January 2\u20136). Motionbert: A unified perspective on learning human motion representations. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01385"}],"container-title":["Robotics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2218-6581\/15\/3\/54\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T15:26:59Z","timestamp":1772638019000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2218-6581\/15\/3\/54"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,4]]},"references-count":59,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["robotics15030054"],"URL":"https:\/\/doi.org\/10.3390\/robotics15030054","relation":{},"ISSN":["2218-6581"],"issn-type":[{"value":"2218-6581","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,4]]}}}