{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T16:31:01Z","timestamp":1778257861914,"version":"3.51.4"},"reference-count":61,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T00:00:00Z","timestamp":1714694400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>In Human-Robot Interaction (HRI), accurate 3D hand pose and mesh estimation hold critical importance. However, inferring reasonable and accurate poses in severe self-occlusion and high self-similarity remains an inherent challenge. In order to alleviate the ambiguity caused by invisible and similar joints during HRI, we propose a new Topology-aware Transformer network named HandGCNFormer with depth image as input, incorporating prior knowledge of hand kinematic topology into the network while modeling long-range contextual information. Specifically, we propose a novel Graphformer decoder with an additional Node-offset Graph Convolutional layer (NoffGConv). The Graphformer decoder optimizes the synergy between the Transformer and GCN, capturing long-range dependencies and local topological connections between joints. On top of that, we replace the standard MLP prediction head with a novel Topology-aware head to better exploit local topological constraints for more reasonable and accurate poses. Our method achieves state-of-the-art 3D hand pose estimation performance on four challenging datasets, including Hands2017, NYU, ICVL, and MSRA. To further demonstrate the effectiveness and scalability of our proposed Graphformer Decoder and Topology aware head, we extend our framework to HandGCNFormer-Mesh for the 3D hand mesh estimation task. The extended framework efficiently integrates a shape regressor with the original Graphformer Decoder and Topology aware head, producing Mano parameters. The results on the HO-3D dataset, which contains various and challenging occlusions, show that our HandGCNFormer-Mesh achieves competitive results compared to previous state-of-the-art 3D hand mesh estimation methods.<\/jats:p>","DOI":"10.3389\/fnbot.2024.1395652","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T13:21:02Z","timestamp":1714742462000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["3D hand pose and mesh estimation via a generic Topology-aware Transformer model"],"prefix":"10.3389","volume":"18","author":[{"given":"Shaoqi","family":"Yu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yintong","family":"Wang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lili","family":"Chen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaolin","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiamao","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2024,5,3]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2109.02860","article-title":"GCST: graph convolutional skeleton transformer for action recognition","author":"Bai","year":"2021","journal-title":"arXiv preprint arXiv:2109.02860"},{"key":"B2","first-page":"3419","article-title":"\u201cActive learning for bayesian 3d hand pose estimation,\u201d","author":"Caramalau","year":"2021","journal-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision"},{"key":"B3","first-page":"213","article-title":"\u201cEnd-to-end object detection with transformers,\u201d","author":"Carion","year":"2020"},{"key":"B4","doi-asserted-by":"publisher","first-page":"138","DOI":"10.1016\/j.neucom.2018.06.097","article-title":"Pose guided structured region ensemble network for cascaded hand pose estimation","volume":"395","author":"Chen","year":"2020","journal-title":"Neurocomputing"},{"key":"B5","first-page":"769","article-title":"\u201cPose2Mesh: graph convolutional network for 3d human pose and mesh recovery from a 2d human pose,\u201d","author":"Choi","year":"2020"},{"key":"B6","doi-asserted-by":"publisher","first-page":"9375","DOI":"10.48550\/arXiv.1606.09375","article-title":"Convolutional neural networks on graphs with fast localized spectral filtering","volume":"29","author":"Defferrard","year":"2016","journal-title":"Adv. Neural Inform. Process. Syst"},{"key":"B7","first-page":"6608","article-title":"\u201cHope-net: a graph-based model for hand-object pose estimation,\u201d","author":"Doosti","year":"2020","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2010.11929","article-title":"An image is worth 16x16 words: transformers for image recognition at scale","author":"Dosovitskiy","year":"2020","journal-title":"arXiv preprint arXiv:2010.11929"},{"key":"B9","first-page":"9896","article-title":"\u201cCrossInfoNet: multi-task information sharing based hand pose estimation,\u201d","author":"Du","year":"2019","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B10","first-page":"120","article-title":"\u201cJGR-P2O: joint graph reasoning based pixel-to-offset prediction network for 3d hand pose estimation from a single depth image,\u201d","author":"Fang","year":"2020"},{"key":"B11","first-page":"409","article-title":"\u201cFirst-person hand action benchmark with RGB-D videos and 3d hand pose annotations,\u201d","author":"Garcia-Hernando","year":"2018","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B12","first-page":"8417","article-title":"\u201cHand PointNet: 3d hand pose estimation using point sets,\u201d","author":"Ge","year":""},{"key":"B13","first-page":"475","article-title":"\u201cPoint-to-point regression pointnet for 3d hand pose estimation,\u201d","author":"Ge","year":"","journal-title":"Proceedings of the European Conference on Computer Vision (ECCV)"},{"key":"B14","first-page":"249","article-title":"\u201cUnderstanding the difficulty of training deep feedforward neural networks,\u201d","author":"Glorot","year":"2010","journal-title":"Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; JMLR Workshop and Conference Proceedings"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1707.07248","article-title":"Towards good practices for deep 3d hand pose estimation","author":"Guo","year":"","journal-title":"arXiv preprint arXiv:1707.07248"},{"key":"B16","first-page":"4512","article-title":"\u201cRegion ensemble network: improving convolutional network for hand pose estimation,\u201d","author":"Guo","year":""},{"key":"B17","first-page":"3196","article-title":"\u201cHonnotate: a method for 3d annotation of hand and object poses,\u201d","author":"Hampali","year":"2020","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B18","first-page":"11090","article-title":"\u201cKeypoint transformer: aolving joint identification in challenging hands and object interactions for accurate 3d pose estimation,\u201d","author":"Hampali","year":"2022","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B19","first-page":"571","article-title":"\u201cLeveraging photometric consistency over time for sparsely supervised hand-object reconstruction,\u201d","author":"Hasson","year":"2020","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B20","first-page":"11807","article-title":"\u201cLearning joint reconstruction of hands and manipulated objects,\u201d","author":"Hasson","year":"2019","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B21","first-page":"770","article-title":"\u201cDeep residual learning for image recognition,\u201d","author":"He","year":"2016","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B22","first-page":"17","article-title":"\u201cHand-transformer: non-autoregressive structured modeling for 3d hand pose estimation,\u201d","author":"Huang","year":""},{"key":"B23","first-page":"3136","article-title":"\u201cHot-net: non-autoregressive transformer for 3d hand-object pose estimation,\u201d","author":"Huang","year":"","journal-title":"Proceedings of the 28th ACM International Conference on Multimedia"},{"key":"B24","first-page":"11061","article-title":"\u201cAWR: Adaptive weighting regression for 3d hand pose estimation,\u201d","author":"Huang","year":"","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34"},{"key":"B25","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2009.12473","article-title":"SIA-GCN: a spatial information aware graph neural network with 2d convolutions for hand pose estimation","author":"Kong","year":"2020","journal-title":"arXiv preprint arXiv:2009.12473"},{"key":"B26","first-page":"1944","article-title":"\u201cPose recognition with cascade transformers,\u201d","author":"Li","year":"","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B27","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2109.05488","article-title":"ArtiBoost: boosting articulated 3d hand-object pose estimation via online exploration and synthesis","author":"Li","year":"","journal-title":"arXiv preprint arXiv:2109.05488"},{"key":"B28","first-page":"13147","article-title":"\u201cMHFormer: multi-hypothesis transformer for 3d human pose estimation,\u201d","author":"Li","year":"2022","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B29","first-page":"3706","article-title":"\u201cCategory-level articulated object pose estimation,\u201d","author":"Li","year":"2020","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B30","first-page":"1954","article-title":"\u201cEnd-to-end human pose and mesh reconstruction with transformers,\u201d","author":"Lin","year":"2021","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B31","first-page":"10012","article-title":"\u201cSwin transformer: hierarchical vision transformer using shifted windows,\u201d","author":"Liu","year":"2021","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B32","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1711.05101","article-title":"Decoupled weight decay regularization","author":"Loshchilov","year":"2017","journal-title":"arXiv preprint arXiv:1711.05101"},{"key":"B33","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1049\/cvi2.12064","article-title":"End-to-end global to local convolutional neural network learning for hand pose recovery in depth data","volume":"16","author":"Madadi","year":"2022","journal-title":"IET Comput. Vis"},{"key":"B34","first-page":"7113","author":"Malik","year":"2020","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2103.15320","article-title":"TFpose: direct human pose estimation with transformers","author":"Mao","year":"2021","journal-title":"arXiv preprint arXiv:2103.15320"},{"key":"B36","first-page":"2276","article-title":"\u201cPose transformers (potr): human motion prediction with non-autoregressive transformers,\u201d","author":"Mart\u00ednez-Gonz\u00e1lez","year":"2021","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B37","first-page":"5079","article-title":"\u201cV2V-PoseNet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map,\u201d","author":"Moon","year":"2018","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B38","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1706.04758","article-title":"Holistic planimetric prediction to local volumetric prediction for 3d human pose estimation","author":"Moon","year":"2017","journal-title":"arXiv preprint arXiv:1706.04758"},{"key":"B39","first-page":"752","article-title":"\u201cI2L-MeshNet: image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single RGB image,\u201d","author":"Moon","year":"2020"},{"key":"B40","first-page":"585","article-title":"\u201cDeepPrior++: improving fast and accurate 3d hand pose estimation,\u201d","author":"Oberweger","year":"2017","journal-title":"Proceedings of the IEEE International Conference on Computer Vision Workshops"},{"key":"B41","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1502.06807","article-title":"Hands deep in deep learning for hand pose estimation","author":"Oberweger","year":"2015","journal-title":"arXiv preprint arXiv:1502.06807"},{"key":"B42","first-page":"488","article-title":"\u201cPeeking into occluded joints: a novel framework for crowd pose estimation,\u201d","author":"Qiu","year":"2020"},{"key":"B43","doi-asserted-by":"publisher","first-page":"315","DOI":"10.1109\/TCYB.2021.3083637","article-title":"Pose-guided hierarchical graph reasoning for 3-d hand pose estimation from a single depth image","volume":"53","author":"Ren","year":"","journal-title":"IEEE Trans. Cybernet"},{"key":"B44","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1016\/j.neucom.2021.01.045","article-title":"Spatial-aware stacked regression network for real-time 3d hand pose estimation","volume":"437","author":"Ren","year":"","journal-title":"Neurocomputing"},{"key":"B45","unstructured":"SRN: Stacked regression network for real-time 3d hand pose estimation\n            RenP.\n            SunH.\n            QiQ.\n            WangJ.\n            HuangW.\n          BMVC2019"},{"key":"B46","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2201.02610","article-title":"Embodied hands: modeling and capturing hands and bodies together","author":"Romero","year":"2022","journal-title":"arXiv preprint arXiv:2201.02610"},{"key":"B47","first-page":"824","article-title":"\u201cCascaded hand pose regression,\u201d","author":"Sun","year":"2015","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B48","first-page":"529","article-title":"\u201cIntegral human pose regression,\u201d","author":"Sun","year":"2018","journal-title":"Proceedings of the European Conference on Computer Vision (ECCV)"},{"key":"B49","first-page":"3786","article-title":"\u201cLatent regression forest: structured estimation of 3d articulated hand posture,\u201d","author":"Tang","year":"2014","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B50","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2629500","article-title":"Real-time continuous pose recovery of human hands using convolutional networks","volume":"33","author":"Tompson","year":"2014","journal-title":"ACM Trans. Graph"},{"key":"B51","first-page":"31","article-title":"\u201cPose-based sign language recognition using GCN and BERT,\u201d","author":"Tunga","year":"2021","journal-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision"},{"key":"B52","doi-asserted-by":"publisher","first-page":"3762","DOI":"10.48550\/arXiv.1706.03762","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inform. Process. Syst"},{"key":"B53","first-page":"5147","article-title":"\u201cDense 3d regression for hand pose estimation,\u201d","author":"Wan","year":"2018","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B54","first-page":"5675","article-title":"\u201cHandGCNFormer: a novel topology-aware transformer network for 3d hand pose estimation,\u201d","author":"Wang","year":"2023","journal-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision"},{"key":"B55","article-title":"\u201cSemi-supervised classification with graph convolutional networks,\u201d","author":"Welling","year":"2016"},{"key":"B56","first-page":"793","article-title":"\u201cA2J: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image,\u201d","author":"Xiong","year":"2019","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B57","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v32i1.12328","article-title":"\u201cSpatial temporal graph convolutional networks for skeleton-based action recognition,\u201d","author":"Yan","year":"2018"},{"key":"B58","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2012.14214","article-title":"Transpose: towards explainable human pose estimation by transformer","author":"Yang","year":"2020","journal-title":"arXiv preprint arXiv:2012.14214"},{"key":"B59","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1707.02237","article-title":"The 2017 hands in the million challenge on 3d hand pose estimation","author":"Yuan","year":"2017","journal-title":"arXiv preprint arXiv:1707.02237"},{"key":"B60","first-page":"3425","article-title":"\u201cSemantic graph convolutional networks for 3d human pose regression,\u201d","author":"Zhao","year":"2019","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B61","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2010.04159","article-title":"Deformable DETR: deformable transformers for end-to-end object detection","author":"Zhu","year":"2020","journal-title":"arXiv preprint arXiv:2010.04159"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2024.1395652\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T13:21:27Z","timestamp":1714742487000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2024.1395652\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,3]]},"references-count":61,"alternative-id":["10.3389\/fnbot.2024.1395652"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2024.1395652","relation":{},"ISSN":["1662-5218"],"issn-type":[{"value":"1662-5218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,3]]},"article-number":"1395652"}}