{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T12:22:26Z","timestamp":1764937346413,"version":"build-2065373602"},"reference-count":35,"publisher":"MDPI AG","issue":"23","license":[{"start":{"date-parts":[[2021,12,3]],"date-time":"2021-12-03T00:00:00Z","timestamp":1638489600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61971393, 61871361"],"award-info":[{"award-number":["61971393, 61871361"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The analysis of hand\u2013object poses from RGB images is important for understanding and imitating human behavior and acts as a key factor in various applications. In this paper, we propose a novel coarse-to-fine two-stage framework for hand\u2013object pose estimation, which explicitly models hand\u2013object relations in 3D pose refinement rather than in the process of converting 2D poses to 3D poses. Specifically, in the coarse stage, 2D heatmaps of hand and object keypoints are obtained from RGB image and subsequently fed into pose regressor to derive coarse 3D poses. As for the fine stage, an interaction-aware graph convolutional network called InterGCN is introduced to perform pose refinement by fully leveraging the hand\u2013object relations in 3D context. One major challenge in 3D pose refinement lies in the fact that relations between hand and object change dynamically according to different HOI scenarios. In response to this issue, we leverage both general and interaction-specific relation graphs to significantly enhance the capacity of the network to cover variations of HOI scenarios for successful 3D pose refinement. Extensive experiments demonstrate state-of-the-art performance of our approach on benchmark hand\u2013object datasets.<\/jats:p>","DOI":"10.3390\/s21238092","type":"journal-article","created":{"date-parts":[[2021,12,6]],"date-time":"2021-12-06T03:10:38Z","timestamp":1638760238000},"page":"8092","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Coarse-to-Fine Hand\u2013Object Pose Estimation with Interaction-Aware Graph Convolutional Network"],"prefix":"10.3390","volume":"21","author":[{"given":"Maomao","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ao","family":"Li","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Honglei","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Minghui","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,12,3]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2629500","article-title":"Real-time continuous pose recovery of human hands using convolutional networks","volume":"33","author":"Tompson","year":"2014","journal-title":"ACM Trans. Graph."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Oberweger, M., Wohlhart, P., and Lepetit, V. (2015, January 7\u201313). Training a Feedback Loop for Hand Pose Estimation. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.379"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zimmermann, C., and Brox, T. (2017, January 22\u201329). Learning to Estimate 3D Hand Pose from Single RGB Images. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.525"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Iqbal, U., Molchanov, P., Gall, T.B.J., and Kautz, J. (2018, January 8\u201314). Hand Pose Estimation via Latent 2.5D Heatmap Regression. Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01252-6_8"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wan, C., Probst, T., Van Gool, L., and Yao, A. (2018, January 18\u201322). Dense 3D Regression for Hand Pose Estimation. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00540"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., and Yuan, J. (2019, January 15\u201320). 3D Hand Shape and Pose Estimation from a Single RGB Image. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01109"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Huang, W., Ren, P., Wang, J., Qi, Q., and Sun, H. (2020, January 7\u201312). AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6761"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Kehl, W., Manhardt, F., Tombari, F., Ilic, S., and Navab, N. (2017, January 22\u201329). SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.169"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Tekin, B., Sinha, S.N., and Fua, P. (2018, January 18\u201322). Real-Time Seamless Single Shot 6D Object Pose Prediction. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00038"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Peng, S., Liu, Y., Huang, Q., Zhou, X., and Bao, H. (2019, January 15\u201320). PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00469"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Kao, Y., Li, W., Wang, Q., Lin, Z., Kim, W., and Hong, S. (2020, January 7\u201312). Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6781"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Huang, L., Tan, J., Meng, J., Liu, J., and Yuan, J. (2020, January 12\u201316). HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413775"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hamer, H., Schindler, K., Koller-Meier, E., and Van Gool, L. (October, January 29). Tracking a Hand Manipulating an Object. Proceedings of the 2009 IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan.","DOI":"10.1109\/ICCV.2009.5459282"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Choi, C., Ho Yoon, S., Chen, C.-N., and Ramani, K. (2017, January 22\u201329). Robust Hand Pose Estimation during the Interaction with an Unknown Object. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.339"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D., and Theobalt, C. (2018, January 18\u201322). GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00013"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., and Schmid, C. (2019, January 15\u201320). Learning Joint Reconstruction of Hands and Manipulated Objects. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01208"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1898","DOI":"10.1109\/TPAMI.2019.2907951","article-title":"Generalized feedback loop for joint hand-object pose estimation","volume":"42","author":"Oberweger","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Tekin, B., Bogo, F., and Pollefeys, M. (2019, January 15\u201320). H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00464"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Doosti, B., Naha, S., Mirbagheri, M., and Crandall, D.J. (2020, January 16\u201318). HOPE-Net: A Graph-based Model for Hand-Object Pose Estimation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00664"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.-J., Yuan, J., and Thalmann, N.M. (November, January 27). Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00236"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sridhar, S., Mueller, F., Zollh\u00f6fer, M., Casas, D., Oulasvirta, A., and Theobalt, C. (2016, January 8\u201316). Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input. Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46475-6_19"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Tsoli, A., and Argyros, A.A. (2018, January 8\u201314). Joint 3D Tracking of a Deformable Object in Interaction with a Hand. Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01264-9_30"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Baek, S., Kim, K.I., and Kim, T.-K. (2020, January 16\u201318). Weakly-supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00616"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3130800.3130883","article-title":"Embodied hands: Modeling and capturing hands and bodies together","volume":"36","author":"Romero","year":"2017","journal-title":"ACM Trans. Graph."},{"key":"ref_25","unstructured":"Kipf, T.N., and Welling, M. (2017, January 24\u201326). Semi-supervised Classification with Graph Convolutional Networks. Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Li, R., Wang, S., Zhu, F., and Huang, J. (2018, January 2\u20137). Adaptive Graph Convolutional Neural Networks. Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11691"},{"key":"ref_27","unstructured":"Niepert, M., Ahmed, M., and Kutzkov, K. (2016, January 19\u201324). Learning Convolutional Neural Networks for Graphs. Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas, D.N. (2019, January 15\u201320). Semantic Graph Convolutional Networks for 3D Human Pose Regression. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00354"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 15\u201320). Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01230"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y. (2016, January 27\u201330). Convolutional Pose Machines. Proceedings of the 2016 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.511"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Zhou, X., Huang, Q., Sun, X., Xue, X., and Wei, Y. (2017, January 22\u201329). Towards 3D Human Pose Estimation in the Wild: A Weakly-supervised Approach. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.51"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.-K. (2018, January 18\u201322). First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00050"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Hampali, S., Rad, M., Oberweger, M., and Lepetit, V. (2020, January 16\u201318). HOnnotate: A method for 3D Annotation of Hand and Object Poses. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00326"},{"key":"ref_34","unstructured":"Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., and Su, H. (2015). Shapenet: An information-rich 3d model repository. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/23\/8092\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:39:19Z","timestamp":1760168359000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/23\/8092"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12,3]]},"references-count":35,"journal-issue":{"issue":"23","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["s21238092"],"URL":"https:\/\/doi.org\/10.3390\/s21238092","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,12,3]]}}}