{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T00:09:38Z","timestamp":1771632578651,"version":"3.50.1"},"reference-count":54,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,4,19]],"date-time":"2025-04-19T00:00:00Z","timestamp":1745020800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100002347","name":"Federal Ministry of Education and Research","doi-asserted-by":"publisher","award":["01IS23047B"],"award-info":[{"award-number":["01IS23047B"]}],"id":[{"id":"10.13039\/501100002347","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Robotics"],"abstract":"<jats:p>Human\u2013Robot Interaction (HRI) depends on robust perception systems that enable intuitive and seamless interaction between humans and robots. This work introduces a multi-view perception framework designed for HRI, incorporating object detection and tracking, human body and hand pose estimation, unified hand\u2013object pose estimation, and action recognition. We use the state-of-the-art object detection architecture to understand the scene for object detection and segmentation, ensuring high accuracy and real-time performance. In interaction environments, 3D whole-body pose estimation is necessary, and we integrate an existing work with high inference speed. We propose a novel architecture for 3D unified hand\u2013object pose estimation and tracking, capturing real-time spatial relationships between hands and objects. Furthermore, we incorporate action recognition by leveraging whole-body pose, unified hand\u2013object pose estimation, and object tracking to determine the handover interaction state. The proposed architecture is evaluated on large-scale, open-source datasets, demonstrating competitive accuracy and faster inference times, making it well-suited for real-time HRI applications.<\/jats:p>","DOI":"10.3390\/robotics14040053","type":"journal-article","created":{"date-parts":[[2025,4,20]],"date-time":"2025-04-20T20:31:36Z","timestamp":1745181096000},"page":"53","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Action Recognition via Multi-View Perception Feature Tracking for Human\u2013Robot Interaction"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7339-8425","authenticated-orcid":false,"given":"Chaitanya","family":"Bandi","sequence":"first","affiliation":[{"name":"Robotics and Human Machine Interaction Lab, Technical University of Chemnitz, Reichenhainer Stra\u00dfe 70, 09126 Chemnitz, Germany"}]},{"given":"Ulrike","family":"Thomas","sequence":"additional","affiliation":[{"name":"Robotics and Human Machine Interaction Lab, Technical University of Chemnitz, Reichenhainer Stra\u00dfe 70, 09126 Chemnitz, Germany"}]}],"member":"1968","published-online":{"date-parts":[[2025,4,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1561\/1100000005","article-title":"Human-robot interaction: A survey","volume":"1","author":"Goodrich","year":"2007","journal-title":"Found. Trends -Hum.-Comput. Interact."},{"key":"ref_2","first-page":"927","article-title":"Predicting human intentions in collaborative tasks with robot partners","volume":"23","author":"Croft","year":"2007","journal-title":"IEEE Trans. Robot."},{"key":"ref_3","unstructured":"Nikolaidis, S., and Shah, J. (2015, January 2\u20135). Intention recognition for human-robot interaction. Proceedings of the 2015 ACM\/IEEE International Conference on Human-Robot Interaction, Portland, OR, USA."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"126827","DOI":"10.1016\/j.neucom.2023.126827","article-title":"Human\u2013robot collaborative interaction with human perception and action recognition","volume":"563","author":"Yu","year":"2024","journal-title":"Neurocomputing"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 38.","DOI":"10.1109\/TPAMI.2015.2437384"},{"key":"ref_7","unstructured":"Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics YOLOv8, Ultralytics."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Cao, Z., Simon, T., Wei, S.E., and Sheikh, Y. (2017, January 21\u201326). Realtime multi-person 2D pose estimation using part affinity fields. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.143"},{"key":"ref_9","unstructured":"Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M., and Lee, J. (2019). Mediapipe: A framework for building perception pipelines. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"7157","DOI":"10.1109\/TPAMI.2022.3222784","article-title":"AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time","volume":"45","author":"Fang","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Yang, Z., Zeng, A., Yuan, C., and Li, Y. (2023, January 2\u20133). Effective Whole-body Pose Estimation with Two-stages Distillation. Proceedings of the 2023 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France.","DOI":"10.1109\/ICCVW60793.2023.00455"},{"key":"ref_12","unstructured":"Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., and Chen, K. (2023). RTMPose: Real-Time Multi-Person Pose Estimation Based on MMPose. arXiv."},{"key":"ref_13","unstructured":"Contributors, M. (2024, October 01). OpenMMLab Pose Estimation Toolbox and Benchmark. Available online: https:\/\/github.com\/open-mmlab\/mmpose."},{"key":"ref_14","first-page":"2678","article-title":"Multimodal human pose estimation with RGB and thermal images","volume":"31","author":"Zhang","year":"2022","journal-title":"IEEE Trans. Image Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lin, J., Zeng, A., Wang, H., Zhang, L., and Li, Y. (2023, January 17\u201324). One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02027"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., and Schmid, C. (2020, January 13\u201319). Learning joint reconstruction of hands and manipulated objects. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2019.01208"},{"key":"ref_17","unstructured":"Chen, C.F., Fan, Q., and Panda, R. (2021). CrossFormer: A versatile vision transformer hinging on cross-scale attention. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Bandi, C., and Thomas, U. (2021, January 15\u201318). Skeleton-based Action Recognition for Human-Robot Interaction using Self-Attention Mechanism. Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India.","DOI":"10.1109\/FG52635.2021.9666948"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Toshev, A., and Szegedy, C. (2014, January 23\u201328). DeepPose: Human pose estimation via deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.214"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Newell, A., Yang, K., and Deng, J. (2016, January 11\u201314). Stacked hourglass networks for human pose estimation. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_29"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, Y., Wang, S., Wu, Y., Lu, Y., Gao, J., and Li, H. (2021, January 11\u201317). TokenPose: Learning keypoint tokens for human pose estimation. Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01112"},{"key":"ref_23","first-page":"7281","article-title":"HRFormer: High-Resolution Transformer for Dense Prediction","volume":"34","author":"Yuan","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., and Black, M.J. (2020, January 23\u201328). Monocular Expressive Body Regression through Body-Driven Attention. Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58607-2_2"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Rong, Y., Shiratori, T., and Joo, H. (2021, January 11\u201317). FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00201"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1145\/2816795.2818013","article-title":"SMPL: A Skinned Multi-Person Linear Model","volume":"34","author":"Loper","year":"2015","journal-title":"ACM Trans. Graph."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., and Black, M.J. (2021, January 1\u20133). Collaborative Regression of Expressive Bodies using Moderation. Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK.","DOI":"10.1109\/3DV53792.2021.00088"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Moon, G., Choi, H., and Lee, K.M. (2022, January 19\u201320). Accurate 3D Hand Pose Estimation for Whole-Body 3D Human Mesh Estimation. Proceedings of the Computer Vision and Pattern Recognition Workshop (CVPRW), New Orleans, LA, USA.","DOI":"10.1109\/CVPRW56347.2022.00257"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1007\/s11263-008-0152-6","article-title":"EPnP: An Accurate O(n) Solution to the PnP Problem","volume":"81","author":"Lepetit","year":"2009","journal-title":"Int. J. Comput. Vision"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2018, January 26\u201330). PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. Proceedings of the Robotics: Science and Systems (RSS), Pittsburgh, PA, USA.","DOI":"10.15607\/RSS.2018.XIV.019"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Peng, S., Liu, Y., Huang, Q., Zhou, X., and Bao, H. (2019, January 15\u201320). PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00469"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Hu, Y., Wang, G., Yang, J., Xia, J., and Xu, K. (2019, January 15\u201320). Segmentation-Driven 6D Object Pose Estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00350"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Labb\u00e9, Y., Carpentier, J., Aubry, M., Sivic, J., and Laptev, I. (2020, January 23\u201328). CosyPose: Consistent Multi-View Multi-Object 6D Pose Estimation. Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58520-4_34"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Doosti, B., Naha, S., Mirbagheri, M., and Crandall, D. (2020, January 13\u201319). HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00664"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Liu, S., Jiang, H., Xu, J., Liu, S., and Wang, X. (2021, January 20\u201325). Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01445"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Tse, T.H.E., Kim, K.I., Leonardis, A., and Chang, H.J. (2022, January 18\u201324). Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00171"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wang, R., Mao, W., and Li, H. (2023, January 2\u20137). Interacting Hand-Object Pose Estimation via Dense Mutual Attention. Proceedings of the 2023 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA.","DOI":"10.1109\/WACV56688.2023.00569"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Chen, Z., Hasson, Y., Schmid, C., and Laptev, I. (2022, January 23\u201327). AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction. Proceedings of the ECCV, 17th European Conference, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19769-7_14"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"183","DOI":"10.5220\/0012370700003660","article-title":"Hand Mesh and Object Pose Reconstruction Using Cross Model Autoencoder","volume":"Volume 4: VISAPP. INSTICC","author":"Bandi","year":"2024","journal-title":"Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"797","DOI":"10.5220\/0013373800003912","article-title":"Multi-Modal Multi-View Perception Feature Tracking for Handover Human Robot Interaction Applications","volume":"Volume 3: VISAPP. INSTICC","author":"Bandi","year":"2025","journal-title":"Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3450626.3459875","article-title":"Embodied hands: Modeling and capturing hands and bodies together","volume":"40","author":"Romero","year":"2021","journal-title":"ACM Trans. Graph."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Wojke, N., Bewley, A., and Paulus, D. (2017, January 17\u201320). Simple online and realtime tracker with a deep association metric. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296962"},{"key":"ref_43","unstructured":"Jocher, G. (2020). Ultralytics YOLOv5, Ultralytics."},{"key":"ref_44","unstructured":"Tan, M., and Le, Q.V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv."},{"key":"ref_45","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv."},{"key":"ref_46","unstructured":"Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is All you Need. Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Yu, B., Yin, H., and Zhu, Z. (2018, January 13\u201319). Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm Sweden. IJCAI\u201918.","DOI":"10.24963\/ijcai.2018\/505"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Hampali, S., Rad, M., Oberweger, M., and Lepetit, V. (2020, January 13\u201319). HOnnotate: A method for 3D Annotation of Hand and Object Poses. Proceedings of the CVPR, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00326"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Chao, Y.W., Yang, W., Xiang, Y., Molchanov, P., Handa, A., Tremblay, J., Narang, Y.S., Van Wyk, K., Iqbal, U., and Birchfield, S. (2021, January 20\u201325). DexYCB: A Benchmark for Capturing Hand Grasping of Objects. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00893"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"\u00c7alli, B., Walsman, A., Singh, A., Srinivasa, S.S., Abbeel, P., and Dollar, A.M. (2015). Benchmarking in Manipulation Research: The YCB Object and Model Set and Benchmarking Protocols. arXiv.","DOI":"10.1109\/MRA.2015.2448951"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., and Schmid, C. (2020, January 13\u201319). Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00065"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Lin, Z., Ding, C., Yao, H., Kuang, Z., and Huang, S. (2023, January 17\u201324). Harmonious Feature Learning for Interactive Hand-Object Pose Estimation. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01248"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"54339","DOI":"10.1109\/ACCESS.2024.3388870","article-title":"Multi-Modal Hand-Object Pose Estimation With Adaptive Fusion and Interaction Learning","volume":"12","author":"Hoang","year":"2024","journal-title":"IEEE Access"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Chen, Z., Chen, S., Schmid, C., and Laptev, I. (2023, January 17\u201324). gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01239"}],"container-title":["Robotics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2218-6581\/14\/4\/53\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:18:04Z","timestamp":1760030284000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2218-6581\/14\/4\/53"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,19]]},"references-count":54,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["robotics14040053"],"URL":"https:\/\/doi.org\/10.3390\/robotics14040053","relation":{},"ISSN":["2218-6581"],"issn-type":[{"value":"2218-6581","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,19]]}}}