{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:34:54Z","timestamp":1772724894571,"version":"3.50.1"},"reference-count":40,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T00:00:00Z","timestamp":1764806400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Regional Innovation System & Education (RISE) program through the (Gwangju RISE Center), the Ministry of Education (MOE) and the (Gwangju Metropolitan City), Republic of Korea","award":["2025-RISE-05-013"],"award-info":[{"award-number":["2025-RISE-05-013"]}]},{"name":"Korean government","award":["NRF-2023R1A2C1005950"],"award-info":[{"award-number":["NRF-2023R1A2C1005950"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Virtual Worlds"],"abstract":"<jats:p>Egocentric hand gesture recognition is vital for natural human\u2013computer interaction in augmented and virtual reality (AR\/VR) systems. However, most deep learning models struggle to balance accuracy and efficiency, limiting real-time use on wearable devices. This paper introduces a Two-Stream Temporal Shift Module Transformer Fusion Network (TSMTFN) that achieves high recognition accuracy with low computational cost. The model integrates Temporal Shift Modules (TSMs) for efficient motion modeling and a Transformer-based fusion mechanism for long-range temporal understanding, operating on dual RGB-D streams to capture complementary visual and depth cues. Training stability and generalization are enhanced through full-layer training from epoch 1 and MixUp\/CutMix augmentations. Evaluated on the EgoGesture dataset, TSMTFN attained 96.18% top-1 accuracy and 99.61% top-5 accuracy on the independent test set with only 16 GFLOPs and 21.3M parameters, offering a 2.4\u20134.7\u00d7 reduction in computation compared to recent state-of-the-art methods. The model runs at 15.10 samples\/s, achieving real-time performance. The results demonstrate robust recognition across over 95% of gesture classes and minimal inter-class confusion, establishing TSMTFN as an efficient, accurate, and deployable solution for next-generation wearable AR\/VR gesture interfaces.<\/jats:p>","DOI":"10.3390\/virtualworlds4040058","type":"journal-article","created":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T16:37:02Z","timestamp":1764952622000},"page":"58","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["TSMTFN: Two-Stream Temporal Shift Module Network for Efficient Egocentric Gesture Recognition in Virtual Reality"],"prefix":"10.3390","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-0863-3346","authenticated-orcid":false,"given":"Muhammad Abrar","family":"Hussain","sequence":"first","affiliation":[{"name":"Department of Computer Engineering, Chosun University, Dong-gu, Gwangju 61452, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3361-8360","authenticated-orcid":false,"given":"Chanjun","family":"Chun","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Chosun University, Dong-gu, Gwangju 61452, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2664-3632","authenticated-orcid":false,"given":"SeongKi","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Chosun University, Dong-gu, Gwangju 61452, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"e218","DOI":"10.7717\/peerj-cs.218","article-title":"A Systematic Review on Hand Gesture Recognition Techniques, Challenges and Applications","volume":"5","author":"Yasen","year":"2019","journal-title":"PeerJ Comput. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"15836","DOI":"10.48084\/etasr.7670","article-title":"Dynamic Adaptation in Deep Learning for Enhanced Hand Gesture Recognition","volume":"14","author":"Hashi","year":"2024","journal-title":"Eng. Technol. Appl. Sci. Res."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Oudah, M., Al-Naji, A., and Chahl, J. (2020). Hand Gesture Recognition Based on Computer Vision: A Review of Techniques. J. Imaging, 6.","DOI":"10.3390\/jimaging6080073"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"125929","DOI":"10.1016\/j.eswa.2024.125929","article-title":"A Comparative Study of Advanced Technologies and Methods in Hand Gesture Analysis and Recognition Systems","volume":"266","author":"Rahman","year":"2025","journal-title":"Expert. Syst. Appl."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"135373","DOI":"10.1109\/ACCESS.2025.3593428","article-title":"Survey on Hand Gesture Recognition from Visual Input","volume":"13","author":"Linardakis","year":"2025","journal-title":"IEEE Access"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"104435","DOI":"10.1016\/j.cviu.2025.104435","article-title":"Continuous Hand Gesture Recognition: Benchmarks and Methods","volume":"259","author":"Emporio","year":"2025","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Ho, H.D., Nguyen, H.Q., Nguyen, T.B., Vu, S.T., and Le, T.L. (2022, January 7\u201310). Dynamic Hand Gesture Recognition from Egocentric Videos Based on SlowFast Architecture. Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand.","DOI":"10.23919\/APSIPAASC55919.2022.9980250"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1038","DOI":"10.1109\/TMM.2018.2808769","article-title":"EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition","volume":"20","author":"Zhang","year":"2018","journal-title":"IEEE Trans. Multimed."},{"key":"ref_9","unstructured":"(2025, November 12). EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera. Available online: https:\/\/www.researchgate.net\/publication\/389917022_EgoEvGesture_Gesture_Recognition_Based_on_Egocentric_Event_Camera."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Cao, C., Zhang, Y., Wu, Y., Lu, H., and Cheng, J. (2017, January 22\u201329). Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.406"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"K\u00f6p\u00fckl\u00fc, O., Gunduz, A., Kose, N., and Rigoll, G. (2019, January 14\u201318). Real-Time Hand Gesture Detection and Classification Using Convolutional Neural Networks. Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019), Lille, France.","DOI":"10.1109\/FG.2019.8756576"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Abavisani, M., Joze, H.R.V., and Patel, V.M. (2019, January 15\u201320). Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training. Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00126"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., Lecun, Y., and Paluri, M. (2018, January 18\u201323). A Closer Look at Spatiotemporal Convolutions for Action Recognition. Proceedings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"ref_14","first-page":"2760","article-title":"TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices","volume":"44","author":"Lin","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., and He, K. (November, January 27). SlowFast Networks for Video Recognition. Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00630"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Montazerin, M., Rahimian, E., Naderkhani, F., Atashzar, S.F., Yanushkevich, S., and Mohammadi, A. (2023). Transformer-Based Hand Gesture Recognition from Instantaneous to Fused Neural Decomposition of High-Density EMG Signals. Sci. Rep., 13.","DOI":"10.1038\/s41598-023-36490-w"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Dai, R., Das, S., Kahatapitiya, K., Ryoo, M.S., and Bremond, F. (2022, January 18\u201324). MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. Proceedings of the 2022 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01941"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1109\/LSP.2023.3241857","article-title":"Multiscaled Multi-Head Attention-Based Video Transformer Network for Hand Gesture Recognition","volume":"30","author":"Garg","year":"2023","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_19","unstructured":"Sinha, A., Raj, M.S., Wang, P., Helmy, A., and Das, S. (2025). MS-Temba: Multi-Scale Temporal Mamba for Efficient Temporal Action Detection. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Chalasani, T., and Smolic, A. (2019, January 27\u201328). Simultaneous Segmentation and Recognition: Towards More Accurate Ego Gesture Recognition. Proceedings of the 2019 International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea.","DOI":"10.1109\/ICCVW.2019.00537"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"161","DOI":"10.15701\/kcgs.2025.31.3.161","article-title":"A Comparative Study of Deep Learning Models for Static Hand Gesture Recognition: DenseNet vs. ViT with HaGRID","volume":"31","author":"Abrar","year":"2025","journal-title":"J. Korea Comput. Graph. Soc."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wunsch, L., Tenorio, C.G., Anding, K., Golomoz, A., and Notni, G. (2024). Data Fusion of RGB and Depth Data with Image Enhancement. J. Imaging, 10.","DOI":"10.3390\/jimaging10030073"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3769084","article-title":"Semi-Supervised RGB-D Hand Gesture Recognition via Mutual Learning of Self-Supervised Models","volume":"21","author":"Zhang","year":"2025","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Ding, I.J., and Zheng, N.W. (2022). CNN Deep Learning with Wavelet Image Fusion of CCD RGB-IR and Depth-Grayscale Sensor Data for Hand Gesture Intention Recognition. Sensors, 22.","DOI":"10.3390\/s22030803"},{"key":"ref_25","first-page":"1","article-title":"TWO-HANDED DYNAMIC GESTURE RECOGNITION USING RGB-D SENSORS","volume":"12","author":"Pu","year":"2025","journal-title":"Int. J. Eng. Technol. Manag. Res."},{"key":"ref_26","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (May, January 30). Mixup: Beyond Empirical Risk Minimization. Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Yun, S., Han, D., Chun, S., Oh, S.J., Choe, J., and Yoo, Y. (November, January 27). CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00612"},{"key":"ref_28","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled Weight Decay Regularization. Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., and Cao, X. (2017, January 22\u201329). Multimodal Gesture Recognition Based on the ResC3D Network. Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy.","DOI":"10.1109\/ICCVW.2017.360"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"114499","DOI":"10.1016\/j.eswa.2020.114499","article-title":"Selective Spatiotemporal Features Learning for Dynamic Gesture Recognition","volume":"169","author":"Tang","year":"2021","journal-title":"Expert. Syst. Appl."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I.S. (2018). CBAM: Convolutional Block Attention Module, Springer. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"444","DOI":"10.1109\/TBIOM.2025.3532416","article-title":"Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition","volume":"7","author":"Li","year":"2025","journal-title":"IEEE Trans. Biom. Behav. Identity Sci."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"102957","DOI":"10.1016\/j.rcim.2025.102957","article-title":"MuViH: Multi-View Hand Gesture Dataset and Recognition Pipeline for Human\u2013Robot Interaction in a Collaborative Robotic Finishing Platform","volume":"94","author":"Hubert","year":"2025","journal-title":"Robot. Comput. Integr. Manuf."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"84922","DOI":"10.1109\/ACCESS.2023.3289389","article-title":"Smart Healthcare Hand Gesture Recognition Using CNN-Based Detector and Deep Belief Network","volume":"11","author":"Alonazi","year":"2023","journal-title":"IEEE Access"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Alabdullah, B.I., Ansar, H., Al Mudawi, N., Alazeb, A., Alshahrani, A., Alotaibi, S.S., and Jalal, A. (2023). Smart Home Automation-Based Hand Gesture Recognition Using Feature Fusion and Recurrent Neural Network. Sensors, 23.","DOI":"10.3390\/s23177523"},{"key":"ref_37","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"8281","DOI":"10.1007\/s00521-024-09509-0","article-title":"MXception and Dynamic Image for Hand Gesture Recognition","volume":"36","author":"Karsh","year":"2024","journal-title":"Neural Comput. Appl."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Hu, Y. (2025). Attention-Based Spatio-Temporal Modeling with 3D Convolutional Neural Networks for Dynamic Gesture Recognition, Springer. Lecture Notes in Computer Science.","DOI":"10.1007\/978-981-97-8511-7_33"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1958","DOI":"10.1109\/TPAMI.2024.3511621","article-title":"Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection","volume":"47","author":"Tang","year":"2025","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."}],"container-title":["Virtual Worlds"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2813-2084\/4\/4\/58\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T17:16:30Z","timestamp":1764954990000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2813-2084\/4\/4\/58"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,4]]},"references-count":40,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["virtualworlds4040058"],"URL":"https:\/\/doi.org\/10.3390\/virtualworlds4040058","relation":{},"ISSN":["2813-2084"],"issn-type":[{"value":"2813-2084","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,4]]}}}