{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T09:22:03Z","timestamp":1780392123219,"version":"3.54.1"},"reference-count":51,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2023,1,5]],"date-time":"2023-01-05T00:00:00Z","timestamp":1672876800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Competitive Research Fund of The University of Aizu, Japan"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>The definition of human-computer interaction (HCI) has changed in the current year because people are interested in their various ergonomic devices ways. Many researchers have been working to develop a hand gesture recognition system with a kinetic sensor-based dataset, but their performance accuracy is not satisfactory. In our work, we proposed a multistage spatial attention-based neural network for hand gesture recognition to overcome the challenges. We included three stages in the proposed model where each stage is inherited the CNN; where we first apply a feature extractor and a spatial attention module by using self-attention from the original dataset and then multiply the feature vector with the attention map to highlight effective features of the dataset. Then, we explored features concatenated with the original dataset for obtaining modality feature embedding. In the same way, we generated a feature vector and attention map in the second stage with the feature extraction architecture and self-attention technique. After multiplying the attention map and features, we produced the final feature, which feeds into the third stage, a classification module to predict the label of the correspondent hand gesture. Our model achieved 99.67%, 99.75%, and 99.46% accuracy for the senz3D, Kinematic, and NTU datasets.<\/jats:p>","DOI":"10.3390\/computers12010013","type":"journal-article","created":{"date-parts":[[2023,1,5]],"date-time":"2023-01-05T02:00:57Z","timestamp":1672884057000},"page":"13","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":61,"title":["Multistage Spatial Attention-Based Neural Network for Hand Gesture Recognition"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1238-0464","authenticated-orcid":false,"given":"Abu Saleh Musa","family":"Miah","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Md. Al Mehedi","family":"Hasan","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7476-2468","authenticated-orcid":false,"given":"Jungpil","family":"Shin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuichi","family":"Okuyama","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yoichi","family":"Tomioka","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,5]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1110","DOI":"10.1109\/TMM.2013.2246148","article-title":"Robust part-based hand gesture recognition using kinect sensor","volume":"15","author":"Ren","year":"2013","journal-title":"IEEE Trans. Multimed."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1145\/1897816.1897838","article-title":"Vision-based hand-gesture applications","volume":"54","author":"Wachs","year":"2011","journal-title":"Commun. ACM"},{"key":"ref_3","unstructured":"Jalal, A., and Rasheed, Y.A. (2007, January 26). Collaboration achievement along with performance maintenance in video streaming. Proceedings of the IEEE Conference on Interactive Computer Aided Learning, Villach, Austria."},{"key":"ref_4","unstructured":"Jalal, A., and Shahzad, A. (2007, January 26\u201328). Multiple facial feature detection using vertex-modeling structure. Proceedings of the ICL, Villach, Austria."},{"key":"ref_5","unstructured":"Jalal, A., Kim, S., and Yun, B. (2005, January 23\u201325). Assembled algorithm in the real-time H. 263 codec for advanced performance. Proceedings of the IEEE 7th International Workshop on Enterprise Networking and Computing in Healthcare Industry (HEALTHCOM 2005), Busan, Republic of Korea."},{"key":"ref_6","first-page":"27","article-title":"Advanced performance achievement using multi-algorithmic approach of video transcoder for low bit rate wireless communication","volume":"5","author":"Jalal","year":"2005","journal-title":"ICGST Int. J. Graph. Vis. Image Process."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Jalal, A., and Uddin, I. (2007, January 12\u201313). Security architecture for third generation (3G) using GMHS cellular network. Proceedings of the 2007 IEEE International Conference on Emerging Technologies, Rawalpindi, Pakistan.","DOI":"10.1109\/ICET.2007.4516319"},{"key":"ref_8","unstructured":"Jalal, A., and Zeb, M.A. (2008). Security enhancement for e-learning portal. IJCSNS Int. J. Comput. Sci. Netw. Secur., 8."},{"key":"ref_9","unstructured":"Jalal, A., and Kim, S. (2022, June 08). The mechanism of edge detection using the block matching criteria for the motion estimation. \ud55c\uad6d HCI \ud559\ud68c \ud559\uc220\ub300\ud68c, Available online: https:\/\/www.dbpia.co.kr\/Journal\/articleDetail?nodeId=NODE01886372."},{"key":"ref_10","unstructured":"Jalal, A., and Kim, S. (2006, January 27\u201328). Algorithmic implementation and efficiency maintenance of real-time environment using low-bitrate wireless communication. Proceedings of the Fourth IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems, and the Second International Workshop on Collaborative Computing, Integration, and Assurance (SEUS-WCCIA\u201906), Gyeongju, Republic of Korea."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"10496","DOI":"10.1109\/ACCESS.2017.2703783","article-title":"Non-touch character input system based on hand tapping gestures using Kinect sensor","volume":"5","author":"Shin","year":"2017","journal-title":"IEEE Access"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"278460","DOI":"10.1155\/2014\/278460","article-title":"Hand gesture and character recognition based on kinect sensor","volume":"10","author":"Murata","year":"2014","journal-title":"Int. J. Distrib. Sens. Netw."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Shin, J., Matsuoka, A., Hasan, M.A.M., and Srizon, A.Y. (2021). American sign language alphabet recognition by extracting feature from hand pose estimation. Sensors, 21.","DOI":"10.3390\/s21175856"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Marin, G., Dominio, F., and Zanuttigh, P. (2014, January 27\u201330). Hand gesture recognition with leap motion and kinect devices. Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France.","DOI":"10.1109\/ICIP.2014.7025313"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Moeslund, T.B., St\u00f6rring, M., and Granum, E. (2001). A natural interface to a virtual environment through computer vision-estimated pointing gestures. International Gesture Workshop, Springer.","DOI":"10.1007\/3-540-47873-6_6"},{"key":"ref_16","first-page":"578","article-title":"Roomware: Towards the next generation of human\u2013computer interaction based on an integrated design of real and virtual worlds","volume":"553","author":"Streitz","year":"2001","journal-title":"Hum.-Comput. Interact. New Millenn."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Dewaele, G., Devernay, F., and Horaud, R. (2004). Hand motion from 3d point trajectories and a smooth surface model. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-540-24670-1_38"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Miah, A.S.M., Shin, J., Hasan, M.A.M., Rahim, M.A., and Okuyama, Y. Rotation, Translation Furthermore, Scale Invariant Sign Word Recognition Using Deep Learning. Computer Systems Science and Engineering, Available online: https:\/\/doi.org\/10.32604\/csse.2023.029336.","DOI":"10.32604\/csse.2023.029336"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Miah, A.S.M., Shin, J., Hasan, M.A.M., and Rahim, M.A. (2022). BenSignNet: Bengali Sign Language Alphabet Recognition Using Concatenated Segmentation and Convolutional Neural Network. Appl. Sci., 12.","DOI":"10.3390\/app12083933"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1016\/j.cviu.2006.10.012","article-title":"Vision-based hand pose estimation: A review","volume":"108","author":"Erol","year":"2007","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_21","first-page":"405","article-title":"A review of vision-based hand gestures recognition","volume":"2","author":"Murthy","year":"2009","journal-title":"Int. J. Inf. Technol. Knowl. Manag."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. (2011, January 20\u201325). Real-time human pose recognition in parts from single depth images. Proceedings of the CVPR 2011, Colorado Springs, CO, USA.","DOI":"10.1109\/CVPR.2011.5995316"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Mohla, S., Pande, S., Banerjee, B., and Chaudhuri, S. (2020, January 14\u201319). Fusatnet: Dual attention based spectrospatial multimodal fusion network for hyperspectral and lidar classification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA.","DOI":"10.21203\/rs.3.rs-32802\/v1"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"14991","DOI":"10.1007\/s11042-015-2451-6","article-title":"Hand gesture recognition with jointly calibrated leap motion and depth sensor","volume":"75","author":"Marin","year":"2016","journal-title":"Multimed. Tools Appl."},{"key":"ref_25","unstructured":"Zhou, R. (2020). Shape Based Hand Gesture Recognition. [Ph.D. Thesis, Nanyang Technological University]."},{"key":"ref_26","unstructured":"Biasotti, S., Tarini, M., and Giachetti, A. (2022, December 01). Exploiting Silhouette Descriptors and Synthetic Data for Hand Gesture Recognition. Available online: https:\/\/diglib.eg.org\/bitstream\/handle\/10.2312\/stag20151288\/015-023.pdf."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1016\/j.vrih.2021.05.001","article-title":"Review of dynamic gesture recognition","volume":"3","author":"Yuanyuan","year":"2021","journal-title":"Virtual Real. Intell. Hardw."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1016\/j.patcog.2017.10.033","article-title":"Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition","volume":"76","author":"Nunez","year":"2018","journal-title":"Pattern Recognit."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"276","DOI":"10.1109\/5326.868448","article-title":"A fuzzy rule-based approach to spatio-temporal hand gesture recognition","volume":"30","author":"Su","year":"2000","journal-title":"IEEE Trans. Syst. Man, Cybern. Part C (Appl. Rev.)"},{"key":"ref_30","unstructured":"Jetley, S., Lord, N.A., Lee, N., and Torr, P.H. (2018). Learn to pay attention. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"110","DOI":"10.1109\/TGRS.2019.2933609","article-title":"Learning to pay attention on spectral domain: A spectral attention module-based convolutional network for hyperspectral image classification","volume":"58","author":"Mou","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_32","unstructured":"Iwai, Y., Watanabe, K., Yagi, Y., and Yachida, M. (1996, January 14\u201317). Gesture recognition by using colored gloves. Proceedings of the 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No. 96CH35929), Beijing, China."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"884","DOI":"10.1109\/34.790429","article-title":"Parametric hidden markov models for gesture recognition","volume":"21","author":"Wilson","year":"1999","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"961","DOI":"10.1109\/34.799904","article-title":"An HMM-based threshold model approach for gesture recognition","volume":"21","author":"Lee","year":"1999","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_35","unstructured":"Kwok, C., Fox, D., and Meila, M. (2002, January 9\u201314). Real-time particle filters. Proceedings of the Advances in Neural Information Processing Systems 15 (NIPS 2002), Vancouver, BC, Canada."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Doucet, A., De Freitas, N., and Gordon, N.J. (2001). Sequential Monte Carlo Methods in Practice, Springer.","DOI":"10.1007\/978-1-4757-3437-9"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Nagi, J., Ducatelle, F., Di Caro, G.A., Cire\u015fan, D., Meier, U., Giusti, A., Nagi, F., Schmidhuber, J., and Gambardella, L.M. (2011, January 16\u201318). Max-pooling convolutional neural networks for vision-based hand gesture recognition. Proceedings of the 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), Kuala Lumpur, Malaysia.","DOI":"10.1109\/ICSIPA.2011.6144164"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"202","DOI":"10.1016\/j.engappai.2018.09.006","article-title":"American Sign Language alphabet recognition using Convolutional Neural Networks with multiview augmentation and inference fusion","volume":"76","author":"Tao","year":"2018","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Naguri, C.R., and Bunescu, R.C. (2017, January 18\u201321). Recognition of dynamic hand gestures from 3D motion data using LSTM and CNN architectures. Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico.","DOI":"10.1109\/ICMLA.2017.00013"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1007\/s11042-016-4223-3","article-title":"Head-mounted gesture-controlled interface for human-computer interaction","volume":"77","author":"Memo","year":"2018","journal-title":"Multimed. Tools Appl."},{"key":"ref_41","unstructured":"Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., and Gool, L.V. (2017, January 4\u20139). Pose Guided Person Image Generation. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation network. Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_43","unstructured":"Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., and Devin, M. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Tock, K. (2019). Google CoLaboratory as a platform for Python coding with students. RTSRE Proc., 2.","DOI":"10.32374\/rtsre.2019.013"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Gollapudi, S. (2019). OpenCV with Python. Learn Computer Vision Using OpenCV, Springer.","DOI":"10.1007\/978-1-4842-4261-2"},{"key":"ref_46","unstructured":"Glorot, X., and Bengio, Y. (2010, January 13\u201315). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth international artificial intelligence and statistics conference. JMLR Workshop and Conference Proceedings, Sardinia, Italy."},{"key":"ref_47","unstructured":"Dozat, T. (2022, December 01). Incorporating Nesterov Momentum into Adam. Available online: https:\/\/openreview.net\/forum?id=OM0jvwB8jIp57ZJjtNEZ."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Tang, H., Wang, W., Xu, D., Yan, Y., and Sebe, N. (2018, January 18\u201323). GestureGAN for Hand Gesture-to-Gesture Translation in the Wild. Proceedings of the CVPR 2018 (IEEE), Salt Lake City, UT, USA.","DOI":"10.1145\/3240508.3240704"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Siarohin, A., Sangineto, E., Lathuili\u00e8re, S., and Sebe, N. (2018, January 18\u201323). Deformable GANs for Pose-based Human Image Generation. Proceedings of the CVPR 2018 (IEEE), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00359"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., and Fritz, M. (2018, January 18\u201323). Disentangled Person Image Generation. Proceedings of the CVPR 2018 (IEEE), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00018"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Yan, Y., Xu, J., Ni, B., Zhang, W., and Yang, X. (2017, January 23\u201327). Skeleton-aided articulated motion generation. Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123277"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/12\/1\/13\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T17:59:23Z","timestamp":1760119163000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/12\/1\/13"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,5]]},"references-count":51,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["computers12010013"],"URL":"https:\/\/doi.org\/10.3390\/computers12010013","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,5]]}}}