{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T12:37:29Z","timestamp":1775911049321,"version":"3.50.1"},"reference-count":52,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T00:00:00Z","timestamp":1630454400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100014188","name":"Ministry of Science and ICT, South Korea","doi-asserted-by":"publisher","award":["No.2020-Data-We81-1"],"award-info":[{"award-number":["No.2020-Data-We81-1"]}],"id":[{"id":"10.13039\/501100014188","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Speech signals are being used as a primary input source in human\u2013computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals.<\/jats:p>","DOI":"10.3390\/s21175892","type":"journal-article","created":{"date-parts":[[2021,9,2]],"date-time":"2021-09-02T23:05:12Z","timestamp":1630623912000},"page":"5892","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":83,"title":["Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7419-494X","authenticated-orcid":false,"given":"Anvarjon","family":"Tursunov","sequence":"first","affiliation":[{"name":"Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8020-3590","authenticated-orcid":false,"family":"Mustaqeem","sequence":"additional","affiliation":[{"name":"Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6604-5944","authenticated-orcid":false,"given":"Joon Yeon","family":"Choeh","sequence":"additional","affiliation":[{"name":"Intelligent Contents Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5451-8815","authenticated-orcid":false,"given":"Soonil","family":"Kwon","sequence":"additional","affiliation":[{"name":"Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2021,9,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.-C., Li, B., Wu, Y., and Le, Q.V.J. (2020). Improved noisy student training for automatic speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-1470"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Anvarjon, T., Kwon, S. (2020). Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors, 20.","DOI":"10.3390\/s20185212"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Ghahremani, P., Nidadavolu, P.S., Chen, N., Villalba, J., Povey, D., Khudanpur, S., and Dehak, N. (2018, January 2\u20136). End-to-end Deep Neural Network Age Estimation. Proceedings of the INTERSPEECH 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2015"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"S\u00e1nchez-Hevia, H.A., Gil-Pita, R., Utrilla-Manso, M., and Rosa-Zurera, M. (2019, January 17\u201319). Convolutional-recurrent neural network for age and gender prediction from speech. Proceedings of the 2019 Signal Processing Symposium (SPSympo), Krakow, Poland.","DOI":"10.1109\/SPS.2019.8881961"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1016\/j.engappai.2014.05.003","article-title":"Speaker age estimation using i-vectors","volume":"34","author":"Bahari","year":"2014","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1016\/j.compeleceng.2016.06.002","article-title":"A new approach with score-level fusion for the classification of a speaker age and gender","volume":"53","author":"Nabiyev","year":"2016","journal-title":"Comput. Electr. Eng."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Kalluri, S.B., Vijayasenan, D., and Ganapathy, S. (2019, January 12\u201317). A deep neural network based end to end model for joint height and age estimation from short duration speech. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683397"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"22524","DOI":"10.1109\/ACCESS.2018.2816163","article-title":"Age estimation in short speech utterances based on LSTM recurrent neural networks","volume":"6","author":"Zazo","year":"2018","journal-title":"IEEE Access"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11357-015-9854-1","article-title":"Effects of age on the amplitude, frequency and perceived quality of voice","volume":"37","author":"Lortie","year":"2015","journal-title":"Age"},{"key":"ref_10","first-page":"14","article-title":"Analysis of variations in speech in different age groups using prosody technique","volume":"126","author":"Landge","year":"2015","journal-title":"Int. J. Comput. Appl."},{"key":"ref_11","unstructured":"Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.-C., Pang, R., Le, Q.V., and Wu, Y.J. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gong, Y., Chung, Y.-A., and Glass, J. (2021). AST: Audio Spectrogram Transformer. arXiv.","DOI":"10.21437\/Interspeech.2021-698"},{"key":"ref_13","unstructured":"Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., and Weber, G. (2019). Common voice: A massively-multilingual speech corpus. arXiv."},{"key":"ref_14","unstructured":"Poggio, B., Brunelli, R., and Poggio, T. (1992, January 26\u201329). HyberBF networks for gender classification. Proceedings of the Image Understanding Workshop, San Diego, CA, USA."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"5116","DOI":"10.1002\/int.22505","article-title":"Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network","volume":"36","author":"Kwon","year":"2021","journal-title":"Int. J. Intell. Syst."},{"key":"ref_16","unstructured":"Ng, C.B., Tay, Y.H., and Goi, B.M.J. (2012, January 3\u20137). Vision-based human gender recognition: A survey. Proceedings of the Pacific Rim International Conference on Artificial Intellegenece, Kuching, Malaysia."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"17","DOI":"10.32628\/IJSRSET196110","article-title":"A Hybrid Approach to Gender Classification using Speech Signal","volume":"6","author":"Pir","year":"2019","journal-title":"IJSRSET"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"2133","DOI":"10.3390\/math8122133","article-title":"CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network","volume":"8","author":"Kwon","year":"2020","journal-title":"Mathematics"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.5121\/ijcseit.2012.2101","article-title":"Gender recognition system using speech signal","volume":"2","author":"Ali","year":"2012","journal-title":"IJCSEIT"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1090\/S0025-5718-1978-0468306-4","article-title":"On computing the discrete Fourier transform","volume":"32","author":"Winograd","year":"1978","journal-title":"Math. Comput."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Martin, A.F., and Przybocki, M.A. (2001, January 3\u20137). Speaker recognition in a multi-speaker environment. Proceedings of the Seventh European Conference on Speech Communication and Technology, Aalborg, Denmark.","DOI":"10.21437\/Eurospeech.2001-246"},{"key":"ref_22","first-page":"14344","article-title":"Speech Based Gender Identification Using Fuzzy Logic","volume":"6","author":"Khan","year":"2017","journal-title":"Int. J. Innov. Res. Sci. Eng. Technol."},{"key":"ref_23","first-page":"477","article-title":"Gender classification in speech recognition using fuzzy logic and neural network","volume":"10","author":"Meena","year":"2013","journal-title":"Int. Arab J. Inf. Technol."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"79861","DOI":"10.1109\/ACCESS.2020.2990405","article-title":"Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM","volume":"8","author":"Sajjad","year":"2020","journal-title":"IEEE Access"},{"key":"ref_25","first-page":"183","article-title":"A CNN-assisted enhanced audio signal processing for speech emotion recognition","volume":"20","author":"Kwon","year":"2020","journal-title":"Sensors"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"94262","DOI":"10.1109\/ACCESS.2021.3093053","article-title":"Short-Term Energy Forecasting Framework Using an Ensemble Deep Learning Approach","volume":"9","author":"Ishaq","year":"2021","journal-title":"IEEE Access"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"820","DOI":"10.1016\/j.future.2021.06.045","article-title":"Human action recognition using attention based LSTM network with dilated CNN features","volume":"125","author":"Muhammad","year":"2021","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"114177","DOI":"10.1016\/j.eswa.2020.114177","article-title":"MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach","volume":"167","author":"Mustaqeem","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_29","first-page":"4039","article-title":"1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features","volume":"67","author":"Mustaqeem","year":"2021","journal-title":"Comput. Mater. Contin."},{"key":"ref_30","unstructured":"Khanum, S., and Sora, M. (2015, January 21\u201322). Speech based gender identification using feed forward neural networks. Proceedings of the National Conference on Recent Trends in Information Technology (NCIT 2015), Gujarat, India."},{"key":"ref_31","first-page":"118","article-title":"Advanced Gender Recognition System Using Speech Signal","volume":"6","author":"Prabha","year":"2016","journal-title":"IJCSET"},{"key":"ref_32","first-page":"646","article-title":"Technology, E. Machine Learning Based Gender recognition and Emotion Detection","volume":"7","author":"Kaur","year":"2014","journal-title":"Int. J. Eng. Sci. Emerg. Technol."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"107101","DOI":"10.1016\/j.asoc.2021.107101","article-title":"Att-Net: Enhanced emotion recognition system using lightweight self-attention module","volume":"102","author":"Kwon","year":"2021","journal-title":"Appl. Soft Comput."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018, January 15\u201320). X-vectors: Robust dnn embeddings for speaker recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"465","DOI":"10.1016\/j.tics.2004.08.008","article-title":"How the brain separates sounds. Trends in cognitive sciences","volume":"8","author":"Carlyon","year":"2004","journal-title":"Trends Cogn. Sci."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Hou, W., Dong, Y., Zhuang, B., Yang, L., Shi, J., and Shinozaki, T. (2020, January 25\u201329). Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning. Proceedings of the INTERSPEECH 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2164"},{"key":"ref_37","unstructured":"Tan, M., and Le, Q. (2019, January 9\u201315). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Tan, M., Pang, R., and Le, Q.V. (2020, January 13\u201319). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1016\/j.future.2019.01.029","article-title":"Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments","volume":"96","author":"Ullah","year":"2019","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_40","first-page":"1261","article-title":"A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition","volume":"29","author":"Passricha","year":"2020","journal-title":"Int. J. Intell. Syst."},{"key":"ref_41","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1016\/j.neucom.2019.01.078","article-title":"Bidirectional LSTM with attention mechanism and convolutional layer for text classification","volume":"337","author":"Liu","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Miyazaki, K., Komatsu, T., Hayashi, T., Watanabe, S., Toda, T., and Takeda, K. (2020, January 4\u20138). Weakly-supervised sound event detection with self-attention. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053609"},{"key":"ref_44","unstructured":"(2021, April 05). The Korean Speech Reconition Dataset. Available online: https:\/\/aihub.or.kr\/aidata\/33305."},{"key":"ref_45","unstructured":"Michael Henretty, T.K. (2021, April 02). Kelly Davis Common Voice. Available online: https:\/\/www.kaggle.com\/mozillaorg\/common-voice."},{"key":"ref_46","unstructured":"Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., and Devin, M.J. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv."},{"key":"ref_47","unstructured":"Van Rossum, G.A.D., and Fred, L. (2009). Python 3 Reference Manual, CreateSpace."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., and Nieto, O. (2015, January 6\u201312). Librosa: Audio and music signal analysis in python. Proceedings of the 14th Python in Science Conference, Austin, TX, USA.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"274","DOI":"10.1016\/j.jvoice.2013.10.012","article-title":"Fundamental frequency changes of persian speakers across the life span","volume":"28","author":"Soltani","year":"2014","journal-title":"J. Voice"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"24","DOI":"10.14500\/aro.10072","article-title":"Objective gender and age recognition from speech sentences","volume":"3","author":"Faek","year":"2015","journal-title":"ARO"},{"key":"ref_51","unstructured":"Simonyan, K., and Zisserman, A.J. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Kao, M.-Y. (2008). Support Vector Machines. Encyclopedia of Algorithms, Springer.","DOI":"10.1007\/978-3-642-27848-8"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/17\/5892\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:54:34Z","timestamp":1760165674000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/17\/5892"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,1]]},"references-count":52,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["s21175892"],"URL":"https:\/\/doi.org\/10.3390\/s21175892","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,1]]}}}