{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T15:28:39Z","timestamp":1775230119132,"version":"3.50.1"},"reference-count":37,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T00:00:00Z","timestamp":1626134400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The speech signal contains a vast spectrum of information about the speaker such as speakers\u2019 gender, age, accent, or health state. In this paper, we explored different approaches to automatic speaker\u2019s gender classification and age estimation system using speech signals. We applied various Deep Neural Network-based embedder architectures such as x-vector and d-vector to age estimation and gender classification tasks. Furthermore, we have applied a transfer learning-based training scheme with pre-training the embedder network for a speaker recognition task using the Vox-Celeb1 dataset and then fine-tuning it for the joint age estimation and gender classification task. The best performing system achieves new state-of-the-art results on the age estimation task using popular TIMIT dataset with a mean absolute error (MAE) of 5.12 years for male and 5.29 years for female speakers and a root-mean square error (RMSE) of 7.24 and 8.12 years for male and female speakers, respectively, and an overall gender recognition accuracy of 99.60%.<\/jats:p>","DOI":"10.3390\/s21144785","type":"journal-article","created":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T22:25:31Z","timestamp":1626215131000},"page":"4785","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":63,"title":["Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2276-5288","authenticated-orcid":false,"given":"Damian","family":"Kwasny","sequence":"first","affiliation":[{"name":"Department of Measurement and Electronics, AGH University of Science and Technology, 30-059 Krakow, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2193-7690","authenticated-orcid":false,"given":"Daria","family":"Hemmerling","sequence":"additional","affiliation":[{"name":"Department of Measurement and Electronics, AGH University of Science and Technology, 30-059 Krakow, Poland"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,13]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.csl.2012.02.005","article-title":"Paralinguistics in speech and language\u2014State-of-the-art and the challenge","volume":"27","author":"Schuller","year":"2013","journal-title":"Comput. Speech Lang."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"631","DOI":"10.1515\/amcs-2015-0046","article-title":"Acoustic analysis assessment in speech pathology detection","volume":"25","author":"Panek","year":"2015","journal-title":"Int. J. Appl. Math. Comput. Sci."},{"key":"ref_3","unstructured":"(2021, February 12). Techmo. Available online: https:\/\/www.techmo.pl."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"22524","DOI":"10.1109\/ACCESS.2018.2816163","article-title":"Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks","volume":"6","author":"Zazo","year":"2018","journal-title":"IEEE Access"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Mahmoodi, D., Marvi, H., Taghizadeh, M., Soleimani, A., Razzazi, F., and Mahmoodi, M. (2011, January 13\u201314). Age Estimation Based on Speech Features and Support Vector Machine. Proceedings of the 2011 3rd Computer Science and Electronic Engineering Conference (CEEC), Colchester, UK.","DOI":"10.1109\/CEEC.2011.5995826"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1109\/TASL.2010.2064307","article-title":"Front-End Factor Analysis for Speaker Verification","volume":"19","author":"Dehak","year":"2011","journal-title":"IEEE Trans. Audio Speech Lang."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Villalba, J., Chen, N., Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Borgstrom, J., Richardson, F., Shon, S., and Grondin, F. (2019, January 15\u201319). State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18. Proceedings of the INTERSPEECH 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2713"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018, January 15\u201320). X-vectors: Robust dnn embeddings for speaker recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"McLaren, M., Lawson, A., Ferrer, L., Castan, D., and Graciarena, M. (2015, January 8\u201312). The speakers in the wild speaker recognition challenge plan. Proceedings of the Interspeech 2016 Special Session, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-1129"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wan, L., Wang, Q., Papir, A., and Moreno, I.L. (2018, January 15\u201320). Generalized end-to-end loss for speaker verification. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Jasuja, L., Rasool, A., and Hajela, G. (2020, January 10\u201312). Voice Gender Recognizer Recognition of Gender from Voice using Deep Neural Networks. Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India.","DOI":"10.1109\/ICOSEC49089.2020.9215254"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Djemili, R., Bourouba, H., and Korba, M.C.A. (2012, January 10\u201312). A speech signal based gender identification system using four classifiers. Proceedings of the 2012 International Conference on Multimedia Computing and Systems, Tangiers, Morocco.","DOI":"10.1109\/ICMCS.2012.6320122"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Buyukyilmaz, M., and Cibikdiken, A.O. (2016, January 18\u201319). Voice gender recognition using deep learning. Proceedings of the 2016 International Conference on Modeling, Simulation and Optimization Technologies and Applications, Xiamen, China.","DOI":"10.2991\/msota-16.2016.90"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Alhussein, M., Ali, Z., Imran, M., and Abdul, W. (2016). Automatic gender detection based on characteristics of vocal folds for mobile healthcare system. Mob. Inf. Syst., 2016.","DOI":"10.1155\/2016\/7805217"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Uddin, M.A., Hossain, M.S., Pathan, R.K., and Biswas, M. (2020, January 24\u201326). Gender Recognition from Human Voice using Multi-Layer Architecture. Proceedings of the 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Novi Sad, Serbia.","DOI":"10.1109\/INISTA49547.2020.9194654"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1016\/j.apacoust.2019.07.033","article-title":"An effective gender recognition approach using voice data via deeper LSTM networks","volume":"156","author":"Ertam","year":"2019","journal-title":"Appl. Acoust."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"492","DOI":"10.3390\/make1010030","article-title":"Gender recognition by voice using an improved self-labeled algorithm","volume":"1","author":"Livieris","year":"2019","journal-title":"Mach. Learn. Knowl. Extr."},{"key":"ref_18","first-page":"3","article-title":"GMM-based speaker age and gender classification in Czech and Slovak","volume":"68","author":"Pribil","year":"2017","journal-title":"J. Electr. Eng."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Maka, T., and Dziurzanski, P. (2014, January 11\u201313). An analysis of the influence of acoustical adverse conditions on speaker gender identification. Proceedings of the XXII Annual Pacific Voice Conference (PVC), Krakow, Poland.","DOI":"10.1109\/PVC.2014.6845419"},{"key":"ref_20","unstructured":"Craig, G., Alvin, M., David, G., Linda, B., and Kevin, W. (2020, August 29). 2010 NIST Speaker Recognition Evaluation Test Set. Available online: https:\/\/catalog.ldc.upenn.edu\/LDC2017S06."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Ghahremani, P., Nidadavolu, P.S., Chen, N., Villalba, J., Povey, D., Khudanpur, S., and Dehak, N. (2018, January 2\u20136). End-to-end Deep Neural Network Age Estimation. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2015"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Kalluri, S.B., Vijayasenan, D., and Ganapathy, S. (2019, January 12\u201317). A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683397"},{"key":"ref_23","unstructured":"Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., and Zue, V. (1992). TIMIT Acoustic-phonetic Continuous Speech Corpus. Linguist. Data Consort."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.specom.2020.03.008","article-title":"Automatic speaker profiling from short duration speech data","volume":"121","author":"Kalluri","year":"2020","journal-title":"Speech Commun."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Nagrani, A., Chung, J.S., and Zisserman, A. (2017). Voxceleb: A large-scale speaker identification dataset. arXiv.","DOI":"10.21437\/Interspeech.2017-950"},{"key":"ref_26","unstructured":"Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., and Weber, G. (2019). Common Voice: A Massively-Multilingual Speech Corpus. arXiv."},{"key":"ref_27","unstructured":"(2020, December 30). Common Voice Database. Available online: https:\/\/www.kaggle.com\/mozillaorg\/common-voice."},{"key":"ref_28","unstructured":"(2020, December 30). DARPA-TIMIT dataset. Available online: https:\/\/www.kaggle.com\/mfekadu\/darpa-timit-acousticphonetic-continuous-speech."},{"key":"ref_29","unstructured":"(2020, September 05). Resemblyryzer. Available online: https:\/\/github.com\/resemble-ai\/Resemblyzer."},{"key":"ref_30","unstructured":"Peddinti, V., Povey, D., and Khudanpur, S. (, January 6\u201310). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1109\/29.21701","article-title":"Phoneme recognition using time-delay neural networks","volume":"37","author":"Waibel","year":"1989","journal-title":"IEEE Trans. Acoust. Speech Signal"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., and Zhang, Y. (2020, January 4\u20138). Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053889"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Sak, H., Senior, A.W., and Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. arXiv.","DOI":"10.21437\/Interspeech.2014-80"},{"key":"ref_34","unstructured":"Lee, S.W., Kim, J.H., Jun, J., Ha, J.W., and Zhang, B.T. (2017, January 4\u20139). Overcoming catastrophic forgetting by incremental moment matching. Proceedings of the Advances in neural information processing systems, Long Beach, CA, USA."},{"key":"ref_35","unstructured":"Baevski, A., Zhou, H., rahman Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert, R. (2020). MLS: A Large-Scale Multilingual Dataset for Speech Research. arXiv.","DOI":"10.21437\/Interspeech.2020-2826"},{"key":"ref_37","unstructured":"(2020, December 30). VoxCeleb1. Available online: https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/voxceleb\/index.html#portfolio."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4785\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:29:56Z","timestamp":1760164196000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4785"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,13]]},"references-count":37,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["s21144785"],"URL":"https:\/\/doi.org\/10.3390\/s21144785","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,13]]}}}