{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T19:22:44Z","timestamp":1774898564835,"version":"3.50.1"},"reference-count":49,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,7,23]],"date-time":"2025-07-23T00:00:00Z","timestamp":1753228800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computation"],"abstract":"<jats:p>Speaker profiling systems are often evaluated on a single corpus, which complicates reliable comparison. We present a fully reproducible evaluation pipeline that trains Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) models independently on three speech corpora representing distinct recording conditions\u2014studio-quality TIMIT, crowdsourced Mozilla Common Voice, and in-the-wild VoxCeleb1. All models share the same architecture, optimizer, and data preprocessing; no corpus-specific hyperparameter tuning is applied. We perform a detailed preprocessing and feature extraction procedure, evaluating multiple configurations and validating their applicability and effectiveness in improving the obtained results. A feature analysis shows that Mel spectrograms benefit CNNs, whereas Mel Frequency Cepstral Coefficients (MFCCs) suit LSTMs, and that the optimal Mel-bin count grows with corpus Signal Noise Rate (SNR). With this fixed recipe, EfficientNet achieves 99.82% gender accuracy on Common Voice (+1.25 pp over the previous best) and 98.86% on VoxCeleb1 (+0.57 pp). MobileNet attains 99.86% age-group accuracy on Common Voice (+2.86 pp) and a 5.35-year MAE for age estimation on TIMIT using a lightweight configuration. The consistent, near-state-of-the-art results across three acoustically diverse datasets substantiate the robustness and versatility of the proposed pipeline. Code and pre-trained weights are released to facilitate downstream research.<\/jats:p>","DOI":"10.3390\/computation13080177","type":"journal-article","created":{"date-parts":[[2025,7,23]],"date-time":"2025-07-23T16:04:51Z","timestamp":1753286691000},"page":"177","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Multi-Corpus Benchmarking of CNN and LSTM Models for Speaker Gender and Age Profiling"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-8646-7588","authenticated-orcid":false,"given":"Jorge","family":"Jorrin-Coz","sequence":"first","affiliation":[{"name":"ESIME Culhuacan, Instituto Politecnico Nacional, Av. Santa Ana 1000, Mexico City 04440, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1346-7825","authenticated-orcid":false,"given":"Mariko","family":"Nakano","sequence":"additional","affiliation":[{"name":"ESIME Culhuacan, Instituto Politecnico Nacional, Av. Santa Ana 1000, Mexico City 04440, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7786-2050","authenticated-orcid":false,"given":"Hector","family":"Perez-Meana","sequence":"additional","affiliation":[{"name":"ESIME Culhuacan, Instituto Politecnico Nacional, Av. Santa Ana 1000, Mexico City 04440, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4555-8695","authenticated-orcid":false,"given":"Leobardo","family":"Hernandez-Gonzalez","sequence":"additional","affiliation":[{"name":"ESIME Culhuacan, Instituto Politecnico Nacional, Av. Santa Ana 1000, Mexico City 04440, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"342","DOI":"10.3758\/BF03195462","article-title":"Interactive voice response: Review of studies 1989\u20132000","volume":"34","author":"Corkrey","year":"2002","journal-title":"Behav. Res. Methods Instrum. Comput."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Jaid, U.H., and Hassan, A.K.A. (2023). Review of Automatic Speaker Profiling: Features, Methods, and Challenges. Iraqi J. Sci., 6548\u20136571.","DOI":"10.24996\/ijs.2023.64.12.36"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Humayun, M.A., Shuja, J., and Abas, P.E. (2023). Speaker Profiling Based on the Short-Term Acoustic Features of Vowels. Technologies, 11.","DOI":"10.3390\/technologies11050119"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"V\u00e1squez-Correa, J.C., and \u00c1lvarez Muniain, A. (2023). Novel speech recognition systems applied to forensics within child exploitation: Wav2vec2. 0 vs. whisper. Sensors, 23.","DOI":"10.3390\/s23041843"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.specom.2020.03.008","article-title":"Automatic speaker profiling from short duration speech data","volume":"121","author":"Kalluri","year":"2020","journal-title":"Speech Commun."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Schuller, B.W., Steidl, S., Batliner, A., Marschik, P.B., Baumeister, H., Dong, F., and Zafeiriou, S. (2018, January 2\u20136). The interspeech 2018 computational paralinguistics challenge: Atypical & self-assessed affect, crying & heart beats. Proceedings of the 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-51"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"3462","DOI":"10.1121\/10.0011471","article-title":"Acoustic voice variation in spontaneous speech","volume":"151","author":"Lee","year":"2022","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Al-Maashani, T., Mendon\u00e7a, I., and Aritsugi, M. (2023, January 11\u201313). Age classification based on voice using Mel-spectrogram and MFCC. Proceedings of the 2023 24th International Conference on Digital Signal Processing (DSP), Rhodes, Greece.","DOI":"10.1109\/DSP58604.2023.10167887"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1109\/TIT.1967.1053964","article-title":"Nearest neighbor pattern classification","volume":"13","author":"Cover","year":"1967","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_10","first-page":"036106","article-title":"Near linear time algorithm to detect communiti structures in large-scale networks","volume":"76","author":"Raghavan","year":"2007","journal-title":"Rhysical Rev. E"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"3535","DOI":"10.1007\/s11042-021-11614-4","article-title":"Age group classification and gender recognition from speech with temporal convolutional neural networks","volume":"81","year":"2022","journal-title":"Multimed. Tools Appl."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kwasny, D., and Hemmerling, D. (2021). Gender and age estimation methods based on speech using deep neural networks. Sensors, 21.","DOI":"10.3390\/s21144785"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"4444388","DOI":"10.1155\/2022\/4444388","article-title":"Speaker gender recognition based on deep neural networks and ResNet50","volume":"2022","author":"Alnuaim","year":"2022","journal-title":"Wirel. Commun. Mob. Comput."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Hechmi, K., Trong, T.N., Hautam\u00e4ki, V., and Kinnunen, T. (2021, January 13\u201317). Voxceleb enrichment for age and gender recognition. Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia.","DOI":"10.1109\/ASRU51503.2021.9688085"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Tursunov, A., Mustaqeem Choeh, J.Y., and Kwon, S. (2021). Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors, 21.","DOI":"10.3390\/s21175892"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zheng, W., Yang, P., Lai, R., Zhu, K., Zhang, T., Zhang, J., and Fu, H. (2022, January 18\u201322). Exploring Multi-task Learning Based Gender Recognition and Age Estimation for Class-imbalanced Data. Proceedings of the 23rd INTERSPEECH, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-682"},{"key":"ref_17","unstructured":"Nowakowski, A., and Kasprzak, W. (2023, January 17\u201320). Automatic speaker\u2019s age classification in the Common Voice database. Proceedings of the 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), Warsaw, Poland."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Halu\u0161ka, R., Popovi\u010d, M., Pleva, M., and Frohman, M. (2023, January 21\u201322). Detection of Gender and Age Category from Speech. Proceedings of the 2023 World Symposium on Digital Intelligence for Systems and Machines (DISA), Ko\u0161ice, Slovakia.","DOI":"10.1109\/DISA59116.2023.10308943"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3065","DOI":"10.1007\/s00521-023-09153-0","article-title":"Speaker age and gender recognition using 1D and 2D convolutional neural networks","volume":"36","year":"2024","journal-title":"Neural Comput. Appl."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Y\u00fccesoy, E. (2024). Automatic Age and Gender Recognition Using Ensemble Learning. Appl. Sci., 14.","DOI":"10.3390\/app14166868"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"114591","DOI":"10.1016\/j.eswa.2021.114591","article-title":"Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges","volume":"171","author":"Jahangir","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Rabiner, L.R., and Schafer, R.W. (2007). Introduction to Digital Speech Processing, Now Publishers Inc.","DOI":"10.1561\/9781601980717"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Kacur, J., Puterka, B., Pavlovicova, J., and Oravec, M. (2022). Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications. Sensors, 22.","DOI":"10.3390\/s22166304"},{"key":"ref_24","first-page":"1","article-title":"Audio signal filtering with low-pass and high-pass filters","volume":"2","author":"Tun","year":"2020","journal-title":"Int. J. All Res. Writ."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1016\/j.jvoice.2009.08.004","article-title":"Effects of low-pass filtering on acoustic analysis of voice","volume":"25","author":"MacCallum","year":"2011","journal-title":"J. Voice"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"125868","DOI":"10.1109\/ACCESS.2019.2938007","article-title":"Speech emotion recognition from 3D log-mel spectrograms with deep learning network","volume":"7","author":"Meng","year":"2019","journal-title":"IEEE Access"},{"key":"ref_27","unstructured":"Ittichaichareon, C., Suksri, S., and Yingthawornsuk, T. (2012, January 28\u201329). Speech recognition using MFCC. Proceedings of the International Conference on Computer Graphics, Simulation and Modeling, Pattaya, Thailand."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Bhandari, B. (2021). Comparative study of popular deep learning models for machining roughness classification using sound and force signals. Micromachines, 12.","DOI":"10.3390\/mi12121484"},{"key":"ref_29","first-page":"1963","article-title":"Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition","volume":"2022","author":"Xu","year":"2022","journal-title":"Proc. Interspeech"},{"key":"ref_30","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA."},{"key":"ref_31","unstructured":"Mukherjee, K., Khare, A., and Verma, A. (2019). A simple dynamic learning-rate tuning algorithm for automated training of DNNs. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Orr, G.B., and M\u00fcller, K.-R. (1998). Early stopping\u2014But when?. Neural Networks: Tricks of the Trade, Springer.","DOI":"10.1007\/3-540-49430-8"},{"key":"ref_33","unstructured":"Krogh, A., and Hertz, J.A. (1991, January 2\u20135). A simple weight decay can improve generalization. Proceedings of the 4th Conference on Neural Information Processing Systems (NIPS 1991), Denver, CO, USA."},{"key":"ref_34","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Shim, J.W. (2024). Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance. Sci. Rep., 14.","DOI":"10.1038\/s41598-024-78858-6"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"5481","DOI":"10.5194\/gmd-15-5481-2022","article-title":"Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not","volume":"15","author":"Hodson","year":"2022","journal-title":"Geosci. Model Dev."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2009, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_38","first-page":"1097","article-title":"ImageNet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_40","unstructured":"Tan, M., and Le, Q.V. (2019, January 9\u201315). EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the 36th International Conference on Machine Learning (ICML) 2019, Long Beach, CA, USA."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_42","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_43","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., and Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1345","DOI":"10.1109\/TKDE.2009.191","article-title":"A survey on transfer learning","volume":"22","author":"Pan","year":"2014","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_45","first-page":"3320","article-title":"How transferable are features in deep neural networks?","volume":"27","author":"Yosinski","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_46","first-page":"2616","article-title":"VoxCeleb: A large-scale speaker identification dataset","volume":"2017","author":"Nagrani","year":"2017","journal-title":"Proc. Interspeech"},{"key":"ref_47","unstructured":"Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., and Weber, G. (2020, January 20\u201325). Common voice: A massively-multilingual speech corpus. Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1016\/0167-6393(90)90010-7","article-title":"Speech database development at MIT: Timit and beyond","volume":"9","author":"Zue","year":"1990","journal-title":"Speech Commun."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Mitsui, K., and Sawada, K. (2022, January 18\u201322). MSR-NV: Neural Vocoder Using Multiple Sampling Rates. Proceedings of the INTERSPEECH 2022, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-295"}],"container-title":["Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-3197\/13\/8\/177\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:14:55Z","timestamp":1760033695000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-3197\/13\/8\/177"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,23]]},"references-count":49,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["computation13080177"],"URL":"https:\/\/doi.org\/10.3390\/computation13080177","relation":{},"ISSN":["2079-3197"],"issn-type":[{"value":"2079-3197","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,23]]}}}