{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T22:43:55Z","timestamp":1772750635398,"version":"3.50.1"},"reference-count":41,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2025,8,31]],"date-time":"2025-08-31T00:00:00Z","timestamp":1756598400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Higher Education, Science, and Technology (Kemdiktisaintek) of the Republic of Indonesia","award":["127\/C3\/DT.05.00\/PL\/2025"],"award-info":[{"award-number":["127\/C3\/DT.05.00\/PL\/2025"]}]},{"name":"Ministry of Higher Education, Science, and Technology (Kemdiktisaintek) of the Republic of Indonesia","award":["026\/LL6\/AL.04\/2025"],"award-info":[{"award-number":["026\/LL6\/AL.04\/2025"]}]},{"name":"Ministry of Higher Education, Science, and Technology (Kemdiktisaintek) of the Republic of Indonesia","award":["081\/DPPMP\/UNISBANK\/UM\/VI\/2025"],"award-info":[{"award-number":["081\/DPPMP\/UNISBANK\/UM\/VI\/2025"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Speech Emotion Recognition (SER) plays a vital role in supporting applications such as healthcare, human\u2013computer interaction, and security. However, many existing approaches still face challenges in achieving robust generalization and maintaining high recall, particularly for emotions related to stress and anxiety. This study proposes a dual-stream hybrid model that combines prosodic features with spatio-temporal representations derived from the Multitaper Mel-Frequency Spectrogram (MTMFS) and the Constant-Q Transform Spectrogram (CQTS). Prosodic cues, including pitch, intensity, jitter, shimmer, HNR, pause rate, and speech rate, were processed using dense layers, while MTMFS and CQTS features were encoded with CNN and BiGRU. A Multi-Head Attention mechanism was then applied to adaptively fuse the two feature streams, allowing the model to focus on the most relevant emotional cues. Evaluations conducted on the RAVDESS dataset with subject-independent 5-fold cross-validation demonstrated an accuracy of 97.64% and a macro F1-score of 0.9745. These results confirm that combining prosodic and advanced spectrogram features with attention-based fusion improves precision, recall, and overall robustness, offering a promising framework for more reliable SER systems.<\/jats:p>","DOI":"10.3390\/computers14090361","type":"journal-article","created":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T08:23:38Z","timestamp":1756801418000},"page":"361","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition"],"prefix":"10.3390","volume":"14","author":[{"given":"Kristiawan","family":"Nugroho","sequence":"first","affiliation":[{"name":"Department of Information Technology, Faculty of Information Technology and Industry, Universitas Stikubank, Semarang 50241, Indonesia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4998-5456","authenticated-orcid":false,"given":"Imam Husni","family":"Al Amin","sequence":"additional","affiliation":[{"name":"Department of Industrial Engineering, Faculty of Information Technology and Industry, Universitas Stikubank, Semarang 50241, Indonesia"}]},{"given":"Nina Anggraeni","family":"Noviasari","sequence":"additional","affiliation":[{"name":"Faculty of Medicine, Universitas Muhammadiyah Semarang, Semarang 50273, Indonesia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6615-4457","authenticated-orcid":false,"given":"De Rosal Ignatius Moses","family":"Setiadi","sequence":"additional","affiliation":[{"name":"Research Centre for Quantum Computing and Materials Informatics, Faculty of Computer Science, Universitas Dian Nuswantoro, Semarang 50131, Indonesia"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,31]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"109613","DOI":"10.1016\/j.apacoust.2023.109613","article-title":"Speech Emotion Recognition Using the Novel PEmoNet (Parallel Emotion Network)","volume":"212","author":"Bhangale","year":"2023","journal-title":"Appl. Acoust."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Waleed, G.T., and Shaker, S.H. (2025). Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN. Information, 16.","DOI":"10.3390\/info16070518"},{"key":"ref_3","unstructured":"Liztio, L.M., Sari, C.A., Setiadi, D.R.I.M., and Rachmawanto, E.H. (2020, January 19\u201320). Gender Identification Based on Speech Recognition Using Backpropagation Neural Network. Proceedings of the 2020 International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"121270","DOI":"10.1016\/j.neuroimage.2025.121270","article-title":"Neural Entrainment to Pitch Changes of Auditory Targets in Noise","volume":"314","author":"Guo","year":"2025","journal-title":"Neuroimage"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"106169","DOI":"10.1016\/j.cognition.2025.106169","article-title":"Prosody Enhances Learning of Statistical Dependencies from Continuous Speech Streams in Adults","volume":"262","author":"Kuuluvainen","year":"2025","journal-title":"Cognition"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"103271","DOI":"10.1016\/j.specom.2025.103271","article-title":"Prosodic Modulation of Discourse Markers: A Cross-Linguistic Analysis of Conversational Dynamics","volume":"173","author":"Shan","year":"2025","journal-title":"Speech Commun."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Guo, P., Huang, S., and Li, M. (2025). DDA-MSLD: A Multi-Feature Speech Lie Detection Algorithm Based on a Dual-Stream Deep Architecture. Information, 16.","DOI":"10.3390\/info16050386"},{"key":"ref_8","first-page":"5511","article-title":"Automatic Speaker Recognition Using Mel-Frequency Cepstral Coefficients Through Machine Learning","volume":"71","author":"Ayvaz","year":"2022","journal-title":"Comput. Mater. Contin."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"012009","DOI":"10.1088\/1742-6596\/1717\/1\/012009","article-title":"Speech Processing: MFCC Based Feature Extraction Techniques- An Investigation","volume":"1717","author":"Prabakaran","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sood, M., and Jain, S. (2021). Speech Recognition Employing MFCC and Dynamic Time Warping Algorithm. Innovations in Information and Communication Technologies (IICT-2020), Springer.","DOI":"10.1007\/978-3-030-66218-9_27"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"243","DOI":"10.62411\/jcta.9655","article-title":"Music-Genre Classification Using Bidirectional Long Short-Term Memory and Mel-Frequency Cepstral Coefficients","volume":"1","author":"Wijaya","year":"2024","journal-title":"J. Comput. Theor. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"107914","DOI":"10.1016\/j.knosys.2021.107914","article-title":"DeepResGRU: Residual Gated Recurrent Neural Network-Augmented Kalman Filtering for Speech Enhancement and Recognition","volume":"238","author":"Saleem","year":"2022","journal-title":"Knowl.-Based Syst."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1049\/iet-spr.2016.0477","article-title":"Deep Neural Network-based Linear Predictive Parameter Estimations for Speech Enhancement","volume":"11","author":"Li","year":"2017","journal-title":"IET Signal Process."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Karapiperis, S., Ellinas, N., Vioni, A., Oh, J., Jho, G., Hwang, I., and Raptis, S. (2024, January 2\u20135). Investigating Disentanglement in a Phoneme-Level Speech Codec for Prosody Modeling. Proceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, China.","DOI":"10.1109\/SLT61566.2024.10832258"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sivasathiya, G., Kumar, A.D., Ar, H.R., and Kanishkaa, R. (2024, January 4\u20136). Emotion-Aware Multimedia Synthesis: A Generative AI Framework for Personalized Content Generation Based on User Sentiment Analysis. Proceedings of the 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India.","DOI":"10.1109\/IDCIoT59759.2024.10467761"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Colunga-Rodriguez, A.A., Mart\u00ednez-Rebollar, A., Estrada-Esquivel, H., Clemente, E., and Pliego-Mart\u00ednez, O.A. (2025). Developing a Dataset of Audio Features to Classify Emotions in Speech. Computation, 13.","DOI":"10.3390\/computation13020039"},{"key":"ref_17","first-page":"617","article-title":"A Research on HMM Based Speech Recognition in Spoken English","volume":"14","author":"Wang","year":"2021","journal-title":"Recent Adv. Electr. Electron. Eng. (Former Recent Pat. Electr. Electron. Eng.)"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1878","DOI":"10.1016\/j.matpr.2021.10.097","article-title":"Speech Recognition Using HMM and Soft Computing","volume":"51","author":"Srivastava","year":"2022","journal-title":"Mater. Today Proc."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Turki, T., and Roy, S.S. (2022). Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer. Appl. Sci., 12.","DOI":"10.3390\/app12136611"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1948159","DOI":"10.1155\/2022\/1948159","article-title":"Simulation of English Speech Recognition Based on Improved Extreme Random Forest Classification","volume":"2022","author":"Hao","year":"2022","journal-title":"Comput. Intell. Neurosci."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Dua, S., Kumar, S.S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot, A., Alshamrani, S.S., and AlGhamdi, A.S. (2022). Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Appl. Sci., 12.","DOI":"10.3390\/app12126223"},{"key":"ref_22","first-page":"3425","article-title":"Combining Audio and Visual Speech Recognition Using LSTM and Deep Convolutional Neural Network","volume":"14","author":"Shashidhar","year":"2022","journal-title":"Int. J. Inf. Technol."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"109492","DOI":"10.1016\/j.apacoust.2023.109492","article-title":"Emotional Speech Recognition Using CNN and Deep Learning Techniques","volume":"211","author":"Hema","year":"2023","journal-title":"Appl. Acoust."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"30069","DOI":"10.1109\/ACCESS.2022.3159339","article-title":"Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition","volume":"10","author":"Oruh","year":"2022","journal-title":"IEEE Access"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Orken, M., Dina, O., Keylan, A., Tolganay, T., and Mohamed, O. (2022). A Study of Transformer-Based End-to-End Speech Recognition System for Kazakh Language. Sci. Rep., 12.","DOI":"10.1038\/s41598-022-12260-y"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"10028","DOI":"10.1109\/TNNLS.2022.3163771","article-title":"Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition","volume":"34","author":"Song","year":"2023","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_27","first-page":"198","article-title":"Multi-Features Audio Extraction for Speech Emotion Recognition Based on Deep Learning","volume":"14","author":"Gondohanindijo","year":"2023","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Tyagi, S., and Sz\u00e9n\u00e1si, S. (2024). Optimizing Speech Emotion Recognition with Deep Learning and Grey Wolf Optimization: A Multi-Dataset Approach. Algorithms, 17.","DOI":"10.3390\/a17030090"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Bhanbhro, J., Memon, A.A., Lal, B., Talpur, S., and Memon, M. (2025). Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models. Signals, 6.","DOI":"10.3390\/signals6020022"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yu, S., Meng, J., Fan, W., Chen, Y., Zhu, B., Yu, H., Xie, Y., and Sun, Q. (2024). Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion. Electronics, 13.","DOI":"10.3390\/electronics13112191"},{"key":"ref_31","first-page":"316","article-title":"A Deep Learning Model for Speech Emotion Recognition on RAVDESS Dataset","volume":"16","author":"Wei","year":"2025","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Makhmudov, F., Kutlimuratov, A., and Cho, Y.-I. (2024). Hybrid LSTM\u2013Attention and CNN Model for Enhanced Speech Emotion Recognition. Appl. Sci., 14.","DOI":"10.3390\/app142311342"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"128039","DOI":"10.1109\/ACCESS.2024.3447770","article-title":"Accuracy Enhancement Method for Speech Emotion Recognition From Spectrogram Using Temporal Frequency Correlation and Positional Information Learning Through Knowledge Transfer","volume":"12","author":"Kim","year":"2024","journal-title":"IEEE Access"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"4152","DOI":"10.21437\/Interspeech.2022-726","article-title":"ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition","volume":"Volume 2022-Septe","author":"Huang","year":"2022","journal-title":"Proceedings of the Interspeech 2022"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"2028","DOI":"10.1109\/LSP.2022.3208411","article-title":"Multitaper-Mel Spectrograms for Keyword Spotting","volume":"29","author":"Bakri","year":"2022","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"McAllister, T., and Gamb\u00e4ck, B. (2022). Music Style Transfer Using Constant-Q Transform Spectrograms. Artificial Intelligence in Music, Sound, Art and Design, Springer International Publishing.","DOI":"10.1007\/978-3-031-03789-4_13"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"498","DOI":"10.1016\/j.aej.2024.07.081","article-title":"Enhancing Emotion Prediction Using Deep Learning and Distributed Federated Systems with SMOTE Oversampling Technique","volume":"108","author":"Raju","year":"2024","journal-title":"Alex. Eng. J."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Ding, Z., Wang, Z., Zhang, Y., Cao, Y., Liu, Y., Shen, X., Tian, Y., and Dai, J. (2025). Trade-Offs between Machine Learning and Deep Learning for Mental Illness Detection on Social Media. Sci. Rep., 15.","DOI":"10.1038\/s41598-025-99167-6"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1007\/s44163-025-00412-8","article-title":"Physiological Signal-Based Mental Stress Detection Using Hybrid Deep Learning Models","volume":"5","author":"Modi","year":"2025","journal-title":"Discov. Artif. Intell."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"124","DOI":"10.62411\/faith.2024-22","article-title":"A Reinforcement Learning-Based Approach for Promoting Mental Health Using Multimodal Emotion Recognition","volume":"1","author":"Pathirana","year":"2024","journal-title":"J. Futur. Artif. Intell. Technol."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wang, Y., Huang, J., Zhao, Z., Lan, H., and Zhang, X. (2024). Speech Emotion Recognition Using Multi-Scale Global\u2013Local Representation Learning with Feature Pyramid Network. Appl. Sci., 14.","DOI":"10.20944\/preprints202410.1002.v1"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/9\/361\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:36:32Z","timestamp":1760034992000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/9\/361"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,31]]},"references-count":41,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["computers14090361"],"URL":"https:\/\/doi.org\/10.3390\/computers14090361","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,31]]}}}