{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,15]],"date-time":"2026-05-15T18:49:22Z","timestamp":1778870962021,"version":"3.51.4"},"reference-count":56,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,8,14]],"date-time":"2025-08-14T00:00:00Z","timestamp":1755129600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Multimedia University","award":["PostDoc(MMUI\/240029)"],"award-info":[{"award-number":["PostDoc(MMUI\/240029)"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>Emotion recognition in speech is essential for enhancing human\u2013computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model\u2019s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time\u2013frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets\u2014SUBESCO, BanglaSER, and a merged version of both\u2014as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.<\/jats:p>","DOI":"10.3390\/jimaging11080273","type":"journal-article","created":{"date-parts":[[2025,8,14]],"date-time":"2025-08-14T15:16:37Z","timestamp":1755184597000},"page":"273","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-9610-9867","authenticated-orcid":false,"given":"Md. Shahid Ahammed","family":"Shakil","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2625-2348","authenticated-orcid":false,"given":"Fahmid Al","family":"Farid","sequence":"additional","affiliation":[{"name":"Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya 63100, Selangor, Malaysia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nitun Kumar","family":"Podder","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"S. M. Hasan Sazzad","family":"Iqbal","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1238-0464","authenticated-orcid":false,"given":"Abu Saleh Musa","family":"Miah","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Bangladesh Army University of Science and Technology (BAUST), Saidpur 5311, Bangladesh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2300-1420","authenticated-orcid":false,"given":"Md Abdur","family":"Rahim","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7613-4596","authenticated-orcid":false,"given":"Hezerul Abdul","family":"Karim","sequence":"additional","affiliation":[{"name":"Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya 63100, Selangor, Malaysia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"020105","DOI":"10.1063\/1.5005438","article-title":"Speech emotion recognition methods: A literature review","volume":"1891","author":"Moradhaseli","year":"2017","journal-title":"AIP Conf. Proc."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"102974","DOI":"10.1016\/j.specom.2023.102974","article-title":"Speech emotion recognition approaches: A systematic review","volume":"154","author":"Alghamdi","year":"2023","journal-title":"Speech Commun."},{"key":"ref_3","first-page":"191393","article-title":"Eye Disease Detection Enhancement Using a Multi-Stage Deep Learning Approach","volume":"12","author":"Muntaqim","year":"2024","journal-title":"IEEE Access"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Hossain, M.M., Chowdhury, Z.R., Akib, S.M.R.H., Ahmed, M.S., Hossain, M.M., and Miah, A.S.M. (2023, January 13\u201315). Crime Text Classification and Drug Modeling from Bengali News Articles: A Transformer Network-Based Deep Learning Approach. Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox\u2019s Bazar, Bangladesh.","DOI":"10.1109\/ICCIT60459.2023.10441195"},{"key":"ref_5","first-page":"1690","article-title":"An Enhanced Hybrid Model Based on CNN and BiLSTM for Identifying Individuals via Handwriting Analysis","volume":"140","author":"Rahim","year":"2024","journal-title":"CMES-Comput. Model. Eng. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"117327","DOI":"10.1109\/ACCESS.2019.2936124","article-title":"Speech emotion recognition using deep learning techniques: A review","volume":"7","author":"Khalil","year":"2019","journal-title":"IEEE Access"},{"key":"ref_7","unstructured":"Saad, F., Mahmud, H., Shaheen, M., Hasan, M.K., and Farastu, P. (2021). Is Speech Emotion Recognition Language-Independent? Analysis of English and Bangla Languages using Language-Independent Vocal Features. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Chakraborty, C., Dash, T.K., Panda, G., and Solanki, S.S. (2022). Phase-based Cepstral features for Automatic Speech Emotion Recognition of Low Resource Indian languages. Transactions on Asian and Low-Resource Language Information Processing, Association for Computing Machinery.","DOI":"10.1145\/3563944"},{"key":"ref_9","unstructured":"Ma, E. (2023, March 08). Data Augmentation for Audio. Medium, Available online: https:\/\/medium.com\/@makcedward\/data-augmentation-for-audio-76912b01fdf6."},{"key":"ref_10","unstructured":"Rintala, J. (2020). Speech Emotion Recognition from Raw Audio using Deep Learning, School of Electrical Engineering and Computer Science Royal Institute of Technology (KTH)."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Tusher, M.M.R., Al Farid, F., Kafi, H.M., Miah, A.S.M., Rinky, S.R., Islam, M., Rahim, M.A., Mansor, S., and Karim, H.A. (Comput. Vis. Pattern Recognit., 2024). BanTrafficNet: Bangladeshi Traffic Sign Recognition Using A Lightweight Deep Learning Approach, Comput. Vis. Pattern Recognit., preprint.","DOI":"10.21203\/rs.3.rs-4216970\/v1"},{"key":"ref_12","unstructured":"Siddiqua, A., Hasan, R., Rahman, A., and Miah, A.S.M. (2024). Computer-Aided Osteoporosis Diagnosis Using Transfer Learning with Enhanced Features from Stacked Deep Learning Modules. arXiv."},{"key":"ref_13","first-page":"2633","article-title":"Development of a Lightweight Model for Handwritten Dataset Recognition: Bangladeshi City Names in Bangla Script","volume":"80","author":"Tusher","year":"2024","journal-title":"Comput. Mater. Contin."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"564","DOI":"10.1109\/ACCESS.2021.3136251","article-title":"Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks","volume":"10","author":"Sultana","year":"2021","journal-title":"IEEE Access"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Rahman, M.M., Dipta, D.R., and Hasan, M.M. (2018, January 8\u20139). Dynamic time warping assisted SVM classifier for Bangla speech recognition. Proceedings of the 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2), Rajshahi, Bangladesh.","DOI":"10.1109\/IC4ME2.2018.8465640"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Issa, D., Demirci, M.F., and Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control, 59.","DOI":"10.1016\/j.bspc.2020.101894"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"312","DOI":"10.1016\/j.bspc.2018.08.035","article-title":"Speech emotion recognition using deep 1D & 2D CNN LSTM networks","volume":"47","author":"Zhao","year":"2019","journal-title":"Biomed. Signal Process. Control"},{"key":"ref_18","first-page":"4039","article-title":"1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features","volume":"67","author":"Mustaqeem","year":"2021","journal-title":"CMC-Comput. Mater. Contin."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Badshah, A.M., Ahmad, J., Rahim, N., and Baik, S.W. (2017, January 13\u201315). Speech emotion recognition from spectrograms with deep convolutional neural network. Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea.","DOI":"10.1109\/PlatCon.2017.7883728"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., and Schmauch, B. (2018). Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv.","DOI":"10.21437\/SMM.2018-5"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"199909","DOI":"10.1109\/ACCESS.2020.3035910","article-title":"Ensemble learning with attention-integrated convolutional recurrent neural network for imbalanced speech emotion recognition","volume":"8","author":"Ai","year":"2020","journal-title":"IEEE Access"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Kwon, S. (2019). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20.","DOI":"10.3390\/s20010183"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zheng, W., Yu, J., and Zou, Y. (2015, January 21\u201324). An experimental study of speech emotion recognition based on deep convolutional neural networks. Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi\u2019an, China.","DOI":"10.1109\/ACII.2015.7344669"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1845","DOI":"10.1007\/s40747-020-00250-4","article-title":"Cross corpus multi-lingual speech emotion recognition using ensemble learning","volume":"7","author":"Zehra","year":"2021","journal-title":"Complex Intell. Syst."},{"key":"ref_25","first-page":"75","article-title":"Exploring Deep Learning Methods for Audio Speech Emotion Detection: An Ensemble MFCCs, CNNs and LSTM","volume":"19","author":"Basha","year":"2025","journal-title":"Appl. Math"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1109\/MSP.2021.3106895","article-title":"Emotion Recognition From Multiple Modalities: Fundamentals and methodologies","volume":"38","author":"Zhao","year":"2021","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Fu, B., Gu, C., Fu, M., Xia, Y., and Liu, Y. (2023). A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals. Front. Neurosci., 17.","DOI":"10.3389\/fnins.2023.1234162"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Sultana, S., Rahman, M.S., Selim, M.R., and Iqbal, M.Z. (2021). SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla. PLoS ONE, 16.","DOI":"10.1371\/journal.pone.0250173"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"108091","DOI":"10.1016\/j.dib.2022.108091","article-title":"BanglaSER: A speech emotion recognition dataset for the Bangla language","volume":"42","author":"Das","year":"2022","journal-title":"Data Brief"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Livingstone, S.R., and Russo, F.A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0196391"},{"key":"ref_31","unstructured":"Burkhardt, F., Paeschke, A., Kienast, M., Sendlmeier, W.F., and Weiss, B. (2022, December 05). Berlin EmoDB (1.3.0) [Data Set]. Zenodo, Available online: https:\/\/zenodo.org\/records\/7447302."},{"key":"ref_32","unstructured":"Paiva, L.F., Alfaro-Espinoza, E., Almeida, V.M., Felix, L.B., and Neves, R.V.A. (2022). A Survey of Data Augmentation for Audio Classification. Proceedings of the XXIV Congresso Brasileiro de Autom\u00e1tica (CBA), Available online: https:\/\/sba.org.br\/open_journal_systems\/index.php\/cba\/article\/view\/3469."},{"key":"ref_33","unstructured":"McFee, B., Raffel, C., Liang, D., Ellis, D.P., Battenberg, E., Nieto, O., Dieleman, S., Tokunaga, H., McQuin, P., and NumPy (2024, January 04). librosa\/librosa: 0.10.1. Available online: https:\/\/zenodo.org\/records\/8252662."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., and Nieto, O. (2015, January 6\u201312). librosa: Audio and music signal analysis in python. Proceedings of the 14th Python in Science Conference, Austin, TX, USA.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/s41586-020-2649-2","article-title":"Array programming with NumPy","volume":"585","author":"Harris","year":"2020","journal-title":"Nature"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"3","DOI":"10.4316\/AECE.2023.03001","article-title":"Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study","volume":"23","author":"Galic","year":"2023","journal-title":"Adv. Electr. Comput. Eng."},{"key":"ref_37","unstructured":"Titeux, N. (2023). Everything You Need to Know About Pitch Shifting, Nicolas Titeux. Available online: https:\/\/www.nicolastiteux.com\/en\/blog\/everything-you-need-to-know-about-pitch-shifting\/."},{"key":"ref_38","unstructured":"Jordal, I. (2023, January 04). Gain; Gain\u2014Audiomentations Documentation. Available online: https:\/\/iver56.github.io\/audiomentations\/waveform_transforms\/gain\/."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"1477","DOI":"10.1109\/PROC.1986.13663","article-title":"Spectral analysis and discrimination by zero-crossings","volume":"74","author":"Kedem","year":"1986","journal-title":"Proc. IEEE"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Rezapour Mashhadi, M.M., and Osei-Bonsu, K. (2023). Speech Emotion Recognition Using Machine Learning Techniques: Feature Extraction and Comparison of Convolutional Neural Network and Random Forest. PLoS ONE, 18.","DOI":"10.1371\/journal.pone.0291500"},{"key":"ref_41","unstructured":"Shah, A., Kattel, M., Nepal, A., and Shrestha, D. (2025, January 05). Chroma Feature Extraction Using Fourier Transform. In Proceedings of the Conference at Kathmandu University, Nepal, January 2019. Available online: https:\/\/www.researchgate.net\/publication\/330796993_Chroma_Feature_Extraction."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"107020","DOI":"10.1016\/j.apacoust.2019.107020","article-title":"Trends in Audio Signal Feature Extraction Methods","volume":"158","author":"Sharma","year":"2020","journal-title":"Appl. Acoust."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"87","DOI":"10.3991\/ijes.v9i2.22983","article-title":"An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition","volume":"9","author":"Kumar","year":"2021","journal-title":"Int. J. Recent Contrib. Eng. Sci. IT (iJES)"},{"key":"ref_44","unstructured":"West, K., and Cox, S. (2005, January 11\u201315). Finding An Optimal Segmentation for Audio Genre Classification. Proceedings of the 6th International Conference on Music Information Retrieval, ISMIR 2005, London, UK."},{"key":"ref_45","unstructured":"Peeters, G. (2004). A Large Set of Audio Features for Sound Description (Similarity and Classification) in the CUIDADO Project, Ircam."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1109\/TAFFC.2020.3032373","article-title":"Audio Features for Music Emotion Recognition: A Survey","volume":"14","author":"Panda","year":"2023","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_47","unstructured":"Roberts, L. (2023, January 05). Understanding the Mel Spectrogram. Medium, Available online: https:\/\/medium.com\/analytics-vidhya\/understanding-the-mel-spectrogram-fca2afa2ce53."},{"key":"ref_48","first-page":"495","article-title":"Introduction to convolutional neural networks","volume":"5","author":"Wu","year":"2017","journal-title":"Natl. Key Lab Nov. Softw. Technol. Nanjing Univ. China"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1186\/s40537-021-00444-8","article-title":"Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions","volume":"8","author":"Alzubaidi","year":"2021","journal-title":"J. Big Data"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Zafar, A., Aamir, M., Mohd Nawi, N., Arshad, A., Riaz, S., Alruban, A., Dutta, A., and Alaybani, S. (2022). A Comparison of Pooling Methods for Convolutional Neural Networks. Appl. Sci., 12.","DOI":"10.3390\/app12178643"},{"key":"ref_51","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv."},{"key":"ref_52","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_53","unstructured":"GeeksforGeeks (2024). What Is a Neural Network Flatten Layer?, GeeksforGeeks. Available online: https:\/\/www.geeksforgeeks.org\/what-is-a-neural-network-flatten-layer\/."},{"key":"ref_54","unstructured":"Nwankpa, C., Ijomah, W., Gachagan, A., and Marshall, S. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning. arXiv."},{"key":"ref_55","unstructured":"Kingma, D.P., and Ba, J. (2017). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"757","DOI":"10.1016\/j.jksuci.2023.01.014","article-title":"A Comprehensive Review on Ensemble Deep Learning: Opportunities and Challenges","volume":"35","author":"Mohammed","year":"2023","journal-title":"J. King Saud Univ. Comput. Inf. Sci."}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/8\/273\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:27:34Z","timestamp":1760034454000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/8\/273"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,14]]},"references-count":56,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["jimaging11080273"],"URL":"https:\/\/doi.org\/10.3390\/jimaging11080273","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,14]]}}}