{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,11]],"date-time":"2026-06-11T16:08:50Z","timestamp":1781194130266,"version":"3.54.1"},"reference-count":43,"publisher":"MDPI AG","issue":"16","license":[{"start":{"date-parts":[[2022,8,9]],"date-time":"2022-08-09T00:00:00Z","timestamp":1660003200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001863","name":"New Energy and Industrial Technology Development Organization (NEDO)","doi-asserted-by":"publisher","award":["JPNP20006"],"award-info":[{"award-number":["JPNP20006"]}],"id":[{"id":"10.13039\/501100001863","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition in various conditions. The experiments are conducted on the Japanese Twitter-based emotional speech and IEMOCAP datasets. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentations and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific condition.<\/jats:p>","DOI":"10.3390\/s22165941","type":"journal-article","created":{"date-parts":[[2022,8,10]],"date-time":"2022-08-10T04:20:32Z","timestamp":1660105232000},"page":"5941","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":35,"title":["Effects of Data Augmentations on Speech Emotion Recognition"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1560-2824","authenticated-orcid":false,"given":"Bagus Tris","family":"Atmaja","sequence":"first","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8560, Japan"},{"name":"Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Akira","family":"Sasou","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8560, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2022,8,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"572","DOI":"10.1016\/j.patcog.2010.09.020","article-title":"Survey on speech emotion recognition: Features, classification schemes, and databases","volume":"44","author":"Kamel","year":"2011","journal-title":"Pattern Recognit."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.specom.2019.12.001","article-title":"Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers","volume":"116","author":"Oguz","year":"2020","journal-title":"Speech Commun."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Park, D.S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., and Wang, S. (IEEE J. Sel. Top. Signal Process., 2022). BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition, IEEE J. Sel. Top. Signal Process., early access.","DOI":"10.1109\/JSTSP.2022.3182537"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). Audio Augmentations for Speech Recognition. Proceedings of the Interspeech 2015, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-711"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Casanova, E., Candido, A., Fernandes, R.C., Finger, M., Gris, L.R.S., Ponti, M.A., and Pinto da Silva, D.P. (September, January 30). Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021. Proceedings of the 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-1798"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"362","DOI":"10.18178\/ijmlc.2021.11.5.1062","article-title":"Effect of Training Data Selection for Speech Recognition of Emotional Speech","volume":"11","author":"Yamada","year":"2021","journal-title":"Int. J. Mach. Learn. Comput."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Horii, D., Ito, A., and Nose, T. (2021, January 10\u201312). Analysis of Feature Extraction by Convolutional Neural Network for Speech Emotion Recognition. Proceedings of the 2021 IEEE 10th Global Conference on Consumer Electronics, Las Vegas, NV, USA.","DOI":"10.1109\/GCCE53005.2021.9621964"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Yang, S.w., Chi, P.H., Chuang, Y.S., Lai, C.I.J., Lakhotia, K., Lin, Y.Y., Liu, A.T., Shi, J., Chang, X., and Lin, G.T. (September, January 30). SUPERB: Speech Processing Universal PERformance Benchmark. Proceedings of the Interspeech 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-1775"},{"key":"ref_10","first-page":"6269","article-title":"Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations","volume":"Volume 2021-June","author":"Wu","year":"2021","journal-title":"Proceedings of the ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Chen, S., Wu, Y., Wang, C., Chen, Z., Chen, Z., Liu, S., Wu, J., Qian, Y., Wei, F., and Li, J. (2022, January 22\u201327). Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training. Proceedings of the ICASSP 2022\u20142022 IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore.","DOI":"10.1109\/ICASSP43922.2022.9747077"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Takeishi, E., Nose, T., Chiba, Y., and Ito, A. (2016, January 26\u201328). Construction and analysis of phonetically and prosodically balanced emotional speech database. Proceedings of the 2016 Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2016, Bali, Indonesia.","DOI":"10.1109\/ICSDA.2016.7918977"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lee, S.w. (2019, January 12\u201317). The Generalization Effect for Multilingual Speech Emotion Recognition across Heterogeneous Languages. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683046"},{"key":"ref_14","unstructured":"Nagase, R., Fukumori, T., and Yamashita, Y. (2021, January 14\u201317). Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions. Proceedings of the APSIPA Annual Summit and Conference, Tokyo, Japan."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Atmaja, B.T., and Sasou, A. (2021, January 7\u201310). Effect of different splitting criteria on the performance of speech emotion recognition. Proceedings of the TENCON 2021\u20142021 IEEE IEEE Region 10 Conference (TENCON), Auckland, New Zealand.","DOI":"10.1109\/TENCON54134.2021.9707265"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Chiba, Y., Nose, T., and Ito, A. (2020, January 25\u201329). Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition. Proceedings of the Interspeech 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1199"},{"key":"ref_17","unstructured":"Atmaja, B.T., and Akagi, M. (2020, January 7\u201310). Deep Multilayer Perceptrons for Dimensional Speech Emotion Recognition. Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020, Auckland, New Zealand."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Atmaja, B.T., and Akagi, M. (2019, January 16\u201318). Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model. Proceedings of the 2019 IEEE International Conference on Signals and Systems, Bandung, Indonesia.","DOI":"10.1109\/ICSIGSYS.2019.8811080"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"107316","DOI":"10.1016\/j.knosys.2021.107316","article-title":"A multimodal hierarchical approach to speech emotion recognition from audio and text","volume":"229","author":"Singh","year":"2021","journal-title":"Knowl. Based Syst."},{"key":"ref_20","unstructured":"Chernykh, V., and Prikhodko, P. (2017). Emotion Recognition From Speech With Recurrent Neural Networks. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Busso, C., and Narayanan, S.S. (2008, January 22\u201326). Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the IEMOCAP database. Proceedings of the Annual Conference of the Interspeech, Brisbane, Australia.","DOI":"10.21437\/Interspeech.2008-463"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1080\/17445760.2019.1626854","article-title":"Speech emotion recognition based on hierarchical attributes using feature nets","volume":"35","author":"Zhao","year":"2019","journal-title":"Int. J. Parallel Emergent Distrib. Syst."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1121\/1.413664","article-title":"Analysis of the glottal excitation of emotionally styled and stressed speech","volume":"98","author":"Cummings","year":"1995","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"313","DOI":"10.1006\/jpho.1996.0017","article-title":"Physical variations related to stress and emotional state: A preliminary study","volume":"24","author":"Laukkanen","year":"1996","journal-title":"J. Phon."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1016\/j.jvoice.2008.04.004","article-title":"Perception of Emotional Valences and Activity Levels from Vowel Segments of Continuous Speech","volume":"24","author":"Waaramaa","year":"2010","journal-title":"J. Voice"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/S0167-6393(02)00082-1","article-title":"The role of voice quality in communicating emotion, mood and attitude","volume":"40","author":"Gobl","year":"2003","journal-title":"Speech Commun."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1016\/j.neucom.2020.06.010","article-title":"Exploration of glottal characteristics and the vocal folds behavior for the speech under emotion","volume":"410","author":"Yao","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1016\/j.specom.2018.07.002","article-title":"Glottal inverse filtering by combining a constrained LP and an HMM-based generative model of glottal flow derivative","volume":"104","author":"Sasou","year":"2018","journal-title":"Speech Commun."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"E7856","DOI":"10.1073\/pnas.1612524113","article-title":"Statistics of natural reverberation enable perceptual separation of sound and space","volume":"113","author":"Traer","year":"2016","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_30","unstructured":"(2022, July 12). EchoThief Impulse Response Library. Available online: http:\/\/www.echothief.com\/downloads\/."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Piczak, K.J. (2015, January 26\u201330). ESC: Dataset for environmental sound classification. Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia.","DOI":"10.1145\/2733373.2806390"},{"key":"ref_32","unstructured":"Johannes, W., Andreas, T., Hagen, W., Maximilian, S., Florian, E., and Schuller, B.W. (2022). Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0 (1.1.0). Zenodo."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., and Schuller, B.W. (2022). Dawn of the transformer era in speech emotion recognition: Closing the valence gap. arXiv.","DOI":"10.1109\/TPAMI.2023.3263585"},{"key":"ref_34","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1109\/TAFFC.2016.2515617","article-title":"MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception","volume":"8","author":"Busso","year":"2017","journal-title":"Trans. Affect. Comput."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/j.specom.2022.03.002","article-title":"Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion","volume":"140","author":"Atmaja","year":"2022","journal-title":"Speech Commun."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"012004","DOI":"10.1088\/1742-6596\/1896\/1\/012004","article-title":"Evaluation of Error and Correlation-Based Loss Functions For Multitask Learning Dimensional Speech Emotion Recognition","volume":"1896","author":"Atmaja","year":"2020","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Satt, A., Rozenberg, S., and Hoory, R. (2017;, January 20\u201324). Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-200"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Horii, D., Ito, A., and Nose, T. (2021, January 27\u201329). Analysis of Effectiveness of Feature Extraction by CNN for Speech Emotion Recognition. Proceedings of the ASJ Autumn Meeting, Austin, TX, USA.","DOI":"10.1109\/GCCE53005.2021.9621964"},{"key":"ref_40","unstructured":"Rintala, J. (2020). Speech Emotion Recognition from Raw Audio Using Deep Learning. [Ph.D. Thesis, Royal Institute of Technology (KTH)]."},{"key":"ref_41","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_42","unstructured":"Jordal, I. (2022, June 13). Audiomentations. Available online: https:\/\/github.com\/iver56\/audiomentations."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Pepino, L., Riera, P., Ferrer, L., and Gravano, A. (2020, January 4\u20138). Fusion Approaches for Emotion Recognition from Speech Using Acoustic and Text-Based Features. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054709"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/16\/5941\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:06:20Z","timestamp":1760141180000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/16\/5941"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,9]]},"references-count":43,"journal-issue":{"issue":"16","published-online":{"date-parts":[[2022,8]]}},"alternative-id":["s22165941"],"URL":"https:\/\/doi.org\/10.3390\/s22165941","relation":{"has-preprint":[{"id-type":"doi","id":"10.20944\/preprints202208.0109.v1","asserted-by":"object"}]},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,9]]}}}