{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T15:05:55Z","timestamp":1778079955902,"version":"3.51.4"},"reference-count":60,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2021,7,19]],"date-time":"2021-07-19T00:00:00Z","timestamp":1626652800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science Foundation","award":["1846658"],"award-info":[{"award-number":["1846658"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD.<\/jats:p>","DOI":"10.3390\/s21144913","type":"journal-article","created":{"date-parts":[[2021,7,19]],"date-time":"2021-07-19T10:07:37Z","timestamp":1626689257000},"page":"4913","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":87,"title":["Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5080-198X","authenticated-orcid":false,"given":"Baijun","family":"Xie","sequence":"first","affiliation":[{"name":"Department of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9381-9506","authenticated-orcid":false,"given":"Mariia","family":"Sidulova","sequence":"additional","affiliation":[{"name":"Department of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0742-6541","authenticated-orcid":false,"given":"Chung Hyuk","family":"Park","sequence":"additional","affiliation":[{"name":"Department of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Han, K., Yu, D., and Tashev, I. (2014, January 14\u201318). Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore.","DOI":"10.21437\/Interspeech.2014-57"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1440","DOI":"10.1109\/LSP.2018.2860246","article-title":"3-D convolutional recurrent neural networks with attention model for speech emotion recognition","volume":"25","author":"Chen","year":"2018","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Nediyanchath, A., Paramasivam, P., and Yenigalla, P. (2020, January 4\u20138). Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054073"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1016\/j.chb.2018.12.029","article-title":"Understanding emotions in text using deep learning and big data","volume":"93","author":"Chatterjee","year":"2019","journal-title":"Comput. Hum. Behav."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"111866","DOI":"10.1109\/ACCESS.2019.2934529","article-title":"Semantic-emotion neural network for emotion recognition from text","volume":"7","author":"Batbaatar","year":"2019","journal-title":"IEEE Access"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1175","DOI":"10.1016\/j.procs.2017.05.025","article-title":"Emotion recognition using facial expressions","volume":"108","author":"Tarnowski","year":"2017","journal-title":"Procedia Comput. Sci."},{"key":"ref_7","unstructured":"Cohen, I., Garg, A., and Huang, T.S. (2000). Emotion recognition from facial expressions using multilevel HMM. Neural Information Processing Systems, Citeseer."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"995","DOI":"10.1080\/02699931.2011.631296","article-title":"The differential contribution of facial expressions, prosody, and speech content to empathy","volume":"26","author":"Regenbogen","year":"2012","journal-title":"Cogn. Emot."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2346","DOI":"10.1016\/j.neuroimage.2012.02.043","article-title":"Multimodal human communication\u2014Targeting facial expressions, speech content and prosody","volume":"60","author":"Regenbogen","year":"2012","journal-title":"Neuroimage"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1016\/j.neuroimage.2011.06.035","article-title":"The temporal dynamics of processing emotions from vocal, facial, and bodily expressions","volume":"58","author":"Jessen","year":"2011","journal-title":"Neuroimage"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"2257","DOI":"10.1016\/j.neuroimage.2010.10.047","article-title":"Incongruence effects in crossmodal emotional integration","volume":"54","author":"Habel","year":"2011","journal-title":"Neuroimage"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"840","DOI":"10.1109\/TRO.2007.907484","article-title":"Enabling multimodal human\u2013robot interaction for the karlsruhe humanoid robot","volume":"23","author":"Stiefelhagen","year":"2007","journal-title":"IEEE Trans. Robot."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hong, A., Lunscher, N., Hu, T., Tsuboi, Y., Zhang, X., dos Reis Alves, S.F., Nejat, G., and Benhabib, B. (2020). A Multimodal Emotional Human-Robot Interaction Architecture for Social Robots Engaged in Bi-directional Communication. IEEE Trans. Cybern.","DOI":"10.1109\/TCYB.2020.2974688"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Kim, J.C., Azzi, P., Jeon, M., Howard, A.M., and Park, C.H. (July, January 28). Audio-based emotion estimation for interactive robotic therapy for children with autism spectrum disorder. Proceedings of the 2017 IEEE 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), Jeju, Korea.","DOI":"10.1109\/URAI.2017.7992881"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Xie, B., Kim, J.C., and Park, C.H. (2020). Musical emotion recognition with spectral feature extraction based on a sinusoidal model with model-based and deep-learning approaches. Appl. Sci., 10.","DOI":"10.3390\/app10030902"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Maat, L., and Pantic, M. (2007). Gaze-X: Adaptive, affective, multimodal interface for single-user office scenarios. Artifical Intelligence for Human Computing, Springer.","DOI":"10.1145\/1180995.1181032"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"724","DOI":"10.1016\/j.ijhcs.2007.02.003","article-title":"Automatic prediction of frustration","volume":"65","author":"Kapoor","year":"2007","journal-title":"Int. J. Hum. Comput. Stud."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Murray, I.R., and Arnott, J.L. (1996, January 3\u20136). Synthesizing emotions in speech: Is it time to get excited?. Proceedings of the IEEE Fourth International Conference on Spoken Language Processing (ICSLP\u201996), Philadelphia, PA, USA.","DOI":"10.21437\/ICSLP.1996-461"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Walker, M.A., Cahn, J.E., and Whittaker, S.J. (1997, January 5\u20138). Improvising linguistic style: Social and affective bases for agent personality. Proceedings of the First International Conference on Autonomous Agents, Marina Del Rey, CA, USA.","DOI":"10.1145\/267658.267680"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Schr\u00f6der, M. (2001, January 3\u20137). Emotional speech synthesis: A review. Proceedings of the Seventh European Conference on Speech Communication and Technology, Aalborg, Denmark.","DOI":"10.21437\/Eurospeech.2001-150"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"829","DOI":"10.1109\/10.846676","article-title":"Acoustical properties of speech as indicators of depression and suicidal risk","volume":"47","author":"France","year":"2000","journal-title":"IEEE Trans. Biomed. Eng."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"789","DOI":"10.1016\/S0272-7358(02)00130-7","article-title":"Emotion recognition via facial expression and affective prosody in schizophrenia: A methodological review","volume":"22","author":"Edwards","year":"2002","journal-title":"Clin. Psychol. Rev."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1016\/S0920-9964(96)00126-0","article-title":"Facial-affect recognition and visual scanning behaviour in the course of schizophrenia","volume":"24","author":"Streit","year":"1997","journal-title":"Schizophr. Res."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sebe, N., Cohen, I., and Huang, T.S. (2005). Multimodal emotion recognition. Handbook of Pattern Recognition and Computer Vision, World Scientific.","DOI":"10.1142\/9789812775320_0021"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s12193-009-0025-5","article-title":"Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis","volume":"3","author":"Kessous","year":"2010","journal-title":"J. Multimodal User Interfaces"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"6","DOI":"10.5772\/45662","article-title":"A multidisciplinary artificial intelligence model of an affective robot","volume":"9","author":"Samani","year":"2012","journal-title":"Int. J. Adv. Robot. Syst."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Barros, P., Magg, S., Weber, C., and Wermter, S. (2014). A multichannel convolutional neural network for hand posture recognition. International Conference on Artificial Neural Networks, Springer.","DOI":"10.1007\/978-3-319-11179-7_51"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"43","DOI":"10.3389\/frobt.2020.00043","article-title":"Toward an Automated Measure of Social Engagement for Children With Autism Spectrum Disorder\u2014A Personalized Computational Modeling Approach","volume":"7","author":"Javed","year":"2020","journal-title":"Front. Robot. AI"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"e12","DOI":"10.1017\/ATSIP.2014.11","article-title":"Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies","volume":"3","author":"Wu","year":"2014","journal-title":"APSIPA Trans. Signal Inf. Process."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Schuller, B., Valster, M., Eyben, F., Cowie, R., and Pantic, M. (2012, January 22\u201326). Avec 2012: The continuous audio\/visual emotion challenge. Proceedings of the 14th ACM International Conference on Multimodal Interaction, Santa Monica, CA, USA.","DOI":"10.1145\/2388676.2388776"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Huang, J., Li, Y., Tao, J., Lian, Z., Wen, Z., Yang, M., and Yi, J. (2017, January 23\u201327). Continuous multimodal emotion prediction based on long short term memory recurrent neural network. Proceedings of the 7th Annual Workshop on Audio\/Visual Emotion Challenge, Mountain View, CA, USA.","DOI":"10.1145\/3133944.3133946"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., and Chen, X. (2014, January 12\u201316). Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. Proceedings of the 16th International Conference on multimodal interaction, Istanbul, Turkey.","DOI":"10.1145\/2663204.2666274"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chen, S., and Jin, Q. (2016, January 15\u201319). Multi-modal conditional attention fusion for dimensional emotion prediction. Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2967286"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1016\/j.inffus.2019.02.010","article-title":"EmbraceNet: A robust deep learning architecture for multimodal classification","volume":"51","author":"Choi","year":"2019","journal-title":"Inf. Fusion"},{"key":"ref_35","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"61672","DOI":"10.1109\/ACCESS.2020.2984368","article-title":"Multimodal approach of speech emotion recognition using multi-level multihead fusion attention-based recurrent neural network","volume":"8","author":"Ho","year":"2020","journal-title":"IEEE Access"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Huang, J., Tao, J., Liu, B., Lian, Z., and Niu, M. (2020, January 4\u20138). Multimodal transformer fusion for continuous emotion recognition. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053762"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"176274","DOI":"10.1109\/ACCESS.2020.3026823","article-title":"Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion","volume":"8","author":"Siriwardhana","year":"2020","journal-title":"IEEE Access"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"985","DOI":"10.1109\/TASLP.2021.3049898","article-title":"CTNet: Conversational transformer network for emotion recognition","volume":"29","author":"Lian","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"6558","DOI":"10.18653\/v1\/P19-1656","article-title":"Multimodal transformer for unaligned multimodal language sequences","volume":"2019","author":"Tsai","year":"2019","journal-title":"Proc. Conf. Assoc. Comput. Linguist. Meet."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., and Mihalcea, R. (2018). Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv.","DOI":"10.18653\/v1\/P19-1050"},{"key":"ref_42","unstructured":"Chen, S.Y., Hsu, C.C., Kuo, C.C., and Ku, L.W. (2018). Emotionlines: An emotion corpus of multi-party conversations. arXiv."},{"key":"ref_43","first-page":"31","article-title":"How might emotions affect learning","volume":"3","author":"Bower","year":"1992","journal-title":"Handb. Emot. Mem. Res. Theory"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1345","DOI":"10.1109\/TKDE.2009.191","article-title":"A survey on transfer learning","volume":"22","author":"Pan","year":"2009","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_45","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2021, July 18). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/www.cs.ubc.ca\/~amuham01\/LING530\/papers\/radford2018improving.pdf."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015, January 7\u201313). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.11"},{"key":"ref_47","unstructured":"Wolf, T., Sanh, V., Chaumond, J., and Delangue, C. (2019). Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv."},{"key":"ref_48","unstructured":"Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., and Kavukcuoglu, K. (15, January 10). Efficient neural audio synthesis. In International Conference on Machine Learning. Proceedings of the International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_49","unstructured":"Ito, K., and Johnson, L. (2021, July 18). The LJ Speech Dataset. Available online: https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Ouyang, X., Kawaai, S., Goh, E.G.H., Shen, S., Ding, W., Ming, H., and Huang, D.Y. (2017, January 13\u201317). Audio-visual emotion recognition using deep transfer learning and multiple temporal models. Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK.","DOI":"10.1145\/3136755.3143012"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"14","DOI":"10.18178\/ijmlc.2019.9.1.759","article-title":"Facial emotion recognition from videos using deep convolutional neural networks","volume":"9","author":"Abdulsalam","year":"2019","journal-title":"Int. J. Mach. Learn. Comput."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Leong, F.H. (2020, January 26\u201328). Deep learning of facial embeddings and facial landmark points for the detection of academic emotions. Proceedings of the 5th International Conference on Information and Education Innovations, London, UK.","DOI":"10.1145\/3411681.3411684"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7\u201312). Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"1499","DOI":"10.1109\/LSP.2016.2603342","article-title":"Joint face detection and alignment using multitask cascaded convolutional networks","volume":"23","author":"Zhang","year":"2016","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Cao, Q., Shen, L., Xie, W., Parkhi, O.M., and Zisserman, A. (2018, January 15\u201319). Vggface2: A dataset for recognising faces across pose and age. Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00020"},{"key":"ref_56","unstructured":"Yi, D., Lei, Z., Liao, S., and Li, S.Z. (2014). Learning face representation from scratch. arXiv."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_59","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Hinton","year":"2008","journal-title":"J. Mach. Learn. Res."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Franzoni, V., Vallverd\u00f9, J., and Milani, A. (2019, January 14\u201317). Errors, biases and overconfidence in artificial emotional modeling. Proceedings of the IEEE\/WIC\/ACM International Conference on Web Intelligence-Companion Volume, Thessaloniki, Greece.","DOI":"10.1145\/3358695.3361749"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4913\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:31:55Z","timestamp":1760164315000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4913"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,19]]},"references-count":60,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["s21144913"],"URL":"https:\/\/doi.org\/10.3390\/s21144913","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,19]]}}}