{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T15:11:11Z","timestamp":1773414671848,"version":"3.50.1"},"reference-count":87,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T00:00:00Z","timestamp":1685404800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004427","name":"Universidad de Valpara\u00edso","doi-asserted-by":"publisher","award":["INICI UVA20993"],"award-info":[{"award-number":["INICI UVA20993"]}],"id":[{"id":"10.13039\/501100004427","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004427","name":"Universidad de Valpara\u00edso","doi-asserted-by":"publisher","award":["2022-21221429"],"award-info":[{"award-number":["2022-21221429"]}],"id":[{"id":"10.13039\/501100004427","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Agency for Research and Development (ANID)","award":["INICI UVA20993"],"award-info":[{"award-number":["INICI UVA20993"]}]},{"name":"National Agency for Research and Development (ANID)","award":["2022-21221429"],"award-info":[{"award-number":["2022-21221429"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Multimodal emotion recognition implies the use of different resources and techniques for identifying and recognizing human emotions. A variety of data sources such as faces, speeches, voices, texts and others have to be processed simultaneously for this recognition task. However, most of the techniques, which are based mainly on Deep Learning, are trained using datasets designed and built in controlled conditions, making their applicability in real contexts with real conditions more difficult. For this reason, the aim of this work is to assess a set of in-the-wild datasets to show their strengths and weaknesses for multimodal emotion recognition. Four in-the-wild datasets are evaluated: AFEW, SFEW, MELD and AffWild2. A multimodal architecture previously designed is used to perform the evaluation and classical metrics such as accuracy and F1-Score are used to measure performance in training and to validate quantitative results. However, strengths and weaknesses of these datasets for various uses indicate that by themselves they are not appropriate for multimodal recognition due to their original purpose, e.g., face or speech recognition. Therefore, we recommend a combination of multiple datasets in order to obtain better results when new samples are being processed and a good balance in the number of samples by class.<\/jats:p>","DOI":"10.3390\/s23115184","type":"journal-article","created":{"date-parts":[[2023,5,31]],"date-time":"2023-05-31T02:57:10Z","timestamp":1685501830000},"page":"5184","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["An Assessment of In-the-Wild Datasets for Multimodal Emotion Recognition"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0726-1759","authenticated-orcid":false,"given":"Ana","family":"Aguilera","sequence":"first","affiliation":[{"name":"Escuela de Ingenier\u00eda Inform\u00e1tica, Universidad de Valpara\u00edso, Valpara\u00edso 2340000, Chile"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8078-253X","authenticated-orcid":false,"given":"Diego","family":"Mellado","sequence":"additional","affiliation":[{"name":"Doctorado en Ciencias e Ingenier\u00eda para la Salud, Universidad de Valpara\u00edso, Valpara\u00edso 2340000, Chile"}]},{"given":"Felipe","family":"Rojas","sequence":"additional","affiliation":[{"name":"Escuela de Ingenier\u00eda Inform\u00e1tica, Universidad de Valpara\u00edso, Valpara\u00edso 2340000, Chile"}]}],"member":"1968","published-online":{"date-parts":[[2023,5,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Dzedzickis, A., Kaklauskas, A., and Bucinskas, V. (2020). Human Emotion Recognition: Review of Sensors and Methods. Sensors, 20.","DOI":"10.3390\/s20030592"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"35553","DOI":"10.1007\/s11042-019-08328-z","article-title":"A Review of Emotion Sensing: Categorization Models and Algorithms","volume":"79","author":"Wang","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1037\/0022-3514.52.6.1061","article-title":"Emotion Knowledge: Further Exploration of a Prototype Approach","volume":"52","author":"Shaver","year":"1987","journal-title":"J. Pers. Soc. Psychol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1080\/02699939208411068","article-title":"An Argument for Basic Emotions","volume":"6","author":"Ekman","year":"1992","journal-title":"Cognit. Emo"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"749933","DOI":"10.3389\/fpsyg.2021.749933","article-title":"Facial Expressions and Emotion Labels Are Separate Initiators of Trait Inferences from the Face","volume":"12","author":"Stahelski","year":"2021","journal-title":"Front. Psychol."},{"key":"ref_6","unstructured":"Schulz, A., Thanh, T.D., Paulheim, H., and Schweizer, I. (2013, January 12\u201315). A Fine-Grained Sentiment Analysis Approach for Detecting Crisis Related Microposts. Proceedings of the 10th International ISCRAM Conference, Baden-Baden, Germany."},{"key":"ref_7","first-page":"71","article-title":"The Underlying Structure of Emotions: A Tri-Dimensional Model of Core Affect and Emotion Concepts for Sports","volume":"7","author":"Latinjak","year":"2012","journal-title":"Rev. Iberoam. Psicol. Ejecicio Deporte (Iberoam. J. Exerc. Sport Psychol.)"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"9","DOI":"10.3389\/fcomp.2020.00009","article-title":"A Review of Generalizable Transfer Learning in Automatic Emotion Recognition","volume":"2","author":"Feng","year":"2020","journal-title":"Front. Comput. Sci."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Calvo, R.A., D\u2019Mello, S.K., Gratch, J., and Kappas, A. (2014). Oxford Handbook of Affective Computing, Oxford University Press.","DOI":"10.1093\/oxfordhb\/9780199942237.013.040"},{"key":"ref_10","unstructured":"Pease, A., and Chandler, J. (1997). Body Language, Sheldon Press."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"349","DOI":"10.1037\/amp0000488","article-title":"What the Face Displays: Mapping 28 Emotions Conveyed by Naturalistic Expression","volume":"75","author":"Cowen","year":"2020","journal-title":"Am. Psychol."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., and Manocha, D. (2020, January 13\u201319). EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege\u2019s Principle. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01424"},{"key":"ref_13","first-page":"1359","article-title":"M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual and Speech Cues","volume":"34","author":"Mittal","year":"2020","journal-title":"Proc. Aaai Conf. Artif. Intell. AAAI"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Subramanian, G., Cholendiran, N., Prathyusha, K., Balasubramanain, N., and Aravinth, J. (2021, January 25\u201327). Multimodal Emotion Recognition Using Different Fusion Techniques. Proceedings of the 2021 Seventh International Conference on Bio Signals, Images and Instrumentation (ICBSII), Chennai, India.","DOI":"10.1109\/ICBSII51839.2021.9445146"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"20727","DOI":"10.1109\/ACCESS.2022.3149214","article-title":"Adaptive Multimodal Emotion Detection Architecture for Social Robots","volume":"10","author":"Heredia","year":"2022","journal-title":"IEEE Access"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Poria, S., Chaturvedi, I., Cambria, E., and Hussain, A. (2016, January 12\u201315). Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis. Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain.","DOI":"10.1109\/ICDM.2016.0055"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1016\/j.dss.2018.09.002","article-title":"Deep Learning for Affective Computing: Text-Based Emotion Recognition in Decision Support","volume":"115","author":"Kratzwald","year":"2018","journal-title":"Decis. Support. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1016\/j.imavis.2017.08.003","article-title":"A survey of Multimodal Sentiment Analysis","volume":"65","author":"Soleymani","year":"2017","journal-title":"Image Vis. Comput."},{"key":"ref_19","first-page":"200171","article-title":"A systematic Survey on Multimodal Emotion Recognition using Learning Algorithms","volume":"17","author":"Ahmed","year":"2023","journal-title":"Intell. Syst. Appl."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., and Li, X. (2019). Learning Alignment for Multimodal Emotion Recognition from Speech. arXiv.","DOI":"10.21437\/Interspeech.2019-3247"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1016\/j.eij.2020.07.005","article-title":"A 3D-convolutional Neural Network Framework with Ensemble Learning Techniques for Multi-Modal Emotion recognition","volume":"22","author":"Salama","year":"2021","journal-title":"Egypt. Inform. J."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"168865","DOI":"10.1109\/ACCESS.2020.3023871","article-title":"Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion","volume":"8","author":"Cimtay","year":"2020","journal-title":"IEEE Access"},{"key":"ref_23","unstructured":"Tripathi, S., Tripathi, S., and Beigi, H. (2018). Multi-Modal Emotion Recognition on IEMOCAP Dataset using Deep Learning. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"102185","DOI":"10.1016\/j.ipm.2019.102185","article-title":"Exploring Temporal Representations by Leveraging Attention-Based Bidirectional LSTM-RNNs for Multi-Modal Emotion Recognition","volume":"57","author":"Li","year":"2020","journal-title":"Inf. Process. Manag."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"715","DOI":"10.1109\/TCDS.2021.3071170","article-title":"Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition","volume":"14","author":"Liu","year":"2021","journal-title":"IEEE Trans. Cogn. Develop. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Ranganathan, H., Chakraborty, S., and Panchanathan, S. (2016, January 7\u201310). Multimodal Emotion Recognition Using Deep Learning Architectures. Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA.","DOI":"10.1109\/WACV.2016.7477679"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"52","DOI":"10.38094\/jastt20291","article-title":"Multimodal Emotion Recognition Using Deep Learning","volume":"2","author":"Abdullah","year":"2021","journal-title":"J. Appl. Sci. Technol. Trends."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1301","DOI":"10.1109\/JSTSP.2017.2764438","article-title":"End-to-End Multimodal Emotion Recognition Using Deep Neural Networks","volume":"11","author":"Tzirakis","year":"2017","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Alaba, S.Y., Nabi, M.M., Shah, C., Prior, J., Campbell, M.D., Wallace, F., Ball, J.E., and Moorhead, R. (2022). Class-Aware Fish Species Recognition Using Deep Learning for an Imbalanced Dataset. Sensors, 22.","DOI":"10.3390\/s22218268"},{"key":"ref_30","unstructured":"Zhao, M., Liu, Q., Jha, A., Deng, R., Yao, T., Mahadevan-Jansen, A., Tyska, M.J., Millis, B.A., and Huo, Y. (2021). Machine Learning in Medical Imaging, Springer."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"21780","DOI":"10.1109\/JSEN.2022.3197235","article-title":"Pseudo RGB-D Face Recognition","volume":"22","author":"Jin","year":"2022","journal-title":"IEEE Sens. J."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Yao, T., Qu, C., Liu, Q., Deng, R., Tian, Y., Xu, J., Jha, A., Bao, S., Zhao, M., and Fogo, A.B. (2021, January 1). Compound Figure Separation of Biomedical Images with Side Loss. Proceedings of the Deep Generative Models and Data Augmentation, Labelling and Imperfections: First Workshop, DGM4MICCAI 2021 and First Workshop, DALI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France.","DOI":"10.1007\/978-3-030-88210-5_16"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"123649","DOI":"10.1109\/ACCESS.2020.3005687","article-title":"Deep Facial Diagnosis: Deep Transfer Learning from Face Recognition to Facial Diagnosis","volume":"8","author":"Jin","year":"2020","journal-title":"IEEE Access"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"7723","DOI":"10.1007\/s00521-020-05514-1","article-title":"Spectrum Interference-based Two-Level data Augmentation Method in Deep Learning for Automatic Modulation Classification","volume":"33","author":"Zheng","year":"2020","journal-title":"Neural Comput. Appl."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1007\/s11042-022-13254-8","article-title":"Building a Three-Level Multimodal Emotion Recognition Framework","volume":"82","author":"Lozano","year":"2023","journal-title":"Multimed. Tools Appl."},{"key":"ref_36","unstructured":"Samadiani, N., Huang, G., Luo, W., Shu, Y., Wang, R., and Kocaturk, T. (2020). Data Science, Springer."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"e5764","DOI":"10.1002\/cpe.5764","article-title":"A multiple Feature Fusion Framework for Video Emotion Recognition in the Wild","volume":"34","author":"Samadiani","year":"2022","journal-title":"Concurr. Computat. Pract. Exper."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"103594","DOI":"10.1016\/j.infrared.2020.103594","article-title":"Facial Expression Recognition Method with Multi-Label Distribution Learning for Non-Verbal Behavior Understanding in the Classroom","volume":"112","author":"Liu","year":"2021","journal-title":"Infrared Phys. Technol."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"104457","DOI":"10.1016\/j.infrared.2022.104457","article-title":"Learning Fusion Feature Representation for Garbage Image Classification Model in Human\u2013Robot Interaction","volume":"128","author":"Li","year":"2023","journal-title":"Infrared Phys. Technol."},{"key":"ref_40","unstructured":"Kollias, D., and Zafeiriou, S. (2019). Exploiting Multi-CNN Features in CNN-RNN based Dimensional Emotion Recognition on the OMG in-the-Wild Dataset. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"8669","DOI":"10.1007\/s00521-020-05616-w","article-title":"HEU Emotion: A Large-Scale Database for Multimodal Emotion Recognition in the Wild","volume":"33","author":"Chen","year":"2021","journal-title":"Neural Comput. Applic."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1195","DOI":"10.1109\/TAFFC.2020.2981446","article-title":"Deep Facial Expression Recognition: A Survey","volume":"13","author":"Li","year":"2020","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Riaz, M.N., Shen, Y., Sohail, M., and Guo, M. (2020). eXnet: An Efficient Approach for Emotion Recognition in the Wild. Sensors, 20.","DOI":"10.3390\/s20041087"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Dhall, A., Sharma, G., Goecke, R., and Gedeon, T. (2020, January 25\u201329). EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges. Proceedings of the ICMI \u201920: 2020 International Conference on Multimodal Interaction, Virtual Event, The Netherlands.","DOI":"10.1145\/3382507.3417973"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Hu, P., Cai, D., Wang, S., Yao, A., and Chen, Y. (2017, January 13\u201317). Learning Supervised Scoring Ensemble for Emotion recognition in the wild. Proceedings of the ICMI\u201917: 19th ACM International Conference on Multimodal Interaction, Glasgow, UK.","DOI":"10.1145\/3136755.3143009"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Li, S., Zheng, W., Zong, Y., Lu, C., Tang, C., Jiang, X., Liu, J., and Xia, W. (2019, January 14\u201318). Bi-modality Fusion for Emotion Recognition in the Wild. Proceedings of the ICMI\u201919: 2019 International Conference on Multimodal Interaction, Suzhou, China.","DOI":"10.1145\/3340555.3355719"},{"key":"ref_47","unstructured":"Salah, A.A., Kaya, H., and G\u00fcrp\u0131nar, F. (2019). Multimodal Behavior Analysis in the Wild, Academic Press."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Yu, Z., and Zhang, C. (2015, January 9\u201313). Image Based Static Facial Expression Recognition with Multiple Deep Network Learning. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA.","DOI":"10.1145\/2818346.2830595"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"1016","DOI":"10.1016\/j.ijleo.2018.01.003","article-title":"Illumination Invariant Facial Expression Recognition using Selected Merged Binary Patterns for Real World Images","volume":"158","author":"Munir","year":"2018","journal-title":"Optik"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Cai, J., Meng, Z., Khan, A.S., Li, Z., O\u2019Reilly, J., and Tong, Y. (2018, January 15\u201319). Island Loss for Learning Discriminative Features in Facial Expression Recognition. Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00051"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Ruan, D., Yan, Y., Lai, S., Chai, Z., Shen, C., and Wang, H. (2021, January 20\u201325). Feature Decomposition and Reconstruction Learning for Effective Facial Expression Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00757"},{"key":"ref_52","unstructured":"Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., and Mihalcea, R. (August, January 28). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Xie, B., Sidulova, M., and Park, C.H. (2021). Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion. Sensors, 21.","DOI":"10.3390\/s21144913"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"61672","DOI":"10.1109\/ACCESS.2020.2984368","article-title":"Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network","volume":"8","author":"Ho","year":"2020","journal-title":"IEEE Access"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Hu, J., Liu, Y., Zhao, J., and Jin, Q. (2021). MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. arXiv.","DOI":"10.18653\/v1\/2021.acl-long.440"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Kollias, D., Tzirakis, P., Nicolaou, M.A., Papaioannou, A., Zhao, G., Schuller, B., Kotsia, I., and Zafeiriou, S. (2018). Deep Affect Prediction in-the-Wild: AffWild Database and Challenge, Deep Architectures and Beyond. arXiv.","DOI":"10.1007\/s11263-019-01158-4"},{"key":"ref_57","unstructured":"Kollias, D., and Zafeiriou, S. (2019). Aff-Wild2: Extending the AffWild Database for Affect Recognition. arXiv."},{"key":"ref_58","unstructured":"Barros, P., and Sciutti, A. (2020). The FaceChannelS: Strike of the Sequences for the AffWild 2 Challenge. arXiv."},{"key":"ref_59","unstructured":"Liu, Y., Zhang, X., Kauttonen, J., and Zhao, G. (2022). Uncertain Facial Expression Recognition via Multi-task Assisted Correction. arXiv."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely Connected Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Guo, Y., Zhang, L., Hu, Y., He, X., and Gao, J. (2016, January 11\u201314). MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Proceedings of the Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part III 14.","DOI":"10.1007\/978-3-319-46487-9_6"},{"key":"ref_63","unstructured":"Yu, J., Cai, Z., He, P., Xie, G., and Ling, Q. (2022). Multi-Model Ensemble Learning Method for Human Expression Recognition. arXiv."},{"key":"ref_64","unstructured":"Tan, M., and Le, Q. (2019). International Conference on Machine Learning, PMLR."},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going Deeper with Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Zhang, W., Qiu, F., Wang, S., Zeng, H., Zhang, Z., An, R., Ma, B., and Ding, Y. (2022, January 19\u201320). Transformer-based Multimodal Information Fusion for Facial Expression Analysis. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA.","DOI":"10.1109\/CVPRW56347.2022.00271"},{"key":"ref_67","unstructured":"Mollahosseini, A., Hasani, B., and Mahoor, M.H. (2017). AffectNet: A Database for Facial Expression, Valence and Arousal Computing in the Wild. arXiv."},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1109\/MMUL.2012.26","article-title":"Collecting Large, Richly Annotated Facial-Expression Databases from Movies","volume":"19","author":"Dhall","year":"2012","journal-title":"IEEE Multimed."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Dhall, A., Ramana Murthy, O.V., Goecke, R., Joshi, J., and Gedeon, T. (2015, January 9\u201313). Video and Image Based Emotion Recognition Challenges in the Wild: EmotiW 2015. Proceedings of the ICMI \u201915: 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA.","DOI":"10.1145\/2818346.2829994"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Dhall, A., Goecke, R., Joshi, J., Hoey, J., and Gedeon, T. (2016, January 12\u201316). EmotiW 2016: Video and Group-Level Emotion Recognition Challenges. Proceedings of the ICMI \u201916: 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan.","DOI":"10.1145\/2993148.2997638"},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., and Matthews, I. (2010, January 13\u201318). The Extended Cohn-Kanade Dataset (CK+): A Complete Dataset for Action Unit and Emotion-Specified Expression. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition\u2014Workshops, San Francisco, CA, USA.","DOI":"10.1109\/CVPRW.2010.5543262"},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., and Lee, D.H. (2013). Challenges in Representation Learning: A Report on Three Machine Learning Contests. arXiv.","DOI":"10.1007\/978-3-642-42051-1_16"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Dhall, A., Goecke, R., Lucey, S., and Gedeon, T. (2011, January 6\u201313). Static Facial Expression Analysis in Tough Conditions: Data, Evaluation Protocol and Benchmark. Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain.","DOI":"10.1109\/ICCVW.2011.6130508"},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Li, S., Deng, W., and Du, J. (2017, January 21\u201326). Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.277"},{"key":"ref_75","unstructured":"Zadeh, A., Liang, P.P., Poria, S., Cambria, E., and Morency, L.P. (2018, January 15\u201320). Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia."},{"key":"ref_76","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive Emotional Dyadic Motion Capture Database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_77","unstructured":"Dhall, A., Goecke, R., Lucey, S., and Gedeon, T. (2011). Technical Report TR-CS-11-02, Australian National University."},{"key":"ref_78","unstructured":"Chen, S.Y., Hsu, C.C., Kuo, C.C., Huang, T.-H., and Ku, L.W. (2018). EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., and Zafeiriou, S. (2017, January 21\u201326). Recognition of Affect in the Wild Using Deep Neural Networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA.","DOI":"10.1109\/CVPRW.2017.247"},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"Kollias, D. (2022, January 19\u201324). Abaw: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPRW56347.2022.00259"},{"key":"ref_81","first-page":"10","article-title":"Converting Video Formats with FFmpeg","volume":"2006","author":"Tomar","year":"2006","journal-title":"Linux J."},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., and Nieto, O. (2015, January 6\u201312). Librosa: Audio and Music Signal Analysis in Python. Proceedings of the 14th Python in Science Conference, Austin, TX, USA.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"14","DOI":"10.3389\/fcomp.2020.00014","article-title":"Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding","volume":"2","author":"Lech","year":"2020","journal-title":"Front. Comput. Sci."},{"key":"ref_84","unstructured":"Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., and Zhong, J. (2021). SpeechBrain: A General-Purpose Speech Toolkit. arXiv."},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"Shen, W., Chen, J., Quan, X., and Xie, Z. (2020). DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. arXiv.","DOI":"10.1609\/aaai.v35i15.17625"},{"key":"ref_86","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very deep convolutional networks for large-scale image recognition. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA."},{"key":"ref_87","unstructured":"Venkataramanan, K., and Rajamohan, H.R. (2019). Emotion Recognition from Speech. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/11\/5184\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:44:56Z","timestamp":1760125496000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/11\/5184"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,30]]},"references-count":87,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2023,6]]}},"alternative-id":["s23115184"],"URL":"https:\/\/doi.org\/10.3390\/s23115184","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,30]]}}}