{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T16:47:39Z","timestamp":1781542059688,"version":"3.54.5"},"reference-count":88,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2021,7,20]],"date-time":"2021-07-20T00:00:00Z","timestamp":1626739200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2021R1A2C2006895"],"award-info":[{"award-number":["NRF-2021R1A2C2006895"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio\u2013video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D\/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.<\/jats:p>","DOI":"10.3390\/s21144927","type":"journal-article","created":{"date-parts":[[2021,7,20]],"date-time":"2021-07-20T11:26:10Z","timestamp":1626780370000},"page":"4927","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":76,"title":["Deep-Learning-Based Multimodal Emotion Classification for Music Videos"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9842-8704","authenticated-orcid":false,"given":"Yagya Raj","family":"Pandeya","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Jeonbuk National University, Jeonju-City 54896, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7014-4868","authenticated-orcid":false,"given":"Bhuwan","family":"Bhattarai","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Jeonbuk National University, Jeonju-City 54896, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Joonwhoan","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Jeonbuk National University, Jeonju-City 54896, Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Yang, Y.H., and Chen, H.H. (2012). Machine Recognition of Music Emotion: A Review. ACM Trans. Intell. Syst. Technol.","DOI":"10.1145\/2168752.2168754"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1080\/0929821042000317813","article-title":"Expression, Perception, and Induction of Musical Emotions: A Review and a Questionnaire Study of Everyday Listening","volume":"33","author":"Juslin","year":"2004","journal-title":"J. New Music Res."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1177\/0305735617707354","article-title":"Music Listening as Self-enhancement: Effects of Empowering Music on Momentary Explicit and Implicit Self-esteem","volume":"46","author":"Elvers","year":"2018","journal-title":"Psychol. Music"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"68","DOI":"10.5498\/wjp.v5.i1.68","article-title":"Effects of Music and Music Therapy on Mood in Neurological Patients","volume":"5","author":"Raglio","year":"2015","journal-title":"World J. Psychiatry"},{"key":"ref_5","unstructured":"Patricia, E.B. (2017, June 07). Music as a Mood Modulator. Retrospective Theses and Dissertations, 1992, 17311. Available online: https:\/\/lib.dr.iastate.edu\/rtd\/17311."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Eerola, T., and Peltola, H.R. (2016). Memorable Experiences with Sad Music\u2014Reasons, Reactions and Mechanisms of Three Types of Experiences. PLoS ONE, 11.","DOI":"10.1371\/journal.pone.0157444"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1177\/0305735619849622","article-title":"Sad Music Depresses Sad Adolescents: A Listener\u2019s Profile","volume":"49","author":"Bogt","year":"2019","journal-title":"Psychol. Music"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1016\/j.concog.2016.06.015","article-title":"Metaphor and Music Emotion: Ancient Views and Future Directions","volume":"44","author":"Pannese","year":"2016","journal-title":"Conscious. Cogn."},{"key":"ref_9","first-page":"2056305119847514","article-title":"Genres as Social Affect: Cultivating Moods and Emotions through Playlists on Spotify","volume":"5","author":"Siles","year":"2019","journal-title":"Soc. Media Soc."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"217","DOI":"10.3389\/fpubh.2016.00217","article-title":"Music Streaming Services as Adjunct Therapies for Depression, Anxiety, and Bipolar Symptoms: Convergence of Digital Technologies, Mobile Apps, Emotions, and Global Mental Health","volume":"4","author":"Schriewer","year":"2016","journal-title":"Front. Public Health"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Pandeya, Y.R., Kim, D., and Lee, J. (2018). Domestic Cat Sound Classification Using Learned Features from Deep Neural Nets. Appl. Sci., 8.","DOI":"10.3390\/app8101949"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"162625","DOI":"10.1109\/ACCESS.2020.3022058","article-title":"Visual Object Detector for Cow Sound Event Detection","volume":"8","author":"Pandeya","year":"2020","journal-title":"IEEE Access"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"154","DOI":"10.5391\/IJFIS.2018.18.2.154","article-title":"Domestic Cat Sound Classification Using Transfer Learning","volume":"18","author":"Pandeya","year":"2018","journal-title":"Int. J. Fuzzy Log. Intell. Syst."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Pandeya, Y.R., Bhattarai, B., and Lee, J. (2020, January 21\u201323). Sound Event Detection in Cowshed using Synthetic Data and Convolutional Neural Network. Proceedings of the 2020 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea.","DOI":"10.1109\/ICTC49870.2020.9289545"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"206016","DOI":"10.1109\/ACCESS.2020.3037773","article-title":"Parallel Stacked Hourglass Network for Music Source Separatio","volume":"8","author":"Bhattarai","year":"2020","journal-title":"IEEE Access"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2887","DOI":"10.1007\/s11042-020-08836-3","article-title":"Deep Learning-based Late Fusion of Multimodal Information for Emotion Classification of Music Video","volume":"80","author":"Pandeya","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019\u20132, January 27). SlowFast Networks for Video Recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00630"},{"key":"ref_18","unstructured":"Joze, H.R.V., Shaban, A., Iuzzolino, M.L., and Koishida, K. (2020, January 13\u201319). MMTM: Multimodal Transfer Module for CNN Fusion. Proceedings of the CVPR 2020, Seattle, WA, USA."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., Albanie, S., Sun, G., and Wu, E. (2018, January 18\u201322). Squeeze-and-Excitation Networks. Proceedings of the CVPR 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1109\/TAFFC.2017.2695460","article-title":"Modelling Affect for Horror Soundscapes","volume":"10","author":"Lopes","year":"2019","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_21","unstructured":"Naoki, N., Katsutoshi, I., Hiromasa, F., Goto, M., Ogata, T., and Okuno, H.G. (2011\u20131, January 28). A Musical Mood Trajectory Estimation Method Using Lyrics and Acoustic Features. Proceedings of the 1st international ACM workshop on Music information retrieval with user-centered and multimodal strategies, Scottsdale, AZ, USA."},{"key":"ref_22","unstructured":"Song, Y., Dixon, S., and Pearce, M. (2012, January 8\u201312). Evaluation of Musical Features for Music Emotion Classification. Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), Porto, Portugal."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Lin, C., Liu, M., Hsiung, W., and Jhang, J. (2016, January 10\u201313). Music Emotion Recognition Based on Two-level Support Vector Classification. Proceedings of the 2016 International Conference on Machine Learning and Cybernetics (ICMLC), Jeju Island, Korea.","DOI":"10.1109\/ICMLC.2016.7860930"},{"key":"ref_24","first-page":"53","article-title":"Extraction of Audio Features for Emotion Recognition System Based on Music","volume":"5","author":"Han","year":"2016","journal-title":"Int. J. Sci. Technol. Res."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"614","DOI":"10.1109\/TAFFC.2018.2820691","article-title":"Novel Audio Features for Music Emotion Recognition","volume":"11","author":"Panda","year":"2020","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Aljanaki, A., Yang, Y.H., and Soleymani, M. (2017). Developing a Benchmark for Emotional Analysis of Music. PLoS ONE, 12.","DOI":"10.1371\/journal.pone.0173392"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Malik, M., Adavanne, A., Drossos, K., Virtanen, T., Ticha, D., and Jarina, R. (2017). Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition. arXiv, Available online: https:\/\/arxiv.org\/abs\/1706.02292.","DOI":"10.23919\/EUSIPCO.2017.8081505"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Jakubik, J., and Kwa\u015bnicka, H. (2017, January 3\u20135). Music Emotion Analysis using Semantic Embedding Recurrent Neural Networks. Proceedings of the 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Gdynia, Poland.","DOI":"10.1109\/INISTA.2017.8001169"},{"key":"ref_29","unstructured":"Liu, X., Chen, Q., Wu, X., Yan, L., and Yang, L. (2017). CNN Based Music Emotion Classification. arXiv, Available online: https:\/\/arxiv.org\/abs\/1704.05665."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Tsunoo, E., Akase, T., Ono, N., and Sagayama, S. (2010, January 14\u201319). Music mood classification by rhythm and bass-line unit pattern analysis. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA.","DOI":"10.1109\/ICASSP.2010.5495964"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Turnbull, D., Barrington, L., Torres, D., and Lanckriet, G. (2007, January 23\u201327). Towards musical query-by-semantic description using the cal500 data set. Proceedings of the ACM SIGIR, Amsterdam, The Netherlands.","DOI":"10.1145\/1277741.1277817"},{"key":"ref_32","unstructured":"Li, S., and Huang, L. (2018, January 13\u201315). Music Emotions Recognition Based on Feature Analysis. Proceedings of the 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wang, S., Wang, J., Yang, Y., and Wang, H. (2014, January 14\u201318). Towards time-varying music auto-tagging based on cal500 expansion. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China.","DOI":"10.1109\/ICME.2014.6890290"},{"key":"ref_34","unstructured":"Berardinis, J., Cangelosi, A., and Coutinho, E. (2020, January 11\u201316). The Multiple Voices of Music Emotions: Source Separation for Improving Music Emotion Recognition Models and Their Interpretability. Proceedings of the ISMIR 2020, Montr\u00e9al, QC, Canada."},{"key":"ref_35","unstructured":"Chaki, S., Doshi, P., Bhattacharya, S., and Patnaik, P. (2020, January 11\u201316). Explaining Perceived Emotions in Music: An Attentive Approach. Proceedings of the ISMIR 2020, Montr\u00e9al, QC, Canada."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Orjesek, R., Jarina, R., Chmulik, M., and Kuba, M. (2019, January 16\u201318). DNN Based Music Emotion Recognition from Raw Audio Signal. Proceedings of the 29th International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic.","DOI":"10.1109\/RADIOELEK.2019.8733572"},{"key":"ref_37","unstructured":"Choi, W., Kim, M., Chung, J., Lee, D., and Jung, S. (2020, January 11\u201316). Investigating U-nets with Various Intermediate blocks for Spectrogram-Based Singing Voice Separation. Proceedings of the ISMIR2020, Montr\u00e9al, QC, Canada."},{"key":"ref_38","unstructured":"Yin, D., Luo, C., Xiong, Z., and Zeng, W. (2019). Phasen: A phase-and-harmonics-aware speech enhancement network. arXiv, Available online: https:\/\/www.isca-speech.org\/archive\/Interspeech_2018\/abstracts\/1773.html."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Takahashi, N., Agrawal, P., Goswami, N., and Mitsufuji, Y. (2018). Phasenet: Discretized phase modeling with deep neural networks for audio source separation. Interspeech, 2713\u20132717.","DOI":"10.21437\/Interspeech.2018-1773"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Zhang, H., and Xu, M. (2016, January 25\u201328). Modeling temporal information using discrete fourier transform for recognizing emotions in user-generated videos. Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA.","DOI":"10.1109\/ICIP.2016.7532433"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1109\/TAFFC.2016.2622690","article-title":"Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization","volume":"9","author":"Xu","year":"2018","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"148","DOI":"10.1109\/TMM.2019.2922129","article-title":"A Multi-Task Neural Approach for Emotion Attribution, Classification, and Summarization","volume":"22","author":"Tu","year":"2020","journal-title":"IEEE Trans. Multimed."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Lee, J., Kim, S., Kiim, S., and Sohn, K. (2018, January 15\u201320). Spatiotemporal Attention Based Deep Neural Networks for Emotion Recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461920"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Sun, M., Hsu, S., Yang, M., and Chien, J. (2018, January 20\u201322). Context-aware Cascade Attention-based RNN for Video Emotion Recognition. Proceedings of the 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), Beijing, China.","DOI":"10.1109\/ACIIAsia.2018.8470372"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Xu, B., Zheng, Y., Ye, H., Wu, C., Wang, H., and Sun, G. (2019, January 8\u201312). Video Emotion Recognition with Concept Selection. Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China.","DOI":"10.1109\/ICME.2019.00077"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"523","DOI":"10.1109\/TMM.2010.2051871","article-title":"Affective Audio-Visual Words and Latent Topic Driving Model for Realizing Movie Affective Scene Classification","volume":"12","author":"Irie","year":"2010","journal-title":"IEEE Trans. Multimedia"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/j.neucom.2018.02.052","article-title":"A Novel Feature Set for Video Emotion Recognition","volume":"291","author":"Mo","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1016\/j.imavis.2017.01.012","article-title":"Video-based Emotion Recognition in the Wild using Deep Transfer Learning and Score Fusion","volume":"65","author":"Kaya","year":"2017","journal-title":"Image Vis. Comput."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Li, H., Kumar, N., Chen, R., and Georgiou, P. (2018, January 15\u201320). A Deep Reinforcement Learning Framework for Identifying Funny Scenes in Movies. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462686"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1037\/h0030377","article-title":"Constants Across Cultures in the Face and Emotion","volume":"17","author":"Ekman","year":"1971","journal-title":"J. Pers. Soc. Psychol."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"1424","DOI":"10.1109\/34.895976","article-title":"Automatic Analysis of Facial Expressions: The State of the art","volume":"22","author":"Pantic","year":"2000","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_52","unstructured":"Li, S., and Deng, W. (2020). Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1109\/TCYB.2016.2625419","article-title":"Automatic Facial Expression Recognition System Using Deep Network-Based Data Fusion","volume":"48","author":"Majumder","year":"2018","journal-title":"IEEE Trans. Cybern."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Kuo, C., Lai, S., and Sarkis, M. (2018, January 18\u201322). A Compact Deep Learning Model for Robust Facial Expression Recognition. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPRW.2018.00286"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1016\/j.patrec.2020.11.002","article-title":"Combined Center Dispersion Loss Function for Deep Facial Expression Recognition","volume":"141","author":"Nanda","year":"2021","journal-title":"Pattern Recognit. Lett."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TMM.2020.2975922","article-title":"End-to-End Audiovisual Speech Recognition System with Multitask Learning","volume":"23","author":"Tao","year":"2021","journal-title":"IEEE Trans. Multimed."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Eskimez, S.E., Maddox, R.K., Xu, C., and Duan, Z. (2019, January 16). Noise-Resilient Training Method for Face Landmark Generation from Speech. Proceedings of the IEEE\/ACM Transactions on Audio, Speech, and Language Processing, Los Altos, CA, USA.","DOI":"10.1109\/TASLP.2019.2947741"},{"key":"ref_58","first-page":"927","article-title":"EmoCo: Visual Analysis of Emotion Coherence in Presentation Videos","volume":"26","author":"Zeng","year":"2020","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Seanglidet, Y., Lee, B.S., and Yeo, C.K. (2016, January 18\u201320). Mood prediction from facial video with music \u201ctherapy\u201d on a smartphone. Proceedings of the 2016 Wireless Telecommunications Symposium (WTS), London, UK.","DOI":"10.1109\/WTS.2016.7482034"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Kostiuk, B., Costa, Y.M.G., Britto, A.S., Hu, X., and Silla, C.N. (2019, January 4\u20136). Multi-label Emotion Classification in Music Videos Using Ensembles of Audio and Video Features. Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA.","DOI":"10.1109\/ICTAI.2019.00078"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Acar, E., Hopfgartner, F., and Albayrak, S. (2014, January 10\u201318). Understanding Affective Content of Music Videos through Learned Representations. Proceedings of the International Conference on Multimedia Modeling, Dublin, Ireland.","DOI":"10.1007\/978-3-319-04114-8_26"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Ekman, P. (1999). Basic Emotions in Handbook of Cognition and Emotion, Wiley.","DOI":"10.1002\/0470013494.ch3"},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"1161","DOI":"10.1037\/h0077714","article-title":"A Circumplex Model of Affect","volume":"39","author":"Russell","year":"1980","journal-title":"J. Personal. Soc. Psychol."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Thayer, R.E. (1989). The Biopsychology of Mood and Arousal, Oxford University Press.","DOI":"10.1093\/oso\/9780195068276.001.0001"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Plutchik, R. (1980). A General Psychoevolutionary Theory of Emotion in Theories of Emotion, Academic Press. [4th ed.].","DOI":"10.1016\/B978-0-12-558701-3.50007-7"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1186\/1687-5281-2013-26","article-title":"Multimedia Content Analysis for Emotional Characterization of Music Video Clips","volume":"2013","author":"Skodras","year":"2013","journal-title":"EURASIP J. Image Video Process."},{"key":"ref_67","unstructured":"G\u00f3mez-Ca\u00f1\u00f3n, J.S., Cano, E., Herrera, P., and G\u00f3mez, E. (2020, January 11\u201316). Joyful for You and Tender for Us: The Influence of Individual Characteristics and Language on Emotion Labeling and Classification. Proceedings of the ISMIR 2020, Montr\u00e9al, QC, Canada."},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1177\/0305735610362821","article-title":"A comparison of the discrete and dimensional models of emotion in music","volume":"39","author":"Eerola","year":"2011","journal-title":"Psychol. Music"},{"key":"ref_69","unstructured":"Makris, D., Kermanidis, K.L., and Karydis, I. (2014, January 19\u201321). The Greek Audio Dataset. Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Rhodes, Greece."},{"key":"ref_70","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1016\/j.ipm.2015.03.004","article-title":"Studying emotion induced by music through a crowdsourcing game","volume":"52","author":"Aljanaki","year":"2016","journal-title":"Inf. Process. Manag."},{"key":"ref_71","doi-asserted-by":"crossref","first-page":"448","DOI":"10.1109\/TASL.2007.911513","article-title":"A Regression Approach to Music Emotion Recognition","volume":"16","author":"Yang","year":"2008","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Livingstone, S.R., and Russo, R.A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0196391"},{"key":"ref_73","unstructured":"Lee, J., Kim, S., Kim, S., Park, J., and Sohn, K. (November, January 27). Context-Aware Emotion Recognition Networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Malandrakis, N., Potamianos, A., Evangelopoulos, G., and Zlatintsi, A. (2011, January 22\u201327). A supervised approach to movie emotion tracking. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic.","DOI":"10.1109\/ICASSP.2011.5946961"},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1109\/TAFFC.2015.2396531","article-title":"LIRIS-ACCEDE: A video database for affective content analysis","volume":"6","author":"Baveye","year":"2015","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Yang, Y.H., and Chen, H.H. (2011). Music Emotion Recognition, CRC Press.","DOI":"10.1201\/b10731"},{"key":"ref_77","unstructured":"Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F.A. (2021). Shortcut Learning in Deep Neural Networks. arXiv, Available online: https:\/\/arxiv.org\/abs\/2004.07780."},{"key":"ref_78","unstructured":"CJ-Moore, B. (2012). An Introduction to the Psychology of Hearing, Brill."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning Spatiotemporal Features with 3D Convolutional Networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2018). Quo vadis, action recognition? A new model and the kinetics dataset. arXiv.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_81","unstructured":"Du, T., Heng, W., Lorenzo, T., and Matt, F. (2019). Video Classification with Channel-Separated Convolutional Networks. arXiv, Available online: https:\/\/arxiv.org\/abs\/1904.02811."},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18\u201323). A Closer Look at Spatiotemporal Convolutions for Action Recognition. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"ref_83","doi-asserted-by":"crossref","unstructured":"Pons, J., Lidy, T., and Serra, X. (2016, January 15\u201317). Experimenting with musically motivated convolutional neural networks. Proceedings of the 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Bucharest, Romania.","DOI":"10.1109\/CBMI.2016.7500246"},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 23\u201328). Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1016\/j.inffus.2017.02.003","article-title":"A review of Affective Computing: From Unimodal Analysis to Multimodal Fusion","volume":"37","author":"Poria","year":"2017","journal-title":"Inf. Fusion"},{"key":"ref_86","first-page":"518","article-title":"The Effects of Music on Emotional Response, Brand Attitude, and Purchase Intent in an Emotional Advertising Condition","volume":"25","author":"Morris","year":"1998","journal-title":"Adv. Consum. Res."},{"key":"ref_87","doi-asserted-by":"crossref","first-page":"1880","DOI":"10.3389\/fpsyg.2018.01880","article-title":"The Effects of User Engagements for User and Company Generated Videos on Music Sales: Empirical Evidence from YouTube","volume":"9","author":"Park","year":"2018","journal-title":"Front. Psychol."},{"key":"ref_88","doi-asserted-by":"crossref","first-page":"473","DOI":"10.1177\/1470593117692021","article-title":"Music in advertising and consumer identity: The search for Heideggerian authenticity","volume":"17","author":"Abolhasani","year":"2017","journal-title":"Mark. Theory"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4927\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:32:11Z","timestamp":1760164331000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4927"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,20]]},"references-count":88,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["s21144927"],"URL":"https:\/\/doi.org\/10.3390\/s21144927","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,20]]}}}