{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T16:02:28Z","timestamp":1774368148245,"version":"3.50.1"},"reference-count":38,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T00:00:00Z","timestamp":1629763200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Rui Zhang","award":["ZDKT2018-006"],"award-info":[{"award-number":["ZDKT2018-006"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Multimodal sentiment analysis and emotion recognition represent a major research direction in natural language processing (NLP). With the rapid development of online media, people often express their emotions on a topic in the form of video, and the signals it transmits are multimodal, including language, visual, and audio. Therefore, the traditional unimodal sentiment analysis method is no longer applicable, which requires the establishment of a fusion model of multimodal information to obtain sentiment understanding. In previous studies, scholars used the feature vector cascade method when fusing multimodal data at each time step in the middle layer. This method puts each modal information in the same position and does not distinguish between strong modal information and weak modal information among multiple modalities. At the same time, this method does not pay attention to the embedding characteristics of multimodal signals across the time dimension. In response to the above problems, this paper proposes a new method and model for processing multimodal signals, which takes into account the delay and hysteresis characteristics of multimodal signals across the time dimension. The purpose is to obtain a multimodal fusion feature emotion analysis representation. We evaluate our method on the multimodal sentiment analysis benchmark dataset CMU Multimodal Opinion Sentiment and Emotion Intensity Corpus (CMU-MOSEI). We compare our proposed method with the state-of-the-art model and show excellent results.<\/jats:p>","DOI":"10.3390\/info12090342","type":"journal-article","created":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T22:12:22Z","timestamp":1629843142000},"page":"342","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4637-5196","authenticated-orcid":false,"given":"Qingfu","family":"Qi","sequence":"first","affiliation":[{"name":"College of Electronic Information and Automation, Tianjin University of Science & Technology, Tianjin 300222, China"},{"name":"School of Software and Communications, Tianjin Sino-German University of Applied Sciences, Tianjin 300222, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liyuan","family":"Lin","sequence":"additional","affiliation":[{"name":"College of Electronic Information and Automation, Tianjin University of Science & Technology, Tianjin 300222, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7875-5193","authenticated-orcid":false,"given":"Rui","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Electronic Information and Automation, Tianjin University of Science & Technology, Tianjin 300222, China"},{"name":"School of Software and Communications, Tianjin Sino-German University of Applied Sciences, Tianjin 300222, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,8,24]]},"reference":[{"key":"ref_1","first-page":"11","article-title":"The Method of Semantic Structuring of Virtual Community Content","volume":"Volume 1044","author":"Korobiichuk","year":"2020","journal-title":"Mechatronics 2019: Recent Advances Towards Industry 4.0. MECHATRONICS 2019"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Cambria, E., Hazarika, D., Poria, S., Hussain, A., and Subramanyam, R.B.V. (2018). Benchmarking multimodal sentiment analysis. Computational Linguistics and Intelligent Text Processing, Springer.","DOI":"10.1007\/978-3-319-77116-8_13"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Reiter, E., and Dale, R. (2000). Building Natural Language Generation Systems, Cambridge University Press. [1st ed.].","DOI":"10.1017\/CBO9780511519857"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1016\/j.csl.2014.09.005","article-title":"A survey on the application of recurrent neural networks to statistical language modeling","volume":"30","author":"Bethard","year":"2015","journal-title":"Comput. Speech Lang."},{"key":"ref_5","first-page":"292","article-title":"Experiments with Open-Domain Textual Question Answering","volume":"2","author":"Harabagiu","year":"2000","journal-title":"Coling"},{"key":"ref_6","first-page":"597","article-title":"Advances in Open Domain Question Answering","volume":"33","author":"Strzalkowski","year":"2007","journal-title":"Comput. Linguist."},{"key":"ref_7","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv."},{"key":"ref_8","unstructured":"Dodge, J., Gane, A., Zhang, X., Bordes, A., Chopra, S., Miller, A.H., Szlam, A., and Weston, J. (2016). Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., and Jurafsky, D. (2016, January 1\u20135). Deep Reinforcement Learning for Dialogue Generation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA.","DOI":"10.18653\/v1\/D16-1127"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Li, H., Lin, Z., Shen, X., Brandt, J., and Hua, G. (2015, January 7\u201312). A convolutional neural network cascade for face detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299170"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Jiang, H., and Learned-Miller, E. (June, January 30). Face Detection with the Faster R-CNN. Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA.","DOI":"10.1109\/FG.2017.82"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Badshah, A.M., Ahmad, J., Rahim, N., and Baik, S.W. (2017, January 13\u201315). Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. Proceedings of the 2017 International Conference on Platform Technology and Service, PlatCon 2017, Busan, Korea.","DOI":"10.1109\/PlatCon.2017.7883728"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Thongtan, T., and Phienthrakul, T. (August, January 28). Sentiment Classification Using Document Embeddings Trained with Cosine Similarity. Proceedings of the ACL 2019\u201457th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, Florence, Italy.","DOI":"10.18653\/v1\/P19-2057"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Pham, H., Manzini, T., Liang, P.P., and Poczos, B. (2018). Seq2seq2sentiment: Multimodal sequence to sequence models for sentiment analysis. arXiv.","DOI":"10.18653\/v1\/W18-3308"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Akhtar, M.S., Chauhan, D.S., Ghosal, D., Poria, S., Ekbal, A., and Bhattacharyya, P. (2019, January 2\u20137). Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis. Proceedings of the NAACL HLT 2019, Minneapolis, MN, USA.","DOI":"10.18653\/v1\/N19-1034"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Huang, Y., Wang, W., Wang, L., and Tan, T. (2013, January 15\u201318). Multi-task deep neural network for multi-label learning. Proceedings of the 2013 IEEE International Conference on Image Processing, ICIP 2013, Melbourne, Australia.","DOI":"10.1109\/ICIP.2013.6738596"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Yoon, S., Byun, S., and Jung, K. (2018, January 18\u201321). Multimodal Speech Emotion Recognition Using Audio and Text. Proceedings of the 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece.","DOI":"10.1109\/SLT.2018.8639583"},{"key":"ref_18","first-page":"359","article-title":"Multiple Classifier Systems for the Classification of Audio-Visual Emotional States","volume":"Volume 6975","author":"Glodek","year":"2011","journal-title":"Computational Linguistics and Intelligent Text Processing"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Ghosh, S., Laksana, E., Morency, L.-P., and Scherer, S. (2016, January 8\u201312). Representation Learning for Speech Emotion Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-692"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Hu, A., and Flaxman, S. (2018, January 19\u201323). Multimodal Sentiment Analysis To Explore the Structure of Emotions. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK.","DOI":"10.1145\/3219819.3219853"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Wang, H., Meghawat, A., Morency, L.P., and Xing, E.P. (2017, January 10\u201314). Select-additive learning: Improving generalization in multimodal sentiment analysis. Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China.","DOI":"10.1109\/ICME.2017.8019301"},{"key":"ref_22","unstructured":"Zadeh, A., Zellers, R., Pincus, E., and Morency, L.P. (2016). Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Williams, J., Kleinegesse, S., Comanescu, R., and Radu, O. (2018, January 20). Recognizing emotions in video using multimodal dnn feature fusion. Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia.","DOI":"10.18653\/v1\/W18-3302"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1016\/j.knosys.2018.07.041","article-title":"Multimodal sentiment analysis using hierarchical fusion with context modeling","volume":"161","author":"Majumder","year":"2018","journal-title":"Knowl.-Based Syst."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Poria, S., Mazumder, N., Cambria, E., Hazarika, D., Morency, L.-P., and Zadeh, A. (August, January 30). Context-Dependent Sentiment Analysis in User-Generated Videos. Proceedings of the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics ACL 2017, Vancouver, BC, Canada,.","DOI":"10.18653\/v1\/P17-1081"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Liang, P.P., Liu, Z., Zadeh, A., and Morency, L.-P. (November, January 31). Multimodal Language Analysis with Recurrent Multistage Fusion. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1014"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zadeh, A., Poria, S., Liang, P.P., Cambria, E., Mazumder, N., and Morency, L.-P. (2018, January 2\u20138). Memory Fusion Network for Multi-view Sequential Learning. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LS, USA.","DOI":"10.1609\/aaai.v32i1.12021"},{"key":"ref_28","unstructured":"Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., and Morency, L.-P. (February, January 27). Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors. Proceedings of the AAAI Conference on Artificial Intelligence AAAI, Honolulu, HI, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Delbrouck, J.-B., Tits, N., Brousmiche, M., and Dupont, S. (2020, January 10). A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. Proceedings of the Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), Seattle, WA, USA.","DOI":"10.18653\/v1\/2020.challengehml-1.1"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_32","unstructured":"Zadeh, A., Liang, P.P., Vanbriesen, J., Poria, S., Tong, E., Cambria, E., Chen, M., and Morency, L.-P. (2018, January 15\u201320). Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the ACL 2018, Melbourne, Australia."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C.D. (2014, January 25\u201329). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014, January 23\u201328). DeepFace: Closing the Gap to Human-Level Performance in Face Verification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.220"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7\u201312). FaceNet: A unified embedding for face recognition and clustering. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. (2017, January 21\u201326). SphereFace: Deep Hypersphere Embedding for Face Recognition. Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.713"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Degottex, G., Kane, J., Drugman, T., Raitio, T., and Scherer, S. (2014, January 4\u20139). COVAREP\u2014A collaborative voice analysis repository for speech technologies. Proceedings of the ICASSP, Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6853739"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Akhtar, M.S., Chauhan, D.S., Ghosal, D., Poria, S., Ekbal, A., and Bhattacharyya, P. (2019). Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv.","DOI":"10.18653\/v1\/N19-1034"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/9\/342\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:50:48Z","timestamp":1760165448000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/9\/342"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,24]]},"references-count":38,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["info12090342"],"URL":"https:\/\/doi.org\/10.3390\/info12090342","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,8,24]]}}}