{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T09:59:41Z","timestamp":1778234381719,"version":"3.51.4"},"reference-count":56,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,1,6]],"date-time":"2023-01-06T00:00:00Z","timestamp":1672963200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Natural Science Foundation of Autonomous Region","award":["2021D01C118"],"award-info":[{"award-number":["2021D01C118"]}]},{"name":"Natural Science Foundation of Autonomous Region","award":["042419006"],"award-info":[{"award-number":["042419006"]}]},{"name":"Autonomous Region High-Level Innovative Talent Project","award":["2021D01C118"],"award-info":[{"award-number":["2021D01C118"]}]},{"name":"Autonomous Region High-Level Innovative Talent Project","award":["042419006"],"award-info":[{"award-number":["042419006"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Sentiment classification is a key task in exploring people\u2019s opinions; improved sentiment classification can help individuals make better decisions. Social media users are increasingly using both images and text to express their opinions and share their experiences, instead of only using text in conventional social media. As a result, understanding how to fully utilize them is critical in a variety of activities, including sentiment classification. In this work, we provide a fresh multimodal sentiment classification approach: visual distillation and attention network or VisdaNet. First, this method proposes a knowledge augmentation module, which overcomes the lack of information in short text by integrating the information of image captions and short text; secondly, aimed at the information control problem in the multi-modal fusion process in the product review scene, this paper proposes a knowledge distillation based on the CLIP module to reduce the noise information of the original modalities and improve the quality of the original modal information. Finally, regarding the single-text multi-image fusion problem in the product review scene, this paper proposes visual aspect attention based on the CLIP module, which correctly models the text-image interaction relationship in special scenes and realizes feature-level fusion across modalities. The results of the experiment on the Yelp multimodal dataset reveal that our model outperforms the previous SOTA model. Furthermore, the ablation experiment results demonstrate the efficacy of various tactics in the suggested model.<\/jats:p>","DOI":"10.3390\/s23020661","type":"journal-article","created":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T06:38:27Z","timestamp":1673246307000},"page":"661","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0258-4893","authenticated-orcid":false,"given":"Shangwu","family":"Hou","sequence":"first","affiliation":[{"name":"Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2441-2483","authenticated-orcid":false,"given":"Gulanbaier","family":"Tuerhong","sequence":"additional","affiliation":[{"name":"Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9480-6915","authenticated-orcid":false,"given":"Mairidan","family":"Wushouer","sequence":"additional","affiliation":[{"name":"Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,6]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","article-title":"Multimodal machine learning: A survey and taxonomy","volume":"41","author":"Baltrusaitis","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics.","DOI":"10.18653\/v1\/N16-1174"},{"key":"ref_3","first-page":"305","article-title":"VistaNet: Visual aspect attention network for multimodal sentiment analysis","volume":"33","author":"Truong","year":"2019","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_4","first-page":"9749","article-title":"Multimodal summarization with guidance of multimodal reference","volume":"34","author":"Zhu","year":"2020","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_5","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, Virtual."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"108107","DOI":"10.1016\/j.knosys.2021.108107","article-title":"Gated attention fusion network for multimodal sentiment classification","volume":"240","author":"Du","year":"2022","journal-title":"Knowl.-Based Syst."},{"key":"ref_7","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_9","unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019, January 8\u201314). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proceedings of the 33th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_10","unstructured":"Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. (2020, January 26\u201330). VL-BERT: Pre-training of generic visual-linguistic representations. Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. (2020, January 5\u201310). What does BERT with vision look at?. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.469"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"429","DOI":"10.1109\/TASLP.2019.2957872","article-title":"Entity-sensitive attention and fusion network for entity-level multimodal sentiment lassification","volume":"28","author":"Yu","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"104827","DOI":"10.1016\/j.knosys.2019.06.035","article-title":"Sparse Attention Based Separable Dilated Convolutional Neural Network for Targeted Sentiment Analysis","volume":"188","author":"Gan","year":"2020","journal-title":"Knowl.-Based Syst."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1016\/j.ins.2019.06.050","article-title":"Gated Recurrent Neural Network with Sentimental Relations for Sentiment Classification","volume":"502","author":"Chen","year":"2019","journal-title":"Inf. Sci."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"292","DOI":"10.1016\/j.future.2018.12.018","article-title":"Sentiment Analysis through Recurrent Variants Latterly on Convolutional Neural Network of Twitter","volume":"95","author":"Abid","year":"2019","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1016\/j.ins.2018.10.030","article-title":"Three-way enhanced convolutional neural networks for sentence-level sentiment classification","volume":"477","author":"Zhang","year":"2019","journal-title":"Inf. Sci."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014, January 22\u201327). A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA.","DOI":"10.3115\/v1\/P14-1062"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Kim, Y. (2014, January 25\u201329). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1181"},{"key":"ref_19","first-page":"2267","article-title":"Recurrent convolutional neural networks for text classification","volume":"29","author":"Lai","year":"2015","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1016\/j.future.2020.08.005","article-title":"ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis","volume":"115","author":"Basiri","year":"2021","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1109\/MCI.2019.2954667","article-title":"How intense are you? Predicting intensities of emotions and sentiments using stacked ensemble [Application Notes]","volume":"15","author":"Akhtar","year":"2020","journal-title":"IEEE Comput. Intell. Mag."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics.","DOI":"10.18653\/v1\/N18-1202"},{"key":"ref_23","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics."},{"key":"ref_24","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_25","unstructured":"Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., and Le, Q.V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"126","DOI":"10.1016\/j.inffus.2018.03.007","article-title":"Consensus vote models for detecting and filtering neutrality in sentiment snalysis","volume":"44","author":"Valdivia","year":"2018","journal-title":"Inf. Fusion"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"683","DOI":"10.1142\/S0218488520500294","article-title":"Multi-level fine-scaled sentiment sensing with ambivalence handling","volume":"28","author":"Wang","year":"2020","journal-title":"Int. J. Uncertain. Fuzziness Knowl.-Based Syst."},{"key":"ref_28","first-page":"8002","article-title":"Real-time emotion recognition via attention gated hierarchical memory network","volume":"34","author":"Jiao","year":"2020","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Ghosal, D., Majumder, N., Gelbukh, A., Mihalcea, R., and Poria, S. (2020, January 16\u201320). COSMIC: Common-sense knowledge for emotion identification in conversations. Proceedings of the Findings of the Association for Computational Linguistics, EMNLP 2020, Online Event.","DOI":"10.18653\/v1\/2020.findings-emnlp.224"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1016\/j.neucom.2021.09.057","article-title":"BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis","volume":"467","author":"Li","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Borth, D., Ji, R., Chen, T., Breuel, T., and Chang, S.-F. (2013, January 21). Large-scale visual sentiment ontology and detectors using adjective noun pairs. Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain.","DOI":"10.1145\/2502081.2502282"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Yu, Y., Lin, H., Meng, J., and Zhao, Z. (2016). Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms, 9.","DOI":"10.3390\/a9020041"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Xu, N., and Mao, W. (2017, January 6). MultiSentiNet: A deep semantic network for multimodal sentiment analysis. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore.","DOI":"10.1145\/3132847.3133142"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Xu, N., Mao, W., and Chen, G. (2018, January 27). A co-memory network for multimodal sentiment analysis. Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA.","DOI":"10.1145\/3209978.3210093"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Cai, Y., Cai, H., and Wan, X. (2019). Multi-modal sarcasm detection in twitter with hierarchical fusion model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics.","DOI":"10.18653\/v1\/P19-1239"},{"key":"ref_36","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very deep convolutional networks for large-scale image recognition. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"102048","DOI":"10.1016\/j.media.2021.102048","article-title":"Faster mean-shift: GPU-accelerated clustering for cosine embedding-based cell segmentation and tracking","volume":"71","author":"Zhao","year":"2021","journal-title":"Med. Image Anal."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Engelhardt, S., Oksuz, I., Zhu, D., Yuan, Y., Mukhopadhyay, A., Heller, N., Huang, S.X., Nguyen, H., Sznitman, R., and Xue, Y. (2021). Compound figure separation of biomedical images with side loss. Deep Generative Models, and Data Augmentation, Labelling, and Imperfections, Springer International Publishing.","DOI":"10.1007\/978-3-030-88210-5"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"21780","DOI":"10.1109\/JSEN.2022.3197235","article-title":"Pseudo RGB-D face recognition","volume":"22","author":"Jin","year":"2022","journal-title":"IEEE Sens. J."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"15844","DOI":"10.1109\/ACCESS.2018.2810849","article-title":"Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process","volume":"6","author":"Zheng","year":"2018","journal-title":"IEEE Access"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wu, Y., Guo, H., Chakraborty, C., Khosravi, M., Berretti, S., and Wan, S. (2022). Edge computing driven low-light image dynamic enhancement for object detection. IEEE Trans. Netw. Sci. Eng.","DOI":"10.1109\/TNSE.2022.3151502"},{"key":"ref_42","unstructured":"Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv."},{"key":"ref_43","unstructured":"Zagoruyko, S., and Komodakis, N. (2017). Paying more attention to attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv."},{"key":"ref_44","unstructured":"Furlanello, T., Lipton, Z., Tschannen, M., Itti, L., and Anandkumar, A. (2018, January 10\u201315). Born again neural networks. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_45","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2020). Training data-efficient image transformers & distillation through Attention. arXiv."},{"key":"ref_46","unstructured":"Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Xiang, T., Hospedales, T.M., and Lu, H. (2018, January 18\u201322). Deep mutual learning. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00454"},{"key":"ref_48","unstructured":"Anil, R., Pereyra, G., Passos, A., Ormandi, R., Dahl, G.E., and Hinton, G.E. (2018). Large scale distributed neural network training through Online distillation. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Cho, K., van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014, January 25\u201329). Learning phrase representations using RNN encoder\u2013decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_50","unstructured":"Kim, J.-H., and On, K.-W. (2016). Hadamard product for low-rank bilinear pooling. arXiv."},{"key":"ref_51","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Loper, E., and Bird, S. (2002). NLTK: The natural language toolkit. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Association for Computational Linguistics.","DOI":"10.3115\/1118108.1118117"},{"key":"ref_53","first-page":"26","article-title":"Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude","volume":"4","author":"Tieleman","year":"2012","journal-title":"COURSERA Neural Netw. Mach. Learn."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Tang, D., Qin, B., and Liu, T. (2015, January 17\u201321). Document modeling with gated recurrent neural network for sentiment classification. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.","DOI":"10.18653\/v1\/D15-1167"},{"key":"ref_56","unstructured":"Clark, K., Luong, M.-T., and Le, Q.V. (2020). ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/661\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:01:28Z","timestamp":1760119288000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/661"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,6]]},"references-count":56,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["s23020661"],"URL":"https:\/\/doi.org\/10.3390\/s23020661","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,6]]}}}