{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T22:57:17Z","timestamp":1781650637537,"version":"3.54.5"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,1,5]],"date-time":"2023-01-05T00:00:00Z","timestamp":1672876800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,1,31]]},"abstract":"<jats:p>Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in multimodal learning, most of the works have focused on simply combining the two modalities without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-level Attentive network (DMLANet), which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify Convolutional neural network representation power. Then, we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to fetch the sentiment-rich multimodal features for the classification automatically. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verify our method's superiority.<\/jats:p>","DOI":"10.1145\/3517139","type":"journal-article","created":{"date-parts":[[2022,3,16]],"date-time":"2022-03-16T10:18:50Z","timestamp":1647425930000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":87,"title":["A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis"],"prefix":"10.1145","volume":"19","author":[{"given":"Ashima","family":"Yadav","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Bennett University, Greater Noida, Uttar Pradesh, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dinesh Kumar","family":"Vishwakarma","sequence":"additional","affiliation":[{"name":"Department of Information Technology, Delhi Technological University, Rohini, New Delhi, India"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,1,5]]},"reference":[{"key":"e_1_3_3_2_2","first-page":"1","article-title":"ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","volume":"32","author":"Lu J.","year":"2019","unstructured":"J. Lu, D. Batra, D. Parikh, and S. Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems 32 (2019), 1\u201311.","journal-title":"33rd Conference on Neural Information Processing Systems"},{"key":"e_1_3_3_3_2","volume-title":"35th Conference on Neural Information Processing Systems","author":"Akbari H.","year":"2021","unstructured":"H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In 35th Conference on Neural Information Processing Systems."},{"key":"e_1_3_3_4_2","volume-title":"38th International Conference on Machine Learning","author":"Radford A.","year":"2021","unstructured":"A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning."},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-020-00656-7"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artmed.2015.03.006"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2020.06.038"},{"key":"e_1_3_3_8_2","doi-asserted-by":"crossref","DOI":"10.1016\/j.asoc.2020.106624","article-title":"A unified framework of deep networks for genre classification using movie trailer","volume":"96","author":"Yadav A.","year":"2020","unstructured":"A. Yadav and D. K. Vishwakarma. 2020. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. 96 (2020).","journal-title":"Appl. Soft Comput."},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2016.10.006"},{"key":"e_1_3_3_10_2","volume-title":"IEEE 16th International Conference on Data Mining","author":"Poria S.","year":"2016","unstructured":"S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In IEEE 16th International Conference on Data Mining."},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2019.04.018"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2019.01.019"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2867718"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-015-2646-x"},{"issue":"12","key":"e_1_3_3_15_2","doi-asserted-by":"crossref","first-page":"2281\u20132296","DOI":"10.1109\/TMM.2015.2491019","article-title":"Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media","volume":"17","author":"Fang Q.","year":"2015","unstructured":"Q. Fang, C. Xu, J. Sang, M. S. Hossain, and G. Muhammad. 2015. Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media. IEEE Trans. Multim. 17, 12 (2015), 2281\u20132296.","journal-title":"IEEE Trans. Multim."},{"key":"e_1_3_3_16_2","volume-title":"IEEE Conference on Multimedia Information Processing and Retrieval","author":"Dai S.","year":"2018","unstructured":"S. Dai and H. Man. 2018. Integrating visual and textual affective descriptors for sentiment analysis of social media posts. In IEEE Conference on Multimedia Information Processing and Retrieval."},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-019-09794-5"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3311969"},{"key":"e_1_3_3_19_2","volume-title":"IEEE International Conference on Intelligence and Security Informatics (ISI)","author":"Xu N.","year":"2017","unstructured":"N. Xu. 2017. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In IEEE International Conference on Intelligence and Security Informatics (ISI)."},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2757769"},{"issue":"6","key":"e_1_3_3_21_2","doi-asserted-by":"crossref","DOI":"10.1016\/j.ipm.2019.102097","article-title":"An image-text consistency driven multimodal sentiment analysis approach for social media","volume":"56","author":"Zhao Z.","year":"2019","unstructured":"Z. Zhao, H. Zhu, Z. Xue, Z. Liu, J. Tian, M. Chua, and M. Liu. 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019).","journal-title":"Inf. Process. Manag."},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2019.2957872"},{"issue":"35","key":"e_1_3_3_23_2","first-page":"1","article-title":"Affective computing for large-scale heterogeneous multimedia data: A survey","volume":"15","author":"Zhao S.","year":"2020","unstructured":"S. Zhao, S. Wang, M. Soleymani, D. Joshi, and Q. Ji. 2020. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Trans. Multim. Comput., Commun. Applic. 15, 35 (2020), 1\u201332.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2019.04.010"},{"key":"e_1_3_3_25_2","doi-asserted-by":"crossref","DOI":"10.1016\/j.patcog.2019.107075","article-title":"Learning visual relationship and context-aware attention for image captioning","volume":"98","author":"Wang J.","year":"2020","unstructured":"J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng, and T. Tan. 2020. Learning visual relationship and context-aware attention for image captioning. Pattern Recog. 98 (2020).","journal-title":"Pattern Recog"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_3_28_2","volume-title":"28th International Joint Conference on Artificial Intelligence","author":"Ma H.","year":"2019","unstructured":"H. Ma, W. Li, X. Zhang, S. Gao, and S. Lu. 2019. AttnSense: Multi-level attention mechanism for multimodal human activity recognition. In 28th International Joint Conference on Artificial Intelligence."},{"key":"e_1_3_3_29_2","first-page":"2017","article-title":"Spatial transformer networks","volume":"28","author":"Jaderberg M.","year":"2015","unstructured":"M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. 2015. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015), 2017\u20132025.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.319"},{"key":"e_1_3_3_31_2","volume-title":"22nd Conference on Computational Natural Language Learning","author":"Barrett M.","year":"2018","unstructured":"M. Barrett, J. Bingel, N. Hollenstein, M. Rei, and A. S\u00f8gaard. 2018. Sequence classification with human attention. In 22nd Conference on Computational Natural Language Learning."},{"key":"e_1_3_3_32_2","unstructured":"P. Battaglia J. Hamrick and V. Bapst. 2018. Relational inductive biases deep learning and graph networks. arXiv preprint arXiv:1806.01261 2018."},{"key":"e_1_3_3_33_2","volume-title":"International Conference on Multimedia Modeling","author":"Niu T.","year":"2016","unstructured":"T. Niu. 2016. Sentiment analysis on multi-view social data. In International Conference on Multimedia Modeling."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2502282"},{"key":"e_1_3_3_35_2","volume-title":"9th ACM International Conference on Web Search and Data Mining","author":"You Q.","year":"2016","unstructured":"Q. You, J. Luo, H. Jin, and J. Yang.2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In 9th ACM International Conference on Web Search and Data Mining."},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2016.94"},{"key":"e_1_3_3_37_2","volume-title":"IEEE International Conference on Computer Vision Workshops","author":"Vadicamo L.","year":"2017","unstructured":"L. Vadicamo, F. Carrara, A. Cimino, S. Cresci, F. Dell'Orletta, F. Falchi, and M. Tesconi. 2017. Cross-media learning for image sentiment analysis in the wild. In IEEE International Conference on Computer Vision Workshops."},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3380688.3380693"},{"key":"e_1_3_3_39_2","volume-title":"IEEE 2nd International Conference on Big Data Analysis (ICBDA)","author":"Xu N.","year":"2017","unstructured":"N. Xu and W. Mao. 2017. A residual merged neutral network for multimodal sentiment analysis. In IEEE 2nd International Conference on Big Data Analysis (ICBDA)."},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210093"},{"key":"e_1_3_3_41_2","volume-title":"ACM Conference on Information and Knowledge Management","author":"Xu N.","year":"2017","unstructured":"N. Xu and W. Mao. 2017. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In ACM Conference on Information and Knowledge Management."},{"key":"e_1_3_3_42_2","volume-title":"Pacific-Asia Conference on Knowledge Discovery and Data Mining","author":"Jiang T.","year":"2020","unstructured":"T. Jiang, J. Wang, Z. Liu, and Y. Ling. 2020. Fusion-extraction network for multimodal sentiment analysis. In Pacific-Asia Conference on Knowledge Discovery and Data Mining."},{"key":"e_1_3_3_43_2","first-page":"1","article-title":"Social image sentiment analysis by exploiting multimodal content and heterogeneous relations","author":"Xu J.","year":"2020","unstructured":"J. Xu, Z. Li, F. Huang, C. Li, and P. S. Yu. 2020. Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans. Industr. Inform. 17, 4 (2020), 1\u20138.","journal-title":"IEEE Trans. Industr. Inform."},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3388861"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2975036"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413690"},{"key":"e_1_3_3_47_2","volume-title":"Conference Association for Computational Linguistics","author":"Rahman W.","year":"2020","unstructured":"W. Rahman, M. Hasan, S. Lee, A. Zadeh, C. Mao, L.-P. Morency, and E. Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Conference Association for Computational Linguistics."},{"key":"e_1_3_3_48_2","unstructured":"R. R. Selvaraju A. Das R. Vedantam M. Cogswell D. Parikh and D. Batra. 2016. Grad-CAM: Why did you say that? arXiv preprint arXiv:1611.07450 2016."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3517139","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3517139","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:02Z","timestamp":1750183742000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3517139"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,5]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1,31]]}},"alternative-id":["10.1145\/3517139"],"URL":"https:\/\/doi.org\/10.1145\/3517139","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,5]]},"assertion":[{"value":"2021-04-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-02-05","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}