{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T17:39:04Z","timestamp":1778002744498,"version":"3.51.4"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2020,5,13]],"date-time":"2020-05-13T00:00:00Z","timestamp":1589328000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2020,6,30]]},"abstract":"<jats:p>\n            Multi-modal affect analysis (e.g., sentiment and emotion analysis) is an interdisciplinary study and has been an emerging and prominent field in Natural Language Processing and Computer Vision. The effective fusion of multiple modalities (e.g.,\n            <jats:italic>text<\/jats:italic>\n            ,\n            <jats:italic>acoustic,<\/jats:italic>\n            or\n            <jats:italic>visual frames<\/jats:italic>\n            ) is a non-trivial task, as these modalities, often, carry distinct and diverse information, and do not contribute equally. The issue further escalates when these data contain noise. In this article, we study the concept of multi-task learning for multi-modal affect analysis and explore a contextual inter-modal attention framework that aims to leverage the association among the neighboring utterances and their multi-modal information. In general, sentiments and emotions have inter-dependence on each other (e.g.,\n            <jats:italic>anger \u2192 negative<\/jats:italic>\n            or\n            <jats:italic>happy \u2192 positive<\/jats:italic>\n            ). In our current work, we exploit the relatedness among the participating tasks in the multi-task framework. We define three different multi-task setups, each having two tasks, i.e., sentiment 8 emotion classification, sentiment classification 8 sentiment intensity prediction, and emotion classificati on 8 emotion intensity prediction. Our evaluation of the proposed system on the CMU-Multi-modal Opinion Sentiment and Emotion Intensity benchmark dataset suggests that, in comparison with the single-task learning framework, our multi-task framework yields better performance for the inter-related participating tasks. Further, comparative studies show that our proposed approach attains state-of-the-art performance for most of the cases.\n          <\/jats:p>","DOI":"10.1145\/3380744","type":"journal-article","created":{"date-parts":[[2020,5,19]],"date-time":"2020-05-19T10:42:16Z","timestamp":1589884936000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":37,"title":["A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis"],"prefix":"10.1145","volume":"14","author":[{"given":"Md Shad","family":"Akhtar","sequence":"first","affiliation":[{"name":"Indraprastha Institute of Information Technology - Delhi, New Delhi, India"}]},{"given":"Dushyant Singh","family":"Chauhan","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology Patna, Patna, India"}]},{"given":"Asif","family":"Ekbal","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology Patna, Patna, India"}]}],"member":"320","published-online":{"date-parts":[[2020,5,13]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1034"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-434"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080702"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2016.7477553"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K18-1025"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-3301"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1010933404324"},{"key":"e_1_2_1_8_1","volume-title":"Multimodal Analytics for Next-Generation Big Data Technologies and Applications","author":"Cambria Erik","unstructured":"Erik Cambria , Soujanya Poria , and Amir Hussain . 2019. Speaker-independent multimodal sentiment analysis for big data . In Multimodal Analytics for Next-Generation Big Data Technologies and Applications . Springer , 13--43. Erik Cambria, Soujanya Poria, and Amir Hussain. 2019. Speaker-independent multimodal sentiment analysis for big data. In Multimodal Analytics for Next-Generation Big Data Technologies and Applications. Springer, 13--43."},{"key":"e_1_2_1_9_1","volume-title":"Embodied multimodal multitask learning. CoRR abs\/1902.01385","author":"Chaplot Devendra Singh","year":"2019","unstructured":"Devendra Singh Chaplot , Lisa Lee , Ruslan Salakhutdinov , Devi Parikh , and Dhruv Batra . 2019. Embodied multimodal multitask learning. CoRR abs\/1902.01385 ( 2019 ). arxiv:1902.01385 Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, and Dhruv Batra. 2019. Embodied multimodal multitask learning. CoRR abs\/1902.01385 (2019). arxiv:1902.01385"},{"key":"e_1_2_1_10_1","volume-title":"Asif Ekbal, and Pushpak Bhattacharyya.","author":"Chauhan Dushyant Singh","year":"2019","unstructured":"Dushyant Singh Chauhan , Md Shad Akhtar , Asif Ekbal, and Pushpak Bhattacharyya. 2019 . Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics , Hong Kong, China, 5651--5661. Dushyant Singh Chauhan, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 5651--5661."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3136755.3136801"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV'19)","author":"Chowdhuri S.","unstructured":"S. Chowdhuri , T. Pankaj , and K. Zipser . 2019. MultiNet: Multi-Modal multi-task learning for autonomous driving . In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV'19) . 1496--1504. S. Chowdhuri, T. Pankaj, and K. Zipser. 2019. MultiNet: Multi-Modal multi-task learning for autonomous driving. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV'19). 1496--1504."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462660"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Degottex G.","unstructured":"G. Degottex , J. Kane , T. Drugman , T. Raitio , and S. Scherer . 2014. COVAREP - A collaborative voice analysis repository for speech technologies . In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing , Florence, Italy. 960--964. G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer. 2014. COVAREP - A collaborative voice analysis repository for speech technologies. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy. 960--964."},{"key":"e_1_2_1_15_1","volume-title":"Shi","author":"Deng Didan","year":"2018","unstructured":"Didan Deng , Yuqian Zhou , Jimin Pi , and Bertram E . Shi . 2018 . Multimodal utterance-level affect analysis using visual, audio and text features. arXiv preprint arXiv:1805.00625 (2018). Didan Deng, Yuqian Zhou, Jimin Pi, and Bertram E. Shi. 2018. Multimodal utterance-level affect analysis using visual, audio and text features. arXiv preprint arXiv:1805.00625 (2018)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.4000\/books.aaccademia.2009"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1168"},{"key":"e_1_2_1_18_1","volume-title":"Handbook of Cognition and Emotion","author":"Ekman Paul","unstructured":"Paul Ekman . 1999. Basic emotions . In Handbook of Cognition and Emotion . Wiley Online Library , 45--60. Paul Ekman. 1999. Basic emotions. In Handbook of Cognition and Emotion. Wiley Online Library, 45--60."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-017-0547-8"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1382"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIPR.2018.00043"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682283"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.59"},{"key":"e_1_2_1_25_1","volume-title":"Jihoon Jeong, and Woo Yong Choi.","author":"Lee Chan Woo","year":"2018","unstructured":"Chan Woo Lee , Kyu Ye Song , Jihoon Jeong, and Woo Yong Choi. 2018 . Convolutional attention networks for multimodal emotion recognition from speech and text data. arXiv preprint arXiv:1805.06606 (2018). Chan Woo Lee, Kyu Ye Song, Jihoon Jeong, and Woo Yong Choi. 2018. Convolutional attention networks for multimodal emotion recognition from speech and text data. arXiv preprint arXiv:1805.06606 (2018)."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2070481.2070509"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2993148.2993176"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the ACM Workshop on Crossmodal Learning and Application. ACM, 3--10","author":"Fortin Mathieu Pag\u00e9","year":"2019","unstructured":"Mathieu Pag\u00e9 Fortin and Brahim Chaib-draa. 2019 . Multimodal multitask emotion recognition using images, texts and tags . In Proceedings of the ACM Workshop on Crossmodal Learning and Application. ACM, 3--10 . Mathieu Pag\u00e9 Fortin and Brahim Chaib-draa. 2019. Multimodal multitask emotion recognition using images, texts and tags. In Proceedings of the ACM Workshop on Crossmodal Learning and Application. ACM, 3--10."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.3115\/1219840.1219855"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CESYS.2017.8321250"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. 1532--1543","author":"Pennington Jeffrey","unstructured":"Jeffrey Pennington , Richard Socher , and Christopher D. Manning . 2014. GloVe: Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. 1532--1543 . Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. 1532--1543."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2017.02.003"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2016.06.009"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1081"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2017.134"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2016.0055"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2016.09.117"},{"key":"e_1_2_1_38_1","first-page":"171","article-title":"A multimodal emotion recognition system using facial landmark analysis. Iranian Journal of Science and Technology","volume":"43","author":"Rahdari Farhad","year":"2019","unstructured":"Farhad Rahdari , Esmat Rashedi , and Mahdi Eftekhari . 2019 . A multimodal emotion recognition system using facial landmark analysis. Iranian Journal of Science and Technology , Transactions of Electrical Engineering 43 , 1 (2019), 171 -- 189 . Farhad Rahdari, Esmat Rashedi, and Mahdi Eftekhari. 2019. A multimodal emotion recognition system using facial landmark analysis. Iranian Journal of Science and Technology, Transactions of Electrical Engineering 43, 1 (2019), 171--189.","journal-title":"Transactions of Electrical Engineering"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46478-7_21"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2016.7477679"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-3303"},{"key":"e_1_2_1_42_1","volume-title":"Md. Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya.","author":"Sangwan Suyash","year":"2019","unstructured":"Suyash Sangwan , Dushyant Singh Chauhan , Md. Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019 . Multi-task gated contextual cross-modal attention framework for sentiment and emotion analysis. In Neural Information Processing, Tom Gedeon, Kok Wai Wong, and Minho Lee (Eds.). Springer International Publishing , Cham, 662--669. Suyash Sangwan, Dushyant Singh Chauhan, Md. Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task gated contextual cross-modal attention framework for sentiment and emotion analysis. In Neural Information Processing, Tom Gedeon, Kok Wai Wong, and Minho Lee (Eds.). Springer International Publishing, Cham, 662--669."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3176840"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-3305"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations (ICLR'15)","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman . 2015 . Very deep convolutional networks for large-scale image recognition . In Proceedings of the 3rd International Conference on Learning Representations (ICLR'15) , San Diego, CA, USA , May 7-9, 2015. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR'15), San Diego, CA, USA, May 7-9, 2015."},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Teney Damien","unstructured":"Damien Teney , Peter Anderson , Xiaodong He , and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge . In Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition , Salt Lake City, UT. 4223--4232. Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT. 4223--4232."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1142"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2017.2764438"},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the 2017 IEEE International Conference on Multimedia and Expo","author":"Wang Haohan","unstructured":"Haohan Wang , Aaksha Meghawat , Louis-Philippe Morency , and Eric P. Xing . 2017. Select-additive learning: Improving generalization in multimodal sentiment analysis . In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo , Hong Kong. 949--954. Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P. Xing. 2017. Select-additive learning: Improving generalization in multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo, Hong Kong. 949--954."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-3302"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2015.2512598"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3133142"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2835776.2835779"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1115"},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the 32nd AAAI Conference on Artificial Intelligence","author":"Zadeh Amir","year":"2018","unstructured":"Amir Zadeh , Paul Pu Liang , Navonil Mazumder , Soujanya Poria , Erik Cambria , and Louis-Philippe Morency . 2018 . Memory fusion network for multi-view sequential learning . In Proceedings of the 32nd AAAI Conference on Artificial Intelligence , New Orleans, LA. 5634--5641. Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA. 5634--5641."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1208"},{"key":"e_1_2_1_57_1","volume-title":"Proceedings of the 32nd AAAI Conference on Artificial Intelligence","author":"Zadeh Amir","year":"2018","unstructured":"Amir Zadeh , Paul Pu Liang , Soujanya Poria , Prateek Vij , Erik Cambria , and Louis-Philippe Morency . 2018 . Multi-attention recurrent network for human communication comprehension . In Proceedings of the 32nd AAAI Conference on Artificial Intelligence , New Orleans, LA. 5642--5649. Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA. 5642--5649."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2016.94"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2018.04.029"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3380744","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3380744","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:32Z","timestamp":1750195892000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3380744"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,13]]},"references-count":59,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,6,30]]}},"alternative-id":["10.1145\/3380744"],"URL":"https:\/\/doi.org\/10.1145\/3380744","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"value":"1556-4681","type":"print"},{"value":"1556-472X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,13]]},"assertion":[{"value":"2019-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}