{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T08:16:28Z","timestamp":1765354588487,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2018,10,15]],"date-time":"2018-10-15T00:00:00Z","timestamp":1539561600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Institutes of Health","award":["R01LM011834"],"award-info":[{"award-number":["R01LM011834"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2018,10,15]]},"DOI":"10.1145\/3240508.3240714","type":"proceedings-article","created":{"date-parts":[[2018,10,18]],"date-time":"2018-10-18T17:52:08Z","timestamp":1539885128000},"page":"537-545","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":26,"title":["Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder"],"prefix":"10.1145","author":[{"given":"Yue","family":"Gu","sequence":"first","affiliation":[{"name":"Rutgers University, Piscataway, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xinyu","family":"Li","sequence":"additional","affiliation":[{"name":"Rutgers University &amp; Amazon Inc., Piscataway, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kaixiang","family":"Huang","sequence":"additional","affiliation":[{"name":"Meitu Inc. &amp; Rutgers University, Xiamen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shiyu","family":"Fu","sequence":"additional","affiliation":[{"name":"Rutgers University, Piscataway, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kangning","family":"Yang","sequence":"additional","affiliation":[{"name":"Rutgers University, Piscataway, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shuhong","family":"Chen","sequence":"additional","affiliation":[{"name":"Rutgers University, Piscataway, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Moliang","family":"Zhou","sequence":"additional","affiliation":[{"name":"Amazon Inc., Boston, MA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ivan","family":"Marsic","sequence":"additional","affiliation":[{"name":"Rutgers University, Piscataway, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,10,15]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_3_2_1_1_1","DOI":"10.1109\/TASLP.2014.2339736"},{"key":"e_1_3_2_1_2_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ( 2014 ). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_3_1","DOI":"10.1023\/A:1010933404324"},{"doi-asserted-by":"crossref","unstructured":"Carlos Busso Murtaza Bulut Shrikanth Narayanan J Gratch and S Marsella. 2013. Toward effective automatic recognition systems of emotion in speech. Social emotions in nature and artifact: emotions in human and human-computer interaction J. Gratch and S. Marsella Eds (2013) 110--127.  Carlos Busso Murtaza Bulut Shrikanth Narayanan J Gratch and S Marsella. 2013. Toward effective automatic recognition systems of emotion in speech. Social emotions in nature and artifact: emotions in human and human-computer interaction J. Gratch and S. Marsella Eds (2013) 110--127.","key":"e_1_3_2_1_4_1","DOI":"10.1093\/acprof:oso\/9780195387643.003.0008"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_5_1","DOI":"10.1145\/3136755.3136801"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_6_1","DOI":"10.1023\/A:1022627411411"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_7_1","DOI":"10.1109\/ICASSP.2014.6853739"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_8_1","DOI":"10.1007\/s12193-009-0032-6"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_9_1","DOI":"10.1145\/1873951.1874246"},{"key":"e_1_3_2_1_10_1","volume-title":"Deep Multimodal Learning for Emotion Recognition in Spoken Language. arXiv preprint arXiv:1802.08332","author":"Gu Yue","year":"2018","unstructured":"Yue Gu , Shuhong Chen , and Ivan Marsic . 2018a. Deep Multimodal Learning for Emotion Recognition in Spoken Language. arXiv preprint arXiv:1802.08332 ( 2018 ). Yue Gu, Shuhong Chen, and Ivan Marsic. 2018a. Deep Multimodal Learning for Emotion Recognition in Spoken Language. arXiv preprint arXiv:1802.08332 (2018)."},{"key":"e_1_3_2_1_11_1","volume-title":"Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment. arXiv preprint arXiv:1805.08660","author":"Gu Yue","year":"2018","unstructured":"Yue Gu , Kangning Yang , Shiyu Fu , Shuhong Chen , Xinyu Li , and Ivan Marsic . 2018b. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment. arXiv preprint arXiv:1805.08660 ( 2018 ). Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2018b. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment. arXiv preprint arXiv:1805.08660 (2018)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_12_1","DOI":"10.1007\/978-3-319-46493-0_38"},{"unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.   Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.","key":"e_1_3_2_1_13_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_14_1","DOI":"10.1145\/3123266.3123365"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_15_1","DOI":"10.1016\/j.eswa.2014.08.036"},{"unstructured":"Tomas Mikolov Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.   Tomas Mikolov Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.","key":"e_1_3_2_1_16_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_17_1","DOI":"10.1109\/ICASSP.2017.7952552"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_18_1","DOI":"10.1145\/2070481.2070509"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_19_1","DOI":"10.1145\/2993148.2993176"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_20_1","DOI":"10.1145\/2663204.2663260"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_21_1","DOI":"10.18653\/v1\/D15-1303"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_22_1","DOI":"10.18653\/v1\/P17-1081"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_23_1","DOI":"10.1109\/ICDM.2017.134"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_24_1","DOI":"10.1109\/ICDM.2016.0055"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_25_1","DOI":"10.1109\/TPAMI.2007.1124"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_26_1","DOI":"10.1007\/978-3-319-46478-7_21"},{"key":"e_1_3_2_1_27_1","volume-title":"Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition","author":"Ranjan Rajeev","year":"2017","unstructured":"Rajeev Ranjan , Vishal M Patel , and Rama Chellappa . 2017 . Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2017). Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. 2017. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_28_1","DOI":"10.1109\/MIS.2013.9"},{"key":"e_1_3_2_1_29_1","volume-title":"Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC)","author":"Rozgic Viktor","year":"2012","unstructured":"Viktor Rozgic , Sankaranarayanan Ananthakrishnan , Shirin Saleem , Rohit Kumar , and Rohit Prasad . 2012 . Ensemble of svm trees for multimodal emotion recognition . In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC) , 2012 Asia-Pacific . IEEE, 1--4. Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad. 2012. Ensemble of svm trees for multimodal emotion recognition. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific . IEEE, 1--4."},{"key":"e_1_3_2_1_30_1","volume-title":"Dynamic programming algorithm optimization for spoken word recognition","author":"Sakoe Hiroaki","year":"1978","unstructured":"Hiroaki Sakoe and Seibi Chiba . 1978. Dynamic programming algorithm optimization for spoken word recognition . IEEE transactions on acoustics, speech, and signal processing , Vol. 26 , 1 ( 1978 ), 43--49. Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing , Vol. 26, 1 (1978), 43--49."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_31_1","DOI":"10.1145\/2388676.2388781"},{"key":"e_1_3_2_1_32_1","volume-title":"Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2120--2127","author":"Song Yale","year":"2012","unstructured":"Yale Song , Louis-Philippe Morency , and Randall Davis . 2012 . Multi-view latent variable discriminative models for action recognition . In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2120--2127 . Yale Song, Louis-Philippe Morency, and Randall Davis. 2012. Multi-view latent variable discriminative models for action recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2120--2127."},{"key":"e_1_3_2_1_33_1","volume-title":"Residual attention network for image classification. arXiv preprint arXiv:1704.06904","author":"Wang Fei","year":"2017","unstructured":"Fei Wang , Mengqing Jiang , Chen Qian , Shuo Yang , Cheng Li , Honggang Zhang , Xiaogang Wang , and Xiaoou Tang . 2017. Residual attention network for image classification. arXiv preprint arXiv:1704.06904 ( 2017 ). Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. arXiv preprint arXiv:1704.06904 (2017)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_34_1","DOI":"10.1109\/MIS.2013.34"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_35_1","DOI":"10.18653\/v1\/N16-1174"},{"key":"e_1_3_2_1_36_1","volume-title":"Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250","author":"Zadeh Amir","year":"2017","unstructured":"Amir Zadeh , Minghai Chen , Soujanya Poria , Erik Cambria , and Louis-Philippe Morency . 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 ( 2017 ). Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)."},{"key":"e_1_3_2_1_37_1","volume-title":"Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency.","author":"Zadeh Amir","year":"2018","unstructured":"Amir Zadeh , Paul Pu Liang , Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018 . Multi-attention recurrent network for human communication comprehension. arXiv preprint arXiv:1802.00923 (2018). Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. arXiv preprint arXiv:1802.00923 (2018)."},{"key":"e_1_3_2_1_38_1","volume-title":"MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259","author":"Zadeh Amir","year":"2016","unstructured":"Amir Zadeh , Rowan Zellers , Eli Pincus , and Louis-Philippe Morency . 2016. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 ( 2016 ). Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016)."},{"key":"e_1_3_2_1_39_1","volume-title":"Learning affective features with a hybrid deep model for audio-visual emotion recognition","author":"Zhang Shiqing","year":"2017","unstructured":"Shiqing Zhang , Shiliang Zhang , Tiejun Huang , Wen Gao , and Qi Tian . 2017. Learning affective features with a hybrid deep model for audio-visual emotion recognition . IEEE Transactions on Circuits and Systems for Video Technology ( 2017 ). Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. 2017. Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology (2017)."}],"event":{"sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"acronym":"MM '18","name":"MM '18: ACM Multimedia Conference","location":"Seoul Republic of Korea"},"container-title":["Proceedings of the 26th ACM international conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3240508.3240714","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3240508.3240714","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:43:32Z","timestamp":1750207412000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3240508.3240714"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,10,15]]},"references-count":39,"alternative-id":["10.1145\/3240508.3240714","10.1145\/3240508"],"URL":"https:\/\/doi.org\/10.1145\/3240508.3240714","relation":{},"subject":[],"published":{"date-parts":[[2018,10,15]]},"assertion":[{"value":"2018-10-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}