{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T12:35:08Z","timestamp":1780317308757,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":42,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T00:00:00Z","timestamp":1629763200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100017440","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61972047"],"award-info":[{"award-number":["61972047"]}],"id":[{"id":"10.13039\/100017440","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2018YFC0831500"],"award-info":[{"award-number":["2018YFC0831500"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,24]]},"DOI":"10.1145\/3460426.3463635","type":"proceedings-article","created":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T22:50:28Z","timestamp":1630536628000},"page":"164-172","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Relation-aware Hierarchical Attention Framework for Video Question Answering"],"prefix":"10.1145","author":[{"given":"Fangtao","family":"Li","sequence":"first","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ting","family":"Bai","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenyu","family":"Cao","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zihe","family":"Liu","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenghao","family":"Yan","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bin","family":"Wu","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,9]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_1_2_1","volume-title":"Multimodal machine learning: A survey and taxonomy","author":"Tadas Baltruvs","year":"2018","unstructured":"Tadas Baltruvs aitis, Chaitanya Ahuja , and Louis-Philippe Morency . 2018. Multimodal machine learning: A survey and taxonomy . IEEE transactions on pattern analysis and machine intelligence , Vol. 41 , 2 ( 2018 ), 423--443. Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence , Vol. 41, 2 (2018), 423--443."},{"key":"e_1_3_2_1_3_1","article-title":"Classification with a Reject Option using a Hinge Loss","volume":"9","author":"Bartlett Peter L","year":"2008","unstructured":"Peter L Bartlett and Marten H Wegkamp . 2008 . Classification with a Reject Option using a Hinge Loss . Journal of Machine Learning Research , Vol. 9 , 8 (2008). Peter L Bartlett and Marten H Wegkamp. 2008. Classification with a Reject Option using a Hinge Loss. Journal of Machine Learning Research , Vol. 9, 8 (2008).","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.285"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00209"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2019.00198"},{"key":"e_1_3_2_1_7_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00688"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6713"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24571-8_47"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00378"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6737"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.149"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6767"},{"key":"e_1_3_2_1_16_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 1571--1581","author":"Kim Jin-Hwa","year":"2018","unstructured":"Jin-Hwa Kim , Jaehyun Jun , and Byoung-Tak Zhang . 2018 . Bilinear attention networks . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 1571--1581 . Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 1571--1581."},{"key":"e_1_3_2_1_17_1","volume-title":"Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907","author":"Kipf Thomas N","year":"2016","unstructured":"Thomas N Kipf and Max Welling . 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 ( 2016 ). Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00999"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1167"},{"key":"e_1_3_2_1_20_1","unstructured":"Jie Lei Licheng Yu Tamara Berg and Mohit Bansal. 2020. TVQA  Jie Lei Licheng Yu Tamara Berg and Mohit Bansal. 2020. TVQA"},{"key":"e_1_3_2_1_21_1","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 8211--8225","unstructured":": Spatio-Temporal Grounding for Video Question Answering . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 8211--8225 . : Spatio-Temporal Grounding for Video Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 8211--8225."},{"key":"e_1_3_2_1_22_1","volume-title":"Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition. In International Conference on Multimedia Modeling . Springer, 75--86","author":"Li Fangtao","year":"2021","unstructured":"Fangtao Li , Wenzhe Wang , Zihe Liu , Haoran Wang , Chenghao Yan , and Bin Wu . 2021 . Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition. In International Conference on Multimedia Modeling . Springer, 75--86 . Fangtao Li, Wenzhe Wang, Zihe Liu, Haoran Wang, Chenghao Yan, and Bin Wu. 2021. Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition. In International Conference on Multimedia Modeling . Springer, 75--86."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.01041"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.199"},{"key":"e_1_3_2_1_25_1","volume-title":"et almbox","author":"Liu Yuanliu","year":"2018","unstructured":"Yuanliu Liu , Bo Peng , Peipei Shi , He Yan , Yong Zhou , Bing Han , Yi Zheng , Chao Lin , Jianbin Jiang , Yin Fan , et almbox . 2018 a. iqiyi-vid: A large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018). Yuanliu Liu, Bo Peng, Peipei Shi, He Yan, Yong Zhou, Bing Han, Yi Zheng, Chao Lin, Jianbin Jiang, Yin Fan, et almbox. 2018a. iqiyi-vid: A large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018)."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1209"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-019-01430-7"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00713"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24571-8_51"},{"key":"e_1_3_2_1_31_1","unstructured":"Shaoqing Ren Kaiming He Ross B Girshick and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS .  Shaoqing Ren Kaiming He Ross B Girshick and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS ."},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems . 4974--4983","author":"Santoro Adam","year":"2017","unstructured":"Adam Santoro , David Raposo , David GT Barrett , Mateusz Malinowski , Razvan Pascanu , Peter Battaglia , and Timothy Lillicrap . 2017 . A simple neural network module for relational reasoning . In Proceedings of the 31st International Conference on Neural Information Processing Systems . 4974--4983 . Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In Proceedings of the 31st International Conference on Neural Information Processing Systems . 4974--4983."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.499"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6431"},{"key":"e_1_3_2_1_35_1","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017 . Attention is all you need . In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010."},{"key":"e_1_3_2_1_36_1","volume-title":"Graph attention networks. arXiv preprint arXiv:1710.10903","author":"Petar Velivc","year":"2017","unstructured":"Petar Velivc kovi\u0107 , Guillem Cucurull , Arantxa Casanova , Adriana Romero , Pietro Lio , and Yoshua Bengio . 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 ( 2017 ). Petar Velivc kovi\u0107 , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)."},{"key":"e_1_3_2_1_37_1","volume-title":"Multi-Cue and Temporal Attention for Person Recognition in Videos. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 369--380","author":"Wang Wenzhe","year":"2020","unstructured":"Wenzhe Wang , Bin Wu , Fangtao Li , and Zihe Liu . 2020 . Multi-Cue and Temporal Attention for Person Recognition in Videos. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 369--380 . Wenzhe Wang, Bin Wu, Fangtao Li, and Zihe Liu. 2020. Multi-Cue and Temporal Attention for Person Recognition in Videos. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 369--380."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.347"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Amir Zadeh Minghai Chen Soujanya Poria Erik Cambria and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP .  Amir Zadeh Minghai Chen Soujanya Poria Erik Cambria and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP .","DOI":"10.18653\/v1\/D17-1115"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12021"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11238"}],"event":{"name":"ICMR '21: International Conference on Multimedia Retrieval","location":"Taipei Taiwan","acronym":"ICMR '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2021 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463635","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460426.3463635","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:03Z","timestamp":1750191423000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463635"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,24]]},"references-count":42,"alternative-id":["10.1145\/3460426.3463635","10.1145\/3460426"],"URL":"https:\/\/doi.org\/10.1145\/3460426.3463635","relation":{},"subject":[],"published":{"date-parts":[[2021,8,24]]},"assertion":[{"value":"2021-09-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}