{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:43:51Z","timestamp":1777657431583,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":54,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548035","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:46Z","timestamp":1665416566000},"page":"4714-4722","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":30,"title":["Equivariant and Invariant Grounding for Video Question Answering"],"prefix":"10.1145","author":[{"given":"Yicong","family":"Li","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiang","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junbin","family":"Xiao","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tat-Seng","family":"Chua","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Vision-and- Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments","author":"Anderson Peter","unstructured":"Peter Anderson , Qi Wu , Damien Teney , Jake Bruce , Mark Johnson , Niko S\u00fcnderhauf , Ian D. Reid , Stephen Gould , and Anton van den Hengel . 2018. Vision-and- Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments . In IEEE CVPR. 3674--3683. Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S\u00fcnderhauf, Ian D. Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and- Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In IEEE CVPR. 3674--3683."},{"key":"e_1_3_2_2_2_1","unstructured":"Mart\u00edn Arjovsky L\u00e9on Bottou Ishaan Gulrajani and David Lopez-Paz. 2019. Invariant Risk Minimization. Mart\u00edn Arjovsky L\u00e9on Bottou Ishaan Gulrajani and David Lopez-Paz. 2019. Invariant Risk Minimization."},{"key":"e_1_3_2_2_3_1","volume-title":"This looks like that: deep learning for interpretable image recognition. CoRR","author":"Chen Chaofan","year":"2018","unstructured":"Chaofan Chen , Oscar Li , Alina Barnett , Jonathan Su , and Cynthia Rudin . 2018. This looks like that: deep learning for interpretable image recognition. CoRR ( 2018 ). Chaofan Chen, Oscar Li, Alina Barnett, Jonathan Su, and Cynthia Rudin. 2018. This looks like that: deep learning for interpretable image recognition. CoRR (2018)."},{"key":"e_1_3_2_2_4_1","volume-title":"Counterfactual Samples Synthesizing for Robust Visual Question Answering","author":"Chen Long","unstructured":"Long Chen , Xin Yan , Jun Xiao , Hanwang Zhang , Shiliang Pu , and Yueting Zhuang . 2020. Counterfactual Samples Synthesizing for Robust Visual Question Answering . In IEEE CVPR. 10797--10806. Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual Samples Synthesizing for Robust Visual Question Answering. In IEEE CVPR. 10797--10806."},{"key":"e_1_3_2_2_5_1","volume-title":"Zemel","author":"Creager Elliot","year":"2021","unstructured":"Elliot Creager , J\u00f6rn-Henrik Jacobsen , and Richard S . Zemel . 2021 . Environment Inference for Invariant Learning. In ICML (Proceedings of Machine Learning Research) . 2189--2200. Elliot Creager, J\u00f6rn-Henrik Jacobsen, and Richard S. Zemel. 2021. Environment Inference for Invariant Learning. In ICML (Proceedings of Machine Learning Research). 2189--2200."},{"key":"e_1_3_2_2_6_1","volume-title":"Vuong Le, and Truyen Tran.","author":"Dang Long Hoang","year":"2021","unstructured":"Long Hoang Dang , Thao Minh Le , Vuong Le, and Truyen Tran. 2021 . Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering . , 636--642 pages. Long Hoang Dang, Thao Minh Le, Vuong Le, and Truyen Tran. 2021. Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering. , 636--642 pages."},{"key":"e_1_3_2_2_7_1","volume-title":"Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering","author":"Fan Chenyou","year":"1999","unstructured":"Chenyou Fan , Xiaofan Zhang , Shu Zhang , Wensheng Wang , Chi Zhang , and Heng Huang . 2019. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering . In IEEE CVPR. 1999 --2007. Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In IEEE CVPR. 1999--2007."},{"key":"e_1_3_2_2_8_1","volume-title":"Motion-Appearance Co-Memory Networks for Video Question Answering","author":"Gao Jiyang","unstructured":"Jiyang Gao , Runzhou Ge , Kan Chen , and Ram Nevatia . 2018. Motion-Appearance Co-Memory Networks for Video Question Answering . In IEEE CVPR. Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-Appearance Co-Memory Networks for Video Question Answering. In IEEE CVPR."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33013681"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-short.122"},{"key":"e_1_3_2_2_11_1","volume-title":"Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722","author":"He Kaiming","year":"2019","unstructured":"Kaiming He , Haoqi Fan , Yuxin Wu , Saining Xie , and Ross Girshick . 2019. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722 ( 2019 ). Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722 (2019)."},{"key":"e_1_3_2_2_12_1","volume-title":"Fooling neural network interpretations via adversarial model manipulation. NeurIPS 32","author":"Heo Juyeon","year":"2019","unstructured":"Juyeon Heo , Sunghwan Joo , and Taesup Moon . 2019. Fooling neural network interpretations via adversarial model manipulation. NeurIPS 32 ( 2019 ). Juyeon Heo, Sunghwan Joo, and Taesup Moon. 2019. Fooling neural network interpretations via adversarial model manipulation. NeurIPS 32 (2019)."},{"key":"e_1_3_2_2_13_1","volume-title":"Long Short-Term Memory. Neural Computation 9, 8 (11","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long Short-Term Memory. Neural Computation 9, 8 (11 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (11 1997), 1735--1780."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"crossref","unstructured":"Deng Huang Peihao Chen Runhao Zeng Qing Du Mingkui Tan and Chuang Gan. 2020. Location-Aware Graph Convolutional Networks for Video Question Answering. 11021--11028 pages. Deng Huang Peihao Chen Runhao Zeng Qing Du Mingkui Tan and Chuang Gan. 2020. Location-Aware Graph Convolutional Networks for Video Question Answering. 11021--11028 pages.","DOI":"10.1609\/aaai.v34i07.6737"},{"key":"e_1_3_2_2_15_1","volume-title":"MMM","volume":"6524","author":"Huang Shih-Shinh","year":"2011","unstructured":"Shih-Shinh Huang , Hsin-Ming Tsai , Pei-Yung Hsiao , Meng-Qui Tu , and Er-Liang Jian . 2011 . Combining Histograms of Oriented Gradients with Global Feature for Human Detection. In Advances in Multimedia Modeling - 17th International Multimedia Modeling Conference , MMM 2011, Vol. 6524 . Springer, 208--218. Shih-Shinh Huang, Hsin-Ming Tsai, Pei-Yung Hsiao, Meng-Qui Tu, and Er-Liang Jian. 2011. Combining Histograms of Oriented Gradients with Global Feature for Human Detection. In Advances in Multimedia Modeling - 17th International Multimedia Modeling Conference, MMM 2011, Vol. 6524. Springer, 208--218."},{"key":"e_1_3_2_2_16_1","unstructured":"Eric Jang Shixiang Gu and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In ICLR. Eric Jang Shixiang Gu and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In ICLR."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"crossref","unstructured":"Pin Jiang and Yahong Han. 2020. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In AAAI. 11109--11116. Pin Jiang and Yahong Han. 2020. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In AAAI. 11109--11116.","DOI":"10.1609\/aaai.v34i07.6767"},{"key":"e_1_3_2_2_18_1","volume-title":"The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. In IJCAI 2019","author":"Laugel Thibault","year":"2019","unstructured":"Thibault Laugel , Marie-Jeanne Lesot , Christophe Marsala , Xavier Renard , and Marcin Detyniecki . 2019 . The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. In IJCAI 2019 , Macao, China, August 10--16 , 2019. 2801--2807. Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, and Marcin Detyniecki. 2019. The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. In IJCAI 2019, Macao, China, August 10--16, 2019. 2801--2807."},{"key":"e_1_3_2_2_19_1","unstructured":"Thao Minh Le Vuong Le Svetha Venkatesh and Truyen Tran. 2020. Hierarchical Conditional Relation Networks for Video Question Answering. (2020) 9969--9978. Thao Minh Le Vuong Le Svetha Venkatesh and Truyen Tran. 2020. Hierarchical Conditional Relation Networks for Video Question Answering. (2020) 9969--9978."},{"key":"e_1_3_2_2_20_1","volume-title":"Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering","author":"Li Xiangpeng","unstructured":"Xiangpeng Li , Jingkuan Song , Lianli Gao , Xianglong Liu , Wenbing Huang , Xiangnan He , and Chuang Gan . 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering . In AAAI. AAAI Press , 8658--8665. Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI. AAAI Press, 8658--8665."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00294"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475540"},{"key":"e_1_3_2_2_23_1","volume-title":"HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering","author":"Liu Fei","year":"2021","unstructured":"Fei Liu , Jing Liu , Weining Wang , and Hanqing Lu . 2021 . HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering . In IEEE ICCV. 1678--1687. Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. 2021. HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering. In IEEE ICCV. 1678--1687."},{"key":"e_1_3_2_2_24_1","unstructured":"Tom Monnier Thibault Groueix and Mathieu Aubry. 2020. Deep Transformation- Invariant Clustering. In NeurIPS. Tom Monnier Thibault Groueix and Mathieu Aubry. 2020. Deep Transformation- Invariant Clustering. In NeurIPS."},{"key":"e_1_3_2_2_25_1","volume-title":"Counterfactual VQA: A Cause-Effect Look at Language Bias","author":"Niu Yulei","unstructured":"Yulei Niu , Kaihua Tang , Hanwang Zhang , Zhiwu Lu , Xian-Sheng Hua , and Ji-Rong Wen . 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias . In IEEECVPR. Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias. In IEEECVPR."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"crossref","unstructured":"Jungin Park Jiyoung Lee and Kwanghoon Sohn. 2021. Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering. 15526--15535 pages. Jungin Park Jiyoung Lee and Kwanghoon Sohn. 2021. Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering. 15526--15535 pages.","DOI":"10.1109\/CVPR46437.2021.01527"},{"key":"e_1_3_2_2_27_1","volume-title":"Causal inference in statistics: An overview. Statistics surveys","author":"Pearl Judea","year":"2009","unstructured":"Judea Pearl . 2009. Causal inference in statistics: An overview. Statistics surveys ( 2009 ), 96--146. Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics surveys (2009), 96--146."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511803161"},{"key":"e_1_3_2_2_29_1","unstructured":"Judea Pearl Madelyn Glymour and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. Judea Pearl Madelyn Glymour and Nicholas P Jewell. 2016. Causal inference in statistics: A primer."},{"key":"e_1_3_2_2_30_1","volume-title":"Progressive Graph Attention Network for Video Question Answering. In MM '21: ACM Multimedia Conference.","author":"Peng Liang","year":"2021","unstructured":"Liang Peng , Shuangji Yang , Yi Bin , and Guoqing Wang . 2021 . Progressive Graph Attention Network for Video Question Answering. In MM '21: ACM Multimedia Conference. Liang Peng, Shuangji Yang, Yi Bin, and Guoqing Wang. 2021. Progressive Graph Attention Network for Video Question Answering. In MM '21: ACM Multimedia Conference."},{"key":"e_1_3_2_2_31_1","first-page":"8748","article-title":"Learning Transferable Visual Models From Natural Language Supervision","volume":"139","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , Gretchen Krueger , and Ilya Sutskever . 2021 . Learning Transferable Visual Models From Natural Language Supervision . In ICML , Vol. 139. 8748 -- 8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML, Vol. 139. 8748--8763.","journal-title":"ICML"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Marco T\u00falio Ribeiro Sameer Singh and Carlos Guestrin. 2016. \"Why Should I Trust You?\": Explaining the Predictions of Any Classifier. In KDD. 1135--1144. Marco T\u00falio Ribeiro Sameer Singh and Carlos Guestrin. 2016. \"Why Should I Trust You?\": Explaining the Predictions of Any Classifier. In KDD. 1135--1144.","DOI":"10.18653\/v1\/N16-3020"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Andrew Slavin Ross Michael C. Hughes and Finale Doshi-Velez. 2017. Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. In IJCAI. 2662--2670. Andrew Slavin Ross Michael C. Hughes and Finale Doshi-Velez. 2017. Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. In IJCAI. 2662--2670.","DOI":"10.24963\/ijcai.2017\/371"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-019-0048-x"},{"key":"e_1_3_2_2_35_1","volume-title":"Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization","author":"Selvaraju Ramprasaath R.","unstructured":"Ramprasaath R. Selvaraju , Michael Cogswell , Abhishek Das , Ramakrishna Vedantam , Devi Parikh , and Dhruv Batra . 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization . In IEEE ICCV. 618--626. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In IEEE ICCV. 618--626."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"crossref","unstructured":"Ahjeong Seo Gi-Cheon Kang Joonhan Park and Byoung-Tak Zhang. 2021. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. In ACL\/IJCNLP. 6167--6177. Ahjeong Seo Gi-Cheon Kang Joonhan Park and Byoung-Tak Zhang. 2021. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. In ACL\/IJCNLP. 6167--6177.","DOI":"10.18653\/v1\/2021.acl-long.481"},{"key":"e_1_3_2_2_37_1","volume-title":"InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. TPAMI","author":"Shen Yujun","year":"2020","unstructured":"Yujun Shen , Ceyuan Yang , Xiaoou Tang , and Bolei Zhou . 2020. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. TPAMI ( 2020 ). Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. 2020. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. TPAMI (2020)."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3375627.3375830"},{"key":"e_1_3_2_2_39_1","volume-title":"Efros","author":"Torralba Antonio","year":"2011","unstructured":"Antonio Torralba and Alexei A . Efros . 2011 . Unbiased look at dataset bias. In IEEECVPR. IEEE Computer Society , 1521--1528. Antonio Torralba and Alexei A. Efros. 2011. Unbiased look at dataset bias. In IEEECVPR. IEEE Computer Society, 1521--1528."},{"key":"e_1_3_2_2_40_1","volume-title":"Representation Learning with Contrastive Predictive Coding. CoRR","author":"van den Oord A\u00e4ron","year":"2018","unstructured":"A\u00e4ron van den Oord , Yazhe Li , and Oriol Vinyals . 2018. Representation Learning with Contrastive Predictive Coding. CoRR ( 2018 ). A\u00e4ron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR (2018)."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Hui Wang Dan Guo Xian-Sheng Hua and Meng Wang. 2021. Pairwise VLAD Interaction Network for Video Question Answering. 5119--5127. Hui Wang Dan Guo Xian-Sheng Hua and Meng Wang. 2021. Pairwise VLAD Interaction Network for Video Question Answering. 5119--5127.","DOI":"10.1145\/3474085.3475620"},{"key":"e_1_3_2_2_42_1","volume-title":"DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering. CoRR abs\/2107.04768","author":"Wang Jianyu","year":"2021","unstructured":"Jianyu Wang , Bing-Kun Bao , and Changsheng Xu. 2021. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering. CoRR abs\/2107.04768 ( 2021 ). Jianyu Wang, Bing-Kun Bao, and Changsheng Xu. 2021. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering. CoRR abs\/2107.04768 (2021)."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Tan Wang Chang Zhou Qianru Sun and Hanwang Zhang. 2021. Causal Attention for Unbiased Visual Recognition. Tan Wang Chang Zhou Qianru Sun and Hanwang Zhang. 2021. Causal Attention for Unbiased Visual Recognition.","DOI":"10.1109\/ICCV48922.2021.00308"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462962"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3512251"},{"key":"e_1_3_2_2_46_1","volume-title":"Videos as Space-Time Region Graphs","author":"Wang Xiaolong","unstructured":"Xiaolong Wang and Abhinav Gupta . 2018. Videos as Space-Time Region Graphs . In IEEE ECCV. 413--431. Xiaolong Wang and Abhinav Gupta. 2018. Videos as Space-Time Region Graphs. In IEEE ECCV. 413--431."},{"key":"e_1_3_2_2_47_1","unstructured":"Ying-Xin Wu Xiang Wang An Zhang Xiangnan He and Tat seng Chua. 2022. Discovering Invariant Rationales for Graph Neural Networks. In ICLR. Ying-Xin Wu Xiang Wang An Zhang Xiangnan He and Tat seng Chua. 2022. Discovering Invariant Rationales for Graph Neural Networks. In ICLR."},{"key":"e_1_3_2_2_48_1","volume-title":"NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions","author":"Xiao Junbin","unstructured":"Junbin Xiao , Xindi Shang , Angela Yao , and Tat-Seng Chua . 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions . In IEEE CVPR. 9777--9786. Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. In IEEE CVPR. 9777--9786."},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i3.20184"},{"key":"e_1_3_2_2_50_1","unstructured":"Dejing Xu Zhou Zhao Jun Xiao Fei Wu Hanwang Zhang Xiangnan He and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM. 1645--1653. Dejing Xu Zhou Zhao Jun Xiao Fei Wu Hanwang Zhang Xiangnan He and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM. 1645--1653."},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"crossref","unstructured":"Xu Yang Hanwang Zhang Guojun Qi and Jianfei Cai. 2021. Causal Attention for Vision-Language Tasks. In CVPR. 9847--9857. Xu Yang Hanwang Zhang Guojun Qi and Jianfei Cai. 2021. Causal Attention for Vision-Language Tasks. In CVPR. 9847--9857.","DOI":"10.1109\/CVPR46437.2021.00972"},{"key":"e_1_3_2_2_52_1","volume-title":"Juan Carlos Niebles, and Min Sun","author":"Zeng Kuo-Hao","year":"2017","unstructured":"Kuo-Hao Zeng , Tseng-Hung Chen , Ching-Yao Chuang , Yuan-Hong Liao , Juan Carlos Niebles, and Min Sun . 2017 . Leveraging Video Descriptions to Learn Video Question Answering . , 4334--4340 pages. Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging Video Descriptions to Learn Video Question Answering. , 4334--4340 pages."},{"key":"e_1_3_2_2_53_1","unstructured":"Hongyi Zhang Moustapha Ciss\u00e9 Yann N. Dauphin and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In ICLR. Hongyi Zhang Moustapha Ciss\u00e9 Yann N. Dauphin and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In ICLR."},{"key":"e_1_3_2_2_54_1","volume-title":"Video Question Answering: Datasets, Algorithms and Challenges. arXiv preprint arXiv:2203.01225","author":"Zhong Yaoyao","year":"2022","unstructured":"Yaoyao Zhong , Wei Ji , Junbin Xiao , Yicong Li , Weihong Deng , and Tat-Seng Chua . 2022. Video Question Answering: Datasets, Algorithms and Challenges. arXiv preprint arXiv:2203.01225 ( 2022 ). Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. 2022. Video Question Answering: Datasets, Algorithms and Challenges. arXiv preprint arXiv:2203.01225 (2022)."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548035","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548035","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:29Z","timestamp":1750186949000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548035"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":54,"alternative-id":["10.1145\/3503161.3548035","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548035","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}