{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T16:58:37Z","timestamp":1780765117289,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":54,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["No. 2020AAA0106300"],"award-info":[{"award-number":["No. 2020AAA0106300"]}]},{"name":"National Natural Science Foundation of China","award":["No. 62250008 No. 62102222"],"award-info":[{"award-number":["No. 62250008 No. 62102222"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548061","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:46Z","timestamp":1665416566000},"page":"4466-4477","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Dynamic Spatio-Temporal Modular Network for Video Question Answering"],"prefix":"10.1145","author":[{"given":"Zi","family":"Qian","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xin","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xuguang","family":"Duan","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Hong","family":"Chen","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wenwu","family":"Zhu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198","author":"Alayrac Jean-Baptiste","year":"2022","unstructured":"Jean-Baptiste Alayrac , JeDonahue, Pauline Luc , Antoine Miech , Iain Barr , Yana Hasson , Karel Lenc , Arthur Mensch , Katie Millican , Malcolm Reynolds , Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 , 2022 . Jean-Baptiste Alayrac, JeDonahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022."},{"key":"e_1_3_2_2_2_1","volume-title":"Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705","author":"Andreas Jacob","year":"2016","unstructured":"Jacob Andreas , Marcus Rohrbach , Trevor Darrell , and Dan Klein . Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 , 2016 . Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.12"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_2_5_1","volume-title":"An analysis of a logical machine using parenthesis-free notation. Mathematical tables and other aids to computation, 8(46):53--57","author":"Burks Arthur W","year":"1954","unstructured":"Arthur W Burks , Don W Warren , and Jesse B Wright . An analysis of a logical machine using parenthesis-free notation. Mathematical tables and other aids to computation, 8(46):53--57 , 1954 . Arthur W Burks, Don W Warren, and Jesse B Wright. An analysis of a logical machine using parenthesis-free notation. Mathematical tables and other aids to computation, 8(46):53--57, 1954."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00070"},{"key":"e_1_3_2_2_7_1","first-page":"31","article-title":"Weakly supervised dense event captioning in videos","author":"Duan Xuguang","year":"2018","unstructured":"Xuguang Duan , Wenbing Huang , Chuang Gan , JingdongWang, Wenwu Zhu , and Junzhou Huang . Weakly supervised dense event captioning in videos . Advances in Neural Information Processing Systems , 31 , 2018 . Xuguang Duan,Wenbing Huang, Chuang Gan, JingdongWang,Wenwu Zhu, and Junzhou Huang. Weakly supervised dense event captioning in videos. Advances in Neural Information Processing Systems, 31, 2018.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00210"},{"key":"e_1_3_2_2_9_1","volume-title":"Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681","author":"Fu Tsu-Jui","year":"2021","unstructured":"Tsu-Jui Fu , Linjie Li , Zhe Gan , Kevin Lin , William Yang Wang , Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681 , 2021 . Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00688"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01113"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3051756"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.93"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_4"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6737"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1469-8137.1912.tb05611.x"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.149"},{"key":"e_1_3_2_2_18_1","first-page":"10236","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ji Jingwei","year":"2020","unstructured":"Jingwei Ji , Ranjay Krishna , Li Fei-Fei , and Juan Carlos Niebles . Action genome : Actions as compositions of spatio-temporal scene graphs . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition , pages 10236 -- 10247 , 2020 . Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 10236-- 10247, 2020."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6767"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3076556"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.325"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.2307\/2332226"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00853"},{"key":"e_1_3_2_2_25_1","volume-title":"Deepstory: Video story qa by deep embedded memory networks. arXiv preprint arXiv:1707.00836","author":"Kim Kyung-Min","year":"2017","unstructured":"Kyung-Min Kim , Min-Oh Heo , Seong-Ho Choi , and Byoung-Tak Zhang . Deepstory: Video story qa by deep embedded memory networks. arXiv preprint arXiv:1707.00836 , 2017 . Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. Deepstory: Video story qa by deep embedded memory networks. arXiv preprint arXiv:1707.00836, 2017."},{"key":"e_1_3_2_2_26_1","volume-title":"Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553, 2","author":"Le Thao Minh","year":"2019","unstructured":"Thao Minh Le , Vuong Le , Svetha Venkatesh , and Truyen Tran . Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553, 2 , 2019 . Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553, 2, 2019."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00999"},{"key":"e_1_3_2_2_28_1","volume-title":"Localized, compositional video question answering. arXiv preprint arXiv:1809.01696","author":"Lei Jie","year":"2018","unstructured":"Jie Lei , Licheng Yu , Mohit Bansal , and Tamara L Berg . Tvqa : Localized, compositional video question answering. arXiv preprint arXiv:1809.01696 , 2018 . Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018."},{"key":"e_1_3_2_2_29_1","volume-title":"Tvqa: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574","author":"Lei Jie","year":"2019","unstructured":"Jie Lei , Licheng Yu , Tamara L Berg , and Mohit Bansal . Tvqa: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 , 2019 . Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvqa: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574, 2019."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350922"},{"key":"e_1_3_2_2_32_1","first-page":"8658","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"33","author":"Li Xiangpeng","year":"2019","unstructured":"Xiangpeng Li , Jingkuan Song , Lianli Gao , Xianglong Liu , Wenbing Huang , Xiangnan He , and Chuang Gan . Beyond rnns : Positional self-attention with co-attention for video question answering . In Proceedings of the AAAI Conference on Artificial Intelligence , volume 33 , pages 8658 -- 8665 , 2019 . Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu,Wenbing Huang, Xiangnan He, and Chuang Gan. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8658--8665, 2019."},{"key":"e_1_3_2_2_33_1","first-page":"1698","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Liu Fei","year":"2021","unstructured":"Fei Liu , Jing Liu , Weining Wang , and Hanqing Lu. Hair : Hierarchical visualsemantic relational reasoning for video question answering . In Proceedings of the IEEE\/CVF International Conference on Computer Vision , pages 1698 -- 1707 , 2021 . Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. Hair: Hierarchical visualsemantic relational reasoning for video question answering. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, pages 1698--1707, 2021."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00519"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.80"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01527"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475193"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.74"},{"key":"e_1_3_2_2_39_1","volume-title":"Attend what you need: Motion-appearance synergistic networks for video question answering. arXiv preprint arXiv:2106.10446","author":"Seo Ahjeong","year":"2021","unstructured":"Ahjeong Seo , Gi-Cheon Kang , Joonhan Park , and Byoung-Tak Zhang . Attend what you need: Motion-appearance synergistic networks for video question answering. arXiv preprint arXiv:2106.10446 , 2021 . Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. Attend what you need: Motion-appearance synergistic networks for video question answering. arXiv preprint arXiv:2106.10446, 2021."},{"key":"e_1_3_2_2_40_1","volume-title":"Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624","author":"Shen Zheyan","year":"2021","unstructured":"Zheyan Shen , Jiashuo Liu , Yue He , Xingxuan Zhang , Renzhe Xu , Han Yu , and Peng Cui . Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624 , 2021 . Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and Peng Cui. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624, 2021."},{"key":"e_1_3_2_2_41_1","volume-title":"Reinforcement learning: An introduction","author":"Sutton Richard S","year":"2018","unstructured":"Richard S Sutton and Andrew G Barto . Reinforcement learning: An introduction . MIT press , 2018 . Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018."},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475620"},{"key":"e_1_3_2_2_43_1","article-title":"A dual-visual graph reasoning unit for video question answering","author":"Wang Jianyu","year":"2021","unstructured":"Jianyu Wang , Bingkun Bao , and Changsheng Xu. Dualvgr : A dual-visual graph reasoning unit for video question answering . IEEE Transactions on Multimedia , 2021 . Jianyu Wang, Bingkun Bao, and Changsheng Xu. Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 2021.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_2_44_1","volume-title":"Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)","author":"Wu Bo","year":"2021","unstructured":"Bo Wu , Shoubin Yu , Zhenfang Chen , Joshua B Tenenbaum , and Chuang Gan . Star : A benchmark for situated reasoning in real-world videos . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021 . Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021."},{"key":"e_1_3_2_2_45_1","volume-title":"Video as conditional graph hierarchy for multi-granular question answering. arXiv preprint arXiv:2112.06197","author":"Xiao Junbin","year":"2021","unstructured":"Junbin Xiao , Angela Yao , Zhiyuan Liu , Yicong Li , Wei Ji , and Tat-Seng Chua . Video as conditional graph hierarchy for multi-granular question answering. arXiv preprint arXiv:2112.06197 , 2021 . Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. Video as conditional graph hierarchy for multi-granular question answering. arXiv preprint arXiv:2112.06197, 2021."},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123427"},{"key":"e_1_3_2_2_47_1","first-page":"9878","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Xu Li","year":"2021","unstructured":"Li Xu , He Huang , and Jun Liu . Sutd-traffcqa : A question answering benchmark and an effcient network for video reasoning over traffc events . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition , pages 9878 -- 9888 , 2021 . Li Xu, He Huang, and Jun Liu. Sutd-traffcqa: A question answering benchmark and an effcient network for video reasoning over traffc events. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 9878--9888, 2021."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2746267"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2859820"},{"key":"e_1_3_2_2_50_1","volume-title":"Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442","author":"Yi Kexin","year":"2019","unstructured":"Kexin Yi , Chuang Gan , Yunzhu Li , Pushmeet Kohli , Jiajun Wu , Antonio Torralba , and Joshua B Tenenbaum . Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 , 2019 . Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019."},{"key":"e_1_3_2_2_51_1","first-page":"23634","article-title":"Multimodal neural script knowledge models","volume":"34","author":"Zellers Rowan","year":"2021","unstructured":"Rowan Zellers , Ximing Lu , Jack Hessel , Youngjae Yu , Jae Sung Park , Jize Cao , Ali Farhadi , and Yejin Choi . Merlot : Multimodal neural script knowledge models . Advances in Neural Information Processing Systems , 34 : 23634 -- 23651 , 2021 . Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634--23651, 2021.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_52_1","first-page":"16375","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zellers Rowan","year":"2022","unstructured":"Rowan Zellers , Jiasen Lu , Ximing Lu , Youngjae Yu , Yanpeng Zhao , Mohammadreza Salehi , Aditya Kusupati , Jack Hessel , Ali Farhadi , and Yejin Choi . Merlot reserve : Neural script knowledge through vision and language and sound . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition , pages 16375 -- 16387 , 2022 . Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 16375--16387, 2022."},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3172077.3172381"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/512"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548061","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548061","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:30Z","timestamp":1750186950000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548061"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":54,"alternative-id":["10.1145\/3503161.3548061","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548061","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}