{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:53:12Z","timestamp":1761126792095,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,27]],"date-time":"2022-06-27T00:00:00Z","timestamp":1656288000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key R&D Program of China","award":["No. 2020AAA0106900"],"award-info":[{"award-number":["No. 2020AAA0106900"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. U19B2037, No. 61876152"],"award-info":[{"award-number":["No. U19B2037, No. 61876152"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Shaanxi Provincial Key R&D Program","award":["No. 2021KWZ-03"],"award-info":[{"award-number":["No. 2021KWZ-03"]}]},{"DOI":"10.13039\/501100017596","name":"Natural Science Basic Research Program of Shaanxi Province","doi-asserted-by":"publisher","award":["No. 2021JCW-03"],"award-info":[{"award-number":["No. 2021JCW-03"]}],"id":[{"id":"10.13039\/501100017596","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,27]]},"DOI":"10.1145\/3512527.3531380","type":"proceedings-article","created":{"date-parts":[[2022,6,23]],"date-time":"2022-06-23T22:23:32Z","timestamp":1656023012000},"page":"219-228","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Dual-Level Decoupled Transformer for Video Captioning"],"prefix":"10.1145","author":[{"given":"Yiqi","family":"Gao","sequence":"first","affiliation":[{"name":"Northwestern Polytechnical University &amp; National Engineering Lab for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi'an, China"}]},{"given":"Xinglin","family":"Hou","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"given":"Wei","family":"Suo","sequence":"additional","affiliation":[{"name":"Northwestern Polytechnical University &amp; National Engineering Lab for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi'an, China"}]},{"given":"Mengyang","family":"Sun","sequence":"additional","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University;National Engineering Lab for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi'an, China"}]},{"given":"Tiezheng","family":"Ge","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"given":"Yuning","family":"Jiang","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"given":"Peng","family":"Wang","sequence":"additional","affiliation":[{"name":"Northwestern Polytechnical University &amp; National Engineering Lab for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi'an, China"}]}],"member":"320","published-online":{"date-parts":[[2022,6,27]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01277"},{"key":"e_1_3_2_2_2_1","volume-title":"Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691","author":"Arnab Anurag","year":"2021","unstructured":"Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , and Cordelia Schmid . 2021 . Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021). Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)."},{"key":"e_1_3_2_2_3_1","unstructured":"Andrei Barbu Alexander Bridge Zachary Burchill Dan Coroian Sven Dickinson Sanja Fidler Aaron Michaux Sam Mussman Siddharth Narayanaswamy Dhaval Salvi etal 2012. Video in sentences out. arXiv preprint arXiv:1204.2742 (2012).  Andrei Barbu Alexander Bridge Zachary Burchill Dan Coroian Sven Dickinson Sanja Fidler Aaron Michaux Sam Mussman Siddharth Narayanaswamy Dhaval Salvi et al. 2012. Video in sentences out. arXiv preprint arXiv:1204.2742 (2012)."},{"key":"e_1_3_2_2_4_1","volume-title":"Is Space-Time Attention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius , Heng Wang , and Lorenzo Torresani . 2021. Is Space-Time Attention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095 ( 2021 ). Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095 (2021)."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.675"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/2002472.2002497"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_20"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018191"},{"key":"e_1_3_2_2_10_1","volume-title":"Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Advances in neural information processing systems.","author":"Chu Xiangxiang","year":"2021","unstructured":"Xiangxiang Chu , Zhi Tian , Yuqing Wang , Bo Zhang , Haibing Ren , Xiaolin Wei , Huaxia Xia , and Chunhua Shen . 2021 . Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Advances in neural information processing systems. Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Advances in neural information processing systems."},{"key":"e_1_3_2_2_11_1","volume-title":"What does bert look at? an analysis of bert's attention. arXiv preprint arXiv:1906.04341","author":"Clark Kevin","year":"2019","unstructured":"Kevin Clark , Urvashi Khandelwal , Omer Levy , and Christopher D Manning . 2019. What does bert look at? an analysis of bert's attention. arXiv preprint arXiv:1906.04341 ( 2019 ). Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert's attention. arXiv preprint arXiv:1906.04341 (2019)."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_2_2_13_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_2_14_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"33","author":"Fang Kuncheng","year":"2019","unstructured":"Kuncheng Fang , Lian Zhou , Cheng Jin , Yuejie Zhang , Kangnian Weng , Tao Zhang , and Weiguo Fan . 2019 . Fully convolutional video captioning with coarsetofine and inherited attention . In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33 . 8271--8278. Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang, Kangnian Weng, Tao Zhang, and Weiguo Fan. 2019. Fully convolutional video captioning with coarsetofine and inherited attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8271--8278."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2729019"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01034"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.450"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00901"},{"key":"e_1_3_2_2_20_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1020346032608"},{"key":"e_1_3_2_2_22_1","volume-title":"Shafiq Joty, Caiming Xiong, and Steven Hoi.","author":"Li Junnan","year":"2021","unstructured":"Junnan Li , Ramprasaath R Selvaraju , Akhilesh Deepak Gotmare , Shafiq Joty, Caiming Xiong, and Steven Hoi. 2021 . Align before Fuse : Vision and Language Representation Learning with Momentum Distillation . arXiv preprint arXiv:2107.07651 (2021). Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv preprint arXiv:2107.07651 (2021)."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2017\/307"},{"key":"e_1_3_2_2_24_1","unstructured":"Fenglin Liu Xuancheng Ren Xian Wu Bang Yang Shen Ge and Xu Sun. [n. d.]. O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning. ([n. d.]).  Fenglin Liu Xuancheng Ren Xian Wu Bang Yang Shen Ge and Xu Sun. [n. d.]. O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning. ([n. d.])."},{"key":"e_1_3_2_2_25_1","volume-title":"Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Yutong Lin , Yue Cao , Han Hu , Yixuan Wei , Zheng Zhang , Stephen Lin , and Baining Guo . 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 ( 2021 ). Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)."},{"key":"e_1_3_2_2_26_1","volume-title":"Video swin transformer. arXiv preprint arXiv:2106.13230","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Jia Ning , Yue Cao , Yixuan Wei , Zheng Zhang , Stephen Lin , and Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230 ( 2021 ). Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00990"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01088"},{"key":"e_1_3_2_2_29_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , 2019 . Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026--8037. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026--8037."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00854"},{"key":"e_1_3_2_2_31_1","volume-title":"Do Vision Transformers See Like Convolutional Neural Networks? arXiv preprint arXiv:2108.08810","author":"Raghu Maithra","year":"2021","unstructured":"Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , and Alexey Dosovitskiy . 2021. Do Vision Transformers See Like Convolutional Neural Networks? arXiv preprint arXiv:2108.08810 ( 2021 ). Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do Vision Transformers See Like Convolutional Neural Networks? arXiv preprint arXiv:2108.08810 (2021)."},{"key":"e_1_3_2_2_32_1","volume-title":"Faster R-CNN: towards real-time object detection with region proposal networks","author":"Ren Shaoqing","year":"2016","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2016. Faster R-CNN: towards real-time object detection with region proposal networks . IEEE transactions on pattern analysis and machine intelligence 39, 6 ( 2016 ), 1137--1149. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39, 6 (2016), 1137--1149."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i3.16353"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123354"},{"key":"e_1_3_2_2_36_1","volume-title":"Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729","author":"Venugopalan Subhashini","year":"2014","unstructured":"Subhashini Venugopalan , Huijuan Xu , Jeff Donahue , Marcus Rohrbach , Raymond Mooney , and Kate Saenko . 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 ( 2014 ). Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00273"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00795"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240677"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00468"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16421"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"e_1_3_2_2_47_1","volume-title":"Volo: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112","author":"Yuan Li","year":"2021","unstructured":"Li Yuan , Qibin Hou , Zihang Jiang , Jiashi Feng , and Shuicheng Yan . 2021 . Volo: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112 (2021). Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. 2021. Volo: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112 (2021)."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00852"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00971"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01329"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01311"}],"event":{"name":"ICMR '22: International Conference on Multimedia Retrieval","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Newark NJ USA","acronym":"ICMR '22"},"container-title":["Proceedings of the 2022 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3512527.3531380","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3512527.3531380","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:12Z","timestamp":1750188612000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3512527.3531380"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,27]]},"references-count":51,"alternative-id":["10.1145\/3512527.3531380","10.1145\/3512527"],"URL":"https:\/\/doi.org\/10.1145\/3512527.3531380","relation":{},"subject":[],"published":{"date-parts":[[2022,6,27]]},"assertion":[{"value":"2022-06-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}