{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:21:18Z","timestamp":1750220478166,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":20,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Natural Science Foundation of China","award":["61921006","62076119"],"award-info":[{"award-number":["61921006","62076119"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3479221","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T08:03:42Z","timestamp":1634544222000},"page":"4799-4802","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["NJU MCG - Sensetime Team Submission to Pre-training for Video Understanding Challenge Track II"],"prefix":"10.1145","author":[{"given":"Liwei","family":"Jin","sequence":"first","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}]},{"given":"Haoyue","family":"Cheng","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}]},{"given":"Su","family":"Xu","sequence":"additional","affiliation":[{"name":"Sensetime Research, Hong Kong, China"}]},{"given":"Wayne","family":"Wu","sequence":"additional","affiliation":[{"name":"Sensetime Research, Hong Kong, China"}]},{"given":"Limin","family":"Wang","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"unstructured":"Gedas Bertasius Heng Wang and Lorenzo Torresani. 2021 a. Facebook Timesformer. https:\/\/github.com\/facebookresearch\/TimeSformer  Gedas Bertasius Heng Wang and Lorenzo Torresani. 2021 a. Facebook Timesformer. https:\/\/github.com\/facebookresearch\/TimeSformer","key":"e_1_3_2_1_1_1"},{"key":"e_1_3_2_1_2_1","volume-title":"2021 b. Is Space-Time Attention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius , Heng Wang , and Lorenzo Torresani . 2021 b. Is Space-Time Attention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095 ( 2021 ). Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021 b. Is Space-Time Attention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095 (2021)."},{"unstructured":"MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https:\/\/github.com\/open-mmlab\/mmaction2.  MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https:\/\/github.com\/open-mmlab\/mmaction2.","key":"e_1_3_2_1_3_1"},{"key":"e_1_3_2_1_4_1","volume-title":"et almbox","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , et almbox . 2020 . An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et almbox. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_5_1","DOI":"10.1109\/ICCV.2019.00630"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_6_1","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_7_1","volume-title":"et almbox","author":"Kay Will","year":"2017","unstructured":"Will Kay , Joao Carreira , Karen Simonyan , Brian Zhang , Chloe Hillier , Sudheendra Vijayanarasimhan , Fabio Viola , Tim Green , Trevor Back , Paul Natsev , et almbox . 2017 . The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017). Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et almbox. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_1_8_1","volume-title":"Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691","author":"Li Tianhao","year":"2020","unstructured":"Tianhao Li and Limin Wang . 2020. Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691 ( 2020 ). Tianhao Li and Limin Wang. 2020. Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691 (2020)."},{"doi-asserted-by":"crossref","unstructured":"Ze Liu Yutong Lin Yue Cao Han Hu Yixuan Wei Zheng Zhang Stephen Lin and Baining Guo. 2021 a. Microsoft Swin Transformer. https:\/\/github.com\/microsoft\/Swin-Transformer  Ze Liu Yutong Lin Yue Cao Han Hu Yixuan Wei Zheng Zhang Stephen Lin and Baining Guo. 2021 a. Microsoft Swin Transformer. https:\/\/github.com\/microsoft\/Swin-Transformer","key":"e_1_3_2_1_9_1","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"e_1_3_2_1_10_1","volume-title":"2021 b. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Yutong Lin , Yue Cao , Han Hu , Yixuan Wei , Zheng Zhang , Stephen Lin , and Baining Guo . 2021 b. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 ( 2021 ). Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021 b. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)."},{"doi-asserted-by":"crossref","unstructured":"Ze Liu Jia Ning Yue Cao Yixuan Wei Zheng Zhang Stephen Lin and Han Hu. 2021 c. Microsoft Video Swin Transformer. https:\/\/github.com\/SwinTransformer\/Video-Swin-Transformer  Ze Liu Jia Ning Yue Cao Yixuan Wei Zheng Zhang Stephen Lin and Han Hu. 2021 c. Microsoft Video Swin Transformer. https:\/\/github.com\/SwinTransformer\/Video-Swin-Transformer","key":"e_1_3_2_1_11_1","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"e_1_3_2_1_12_1","volume-title":"2021 d. Video Swin Transformer. arXiv preprint arXiv:2106.13230","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Jia Ning , Yue Cao , Yixuan Wei , Zheng Zhang , Stephen Lin , and Han Hu . 2021 d. Video Swin Transformer. arXiv preprint arXiv:2106.13230 ( 2021 ). Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021 d. Video Swin Transformer. arXiv preprint arXiv:2106.13230 (2021)."},{"unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et al. 2021 a. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).  Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et al. 2021 a. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).","key":"e_1_3_2_1_13_1"},{"unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et almbox. 2021 b. OpenAI CLIP. https:\/\/github.com\/openai\/CLIP  Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark et almbox. 2021 b. OpenAI CLIP. https:\/\/github.com\/openai\/CLIP","key":"e_1_3_2_1_14_1"},{"key":"e_1_3_2_1_15_1","volume-title":"Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877","author":"Touvron Hugo","year":"2020","unstructured":"Hugo Touvron , Matthieu Cord , Matthijs Douze , Francisco Massa , Alexandre Sablayrolles , and Herv\u00e9 J\u00e9gou . 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 ( 2020 ). Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv\u00e9 J\u00e9gou. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)."},{"unstructured":"Hugo Touvron Matthieu Cord Matthijs Douze Francisco Massa Alexandre Sablayrolles and Herv\u00e9 J\u00e9gou. 2021. Facebook Deit. https:\/\/github.com\/facebookresearch\/deit  Hugo Touvron Matthieu Cord Matthijs Douze Francisco Massa Alexandre Sablayrolles and Herv\u00e9 J\u00e9gou. 2021. Facebook Deit. https:\/\/github.com\/facebookresearch\/deit","key":"e_1_3_2_1_16_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_17_1","DOI":"10.1109\/ICCV.2019.00565"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_18_1","DOI":"10.1109\/CVPR46437.2021.00193"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_19_1","DOI":"10.1109\/TPAMI.2018.2868668"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_20_1","DOI":"10.1007\/978-3-319-46484-8_2"}],"event":{"sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"acronym":"MM '21","name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China"},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3479221","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3479221","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:48Z","timestamp":1750193328000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3479221"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":20,"alternative-id":["10.1145\/3474085.3479221","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3479221","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}