{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T10:33:38Z","timestamp":1760524418058,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":41,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the Fundamental Research Funds for the Central Universities","award":["No. 226-2022-00087"],"award-info":[{"award-number":["No. 226-2022-00087"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547909","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:01Z","timestamp":1665416581000},"page":"1819-1827","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["In-N-Out Generative Learning for Dense Unsupervised Video Segmentation"],"prefix":"10.1145","author":[{"given":"Xiao","family":"Pan","sequence":"first","affiliation":[{"name":"Zhejiang University &amp; Alibaba DAMO Academy, Hangzhou, China"}]},{"given":"Peike","family":"Li","sequence":"additional","affiliation":[{"name":"Alibaba DAMO Academy &amp; University of Technology Sydney, Hangzhou, China"}]},{"given":"Zongxin","family":"Yang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Huiling","family":"Zhou","sequence":"additional","affiliation":[{"name":"Alibaba DAMO Academy, Hangzhou , China"}]},{"given":"Chang","family":"Zhou","sequence":"additional","affiliation":[{"name":"Alibaba DAMO Academy, Hangzhou, China"}]},{"given":"Hongxia","family":"Yang","sequence":"additional","affiliation":[{"name":"Alibaba DAMO Academy, Hangzhou, China"}]},{"given":"Jingren","family":"Zhou","sequence":"additional","affiliation":[{"name":"Alibaba DAMO Academy, Hangzhou, China"}]},{"given":"Yi","family":"Yang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Advances in Neural Information Processing Systems","volume":"34","author":"Araslanov Nikita","year":"2021","unstructured":"Nikita Araslanov , Simone Schaub-Meyer , and Stefan Roth . 2021 . Dense Unsupervised Learning for Video Segmentation . Advances in Neural Information Processing Systems , Vol. 34 (2021). Nikita Araslanov, Simone Schaub-Meyer, and Stefan Roth. 2021. Dense Unsupervised Learning for Video Segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021)."},{"key":"e_1_3_2_2_2_1","volume-title":"Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254","author":"Bao Hangbo","year":"2021","unstructured":"Hangbo Bao , Li Dong , and Furu Wei . 2021 . Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021). Hangbo Bao, Li Dong, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"e_1_3_2_2_5_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_6_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_2_7_1","first-page":"21271","article-title":"Bootstrap your own latent-a new approach to self-supervised learning","volume":"33","author":"Grill Jean-Bastien","year":"2020","unstructured":"Jean-Bastien Grill , Florian Strub , Florent Altch\u00e9 , Corentin Tallec , Pierre Richemond , Elena Buchatskaya , Carl Doersch , Bernardo Avila Pires , Zhaohan Guo , Mohammad Gheshlaghi Azar , 2020 . Bootstrap your own latent-a new approach to self-supervised learning . Advances in Neural Information Processing Systems , Vol. 33 (2020), 21271 -- 21284 . Jean-Bastien Grill, Florian Strub, Florent Altch\u00e9, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 21271--21284.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_8_1","volume-title":"Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377","author":"He Kaiming","year":"2021","unstructured":"Kaiming He , Xinlei Chen , Saining Xie , Yanghao Li , Piotr Doll\u00e1r , and Ross Girshick . 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 ( 2021 ). Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1r, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_11_1","volume-title":"Space-time correspondence as a contrastive random walk. Advances in neural information processing systems","author":"Jabri Allan","year":"2020","unstructured":"Allan Jabri , Andrew Owens , and Alexei Efros . 2020. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems , Vol. 33 ( 2020 ), 19545--19560. Allan Jabri, Andrew Owens, and Alexei Efros. 2020. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, Vol. 33 (2020), 19545--19560."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00109"},{"key":"e_1_3_2_2_13_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev etal 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).  Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00651"},{"key":"e_1_3_2_2_15_1","volume-title":"Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875","author":"Lai Zihang","year":"2019","unstructured":"Zihang Lai and Weidi Xie . 2019. Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875 ( 2019 ). Zihang Lai and Weidi Xie. 2019. Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875 (2019)."},{"key":"e_1_3_2_2_16_1","volume-title":"Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061","author":"Liang Chen","year":"2021","unstructured":"Chen Liang , Yu Wu , Tianfei Zhou , Wenguan Wang , Zongxin Yang , Yunchao Wei , and Yi Yang . 2021. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 ( 2021 ). Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 (2021)."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3479205"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01038"},{"key":"e_1_3_2_2_20_1","volume-title":"TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In The European Conference on Computer Vision (ECCV).","author":"Muller Matthias","year":"2018","unstructured":"Matthias Muller , Adel Bibi , Silvio Giancola , Salman Alsubaihi , and Bernard Ghanem . 2018 . TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In The European Conference on Computer Vision (ECCV). Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In The European Conference on Computer Vision (ECCV)."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00932"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.85"},{"key":"e_1_3_2_2_23_1","volume-title":"The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675","author":"Pont-Tuset Jordi","year":"2017","unstructured":"Jordi Pont-Tuset , Federico Perazzi , Sergi Caelles , Pablo Arbel\u00e1ez , Alex Sorkine-Hornung , and Luc Van Gool . 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 ( 2017 ). Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel\u00e1ez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"e_1_3_2_2_25_1","volume-title":"International Conference on Machine Learning. PMLR, 10347--10357","author":"Touvron Hugo","year":"2021","unstructured":"Hugo Touvron , Matthieu Cord , Matthijs Douze , Francisco Massa , Alexandre Sablayrolles , and Herv\u00e9 J\u00e9gou . 2021 . Training data-efficient image transformers & distillation through attention . In International Conference on Machine Learning. PMLR, 10347--10357 . Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv\u00e9 J\u00e9gou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_41"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_24"},{"key":"e_1_3_2_2_28_1","volume-title":"Contrastive transformation for self-supervised correspondence learning. arXiv preprint arXiv:2012.05057","author":"Wang Ning","year":"2020","unstructured":"Ning Wang , Wengang Zhou , and Houqiang Li. 2020. Contrastive transformation for self-supervised correspondence learning. arXiv preprint arXiv:2012.05057 ( 2020 ). Ning Wang, Wengang Zhou, and Houqiang Li. 2020. Contrastive transformation for self-supervised correspondence learning. arXiv preprint arXiv:2012.05057 (2020)."},{"key":"e_1_3_2_2_29_1","volume-title":"A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153","author":"Wang Wenguan","year":"2021","unstructured":"Wenguan Wang , Tianfei Zhou , Fatih Porikli , David Crandall , and Luc Van Gool . 2021b. A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153 ( 2021 ). Wenguan Wang, Tianfei Zhou, Fatih Porikli, David Crandall, and Luc Van Gool. 2021b. A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153 (2021)."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00267"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41377-021-00658-8"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00992"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01228-1_36"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00097"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00085"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58558-7_20"},{"key":"e_1_3_2_2_38_1","first-page":"2491","article-title":"Associating objects with transformers for video object segmentation","volume":"34","author":"Yang Zongxin","year":"2021","unstructured":"Zongxin Yang , Yunchao Wei , and Yi Yang . 2021 a. Associating objects with transformers for video object segmentation . Advances in Neural Information Processing Systems , Vol. 34 (2021), 2491 -- 2502 . Zongxin Yang, Yunchao Wei, and Yi Yang. 2021a. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 2491--2502.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3081597"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00981"},{"key":"e_1_3_2_2_41_1","volume-title":"ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832","author":"Zhou Jinghao","year":"2021","unstructured":"Jinghao Zhou , Chen Wei , Huiyu Wang , Wei Shen , Cihang Xie , Alan Yuille , and Tao Kong . 2021. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 ( 2021 ). Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2021. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547909","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547909","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:30Z","timestamp":1750186830000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547909"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":41,"alternative-id":["10.1145\/3503161.3547909","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547909","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}