{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:16:33Z","timestamp":1777655793463,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":61,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"SenseTime Collaborative Reasearch Grant"},{"name":"Shanghai Science and Technology RD Program of China","award":["20511100300"],"award-info":[{"award-number":["20511100300"]}]},{"name":"Shanghai Municipal Science and Technology Major Project","award":["2021SHZDZX0102"],"award-info":[{"award-number":["2021SHZDZX0102"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grant No.61902247"],"award-info":[{"award-number":["Grant No.61902247"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Key R&D Program of China","award":["2018AAA0100704"],"award-info":[{"award-number":["2018AAA0100704"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475409","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T06:35:51Z","timestamp":1634538951000},"page":"59-68","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":43,"title":["Video Semantic Segmentation via Sparse Temporal Transformer"],"prefix":"10.1145","author":[{"given":"Jiangtong","family":"Li","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wentao","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junjie","family":"Chen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Li","family":"Niu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jianlou","family":"Si","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chen","family":"Qian","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liqing","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2644615"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-88682-2_5"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_4_1","volume-title":"2020 b. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364","author":"Chen Hanting","year":"2020","unstructured":"Hanting Chen , Yunhe Wang , Tianyu Guo , Chang Xu , Yiping Deng , Zhenhua Liu , Siwei Ma , Chunjing Xu , Chao Xu , and Wen Gao . 2020 b. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 ( 2020 ). Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020 b. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 (2020)."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2699184"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2699184"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"e_1_3_2_2_8_1","volume-title":"ICML","author":"Chen Mark","year":"2020","unstructured":"Mark Chen , Alec Radford , Rewon Child , Jeffrey Wu , Heewoo Jun , David Luan , and Ilya Sutskever . 2020 a. Generative pretraining from pixels . In ICML 2020. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020 a. Generative pretraining from pixels. In ICML 2020."},{"key":"e_1_3_2_2_9_1","volume-title":"arXiv preprint arXiv:2103.15436","author":"Chen Xin","year":"2021","unstructured":"Xin Chen , Bin Yan , Jiawen Zhu , Dong Wang , Xiaoyun Yang , and Huchuan Lu. 2021. Transformer Tracking . arXiv preprint arXiv:2103.15436 ( 2021 ). Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer Tracking. arXiv preprint arXiv:2103.15436 (2021)."},{"key":"e_1_3_2_2_10_1","volume-title":"Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509","author":"Child Rewon","year":"2019","unstructured":"Rewon Child , Scott Gray , Alec Radford , and Ilya Sutskever . 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 ( 2019 ). Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.350"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1223"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1285"},{"key":"e_1_3_2_2_14_1","volume-title":"NAACL","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding . In NAACL 2018. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2018."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6699"},{"key":"e_1_3_2_2_16_1","volume-title":"ICLR","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , 2021 . An image is worth 16x16 words: Transformers for image recognition at scale . In ICLR 2021. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR 2021."},{"key":"e_1_3_2_2_17_1","volume-title":"SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation. In CVPR","author":"Duke Brendan","year":"2021","unstructured":"Brendan Duke , Abdalla Ahmed , Christian Wolf , Parham Aarabi , and Graham W Taylor . 2021 . SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation. In CVPR 2021. Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor. 2021. SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation. In CVPR 2021."},{"key":"e_1_3_2_2_18_1","volume-title":"ACCV","author":"Fayyaz Mohsen","year":"2016","unstructured":"Mohsen Fayyaz , Mohammad Hajizadeh Saffar , Mohammad Sabokrou , Mahmood Fathy , Fay Huang , and Reinhard Klette . 2016 . STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes . In ACCV 2016. Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Fay Huang, and Reinhard Klette. 2016. STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In ACCV 2016."},{"key":"e_1_3_2_2_19_1","volume-title":"TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge","author":"Feng Junyi","year":"2020","unstructured":"Junyi Feng , Songyuan Li , Xi Li , Fei Wu , Qi Tian , Ming-Hsuan Yang , and Haibin Ling . 2020. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2020 ). Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, and Haibin Ling. 2020. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.477"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413593"},{"key":"e_1_3_2_2_22_1","volume-title":"BoundarySqueeze: Image Segmentation as Boundary Squeezing. arXiv preprint arXiv:2105.11668","author":"He Hao","year":"2021","unstructured":"Hao He , Xiangtai Li , Kuiyuan Yang , Guangliang Cheng , Jianping Shi , Yunhai Tong , Zhengjun Zha , and Lubin Weng . 2021. BoundarySqueeze: Image Segmentation as Boundary Squeezing. arXiv preprint arXiv:2105.11668 ( 2021 ). Hao He, Xiangtai Li, Kuiyuan Yang, Guangliang Cheng, Jianping Shi, Yunhai Tong, Zhengjun Zha, and Lubin Weng. 2021. BoundarySqueeze: Image Segmentation as Boundary Squeezing. arXiv preprint arXiv:2105.11668 (2021)."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_24_1","volume-title":"CVPR","author":"Howard Andrew G","year":"2017","unstructured":"Andrew G Howard , Menglong Zhu , Bo Chen , Dmitry Kalenichenko , Weijun Wang , Tobias Weyand , Marco Andreetto , and Hartwig Adam . 2017 . MobileNets: Efficient convolutional neural networks for mobile vision applications . In CVPR 2017. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In CVPR 2017."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00884"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_2_2_27_1","volume-title":"Ccnet: Criss-cross attention for semantic segmentation. In ICCV.","author":"Huang Zilong","year":"2019","unstructured":"Zilong Huang , Xinggang Wang , Lichao Huang , Chang Huang , Yunchao Wei , and Wenyu Liu . 2019 . Ccnet: Criss-cross attention for semantic segmentation. In ICCV. Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In ICCV."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00907"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2019.2962216"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.595"},{"key":"e_1_3_2_2_31_1","volume-title":"ICCV Workshops","author":"Kreso Ivan","year":"2017","unstructured":"Ivan Kreso , Sinisa Segvic , and Josip Krapac . 2017 . Ladder-style densenets for semantic segmentation of large natural images . In ICCV Workshops 2017. 238--245. Ivan Kreso, Sinisa Segvic, and Josip Krapac. 2017. Ladder-style densenets for semantic segmentation of large natural images. In ICCV Workshops 2017. 238--245."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00420"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00628"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.549"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413915"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.114"},{"key":"e_1_3_2_2_37_1","volume-title":"Efficient Semantic Video Segmentation with Per-frame Inference. In ECCV","author":"Liu Yifan","year":"2020","unstructured":"Yifan Liu , Chunhua Shen , Changqian Yu , and Jingdong Wang . 2020 b . Efficient Semantic Video Segmentation with Per-frame Inference. In ECCV 2020. Yifan Liu, Chunhua Shen, Changqian Yu, and Jingdong Wang. 2020 b. Efficient Semantic Video Segmentation with Per-frame Inference. In ECCV 2020."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2572683"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00222"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00713"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01289"},{"key":"e_1_3_2_2_42_1","volume-title":"ICML","author":"Parmar Niki","year":"2018","unstructured":"Niki Parmar , Ashish Vaswani , Jakob Uszkoreit , Lukasz Kaiser , Noam Shazeer , Alexander Ku , and Dustin Tran . 2018 . Image transformer . In ICML 2018. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML 2018."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-49409-8_69"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_2_47_1","unstructured":"Jingdong Wang Ke Sun Tianheng Cheng Borui Jiang Chaorui Deng Yang Zhao Dong Liu Yadong Mu Mingkui Tan Xinggang Wang etal 2020. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020).  Jingdong Wang Ke Sun Tianheng Cheng Borui Jiang Chaorui Deng Yang Zhao Dong Liu Yadong Mu Mingkui Tan Xinggang Wang et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)."},{"key":"e_1_3_2_2_48_1","volume-title":"End-to-End Video Instance Segmentation with Transformers. In CVPR","author":"Wang Yuqing","year":"2021","unstructured":"Yuqing Wang , Zhaoliang Xu , Xinlong Wang , Chunhua Shen , Baoshan Cheng , Hao Shen , and Huaxia Xia . 2021 . End-to-End Video Instance Segmentation with Transformers. In CVPR 2021. Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021. End-to-End Video Instance Segmentation with Transformers. In CVPR 2021."},{"key":"e_1_3_2_2_49_1","volume-title":"Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677","author":"Wu Bichen","year":"2020","unstructured":"Bichen Wu , Chenfeng Xu , Xiaoliang Dai , Alvin Wan , Peizhao Zhang , Masayoshi Tomizuka , Kurt Keutzer , and Peter Vajda . 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 ( 2020 ). Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020)."},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00686"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00583"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454804"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_20"},{"key":"e_1_3_2_2_54_1","volume-title":"ICLR","author":"Yu Fisher","year":"2016","unstructured":"Fisher Yu and Vladlen Koltun . 2016 . Multi-scale context aggregation by dilated convolutions . In ICLR 2016. Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR 2016."},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_25"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.660"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"e_1_3_2_2_58_1","volume-title":"Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI 2021","author":"Zhou Haoyi","year":"2021","unstructured":"Haoyi Zhou , Shanghang Zhang , Jieqi Peng , Shuai Zhang , Jianxin Li , Hui Xiong , and Wancai Zhang . 2021 . Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI 2021 (2021). Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI 2021 (2021)."},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00753"},{"key":"e_1_3_2_2_60_1","volume-title":"Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR","author":"Zhu Xizhou","year":"2021","unstructured":"Xizhou Zhu , Weijie Su , Lewei Lu , Bin Li , Xiaogang Wang , and Jifeng Dai . 2021 . Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR 2021. Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR 2021."},{"key":"e_1_3_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.441"}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475409","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475409","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:32Z","timestamp":1750193312000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475409"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":61,"alternative-id":["10.1145\/3474085.3475409","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475409","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}