{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T21:06:33Z","timestamp":1776200793715,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":61,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547761","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:01Z","timestamp":1665416581000},"page":"4416-4425","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":45,"title":["Multi-Attention Network for Compressed Video Referring Object Segmentation"],"prefix":"10.1145","author":[{"given":"Weidong","family":"Chen","sequence":"first","affiliation":[{"name":"University of Chinese Academy of Science, Beijing, China"}]},{"given":"Dexiang","family":"Hong","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Science, Beijing, China"}]},{"given":"Yuankai","family":"Qi","sequence":"additional","affiliation":[{"name":"Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia"}]},{"given":"Zhenjun","family":"Han","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Science, Beijing, China"}]},{"given":"Shuhui","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]},{"given":"Laiyun","family":"Qing","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Science, Beijing, China"}]},{"given":"Qingming","family":"Huang","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Science, Beijing, China"}]},{"given":"Guorong","family":"Li","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Science, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00493"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413840"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475534"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475251"},{"key":"e_1_3_2_2_6_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00326"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00680"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00624"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00161"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_32"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_13_1","volume-title":"MV2Flow: Learning motion representation for fast compressed video action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)","author":"Hu Hezhen","year":"2020","unstructured":"Hezhen Hu , Wengang Zhou , Xingze Li , Ning Yan , and Houqiang Li. 2020. MV2Flow: Learning motion representation for fast compressed video action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) ( 2020 ). Hezhen Hu, Wengang Zhou, Xingze Li, Ning Yan, and Houqiang Li. 2020. MV2Flow: Learning motion representation for fast compressed video action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2020)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00417"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00180"},{"key":"e_1_3_2_2_16_1","unstructured":"Jin-Hwa Kim Jaehyun Jun and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in neural information processing systems.  Jin-Hwa Kim Jaehyun Jun and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in neural information processing systems."},{"key":"e_1_3_2_2_17_1","volume-title":"You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation. In AAAI Conference on Artificial Intelligence.","author":"Li Dezhuang","year":"2022","unstructured":"Dezhuang Li , Ruoqi Li , Lijun Wang , Yifan Wang , Jinqing Qi , Lu Zhang , Ting Liu , Qingquan Xu , and Huchuan Lu . 2022 . You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation. In AAAI Conference on Artificial Intelligence. Dezhuang Li, Ruoqi Li, Lijun Wang, Yifan Wang, Jinqing Qi, Lu Zhang, Ting Liu, Qingquan Xu, and Huchuan Lu. 2022. You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation. In AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413943"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413641"},{"key":"e_1_3_2_2_20_1","volume-title":"Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019 . Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019). Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)."},{"key":"e_1_3_2_2_21_1","volume-title":"Referring transformer: A one-step approach to multi-task visual grounding. Advances in neural information processing systems","author":"Li Muchen","year":"2021","unstructured":"Muchen Li and Leonid Sigal . 2021. Referring transformer: A one-step approach to multi-task visual grounding. Advances in neural information processing systems ( 2021 ). Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. Advances in neural information processing systems (2021)."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413924"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3079993"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_2_25_1","volume-title":"Video swin transformer. arXiv preprint arXiv:2106.13230","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Jia Ning , Yue Cao , Yixuan Wei , Zheng Zhang , Stephen Lin , and Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230 ( 2021 ). Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)."},{"key":"e_1_3_2_2_26_1","unstructured":"Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems.  Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems."},{"key":"e_1_3_2_2_27_1","unstructured":"Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in neural information processing systems.  Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in neural information processing systems."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12240"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00996"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"crossref","unstructured":"Ke Ning Lingxi Xie Fei Wu and Qi Tian. 2020. Polar Relative Positional Encoding for Video-Language Segmentation. In IJCAI.  Ke Ning Lingxi Xie Fei Wu and Qi Tian. 2020. Polar Relative Positional Encoding for Video-Language Segmentation. In IJCAI.","DOI":"10.24963\/ijcai.2020\/132"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413954"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3059923"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413618"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3414053"},{"key":"e_1_3_2_2_35_1","volume-title":"video coding for next generation multimedia to Black Holes","author":"Richardson Iain E. G","unstructured":"Iain E. G Richardson . 2003. H.264 and MPEG-4 video compression : video coding for next generation multimedia to Black Holes . Chichester, Hoboken, NJ : Wiley . Iain E. G Richardson. 2003. H.264 and MPEG-4 video compression : video coding for next generation multimedia to Black Holes. Chichester, Hoboken, NJ: Wiley."},{"key":"e_1_3_2_2_36_1","volume-title":"U-net: Convolutional networks for biomedical image segmentation","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger , Philipp Fischer , and Thomas Brox . 2015 . U-net: Convolutional networks for biomedical image segmentation . In MICCAI. Springer . Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Springer."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58555-6_13"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00136"},{"key":"e_1_3_2_2_39_1","unstructured":"Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.  Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR."},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Wei Suo Mengyang Sun Peng Wang and Qi Wu. 2021. Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention. In IJCAI.  Wei Suo Mengyang Sun Peng Wang and Qi Wu. 2021. Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention. In IJCAI.","DOI":"10.24963\/ijcai.2021\/143"},{"key":"e_1_3_2_2_42_1","volume-title":"J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov.","author":"Hubert Tsai Yao-Hung","year":"2019","unstructured":"Yao-Hung Hubert Tsai , Shaojie Bai , Paul Pu Liang , J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019 . Multimodal Transformer for Unaligned Multimodal Language Sequences. In ACL. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In ACL."},{"key":"e_1_3_2_2_43_1","volume-title":"Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In AAAI Conference on Artificial Intelligence.","author":"Wang Hao","year":"2020","unstructured":"Hao Wang , Cheng Deng , Fan Ma , and Yi Yang . 2020 . Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In AAAI Conference on Artificial Intelligence. Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00404"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240535"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00631"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.336"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298839"},{"key":"e_1_3_2_2_51_1","unstructured":"Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhudinov Rich Zemel and Yoshua Bengio. 2015. Show attend and tell: Neural image caption generation with visual attention. In ICML.  Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhudinov Rich Zemel and Yoshua Bengio. 2015. Show attend and tell: Neural image caption generation with visual attention. In ICML."},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.115"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.10"},{"key":"e_1_3_2_2_54_1","unstructured":"Zhao Yang Yansong Tang Luca Bertinetto Hengshuang Zhao and Philip HS Torr. 2021. Hierarchical interaction network for video object segmentation from referring expressions. In BMVC.  Zhao Yang Yansong Tang Luca Bertinetto Hengshuang Zhao and Philip HS Torr. 2021. Hierarchical interaction network for video object segmentation from referring expressions. In BMVC."},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01075"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3054384"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2019.2947482"},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"crossref","unstructured":"Youngjae Yu Sangho Lee Gunhee Kim and Yale Song. 2020. Self-supervised learning of compressed video representations. In ICLR.  Youngjae Yu Sangho Lee Gunhee Kim and Yale Song. 2020. Self-supervised learning of compressed video representations. In ICLR.","DOI":"10.1109\/ICPR48806.2021.9412942"},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.297"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2791180"},{"key":"e_1_3_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123292"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547761","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547761","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:40Z","timestamp":1750188640000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547761"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":61,"alternative-id":["10.1145\/3503161.3547761","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547761","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}