{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T16:01:05Z","timestamp":1772208065930,"version":"3.50.1"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,1,5]],"date-time":"2023-01-05T00:00:00Z","timestamp":1672876800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Italy-China Collaboration Project TALENT","award":["2018YFE0118400"],"award-info":[{"award-number":["2018YFE0118400"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61772494, 61620106009, 61872333, 61931008, 61836002, 61976069, 62022083"],"award-info":[{"award-number":["61772494, 61620106009, 61872333, 61931008, 61836002, 61976069, 62022083"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100004739","name":"Youth Innovation Promotion Association CAS","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100004739","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Fundamental Research Funds for Central Universities"},{"name":"China Postdoctoral Science Foundation Funded Project","award":["2021M691683"],"award-info":[{"award-number":["2021M691683"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,1,31]]},"abstract":"<jats:p>In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications\u2019 needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.<\/jats:p>","DOI":"10.1145\/3514250","type":"journal-article","created":{"date-parts":[[2022,7,18]],"date-time":"2022-07-18T12:19:57Z","timestamp":1658146797000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2774-2875","authenticated-orcid":false,"given":"Weidong","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Key Lab of Big Data Mining and Knowledge Management, UCAS, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3954-2387","authenticated-orcid":false,"given":"Guorong","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Key Lab of Big Data Mining and Knowledge Management, UCAS, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7517-3868","authenticated-orcid":false,"given":"Xinfeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5931-0527","authenticated-orcid":false,"given":"Shuhui","family":"Wang","sequence":"additional","affiliation":[{"name":"Key Lab of Intelligent Information Processing, ICT, CAS, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1943-8219","authenticated-orcid":false,"given":"Liang","family":"Li","sequence":"additional","affiliation":[{"name":"Key Lab of Intelligent Information Processing, ICT, CAS, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7542-296X","authenticated-orcid":false,"given":"Qingming","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Key Lab of Big Data Mining and Knowledge Management, UCAS, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,1,5]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Proceedings of the ICCV","author":"Hendricks Lisa Anne","year":"2017","unstructured":"Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the ICCV."},{"key":"e_1_3_1_3_2","volume-title":"Proceedings of the ACM MM","author":"Chen Junwen","year":"2020","unstructured":"Junwen Chen, Wentao Bao, and Yu Kong. 2020. Activity-driven weakly supervised spatio-temporal grounding from untrimmed videos. In Proceedings of the ACM MM."},{"key":"e_1_3_1_4_2","volume-title":"Proceedings of the CVPR","author":"Chen Jie","year":"2020","unstructured":"Jie Chen, Zhiheng Li, Jiebo Luo, and Chenliang Xu. 2020. Learning a weakly supervised video actor-action segmentation model with a wise selection. In Proceedings of the CVPR."},{"key":"e_1_3_1_5_2","volume-title":"Proceedings of the CVPR","author":"Chen Kan","year":"2018","unstructured":"Kan Chen, Jiyang Gao, and Ram Nevatia. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the CVPR."},{"issue":"10","key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"2723","DOI":"10.1109\/TMM.2019.2959977","article-title":"Relation attention for temporal action localization","volume":"22","author":"Chen Peihao","year":"2019","unstructured":"Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 2723\u20132733.","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_1_7_2","first-page":"4053","volume-title":"Proceedings of the ACM MM","author":"Chen Weidong","year":"2021","unstructured":"Weidong Chen, Guorong Li, Xinfeng Zhang, Hongyang Yu, Shuhui Wang, and Qingming Huang. 2021. Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the ACM MM. 4053\u20134062."},{"key":"e_1_3_1_8_2","volume-title":"Proceedings of the ECCV","author":"Chen Yangyu","year":"2018","unstructured":"Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the ECCV."},{"key":"e_1_3_1_9_2","volume-title":"Proceedings of the ACL","author":"Chen Zhenfang","year":"2019","unstructured":"Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly supervised spatio-temporally grounding natural sentence in video. In Proceedings of the ACL."},{"issue":"3","key":"e_1_3_1_10_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3441628","article-title":"Part-wise spatio-temporal attention driven CNN-based 3D human action recognition","volume":"17","author":"Dhiman Chhavi","year":"2021","unstructured":"Chhavi Dhiman, Dinesh Kumar Vishwakarma, and Paras Agarwal. 2021. Part-wise spatio-temporal attention driven CNN-based 3D human action recognition. ACM Trans. Multimidia Comput. Commun. Appl. 17, 3 (2021), 1\u201324.","journal-title":"ACM Trans. Multimidia Comput. Commun. Appl."},{"key":"e_1_3_1_11_2","volume-title":"Proceedings of the CVPR","author":"Fan Junsong","year":"2020","unstructured":"Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. 2020. Learning integral objects with intra-class discriminator for weakly supervised semantic segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_12_2","volume-title":"Proceedings of the AAAI","author":"Fan Junsong","year":"2020","unstructured":"Junsong Fan, Zhaoxiang Zhang, Tieniu Tan, Chunfeng Song, and Jun Xiao. 2020. Cian: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI."},{"key":"e_1_3_1_13_2","volume-title":"Proceedings of the ECCV","author":"Fan Ruochen","year":"2018","unstructured":"Ruochen Fan, Qibin Hou, Ming-Ming Cheng, Gang Yu, Ralph R Martin, and Shi-Min Hu. 2018. Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of the ECCV."},{"key":"e_1_3_1_14_2","volume-title":"Proceedings of the ICCV","author":"Gao Jiyang","year":"2017","unstructured":"Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the ICCV."},{"key":"e_1_3_1_15_2","volume-title":"Proceedings of the CVPR","author":"Gavrilyuk Kirill","year":"2018","unstructured":"Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. 2018. Actor and action video segmentation from a sentence. In Proceedings of the CVPR."},{"key":"e_1_3_1_16_2","volume-title":"Proceedings of the NeurIPS","author":"Han Tengda","year":"2020","unstructured":"Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_18_2","volume-title":"Proceedings of the NeurIPS","author":"Hou Qibin","year":"2018","unstructured":"Qibin Hou, Peng-Tao Jiang, Yunchao Wei, and Ming-Ming Cheng. 2018. Self-erasing network for integral object attention. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_19_2","volume-title":"Proceedings of the ECCV","author":"Hu Ronghang","year":"2016","unstructured":"Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In Proceedings of the ECCV."},{"key":"e_1_3_1_20_2","volume-title":"Proceedings of the CVPR","author":"Huang De-An","year":"2018","unstructured":"De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding \u201cit\u201d: Weakly supervised reference-aware visual grounding in instructional videos. In Proceedings of the CVPR."},{"key":"e_1_3_1_21_2","volume-title":"Proceedings of the CVPR","author":"Huang Zilong","year":"2018","unstructured":"Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and Jingdong Wang. 2018. Weakly supervised semantic segmentation network with deep seeded region growing. In Proceedings of the CVPR."},{"issue":"2","key":"e_1_3_1_22_2","first-page":"1","article-title":"A multi-instance multi-label dual learning approach for video captioning","volume":"17","author":"Ji Wanting","year":"2021","unstructured":"Wanting Ji and Ruili Wang. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021), 1\u201318.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_23_2","volume-title":"Proceedings of the CVPR","author":"Khoreva Anna","year":"2017","unstructured":"Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, and Bernt Schiele. 2017. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_1_25_2","volume-title":"Proceedings of the CVPR","author":"Lee Jungbeom","year":"2019","unstructured":"Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and Sungroh Yoon. 2019. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the CVPR."},{"key":"e_1_3_1_26_2","volume-title":"Proceedings of the CVPR","author":"Li Shuang","year":"2017","unstructured":"Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the CVPR."},{"key":"e_1_3_1_27_2","volume-title":"Proceedings of the AAAI","author":"Li Xueyi","year":"2021","unstructured":"Xueyi Li, Tianfei Zhou, Jianwu Li, Yi Zhou, and Zhaoxiang Zhang. 2021. Group-wise semantic mining for weakly supervised semantic segmentation. In Proceedings of the AAAI."},{"key":"e_1_3_1_28_2","volume-title":"Proceedings of the CVPR","author":"Li Zhenyang","year":"2017","unstructured":"Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. 2017. Tracking by natural language specification. In Proceedings of the CVPR."},{"key":"e_1_3_1_29_2","volume-title":"Proceedings of the CVPR","author":"Lin Di","year":"2016","unstructured":"Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_30_2","volume-title":"Proceedings of the ICCV","author":"Liu Xuejing","year":"2019","unstructured":"Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ICCV."},{"key":"e_1_3_1_31_2","volume-title":"Proceedings of the ACM MM","author":"Liu Xuejing","year":"2019","unstructured":"Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ACM MM."},{"issue":"3","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3441577","article-title":"Single-shot semantic matching network for moment localization in videos","volume":"17","author":"Liu Xinfang","year":"2021","unstructured":"Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021), 1\u201314.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_33_2","volume-title":"Proceedings of the CVPR","author":"Liu Yongfei","year":"2021","unstructured":"Yongfei Liu, Bo Wan, Lin Ma, and Xuming He. 2021. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of the CVPR."},{"key":"e_1_3_1_34_2","volume-title":"Proceedings of the CVPR","author":"Lu Xiankai","year":"2020","unstructured":"Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J. Crandall, and Steven C. H. Hoi. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the CVPR."},{"key":"e_1_3_1_35_2","volume-title":"Proceedings of the CVPR","author":"Luo Ruotian","year":"2017","unstructured":"Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the CVPR."},{"key":"e_1_3_1_36_2","volume-title":"Proceedings of the ACL (System Demonstrations)","author":"Manning Christopher D.","year":"2014","unstructured":"Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the ACL (System Demonstrations)."},{"key":"e_1_3_1_37_2","volume-title":"Proceedings of the CVPR","author":"Mao Junhua","year":"2016","unstructured":"Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the CVPR."},{"key":"e_1_3_1_38_2","volume-title":"Proceedings of the CVPR","author":"McIntosh Bruce","year":"2020","unstructured":"Bruce McIntosh, Kevin Duarte, Yogesh S. Rawat, and Mubarak Shah. 2020. Visual-textual capsule routing for text-based video segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_39_2","volume-title":"Proceedings of the IJCAI","author":"Ning Ke","year":"2020","unstructured":"Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar relative positional encoding for video-language segmentation. In Proceedings of the IJCAI."},{"key":"e_1_3_1_40_2","first-page":"5152","volume-title":"Proceedings of the ICML","author":"Piergiovanni AJ","year":"2019","unstructured":"AJ Piergiovanni and Michael Ryoo. 2019. Temporal gaussian mixture layer for videos. In Proceedings of the ICML. PMLR, 5152\u20135161."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_1_42_2","volume-title":"Proceedings of the ECCV","author":"Shi Hengcan","year":"2018","unstructured":"Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the ECCV."},{"key":"e_1_3_1_43_2","volume-title":"Proceedings of the CVPR","author":"Shi Jing","year":"2019","unstructured":"Jing Shi, Jia Xu, Boqing Gong, and Chenliang Xu. 2019. Not all frames are equal: Weakly supervised video grounding with contextual similarity and visual clustering losses. In Proceedings of the CVPR."},{"key":"e_1_3_1_44_2","volume-title":"Proceedings of the ICLR","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR."},{"key":"e_1_3_1_45_2","volume-title":"Proceedings of the CVPR","author":"Song Chunfeng","year":"2019","unstructured":"Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. 2019. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_46_2","volume-title":"Proceedings of the ECCV","author":"Sun Guolei","year":"2020","unstructured":"Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool. 2020. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the ECCV."},{"key":"e_1_3_1_47_2","article-title":"Discriminative triad matching and reconstruction for weakly referring expression grounding","author":"Sun Mingjie","year":"2021","unstructured":"Mingjie Sun, Jimin Xiao, Enggee Lim, Si Liu, and John Yannis Goulermas. 2021. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. PAMI 43, 11 (2021), 4189\u20134195.","journal-title":"IEEE Trans. PAMI"},{"key":"e_1_3_1_48_2","volume-title":"Proceedings of the ECCV","author":"Tang Meng","year":"2018","unstructured":"Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. 2018. On regularized losses for weakly supervised CNN segmentation. In Proceedings of the ECCV."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3303083"},{"key":"e_1_3_1_50_2","volume-title":"Proceedings of the CVPR","author":"Vernaza Paul","year":"2017","unstructured":"Paul Vernaza and Manmohan Chandraker. 2017. Learning random-walk label propagation for weakly supervised semantic segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_51_2","volume-title":"Proceedings of the AAAI","author":"Wang Hao","year":"2020","unstructured":"Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI."},{"key":"e_1_3_1_52_2","volume-title":"Proceedings of the ICCV","author":"Wang Hao","year":"2019","unstructured":"Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the ICCV."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3419842"},{"key":"e_1_3_1_54_2","volume-title":"CVPR","author":"Xu Chenliang","year":"2016","unstructured":"Chenliang Xu and Jason J. Corso. 2016. Actor-action semantic segmentation with grouping process models. In CVPR."},{"key":"e_1_3_1_55_2","volume-title":"Proceedings of the CVPR","author":"Xu Chenliang","year":"2015","unstructured":"Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J. Corso. 2015. Can humans fly? Action understanding with multiple classes of actors. In Proceedings of the CVPR."},{"key":"e_1_3_1_56_2","volume-title":"Proceedings of the ICML","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML."},{"key":"e_1_3_1_57_2","first-page":"10156","volume-title":"Proceedings of the IEEE\/CVF CVPR","author":"Xu Mengmeng","year":"2020","unstructured":"Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE\/CVF CVPR. 10156\u201310165."},{"key":"e_1_3_1_58_2","volume-title":"Proceedings of the ICCV","author":"Yamaguchi Masataka","year":"2017","unstructured":"Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the ICCV."},{"key":"e_1_3_1_59_2","volume-title":"Proceedings of the CVPR","author":"Yan Yan","year":"2017","unstructured":"Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the CVPR."},{"key":"e_1_3_1_60_2","volume-title":"Proceedings of the ACM MM","author":"Yang Xun","year":"2020","unstructured":"Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the ACM MM."},{"key":"e_1_3_1_61_2","volume-title":"Proceedings of the CVPR","author":"Ye Linwei","year":"2019","unstructured":"Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the CVPR."},{"key":"e_1_3_1_62_2","article-title":"Multimodal transformer with multi-view visual representation for image captioning","author":"Yu Jun","year":"2019","unstructured":"Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467\u20134480.","journal-title":"IEEE Trans. Circ. Syst. Video Technol."},{"key":"e_1_3_1_63_2","volume-title":"Proceedings of the CVPR","author":"Yu Licheng","year":"2018","unstructured":"Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the CVPR."},{"key":"e_1_3_1_64_2","volume-title":"Pattern Recognition","author":"Zach Christopher","year":"2007","unstructured":"Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime tv-l 1 optical flow. In Pattern Recognition."},{"key":"e_1_3_1_65_2","first-page":"7094","volume-title":"Proceedings of the IEEE\/CVF ICCV","author":"Zeng Runhao","year":"2019","unstructured":"Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE\/CVF ICCV. 7094\u20137103."},{"key":"e_1_3_1_66_2","volume-title":"Proceedings of the CVPR","author":"Zeng Yu","year":"2019","unstructured":"Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, Mingyang Qian, and Yizhou Yu. 2019. Multi-source weak supervision for saliency detection. In Proceedings of the CVPR."},{"key":"e_1_3_1_67_2","volume-title":"Proceedings of the AAAI","author":"Zhang Bingfeng","year":"2020","unstructured":"Bingfeng Zhang, Jimin Xiao, Yunchao Wei, Mingjie Sun, and Kaizhu Huang. 2020. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI."},{"key":"e_1_3_1_68_2","volume-title":"Proceedings of the CVPR","author":"Zhang Hanwang","year":"2018","unstructured":"Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the CVPR."},{"key":"e_1_3_1_69_2","volume-title":"Proceedings of the BMVC","author":"Zhou Luowei","year":"2018","unstructured":"Luowei Zhou, Nathan Louis, and Jason J. Corso. 2018. Weakly supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the BMVC."},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3361845"},{"key":"e_1_3_1_71_2","volume-title":"Proceedings of the ICCV","author":"Zhu Xizhou","year":"2017","unstructured":"Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the ICCV."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3514250","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3514250","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:10:14Z","timestamp":1750183814000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3514250"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,5]]},"references-count":70,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1,31]]}},"alternative-id":["10.1145\/3514250"],"URL":"https:\/\/doi.org\/10.1145\/3514250","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,5]]},"assertion":[{"value":"2021-11-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-01-28","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}