{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,11]],"date-time":"2026-02-11T02:42:27Z","timestamp":1770777747070,"version":"3.50.0"},"reference-count":142,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2020AAA0106300"],"award-info":[{"award-number":["2020AAA0106300"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62102222"],"award-info":[{"award-number":["62102222"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:p>Temporal sentence grounding in videos\u00a0(TSGV), which aims at localizing one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities\u00a0(i.e., text and video). In this survey, we give a comprehensive overview for TSGV, which (i) summarizes the taxonomy of existing methods, (ii) provides a detailed description of the evaluation protocols\u00a0(i.e., datasets and metrics) to be used in TSGV, and (iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. To the best of our knowledge, this is the first systematic survey on temporal sentence grounding. More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, single-stage methods, reinforcement learning-based methods, and weakly supervised methods. Then we present the benchmark datasets and evaluation metrics to assess current research progress. Finally, we discuss some limitations in TSGV through pointing out potential problems improperly resolved in the current evaluation protocols, which may push forwards more cutting-edge research in TSGV. Besides, we also share our insights on several promising directions, including four typical tasks with new and practical settings based on TSGV.<\/jats:p>","DOI":"10.1145\/3532626","type":"journal-article","created":{"date-parts":[[2022,5,20]],"date-time":"2022-05-20T12:28:13Z","timestamp":1653049693000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":33,"title":["A Survey on Temporal Sentence Grounding in Videos"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5382-6699","authenticated-orcid":false,"given":"Xiaohan","family":"Lan","sequence":"first","affiliation":[{"name":"Tsinghua Shenzhen International Graduate School, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8701-7689","authenticated-orcid":false,"given":"Yitian","family":"Yuan","sequence":"additional","affiliation":[{"name":"Meituan, Chaoyang District, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0351-2939","authenticated-orcid":false,"given":"Xin","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5462-6178","authenticated-orcid":false,"given":"Zhi","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua Shenzhen International Graduate School, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2236-9290","authenticated-orcid":false,"given":"Wenwu","family":"Zhu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,2,6]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"6077","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077\u20136086."},{"key":"e_1_3_1_3_2","first-page":"2425","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425\u20132433."},{"key":"e_1_3_1_4_2","first-page":"920","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Bao Peijun","year":"2021","unstructured":"Peijun Bao, Qian Zheng, and Yadong Mu. 2021. Dense events grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence. 920\u2013928."},{"key":"e_1_3_1_5_2","volume-title":"Proceedings of the British Machine Vision Conference 2017","author":"Buch Shyamal","year":"2017","unstructured":"Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2017. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference 2017."},{"key":"e_1_3_1_6_2","first-page":"961","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Heilbron Fabian Caba","year":"2015","unstructured":"Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961\u2013970."},{"key":"e_1_3_1_7_2","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Cao Da","year":"2020","unstructured":"Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin. 2020. STRONG: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia."},{"key":"e_1_3_1_8_2","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Cao Da","year":"2020","unstructured":"Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zheng Qin. 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the 28th ACM International Conference on Multimedia."},{"key":"e_1_3_1_9_2","first-page":"1","volume-title":"Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement","author":"Cha Meeyoung","year":"2007","unstructured":"Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. 2007. I tube, you tube, everybody tubes: Analyzing the world\u2019s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. 1\u201314."},{"key":"e_1_3_1_10_2","first-page":"119","volume-title":"Proceedings of the International Symposium on Neural Networks","author":"Chen Cheng","year":"2020","unstructured":"Cheng Chen and Xiaodong Gu. 2020. Semantic modulation based residual network for temporal language queries grounding in video. In Proceedings of the International Symposium on Neural Networks. Springer, 119\u2013129."},{"key":"e_1_3_1_11_2","first-page":"1870","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics","author":"Chen Danqi","year":"2017","unstructured":"Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1870\u20131879."},{"key":"e_1_3_1_12_2","doi-asserted-by":"crossref","first-page":"162","DOI":"10.18653\/v1\/D18-1015","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Chen Jingyuan","year":"2018","unstructured":"Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162\u2013171."},{"key":"e_1_3_1_13_2","first-page":"8175","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Chen Jingyuan","year":"2019","unstructured":"Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. 2019. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 8175\u20138182."},{"key":"e_1_3_1_14_2","first-page":"10551","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Chen Long","year":"2020","unstructured":"Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 10551\u201310558."},{"key":"e_1_3_1_15_2","first-page":"333","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Chen Shaoxiang","year":"2020","unstructured":"Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision. Springer, 333\u2013351."},{"key":"e_1_3_1_16_2","first-page":"8199","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Chen Shaoxiang","year":"2019","unstructured":"Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence. 8199\u20138206."},{"key":"e_1_3_1_17_2","first-page":"601","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Chen Shaoxiang","year":"2020","unstructured":"Shaoxiang Chen and Yu-Gang Jiang. 2020. Hierarchical visual-textual graph for temporal activity localization via language. In Proceedings of the European Conference on Computer Vision. Springer, 601\u2013618."},{"key":"e_1_3_1_18_2","first-page":"8425","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Shaoxiang","year":"2021","unstructured":"Shaoxiang Chen and Yu-Gang Jiang. 2021. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8425\u20138435."},{"key":"e_1_3_1_19_2","article-title":"Microsoft coco captions: Data collection and evaluation server","author":"Chen Xinlei","year":"2015","unstructured":"Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https:\/\/arxiv.org\/abs\/1504.00325.","journal-title":"arXiv:1504.00325."},{"key":"e_1_3_1_20_2","first-page":"104","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104\u2013120."},{"key":"e_1_3_1_21_2","unstructured":"Zhenfang Chen Lin Ma Wenhan Luo Peng Tang and Kwan-Yee K. Wong. 2020. Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv:2001.09308. Retrieved from https:\/\/arxiv.org\/abs\/2001.09308."},{"key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"1884","DOI":"10.18653\/v1\/P19-1183","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Chen Zhenfang","year":"2019","unstructured":"Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly-supervised spatio-temporally grounding natural sentence in video. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1884\u20131894."},{"issue":"8","key":"e_1_3_1_23_2","doi-asserted-by":"crossref","first-page":"2439","DOI":"10.1109\/78.852023","article-title":"Partial encryption of compressed images and videos","volume":"48","author":"Cheng Howard","year":"2000","unstructured":"Howard Cheng and Xiaobo Li. 2000. Partial encryption of compressed images and videos. IEEE Transactions on Signal Processing 48, 8 (2000), 2439\u20132451.","journal-title":"IEEE Transactions on Signal Processing"},{"key":"e_1_3_1_24_2","first-page":"3584","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Chu Wen-Sheng","year":"2015","unstructured":"Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3584\u20133592."},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","first-page":"797","DOI":"10.1145\/3474085.3475251","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia","author":"Cui Yuhao","year":"2021","unstructured":"Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, and Jun Yu. 2021. ROSITA: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia. 797\u2013806."},{"key":"e_1_3_1_26_2","first-page":"11573","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Ding Xinpeng","year":"2021","unstructured":"Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, and Xinbo Gao. 2021. Support-set based cross-supervision for video grounding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 11573\u201311582."},{"key":"e_1_3_1_27_2","first-page":"3063","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Duan Xuguang","year":"2018","unstructured":"Xuguang Duan, Wen-bing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Advances in Neural Information Processing Systems. 3063\u20133073."},{"key":"e_1_3_1_28_2","unstructured":"Victor Escorcia Mattia Soldan Josef Sivic Bernard Ghanem and Bryan Russell. 2019. Temporal localization of moments in video collections with natural language. arXiv:1907.12763. Retrieved from https:\/\/arxiv.org\/abs\/1907.12763."},{"key":"e_1_3_1_29_2","first-page":"1999","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Fan Chenyou","year":"2019","unstructured":"Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1999\u20132007."},{"key":"e_1_3_1_30_2","first-page":"5277","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Gao Jiyang","year":"2017","unstructured":"Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5277\u20135285."},{"key":"e_1_3_1_31_2","first-page":"1523","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Gao Junyu","year":"2021","unstructured":"Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1523\u20131532."},{"key":"e_1_3_1_32_2","first-page":"1481","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing","author":"Gao Mingfei","year":"2019","unstructured":"Mingfei Gao, Larry Davis, Richard Socher, and Caiming Xiong. 2019. WSLLN:Weakly supervised natural language localization networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 1481\u20131487."},{"key":"e_1_3_1_33_2","first-page":"245","volume-title":"Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision","author":"Ge Runzhou","year":"2019","unstructured":"Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. Mac: Mining activity concepts for language-based temporal localization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision. IEEE, 245\u2013253."},{"key":"e_1_3_1_34_2","first-page":"1984","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Ghosh Soham","year":"2019","unstructured":"Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. ExCL: Extractive clip localization using natural language descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1984\u20131990."},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","first-page":"580","DOI":"10.1109\/CVPR.2014.81","volume-title":"Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition","author":"Girshick Ross B.","year":"2014","unstructured":"Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. 580\u2013587."},{"key":"e_1_3_1_36_2","volume-title":"Proceedings of the 31st British Machine Vision Conference","author":"Hahn Meera","year":"2020","unstructured":"Meera Hahn, Asim Kadav, James M. Rehg, and Hans Peter Graf. 2020. Tripping through time: Efficient localization of activities in videos. In Proceedings of the 31st British Machine Vision Conference."},{"key":"e_1_3_1_37_2","first-page":"8393","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"He Dongliang","year":"2019","unstructured":"Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 8393\u20138400."},{"key":"e_1_3_1_38_2","doi-asserted-by":"crossref","unstructured":"Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2018. Localizing moments in video with temporal language. In EMNLP . 1380\u20131390.","DOI":"10.18653\/v1\/D18-1168"},{"key":"e_1_3_1_39_2","first-page":"5804","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Hendricks Lisa Anne","year":"2017","unstructured":"Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5804\u20135813."},{"key":"e_1_3_1_40_2","first-page":"2352","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Hori Chiori","year":"2019","unstructured":"Chiori Hori, Huda AlAmri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, and Devi Parikh. 2019. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2352\u20132356."},{"key":"e_1_3_1_41_2","first-page":"4555","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"Hu Ronghang","year":"2016","unstructured":"Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4555\u20134564."},{"key":"e_1_3_1_42_2","first-page":"7199","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Huang Jiabo","year":"2021","unstructured":"Jiabo Huang, Yang Liu, Shaogang Gong, and Hailin Jin. 2021. Cross-sentence temporal and semantic relations in video activity localisation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 7199\u20137208."},{"key":"e_1_3_1_43_2","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1145\/3323873.3325019","volume-title":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","author":"Jiang Bin","year":"2019","unstructured":"Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 217\u2013225."},{"issue":"10","key":"e_1_3_1_44_2","doi-asserted-by":"crossref","first-page":"2693","DOI":"10.1109\/TMM.2018.2815998","article-title":"Three-dimensional attention-based deep ranking model for video highlight detection","volume":"20","author":"Jiao Yifan","year":"2018","unstructured":"Yifan Jiao, Zhetao Li, Shucheng Huang, Xiaoshan Yang, Bin Liu, and Tianzhu Zhang. 2018. Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Transactions on Multimedia 20, 10 (2018), 2693\u20132705.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_45_2","volume-title":"Proceedings of the ECCV THUMOS Workshop","author":"Karaman Svebor","year":"2014","unstructured":"Svebor Karaman, Lorenzo Seidenari, and Alberto Del Bimbo. 2014. Fast saliency based pooling of fisher encoded dense trajectories. In Proceedings of the ECCV THUMOS Workshop."},{"key":"e_1_3_1_46_2","first-page":"787","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing","author":"Kazemzadeh Sahar","year":"2014","unstructured":"Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 787\u2013798."},{"key":"e_1_3_1_47_2","volume-title":"Proceedings of the 5th International Conference on Learning Representations","author":"Kipf Thomas N.","year":"2017","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations."},{"key":"e_1_3_1_48_2","first-page":"706","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Krishna Ranjay","year":"2017","unstructured":"Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706\u2013715."},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","unstructured":"Jie Lei Licheng Yu Mohit Bansal and Tamara L. Berg. 2018. TVQA: Localized compositional video question answering. In EMNLP .","DOI":"10.18653\/v1\/D18-1167"},{"key":"e_1_3_1_50_2","first-page":"447","volume-title":"Proceedings of the 16th European Conference on Computer Vision.","author":"Lei Jie","year":"2020","unstructured":"Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Proceedings of the 16th European Conference on Computer Vision. Springer, 447\u2013463."},{"key":"e_1_3_1_51_2","first-page":"8658","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Li Xiangpeng","year":"2019","unstructured":"Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. 8658\u20138665."},{"key":"e_1_3_1_52_2","first-page":"988","volume-title":"Proceedings of the 2017 ACM on Multimedia Conference","author":"Lin Tianwei","year":"2017","unstructured":"Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference. 988\u2013996."},{"key":"e_1_3_1_53_2","first-page":"11539","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Lin Zhijie","year":"2020","unstructured":"Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence. 11539\u201311546."},{"key":"e_1_3_1_54_2","first-page":"552","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Liu Bingbin","year":"2018","unstructured":"Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. 2018. Temporal modular networks for retrieving complex compositional activities in videos. In Proceedings of the European Conference on Computer Vision. 552\u2013568."},{"key":"e_1_3_1_55_2","first-page":"1841","volume-title":"Proceedings of the 28th International Conference on Computational Linguistics","author":"Liu Daizong","year":"2020","unstructured":"Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 1841\u20131851."},{"key":"e_1_3_1_56_2","first-page":"11235","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Daizong","year":"2021","unstructured":"Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 11235\u201311244."},{"key":"e_1_3_1_57_2","first-page":"4070","volume-title":"Proceedings of the28th ACM International Conference on Multimedia","author":"Liu Daizong","year":"2020","unstructured":"Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross- and self-modal graph attention network for query-based moment localization. In Proceedings of the28th ACM International Conference on Multimedia. 4070\u20134078."},{"key":"e_1_3_1_58_2","first-page":"15","volume-title":"Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval","author":"Liu Meng","year":"2018","unstructured":"Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 15\u201324."},{"key":"e_1_3_1_59_2","first-page":"843","volume-title":"Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference","author":"Liu Meng","year":"2018","unstructured":"Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. 843\u2013851."},{"key":"e_1_3_1_60_2","article-title":"A survey on natural language video localization","author":"Liu Xinfang","year":"2021","unstructured":"Xinfang Liu, Xiushan Nie, Zhifang Tan, Jie Guo, and Yilong Yin. 2021. A survey on natural language video localization. arXiv:2104.00234. Retrieved from https:\/\/arxiv.org\/abs\/2104.00234.","journal-title":"arXiv:2104.00234."},{"key":"e_1_3_1_61_2","first-page":"5144","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing","author":"Lu Chujie","year":"2019","unstructured":"Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5144\u20135153."},{"key":"e_1_3_1_62_2","article-title":"ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","volume":"32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019), 13\u201323.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_63_2","first-page":"5600","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia","author":"Luo Jianjie","year":"2021","unstructured":"Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, and Tao Mei. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 5600\u20135608."},{"key":"e_1_3_1_64_2","first-page":"156","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Ma Minuk","year":"2020","unstructured":"Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, and Chang D. Yoo. 2020. Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In Proceedings of the European Conference on Computer Vision. Springer, 156\u2013171."},{"key":"e_1_3_1_65_2","first-page":"1942","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"Ma Shugao","year":"2016","unstructured":"Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1942\u20131950."},{"key":"e_1_3_1_66_2","first-page":"533","volume-title":"Proceedings of the 10th ACM International Conference on Multimedia","author":"Ma Yu-Fei","year":"2002","unstructured":"Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. 2002. A user attention model for video summarization. In Proceedings of the 10th ACM International Conference on Multimedia. 533\u2013542."},{"key":"e_1_3_1_67_2","first-page":"202","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Mahasseni Behrooz","year":"2017","unstructured":"Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 202\u2013211."},{"key":"e_1_3_1_68_2","first-page":"11592","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Mithun Niluthpol Chowdhury","year":"2019","unstructured":"Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K. Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592\u201311601."},{"key":"e_1_3_1_69_2","first-page":"10807","volume-title":"Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Mun Jonghwan","year":"2020","unstructured":"Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10807\u201310816."},{"key":"e_1_3_1_70_2","first-page":"2765","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Nan Guoshun","year":"2021","unstructured":"Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. 2021. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2765\u20132775."},{"key":"e_1_3_1_71_2","volume-title":"Proceedings of the 31st British Machine Vision Conference","author":"Otani Mayu","year":"2020","unstructured":"Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkil\u00e4. 2020. Uncovering hidden challenges in query-based video moment retrieval. In Proceedings of the 31st British Machine Vision Conference."},{"key":"e_1_3_1_72_2","article-title":"Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training","author":"Pan Yingwei","year":"2020","unstructured":"Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375. Retrieved from https:\/\/arxiv.org\/abs\/2007.02375.","journal-title":"arXiv:2007.02375."},{"key":"e_1_3_1_73_2","first-page":"4594","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Pan Yingwei","year":"2016","unstructured":"Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594\u20134602."},{"key":"e_1_3_1_74_2","first-page":"6504","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Pan Yingwei","year":"2017","unstructured":"Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504\u20136512."},{"key":"e_1_3_1_75_2","first-page":"4280","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Qu Xiaoye","year":"2020","unstructured":"Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 4280\u20134288."},{"key":"e_1_3_1_76_2","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1162\/tacl_a_00207","article-title":"Grounding action descriptions in videos","volume":"1","author":"Regneri Michaela","year":"2013","unstructured":"Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25\u201336.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_1_77_2","first-page":"2464","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Rodriguez Cristian","year":"2020","unstructured":"Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 2464\u20132473."},{"key":"e_1_3_1_78_2","first-page":"1079","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Rodriguez-Opazo Cristian","year":"2021","unstructured":"Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, and Stephen Gould. 2021. DORi: Discovering object relationships for moment localization of a natural language query in a video. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 1079\u20131088."},{"key":"e_1_3_1_79_2","first-page":"144","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Rohrbach Marcus","year":"2012","unstructured":"Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. Springer, 144\u2013157."},{"key":"e_1_3_1_80_2","first-page":"10414","volume-title":"Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Sadhu Arka","year":"2020","unstructured":"Arka Sadhu, Kan Chen, and Ram Nevatia. 2020. Video object grounding using semantic roles in language description. In Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10414\u201310424."},{"key":"e_1_3_1_81_2","first-page":"200","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Shao Dian","year":"2018","unstructured":"Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. 2018. Find and focus: Retrieve and localize video events with natural language queries. In Proceedings of the European Conference on Computer Vision. 200\u2013216."},{"key":"e_1_3_1_82_2","first-page":"4788","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Sharghi Aidean","year":"2017","unstructured":"Aidean Sharghi, Jacob S. Laurel, and Boqing Gong. 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4788\u20134797."},{"key":"e_1_3_1_83_2","first-page":"1049","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"Shou Zheng","year":"2016","unstructured":"Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1049\u20131058."},{"key":"e_1_3_1_84_2","first-page":"510","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Sigurdsson Gunnar A.","year":"2016","unstructured":"Gunnar A. Sigurdsson, G\u00fcl Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510\u2013526."},{"key":"e_1_3_1_85_2","first-page":"1961","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"Singh Bharat","year":"2016","unstructured":"Bharat Singh, Tim K. Marks, Michael J. Jones, Oncel Tuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1961\u20131970."},{"key":"e_1_3_1_86_2","first-page":"3224","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Soldan Mattia","year":"2021","unstructured":"Mattia Soldan, Mengmeng Xu, Sisi Qu, Jesper Tegner, and Bernard Ghanem. 2021. VLG-Net: Video-language graph matching network for video grounding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 3224\u20133234."},{"key":"e_1_3_1_87_2","first-page":"340","volume-title":"Proceedings of the Pacific Rim Conference on Multimedia","author":"Song Xiaomeng","year":"2018","unstructured":"Xiaomeng Song and Yahong Han. 2018. Val: Visual-attention action localizer. In Proceedings of the Pacific Rim Conference on Multimedia. Springer, 340\u2013350."},{"key":"e_1_3_1_88_2","unstructured":"Yijun Song Jingwen Wang Lin Ma Zhou Yu and Jun Yu. 2020. Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv:2003.07048. Retrieved from https:\/\/arxiv.org\/abs\/2003.07048."},{"key":"e_1_3_1_89_2","article-title":"Compositional temporal visual grounding of natural language event descriptions","author":"Stroud Jonathan C.","year":"2019","unstructured":"Jonathan C. Stroud, Ryan McCaffrey, Rada Mihalcea, Jia Deng, and Olga Russakovsky. 2019. Compositional temporal visual grounding of natural language event descriptions. arXiv:1912.02256. Retrieved from https:\/\/arxiv.org\/abs\/1912.02256.","journal-title":"arXiv:1912.02256."},{"key":"e_1_3_1_90_2","first-page":"1533","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Su Rui","year":"2021","unstructured":"Rui Su, Qian Yu, and Dong Xu. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1533\u20131542."},{"key":"e_1_3_1_91_2","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing."},{"key":"e_1_3_1_92_2","first-page":"2083","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Tan Reuben","year":"2021","unstructured":"Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan A. Plummer. 2021. Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 2083\u20132092."},{"key":"e_1_3_1_93_2","article-title":"Human-centric spatio-temporal video grounding with visual transformers","author":"Tang Zongheng","year":"2021","unstructured":"Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. 2021. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology (2021).","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_94_2","first-page":"1","volume-title":"Proceedings of the ACM International Conference on Image and Video Retrieval","author":"Tellex Stefanie","year":"2009","unstructured":"Stefanie Tellex and Deb Roy. 2009. Towards surveillance video search by natural language query. In Proceedings of the ACM International Conference on Image and Video Retrieval. 1\u20138."},{"key":"e_1_3_1_95_2","first-page":"5998","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_96_2","doi-asserted-by":"crossref","first-page":"4116","DOI":"10.1145\/3394171.3413975","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Wang Hao","year":"2020","unstructured":"Hao Wang, Zheng-Jun Zha, Xuejin Chen, Zhiwei Xiong, and Jiebo Luo. 2020. Dual path interaction network for video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4116\u20134124."},{"key":"e_1_3_1_97_2","first-page":"7026","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang Hao","year":"2021","unstructured":"Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7026\u20137035."},{"key":"e_1_3_1_98_2","first-page":"12168","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Wang Jingwen","year":"2020","unstructured":"Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. 12168\u201312175."},{"key":"e_1_3_1_99_2","first-page":"14090","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang Liwei","year":"2021","unstructured":"Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, and Dong Yu. 2021. Improving weakly supervised visual grounding by contrastive knowledge distillation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 14090\u201314100."},{"issue":"2","key":"e_1_3_1_100_2","first-page":"2","article-title":"Action recognition and detection by combining motion and appearance features","volume":"1","author":"Wang Limin","year":"2014","unstructured":"Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1, 2 (2014), 2.","journal-title":"THUMOS14 Action Recognition Challenge"},{"key":"e_1_3_1_101_2","first-page":"334","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wang Weining","year":"2019","unstructured":"Weining Wang, Yan Huang, and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334\u2013343."},{"key":"e_1_3_1_102_2","first-page":"89","volume-title":"Proceedings of the Findings of the Association for Computational Linguistics","author":"Wang Yuechen","year":"2021","unstructured":"Yuechen Wang, Wengang Zhou, and Houqiang Li. 2021. Fine-grained semantic alignment network for weakly supervised temporal language grounding. In Proceedings of the Findings of the Association for Computational Linguistics. 89\u201399."},{"key":"e_1_3_1_103_2","doi-asserted-by":"crossref","first-page":"1459","DOI":"10.1145\/3474085.3475278","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia","author":"Wang Zheng","year":"2021","unstructured":"Zheng Wang, Jingjing Chen, and Yu-Gang Jiang. 2021. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 1459\u20131468."},{"key":"e_1_3_1_104_2","first-page":"1029","volume-title":"Proceedings of the 27th International Joint Conference on Artificial Intelligence","author":"Wu Aming","year":"2018","unstructured":"Aming Wu and Yahong Han. 2018. Multi-modal circulant fusion for video-to-language and backward. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 1029\u20131035."},{"key":"e_1_3_1_105_2","first-page":"1283","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Wu Jie","year":"2020","unstructured":"Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the 28th ACM International Conference on Multimedia. 1283\u20131291."},{"key":"e_1_3_1_106_2","first-page":"12386","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Wu Jie","year":"2020","unstructured":"Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence. 12386\u201312393."},{"key":"e_1_3_1_107_2","first-page":"2986","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Xiao Shaoning","year":"2021","unstructured":"Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 2986\u20132994."},{"key":"e_1_3_1_108_2","first-page":"1645","volume-title":"Proceedings of the 25th ACM International Conference on Multimedia","author":"Xu Dejing","year":"2017","unstructured":"Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. 1645\u20131653."},{"key":"e_1_3_1_109_2","first-page":"9062","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Xu Huijuan","year":"2019","unstructured":"Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 9062\u20139069."},{"key":"e_1_3_1_110_2","first-page":"5288","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Xu Jun","year":"2016","unstructured":"Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288\u20135296."},{"key":"e_1_3_1_111_2","first-page":"7220","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Xu Mengmeng","year":"2021","unstructured":"Mengmeng Xu, Juan-Manuel P\u00e9rez-R\u00faa, Victor Escorcia, Brais Martinez, Xiatian Zhu, Li Zhang, Bernard Ghanem, and Tao Xiang. 2021. Boundary-sensitive pre-training for temporal localization in videos. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 7220\u20137230."},{"key":"e_1_3_1_112_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.05.027"},{"key":"e_1_3_1_113_2","first-page":"1","volume-title":"Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Yang Xun","year":"2021","unstructured":"Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1\u201310."},{"key":"e_1_3_1_114_2","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1109\/ICCST50977.2020.00123","volume-title":"Proceedings of the 2020 International Conference on Culture-oriented Science & Technology","author":"Yang Yulan","year":"2020","unstructured":"Yulan Yang, Zhaohui Li, and Gangyan Zeng. 2020. A survey of temporal activity localization via language in untrimmed videos. In Proceedings of the 2020 International Conference on Culture-oriented Science & Technology. IEEE, 596\u2013601."},{"key":"e_1_3_1_115_2","first-page":"982","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"Yao Ting","year":"2016","unstructured":"Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 982\u2013990."},{"key":"e_1_3_1_116_2","first-page":"684","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Yao Ting","year":"2018","unstructured":"Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684\u2013699."},{"key":"e_1_3_1_117_2","first-page":"4894","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Yao Ting","year":"2017","unstructured":"Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894\u20134902."},{"key":"e_1_3_1_118_2","first-page":"2678","volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition","author":"Yeung Serena","year":"2016","unstructured":"Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2678\u20132687."},{"key":"e_1_3_1_119_2","first-page":"1307","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Yu Licheng","year":"2018","unstructured":"Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307\u20131315."},{"key":"e_1_3_1_120_2","first-page":"1860","volume-title":"Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Yu Xinli","year":"2021","unstructured":"Xinli Yu, Mohsen Malmir, Xin He, Jiangning Chen, Tong Wang, Yue Wu, Yue Liu, and Yang Liu. 2021. Cross interaction network for natural language guided video moment retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1860\u20131864."},{"key":"e_1_3_1_121_2","first-page":"6281","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Yu Zhou","year":"2019","unstructured":"Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6281\u20136290."},{"key":"e_1_3_1_122_2","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1145\/3475723.3484247","volume-title":"Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis","author":"Yuan Yitian","year":"2021","unstructured":"Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, and Wenwu Zhu. 2021. A closer look at temporal sentence grounding in videos: Dataset and metric. In Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis. 13\u201321."},{"key":"e_1_3_1_123_2","first-page":"534","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Yuan Yitian","year":"2019","unstructured":"Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Proceedings of the Advances in Neural Information Processing Systems. 534\u2013544."},{"issue":"1","key":"e_1_3_1_124_2","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1109\/TCSVT.2017.2771247","article-title":"Video summarization by learning deep side semantic embedding","volume":"29","author":"Yuan Yitian","year":"2017","unstructured":"Yitian Yuan, Tao Mei, Peng Cui, and Wenwu Zhu. 2017. Video summarization by learning deep side semantic embedding. IEEE Transactions on Circuits and Systems for Video Technology 29, 1 (2017), 226\u2013237.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_125_2","first-page":"9159","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Yuan Yitian","year":"2019","unstructured":"Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence. 9159\u20139166."},{"key":"e_1_3_1_126_2","first-page":"10284","volume-title":"Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zeng Runhao","year":"2020","unstructured":"Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10284\u201310293."},{"key":"e_1_3_1_127_2","first-page":"2215","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zeng Yawen","year":"2021","unstructured":"Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-modal relational graph for cross-modal video moment retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2215\u20132224."},{"key":"e_1_3_1_128_2","article-title":"A hierarchical multi-modal encoder for moment localization in video corpus","author":"Zhang Bowen","year":"2020","unstructured":"Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas, Vihan Jain, Eugene Ie, and Fei Sha. 2020. A hierarchical multi-modal encoder for moment localization in video corpus. abs\/2011.09046 arXiv:2011.09046. Retrieved from https:\/\/arxiv.org\/abs\/2011.09046.","journal-title":"abs\/2011.09046"},{"key":"e_1_3_1_129_2","first-page":"1247","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Zhang Da","year":"2019","unstructured":"Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247\u20131257."},{"key":"e_1_3_1_130_2","volume-title":"Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Zhang Hao","year":"2021","unstructured":"Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval."},{"key":"e_1_3_1_131_2","doi-asserted-by":"crossref","first-page":"6543","DOI":"10.18653\/v1\/2020.acl-main.585","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Zhang Hao","year":"2020","unstructured":"Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6543\u20136554."},{"key":"e_1_3_1_132_2","first-page":"383","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Zhang Ke","year":"2018","unstructured":"Ke Zhang, Kristen Grauman, and Fei Sha. 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision. 383\u2013399."},{"key":"e_1_3_1_133_2","first-page":"682","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Zhang Lingyu","year":"2022","unstructured":"Lingyu Zhang and Richard J. Radke. 2022. Natural language video moment localization through query-controlled temporal convolution. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 682\u2013690."},{"key":"e_1_3_1_134_2","first-page":"12669","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Mingxing","year":"2021","unstructured":"Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 12669\u201312678."},{"key":"e_1_3_1_135_2","first-page":"12870","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Zhang Songyang","year":"2020","unstructured":"Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence. 12870\u201312877."},{"key":"e_1_3_1_136_2","doi-asserted-by":"crossref","first-page":"1230","DOI":"10.1145\/3343031.3350879","volume-title":"Proceedings of the 27th ACM International Conference on Multimedia","author":"Zhang Songyang","year":"2019","unstructured":"Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the 27th ACM International Conference on Multimedia. 1230\u20131238."},{"key":"e_1_3_1_137_2","doi-asserted-by":"crossref","first-page":"655","DOI":"10.1145\/3331184.3331235","volume-title":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Zhang Zhu","year":"2019","unstructured":"Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 655\u2013664."},{"key":"e_1_3_1_138_2","doi-asserted-by":"crossref","first-page":"4098","DOI":"10.1145\/3394171.3413967","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Zhang Zhu","year":"2020","unstructured":"Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, and Xiuqiang He. 2020. Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 4098\u20134106."},{"key":"e_1_3_1_139_2","first-page":"18123","article-title":"Counterfactual contrastive learning for weakly-supervised vision-language grounding","volume":"33","author":"Zhang Zhu","year":"2020","unstructured":"Zhu Zhang, Zhou Zhao, Zhijie Lin, Xiuqiang He, and Jieming Zhu. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems 33 (2020), 18123\u201318134.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_140_2","first-page":"10665","volume-title":"Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Zhu","year":"2020","unstructured":"Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. 2020. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10665\u201310674."},{"key":"e_1_3_1_141_2","first-page":"7405","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Zhao Bin","year":"2018","unstructured":"Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405\u20137414."},{"key":"e_1_3_1_142_2","first-page":"4197","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhao Yang","year":"2021","unstructured":"Yang Zhao, Zhou Zhao, Zhu Zhang, and Zhijie Lin. 2021. Cascaded prediction network via segment tree for temporal video grounding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4197\u20134206."},{"key":"e_1_3_1_143_2","first-page":"8445","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhou Hao","year":"2021","unstructured":"Hao Zhou, Chongyang Zhang, Yan Luo, Yanjun Chen, and Chuanping Hu. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8445\u20138454."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3532626","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3532626","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:04Z","timestamp":1750182664000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3532626"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,6]]},"references-count":142,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,5,31]]}},"alternative-id":["10.1145\/3532626"],"URL":"https:\/\/doi.org\/10.1145\/3532626","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,6]]},"assertion":[{"value":"2021-09-16","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}