{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T15:31:17Z","timestamp":1774539077570,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":37,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,11,23]],"date-time":"2021-11-23T00:00:00Z","timestamp":1637625600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Tsinghua GuoQiang Research Center Grant","award":["2020GQG1014"],"award-info":[{"award-number":["2020GQG1014"]}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2020AAA0106301"],"award-info":[{"award-number":["2020AAA0106301"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62050110"],"award-info":[{"award-number":["62050110"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,11,23]]},"DOI":"10.1145\/3475723.3484247","type":"proceedings-article","created":{"date-parts":[[2021,11,25]],"date-time":"2021-11-25T17:06:06Z","timestamp":1637859966000},"page":"13-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":51,"title":["A Closer Look at Temporal Sentence Grounding in Videos"],"prefix":"10.1145","author":[{"given":"Yitian","family":"Yuan","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Xiaohan","family":"Lan","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Xin","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University &amp; Pengcheng Laboratory, Beijing, China"}]},{"given":"Long","family":"Chen","sequence":"additional","affiliation":[{"name":"Columbia University, New York City, NY, USA"}]},{"given":"Zhi","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Wenwu","family":"Zhu","sequence":"additional","affiliation":[{"name":"Tsinghua University &amp; Pengcheng Laboratory, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2021,11,25]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV .  Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV .","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR .  Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR .","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Jingyuan Chen Xinpeng Chen Lin Ma Zequn Jie and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In EMNLP .  Jingyuan Chen Xinpeng Chen Lin Ma Zequn Jie and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In EMNLP .","DOI":"10.18653\/v1\/D18-1015"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Long Chen Chujie Lu Siliang Tang Jun Xiao Dong Zhang Chilie Tan and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-Based Video Localization.. In AAAI .  Long Chen Chujie Lu Siliang Tang Jun Xiao Dong Zhang Chilie Tan and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-Based Video Localization.. In AAAI .","DOI":"10.1609\/aaai.v34i07.6627"},{"key":"e_1_3_2_1_5_1","unstructured":"Xuguang Duan Wenbing Huang Chuang Gan Jingdong Wang Wenwu Zhu and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In NeurIPS .  Xuguang Duan Wenbing Huang Chuang Gan Jingdong Wang Wenwu Zhu and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In NeurIPS ."},{"key":"e_1_3_2_1_6_1","volume-title":"Tall: Temporal activity localization via language query. In ICCV .","author":"Gao Jiyang","year":"2017"},{"key":"e_1_3_2_1_7_1","volume-title":"WSLLN: Weakly Supervised Natural Language Localization Networks. In EMNLP .","author":"Gao Mingfei","year":"2019"},{"key":"e_1_3_2_1_8_1","volume-title":"Mac: Mining activity concepts for language-based temporal localization. In WACV .","author":"Ge Runzhou","year":"2019"},{"key":"e_1_3_2_1_9_1","unstructured":"Meera Hahn Asim Kadav James M Rehg and Hans Peter Graf. 2019. Tripping through time: Efficient localization of activities in videos. In arXiv .  Meera Hahn Asim Kadav James M Rehg and Hans Peter Graf. 2019. Tripping through time: Efficient localization of activities in videos. In arXiv ."},{"key":"e_1_3_2_1_10_1","unstructured":"Dongliang He Xiang Zhao Jizhou Huang Fu Li Xiao Liu and Shilei Wen. 2019. Read watch and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In AAAI .  Dongliang He Xiang Zhao Jizhou Huang Fu Li Xiao Liu and Shilei Wen. 2019. Read watch and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In AAAI ."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2018. Localizing moments in video with temporal language. In EMNLP .  Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2018. Localizing moments in video with temporal language. In EMNLP .","DOI":"10.18653\/v1\/D18-1168"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Bin Jiang Xin Huang Chao Yang and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In ICMR .  Bin Jiang Xin Huang Chao Yang and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In ICMR .","DOI":"10.1145\/3323873.3325019"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Ranjay Krishna Kenji Hata Frederic Ren Li Fei-Fei and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV .  Ranjay Krishna Kenji Hata Frederic Ren Li Fei-Fei and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV .","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Meng Liu Xiang Wang Liqiang Nie Xiangnan He Baoquan Chen and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR .  Meng Liu Xiang Wang Liqiang Nie Xiangnan He Baoquan Chen and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR .","DOI":"10.1145\/3209978.3210003"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Meng Liu Xiang Wang Liqiang Nie Qi Tian Baoquan Chen and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In ACM MM .  Meng Liu Xiang Wang Liqiang Nie Qi Tian Baoquan Chen and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In ACM MM .","DOI":"10.1145\/3240508.3240549"},{"key":"e_1_3_2_1_16_1","volume-title":"Debug: A dense bottom-up grounding approach for natural language video localization. In EMNLP .","author":"Lu Chujie","year":"2019"},{"key":"e_1_3_2_1_17_1","unstructured":"Niluthpol Chowdhury Mithun Sujoy Paul and Amit K Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In CVPR .  Niluthpol Chowdhury Mithun Sujoy Paul and Amit K Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In CVPR ."},{"key":"e_1_3_2_1_18_1","unstructured":"Mayu Otani Yuta Nakashima Esa Rahtu and Janne Heikkil\"a. 2020. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval. In BMVC .  Mayu Otani Yuta Nakashima Esa Rahtu and Janne Heikkil\"a. 2020. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval. In BMVC ."},{"key":"e_1_3_2_1_19_1","volume-title":"Glove: Global vectors for word representation. In EMNLP .","author":"Pennington Jeffrey","year":"2014"},{"key":"e_1_3_2_1_20_1","volume-title":"Grounding action descriptions in videos. TACL","author":"Regneri Michaela","year":"2013"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Zheng Shou Dongang Wang and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR .  Zheng Shou Dongang Wang and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR .","DOI":"10.1109\/CVPR.2016.119"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Gunnar A Sigurdsson G\u00fcl Varol Xiaolong Wang Ali Farhadi Ivan Laptev and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV .  Gunnar A Sigurdsson G\u00fcl Varol Xiaolong Wang Ali Farhadi Ivan Laptev and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV .","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"e_1_3_2_1_23_1","volume-title":"Val: Visual-attention action localizer. In PCM .","author":"Song Xiaomeng","year":"2018"},{"key":"e_1_3_2_1_24_1","unstructured":"Yijun Song Jingwen Wang Lin Ma Zhou Yu and Jun Yu. 2020. Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. In arXiv .  Yijun Song Jingwen Wang Lin Ma Zhou Yu and Jun Yu. 2020. Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. In arXiv ."},{"key":"e_1_3_2_1_25_1","unstructured":"Reuben Tan Huijuan Xu Kate Saenko and Bryan A Plummer. 2019. wman: Weakly-supervised moment alignment network for text-based video segment retrieval. In arXiv .  Reuben Tan Huijuan Xu Kate Saenko and Bryan A Plummer. 2019. wman: Weakly-supervised moment alignment network for text-based video segment retrieval. In arXiv ."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV .  Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV .","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao Dahua Lin Xiaoou Tang and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV .  Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao Dahua Lin Xiaoou Tang and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV .","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Weining Wang Yan Huang and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In CVPR .  Weining Wang Yan Huang and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In CVPR .","DOI":"10.1109\/CVPR.2019.00042"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Jie Wu Guanbin Li Si Liu and Liang Lin. 2020. Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video. In AAAI .  Jie Wu Guanbin Li Si Liu and Liang Lin. 2020. Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video. In AAAI .","DOI":"10.1609\/aaai.v34i07.6924"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Shaoning Xiao Long Chen Songyang Zhang Wei Ji Jian Shao Lu Ye and Jun Xiao. 2021. Boundary Proposal Network for Two-Stage Natural Language Video Localization. In AAAI .  Shaoning Xiao Long Chen Songyang Zhang Wei Ji Jian Shao Lu Ye and Jun Xiao. 2021. Boundary Proposal Network for Two-Stage Natural Language Video Localization. In AAAI .","DOI":"10.1609\/aaai.v35i4.16406"},{"key":"e_1_3_2_1_31_1","unstructured":"Huijuan Xu Kun He Bryan A Plummer Leonid Sigal Stan Sclaroff and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI .  Huijuan Xu Kun He Bryan A Plummer Leonid Sigal Stan Sclaroff and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI ."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"crossref","unstructured":"Yitian Yuan Lin Ma Jingwen Wang Wei Liu and Wenwu Zhu. 2019 a. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NeurIPS .  Yitian Yuan Lin Ma Jingwen Wang Wei Liu and Wenwu Zhu. 2019 a. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NeurIPS .","DOI":"10.1109\/TPAMI.2020.3038993"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"crossref","unstructured":"Yitian Yuan Tao Mei and Wenwu Zhu. 2019 b. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI .  Yitian Yuan Tao Mei and Wenwu Zhu. 2019 b. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI .","DOI":"10.1609\/aaai.v33i01.33019159"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Runhao Zeng Haoming Xu Wenbing Huang Peihao Chen Mingkui Tan and Chuang Gan. 2020. Dense regression network for video grounding. In CVPR .  Runhao Zeng Haoming Xu Wenbing Huang Peihao Chen Mingkui Tan and Chuang Gan. 2020. Dense regression network for video grounding. In CVPR .","DOI":"10.1109\/CVPR42600.2020.01030"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Da Zhang Xiyang Dai Xin Wang Yuan-Fang Wang and Larry S Davis. 2019 a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR .  Da Zhang Xiyang Dai Xin Wang Yuan-Fang Wang and Larry S Davis. 2019 a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR .","DOI":"10.1109\/CVPR.2019.00134"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Songyang Zhang Houwen Peng Jianlong Fu and Jiebo Luo. 2020. Learning 2D Temporal Adjacent Networks forMoment Localization with Natural Language. In AAAI .  Songyang Zhang Houwen Peng Jianlong Fu and Jiebo Luo. 2020. Learning 2D Temporal Adjacent Networks forMoment Localization with Natural Language. In AAAI .","DOI":"10.1609\/aaai.v34i07.6984"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Zhu Zhang Zhijie Lin Zhou Zhao and Zhenxin Xiao. 2019 b. Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR .  Zhu Zhang Zhijie Lin Zhou Zhao and Zhenxin Xiao. 2019 b. Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR .","DOI":"10.1145\/3331184.3331235"}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3475723.3484247","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3475723.3484247","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:18Z","timestamp":1750193298000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3475723.3484247"}},"subtitle":["Dataset and Metric"],"short-title":[],"issued":{"date-parts":[[2021,11,23]]},"references-count":37,"alternative-id":["10.1145\/3475723.3484247","10.1145\/3475723"],"URL":"https:\/\/doi.org\/10.1145\/3475723.3484247","relation":{},"subject":[],"published":{"date-parts":[[2021,11,23]]},"assertion":[{"value":"2021-11-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}