{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T07:34:31Z","timestamp":1767339271896,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T00:00:00Z","timestamp":1602460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,12]]},"DOI":"10.1145\/3394171.3413614","type":"proceedings-article","created":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T13:12:00Z","timestamp":1602508320000},"page":"3789-3797","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos"],"prefix":"10.1145","author":[{"given":"Junwen","family":"Chen","sequence":"first","affiliation":[{"name":"Rochester Institute of Technology, Rochester, NY, USA"}]},{"given":"Wentao","family":"Bao","sequence":"additional","affiliation":[{"name":"Rochester Institute of Technology, Rochester, NY, USA"}]},{"given":"Yu","family":"Kong","sequence":"additional","affiliation":[{"name":"Rochester Institute of Technology, Rochester, NY, USA"}]}],"member":"320","published-online":{"date-parts":[[2020,10,12]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/3298023.3298199"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00209"},{"key":"e_1_3_2_2_3_1","volume-title":"OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields","author":"Cao Z","year":"2019","unstructured":"Z Cao , G Martinez Hidalgo , T Simon , SE Wei , and YA Sheikh . 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields . IEEE transactions on pattern analysis and machine intelligence ( 2019 ). Z Cao, G Martinez Hidalgo, T Simon, SE Wei, and YA Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence (2019)."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00425"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00177"},{"key":"e_1_3_2_2_6_1","volume-title":"Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv preprint arXiv:2001.09308","author":"Chen Zhenfang","year":"2020","unstructured":"Zhenfang Chen , Lin Ma , Wenhan Luo , Peng Tang , and Kwan-Yee K Wong . 2020. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv preprint arXiv:2001.09308 ( 2020 ). Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and Kwan-Yee K Wong. 2020. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video. arXiv preprint arXiv:2001.09308 (2020)."},{"key":"e_1_3_2_2_7_1","volume-title":"2019 a. Weakly-supervised spatio-temporally grounding natural sentence in video. ACL","author":"Chen Zhenfang","year":"2019","unstructured":"Zhenfang Chen , Lin Ma , Wenhan Luo , and Kwan-Yee K Wong . 2019 a. Weakly-supervised spatio-temporally grounding natural sentence in video. ACL ( 2019 ). Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee K Wong. 2019 a. Weakly-supervised spatio-temporally grounding natural sentence in video. ACL (2019)."},{"key":"e_1_3_2_2_8_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186."},{"key":"e_1_3_2_2_9_1","unstructured":"Xuguang Duan Wenbing Huang Chuang Gan Jingdong Wang Wenwu Zhu and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059--3069.  Xuguang Duan Wenbing Huang Chuang Gan Jingdong Wang Wenwu Zhu and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059--3069."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00552"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1157"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1162"},{"key":"e_1_3_2_2_14_1","volume-title":"Reference-Aware Visual Grounding in Instructional Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","author":"De-An","year":"2018","unstructured":"De-An Huang*, Shyamal Buch*, Lucio Dery , Animesh Garg , Li Fei-Fei , and Juan Carlos Niebles . 2018 . Finding \"It\": Weakly-Supervised , Reference-Aware Visual Grounding in Instructional Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). De-An Huang*, Shyamal Buch*, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding \"It\": Weakly-Supervised, Reference-Aware Visual Grounding in Instructional Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.363"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5010"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01186"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6248065"},{"key":"e_1_3_2_2_22_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_49"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.509"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"crossref","unstructured":"Jing Shi Jia Xu Boqing Gong and Chenliang Xu. 2019. Not all frames are equal: Weakly-supervised video grounding with contextual similarity and visual clustering losses. In CVPR. 10444--10452.  Jing Shi Jia Xu Boqing Gong and Chenliang Xu. 2019. Not all frames are equal: Weakly-supervised video grounding with contextual similarity and visual clustering losses. In CVPR. 10444--10452.","DOI":"10.1109\/CVPR.2019.01069"},{"key":"e_1_3_2_2_26_1","volume-title":"Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831","author":"Shridhar Mohit","year":"2018","unstructured":"Mohit Shridhar and David Hsu . 2018. Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831 ( 2018 ). Mohit Shridhar and David Hsu. 2018. Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831 (2018)."},{"key":"e_1_3_2_2_27_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00140"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00267"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00427"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00478"},{"key":"e_1_3_2_2_32_1","volume-title":"2019 b. Grounding-Tracking-Integration. arXiv preprint arXiv:1912.06316","author":"Yang Zhengyuan","year":"2019","unstructured":"Zhengyuan Yang , Tushar Kumar , Tianlang Chen , and Jiebo Luo . 2019 b. Grounding-Tracking-Integration. arXiv preprint arXiv:1912.06316 ( 2019 ). Zhengyuan Yang, Tushar Kumar, Tianlang Chen, and Jiebo Luo. 2019 b. Grounding-Tracking-Integration. arXiv preprint arXiv:1912.06316 (2019)."},{"key":"e_1_3_2_2_33_1","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 53--63","author":"Yu Haonan","year":"2013","unstructured":"Haonan Yu and Jeffrey Mark Siskind . 2013 . Grounded language learning from video described with sentences . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 53--63 . Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 53--63."},{"key":"e_1_3_2_2_34_1","volume-title":"Grounded Video Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Zhou Luowei","year":"2019","unstructured":"Luowei Zhou , Yannis Kalantidis , Xinlei Chen , Jason J. Corso , and Marcus Rohrbach . 2019 . Grounded Video Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded Video Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_2_2_35_1","volume-title":"Weakly-supervised video object grounding from text by loss weighting and object interaction. BMVC","author":"Zhou Luowei","year":"2018","unstructured":"Luowei Zhou , Nathan Louis , and Jason J Corso . 2018a. Weakly-supervised video object grounding from text by loss weighting and object interaction. BMVC ( 2018 ). Luowei Zhou, Nathan Louis, and Jason J Corso. 2018a. Weakly-supervised video object grounding from text by loss weighting and object interaction. BMVC (2018)."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12342"}],"event":{"name":"MM '20: The 28th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Seattle WA USA","acronym":"MM '20"},"container-title":["Proceedings of the 28th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413614","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413614","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:47:15Z","timestamp":1750193235000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413614"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,12]]},"references-count":36,"alternative-id":["10.1145\/3394171.3413614","10.1145\/3394171"],"URL":"https:\/\/doi.org\/10.1145\/3394171.3413614","relation":{},"subject":[],"published":{"date-parts":[[2020,10,12]]},"assertion":[{"value":"2020-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}