{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:18:42Z","timestamp":1750220322280,"version":"3.41.0"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"3s","license":[{"start":{"date-parts":[[2022,10,31]],"date-time":"2022-10-31T00:00:00Z","timestamp":1667174400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100008192","name":"Automotive Research Center","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100008192","id-type":"DOI","asserted-by":"crossref"}]},{"name":"University of Michigan in accordance with Cooperative Agreement","award":["W56HZV-19-2-0001"],"award-info":[{"award-number":["W56HZV-19-2-0001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,10,31]]},"abstract":"<jats:p>We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods, and leads to improvements over previous work for the task of temporal action localization.<\/jats:p>","DOI":"10.1145\/3495211","type":"journal-article","created":{"date-parts":[[2022,2,18]],"date-time":"2022-02-18T19:42:08Z","timestamp":1645213328000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs"],"prefix":"10.1145","volume":"18","author":[{"given":"Oana","family":"Ignat","sequence":"first","affiliation":[{"name":"University of Michigan, Ann Arbor, Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Santiago","family":"Castro","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuhang","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiajun","family":"Bao","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dandan","family":"Shan","sequence":"additional","affiliation":[{"name":"University of Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rada","family":"Mihalcea","sequence":"additional","affiliation":[{"name":"University of Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,11,2]]},"reference":[{"key":"e_1_3_2_2_2","article-title":"Youtube-8m: A large-scale video classification benchmark","author":"Abu-El-Haija Sami","year":"2016","unstructured":"Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675. Retrieved from https:\/\/arxiv.org\/abs\/1609.08675.","journal-title":"arXiv:1609.08675."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.495"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1082"},{"key":"e_1_3_2_10_2","volume-title":"Proceedings of the NAACL-HLT","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_3_2_12_2","article-title":"Temporal localization of moments in video collections with natural language","author":"Escorcia Victor","year":"2019","unstructured":"Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. 2019. Temporal localization of moments in video collections with natural language. arXiv:1907.12763. Retrieved from https:\/\/arxiv.org\/abs\/1907.12763.","journal-title":"arXiv:1907.12763."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1177\/001316447303300309"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00524"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.563"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2019.00032"},{"key":"e_1_3_2_17_2","article-title":"ExCL: Extractive clip localization using natural language descriptions","author":"Ghosh Soham","year":"2019","unstructured":"Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. ExCL: Extractive clip localization using natural language descriptions. arXiv:1904.02755. Retrieved from https:\/\/arxiv.org\/abs\/1904.02755.","journal-title":"arXiv:1904.02755."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00033"},{"key":"e_1_3_2_19_2","first-page":"6325","article-title":"Making the V in VQA matter: Elevating the role of image understanding in visual question answering","author":"Goyal Yash","year":"2016","unstructured":"Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.6325\u20136334.","journal-title":"Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_20_2","volume-title":"Proceedings of the Studies in Computational Intelligence","author":"Graves Alex","year":"2008","unstructured":"Alex Graves. 2008. Supervised sequence labelling with recurrent neural networks. In Proceedings of the Studies in Computational Intelligence."},{"key":"e_1_3_2_21_2","first-page":"6047","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","year":"2018","unstructured":"Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6047\u20136056."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_23_2","article-title":"Localizing moments in video with temporal language","author":"Hendricks Lisa Anne","year":"2018","unstructured":"Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing moments in video with temporal language. arXiv:1809.01337. Retrieved from https:\/\/arxiv.org\/abs\/1809.01337.","journal-title":"arXiv:1809.01337."},{"key":"e_1_3_2_24_2","article-title":"Gqa: A new dataset for compositional question answering over real-world images","author":"Hudson Drew A.","year":"2019","unstructured":"Drew A. Hudson and Christopher D. Manning. 2019. Gqa: A new dataset for compositional question answering over real-world images. arXiv:1902.09506. Retrieved from https:\/\/arxiv.org\/abs\/1902.09506.","journal-title":"arXiv:1902.09506."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1643"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/1282280.1282352"},{"key":"e_1_3_2_27_2","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev Mustafa Suleyman and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950. Retrieved from https:\/\/arxiv.org\/abs\/1705.06950."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1177\/001316447003000105"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","unstructured":"Jie Lei Licheng Yu Tamara L. Berg and Mohit Bansal. 2019. TVQA+: Spatio-Temporal grounding for video question answering. arXiv:1904.11574. Retrieved from https:\/\/arxiv.org\/abs\/1904.11574.","DOI":"10.18653\/v1\/2020.acl-main.730"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2050"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123343"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210003"},{"key":"e_1_3_2_35_2","first-page":"13","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems. 13\u201323."},{"key":"e_1_3_2_36_2","article-title":"End-to-End learning of visual representations from uncurated instructional videos","author":"Miech Antoine","year":"2019","unstructured":"Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2019. End-to-End learning of visual representations from uncurated instructional videos. arXiv:1912.06430. Retrieved from https:\/\/arxiv.org\/abs\/1912.06430.","journal-title":"arXiv:1912.06430."},{"key":"e_1_3_2_37_2","article-title":"HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips","author":"Miech Antoine","year":"2019","unstructured":"Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. arXiv:1906.03327. Retrieved from https:\/\/arxiv.org\/abs\/1906.03327.","journal-title":"arXiv:1906.03327."},{"key":"e_1_3_2_38_2","unstructured":"Tomas Mikolov Kai Chen G. S. Corrado and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https:\/\/arxiv.org\/abs\/1301.3781."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3206025.3206064"},{"key":"e_1_3_2_40_2","unstructured":"IEEE Transactions on Pattern Analysis and Machine Intelligence 2019 Moments in time dataset: One million videos for event understanding"},{"key":"e_1_3_2_41_2","volume-title":"Proceedings of the ECAI","author":"Motwani Tanvi S.","year":"2012","unstructured":"Tanvi S. Motwani and Raymond J. Mooney. 2012. Improving video activity recognition using object recognition and text mining. In Proceedings of the ECAI."},{"key":"e_1_3_2_42_2","article-title":"Multimodal abstractive summarization for how2 videos","author":"Palaskar Shruti","year":"2019","unstructured":"Shruti Palaskar, Jindrich Libovick\u1ef3, Spandana Gella, and Florian Metze. 2019. Multimodal abstractive summarization for how2 videos. arXiv:1906.07901. Retrieved from https:\/\/arxiv.org\/abs\/1906.07901.","journal-title":"arXiv:1906.07901."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.118"},{"key":"e_1_3_2_45_2","first-page":"779","article-title":"You only look once: Unified, real-time object detection","author":"Redmon Joseph","year":"2015","unstructured":"Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.779\u2013788.","journal-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00207"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6247801"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1641"},{"key":"e_1_3_2_49_2","article-title":"Charades-ego: A large-scale dataset of paired third and first person videos","author":"Sigurdsson Gunnar A.","year":"2018","unstructured":"Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. 2018. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv:1804.09626. Retrieved from https:\/\/arxiv.org\/abs\/1804.09626.","journal-title":"arXiv:1804.09626."},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.235"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"e_1_3_2_52_2","article-title":"A corpus for reasoning about natural language grounded in photographs","author":"Suhr Alane","year":"2018","unstructured":"Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https:\/\/arxiv.org\/abs\/1811.00491.","journal-title":"arXiv:1811.00491."},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_2_54_2","article-title":"Lxmert: Learning cross-modality encoder representations from transformers","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv:1908.07490. Retrieved from https:\/\/arxiv.org\/abs\/1908.07490.","journal-title":"arXiv:1908.07490."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00130"},{"key":"e_1_3_2_56_2","doi-asserted-by":"crossref","unstructured":"Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767. Retrieved from https:\/\/arxiv.org\/abs\/1412.0767.","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_57_2","unstructured":"Du Tran Lubomir D. Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2014. C3D: Generic features for video analysis. arXiv:1412.0767. Retrieved from https:\/\/arxiv.org\/abs\/1412.0767."},{"key":"e_1_3_2_58_2","volume-title":"Proceedings of the NIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS."},{"key":"e_1_3_2_59_2","first-page":"3169","article-title":"Action recognition by dense trajectories","author":"Wang Heng","year":"2011","unstructured":"Heng Wang, Alexander Kl\u00e4ser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. CVPR (2011), 3169\u20133176.","journal-title":"CVPR"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_42"},{"key":"e_1_3_2_61_2","first-page":"21","article-title":"Stacked attention networks for image question answering","author":"Yang Zichao","year":"2015","unstructured":"Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2015. Stacked attention networks for image question answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition.21\u201329.","journal-title":"2016 IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_63_2","unstructured":"Yitian Yuan Tao Mei and Wenwu Zhu. 2018. To find where you talk: Temporal sentence localization in video with attention based location regression. arXiv:1804.07014. Retrieved from https:\/\/arxiv.org\/abs\/1804.07014."},{"key":"e_1_3_2_64_2","volume-title":"Proceedings of the AAAI","author":"Zhang Songyang","year":"2020","unstructured":"Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D temporal adjacent networks formoment localization with natural language. In Proceedings of the AAAI."},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00365"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3495211","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3495211","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:12:02Z","timestamp":1750191122000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3495211"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,31]]},"references-count":64,"journal-issue":{"issue":"3s","published-print":{"date-parts":[[2022,10,31]]}},"alternative-id":["10.1145\/3495211"],"URL":"https:\/\/doi.org\/10.1145\/3495211","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2022,10,31]]},"assertion":[{"value":"2021-02-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}