{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T02:19:37Z","timestamp":1774059577140,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":85,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548268","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:46Z","timestamp":1665416566000},"page":"5407-5416","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Transcript to Video: Efficient Clip Sequencing from Texts"],"prefix":"10.1145","author":[{"given":"Yu","family":"Xiong","sequence":"first","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, Hong Kong"}]},{"given":"Fabian Caba","family":"Heilbron","sequence":"additional","affiliation":[{"name":"Adobe Research, San Jose, CA, USA"}]},{"given":"Dahua","family":"Lin","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, Hong Kong"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.738360"},{"key":"e_1_3_2_2_2_1","volume-title":"We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos. arXiv preprint arXiv:2008.05596","author":"Andonian Alex","year":"2020","unstructured":"Alex Andonian , Camilo Fosco , Mathew Monfort , Allen Lee , Rogerio Feris , Carl Vondrick , and Aude Oliva . 2020. We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos. arXiv preprint arXiv:2008.05596 ( 2020 ). Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris, Carl Vondrick, and Aude Oliva. 2020. We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos. arXiv preprint arXiv:2008.05596 (2020)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_2_2_4_1","volume-title":"Grammar of the film language","author":"Arijon Daniel","unstructured":"Daniel Arijon . 1991. Grammar of the film language . Silman-James Press . Daniel Arijon. 1991. Grammar of the film language. Silman-James Press."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"e_1_3_2_2_6_1","volume-title":"Grammar of the Edit","author":"Bowen Christopher J","unstructured":"Christopher J Bowen and Roy Thompson . 2017. Grammar of the Edit . Taylor & Francis . Christopher J Bowen and Roy Thompson. 2017. Grammar of the Edit. Taylor & Francis."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_9_1","unstructured":"Brandon Castellano. 2012. PySceneDetect. https:\/\/github.com\/Breakthrough\/PySceneDetect.  Brandon Castellano. 2012. PySceneDetect. https:\/\/github.com\/Breakthrough\/PySceneDetect."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2018.00191"},{"key":"e_1_3_2_2_11_1","first-page":"148","article-title":"Declarative camera control for automatic cinematography","volume":"1","author":"Christianson David B","year":"1996","unstructured":"David B Christianson , Sean E Anderson , Li-wei He, David H Salesin , Daniel S Weld , and Michael F Cohen . 1996 . Declarative camera control for automatic cinematography . In AAAI\/IAAI , Vol. 1. 148 -- 155 . David B Christianson, Sean E Anderson, Li-wei He, David H Salesin, Daniel S Weld, and Michael F Cohen. 1996. Declarative camera control for automatic cinematography. In AAAI\/IAAI, Vol. 1. 148--155.","journal-title":"AAAI\/IAAI"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/211430.211431"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.3167\/proj.2011.050102"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00190"},{"key":"e_1_3_2_2_15_1","volume-title":"Jamie Ryan Kiros, and Sanja Fidler","author":"Faghri Fartash","year":"2017","unstructured":"Fartash Faghri , David J Fleet , Jamie Ryan Kiros, and Sanja Fidler . 2017 . Vse : Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017). Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.607"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"crossref","unstructured":"David F. Fouhey Weicheng Kuo Alexei A. Efros and Jitendra Malik. 2018. From Lifestyle VLOGs to Everyday Interactions. In CVPR.  David F. Fouhey Weicheng Kuo Alexei A. Efros and Jitendra Malik. 2018. From Lifestyle VLOGs to Everyday Interactions. In CVPR.","DOI":"10.1109\/CVPR.2018.00524"},{"key":"e_1_3_2_2_18_1","volume-title":"Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121--2129.","author":"Frome Andrea","year":"2013","unstructured":"Andrea Frome , Greg S Corrado , Jon Shlens , Samy Bengio , Jeff Dean , Marc'Aurelio Ranzato , and Tomas Mikolov . 2013 . Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121--2129. Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121--2129."},{"key":"e_1_3_2_2_19_1","volume-title":"Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV)","volume":"5","author":"Gabeur Valentin","year":"2020","unstructured":"Valentin Gabeur , Chen Sun , Karteek Alahari , and Cordelia Schmid . 2020 . Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV) , Vol. 5 . Springer. Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV), Vol. 5. Springer."},{"key":"e_1_3_2_2_20_1","volume-title":"Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610","author":"Gao Zijian","year":"2021","unstructured":"Zijian Gao , Jingyu Liu , Sheng Chen , Dedan Chang , Hao Zhang , and Jinwei Yuan . 2021. Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610 ( 2021 ). Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. 2021. Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610 (2021)."},{"key":"e_1_3_2_2_21_1","volume-title":"Advances in Neural Information Processing Systems","volume":"33","author":"Ging Simon","year":"2020","unstructured":"Simon Ging , Mohammadreza Zolfaghari , Hamed Pirsiavash , and Thomas Brox . 2020 . COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning . Advances in Neural Information Processing Systems , Vol. 33 (2020). Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. Advances in Neural Information Processing Systems , Vol. 33 (2020)."},{"key":"e_1_3_2_2_22_1","unstructured":"Boqing Gong Wei-Lun Chao Kristen Grauman and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. In Advances in neural information processing systems. 2069--2077.  Boqing Gong Wei-Lun Chao Kristen Grauman and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. In Advances in neural information processing systems. 2069--2077."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00633"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.337"},{"key":"e_1_3_2_2_25_1","volume-title":"Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304","author":"Gutmann Michael","year":"2010","unstructured":"Michael Gutmann and Aapo Hyv\"arinen. 2010 . Noise-contrastive estimation: A new estimation principle for unnormalized statistical models . In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304 . Michael Gutmann and Aapo Hyv\"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00186"},{"key":"e_1_3_2_2_27_1","volume-title":"Memory-augmented dense predictive coding for video representation learning. arXiv preprint arXiv:2008.01065","author":"Han Tengda","year":"2020","unstructured":"Tengda Han , Weidi Xie , and Andrew Zisserman . 2020. Memory-augmented dense predictive coding for video representation learning. arXiv preprint arXiv:2008.01065 ( 2020 ). Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Memory-augmented dense predictive coding for video representation learning. arXiv preprint arXiv:2008.01065 (2020)."},{"key":"e_1_3_2_2_28_1","volume-title":"The Steadicam\u00ae Operator's Handbook","author":"Holway Jerry","unstructured":"Jerry Holway and Laurie Hayball . 2013. The Steadicam\u00ae Operator's Handbook . CRC Press . Jerry Holway and Laurie Hayball. 2013. The Steadicam\u00ae Operator's Handbook. CRC Press."},{"key":"e_1_3_2_2_29_1","volume-title":"Finding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5948--5957","author":"Huang De-An","year":"2018","unstructured":"De-An Huang , Shyamal Buch , Lucio Dery , Animesh Garg , Li Fei-Fei , and Juan Carlos Niebles . 2018 . Finding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5948--5957 . De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5948--5957."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372278.3390695"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_41"},{"key":"e_1_3_2_2_32_1","unstructured":"American Film Industry. 2020. AFI Catalog of Feature Films The First 100 Years 1893--1993 THE GREAT TRAIN ROBBERY. https:\/\/catalog.afi.com\/Catalog\/moviedetails\/31841.  American Film Industry. 2020. AFI Catalog of Feature Films The First 100 Years 1893--1993 THE GREAT TRAIN ROBBERY. https:\/\/catalog.afi.com\/Catalog\/moviedetails\/31841."},{"key":"e_1_3_2_2_33_1","volume-title":"Video Representation Learning by Recognizing Temporal Transformations. arXiv preprint arXiv:2007.10730","author":"Jenni Simon","year":"2020","unstructured":"Simon Jenni , Givi Meishvili , and Paolo Favaro . 2020. Video Representation Learning by Recognizing Temporal Transformations. arXiv preprint arXiv:2007.10730 ( 2020 ). Simon Jenni, Givi Meishvili, and Paolo Favaro. 2020. Video Representation Learning by Recognizing Temporal Transformations. arXiv preprint arXiv:2007.10730 (2020)."},{"key":"e_1_3_2_2_34_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev etal 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).  Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073653"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.79"},{"key":"e_1_3_2_2_38_1","volume-title":"Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696","author":"Lei Jie","year":"2018","unstructured":"Jie Lei , Licheng Yu , Mohit Bansal , and Tamara L Berg . 2018 . Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018). Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)."},{"key":"e_1_3_2_2_39_1","volume-title":"Tvr: A large-scale dataset for video-subtitle moment retrieval. arXiv preprint arXiv:2001.09099","author":"Lei Jie","year":"2020","unstructured":"Jie Lei , Licheng Yu , Tamara L Berg , and Mohit Bansal . 2020 . Tvr: A large-scale dataset for video-subtitle moment retrieval. arXiv preprint arXiv:2001.09099 (2020). Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. arXiv preprint arXiv:2001.09099 (2020)."},{"key":"e_1_3_2_2_40_1","volume-title":"HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. arXiv preprint arXiv:2005.00200","author":"Li Linjie","year":"2020","unstructured":"Linjie Li , Yen-Chun Chen , Yu Cheng , Zhe Gan , Licheng Yu , and Jingjing Liu . 2020 . HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. arXiv preprint arXiv:2005.00200 (2020). Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. arXiv preprint arXiv:2005.00200 (2020)."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01237-3_10"},{"key":"e_1_3_2_2_42_1","volume-title":"Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487","author":"Liu Yang","year":"2019","unstructured":"Yang Liu , Samuel Albanie , Arsha Nagrani , and Andrew Zisserman . 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 ( 2019 ). Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019)."},{"key":"e_1_3_2_2_43_1","volume-title":"Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860","author":"Luo Huaishao","year":"2021","unstructured":"Huaishao Luo , Lei Ji , Ming Zhong , Yang Chen , Wen Lei , Nan Duan , and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 ( 2021 ). Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)."},{"key":"e_1_3_2_2_44_1","unstructured":"Oded Maron and Tom\u00e1s Lozano-P\u00e9rez. 1998. A framework for multiple-instance learning. In Advances in neural information processing systems. 570--576.  Oded Maron and Tom\u00e1s Lozano-P\u00e9rez. 1998. A framework for multiple-instance learning. In Advances in neural information processing systems. 570--576."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00990"},{"key":"e_1_3_2_2_46_1","volume-title":"Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516","author":"Miech Antoine","year":"2018","unstructured":"Antoine Miech , Ivan Laptev , and Josef Sivic . 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 ( 2018 ). Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018)."},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00272"},{"key":"e_1_3_2_2_48_1","volume-title":"Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov , Kai Chen , Greg Corrado , and Jeffrey Dean . 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 ( 2013 ). Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)."},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_32"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3206025.3206064"},{"key":"e_1_3_2_2_51_1","volume-title":"The Early Cinema of E dwin S. P orter","author":"Musser Charles","year":"2011","unstructured":"Charles Musser . 2011. The Early Cinema of E dwin S. P orter . The Wiley-Blackwell History of American Film ( 2011 ). Charles Musser. 2011. The Early Cinema of E dwin S. P orter. The Wiley-Blackwell History of American Film (2011)."},{"key":"e_1_3_2_2_52_1","volume-title":"Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748","author":"van den Oord Aaron","year":"2018","unstructured":"Aaron van den Oord , Yazhe Li , and Oriol Vinyals . 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 ( 2018 ). Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)."},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00778"},{"key":"e_1_3_2_2_54_1","volume-title":"Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800","author":"Qian Rui","year":"2020","unstructured":"Rui Qian , Tianjian Meng , Boqing Gong , Ming-Hsuan Yang , Huisheng Wang , Serge Belongie , and Yin Cui . 2020. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800 ( 2020 ). Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2020. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800 (2020)."},{"key":"e_1_3_2_2_55_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298940"},{"key":"e_1_3_2_2_57_1","volume-title":"How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347","author":"Sanabria Ramon","year":"2018","unstructured":"Ramon Sanabria , Ozan Caglayan , Shruti Palaskar , Desmond Elliott , Lo\"ic Barrault, Lucia Specia , and Florian Metze . 2018. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347 ( 2018 ). Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Lo\"ic Barrault, Lucia Specia, and Florian Metze. 2018. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347 (2018)."},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01240-3_13"},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_32"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.229"},{"key":"e_1_3_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/1518701.1518825"},{"key":"e_1_3_2_2_62_1","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187","author":"Song Yale","year":"2015","unstructured":"Yale Song , Jordi Vallmitjana , Amanda Stent , and Alejandro Jaimes . 2015 . Tvsum: Summarizing web videos using titles . In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187 . Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187."},{"key":"e_1_3_2_2_63_1","volume-title":"Learning Video Representations using Contrastive Bidirectional Transformer. arXiv preprint arXiv:1906.05743","author":"Sun Chen","year":"2019","unstructured":"Chen Sun , Fabien Baradel , Kevin Murphy , and Cordelia Schmid . 2019. Learning Video Representations using Contrastive Bidirectional Transformer. arXiv preprint arXiv:1906.05743 ( 2019 ). Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Learning Video Representations using Contrastive Bidirectional Transformer. arXiv preprint arXiv:1906.05743 (2019)."},{"key":"e_1_3_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.501"},{"key":"e_1_3_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2984511.2984569"},{"key":"e_1_3_2_2_67_1","volume-title":"Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729","author":"Venugopalan Subhashini","year":"2014","unstructured":"Subhashini Venugopalan , Huijuan Xu , Jeff Donahue , Marcus Rohrbach , Raymond Mooney , and Kate Saenko . 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 ( 2014 ). Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)."},{"key":"e_1_3_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.18"},{"key":"e_1_3_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_24"},{"key":"e_1_3_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00413"},{"key":"e_1_3_2_2_72_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355089.3356520"},{"key":"e_1_3_2_2_74_1","unstructured":"Wikipedia. 2020a. Montage (filmmaking). https:\/\/en.wikipedia.org\/wiki\/Montage_(filmmaking).  Wikipedia. 2020a. Montage (filmmaking). https:\/\/en.wikipedia.org\/wiki\/Montage_(filmmaking)."},{"key":"e_1_3_2_2_75_1","unstructured":"Wikipedia. 2020b. Soviet montage theory. https:\/\/en.wikipedia.org\/wiki\/Soviet_montage_theory.  Wikipedia. 2020b. Soviet montage theory. https:\/\/en.wikipedia.org\/wiki\/Soviet_montage_theory."},{"key":"e_1_3_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"e_1_3_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00469"},{"key":"e_1_3_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01058"},{"key":"e_1_3_2_2_79_1","volume-title":"Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084","author":"Xu Hu","year":"2021","unstructured":"Hu Xu , Gargi Ghosh , Po-Yao Huang , Dmytro Okhonko , Armen Aghajanyan , Florian Metze , Luke Zettlemoyer , and Christoph Feichtenhofer . 2021 . Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021). Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)."},{"key":"e_1_3_2_2_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_2_2_81_1","volume-title":"Video representation learning with visual tempo consistency. arXiv preprint arXiv:2006.15489","author":"Yang Ceyuan","year":"2020","unstructured":"Ceyuan Yang , Yinghao Xu , Bo Dai , and Bolei Zhou . 2020. Video representation learning with visual tempo consistency. arXiv preprint arXiv:2006.15489 ( 2020 ). Ceyuan Yang, Yinghao Xu, Bo Dai, and Bolei Zhou. 2020. Video representation learning with visual tempo consistency. arXiv preprint arXiv:2006.15489 (2020)."},{"key":"e_1_3_2_2_82_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_29"},{"key":"e_1_3_2_2_83_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00198"},{"key":"e_1_3_2_2_84_1","volume-title":"Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054","author":"Zhou Kaiyang","year":"2017","unstructured":"Kaiyang Zhou , Yu Qiao , and Tao Xiang . 2017a. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054 ( 2017 ). Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2017a. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054 (2017)."},{"key":"e_1_3_2_2_85_1","volume-title":"Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788","author":"Zhou Luowei","year":"2017","unstructured":"Luowei Zhou , Chenliang Xu , and Jason J Corso . 2017b. Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788 ( 2017 ).io Luowei Zhou, Chenliang Xu, and Jason J Corso. 2017b. Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788 (2017).io"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548268","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548268","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:42Z","timestamp":1750186842000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548268"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":85,"alternative-id":["10.1145\/3503161.3548268","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548268","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}