{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T17:41:20Z","timestamp":1757612480635,"version":"3.44.0"},"reference-count":124,"publisher":"Association for Computing Machinery (ACM)","issue":"10","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,6]]},"abstract":"<jats:p>Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posing a major hurdle to real-world deployment and necessitating solutions for integration into scalable video data management systems. This paper introduces D\u00e9j\u00e0 Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames. At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities, striking an effective balance between accuracy and reuse. Although ReuseViT significantly reduces computation, these savings do not directly translate into performance gains on GPUs. To overcome this, D\u00e9j\u00e0 Vu integrates memory-compute joint compaction techniques that convert the FLOP savings into tangible performance gains. Evaluations on three VideoLM tasks show that D\u00e9j\u00e0 Vu accelerates embedding generation by up to a 2.64\u00d7 within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.<\/jats:p>","DOI":"10.14778\/3748191.3748195","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T13:50:16Z","timestamp":1756993816000},"page":"3284-3298","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["D\u00e9j\u00e0 Vu: Efficient Video-Language Query Engine with Learning-Based Inter-Frame Computation Reuse"],"prefix":"10.14778","volume":"18","author":[{"given":"Jinwoo","family":"Hwang","sequence":"first","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daeun","family":"Kim","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sangyeop","family":"Lee","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yoonsung","family":"Kim","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guseul","family":"Heo","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hojoon","family":"Kim","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yunseok","family":"Jeong","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tadiwos","family":"Meaza","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eunhyeok","family":"Park","sequence":"additional","affiliation":[{"name":"POSTECH"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jeongseob","family":"Ahn","sequence":"additional","affiliation":[{"name":"Korea University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jongse","family":"Park","sequence":"additional","affiliation":[{"name":"KAIST"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Boggart: Towards General-Purpose acceleration of retrospective video analytics. In NSDI.","author":"Agarwal Neil","year":"2023","unstructured":"Neil Agarwal and Ravi Netravali. 2023. Boggart: Towards General-Purpose acceleration of retrospective video analytics. In NSDI."},{"volume-title":"Physical representation-based predicate optimization for a visual analytics database","author":"Anderson Michael R","key":"e_1_2_1_2_1","unstructured":"Michael R Anderson, Michael Cafarella, German Ros, and Thomas F Wenisch. 2019. Physical representation-based predicate optimization for a visual analytics database. In ICDE. IEEE, 1466\u20131477."},{"key":"e_1_2_1_3_1","volume-title":"Vivit: A video vision transformer. In ICCV. 6836\u20136846.","author":"Arnab Anurag","year":"2021","unstructured":"Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu\u010di\u0107, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In ICCV. 6836\u20136846."},{"key":"e_1_2_1_4_1","volume-title":"Pramod Chunduri, Subrata Mitra, and Joy Arulraj.","author":"Bang Jaeho","year":"2023","unstructured":"Jaeho Bang, Gaurav Tarlok Kakkar, Pramod Chunduri, Subrata Mitra, and Joy Arulraj. 2023. Seiden: Revisiting query processing in video database systems. VLDB 16, 9 (2023)."},{"key":"e_1_2_1_5_1","volume-title":"Miris: Fast object track queries in video. In SIGMOD. 1907\u20131921.","author":"Bastani Favyen","year":"2020","unstructured":"Favyen Bastani, Songtao He, Arjun Balasingam, Karthik Gopalakrishnan, Mohammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska, and Sam Madden. 2020. Miris: Fast object track queries in video. In SIGMOD. 1907\u20131921."},{"key":"e_1_2_1_6_1","volume-title":"OTIF: Efficient tracker pre-processing over large video datasets. In sigmod. 2091\u20132104.","author":"Bastani Favyen","year":"2022","unstructured":"Favyen Bastani and Samuel Madden. 2022. OTIF: Efficient tracker pre-processing over large video datasets. In sigmod. 2091\u20132104."},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Markus Billeter Ola Olsson and Ulf Assarsson. 2009. Efficient stream compaction on wide SIMD many-core architectures. In HPG. 159\u2013166.","DOI":"10.1145\/1572769.1572795"},{"key":"e_1_2_1_8_1","volume-title":"Token merging: Your vit but faster. ICLR","author":"Bolya Daniel","year":"2023","unstructured":"Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2023. Token merging: Your vit but faster. ICLR (2023)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Mark Buckler Philip Bedoukian Suren Jayasuriya and Adrian Sampson. 2018. EVA2: Exploiting Temporal Redundancy in Live Computer Vision. In ISCA. 533\u2013546.","DOI":"10.1109\/ISCA.2018.00051"},{"key":"e_1_2_1_10_1","first-page":"19594","article-title":"Space-time mixing attention for video transformer","volume":"34","author":"Bulat Adrian","year":"2021","unstructured":"Adrian Bulat, Juan Manuel Perez Rua, Swathikiran Sudhakaran, Brais Martinez, and Georgios Tzimiropoulos. 2021. Space-time mixing attention for video transformer. NeurIPS 34 (2021), 19594\u201319607.","journal-title":"NeurIPS"},{"key":"e_1_2_1_11_1","volume-title":"Figo: Fine-grained query optimization in video analytics. In SIGMOD. 559\u2013572.","author":"Cao Jiashen","year":"2022","unstructured":"Jiashen Cao, Karan Sarkar, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim. 2022. Figo: Fine-grained query optimization in video analytics. In SIGMOD. 559\u2013572."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12890\u201312903","author":"Cao Qingqing","year":"2023","unstructured":"Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi. 2023. PuMer: Pruning and Merging Tokens for Efficient Vision Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12890\u201312903."},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Mathilde Caron Hugo Touvron Ishan Misra Herv\u00e9 J\u00e9gou Julien Mairal Piotr Bojanowski and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV. 9650\u20139660.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision. 17164\u201317174","author":"Chen Mengzhao","year":"2023","unstructured":"Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. 2023. Diffrate: Differentiable compression rate for efficient vision transformers. In Proceedings of the IEEE\/CVF international conference on computer vision. 17164\u201317174."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Yinpeng Chen Xiyang Dai Mengchen Liu Dongdong Chen Lu Yuan and Zicheng Liu. 2020. Dynamic convolution: Attention over convolution kernels. In CVPR. 11030\u201311039.","DOI":"10.1109\/CVPR42600.2020.01104"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Feng Cheng Xizi Wang Jie Lei David Crandall Mohit Bansal and Gedas Bertasius. 2023. VindLU: A Recipe for Effective Video-and-Language Pretraining. In CVPR.","DOI":"10.1109\/CVPR52729.2023.01034"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Joonmyung Choi Sanghyeok Lee Jaewon Chu Minhyuk Choi and Hyunwoo J Kim. 2024. vid-TLDR: Training Free Token merging for Light-weight Video Transformer. In CVPR. 18771\u201318781.","DOI":"10.1109\/CVPR52733.2024.01776"},{"key":"e_1_2_1_18_1","first-page":"9355","article-title":"Twins: Revisiting the design of spatial attention in vision transformers","volume":"34","author":"Chu Xiangxiang","year":"2021","unstructured":"Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Twins: Revisiting the design of spatial attention in vision transformers. NIPS 34 (2021), 9355\u20139366.","journal-title":"NIPS"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference. 5450\u20135455","author":"Colas Anthony","year":"2020","unstructured":"Anthony Colas, Seokhwan Kim, Franck Dernoncourt, Siddhesh Gupte, Zhe Wang, and Doo Soon Kim. 2020. TutorialVQA: Question Answering Dataset for Tutorial Videos. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 5450\u20135455."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2022.3162598"},{"key":"e_1_2_1_21_1","volume-title":"TASM: A Tile-Based Storage Manager for Video Analytics. In ICDE. 1775\u20131786.","author":"Daum Maureen","year":"2021","unstructured":"Maureen Daum, Brandon Haynes, Dong He, Amrita Mazumdar, and Magdalena Balazinska. 2021. TASM: A Tile-Based Storage Manager for Video Analytics. In ICDE. 1775\u20131786."},{"key":"e_1_2_1_22_1","first-page":"4188","article-title":"Vocalexplore: Pay-as-you-go video data exploration and model building","volume":"16","author":"Daum Maureen","year":"2023","unstructured":"Maureen Daum, Enhao Zhang, Dong He, Stephen Mussmann, Brandon Haynes, Ranjay Krishna, and Magdalena Balazinska. 2023. Vocalexplore: Pay-as-you-go video data exploration and model building. VLDB 16, 13 (2023), 4188\u20134201.","journal-title":"VLDB"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the Human Factors and Ergonomics Society Annual Meeting.","author":"Deb Shuchisnigdha","year":"2018","unstructured":"Shuchisnigdha Deb, Christopher R Hudson, Daniel W Carruth, and Darren Frey. 2018. Pedestrians Receptivity in Autonomous Vehicles: Exploring a Video-based Assessment. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting."},{"key":"e_1_2_1_24_1","volume-title":"Heatvit: Hardware-efficient adaptive token pruning for vision transformers","author":"Dong Peiyan","year":"2023","unstructured":"Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, et al. 2023. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. In HPCA. IEEE."},{"key":"e_1_2_1_25_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR."},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Matthew Dutson Yin Li and Mohit Gupta. 2023. Eventful transformers: leveraging temporal redundancy in vision transformers. In ICCV. 16911\u201316923.","DOI":"10.1109\/ICCV51070.2023.01551"},{"key":"e_1_2_1_27_1","volume-title":"Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J\u00fcrgen Gall.","author":"Fayyaz Mohsen","year":"2022","unstructured":"Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J\u00fcrgen Gall. 2022. Adaptive token sampling for efficient vision transformers. In ECCV. Springer."},{"key":"e_1_2_1_28_1","unstructured":"Wensheng Gan Zhenlian Qi Jiayang Wu and Jerry Chun-Wei Lin. 2023. Large language models in education: Vision and opportunities. In BigData."},{"key":"e_1_2_1_29_1","volume-title":"Comprehensive Review, and Challenges. Recent Trends in Computational Intelligence","author":"Gawande Ujwalla","year":"2020","unstructured":"Ujwalla Gawande, Kamal Hajari, and Yogesh Golhar. 2020. Pedestrian Detection and Tracking in Video Surveillance Vystem: Issues, Comprehensive Review, and Challenges. Recent Trends in Computational Intelligence (2020), 1\u201324."},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Benjamin Graham Alaaeldin El-Nouby Hugo Touvron Pierre Stock Armand Joulin Herv\u00e9 J\u00e9gou and Matthijs Douze. 2021. Levit: a vision transformer in convnet's clothing for faster inference. In ICCV.","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 21st Workshop on Biomedical Language Processing.","author":"Gupta Deepak","year":"2022","unstructured":"Deepak Gupta and Dina Demner-Fushman. 2022. Overview of the MedVidQA 2022 shared task on medical video question-answering. In Proceedings of the 21st Workshop on Biomedical Language Processing."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2018.2856261"},{"key":"e_1_2_1_33_1","volume-title":"VSS: A Storage System for Video Analytics. In SIGMOD.","author":"Haynes Brandon","year":"2021","unstructured":"Brandon Haynes, Maureen Daum, Dong He, Amrita Mazumdar, Magdalena Balazinska, Alvin Cheung, and Luis Ceze. 2021. VSS: A Storage System for Video Analytics. In SIGMOD."},{"key":"e_1_2_1_34_1","unstructured":"Byeongho Heo Sangdoo Yun Dongyoon Han Sanghyuk Chun Junsuk Choe and Seong Joon Oh. 2021. Rethinking spatial dimensions of vision transformers. In ICCV."},{"key":"e_1_2_1_35_1","volume-title":"Focus: Querying Large Video Datasets with Low Latency and Low Cost. In OSDI.","author":"Hsieh Kevin","year":"2018","unstructured":"Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In OSDI."},{"key":"e_1_2_1_36_1","unstructured":"Jinwoo Hwang Minsu Kim Daeun Kim Seungho Nam Yoonsung Kim Dohee Kim Hardik Sharma and Jongse Park. 2022. CoVA: Exploiting Compressed-Domain Analysis to Accelerate Video Analytics. In ATC."},{"key":"e_1_2_1_37_1","unstructured":"Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu Pham Quoc Le Yun-Hsuan Sung Zhen Li and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML."},{"key":"e_1_2_1_38_1","volume-title":"Chameleon: Scalable Adaptation of Video Analytics. In SIGCOMM.","author":"Jiang Junchen","year":"2018","unstructured":"Junchen Jiang, Ganesh Ananthanarayanan, Peter Bod\u00edk, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable Adaptation of Video Analytics. In SIGCOMM."},{"volume-title":"Prompting visual-language models for efficient video understanding","author":"Ju Chen","key":"e_1_2_1_39_1","unstructured":"Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting visual-language models for efficient video understanding. In ECCV. Springer."},{"key":"e_1_2_1_40_1","volume-title":"Hydro: Adaptive Query Processing of ML Queries. arXiv preprint arXiv:2403.14902","author":"Kakkar Gaurav Tarlok","year":"2024","unstructured":"Gaurav Tarlok Kakkar, Jiashen Cao, Aubhro Sengupta, Joy Arulraj, and Hyesoon Kim. 2024. Hydro: Adaptive Query Processing of ML Queries. arXiv preprint arXiv:2403.14902 (2024)."},{"key":"e_1_2_1_41_1","volume-title":"BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics. In","author":"Kang Daniel","year":"2019","unstructured":"Daniel Kang, Peter Bailis, and Matei Zaharia. PVLDB. BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics. In 2019."},{"key":"e_1_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Daniel Kang John Emmons Firas Abuzaid Peter Bailis and Matei Zaharia. 2017. NoScope: Optimizing Neural Network Queries over Video at Scale. In PVLDB.","DOI":"10.14778\/3137628.3137664"},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Daniel Kang John Guibas Peter Bailis Tatsunori Hashimoto and Matei Zaharia. 2021. Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data. In PVLDB.","DOI":"10.1145\/3514221.3517897"},{"key":"e_1_2_1_44_1","volume-title":"VIVA: An End-to-End System for Interactive Video Analytics.. In CIDR.","author":"Kang Daniel","year":"2022","unstructured":"Daniel Kang, Francisco Romero, Peter D Bailis, Christos Kozyrakis, and Matei Zaharia. 2022. VIVA: An End-to-End System for Interactive Video Analytics.. In CIDR."},{"key":"e_1_2_1_45_1","volume-title":"Vilt: Vision-and-language transformer without convolution or region supervision. In ICML.","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML."},{"key":"e_1_2_1_46_1","volume-title":"Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations. VLDB","author":"Kittivorawong Chanwut","year":"2024","unstructured":"Chanwut Kittivorawong, Yongming Ge, Yousef Helal, and Alvin Cheung. 2024. Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations. VLDB (2024)."},{"key":"e_1_2_1_47_1","volume-title":"Spvit: Enabling faster vision transformers via soft token pruning. ECCV","author":"Kong Zhenglun","year":"2022","unstructured":"Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, et al. 2022. Spvit: Enabling faster vision transformers via soft token pruning. ECCV (2022)."},{"key":"e_1_2_1_48_1","volume-title":"Everest: A top-k deep video analytics system. In SIGMOD. 2357\u20132360.","author":"Lai Ziliang","year":"2022","unstructured":"Ziliang Lai, Chris Liu, Chenxia Han, Pengfei Zhang, Eric Lo, and Ben Kao. 2022. Everest: A top-k deep video analytics system. In SIGMOD. 2357\u20132360."},{"key":"e_1_2_1_49_1","volume-title":"LVS: A Learned Video Storage for Fast and Efficient Video Understanding. In CVPRW.","author":"Lee Yunghee","year":"2024","unstructured":"Yunghee Lee and Jongse Park. 2024. LVS: A Learned Video Storage for Fast and Efficient Video Understanding. In CVPRW."},{"key":"e_1_2_1_50_1","volume-title":"Juan Carlos Niebles, and Steven CH Hoi","author":"Li Dongxu","year":"2022","unstructured":"Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In CVPR."},{"key":"e_1_2_1_51_1","volume-title":"ISEE: an intelligent scene exploration and evaluation platform for large-scale visual surveillance. TPDS","author":"Li Da","year":"2019","unstructured":"Da Li, Zhang Zhang, Kai Yu, Kaiqi Huang, and Tieniu Tan. 2019. ISEE: an intelligent scene exploration and evaluation platform for large-scale visual surveillance. TPDS (2019)."},{"key":"e_1_2_1_52_1","volume-title":"Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML."},{"key":"e_1_2_1_53_1","volume-title":"Align before fuse: Vision and language representation learning with momentum distillation. NIPS","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. NIPS (2021)."},{"key":"e_1_2_1_54_1","unstructured":"Kunchang Li Yali Wang Yizhuo Li Yi Wang Yinan He Limin Wang and Yu Qiao. 2023. Unmasked teacher: Towards training-efficient video foundation models. In ICCV."},{"key":"e_1_2_1_55_1","volume-title":"HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training. In EMNLP.","author":"Li Linjie","year":"2020","unstructured":"Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training. In EMNLP."},{"key":"e_1_2_1_56_1","volume-title":"Hero: Hierarchical spatiotemporal reasoning with contrastive action correspondence for end-to-end video object grounding. In MM.","author":"Li Mengze","year":"2022","unstructured":"Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Wenqiao Zhang, Jiaxu Miao, Shiliang Pu, and Fei Wu. 2022. Hero: Hierarchical spatiotemporal reasoning with contrastive action correspondence for end-to-end video object grounding. In MM."},{"key":"e_1_2_1_57_1","first-page":"1763","article-title":"Elf: Erasing-based lossless floating-point compression","volume":"16","author":"Li Ruiyuan","year":"2023","unstructured":"Ruiyuan Li, Zheng Li, Yi Wu, Chao Chen, and Yu Zheng. 2023. Elf: Erasing-based lossless floating-point compression. VLDB 16, 7 (2023), 1763\u20131776.","journal-title":"VLDB"},{"key":"e_1_2_1_58_1","volume-title":"VidToMe: Video Token Merging for Zero-Shot Video Editing. CVPR","author":"Li Xirui","year":"2024","unstructured":"Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. 2024. VidToMe: Video Token Merging for Zero-Shot Video Editing. CVPR (2024)."},{"key":"e_1_2_1_59_1","volume-title":"Guoqing Harry Xu, and Ravi Netravali","author":"Li Yuanqi","year":"2020","unstructured":"Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guoqing Harry Xu, and Ravi Netravali. 2020. Reducto: On-camera filtering for resource-efficient real-time video analytics. In SIGCOMM."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3527423"},{"key":"e_1_2_1_61_1","first-page":"3058","article-title":"Chimp: efficient lossless floating point compression for time series databases","volume":"15","author":"Liakos Panagiotis","year":"2022","unstructured":"Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. VLDB 15, 11 (2022), 3058\u20133070.","journal-title":"VLDB"},{"key":"e_1_2_1_62_1","unstructured":"Youwei Liang Chongjian Ge Zhan Tong Yibing Song Jue Wang and Pengtao Xie. 2022. Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR."},{"key":"e_1_2_1_63_1","doi-asserted-by":"crossref","unstructured":"Ze Liu Yutong Lin Yue Cao Han Hu Yixuan Wei Zheng Zhang Stephen Lin and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_2_1_64_1","volume-title":"CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing","author":"Luo Huaishao","year":"2022","unstructured":"Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing (2022)."},{"key":"e_1_2_1_65_1","volume-title":"Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, and Oncel Tuzel.","author":"Marin Dmitrii","year":"2021","unstructured":"Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, and Oncel Tuzel. 2021. Token Pooling in Vision Transformers. arXiv:2110.03860 [cs.CV]"},{"key":"e_1_2_1_66_1","volume-title":"SeeSaw: interactive ad-hoc search over image databases. PACMMOD","author":"Moll Oscar","year":"2023","unstructured":"Oscar Moll, Manuel Favela, Samuel Madden, Vijay Gadepally, and Michael Cafarella. 2023. SeeSaw: interactive ad-hoc search over image databases. PACMMOD (2023)."},{"key":"e_1_2_1_67_1","volume-title":"Ilchae Jung, and Bohyung Han.","author":"Mun Jonghwan","year":"2017","unstructured":"Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. 2017. Marioqa: Answering questions by watching gameplay videos. In ICCV."},{"key":"e_1_2_1_68_1","volume-title":"DINOv2: Learning Robust Visual Features without Supervision. TMLR","author":"Oquab Maxime","year":"2024","unstructured":"Maxime Oquab, Timoth\u00e9e Darcet, Th\u00e9o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without Supervision. TMLR (2024)."},{"key":"e_1_2_1_69_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR).","author":"Pan Bowen","year":"2021","unstructured":"Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex Andonian, Yue Meng, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. Video Adaptive Redundancy Reduction. In Proceedings of the International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_70_1","volume-title":"Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In ECCV.","author":"Pan Junting","year":"2022","unstructured":"Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, and Brais Martinez. 2022. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In ECCV."},{"key":"e_1_2_1_71_1","doi-asserted-by":"crossref","unstructured":"Mathias Parger Chengcheng Tang Thomas Neff Christopher D Twigg Cem Keskin Robert Wang and Markus Steinberger. 2023. MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions. In ICCV.","DOI":"10.1109\/ICCV51070.2023.01586"},{"key":"e_1_2_1_72_1","doi-asserted-by":"crossref","unstructured":"Mathias Parger Chengcheng Tang Christopher D Twigg Cem Keskin Robert Wang and Markus Steinberger. 2022. DeltaCNN: End-to-end CNN inference of sparse frame differences in videos. In CVPR.","DOI":"10.1109\/CVPR52688.2022.01217"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824078"},{"key":"e_1_2_1_74_1","doi-asserted-by":"crossref","unstructured":"AJ Piergiovanni Weicheng Kuo and Anelia Angelova. 2023. Rethinking video vits: Sparse video tubes for joint image and video learning. In CVPR.","DOI":"10.1109\/CVPR52729.2023.00220"},{"key":"e_1_2_1_75_1","volume-title":"Scanner: Efficient video analysis at scale. TOG","author":"Poms Alex","year":"2018","unstructured":"Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner: Efficient video analysis at scale. TOG (2018)."},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589057"},{"key":"e_1_2_1_77_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748\u20138763."},{"key":"e_1_2_1_78_1","volume-title":"Dynamicvit: Efficient vision transformers with dynamic token sparsification. NIPS 34","author":"Rao Yongming","year":"2021","unstructured":"Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. Dynamicvit: Efficient vision transformers with dynamic token sparsification. NIPS 34 (2021)."},{"key":"e_1_2_1_79_1","volume-title":"TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding. In EMNLP.","author":"Ren Shuhuai","year":"2023","unstructured":"Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. 2023. TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding. In EMNLP."},{"key":"e_1_2_1_80_1","volume-title":"Zelda: Video analytics using vision-language models. arXiv preprint arXiv:2305.03785","author":"Romero Francisco","year":"2023","unstructured":"Francisco Romero, Caleb Winston, Johann Hauswald, Matei Zaharia, and Christos Kozyrakis. 2023. Zelda: Video analytics using vision-language models. arXiv preprint arXiv:2305.03785 (2023)."},{"key":"e_1_2_1_81_1","volume-title":"Accelerating Aggregation Queries on Unstructured Streams of Data. VLDB","author":"Russo Matthew","year":"2023","unstructured":"Matthew Russo, Tatsunori Hashimoto, Daniel Kang, Yi Sun, and Matei Zaharia. 2023. Accelerating Aggregation Queries on Unstructured Streams of Data. VLDB (2023)."},{"key":"e_1_2_1_82_1","volume-title":"Annual conference on medical image understanding and analysis. Springer.","author":"Sanderson Edward","year":"2022","unstructured":"Edward Sanderson and Bogdan J Matuszewski. 2022. FCN-transformer feature fusion for polyp segmentation. In Annual conference on medical image understanding and analysis. Springer."},{"key":"e_1_2_1_83_1","volume-title":"K-lite: Learning transferable visual models with external knowledge. NIPS","author":"Shen Sheng","year":"2022","unstructured":"Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, et al. 2022. K-lite: Learning transferable visual models with external knowledge. NIPS (2022)."},{"key":"e_1_2_1_84_1","volume-title":"Flexgen: High-throughput generative inference of large language models with a single gpu. In ICML.","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In ICML."},{"key":"e_1_2_1_85_1","volume-title":"Flava: A foundational language and vision alignment model. In CVPR.","author":"Singh Amanpreet","year":"2022","unstructured":"Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. Flava: A foundational language and vision alignment model. In CVPR."},{"key":"e_1_2_1_86_1","volume-title":"CMC: Video Transformer Acceleration via CODEC Assisted Matrix Condensing. In ASPLOS.","author":"Song Zhuoran","year":"2024","unstructured":"Zhuoran Song, Chunyu Qi, Fangxin Liu, Naifeng Jing, and Xiaoyao Liang. 2024. CMC: Video Transformer Acceleration via CODEC Assisted Matrix Condensing. In ASPLOS."},{"key":"e_1_2_1_87_1","volume-title":"Vr-dann: Real-time video recognition via decoder-assisted neural network acceleration","author":"Song Zhuoran","year":"2020","unstructured":"Zhuoran Song, Feiyang Wu, Xueyuan Liu, Jing Ke, Naifeng Jing, and Xiaoyao Liang. 2020. Vr-dann: Real-time video recognition via decoder-assisted neural network acceleration. In MICRO. IEEE, 698\u2013710."},{"key":"e_1_2_1_88_1","volume-title":"Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning. NIPS","author":"Sun Yuchong","year":"2022","unstructured":"Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. 2022. Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning. NIPS (2022)."},{"key":"e_1_2_1_89_1","volume-title":"ODIN: automated drift detection and recovery in video analytics. VLDB","author":"Suprem Abhijit","year":"2020","unstructured":"Abhijit Suprem, Joy Arulraj, Calton Pu, and Joao Ferreira. 2020. ODIN: automated drift detection and recovery in video analytics. VLDB (2020)."},{"key":"e_1_2_1_90_1","doi-asserted-by":"crossref","unstructured":"Ra\u00fal Taranco Jos\u00e9-Mar\u00eda Arnau and Antonio Gonz\u00e1lez. 2023. \u03b4LTA: Decoupling Camera Sampling from Processing to Avoid Redundant Computations in the Vision Pipeline. In MICRO.","DOI":"10.1145\/3613424.3614261"},{"key":"e_1_2_1_91_1","volume-title":"Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS","author":"Tong Zhan","year":"2022","unstructured":"Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS (2022)."},{"key":"e_1_2_1_92_1","volume-title":"Biometric surveillance using visual question answering. Pattern Recognition Letters","author":"Toor Andeep S","year":"2019","unstructured":"Andeep S Toor, Harry Wechsler, and Michele Nappi. 2019. Biometric surveillance using visual question answering. Pattern Recognition Letters (2019)."},{"key":"e_1_2_1_93_1","volume-title":"Proceedings of Machine Learning and Systems 3 (MLSys).","author":"Wakatsuki Toshiaki","year":"2021","unstructured":"Toshiaki Wakatsuki, Sekitoshi Kanai, and Yasuhiro Fujiwara. 2021. Accelerate Inference of CNNs for Video Analysis While Preserving Exactness Exploiting Activation Sparsity. In Proceedings of Machine Learning and Systems 3 (MLSys)."},{"key":"e_1_2_1_94_1","volume-title":"Spatten: Efficient sparse attention architecture with cascade token and head pruning. In HPCA.","author":"Wang Hanrui","year":"2021","unstructured":"Hanrui Wang, Zhekai Zhang, and Song Han. 2021. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In HPCA."},{"key":"e_1_2_1_95_1","doi-asserted-by":"crossref","unstructured":"Limin Wang Bingkun Huang Zhiyu Zhao Zhan Tong Yinan He Yi Wang Yali Wang and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR.","DOI":"10.1109\/CVPR52729.2023.01398"},{"key":"e_1_2_1_96_1","doi-asserted-by":"crossref","unstructured":"Wenhai Wang Enze Xie Xiang Li Deng-Ping Fan Kaitao Song Ding Liang Tong Lu Ping Luo and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"e_1_2_1_97_1","volume-title":"Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media 8, 3","author":"Wang Wenhai","year":"2022","unstructured":"Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2022. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media 8, 3 (2022)."},{"key":"e_1_2_1_98_1","volume-title":"Skipnet: Learning dynamic routing in convolutional networks. In ECCV.","author":"Wang Xin","year":"2018","unstructured":"Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. 2018. Skipnet: Learning dynamic routing in convolutional networks. In ECCV."},{"key":"e_1_2_1_99_1","volume-title":"Internvid: A large-scale video-text dataset for multimodal understanding and generation. ICLR","author":"Wang Yi","year":"2024","unstructured":"Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. 2024. Internvid: A large-scale video-text dataset for multimodal understanding and generation. ICLR (2024)."},{"key":"e_1_2_1_100_1","volume-title":"Internvideo: General video foundation models via generative and discriminative learning. ECCV","author":"Wang Yi","year":"2024","unstructured":"Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2024. Internvideo: General video foundation models via generative and discriminative learning. ECCV (2024)."},{"key":"e_1_2_1_101_1","volume-title":"Proceedings of the ACM on Management of Data","author":"Wu Renzhi","year":"2024","unstructured":"Renzhi Wu, Pramod Chunduri, Ali Payani, Xu Chu, Joy Arulraj, and Kexin Rong. 2024. SketchQL: Video Moment Querying with a Visual Query Interface. Proceedings of the ACM on Management of Data (2024)."},{"key":"e_1_2_1_102_1","volume-title":"Blockdrop: Dynamic inference paths in residual networks. In CVPR.","author":"Wu Zuxuan","year":"2018","unstructured":"Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. 2018. Blockdrop: Dynamic inference paths in residual networks. In CVPR."},{"key":"e_1_2_1_103_1","volume-title":"Liteeval: A coarse-to-fine framework for resource efficient video recognition. NeurIPS","author":"Wu Zuxuan","year":"2019","unstructured":"Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S Davis. 2019. Liteeval: A coarse-to-fine framework for resource efficient video recognition. NeurIPS (2019)."},{"key":"e_1_2_1_104_1","doi-asserted-by":"crossref","unstructured":"Junbin Xiao Angela Yao Yicong Li and Tat-Seng Chua. 2024. Can i trust your answer? visually grounded video question answering. In CVPR.","DOI":"10.1109\/CVPR52733.2024.01254"},{"key":"e_1_2_1_105_1","volume-title":"DoveDB: A Declarative and Low-Latency Video Database. VLDB","author":"Xiao Ziyang","year":"2023","unstructured":"Ziyang Xiao, Dongxiang Zhang, Zepeng Li, Sai Wu, Kian-Lee Tan, and Gang Chen. 2023. DoveDB: A Declarative and Low-Latency Video Database. VLDB (2023)."},{"key":"e_1_2_1_106_1","volume-title":"Msr-vtt: A large video description dataset for bridging video and language. In CVPR.","author":"Xu Jun","year":"2016","unstructured":"Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR."},{"key":"e_1_2_1_107_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303971"},{"key":"e_1_2_1_108_1","volume-title":"Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI.","author":"Xu Yifan","year":"2022","unstructured":"Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. 2022. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI."},{"key":"e_1_2_1_109_1","volume-title":"Joy Arulraj, and Umakishore Ramachandran.","author":"Xu Zhuangdi","year":"2022","unstructured":"Zhuangdi Xu, Gaurav Tarlok Kakkar, Joy Arulraj, and Umakishore Ramachandran. 2022. EVA: A symbolic approach to accelerating exploratory video analytics with materialized views. In SIGMOD."},{"key":"e_1_2_1_110_1","doi-asserted-by":"crossref","unstructured":"Antoine Yang Antoine Miech Josef Sivic Ivan Laptev and Cordelia Schmid. 2021. Just ask: Learning to answer questions from millions of narrated videos. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00171"},{"key":"e_1_2_1_111_1","unstructured":"Antoine Yang Antoine Miech Josef Sivic Ivan Laptev and Cordelia Schmid. 2022. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. In NIPS S. Koyejo S. Mohamed A. Agarwal D. Belgrave K. Cho and A. Oh (Eds.)."},{"key":"e_1_2_1_112_1","doi-asserted-by":"crossref","unstructured":"Shusheng Yang Xinggang Wang Yu Li Yuxin Fang Jiemin Fang Wenyu Liu Xun Zhao and Ying Shan. 2022. Temporally efficient vision transformer for video instance segmentation. In CVPR.","DOI":"10.1109\/CVPR52688.2022.00290"},{"volume-title":"AdapTiV: Sign-Similarity Based Image-Adaptive Token Merging for Vision Transformer Acceleration","author":"Yoo Seungjae","key":"e_1_2_1_113_1","unstructured":"Seungjae Yoo, Hangyeol Kim, and Joo-Young Kim. 2024. AdapTiV: Sign-Similarity Based Image-Adaptive Token Merging for Vision Transformer Acceleration. In MICRO. IEEE."},{"key":"e_1_2_1_114_1","volume-title":"Florence: A new foundation model for computer vision. arXiv","author":"Yuan Lu","year":"2021","unstructured":"Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. 2021. Florence: A new foundation model for computer vision. arXiv (2021)."},{"key":"e_1_2_1_115_1","doi-asserted-by":"crossref","unstructured":"Mu Yuan Lan Zhang Xuanke You and Xiang-Yang Li. 2023. PacketGame: Multi-Stream Packet Gating for Concurrent Video Inference at Scale. In SIGCOMM.","DOI":"10.1145\/3603269.3604825"},{"key":"e_1_2_1_116_1","volume-title":"Freedman","author":"Zhang Haoyu","year":"2017","unstructured":"Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. 2017. Live Video Analytics at Scale with Approximation and Delay-Tolerance. In NSDI."},{"key":"e_1_2_1_117_1","doi-asserted-by":"crossref","unstructured":"Hang Zhang Xin Li and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In EMNLP Demo.","DOI":"10.18653\/v1\/2023.emnlp-demo.49"},{"key":"e_1_2_1_118_1","volume-title":"Poet: Product-oriented video captioner for e-commerce. In ACM MM.","author":"Zhang Shengyu","year":"2020","unstructured":"Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, and Fei Wu. 2020. Poet: Product-oriented video captioner for e-commerce. In ACM MM."},{"key":"e_1_2_1_119_1","volume-title":"Panorama: A Data System for Unbounded Vocabulary Querying over Video. In VLDB.","author":"Zhang Yuhao","year":"2020","unstructured":"Yuhao Zhang and Arun Kumar. 2020. Panorama: A Data System for Unbounded Vocabulary Querying over Video. In VLDB."},{"key":"e_1_2_1_120_1","volume-title":"Attract me to buy: Advertisement copywriting generation with multimodal multi-structured information. arXiv preprint arXiv:2205.03534","author":"Zhang Zhipeng","year":"2022","unstructured":"Zhipeng Zhang, Xinglin Hou, Kai Niu, Zhongzhen Huang, Tiezheng Ge, Yuning Jiang, Qi Wu, and Peng Wang. 2022. Attract me to buy: Advertisement copywriting generation with multimodal multi-structured information. arXiv preprint arXiv:2205.03534 (2022)."},{"key":"e_1_2_1_121_1","volume-title":"AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild. CoRR abs\/2010.13302","author":"Zhang Zhe","year":"2020","unstructured":"Zhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and Wenjun Zeng. 2020. AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild. CoRR abs\/2010.13302 (2020). arXiv:2010.13302 https:\/\/arxiv.org\/abs\/2010.13302"},{"key":"e_1_2_1_122_1","doi-asserted-by":"crossref","unstructured":"Yue Zhao Ishan Misra Philipp Kr\u00e4henb\u00fchl and Rohit Girdhar. 2023. Learning video representations from large language models. In CVPR.","DOI":"10.1109\/CVPR52729.2023.00637"},{"key":"e_1_2_1_123_1","volume-title":"TVM: A Tile-based Video Management Framework. VLDB","author":"Zhong Tianxiong","year":"2023","unstructured":"Tianxiong Zhong, Zhiwei Zhang, Guo Lu, Ye Yuan, Yu-Ping Wang, and Guoren Wang. 2023. TVM: A Tile-based Video Management Framework. VLDB (2023)."},{"key":"e_1_2_1_124_1","volume-title":"Crucio: End-to-End Coordinated Spatio-Temporal Redundancy Elimination for Fast Video Analytics. In INFOCOM.","author":"Zhu Andong","year":"2024","unstructured":"Andong Zhu, Sheng Zhang, Xiaohang Shi, Ke Cheng, Hesheng Sun, and Sanglu Lu. 2024. Crucio: End-to-End Coordinated Spatio-Temporal Redundancy Elimination for Fast Video Analytics. In INFOCOM."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3748191.3748195","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T13:53:02Z","timestamp":1756993982000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3748191.3748195"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6]]},"references-count":124,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,6]]}},"alternative-id":["10.14778\/3748191.3748195"],"URL":"https:\/\/doi.org\/10.14778\/3748191.3748195","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,6]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}