{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T01:10:26Z","timestamp":1755825026755,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":60,"publisher":"ACM","funder":[{"name":"European Commission under Horizon Europe Program","award":["101120237 - ELIAS"],"award-info":[{"award-number":["101120237 - ELIAS"]}]},{"name":"MUR PNRR project FAIR","award":["PE00000013"],"award-info":[{"award-number":["PE00000013"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,30]]},"DOI":"10.1145\/3731715.3733417","type":"proceedings-article","created":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T18:31:39Z","timestamp":1750876299000},"page":"1081-1090","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["R\n            <scp>ag<\/scp>\n            M\n            <scp>e<\/scp>\n            : Retrieval Augmented Video Generation for Enhanced Motion Realism"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0119-3783","authenticated-orcid":false,"given":"Elia","family":"Peruzzo","sequence":"first","affiliation":[{"name":"University of Trento, Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8474-3095","authenticated-orcid":false,"given":"Dejia","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Texas at Austin, Austin, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7276-5566","authenticated-orcid":false,"given":"Xingqian","family":"Xu","sequence":"additional","affiliation":[{"name":"Picsart AI Research, Atlanta, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2922-5663","authenticated-orcid":false,"given":"Humphrey","family":"Shi","sequence":"additional","affiliation":[{"name":"Georgia Tech, Atlanta, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6597-7248","authenticated-orcid":false,"given":"Nicu","family":"Sebe","sequence":"additional","affiliation":[{"name":"University of Trento, Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,6,30]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In IEEE International Conference on Computer Vision.","author":"Bain Max","year":"2021","unstructured":"Max Bain, Arsha Nagrani, G\u00fcl Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In IEEE International Conference on Computer Vision."},{"key":"e_1_3_2_1_2_1","unstructured":"Andreas Blattmann Tim Dockhorn Sumith Kulal Daniel Mendelevitch Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)."},{"key":"e_1_3_2_1_3_1","volume-title":"Sanja Fidler, and Karsten Kreis.","author":"Blattmann Andreas","year":"2023","unstructured":"Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 2023. Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models. In CVPR."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02161"},{"key":"e_1_3_2_1_5_1","first-page":"15309","article-title":"Retrieval-augmented diffusion models","volume":"35","author":"Blattmann Andreas","year":"2022","unstructured":"Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M\u00fcller, and Bj\u00f6rn Ommer. 2022. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems 35 (2022), 15309--15324.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"2240","author":"Borgeaud Sebastian","year":"2022","unstructured":"Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, and Laurent Sifre. 2022. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 2206--2240."},{"key":"e_1_3_2_1_7_1","unstructured":"Tim Brooks Bill Peebles Connor Holmes Will DePue Yufei Guo Li Jing David Schnurr Joe Taylor Troy Luhman Eric Luhman Clarence Ng Ricky Wang and Aditya Ramesh. 2024. Video generation models as world simulators. (2024). https:\/\/openai.com\/research\/video-generation-models-as-world-simulators"},{"key":"e_1_3_2_1_8_1","first-page":"27517","article-title":"Instance-conditioned gan","volume":"34","author":"Casanova Arantxa","year":"2021","unstructured":"Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano. 2021. Instance-conditioned gan. Advances in Neural Information Processing Systems 34 (2021), 27517--27529.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Duygu Ceylan Chun-Hao P Huang and Niloy J Mitra. 2023. Pix2video: Video editing using image diffusion. In ICCV.","DOI":"10.1109\/ICCV51070.2023.02121"},{"key":"e_1_3_2_1_10_1","volume-title":"Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922","author":"Cong Yuren","year":"2023","unstructured":"Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. 2023. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922 (2023)."},{"key":"e_1_3_2_1_11_1","volume-title":"Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34","author":"Dhariwal Prafulla","year":"2021","unstructured":"Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780--8794."},{"key":"e_1_3_2_1_12_1","volume-title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021)."},{"key":"e_1_3_2_1_13_1","unstructured":"Matthijs Douze Alexandr Guzhva Chengqi Deng Jeff Johnson Gergely Szilvasy Pierre-Emmanuel Mazar\u00e9 Maria Lomeli Lucas Hosseini and Herv\u00e9 J\u00e9gou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Patrick Esser Johnathan Chiu Parmida Atighehchian Jonathan Granskog and Anastasis Germanidis. 2023. Structure and content-guided video synthesis with diffusion models. In ICCV.","DOI":"10.1109\/ICCV51070.2023.00675"},{"key":"e_1_3_2_1_15_1","unstructured":"Songwei Ge Seungjun Nah Guilin Liu Tyler Poon AndrewTao Bryan Catanzaro David Jacobs Jia-Bin Huang Ming-Yu Liu and Yogesh Balaji. 2023. Preserve your own correlation: A noise prior for video diffusion models. In ICCV."},{"key":"e_1_3_2_1_16_1","volume-title":"Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint","author":"Geyer Michal","year":"2023","unstructured":"Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2023. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint (2023)."},{"key":"e_1_3_2_1_17_1","volume-title":"Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933","author":"Guo Yuwei","year":"2023","unstructured":"Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933 (2023)."},{"key":"e_1_3_2_1_18_1","volume-title":"Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint","author":"Guo Yuwei","year":"2023","unstructured":"Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint (2023)."},{"key":"e_1_3_2_1_19_1","volume-title":"Animate-astory: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940","author":"He Yingqing","year":"2023","unstructured":"Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. 2023. Animate-astory: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940 (2023)."},{"key":"e_1_3_2_1_20_1","unstructured":"Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Diederik P Kingma Ben Poole Mohammad Norouzi David J Fleet et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint (2022)."},{"key":"e_1_3_2_1_21_1","volume-title":"Denoising diffusion probabilistic models. Advances in neural information processing systems 33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840--6851."},{"key":"e_1_3_2_1_22_1","volume-title":"Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685","author":"Hu Edward J","year":"2021","unstructured":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02060"},{"key":"e_1_3_2_1_24_1","volume-title":"Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems 35","author":"Karras Tero","year":"2022","unstructured":"Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems 35 (2022), 26565--26577."},{"key":"e_1_3_2_1_25_1","volume-title":"Try","author":"Karthik Shyamgopal","year":"2023","unstructured":"Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. 2023. If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection. arXiv preprint arXiv:2305.13308 (2023)."},{"key":"e_1_3_2_1_26_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_1_27_1","volume-title":"Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint","author":"Khachatryan Levon","year":"2023","unstructured":"Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint (2023)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","unstructured":"PKU-Yuan Lab and Tuzhan AI etc. 2024. Open-Sora-Plan. doi:10.5281\/zenodo.10948109","DOI":"10.5281\/zenodo.10948109"},{"key":"e_1_3_2_1_29_1","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_30_1","volume-title":"FIT: Far-reaching Interleaved Transformers.","author":"Li Lala","year":"2023","unstructured":"Lala Li and Ting Chen. 2023. FIT: Far-reaching Interleaved Transformers. (2023)."},{"key":"e_1_3_2_1_31_1","volume-title":"Video-p2p: Video editing with cross-attention control. arXiv preprint","author":"Liu Shaoteng","year":"2023","unstructured":"Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. 2023. Video-p2p: Video editing with cross-attention control. arXiv preprint (2023)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.07.028"},{"key":"e_1_3_2_1_33_1","volume-title":"Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048","author":"Ma Xin","year":"2024","unstructured":"Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. 2024. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547910"},{"key":"e_1_3_2_1_35_1","volume-title":"Customizing motion in text-to-video diffusion models. arXiv preprint arXiv:2312.04966","author":"Materzynska Joanna","year":"2023","unstructured":"Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, and Bryan Russell. 2023. Customizing motion in text-to-video diffusion models. arXiv preprint arXiv:2312.04966 (2023)."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00672"},{"key":"e_1_3_2_1_37_1","unstructured":"Maxime Oquab Timoth\u00e9e Darcet Th\u00e9o Moutakanni Huy Vo Marc Szafraniec Vasil Khalidov Pierre Fernandez Daniel Haziza Francisco Massa Alaaeldin El-Nouby et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)."},{"key":"e_1_3_2_1_38_1","volume-title":"Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473","author":"Peruzzo Elia","year":"2024","unstructured":"Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. 2024. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473 (2024)."},{"key":"e_1_3_2_1_39_1","volume-title":"Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint","author":"Qi Chenyang","year":"2023","unstructured":"Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. 2023. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint (2023)."},{"key":"e_1_3_2_1_40_1","volume-title":"International conference on machine learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00605"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02155"},{"key":"e_1_3_2_1_44_1","volume-title":"Knn-diffusion: Image generation via largescale retrieval. arXiv preprint arXiv:2204.02849","author":"Sheynin Shelly","year":"2022","unstructured":"Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. 2022. Knn-diffusion: Image generation via largescale retrieval. arXiv preprint arXiv:2204.02849 (2022)."},{"key":"e_1_3_2_1_45_1","volume-title":"Make-a-video: Text-to-video generation without text-video data. arXiv preprint","author":"Singer Uriel","year":"2022","unstructured":"Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint (2022)."},{"key":"e_1_3_2_1_46_1","volume-title":"Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502","author":"Song Jiaming","year":"2020","unstructured":"Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)."},{"key":"e_1_3_2_1_47_1","unstructured":"Spencer Sterling. 2023. Zeroscope. https:\/\/huggingface.co\/cerspense\/zeroscope_v2_576w"},{"key":"e_1_3_2_1_48_1","volume-title":"Retrievegan: Image synthesis via differentiable patch retrieval. In Computer Vision-ECCV 2020: 16th European Conference","author":"Tseng Hung-Yu","year":"2020","unstructured":"Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, and Weilong Yang. 2020. Retrievegan: Image synthesis via differentiable patch retrieval. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VIII 16. Springer, 242--257."},{"key":"e_1_3_2_1_49_1","volume-title":"Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly.","author":"Unterthiner Thomas","year":"2018","unstructured":"Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)."},{"key":"e_1_3_2_1_50_1","volume-title":"Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571","author":"Wang Jiuniu","year":"2023","unstructured":"Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023)."},{"key":"e_1_3_2_1_51_1","volume-title":"VideoComposer: Compositional Video Synthesis with Motion Controllability. arXiv preprint","author":"Wang Xiang","year":"2023","unstructured":"Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. 2023. VideoComposer: Compositional Video Synthesis with Motion Controllability. arXiv preprint (2023)."},{"key":"e_1_3_2_1_52_1","volume-title":"Lavie: Highquality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103","author":"Wang Yaohui","year":"2023","unstructured":"Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2023. Lavie: Highquality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023)."},{"key":"e_1_3_2_1_53_1","volume-title":"Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou.","author":"Wu Jay Zhangjie","year":"2023","unstructured":"Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2023. Tune-avideo: One-shot tuning of image diffusion models for text-to-video generation. In ICCV."},{"key":"e_1_3_2_1_54_1","volume-title":"Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537","author":"Wu Tianxing","year":"2023","unstructured":"Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, and Ziwei Liu. 2023. Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537 (2023)."},{"key":"e_1_3_2_1_55_1","volume-title":"Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image Diffusion Models. arXiv preprint arXiv:2405.14828","author":"Xu Katherine","year":"2024","unstructured":"Katherine Xu, Lingzhi Zhang, and Jianbo Shi. 2024. Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image Diffusion Models. arXiv preprint arXiv:2405.14828 (2024)."},{"key":"e_1_3_2_1_56_1","volume-title":"Forty-first International Conference on Machine Learning.","author":"Yang Ling","year":"2024","unstructured":"Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and CUI Bin. 2024. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In Forty-first International Conference on Machine Learning."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3610548.3618160"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00809"},{"key":"e_1_3_2_1_59_1","volume-title":"European Conference on Computer Vision. Springer, 273--290","author":"Zhao Rui","year":"2024","unstructured":"Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. 2024. Motiondirector: Motion customization of text-to-video diffusion models. In European Conference on Computer Vision. Springer, 273--290."},{"key":"e_1_3_2_1_60_1","unstructured":"Zangwei Zheng Xiangyu Peng Tianji Yang Chenhui Shen Shenggui Li Hongxin Liu Yukun Zhou Tianyi Li and Yang You. 2024. Open-Sora: Democratizing Efficient Video Production for All. https:\/\/github.com\/hpcaitech\/Open-Sora"}],"event":{"name":"ICMR '25: International Conference on Multimedia Retrieval","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Chicago IL USA","acronym":"ICMR '25"},"container-title":["Proceedings of the 2025 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731715.3733417","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T04:14:48Z","timestamp":1755749688000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731715.3733417"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,30]]},"references-count":60,"alternative-id":["10.1145\/3731715.3733417","10.1145\/3731715"],"URL":"https:\/\/doi.org\/10.1145\/3731715.3733417","relation":{},"subject":[],"published":{"date-parts":[[2025,6,30]]},"assertion":[{"value":"2025-06-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}