{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T16:19:50Z","timestamp":1771949990817,"version":"3.50.1"},"reference-count":80,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T00:00:00Z","timestamp":1731974400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61991451"],"award-info":[{"award-number":["61991451"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Shenzhen Science and Technology Program","award":["JSGG20220831093004008"],"award-info":[{"award-number":["JSGG20220831093004008"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2024,12,19]]},"abstract":"<jats:p>Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired artistic videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pretrained T2V models with a style control adapter, allowing video generation in any style by feeding a reference image. Considering the scarcity of artistic video data, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we employ carefully designed data augmentation strategies to enhance decoupled learning. Additionally, we propose a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors. Project page: https:\/\/gongyeliu.github.io\/StyleCrafter.github.io\/<\/jats:p>","DOI":"10.1145\/3687975","type":"journal-article","created":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T10:46:04Z","timestamp":1732013164000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["StyleCrafter: Taming Artistic Video Diffusion with Reference-Augmented Adapter Learning"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-6536-282X","authenticated-orcid":false,"given":"Gongye","family":"Liu","sequence":"first","affiliation":[{"name":"Tsinghua University, shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9664-4967","authenticated-orcid":false,"given":"Menghan","family":"Xia","sequence":"additional","affiliation":[{"name":"Tencent AI lab, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0066-3448","authenticated-orcid":false,"given":"Yong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tencent AI lab, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6085-2107","authenticated-orcid":false,"given":"Haoxin","family":"Chen","sequence":"additional","affiliation":[{"name":"Tencent AI lab, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2181-1879","authenticated-orcid":false,"given":"Jinbo","family":"Xing","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, HongKong, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1774-0039","authenticated-orcid":false,"given":"Yibo","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6585-8604","authenticated-orcid":false,"given":"Xintao","family":"Wang","sequence":"additional","affiliation":[{"name":"Tencent AI lab, ShenZhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7673-8325","authenticated-orcid":false,"given":"Ying","family":"Shan","sequence":"additional","affiliation":[{"name":"Tencent AI lab, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6427-1024","authenticated-orcid":false,"given":"Yujiu","family":"Yang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,19]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence.","author":"Ahn Namhyuk","year":"2023","unstructured":"Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, and Kibeom Hong. 2023. DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00092"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"e_1_2_2_4_1","unstructured":"Andreas Blattmann Tim Dockhorn Sumith Kulal Daniel Mendelevitch Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1276377.1276507"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.126"},{"key":"e_1_2_2_8_1","volume-title":"VideoCrafter1: Open Diffusion Models for High-Quality Video Generation. preprint arXiv:2310.19512","author":"Chen Haoxin","year":"2023","unstructured":"Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. 2023a. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation. preprint arXiv:2310.19512 (2023)."},{"key":"e_1_2_2_9_1","volume-title":"Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047","author":"Chen Haoxin","year":"2024","unstructured":"Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047 (2024)."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611819"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16208"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01104"},{"key":"e_1_2_2_13_1","first-page":"16890","article-title":"Cogview2: Faster and better text-to-image generation via hierarchical transformers","volume":"35","author":"Ding Ming","year":"2022","unstructured":"Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems 35 (2022), 16890--16902.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_14_1","volume-title":"An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint:2208.01618","author":"Gal Rinon","year":"2022","unstructured":"Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint:2208.01618 (2022)."},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093420"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.265"},{"key":"e_1_2_2_17_1","unstructured":"Gen-2. 2023. Gen-2. Gen-2. Accessed Nov. 1 2023 [Online] https:\/\/research.runwayml.com\/gen2. https:\/\/research.runwayml.com\/gen2"},{"key":"e_1_2_2_18_1","volume-title":"In International Conference on Learning Representations (ICLR).","author":"Geyer Michal","year":"2024","unstructured":"Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2024. Tokenflow: Consistent diffusion features for consistent video editing. In In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_19_1","volume-title":"International Conference on Learning Representations (ICLR)","author":"Guo Yuwei","year":"2024","unstructured":"Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. 2024. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In International Conference on Learning Representations (ICLR) (2024)."},{"key":"e_1_2_2_20_1","volume-title":"Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint:2211.13221","author":"He Yingqing","year":"2022","unstructured":"Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. 2022. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint:2211.13221 (2022)."},{"key":"e_1_2_2_21_1","volume-title":"Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133","author":"Hertz Amir","year":"2023","unstructured":"Amir Hertz, Andrey Voynov, Shlomi Fruchter, and Daniel Cohen-Or. 2023. Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133 (2023)."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/383259.383295"},{"key":"e_1_2_2_23_1","unstructured":"Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Diederik P Kingma Ben Poole Mohammad Norouzi David J Fleet et al. 2022a. Imagen video: High definition video generation with diffusion models. arXiv preprint:2210.02303 (2022)."},{"key":"e_1_2_2_24_1","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 6840--6851.","journal-title":"Advances in Neural Information Processing Systems (NeurIPS)"},{"key":"e_1_2_2_25_1","volume-title":"Video diffusion models. arXiv:2204.03458","author":"Ho Jonathan","year":"2022","unstructured":"Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022b. Video diffusion models. arXiv:2204.03458 (2022)."},{"key":"e_1_2_2_26_1","volume-title":"Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint:2205.15868","author":"Hong Wenyi","year":"2022","unstructured":"Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint:2205.15868 (2022)."},{"key":"e_1_2_2_27_1","volume-title":"LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR).","author":"Hu Edward J.","year":"2022","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_28_1","volume-title":"Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint:2302.09778","author":"Huang Lianghua","year":"2023","unstructured":"Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. 2023a. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint:2302.09778 (2023)."},{"key":"e_1_2_2_29_1","volume-title":"Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer. arXiv preprint:2305.05464","author":"Huang Nisha","year":"2023","unstructured":"Nisha Huang, Yuxin Zhang, and Weiming Dong. 2023b. Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer. arXiv preprint:2305.05464 (2023)."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.167"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01459"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3323006"},{"key":"e_1_2_2_33_1","volume-title":"Text2videozero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint:2303.13439","author":"Khachatryan Levon","year":"2023","unstructured":"Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2videozero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint:2303.13439 (2023)."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00192"},{"key":"e_1_2_2_35_1","unstructured":"Junnan Li Dongxu Li Silvio Savarese and Steven Hoi. 2023a. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML."},{"key":"e_1_2_2_36_1","volume-title":"Universal style transfer via feature transforms. Advances in neural information processing systems 30","author":"Li Yijun","year":"2017","unstructured":"Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal style transfer via feature transforms. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02156"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00658"},{"key":"e_1_2_2_39_1","volume-title":"Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440","author":"Liu Yaofang","year":"2023","unstructured":"Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. 2023. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440 (2023)."},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3618326"},{"key":"e_1_2_2_41_1","volume-title":"International Conference on Machine Learning (ICML). PMLR, 8162--8171","author":"Nichol Alexander Quinn","year":"2021","unstructured":"Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML). PMLR, 8162--8171."},{"key":"e_1_2_2_42_1","volume-title":"Technical report","author":"System Card AI.","year":"2023","unstructured":"OpenAI. 2023. GPT-4V(ision) System Card. Technical report (2023)."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00603"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.2308\/iace-50038"},{"key":"e_1_2_2_45_1","volume-title":"Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952","author":"Podell Dustin","year":"2023","unstructured":"Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\u00fcller, Joe Penna, and Robin Rombach. 2023a. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)."},{"key":"e_1_2_2_46_1","volume-title":"Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint:2307.01952","author":"Podell Dustin","year":"2023","unstructured":"Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\u00fcller, Joe Penna, and Robin Rombach. 2023b. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint:2307.01952 (2023)."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01460"},{"key":"e_1_2_2_48_1","volume-title":"International Conference on Machine Learning (ICML).","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_49_1","volume-title":"Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2","author":"Ramesh Aditya","year":"2022","unstructured":"Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3."},{"key":"e_1_2_2_50_1","doi-asserted-by":"crossref","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Bj\u00f6rn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-45886-1_3"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02155"},{"key":"e_1_2_2_53_1","first-page":"25278","article-title":"Laion-5b: An open large-scale dataset for training next generation image-text models","volume":"35","author":"Schuhmann Christoph","year":"2022","unstructured":"Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278--25294.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_54_1","volume-title":"Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411","author":"Shi Jing","year":"2023","unstructured":"Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. 2023. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411"},{"key":"e_1_2_2_55_1","unstructured":"(2023)."},{"key":"e_1_2_2_56_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_2_2_57_1","volume-title":"Make-a-video: Text-to-video generation without text-video data. arXiv preprint:2209.14792","author":"Singer Uriel","year":"2022","unstructured":"Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint:2209.14792 (2022)."},{"key":"e_1_2_2_58_1","volume-title":"Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, et al.","author":"Sohn Kihyuk","year":"2023","unstructured":"Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, et al. 2023. StyleDrop: Text-to-Image Generation in Any Style. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_2_59_1","volume-title":"Denoising Diffusion Implicit Models. In In International Conference on Learning Representations (ICLR).","author":"Song Jiaming","year":"2021","unstructured":"Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021a. Denoising Diffusion Implicit Models. In In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_60_1","volume-title":"In International Conference on Learning Representations (ICLR).","author":"Song Yang","year":"2021","unstructured":"Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021b. Score-Based Generative Modeling through Stochastic Differential Equations. In In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_61_1","volume-title":"Separating style and content with bilinear models. Neural computation 12, 6","author":"Tenenbaum Joshua B","year":"2000","unstructured":"Joshua B Tenenbaum and William T Freeman. 2000. Separating style and content with bilinear models. Neural computation 12, 6 (2000), 1247--1283."},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cag.2020.01.002"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392453"},{"key":"e_1_2_2_64_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2004.1272726"},{"key":"e_1_2_2_66_1","volume-title":"Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571","author":"Wang Jiuniu","year":"2023","unstructured":"Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023c. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023)."},{"key":"e_1_2_2_67_1","volume-title":"Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems 36","author":"Wang Xiang","year":"2024","unstructured":"Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. 2024. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_68_1","volume-title":"LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models. arXiv preprint arXiv:2309.15103","author":"Wang Yaohui","year":"2023","unstructured":"Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2023a. LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models. arXiv preprint arXiv:2309.15103 (2023)."},{"key":"e_1_2_2_69_1","volume-title":"StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation. arXiv preprint:2309.01770","author":"Wang Zhouxia","year":"2023","unstructured":"Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, and Ping Luo. 2023b. StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation. arXiv preprint:2309.01770 (2023)."},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01461"},{"key":"e_1_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00701"},{"key":"e_1_2_2_72_1","volume-title":"ToonCrafter: Generative Cartoon Interpolation. arXiv preprint arXiv:2405.17933","author":"Xing Jinbo","year":"2024","unstructured":"Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, and Tien-Tsin Wong. 2024. ToonCrafter: Generative Cartoon Interpolation. arXiv preprint arXiv:2405.17933 (2024)."},{"key":"e_1_2_2_73_1","volume-title":"Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190","author":"Xing Jinbo","year":"2023","unstructured":"Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. 2023. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190 (2023)."},{"key":"e_1_2_2_74_1","volume-title":"Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. In ACM SIGGRAPH Asia 2023 Conference Proceedings.","author":"Yang Shuai","year":"2023","unstructured":"Shuai Yang, Yifan Zhou, Ziwei Liu,, and Chen Change Loy. 2023. Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. In ACM SIGGRAPH Asia 2023 Conference Proceedings."},{"key":"e_1_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00831"},{"key":"e_1_2_2_76_1","volume-title":"IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv preprint arXiv:2308.06721","author":"Ye Hu","year":"2023","unstructured":"Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv preprint arXiv:2308.06721 (2023)."},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2013.2265675"},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00978"},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528233.3530736"},{"key":"e_1_2_2_80_1","volume-title":"Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint:2211.11018","author":"Zhou Daquan","year":"2022","unstructured":"Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. 2022. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint:2211.11018 (2022)."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3687975","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3687975","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,2]],"date-time":"2025-10-02T16:37:55Z","timestamp":1759423075000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3687975"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,19]]},"references-count":80,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,12,19]]}},"alternative-id":["10.1145\/3687975"],"URL":"https:\/\/doi.org\/10.1145\/3687975","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,19]]},"assertion":[{"value":"2024-11-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}