{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T14:37:23Z","timestamp":1780756643705,"version":"3.54.1"},"reference-count":75,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2023,12,5]],"date-time":"2023-12-05T00:00:00Z","timestamp":1701734400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2023,12,5]]},"abstract":"<jats:p>\n            We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions --- an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image\n            <jats:italic toggle=\"yes\">twins<\/jats:italic>\n            from a single text prompt --- a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.\n          <\/jats:p>","DOI":"10.1145\/3618326","type":"journal-article","created":{"date-parts":[[2023,12,5]],"date-time":"2023-12-05T10:20:48Z","timestamp":1701771648000},"page":"1-13","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Text-Guided Synthesis of Eulerian Cinemagraphs"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1480-7302","authenticated-orcid":false,"given":"Aniruddha","family":"Mahapatra","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9252-1775","authenticated-orcid":false,"given":"Aliaksandr","family":"Siarohin","sequence":"additional","affiliation":[{"name":"Snap Inc., USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2442-0117","authenticated-orcid":false,"given":"Hsin-Ying","family":"Lee","sequence":"additional","affiliation":[{"name":"Snap Inc., USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3465-1592","authenticated-orcid":false,"given":"Sergey","family":"Tulyakov","sequence":"additional","affiliation":[{"name":"Snap Inc., USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8504-3410","authenticated-orcid":false,"given":"Jun-Yan","family":"Zhu","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,12,5]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1186822.1073268"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2185520.2185562"},{"key":"e_1_2_2_3_1","volume-title":"Computer Graphics Forum","author":"Bai Jiamin","unstructured":"Jiamin Bai, Aseem Agarwala, Maneesh Agrawala, and Ravi Ramamoorthi. 2013. Automatic cinemagraph portraits. In Computer Graphics Forum, Vol. 32. Wiley Online Library, 17--25."},{"key":"e_1_2_2_4_1","volume-title":"Tuanfeng Y Wang, Meysam Madadi, Sergio Escalera, and Duygu Ceylan.","author":"Bertiche Hugo","year":"2023","unstructured":"Hugo Bertiche, Niloy J Mitra, Kuldeep Kulkarni, Chun-Hao Paul Huang, Tuanfeng Y Wang, Meysam Madadi, Sergio Escalera, and Duygu Ceylan. 2023. Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images. arXiv preprint arXiv:2303.08639 (2023)."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02161"},{"key":"e_1_2_2_6_1","volume-title":"Efros","author":"Brooks Tim","year":"2023","unstructured":"Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR."},{"key":"e_1_2_2_7_1","unstructured":"Kevin Burg and Jamie Beck. 2011. Cinemagraphs. http:\/\/cinemagraphs.com\/."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00132"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1186822.1073273"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2006.281"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19836-6_6"},{"key":"e_1_2_2_12_1","first-page":"19822","article-title":"Cogview: Mastering text-to-image generation via transformers","volume":"34","author":"Ding Ming","year":"2021","unstructured":"Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021), 19822--19835.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_13_1","volume-title":"Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. arXiv preprint arXiv:1910.07192","author":"Endo Yuki","year":"2019","unstructured":"Yuki Endo, Yoshihiro Kanamori, and Shigeru Kuriyama. 2019. Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. arXiv preprint arXiv:1910.07192 (2019)."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"e_1_2_2_15_1","volume-title":"Simulating Fluids in Real-World Still Images. arXiv preprint arXiv:2204.11335","author":"Fan Siming","year":"2022","unstructured":"Siming Fan, Jingtan Piao, Chen Qian, Kwan-Yee Lin, and Hongsheng Li. 2022. Simulating Fluids in Real-World Still Images. arXiv preprint arXiv:2204.11335 (2022)."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1080\/14786440109462720"},{"key":"e_1_2_2_17_1","volume-title":"Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131","author":"Gafni Oran","year":"2022","unstructured":"Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131 (2022)."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528223.3530164"},{"key":"e_1_2_2_19_1","volume-title":"Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models. arXiv preprint arXiv:2305.10474","author":"Ge Songwei","year":"2023","unstructured":"Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. 2023a. Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models. arXiv preprint arXiv:2305.10474 (2023)."},{"key":"e_1_2_2_20_1","volume-title":"Expressive Text-to-Image Generation with Rich Text. arXiv preprint arXiv:2304.06720","author":"Ge Songwei","year":"2023","unstructured":"Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin Huang. 2023b. Expressive Text-to-Image Generation with Rich Text. arXiv preprint arXiv:2304.06720 (2023)."},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_20"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3450626.3459935"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073648"},{"key":"e_1_2_2_24_1","unstructured":"Yingqing He Haoxin Chen and Menghan Xia. 2023. VideoCrafter: A Toolkit for Text-to-Video Generation and Editing. https:\/\/github.com\/VideoCrafter\/VideoCrafter."},{"key":"e_1_2_2_25_1","volume-title":"Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626","author":"Hertz Amir","year":"2022","unstructured":"Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)."},{"key":"e_1_2_2_26_1","unstructured":"Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Diederik P Kingma Ben Poole Mohammad Norouzi David J Fleet et al. 2022. Imagen video: High definition video generation with diffusion models. arXivpreprint arXiv 2210.02303 (2022)."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00575"},{"key":"e_1_2_2_28_1","volume-title":"Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868","author":"Hong Wenyi","year":"2022","unstructured":"Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.632"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2380116.2380149"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00976"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00453"},{"key":"e_1_2_2_33_1","volume-title":"Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv preprint arXiv:2210.09276","author":"Kawar Bahjat","year":"2022","unstructured":"Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, In-bar Mosseri, and Michal Irani. 2022. Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv preprint arXiv:2210.09276 (2022)."},{"key":"e_1_2_2_34_1","volume-title":"Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439","author":"Khachatryan Levon","year":"2023","unstructured":"Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)."},{"key":"e_1_2_2_35_1","volume-title":"arXiv:2304.02643","author":"Kirillov Alexander","year":"2023","unstructured":"Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll\u00e1r, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023)."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1201775.882264"},{"key":"e_1_2_2_37_1","volume-title":"BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning (ICML).","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_38_1","volume-title":"3D Cinemagraphy from a Single Image. arXiv preprint arXiv:2303.05724","author":"Li Xingyi","year":"2023","unstructured":"Xingyi Li, Zhiguo Cao, Huiqiang Sun, Jianming Zhang, Ke Xian, and Guosheng Lin. 2023. 3D Cinemagraphy from a Single Image. arXiv preprint arXiv:2303.05724 (2023)."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2816795.2818061"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2461912.2461950"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982.1056489"},{"key":"e_1_2_2_42_1","volume-title":"Proceedings, Part XXIII 16","author":"Logacheva Elizaveta","year":"2020","unstructured":"Elizaveta Logacheva, Roman Suvorov, Oleg Khomenko, Anton Mashikhin, and Victor Lempitsky. 2020. Deeplandscape: Adversarial modeling of landscape videos. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIII 16. 256--272."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00365"},{"key":"e_1_2_2_44_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Mansimov Elman","year":"2016","unstructured":"Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. 2016. Generating images from captions with attention. International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_45_1","volume-title":"On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems 14","author":"Ng Andrew","year":"2001","unstructured":"Andrew Ng, Michael Jordan, and Yair Weiss. 2001. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems 14 (2001)."},{"key":"e_1_2_2_46_1","unstructured":"NLTKTeam. 2023. Natural Language Toolkit. https:\/\/www.nltk.org\/."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00244"},{"key":"e_1_2_2_48_1","volume-title":"Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.","author":"Parmar Gaurav","year":"2023","unstructured":"Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. 2023. Zero-shot Image-to-Image Translation. arXiv preprint arXiv:2302.03027 (2023)."},{"key":"e_1_2_2_49_1","doi-asserted-by":"crossref","unstructured":"Or Patashnik Daniel Garibi Idan Azuri Hadar Averbuch-Elor and Daniel Cohen-Or. 2023. Localizing Object-level Shape Variations with Text-to-Image Diffusion Models. arXiv:2303.11306 [cs.CV]","DOI":"10.1109\/ICCV51070.2023.02107"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00209"},{"key":"e_1_2_2_51_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_2_2_52_1","volume-title":"Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125","author":"Ramesh Aditya","year":"2022","unstructured":"Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)."},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_2_54_1","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Bj\u00f6rn Ommer. 2022b. Stable Diffusion. https:\/\/github.com\/CompVis\/stable-diffusion."},{"key":"e_1_2_2_55_1","volume-title":"U-net: Convolutional networks for biomedical image segmentation","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Springer, 234--241."},{"key":"e_1_2_2_56_1","volume-title":"Palette: Image-to-image diffusion models. In ACM SIGGRAPH. 1--10.","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022a. Palette: Image-to-image diffusion models. In ACM SIGGRAPH. 1--10."},{"key":"e_1_2_2_57_1","volume-title":"Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al.","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022b. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)."},{"key":"e_1_2_2_58_1","volume-title":"StyleGANT: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. International Conference on Machine Learning abs\/2301","author":"Sauer Axel","year":"2023","unstructured":"Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. 2023. StyleGANT: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. International Conference on Machine Learning abs\/2301.09515."},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/344779.345012"},{"key":"e_1_2_2_60_1","volume-title":"Computer Graphics Forum","author":"Sevilla-Lara Laura","unstructured":"Laura Sevilla-Lara, Jonas Wulff, Kalyan Sunkavalli, and Eli Shechtman. 2015. Smooth loops from unconstrained video. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 99--107."},{"key":"e_1_2_2_61_1","volume-title":"Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792","author":"Singer Uriel","year":"2022","unstructured":"Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)."},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00361"},{"key":"e_1_2_2_63_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Song Jiaming","year":"2021","unstructured":"Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_64_1","volume-title":"Proceedings, Part II 16","author":"Teed Zachary","year":"2020","unstructured":"Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16. Springer, 402--419."},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVMP.2011.16"},{"key":"e_1_2_2_66_1","volume-title":"Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. arXiv preprint arXiv:2211.12572","author":"Tumanyan Narek","year":"2022","unstructured":"Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2022. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. arXiv preprint arXiv:2211.12572 (2022)."},{"key":"e_1_2_2_67_1","volume-title":"Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly.","author":"Unterthiner Thomas","year":"2018","unstructured":"Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)."},{"key":"e_1_2_2_68_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Villegas Ruben","year":"2023","unstructured":"Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2023. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_69_1","volume-title":"Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952","author":"Wang Tengfei","year":"2022","unstructured":"Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. 2022. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952 (2022)."},{"key":"e_1_2_2_70_1","volume-title":"High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Wang Ting-Chun","year":"2018","unstructured":"Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_2_71_1","volume-title":"Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv preprint arXiv:2303.04803","author":"Xu Jiarui","year":"2023","unstructured":"Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. 2023. Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv preprint arXiv:2303.04803 (2023)."},{"key":"e_1_2_2_72_1","volume-title":"Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al.","author":"Yu Jiahui","year":"2022","unstructured":"Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)."},{"key":"e_1_2_2_73_1","volume-title":"Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543","author":"Zhang Lvmin","year":"2023","unstructured":"Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)."},{"key":"e_1_2_2_74_1","doi-asserted-by":"crossref","unstructured":"Richard Zhang Phillip Isola Alexei A Efros Eli Shechtman and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.","DOI":"10.1109\/CVPR.2018.00068"},{"key":"e_1_2_2_75_1","volume-title":"Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018","author":"Zhou Daquan","year":"2022","unstructured":"Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. 2022. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3618326","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3618326","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T10:49:38Z","timestamp":1755773378000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3618326"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,5]]},"references-count":75,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,12,5]]}},"alternative-id":["10.1145\/3618326"],"URL":"https:\/\/doi.org\/10.1145\/3618326","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,5]]},"assertion":[{"value":"2023-12-05","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}