{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T15:56:13Z","timestamp":1775577373157,"version":"3.50.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,7,26]],"date-time":"2023-07-26T00:00:00Z","timestamp":1690329600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-driven image editing methods usually need edit masks, struggle with edits that require significant visual changes and cannot easily keep specific details of the edited portion. In this paper we make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image. We also show that initializing the stochastic sampler with a noised version of the base image before the sampling and interpolating relevant details from the base image after sampling further increase the quality of the edit operation. Combining these observations, we propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image. UniTune does not require additional inputs, like masks or sketches, and can perform multiple edits on the same image without retraining. We test our method using the Imagen model in a range of different use cases. We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.<\/jats:p>","DOI":"10.1145\/3592451","type":"journal-article","created":{"date-parts":[[2023,7,26]],"date-time":"2023-07-26T14:29:21Z","timestamp":1690381761000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":53,"title":["UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4211-2866","authenticated-orcid":false,"given":"Dani","family":"Valevski","sequence":"first","affiliation":[{"name":"Google Research, Tel Aviv, Israel"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8002-1148","authenticated-orcid":false,"given":"Matan","family":"Kalman","sequence":"additional","affiliation":[{"name":"Google Research, Mountain View, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-4353-9713","authenticated-orcid":false,"given":"Eyal","family":"Molad","sequence":"additional","affiliation":[{"name":"Google Research, Tel Aviv, Israel"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6529-5361","authenticated-orcid":false,"given":"Eyal","family":"Segalis","sequence":"additional","affiliation":[{"name":"Google Research, Tel Aviv, Israel"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3960-6002","authenticated-orcid":false,"given":"Yossi","family":"Matias","sequence":"additional","affiliation":[{"name":"Google Research, Tel Aviv, Israel"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4080-4845","authenticated-orcid":false,"given":"Yaniv","family":"Leviathan","sequence":"additional","affiliation":[{"name":"Google Research, Mountain View, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2023,7,26]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","unstructured":"Rameen Abdal Peihao Zhu John Femiani Niloy J. Mitra and Peter Wonka. 2021. CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions. 10.48550\/ARXIV.2112.05219","DOI":"10.48550\/ARXIV.2112.05219"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","unstructured":"Omri Avrahami Ohad Fried and Dani Lischinski. 2022. Blended Latent Diffusion. 10.48550\/ARXIV.2206.02779","DOI":"10.48550\/ARXIV.2206.02779"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","unstructured":"Omri Avrahami Dani Lischinski and Ohad Fried. 2021. Blended Diffusion for Text-driven Editing of Natural Images. 10.48550\/ARXIV.2111.14818","DOI":"10.48550\/ARXIV.2111.14818"},{"key":"e_1_2_2_4_1","volume-title":"Text2LIVE: Text-Driven Layered Image and Video Editing. arXiv preprint arXiv:2204.02491","author":"Bar-Tal Omer","year":"2022","unstructured":"Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022. Text2LIVE: Text-Driven Layered Image and Video Editing. arXiv preprint arXiv:2204.02491 (2022)."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","unstructured":"David Bau Alex Andonian Audrey Cui YeonHwan Park Ali Jahanian Aude Oliva and Antonio Torralba. 2021. Paint by Word. 10.48550\/ARXIV.2103.10951","DOI":"10.48550\/ARXIV.2103.10951"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","unstructured":"Andrew Brock Theodore Lim J. M. Ritchie and Nick Weston. 2016. Neural Photo Editing with Introspective Adversarial Networks. 10.48550\/ARXIV.1609.07093","DOI":"10.48550\/ARXIV.1609.07093"},{"key":"e_1_2_2_7_1","volume-title":"Efros","author":"Brooks Tim","year":"2023","unstructured":"Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv:2211.09800 [cs.CV]"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2108.02938"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","unstructured":"Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. 10.48550\/ARXIV.2105.05233","DOI":"10.48550\/ARXIV.2105.05233"},{"key":"e_1_2_2_10_1","unstructured":"Ziyi Dong Pengxu Wei and Liang Lin. 2023. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning. arXiv:2211.11337 [cs.CV]"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","unstructured":"Rinon Gal Yuval Alaluf Yuval Atzmon Or Patashnik Amit H. Bermano Gal Chechik and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. 10.48550\/ARXIV.2208.01618","DOI":"10.48550\/ARXIV.2208.01618"},{"key":"e_1_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Rinon Gal Or Patashnik Haggai Maron Gal Chechik and Daniel Cohen-Or. 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946 [cs.CV]","DOI":"10.1145\/3528223.3530164"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","unstructured":"Ian J. Goodfellow Mehdi Mirza Da Xiao Aaron Courville and Yoshua Bengio. 2013. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. 10.48550\/ARXIV.1312.6211","DOI":"10.48550\/ARXIV.1312.6211"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","unstructured":"Ian J. Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative Adversarial Networks. 10.48550\/ARXIV.1406.2661","DOI":"10.48550\/ARXIV.1406.2661"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","unstructured":"Amir Hertz Ron Mokady Jay Tenenbaum Kfir Aberman Yael Pritch and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. 10.48550\/ARXIV.2208.01626","DOI":"10.48550\/ARXIV.2208.01626"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2210.02303"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","unstructured":"Jonathan Ho Ajay Jain and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. 10.48550\/ARXIV.2006.11239","DOI":"10.48550\/ARXIV.2006.11239"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","unstructured":"Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. 10.48550\/ARXIV.2207.12598","DOI":"10.48550\/ARXIV.2207.12598"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","unstructured":"Tero Karras Samuli Laine and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. 10.48550\/ARXIV.1812.04948","DOI":"10.48550\/ARXIV.1812.04948"},{"key":"e_1_2_2_20_1","volume-title":"Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv:2210.09276 [cs.CV]","author":"Kawar Bahjat","year":"2023","unstructured":"Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2023. Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv:2210.09276 [cs.CV]"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","unstructured":"Gwanghyun Kim Taesung Kwon and Jong Chul Ye. 2021. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. 10.48550\/ARXIV.2110.02711","DOI":"10.48550\/ARXIV.2110.02711"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2112.05744"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","unstructured":"Chenlin Meng Yutong He Yang Song Jiaming Song Jiajun Wu Jun-Yan Zhu and Stefano Ermon. 2021. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. 10.48550\/ARXIV.2108.01073","DOI":"10.48550\/ARXIV.2108.01073"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2112.10741"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","unstructured":"Or Patashnik Zongze Wu Eli Shechtman Daniel Cohen-Or and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. 10.48550\/ARXIV.2103.17249","DOI":"10.48550\/ARXIV.2103.17249"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2103.00020"},{"key":"e_1_2_2_27_1","first-page":"140","article-title":"2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1--67. http:\/\/jmlr.org\/papers\/v21\/20-074.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","unstructured":"Aditya Ramesh Prafulla Dhariwal Alex Nichol Casey Chu and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. 10.48550\/ARXIV.2204.06125","DOI":"10.48550\/ARXIV.2204.06125"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","unstructured":"Daniel Roich Ron Mokady Amit H. Bermano and Daniel Cohen-Or. 2021. Pivotal Tuning for Latent-based Editing of Real Images. 10.48550\/ARXIV.2106.05744","DOI":"10.48550\/ARXIV.2106.05744"},{"key":"e_1_2_2_30_1","doi-asserted-by":"crossref","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Bj\u00f6rn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","unstructured":"Olaf Ronneberger Philipp Fischer and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. 10.48550\/ARXIV.1505.04597","DOI":"10.48550\/ARXIV.1505.04597"},{"key":"e_1_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Nataniel Ruiz Yuanzhen Li Varun Jampani Yael Pritch Michael Rubinstein and Kfir Aberman. 2022. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation.","DOI":"10.1109\/CVPR52729.2023.02155"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2205.11487"},{"key":"e_1_2_2_34_1","doi-asserted-by":"crossref","unstructured":"Chitwan Saharia Jonathan Ho William Chan Tim Salimans David J. Fleet and Mohammad Norouzi. 2021. Image Super-Resolution via Iterative Refinement. arXiv:2104.07636 [eess.IV]","DOI":"10.1109\/TPAMI.2022.3204461"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","unstructured":"Jascha Sohl-Dickstein Eric A. Weiss Niru Maheswaranathan and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. 10.48550\/ARXIV.1503.03585","DOI":"10.48550\/ARXIV.1503.03585"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","unstructured":"Yang Song and Stefano Ermon. 2019. Generative Modeling by Estimating Gradients of the Data Distribution. 10.48550\/ARXIV.1907.05600","DOI":"10.48550\/ARXIV.1907.05600"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","unstructured":"David Stap Maurits Bleeker Sarah Ibrahimi and Maartje ter Hoeve. 2020. Conditional Image Generation and Manipulation for User-Specified Content. 10.48550\/ARXIV.2005.04909","DOI":"10.48550\/ARXIV.2005.04909"},{"key":"e_1_2_2_38_1","unstructured":"Tengfei Wang Ting Zhang Bo Zhang Hao Ouyang Dong Chen Qifeng Chen and Fang Wen. 2022. Pretraining is All You Need for Image-to-Image Translation. In arXiv."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","unstructured":"Weihao Xia Yujiu Yang Jing-Hao Xue and Baoyuan Wu. 2020. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. 10.48550\/ARXIV.2012.03308","DOI":"10.48550\/ARXIV.2012.03308"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","unstructured":"Weihao Xia Yulun Zhang Yujiu Yang Jing-Hao Xue Bolei Zhou and Ming-Hsuan Yang. 2021. GAN Inversion: A Survey. 10.48550\/ARXIV.2101.05278","DOI":"10.48550\/ARXIV.2101.05278"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1609.03552"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3592451","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3592451","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:48:59Z","timestamp":1750182539000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3592451"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,26]]},"references-count":41,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.1145\/3592451"],"URL":"https:\/\/doi.org\/10.1145\/3592451","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,26]]},"assertion":[{"value":"2023-07-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}