{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T10:28:33Z","timestamp":1770719313206,"version":"3.49.0"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62325206, 62206132, 62532003"],"award-info":[{"award-number":["62325206, 62206132, 62532003"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Key Research and Development Program of Jiangsu Province","award":["BE2023016-4"],"award-info":[{"award-number":["BE2023016-4"]}]},{"name":"Nanjing Scientific and Technological Innovation Project for Overseas Returnees, and the Postgraduate Research and Practice Innovation Program of Jiangsu Province","award":["KYCX23_1023"],"award-info":[{"award-number":["KYCX23_1023"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>Text-conditioned image editing aims to modify a source image into a target image according to a specified text description, tackling two core challenges: locating target editing regions and ensuring consistency in non-target editing areas. Existing approaches utilize manual selection or cross-modal attention to define editing regions and deploy diffusion models to generate edited images. Despite these recent advancements, two problems remain. First, current methods fail to locate editing areas described in the text but invisible in the image. Second, they struggle to ensure spatial consistency in non-targeted regions due to the global noise addition along with excessive denoising during the diffusion process. To overcome these limitations, we propose AdaEdit, which comprises an adaptive mask localization module and an adaptive denoising strategy for text-conditioned image editing. AdaEdit can accurately identify the editing area via the measurement of cross-modal semantic mismatch, even when the visual details are not explicitly described in the text inputs. The adaptive denoising strategy applies varying noise levels to differentiate between targeted and non-targeted regions, enhancing the stability and consistency of the non-edited areas. Extensive experiments demonstrate that our proposed method achieves excellent performance on MS-COCO, MagicBrush, and Laion. We also expand our application to iterative editing tasks, thereby extending its utility for generalized editing scenarios.<\/jats:p>","DOI":"10.1145\/3778175","type":"journal-article","created":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T09:19:54Z","timestamp":1764235194000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["AdaEdit: Adaptive Diffusion Model for Invisible Target Oriented Text-Conditioned Image Editing"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2547-4646","authenticated-orcid":false,"given":"Yefei","family":"Sheng","sequence":"first","affiliation":[{"name":"Nanjing University of Posts and Telecommunications, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8662-9488","authenticated-orcid":false,"given":"Jie","family":"Wang","sequence":"additional","affiliation":[{"name":"Nanjing University of Posts and Telecommunications, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4662-7170","authenticated-orcid":false,"given":"Ming","family":"Tao","sequence":"additional","affiliation":[{"name":"Nanjing University of Posts and Telecommunications, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5956-831X","authenticated-orcid":false,"given":"Bing-Kun","family":"Bao","sequence":"additional","affiliation":[{"name":"Nanjing University of Posts and Telecommunications, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,2,9]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3592450"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01764"},{"key":"e_1_3_2_4_2","unstructured":"Mingdeng Cao Xintao Wang Zhongang Qi Ying Shan Xiaohu Qie and Yinqiang Zheng. 2023. MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE\/CVF International Conference on Computer Vision 22560\u201322570."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3137605"},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Tianrun Chen Lanyun Zhu Chaotao Ding Runlong Cao Shangzhan Zhang Yan Wang Zejian Li Lingyun Sun Papa Mao and Ying Zang. 2023. SAM fails to segment anything?\u2014SAM-adapter: Adapting SAM in underperformed scenes: Camouflage shadow and more. arXiv:2304.09148. Retrieved from https:\/\/arxiv.org\/abs\/2304.09148","DOI":"10.1109\/ICCVW60793.2023.00361"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589002"},{"key":"e_1_3_2_8_2","unstructured":"Tsu-Jui Fu Wenze Hu Xianzhi Du William Yang Wang Yinfei Yang and Zhe Gan. 2023. Guiding instruction-based image editing via multimodal large language models. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107384"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3219677"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01208"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco_a_01060"},{"key":"e_1_3_2_13_2","unstructured":"Amir Hertz Ron Mokady Jay Tenenbaum Kfir Aberman Yael Pritch and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. In The Eleventh International Conference on Learning Representations."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICBDA57405.2023.10104850"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00799"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Kj Joseph Prateksha Udhayanan Tripti Shukla Aishwarya Agarwal Srikrishna Karanam Koustava Goswami and Balaji Vasan Srinivasan. 2023. Iterative multi-granular image editing using diffusion models. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision 8107\u20138116.","DOI":"10.1109\/WACV57701.2024.00792"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00582"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00790"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00598"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01245"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer International Publishing Cham 740\u2013755.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_22_2","doi-asserted-by":"crossref","unstructured":"Bingyan Liu Chengyu Wang Tingfeng Cao Kui Jia and Jun Huang. 2024. Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 7817\u20137826.","DOI":"10.1109\/CVPR52733.2024.00747"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3172548"},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Shilong Liu Zhaoyang Zeng Tianhe Ren Feng Li Hao Zhang Jie Yang Chunyuan Li Jianwei Yang Hang Su Jun Zhu et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision. Springer Nature Switzerland Cham 38\u201355.","DOI":"10.1007\/978-3-031-72970-6_3"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01117"},{"key":"e_1_3_2_26_2","unstructured":"Chenlin Meng Yutong He Yang Song Jiaming Song Jiajun Wu Jun-Yan Zhu and Stefano Ermon. 2021. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv:2108.01073. Retrieved from https:\/\/arxiv.org\/abs\/2108.01073"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00585"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3211473"},{"key":"e_1_3_2_29_2","unstructured":"Alex Nichol Prafulla Dhariwal Aditya Ramesh Pranav Shyam Pamela Mishkin Bob McGrew Ilya Sutskever and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning. PMLR 16784\u201316804."},{"key":"e_1_3_2_30_2","first-page":"8748","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748\u20138763."},{"key":"e_1_3_2_31_2","first-page":"8748","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_32_2","unstructured":"Scott Reed Zeynep Akata Xinchen Yan Lajanugen Logeswaran Bernt Schiele and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning. PMLR 1060\u20131069."},{"key":"e_1_3_2_33_2","first-page":"217","article-title":"Learning what and where to draw","volume":"29","author":"Reed Scott E.","year":"2016","unstructured":"Scott E. Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. Learning what and where to draw. Advances in Neural Information Processing Systems 29 (2016), 217\u2013225.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_35_2","first-page":"36479","article-title":"Photorealistic text-to-image diffusion models with deep language understanding","volume":"35","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479\u201336494.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_36_2","first-page":"25278","article-title":"Laion-5b: An open large-scale dataset for training next generation image-text models","volume":"35","author":"Schuhmann Christoph","year":"2022","unstructured":"Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278\u201325294.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650033"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00847"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3238554"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i8.26189"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01602"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00191"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02158"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01044"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3207000"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3377540"},{"key":"e_1_3_2_47_2","doi-asserted-by":"crossref","unstructured":"Tao Xu Pengchuan Zhang Qiuyuan Huang Han Zhang Zhe Gan Xiaolei Huang and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1316\u20131324.","DOI":"10.1109\/CVPR.2018.00143"},{"key":"e_1_3_2_48_2","first-page":"1255","article-title":"Semantic distance adversarial learning for text-to-image synthesis","author":"Yuan Bowen","year":"2023","unstructured":"Bowen Yuan, Yefei Sheng, Bing-Kun Bao, Yi-Ping Phoebe Chen, and Changsheng Xu. 2023. Semantic distance adversarial learning for text-to-image synthesis. IEEE Transactions on Multimedia 26 (2023), 1255\u20131266.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3382484"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2856256"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.629"},{"key":"e_1_3_2_52_2","unstructured":"Kai Zhang Lingbo Mo Wenhu Chen Huan Sun and Yu Su. 2023. MagicBrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems 36 (2023) 31428\u201331449."},{"issue":"3","key":"e_1_3_2_53_2","first-page":"3848","article-title":"TN-ZSTAD: Transferable network for zero-shot temporal activity detection","volume":"45","author":"Zhang Lingling","year":"2022","unstructured":"Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Zhihui Li, Lina Yao, and Alex Hauptmann. 2022. TN-ZSTAD: Transferable network for zero-shot temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3848\u20133861.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3702999"},{"key":"e_1_3_2_55_2","unstructured":"Yufan Zhou Ruiyi Zhang Changyou Chen Chunyuan Li Chris Tensmeyer Tong Yu Jiuxiang Gu Jinhui Xu and Tong Sun. 2021. Lafite: Towards language-free training for text-to-image generation. arXiv:2111.13792. Retrieved from https:\/\/arxiv.org\/abs\/2111.13792"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3369853"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3778175","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T14:57:44Z","timestamp":1770649064000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3778175"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,9]]},"references-count":55,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3778175"],"URL":"https:\/\/doi.org\/10.1145\/3778175","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,9]]},"assertion":[{"value":"2025-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}