{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T18:53:59Z","timestamp":1774637639079,"version":"3.50.1"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"DOI":"10.13039\/100026024","name":"Adobe Research","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100026024","id-type":"DOI","asserted-by":"crossref"}]},{"name":"UCL Centre for Artificial Intelligence"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2025,8,1]]},"abstract":"<jats:p>\n                    Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be\n                    <jats:italic toggle=\"yes\">taught<\/jats:italic>\n                    to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing operations, by training them to solve specially-designed visual puzzles. Subsequently, such an operation-aware MLLM can both plan and propose edit sequences. To facilitate training, given a set of expert-edited photos, we synthesize a reasoning dataset by procedurally manipulating the expert edits and then grounding a pretrained LLM on the visual adjustments, to synthesize reasoning for finetuning. The proposed retouching operations are, by construction, understandable by the users, preserve object details and resolution, and can be optionally overridden. We evaluate our setup on a variety of test examples and show advantages, in terms of explainability and identity preservation, over existing generative and other procedural alternatives.\n                    <jats:italic toggle=\"yes\">Code, data, models, and supplementary results can be found via our project website at https:\/\/monetgpt.github.io.<\/jats:italic>\n                  <\/jats:p>","DOI":"10.1145\/3730926","type":"journal-article","created":{"date-parts":[[2025,7,27]],"date-time":"2025-07-27T04:02:22Z","timestamp":1753588942000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7423-2221","authenticated-orcid":false,"given":"Niladri Shekhar","family":"Dutt","sequence":"first","affiliation":[{"name":"University College London, London, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2307-9052","authenticated-orcid":false,"given":"Duygu","family":"Ceylan","sequence":"additional","affiliation":[{"name":"Adobe Research, London, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2597-0914","authenticated-orcid":false,"given":"Niloy J.","family":"Mitra","sequence":"additional","affiliation":[{"name":"University College London, London, United Kingdom"},{"name":"Adobe Research, London, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,7,27]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)."},{"key":"e_1_2_2_2_1","volume-title":"Stewart Morris, Seung Jean Yoo, Aditya Ganeshan, R Kenny Jones, Qiuhong Anna Wei, Kailiang Fu, and Daniel Ritchie.","author":"Aguina-Kang Rio","year":"2024","unstructured":"Rio Aguina-Kang, Maxim Gumin, Do Heon Han, Stewart Morris, Seung Jean Yoo, Aditya Ganeshan, R Kenny Jones, Qiuhong Anna Wei, Kailiang Fu, and Daniel Ritchie. 2024. Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases. arXiv preprint arXiv:2403.09675 (2024)."},{"key":"e_1_2_2_3_1","volume-title":"Efros","author":"Brooks Tim","year":"2022","unstructured":"Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2022. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR. 18392\u201318402. https:\/\/api.semanticscholar.org\/CorpusID:253581213"},{"key":"e_1_2_2_4_1","volume-title":"Efros","author":"Brooks Tim","year":"2023","unstructured":"Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR."},{"key":"e_1_2_2_5_1","doi-asserted-by":"crossref","unstructured":"Vladimir Bychkovsky Sylvain Paris Eric Chan and Fredo Durand. 2011. Learning photographic global tonal adjustment with a database of input \/ output image pairs. In CVPR. 97\u2013104.","DOI":"10.1109\/CVPR.2011.5995413"},{"key":"e_1_2_2_6_1","doi-asserted-by":"crossref","unstructured":"Ming Cao Xintao Wang Zhongang Qi Ying Shan Xiaohu Qie and Yinqiang Zheng. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In ICCV. 22503\u201322513. https:\/\/api.semanticscholar.org\/CorpusID:258179432","DOI":"10.1109\/ICCV51070.2023.02062"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.273"},{"key":"e_1_2_2_8_1","volume-title":"Xin Eric Wang, and William Yang Wang","author":"Feng Weixi","year":"2024","unstructured":"Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2024. LayoutGPT: Compositional Visual Planning and Generation with Large Language Models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_9_1","volume-title":"NICER: Aesthetic image enhancement with humans in the loop. arXiv preprint arXiv:2012.01778","author":"Fischer Michael","year":"2020","unstructured":"Michael Fischer, Konstantin Kobs, and Andreas Hotho. 2020. NICER: Aesthetic image enhancement with humans in the loop. arXiv preprint arXiv:2012.01778 (2020)."},{"key":"e_1_2_2_10_1","unstructured":"Tsu-Jui Fu Wenze Hu Xianzhi Du William Wang Yinfei Yang and Zhe Gan. 2024. Guiding Instruction-based Image Editing via Multimodal Large Language Models. In ICLR. https:\/\/arxiv.org\/abs\/2309.17102"},{"key":"e_1_2_2_11_1","volume-title":"ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation. arXiv preprint arXiv:2410.01731","author":"Gal Rinon","year":"2024","unstructured":"Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, and Gal Chechik. 2024. ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation. arXiv preprint arXiv:2410.01731 (2024)."},{"key":"e_1_2_2_12_1","volume-title":"Weinberger (Eds.)","volume":"27","author":"Goodfellow Ian","year":"2014","unstructured":"Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2014\/file\/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf"},{"key":"e_1_2_2_13_1","volume-title":"CCA: Collaborative Competitive Agents for Image Editing.","author":"Hang Tiankai","year":"2024","unstructured":"Tiankai Hang, Shuyang Gu, Dong Chen, Xin Geng, and Baining Guo. 2024. CCA: Collaborative Competitive Agents for Image Editing. (2024). arXiv:2401.13011 [cs.CV]"},{"key":"e_1_2_2_14_1","volume-title":"Seongmin Lee Lee, and Polo Chau","author":"Helbling Alec","year":"2024","unstructured":"Alec Helbling, Seongmin Lee Lee, and Polo Chau. 2024. ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing. arXiv preprint arXiv:2012.01778 (2024)."},{"key":"e_1_2_2_15_1","volume-title":"Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626","author":"Hertz Amir","year":"2022","unstructured":"Amir Hertz, Ron Mokady, Jay M. Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626 (2022). https:\/\/api.semanticscholar.org\/CorpusID:251252882"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3181974"},{"key":"e_1_2_2_17_1","doi-asserted-by":"crossref","unstructured":"Ian Huang Guandao Yang and Leonidas Guibas. 2024. BlenderAlchemy: Editing 3D Graphics with Vision-Language Models. arXiv:2404.17672 [cs.CV] https:\/\/arxiv.org\/abs\/2404.17672","DOI":"10.1007\/978-3-031-73024-5_18"},{"key":"e_1_2_2_18_1","volume-title":"Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.","author":"Jiang Albert Q","year":"2023","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58595-2_21"},{"key":"e_1_2_2_20_1","volume-title":"PieNet: Personalized Image Enhancement. In European Conference on Computer Vision.","author":"Kim Han-Ul","year":"2020","unstructured":"Han-Ul Kim, Young Jun Koh, and Chang-Su Kim. 2020b. PieNet: Personalized Image Enhancement. In European Conference on Computer Vision."},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6790"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW59228.2023.00116"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00071"},{"key":"e_1_2_2_24_1","volume-title":"Niladri Shekhar Dutt, and Niloy J. Mitra","author":"Littlefair Gabrielle","year":"2025","unstructured":"Gabrielle Littlefair, Niladri Shekhar Dutt, and Niloy J. Mitra. 2025. FlairGPT: Repurposing LLMs for Interior Designs. arXiv:2501.04648 [cs.GR] https:\/\/arxiv.org\/abs\/2501.04648"},{"key":"e_1_2_2_25_1","volume-title":"Kwang-Ting Cheng, and Min-Hung Chen.","author":"Liu Shih-Yang","year":"2024","unstructured":"Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv:2402.09353 [cs.CL] https:\/\/arxiv.org\/abs\/2402.09353"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3179904"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cobeha.2020.11.002"},{"key":"e_1_2_2_28_1","volume-title":"Dragon-diffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421","author":"Mou Chong","year":"2023","unstructured":"Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. 2023. Dragon-diffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421 (2023)."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01117"},{"key":"e_1_2_2_30_1","volume-title":"Kosmos-G: Generating Images in Context with Multimodal Large Language Models. arXiv preprint arXiv:2310.02992","author":"Pan Xichen","year":"2023","unstructured":"Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. 2023. Kosmos-G: Generating Images in Context with Multimodal Large Language Models. arXiv preprint arXiv:2310.02992 (2023). https:\/\/api.semanticscholar.org\/CorpusID:263620748"},{"key":"e_1_2_2_31_1","volume-title":"Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824","author":"Peng Zhiliang","year":"2023","unstructured":"Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824 (2023). https:\/\/api.semanticscholar.org\/CorpusID:259262263"},{"key":"e_1_2_2_32_1","volume-title":"ShapeLLM: Universal 3D Object Understanding for Embodied Interaction. arXiv preprint arXiv:2402.17766","author":"Qi Zekun","year":"2024","unstructured":"Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. 2024. ShapeLLM: Universal 3D Object Understanding for Embodied Interaction. arXiv preprint arXiv:2402.17766 (2024)."},{"key":"e_1_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Bj\u00f6rn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_2_34_1","doi-asserted-by":"crossref","unstructured":"Rodrigo Santos Jo\u00e3o Silva and Ant\u00f3nio Branco. 2024. Leveraging LLMs for On-the-Fly Instruction Guided Image Editing. arXiv:2403.08004 [cs.CL] https:\/\/arxiv.org\/abs\/2403.08004","DOI":"10.1007\/978-3-031-73497-7_3"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01338"},{"key":"e_1_2_2_36_1","unstructured":"Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL] https:\/\/arxiv.org\/abs\/2403.05530"},{"key":"e_1_2_2_37_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_2_38_1","volume-title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191","author":"Wang Peng","year":"2024","unstructured":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024a. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191 (2024)."},{"key":"e_1_2_2_39_1","unstructured":"Zhengyi Wang Jonathan Lorraine Yikai Wang Hang Su Jun Zhu Sanja Fidler and Xiaohui Zeng. 2024b. LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models. arXiv:2411.09595 [cs.LG] https:\/\/arxiv.org\/abs\/2411.09595"},{"key":"e_1_2_2_40_1","volume-title":"Proc. NeurIPS. Article","author":"Wei Jason","year":"2024","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2024. Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS. Article 1800, 14 pages."},{"key":"e_1_2_2_41_1","volume-title":"OmniGen: Unified Image Generation. arXiv preprint arXiv:2409.11340","author":"Xiao Shitao","year":"2024","unstructured":"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, and Zheng Liu. 2024. OmniGen: Unified Image Generation. arXiv preprint arXiv:2409.11340 (2024). https:\/\/api.semanticscholar.org\/CorpusID:272694523"},{"key":"e_1_2_2_42_1","volume-title":"Sing Bing Kang, and Xiaoou Tang","author":"Yan Jianzhou","year":"2014","unstructured":"Jianzhou Yan, Stephen Lin, Sing Bing Kang, and Xiaoou Tang. 2014. A Learning-to-Rank Approach for Image Color Enhancement. In CVPR. 2987\u20132994."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2790296"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2312.09067"},{"key":"e_1_2_2_45_1","doi-asserted-by":"crossref","unstructured":"Lvmin Zhang Anyi Rao and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"e_1_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Richard Zhang Phillip Isola Alexei A Efros Eli Shechtman and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.","DOI":"10.1109\/CVPR.2018.00068"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-demos.38"},{"key":"e_1_2_2_48_1","volume-title":"GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing. arXiv preprint arXiv:2407.05600","author":"Zhenyu Wang","year":"2024","unstructured":"Wang Zhenyu, Li Aoxue, Li Zhenguo, and Liu Xihui. 2024. GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing. arXiv preprint arXiv:2407.05600 (2024)."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3730926","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:53:05Z","timestamp":1774633985000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730926"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,27]]},"references-count":48,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,8,1]]}},"alternative-id":["10.1145\/3730926"],"URL":"https:\/\/doi.org\/10.1145\/3730926","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,27]]},"assertion":[{"value":"2025-01-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}