{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T17:47:15Z","timestamp":1772905635170,"version":"3.50.1"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Hong Kong RGC General Research Fund","award":["CityU 11216122"],"award-info":[{"award-number":["CityU 11216122"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grant 62276128"],"award-info":[{"award-number":["Grant 62276128"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grant 62192783"],"award-info":[{"award-number":["Grant 62192783"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2024,7,19]]},"abstract":"<jats:p>Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively. Our project webpage is available at https:\/\/analogist2d.github.io.<\/jats:p>","DOI":"10.1145\/3658136","type":"journal-article","created":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T14:47:57Z","timestamp":1721400477000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9914-3922","authenticated-orcid":false,"given":"Zheng","family":"Gu","sequence":"first","affiliation":[{"name":"Nanjing University, Nanjing, China"},{"name":"City University of Hong Kong, Hong Kong, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8213-5803","authenticated-orcid":false,"given":"Shiyuan","family":"Yang","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"},{"name":"City University of Hong Kong, Hong Kong, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7014-5377","authenticated-orcid":false,"given":"Jing","family":"Liao","sequence":"additional","affiliation":[{"name":"City University of Hong Kong, Hong Kong, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8504-455X","authenticated-orcid":false,"given":"Jing","family":"Huo","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2488-1813","authenticated-orcid":false,"given":"Yang","family":"Gao","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,7,19]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Sequential Modeling Enables Scalable Learning for Large Vision Models. arXiv preprint arXiv:2312.00785","author":"Bai Yutong","year":"2023","unstructured":"Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. 2023. Sequential Modeling Enables Scalable Learning for Large Vision Models. arXiv preprint arXiv:2312.00785 (2023)."},{"key":"e_1_2_1_2_1","first-page":"25005","article-title":"Visual prompting via image inpainting","volume":"35","author":"Bar Amir","year":"2022","unstructured":"Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. 2022. Visual prompting via image inpainting. Advances in Neural Information Processing Systems 35 (2022), 25005--25017.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3610287"},{"key":"e_1_2_1_4_1","unstructured":"James Betker Gabriel Goh Li Jing Tim Brooks Jianfeng Wang Linjie Li Long Ouyang Juntang Zhuang Joyce Lee Yufei Guo et al. 2023. Improving image generation with better captions. Computer Science. https:\/\/cdn.openai.com\/papers\/dall-e-3.pdf 2 (2023) 3."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01764"},{"key":"e_1_2_1_6_1","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877--1901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.02062"},{"key":"e_1_2_1_8_1","volume-title":"MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. arXiv preprint arXiv:2311.12052","author":"Chang Di","year":"2023","unstructured":"Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Xiao Yang, and Mohammad Soleymani. 2023. MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. arXiv preprint arXiv:2311.12052 (2023)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392386"},{"key":"e_1_2_1_10_1","volume-title":"Deep Retinex Decomposition for Low-Light Enhancement. In British Machine Vision Conference. British Machine Vision Association.","author":"Chen Wei","year":"2018","unstructured":"Wei Chen, Wang Wenjing, Yang Wenhan, and Liu Jiaying. 2018. Deep Retinex Decomposition for Low-Light Enhancement. In British Machine Vision Conference. British Machine Vision Association."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.261"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2699641"},{"key":"e_1_2_1_14_1","volume-title":"A survey for in-context learning. arXiv preprint arXiv:2301.00234","author":"Dong Qingxiu","year":"2022","unstructured":"Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)."},{"key":"e_1_2_1_15_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3086722"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01785"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/383259.383295"},{"key":"e_1_2_1_19_1","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840--6851.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3323006"},{"key":"e_1_2_1_21_1","volume-title":"International Conference on Machine Learning. PMLR, 12888--12900","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073683"},{"key":"e_1_2_1_23_1","volume-title":"Context Diffusion: In-Context Aware Image Generation. arXiv preprint arXiv:2312.03584","author":"Najdenkoska Ivona","year":"2023","unstructured":"Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, and Filip Radenovic. 2023. Context Diffusion: In-Context Aware Image Generation. arXiv preprint arXiv:2312.03584 (2023)."},{"key":"e_1_2_1_24_1","volume-title":"Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=l9BsCh8ikK","author":"Nguyen Thao","year":"2023","unstructured":"Thao Nguyen, Yuheng Li, Utkarsh Ojha, and Yong Jae Lee. 2023. Visual Instruction Inversion: Image Editing via Image Prompting. In Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=l9BsCh8ikK"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588432.3591513"},{"key":"e_1_2_1_26_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00209"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.85"},{"key":"e_1_2_1_29_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_1_31_1","volume-title":"Diffusion Image Analogies. In ACM SIGGRAPH 2023 Conference Proceedings. 1--10","author":"\u0160ubrtov\u00e1 Ad\u00e9la","year":"2023","unstructured":"Ad\u00e9la \u0160ubrtov\u00e1, Michal Luk\u00e1\u010d, Jan \u010cech, David Futschik, Eli Shechtman, and Daniel S\u1ef3kora. 2023. Diffusion Image Analogies. In ACM SIGGRAPH 2023 Conference Proceedings. 1--10."},{"key":"e_1_2_1_32_1","unstructured":"Yasheng SUN Yifan Yang Houwen Peng Yifei Shen Yuqing Yang Han Hu Lili Qiu and Hideki Koike. 2023. ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation. In Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=EmOIP3t9nk"},{"key":"e_1_2_1_33_1","volume-title":"Manmohan Chandraker, Ravi Ramamoorthi, and Koki Nagano.","author":"Trevithick Alex","year":"2024","unstructured":"Alex Trevithick, Matthew Chan, Towaki Takikawa, Umar Iqbal, Shalini De Mello, Manmohan Chandraker, Ravi Ramamoorthi, and Koki Nagano. 2024. What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs. arXiv preprint arXiv:2401.02411 (2024)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00660"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00110"},{"key":"e_1_2_1_36_1","volume-title":"In-Context Learning Unlocked for Diffusion Models. In Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=6BZS2EAkns","author":"Wang Zhendong","year":"2023","unstructured":"Zhendong Wang, Yifan Jiang, Yadong Lu, yelong shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, and Mingyuan Zhou. 2023a. In-Context Learning Unlocked for Diffusion Models. In Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=6BZS2EAkns"},{"key":"e_1_2_1_37_1","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_38_1","first-page":"1","article-title":"IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers","volume":"42","author":"Wu Ronghuan","year":"2023","unstructured":"Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. 2023. IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers. ACM Transactions on Graphics 42, 6 (2023), 1--14.","journal-title":"ACM Transactions on Graphics"},{"key":"e_1_2_1_39_1","first-page":"3121","article-title":"Gan inversion: A survey","volume":"45","author":"Xia Weihao","year":"2022","unstructured":"Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. 2022. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3121--3138.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_2_1_40_1","volume-title":"Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848","author":"Xu Canwen","year":"2023","unstructured":"Canwen Xu, Yichong Xu, Shuohang Wang, Yang Liu, Chenguang Zhu, and Julian McAuley. 2023. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848 (2023)."},{"key":"e_1_2_1_41_1","volume-title":"Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441","author":"Yang Jianwei","year":"2023","unstructured":"Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023b. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023)."},{"key":"e_1_2_1_42_1","volume-title":"The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1","author":"Yang Zhengyuan","year":"2023","unstructured":"Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023a. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023)."},{"key":"e_1_2_1_43_1","volume-title":"DiffMat: Latent diffusion models for image-guided material generation. Visual Informatics","author":"Yuan Liang","year":"2024","unstructured":"Liang Yuan, Dingkun Yan, Suguru Saito, and Issei Fujishiro. 2024. DiffMat: Latent diffusion models for image-guided material generation. Visual Informatics (2024)."},{"key":"e_1_2_1_44_1","volume-title":"Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139","author":"Zablotskaia Polina","year":"2019","unstructured":"Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. 2019. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"e_1_2_1_46_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Zhu Deyao","year":"2023","unstructured":"Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In The Twelfth International Conference on Learning Representations."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658136","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3658136","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:05:54Z","timestamp":1750291554000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658136"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,19]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,7,19]]}},"alternative-id":["10.1145\/3658136"],"URL":"https:\/\/doi.org\/10.1145\/3658136","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,19]]},"assertion":[{"value":"2024-07-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}