{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T21:16:55Z","timestamp":1764969415401,"version":"3.46.0"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:p>Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.<\/jats:p>","DOI":"10.1145\/3763341","type":"journal-article","created":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T17:15:39Z","timestamp":1764868539000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["MALeR: Improving Compositional Fidelity in Layout-Guided Generation"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-4964-2454","authenticated-orcid":false,"given":"Shivank","family":"Saxena","sequence":"first","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6789-4390","authenticated-orcid":false,"given":"Dhruv","family":"Srivastava","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, India"},{"name":"Adobe Research, Hyderabad, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8800-9015","authenticated-orcid":false,"given":"Makarand","family":"Tapaswi","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, India"}]}],"member":"320","published-online":{"date-parts":[[2025,12,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"A-STAR: Test-Time Attention Segregation and Retention for Text-to-Image Synthesis. In International Conference on Computer Vision (ICCV).","author":"Agarwal Aishwarya","year":"2023","unstructured":"Aishwarya Agarwal, Srikrishna Karanam, KJ Joseph, Apoorv Saxena, Koustava Goswami, and Balaji Vasan Srinivasan. 2023. A-STAR: Test-Time Attention Segregation and Retention for Text-to-Image Synthesis. In International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_1_2_1","volume-title":"SpaText: Spatio-Textual Representation for Controllable Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Avrahami Omri","year":"2023","unstructured":"Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. 2023. SpaText: Spatio-Textual Representation for Controllable Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_3_1","volume-title":"Reliable and Scalable Benchmark for Text-to-Image Models. In International Conference on Computer Vision (ICCV).","author":"Bakr Eslam Mohamed","year":"2023","unstructured":"Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny. 2023. HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. In International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_1_4_1","unstructured":"Yogesh Balaji Seungjun Nah Xun Huang Arash Vahdat Jiaming Song Qinsheng Zhang Karsten Kreis Miika Aittala Timo Aila Samuli Laine et al. 2022. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv preprint arXiv:2211.01324 (2022)."},{"key":"e_1_2_1_5_1","volume-title":"Universal Guidance for Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Bansal Arpit","year":"2023","unstructured":"Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. Universal Guidance for Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_6_1","volume-title":"MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In International Conference on Machine Learning (ICML).","author":"Bar-Tal Omer","year":"2023","unstructured":"Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. 2023. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_7_1","volume-title":"Obtaining Favorable Layouts for Multiple Object Generation. arXiv preprint arXiv:2405.00791","author":"Battash Barak","year":"2024","unstructured":"Barak Battash, Amit Rozner, Lior Wolf, and Ofir Lindenbaum. 2024. Obtaining Favorable Layouts for Multiple Object Generation. arXiv preprint arXiv:2405.00791 (2024)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592116"},{"key":"e_1_2_1_9_1","volume-title":"Training-Free Layout Control with Cross-Attention Guidance. In Winter Conference on Applications of Computer Vision (WACV).","author":"Chen Minghao","year":"2024","unstructured":"Minghao Chen, Iro Laina, and Andrea Vedaldi. 2024. Training-Free Layout Control with Cross-Attention Guidance. In Winter Conference on Applications of Computer Vision (WACV)."},{"key":"e_1_2_1_10_1","volume-title":"Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation. In European Conference on Computer Vision (ECCV).","author":"Dahary Omer","year":"2024","unstructured":"Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. 2024. Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation. In European Conference on Computer Vision (ECCV)."},{"key":"e_1_2_1_11_1","unstructured":"Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_12_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00371-023-03151-y"},{"key":"e_1_2_1_14_1","volume-title":"Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In International Conference on Machine Learning (ICML).","author":"Esser Patrick","year":"2024","unstructured":"Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M\u00fcller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_15_1","volume-title":"Taming Transformers for High-Resolution Image Synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Esser Patrick","year":"2021","unstructured":"Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_16_1","volume-title":"Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In International Conference on Learning Representations (ICLR).","author":"Feng Weixi","year":"2023","unstructured":"Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_17_1","volume-title":"Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Feng Yutong","year":"2024","unstructured":"Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, and Jingren Zhou. 2024. Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_18_1","volume-title":"International Conference on Computer Vision (ICCV).","author":"Gao Shanghua","year":"2023","unstructured":"Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. 2023. MDTV2: Masked Diffusion Transformer is a Strong Image Synthesizer. In International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_1_19_1","volume-title":"ROICtrl: Boosting Instance Control for Visual Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Gu Yuchao","year":"2025","unstructured":"Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, and Mike Zheng Shou. 2025. ROICtrl: Boosting Instance Control for Visual Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00896"},{"key":"e_1_2_1_21_1","unstructured":"Yaru Hao Zewen Chi Li Dong and Furu Wei. 2023. Optimizing Prompts for Text-to-Image Generation. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_22_1","volume-title":"Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626","author":"Hertz Amir","year":"2022","unstructured":"Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626 (2022)."},{"key":"e_1_2_1_23_1","unstructured":"Jonathan Ho Ajay Jain and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_24_1","volume-title":"Classifier-Free Diffusion Guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications.","author":"Ho Jonathan","year":"2021","unstructured":"Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications."},{"key":"e_1_2_1_25_1","volume-title":"Composer: Creative and Controllable Image Synthesis with Composable Conditions. In International Conference on Machine Learning (ICML).","author":"Huang Lianghua","year":"2023","unstructured":"Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. 2023. Composer: Creative and Controllable Image Synthesis with Composable Conditions. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_26_1","volume-title":"Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory. arXiv preprint arXiv:2411.17472","author":"Jiang Eric Hanchen","year":"2024","unstructured":"Eric Hanchen Jiang, Yasi Zhang, Zhi Zhang, Yixin Wan, Andrew Lizarraga, Shufan Li, and Ying Nian Wu. 2024. Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory. arXiv preprint arXiv:2411.17472 (2024)."},{"key":"e_1_2_1_27_1","unstructured":"Diederik Kingma Tim Salimans Ben Poole and Jonathan Ho. 2021. Variational Diffusion Models. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729694"},{"key":"e_1_2_1_29_1","volume-title":"Divide & Bind Your Attention for Improved Generative Semantic Nursing. In British Machine Vision Conference (BMVC).","author":"Li Yumeng","year":"2023","unstructured":"Yumeng Li, Margret Keuper, Dan Zhang, and Anna Khoreva. 2023a. Divide & Bind Your Attention for Improved Generative Semantic Nursing. In British Machine Vision Conference (BMVC)."},{"key":"e_1_2_1_30_1","volume-title":"GLIGEN: Open-Set Grounded Text-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Li Yuheng","year":"2023","unstructured":"Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. 2023b. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_31_1","volume-title":"Place: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Lv Zhengyao","year":"2024","unstructured":"Zhengyao Lv, Yuxiang Wei, Wangmeng Zuo, and Kwan-Yee K Wong. 2024. Place: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_32_1","volume-title":"Conform: Contrast Is All You Need for High-Fidelity Text-to-Image Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Salih Meral Tuna Han","year":"2024","unstructured":"Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. 2024. Conform: Contrast Is All You Need for High-Fidelity Text-to-Image Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_33_1","volume-title":"Dynamic Prompt Optimizing for Text-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Mo Wenyi","year":"2024","unstructured":"Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, and Qing Yang. 2024. Dynamic Prompt Optimizing for Text-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_34_1","volume-title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning (ICML).","author":"Nichol Alex","year":"2022","unstructured":"Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_35_1","volume-title":"Improved Denoising Diffusion Probabilistic Models. In International Conference on Machine Learning (ICML).","author":"Nichol Alexander Quinn","year":"2021","unstructured":"Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_36_1","volume-title":"Compositional Text-to-Image Generation with Dense Blob Representations. In International Conference on Machine Learning (ICML).","author":"Nie Weili","year":"2024","unstructured":"Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, and Arash Vahdat. 2024. Compositional Text-to-Image Generation with Dense Blob Representations. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_37_1","volume-title":"Scalable Diffusion Models with Transformers. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Peebles William","year":"2023","unstructured":"William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_38_1","volume-title":"Grounded Text-to-Image Synthesis with Attention Refocusing. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Phung Quynh","year":"2024","unstructured":"Quynh Phung, Songwei Ge, and Jia-Bin Huang. 2024. Grounded Text-to-Image Synthesis with Attention Refocusing. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_39_1","volume-title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In International Conference on Learning Representations (ICLR).","author":"Podell Dustin","year":"2024","unstructured":"Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\u00fcller, Joe Penna, and Robin Rombach. 2024. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_40_1","unstructured":"Leigang Qu Shengqiong Wu Hao Fei Liqiang Nie and Tat-Seng Chua. 2023. LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation. In ACM Multimedia (MM)."},{"key":"e_1_2_1_41_1","volume-title":"Zero-Shot Text-to-Image Generation. In International Conference on Machine Learning (ICML).","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_42_1","unstructured":"Royi Rassin Eran Hirsch Daniel Glickman Shauli Ravfogel Yoav Goldberg and Gal Chechik. 2023. Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_43_1","volume-title":"High-Resolution Image Synthesis with Latent Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Rombach Robin","year":"2022","unstructured":"Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\u00f6rn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_44_1","volume-title":"U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention.","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention."},{"key":"e_1_2_1_45_1","volume-title":"Burcu Karagol Ayan, Tim Salimans, et al.","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_46_1","volume-title":"Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. In SIGGRAPH Asia Conference Papers.","author":"Sauer Axel","year":"2024","unstructured":"Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. 2024. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. In SIGGRAPH Asia Conference Papers."},{"key":"e_1_2_1_47_1","volume-title":"FaceNet: A Unified Embedding for Face Recognition and Clustering. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Schroff Florian","year":"2015","unstructured":"Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_48_1","volume-title":"Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR).","author":"Song Jiaming","year":"2021","unstructured":"Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021a. Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_49_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Song Yang","year":"2021","unstructured":"Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021b. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_50_1","volume-title":"CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis. arXiv preprint arXiv:2411.16783","author":"Sundaram Aravindan","year":"2024","unstructured":"Aravindan Sundaram, Ujjayan Pal, Abhimanyu Chauhan, Aishwarya Agarwal, and Srikrishna Karanam. 2024. CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis. arXiv preprint arXiv:2411.16783 (2024)."},{"volume-title":"What the DAAM: Interpreting Stable Diffusion Using Cross Attention","author":"Tang Raphael","key":"e_1_2_1_51_1","unstructured":"Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. 2023. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. In Association of Computational Linguistics (ACL)."},{"key":"e_1_2_1_52_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_53_1","volume-title":"InstanceDiffusion: Instance-Level Control for Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Wang Xudong","year":"2024","unstructured":"Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, and Ishan Misra. 2024. InstanceDiffusion: Instance-Level Control for Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_54_1","volume-title":"Investigating Prompt Engineering in Diffusion Models. arXiv preprint arXiv:2211.15462","author":"Witteveen Sam","year":"2022","unstructured":"Sam Witteveen and Martin Andrews. 2022. Investigating Prompt Engineering in Diffusion Models. arXiv preprint arXiv:2211.15462 (2022)."},{"key":"e_1_2_1_55_1","volume-title":"Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis. In International Conference on Computer Vision (ICCV).","author":"Wu Qiucheng","year":"2023","unstructured":"Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui, Zhe Lin, Yang Zhang, and Shiyu Chang. 2023. Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis. In International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_1_56_1","volume-title":"Ifadapter: Instance feature control for grounded text-to-image generation. arXiv preprint arXiv:2409.08240","author":"Wu Yinwei","year":"2024","unstructured":"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, and Xinchao Wang. 2024. Ifadapter: Instance feature control for grounded text-to-image generation. arXiv preprint arXiv:2409.08240 (2024)."},{"key":"e_1_2_1_57_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Xiao Jiayu","year":"2024","unstructured":"Jiayu Xiao, Henglei Lv, Liang Li, Shuhui Wang, and Qingming Huang. 2024. R&B: Region and Boundary Aware Zero-Shot Grounded Text-to-Image Generation. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_58_1","volume-title":"BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. In International Conference on Computer Vision (ICCV).","author":"Xie Jinheng","year":"2023","unstructured":"Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. 2023. BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. In International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_1_59_1","volume-title":"Imagereward: Learning and evaluating human preferences for text-to-image generation. In Advances in Neural Information Processing Systems (NeurIPS).","author":"Xu Jiazheng","year":"2023","unstructured":"Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human preferences for text-to-image generation. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_60_1","volume-title":"ReCo: Region-Controlled Text-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Yang Zhengyuan","year":"2023","unstructured":"Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, et al. 2023. ReCo: Region-Controlled Text-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"e_1_2_1_62_1","volume-title":"LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis. arXiv preprint arXiv:2311.12342","author":"Zhao Peiang","year":"2023","unstructured":"Peiang Zhao, Han Li, Ruiyang Jin, and S Kevin Zhou. 2023. LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis. arXiv preprint arXiv:2311.12342 (2023)."},{"key":"e_1_2_1_63_1","volume-title":"LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Zheng Guangcong","year":"2023","unstructured":"Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. 2023. LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_64_1","volume-title":"Fast Training of Diffusion Models with Masked Transformers. Transactions on Machine Learning Research (TMLR)","author":"Zheng Hongkai","year":"2024","unstructured":"Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. 2024. Fast Training of Diffusion Models with Masked Transformers. Transactions on Machine Learning Research (TMLR) (2024)."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00651"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3763341","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T21:13:45Z","timestamp":1764969225000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3763341"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12]]},"references-count":65,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["10.1145\/3763341"],"URL":"https:\/\/doi.org\/10.1145\/3763341","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"type":"print","value":"0730-0301"},{"type":"electronic","value":"1557-7368"}],"subject":[],"published":{"date-parts":[[2025,12]]},"assertion":[{"value":"2025-05-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-09","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}