{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:04:25Z","timestamp":1765339465238,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":53,"publisher":"ACM","funder":[{"name":"Ningbo ?Yongjiang Science In- novation 2035? Key Technology Breakthrough Plan","award":["No. 2024Z254"],"award-info":[{"award-number":["No. 2024Z254"]}]},{"name":"Zhejiang ?Leading Goose? R&D Program","award":["No. 2025C02211"],"award-info":[{"award-number":["No. 2025C02211"]}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"publisher","award":["No. 2024M752677"],"award-info":[{"award-number":["No. 2024M752677"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755367","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T06:54:15Z","timestamp":1761375255000},"page":"10064-10073","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["ObjCtrl: Object-based Control Relaxation for Conditional Text-to-Image Generation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-7700-4024","authenticated-orcid":false,"given":"Xinlong","family":"Zhang","sequence":"first","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5313-2742","authenticated-orcid":false,"given":"Zejian","family":"Li","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4383-3905","authenticated-orcid":false,"given":"Wei","family":"Li","sequence":"additional","affiliation":[{"name":"School of Design, Southwest Jiaotong University, Chengdu, Sichuan, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-7094-884X","authenticated-orcid":false,"given":"Xiaoyu","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2996-203X","authenticated-orcid":false,"given":"Jia","family":"Wei","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2369-7419","authenticated-orcid":false,"given":"Chengyu","family":"Lin","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0157-7771","authenticated-orcid":false,"given":"Yongchuan","family":"Tang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01762"},{"key":"e_1_3_2_2_2_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Bar-Tal Omer","year":"2023","unstructured":"Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. 2023. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML'23). Article 74, 16 pages."},{"key":"e_1_3_2_2_3_1","first-page":"8","article-title":". Improving Image Generation with Better Captions","volume":"2","author":"Betker James","year":"2023","unstructured":"James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al., 2023. Improving Image Generation with Better Captions. Computer Science, Vol. 2, 3 (2023), 8.","journal-title":"Computer Science"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.1986.4767851"},{"key":"e_1_3_2_2_5_1","volume-title":"Article arXiv:2403.04279","author":"Cao Pu","year":"2024","unstructured":"Pu Cao, Feng Zhou, Qing Song, and Lu Yang. 2024. Controllable Generation with Text-to-Image Diffusion Models: A Survey. , Article arXiv:2403.04279 (2024). arXiv:2403.04279 [cs.CV]"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.143"},{"key":"e_1_3_2_2_7_1","unstructured":"Hongyu Chen Yiqi Gao Min Zhou Peng Wang Xubin Li Tiezheng Ge and Bo Zheng. 2024a. Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion. arXiv:2404.14768 [cs.CV]"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00526"},{"key":"e_1_3_2_2_9_1","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"Cheng Bo","year":"2025","unstructured":"Bo Cheng, Yuhang Ma, Liebucha Wu, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Dawei Leng, and Yuhui Yin. 2025. HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation. In Proceedings of the 38th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS '24). Curran Associates Inc., Red Hook, NY, USA, Article 4094, 25 pages."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP48485.2024.10446608"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/3692070.3692573"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i3.27951"},{"key":"e_1_3_2_2_13_1","volume-title":"Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A. Sigurdsson, Nanyun Peng, and Xin Eric Wang.","author":"He Xuehai","year":"2024","unstructured":"Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A. Sigurdsson, Nanyun Peng, and Xin Eric Wang. 2024. FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation. Transactions on Machine Learning Research (2024)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295408"},{"key":"e_1_3_2_2_15_1","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Hu Minghui","year":"2023","unstructured":"Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, and Tat-Jen Cham. 2023. Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 1408, 21 pages."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/3600270.3602196"},{"key":"e_1_3_2_2_17_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Kingma DP","year":"2014","unstructured":"DP Kingma. 2014. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_2_18_1","volume-title":"Proceedings of the European Conference on Computer Vision. Springer, 129-147","author":"Li Ming","year":"2024","unstructured":"Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. 2024b. ControlNet: Improving Conditional Controls with Efficient Consistency Feedback. In Proceedings of the European Conference on Computer Vision. Springer, 129-147."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681284"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02156"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_2_22_1","volume-title":"Proceedings of the European Conference on Computer Vision. Springer, 38-55","author":"Liu Shilong","year":"2024","unstructured":"Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al., 2024b. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the European Conference on Computer Vision. Springer, 38-55."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73195-2_1"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3680658"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612191"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00713"},{"key":"e_1_3_2_2_27_1","volume-title":"Proceedings of the 39th International Conference on Machine Learning. PMLR, 16784-16804","author":"Nichol Alexander Quinn","year":"2022","unstructured":"Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 16784-16804."},{"key":"e_1_3_2_2_28_1","volume-title":"Proceedings of the International Conference on Learning Representations. 1862-1874","author":"Podell Dustin","year":"2024","unstructured":"Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\u00fcller, Joe Penna, and Robin Rombach. 2024. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In Proceedings of the International Conference on Learning Representations. 1862-1874."},{"key":"e_1_3_2_2_29_1","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Qin Can","year":"2023","unstructured":"Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. 2023. UniControl: A Unified Diffusion Model for Controllable Visual Generation In The Wild. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 1862, 32 pages."},{"key":"e_1_3_2_2_30_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning. PMLR, 8748-8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., 2021. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 8748-8763."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3019967"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_2_33_1","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"Schuhmann Christoph","year":"2022","unstructured":"Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An Open Large-scale Dataset for Training Next Generation Image-Text Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS '22). Curran Associates Inc., Red Hook, NY, USA, Article 1833, 17 pages."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73247-8_6"},{"key":"e_1_3_2_2_35_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision.","author":"Tan Zhenxiong","year":"2025","unstructured":"Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. 2025. OminiControl: Minimal and Universal Control for Diffusion Transformer. In Proceedings of the IEEE\/CVF International Conference on Computer Vision."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00753"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72970-6_2"},{"key":"e_1_3_2_2_39_1","unstructured":"Rui Wang Hailong Guo Jiaming Liu Huaxia Li Haibo Zhao Xu Tang Yao Hu Hao Tang and Peipei Li. 2024b. StableGarment: Garment-Centric Generation via Stable Diffusion. arXiv:2403.10783 [cs.CV] https:\/\/arxiv.org\/abs\/2403.10783"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00596"},{"key":"e_1_3_2_2_41_1","unstructured":"Yinwei Wu Xianpan Zhou Bing Ma Xuefeng Su Kai Ma and Xinchao Wang. 2024. IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation. arXiv:2409.08240 [cs.CV]"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00685"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3680692"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73223-2_20"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00800"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGIP62525.2024.00035"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00179"},{"volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Zhao Shihao","key":"e_1_3_2_2_49_1","unstructured":"Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K. Wong. 2023. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS '23). Article 491, 24 pages."},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02154"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3510752"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00651"},{"key":"e_1_3_2_2_53_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Zhou Dewei","year":"2025","unstructured":"Dewei Zhou, Ji Xie, Zongxin Yang, and Yi Yang. 2025b. 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation. In Proceedings of the International Conference on Learning Representations."}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755367","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:00:08Z","timestamp":1765339208000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755367"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":53,"alternative-id":["10.1145\/3746027.3755367","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755367","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}