{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,24]],"date-time":"2025-12-24T12:21:37Z","timestamp":1766578897670,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2025,5,22]],"date-time":"2025-05-22T00:00:00Z","timestamp":1747872000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,5,31]]},"abstract":"<jats:p>The ability to fine-tune generative models for text-to-image generation tasks is crucial, particularly when facing the complexity involved in accurately interpreting and visualizing textual inputs. While LoRA is efficient for language model adaptation, it often falls short in text-to-image tasks due to the intricate demands of image generation, such as accommodating a broad spectrum of styles and nuances. To bridge this gap, we introduce StyleInject, a specialized fine-tuning approach tailored for text-to-image models. StyleInject comprises multiple parallel low-rank parameter matrices, maintaining the diversity of visual features. It dynamically adapts to varying styles by adjusting the variance of visual features based on the characteristics of the input signal. This approach significantly minimizes the impact on the original model\u2019s text-image alignment capabilities while adeptly adapting to various styles in transfer learning. StyleInject proves particularly effective in learning from and enhancing a range of advanced, community-fine-tuned generative models. Our comprehensive experiments, including both small-sample and large-scale data fine-tuning as well as base model distillation, show that StyleInject surpasses traditional LoRA in both text-image semantic consistency and human preference evaluation, all while ensuring greater parameter efficiency.<\/jats:p>","DOI":"10.1145\/3730403","type":"journal-article","created":{"date-parts":[[2025,4,16]],"date-time":"2025-04-16T16:38:16Z","timestamp":1744821496000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3250-4978","authenticated-orcid":false,"given":"Mohan","family":"Zhou","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8416-9027","authenticated-orcid":false,"given":"Yalong","family":"Bai","sequence":"additional","affiliation":[{"name":"Du Xiaoman Technology, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8596-884X","authenticated-orcid":false,"given":"Qing","family":"Yang","sequence":"additional","affiliation":[{"name":"Du Xiaoman Technology, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4659-4935","authenticated-orcid":false,"given":"Tiejun","family":"Zhao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Harbin, China"}]}],"member":"320","published-online":{"date-parts":[[2025,5,22]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"25278","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"35","author":"Schuhmann C.","year":"2022","unstructured":"C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35, 25278\u201325294."},{"key":"e_1_3_2_3_2","first-page":"36479","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"35","author":"Saharia C.","year":"2022","unstructured":"C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35, 36479\u201336494."},{"key":"e_1_3_2_4_2","unstructured":"P. Mishkin L. Ahmad M. Brundage G. Krueger and G. Sastry. 2022. DALL\u00b7E 2 Preview\u2014Risks and Limitations. Retrieved from https:\/\/github.com\/openai\/dalle-2-preview\/blob\/main\/system-card.md"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_6_2","first-page":"16784","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Nichol A. Q.","year":"2022","unstructured":"A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen. 2022. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the International Conference on Machine Learning. PMLR, 16784\u201316804."},{"key":"e_1_3_2_7_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Podell D.","year":"2024","unstructured":"D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M\u00fcller, J. Penna, and R. Rombach. 2024. SDXL: Improving latent diffusion models for high-resolution image synthesis. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_2_8_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Hu E. J.","year":"2022","unstructured":"E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. 2022. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_9_2","unstructured":"Y. Liu M. Ott N. Goyal J. Du M. Joshi D. Chen O. Levy M. Lewis L. Zettlemoyer and V. Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_2_10_2","first-page":"1877","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"33","author":"Brown T.","year":"2020","unstructured":"T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 1877\u20131901."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.167"},{"key":"e_1_3_2_12_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"31","author":"Nam H.","year":"2018","unstructured":"H. Nam and H.-E. Kim. 2018. Batch-instance normalization for adaptively style-invariant neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 31."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00453"},{"key":"e_1_3_2_14_2","first-page":"6840","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"33","author":"Ho J.","year":"2020","unstructured":"J. Ho, A. Jain, and P. Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 6840\u20136851."},{"key":"e_1_3_2_15_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Song J.","year":"2021","unstructured":"J. Song, C. Meng, and S. Ermon. 2021. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_16_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"27","author":"Goodfellow I.","year":"2014","unstructured":"I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 27."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589002"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650033"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3545610"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3576858"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01764"},{"key":"e_1_3_2_22_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Hertz A.","year":"2023","unstructured":"A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. 2023. Prompt-to-prompt image editing with cross-attention control. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_2_23_2","first-page":"22500","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ruiz N.","year":"2023","unstructured":"N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman. 2023. DreamBooth: Fine-tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 22500\u201322510."},{"key":"e_1_3_2_24_2","first-page":"3836","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201923)","author":"Zhang L.","year":"2023","unstructured":"L. Zhang, A. Rao, and M. Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201923), 3836\u20133847."},{"key":"e_1_3_2_25_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Zhang Q.","year":"2023","unstructured":"Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao. 2023. Adaptive budget allocation for parameter-efficient fine-tuning. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_2_26_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Hyeon-Woo N.","year":"2022","unstructured":"N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh. 2022. FedPara: Low-rank Hadamard product for communication-efficient federated learning. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_27_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Yeh S.-Y.","year":"2023","unstructured":"S.-Y. Yeh, Y.-G. Hsieh, Z. Gao, B. B. Yang, G. Oh, and Y. Gong. 2023. Navigating text-to-image customization: From LyCORIS fine-tuning to model evaluation. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01374"},{"key":"e_1_3_2_29_2","unstructured":"B.-K. Kim H.-K. Song T. Castells and S. Choi. 2023. On architectural compression of text-to-image diffusion models. arXiv:2305.15798. Retrieved from https:\/\/arxiv.org\/abs\/2305.15798"},{"key":"e_1_3_2_30_2","volume-title":"Proceedings of the Workshop on Efficient Systems for Foundation Models (ICML \u201923)","author":"Kim B.-K.","year":"2023","unstructured":"B.-K. Kim, H.-K. Song, T. Castells, and S. Choi. 2023. BK-SDM: Architecturally compressed stable diffusion for efficient text-to-image generation. In Proceedings of the Workshop on Efficient Systems for Foundation Models (ICML \u201923)."},{"key":"e_1_3_2_31_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Ha D.","year":"2022","unstructured":"D. Ha, A. M. Dai, and Q. V. Le. 2022. Hypernetworks. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612191"},{"key":"e_1_3_2_33_2","volume-title":"Proceedings of the 37th Conference on Neural Information Processing Systems","author":"Yu X.","year":"2023","unstructured":"X. Yu, X. Gu, H. Liu, and J. Sun. 2023. Constructing non-isotropic gaussian diffusion model using isotropic Gaussian diffusion model for image editing. In Proceedings of the 37th Conference on Neural Information Processing Systems. Retrieved from https:\/\/openreview.net\/forum?id=2Ibp83esmb"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.595"},{"key":"e_1_3_2_35_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"36","author":"Xu J.","year":"2024","unstructured":"J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong. 2024. ImageReward: Learning and evaluating human preferences for text-to-image generation. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 36."},{"key":"e_1_3_2_36_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"36","author":"Kirstain Y.","year":"2024","unstructured":"Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. 2024. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 36."},{"key":"e_1_3_2_37_2","unstructured":"A. Yang J. Pan J. Lin R. Men Y. Zhang J. Zhou and C. Zhou. 2022. Chinese CLIP: Contrastive vision-language pretraining in Chinese. arXiv:2211.01335. Retrieved from https:\/\/arxiv.org\/abs\/2211.01335"},{"key":"e_1_3_2_38_2","first-page":"11","article-title":"Visualizing data using t-SNE","volume":"9","author":"Van der Maaten L.","year":"2008","unstructured":"L. Van der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 11.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_39_2","unstructured":"X. Wu Y. Hao K. Sun Y. Chen F. Zhu R. Zhao and H. Li. 2023. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv:2306.09341. Retrieved from https:\/\/arxiv.org\/abs\/2306.09341"},{"key":"e_1_3_2_40_2","unstructured":"M. Oquab T. Darcet T. Moutakanni H. V. Vo M. Szafraniec V. Khalidov P. Fernandez D. Haziza F. Massa A. El-Nouby et al. 2024. DINOv2: Learning robust visual features without supervision. InTransactions on Machine Learning Research. Retrieved form https:\/\/openreview.net\/forum?id=a68SUt6zFt"},{"key":"e_1_3_2_41_2","unstructured":"D. Friedman and A. B. Dieng. 2023. The Vendi score: A diversity evaluation metric for machine learning. In Transactions on Machine Learning Research. Retrieved from https:\/\/openreview.net\/forum?id=g97OHbQyk1"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730403","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3730403","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:57:19Z","timestamp":1750298239000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730403"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,22]]},"references-count":40,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,5,31]]}},"alternative-id":["10.1145\/3730403"],"URL":"https:\/\/doi.org\/10.1145\/3730403","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,5,22]]},"assertion":[{"value":"2024-06-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}