{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:38:22Z","timestamp":1777657102906,"version":"3.51.4"},"reference-count":75,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T00:00:00Z","timestamp":1772755200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T00:00:00Z","timestamp":1772755200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005950","name":"Hong Kong University of Science and Technology","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005950","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    The potential for higher-resolution image generation using pretrained diffusion models is immense. However, these models often struggle with object repetition and structural artifacts especially when scaling to 4K resolution and beyond. Our analysis reveals that causes the problem, a single prompt for the generation of multiple scales provides insufficient efficacy. To address this, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts provide both global and local semantic guidance. Specifically, the global prompt captures overall scene semantics from user input, while local guidance comes from patch-wise descriptions generated by MLLMs to refine regional structures and textures. Furthermore, during inverse denoising, noise is decomposed into low- and high-frequency components, each conditioned on different prompt levels, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality. The demo and code can be found on the project website:\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/liuxinyv.github.io\/HiPrompt\/\" ext-link-type=\"uri\">https:\/\/liuxinyv.github.io\/HiPrompt\/<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1007\/s11263-026-02736-z","type":"journal-article","created":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T09:34:57Z","timestamp":1772789697000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts"],"prefix":"10.1007","volume":"134","author":[{"given":"Xinyu","family":"Liu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yingqing","family":"He","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lanqing","family":"Guo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiang","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bu","family":"Jin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yan","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chi-Min","family":"Chan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wei","family":"Xue","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5697-4168","authenticated-orcid":false,"given":"Wenhan","family":"Luo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qifeng","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yike","family":"Guo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,3,6]]},"reference":[{"key":"2736_CR1","doi-asserted-by":"crossref","unstructured":"Avrahami, O., Lischinski, D., & Fried, O.(2022). Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pages 18208\u201318218.","DOI":"10.1109\/CVPR52688.2022.01767"},{"key":"2736_CR2","volume-title":"and Tali Dekel","author":"O Bar-Tal","year":"2023","unstructured":"Bar-Tal, O., Yariv, L., & Lipman, Y. (2023). and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation."},{"key":"2736_CR3","unstructured":"Bi\u0144kowski, M., Sutherland, D.J., Arbel, M., & Gretton, A.(2021). Demystifying mmd gans. arxiv:1801.01401."},{"key":"2736_CR4","unstructured":"Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., & Rombach, R. (2023). Stable video diffusion: Scaling latent video diffusion models to large datasets, https:\/\/arxiv.org\/abs\/2311.15127."},{"key":"2736_CR5","doi-asserted-by":"crossref","unstructured":"Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et\u00a0al. (2023a). Pixart-$$\\alpha $$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426.","DOI":"10.1007\/978-3-031-73411-3_5"},{"key":"2736_CR6","doi-asserted-by":"crossref","unstructured":"Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., & Lin, D.(2023b). Sharegpt4v: Improving large multi-modal models with better captions. arXiv:2311.12793","DOI":"10.1007\/978-3-031-72643-9_22"},{"key":"2736_CR7","doi-asserted-by":"crossref","unstructured":"Chen, M., Laina, I., & Vedaldi, A.(2024a). Training-free layout control with cross-attention guidance. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pages 5343\u20135353.","DOI":"10.1109\/WACV57701.2024.00526"},{"key":"2736_CR8","doi-asserted-by":"crossref","unstructured":"Chen, Y., Wang, O., Zhang, R., Shechtman, E., Wang, X., & Gharbi, M.(2024b). Image neural field diffusion models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 8007\u20138017.","DOI":"10.1109\/CVPR52733.2024.00765"},{"key":"2736_CR9","doi-asserted-by":"crossref","unstructured":"Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S.(2021). Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv:2108.02938.","DOI":"10.1109\/ICCV48922.2021.01410"},{"key":"2736_CR10","first-page":"19822","volume":"34","author":"M Ding","year":"2021","unstructured":"Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Junyang Lin, X., Zou, Z. S., Yang, H., et al. (2021). Cogview: Mastering text-to-image generation via transformers. Advances in neural information processing systems, 34, 19822\u201319835.","journal-title":"Advances in neural information processing systems"},{"key":"2736_CR11","unstructured":"Ding, Z., Zhang, M., Wu, J., & Tu, Z.(2023). Patched denoising diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations."},{"key":"2736_CR12","doi-asserted-by":"crossref","unstructured":"Du, R., Chang, D., Hospedales, T., Song, Y-Z., & Ma, Z.(2024). Demofusion: Democratising high-resolution image generation with no \\$\\$\\$. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 6159\u20136168.","DOI":"10.1109\/CVPR52733.2024.00589"},{"key":"2736_CR13","doi-asserted-by":"crossref","unstructured":"Feng, Y., Gong, B., Chen, D., Shen, Y., Liu, Y., & Zhou, J. (2024). Ranni: Taming text-to-image diffusion for accurate instruction following. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 4744\u20134753.","DOI":"10.1109\/CVPR52733.2024.00454"},{"key":"2736_CR14","doi-asserted-by":"crossref","unstructured":"Geng, D., Park, I., & Owens, A. (2024). Visual anagrams: Generating multi-view optical illusions with diffusion models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 24154\u201324163.","DOI":"10.1109\/CVPR52733.2024.02280"},{"key":"2736_CR15","unstructured":"Gu, J., Zhai, S., Zhang, Y., Susskind, J\u00a0M., & Jaitly, N. (2023). Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations."},{"key":"2736_CR16","doi-asserted-by":"crossref","unstructured":"Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., & Guo, B.(2022). Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pages 10696\u201310706.","DOI":"10.1109\/CVPR52688.2022.01043"},{"key":"2736_CR17","doi-asserted-by":"crossref","unstructured":"Guo, L., He, Y., Chen, H., Xia, M., Cun, X., Wang, Y., Huang, S., Zhang, Y., Wang, X., Chen, Q. et\u00a0al. (2024). Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. arXiv:2402.10491.","DOI":"10.1007\/978-3-031-72764-1_3"},{"key":"2736_CR18","doi-asserted-by":"crossref","unstructured":"Haji-Ali, M., Balakrishnan, G., & Ordonez, V. (2024). Elasticdiffusion: Training-free arbitrary size image generation through global-local content separation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 6603\u20136612.","DOI":"10.1109\/CVPR52733.2024.00631"},{"key":"2736_CR19","unstructured":"He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen, Q., & Shan, Y. (2023a). Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representations."},{"key":"2736_CR20","unstructured":"He, Y., Yang, T., Zhang, Y., Shan, Y., & Chen, Q. (2023b). Latent video diffusion models for high-fidelity long video generation. URL https:\/\/arxiv.org\/abs\/2211.13221."},{"key":"2736_CR21","unstructured":"Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Prompt-to-prompt image editing with cross attention control.(2022). arxiv:2208.01626."},{"key":"2736_CR22","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2018). Gans trained by a two time-scale update rule converge to a local nash equilibrium. https:\/\/arxiv.org\/abs\/1706.08500."},{"key":"2736_CR23","first-page":"6840","volume":"33","author":"J Ho","year":"2020","unstructured":"Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840\u20136851.","journal-title":"Advances in neural information processing systems"},{"key":"2736_CR24","doi-asserted-by":"crossref","unstructured":"Hu, V\u00a0T., Baumann, S\u00a0A., Gui, M., Grebenkova, O., Ma, P., Schusterbauer, J., & Ommer, B. (2024). Zigma: A dit-style zigzag mamba diffusion model. https:\/\/arxiv.org\/abs\/2403.13802.","DOI":"10.1007\/978-3-031-72664-4_9"},{"key":"2736_CR25","doi-asserted-by":"crossref","unstructured":"Huang, L., Fang, R., Zhang, A., Song, G., Liu, S., Liu, Y., & Li, H. (2024). Fouriscale: A frequency perspective on training-free high-resolution image synthesis. arXiv:2403.12963.","DOI":"10.1007\/978-3-031-73254-6_12"},{"key":"2736_CR26","doi-asserted-by":"crossref","unstructured":"Jin, Z., Shen, X., Li, B., & Xue, X. (2024). Training-free diffusion model adaptation for variable-sized text-to-image synthesis. Advances in Neural Information Processing Systems, 36.","DOI":"10.52202\/075280-3103"},{"key":"2736_CR27","doi-asserted-by":"crossref","unstructured":"Kim, G., Kim, H., Seo, H., Kang, D\u00a0U., & Chun, S\u00a0Y. (2024a). Beyondscene: Higher-resolution human-centric scene generation with pretrained diffusion. https:\/\/arxiv.org\/abs\/2404.04544.","DOI":"10.1007\/978-3-031-73039-9_8"},{"key":"2736_CR28","unstructured":"Kim, Y., Hwang, G., Zhang, J., & Park, E. (2024b). Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance.https:\/\/arxiv.org\/abs\/2406.18459."},{"key":"2736_CR29","doi-asserted-by":"publisher","first-page":"4338","DOI":"10.1609\/aaai.v39i4.32456","volume":"39","author":"Y Kim","year":"2025","unstructured":"Kim, Y., Hwang, G., Zhang, J., & Park, E. (2025). Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. In Proceedings of the AAAI conference on artificial intelligence, 39, 4338\u20134346.","journal-title":"In Proceedings of the AAAI conference on artificial intelligence"},{"key":"2736_CR30","unstructured":"Li, D., Kamko, A., Akhgari, E., Sabet, A., Xu, L., & Doshi, S. (2024). Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv:2402.17245."},{"key":"2736_CR31","doi-asserted-by":"crossref","unstructured":"Lin, Z., Lin, M., Zhan, W., & Ji, R. (2024). Accdiffusion v2: Towards more accurate higher-resolution diffusion extrapolation. arXiv:2412.02099.","DOI":"10.1109\/TPAMI.2025.3576740"},{"key":"2736_CR32","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, C., Li, Y., & Lee, Y\u00a0J. (2024). Improved baselines with visual instruction tuning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 26296\u201326306.","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"2736_CR33","doi-asserted-by":"crossref","unstructured":"Long, F., Qiu, Z., Yao, T., & Mei, T. (2024). Videostudio: Generating consistent-content and multi-scene videos. https:\/\/arxiv.org\/abs\/2401.01256.","DOI":"10.1007\/978-3-031-73027-6_27"},{"key":"2736_CR34","unstructured":"Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., & Bai, L. (2024a). Fit: Flexible vision transformer for diffusion model. arXiv:2402.12376."},{"key":"2736_CR35","unstructured":"Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., & Bai, L. (2024b). Fit: Flexible vision transformer for diffusion model. https:\/\/arxiv.org\/abs\/2402.12376."},{"key":"2736_CR36","doi-asserted-by":"publisher","first-page":"4296","DOI":"10.1609\/aaai.v38i5.28226","volume":"38","author":"C Mou","year":"2024","unstructured":"Mou, C., Wang, X., Xie, L., Yanze, W., Zhang, J., Qi, Z., & Shan, Y. (2024). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 4296\u20134304.","journal-title":"In Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"2736_CR37","unstructured":"Nash, C., Menick, J., Dieleman, S., & Battaglia, P\u00a0W. (2021). Generating images with sparse representations. arXiv:2103.03841."},{"key":"2736_CR38","unstructured":"Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741."},{"key":"2736_CR39","doi-asserted-by":"crossref","unstructured":"Parmar, G., Zhang, R., & Zhu, J-Y. (2022). On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 11410\u201311420.","DOI":"10.1109\/CVPR52688.2022.01112"},{"key":"2736_CR40","doi-asserted-by":"crossref","unstructured":"Peebles, W., & Xie, S. (2023). Scalable diffusion models with transformers. In Proceedings of the IEEE\/CVF international conference on computer vision, pages 4195\u20134205.","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"2736_CR41","unstructured":"Peng, B., Chen, X., Wang, Y., Lu, C., & Qiao, Y. (2024). Conditionvideo: Training-free condition-guided text-to-video generation. https:\/\/arxiv.org\/abs\/2310.07697."},{"key":"2736_CR42","unstructured":"Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M\u00fcller, J., Penna, J., & Rombach, R. (2023a). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952."},{"key":"2736_CR43","unstructured":"Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M\u00fcller, J., Penna, J., & Rombach, R. (2023b). Sdxl: Improving latent diffusion models for high-resolution image synthesis. https:\/\/arxiv.org\/abs\/2307.01952."},{"key":"2736_CR44","unstructured":"Qing, Z., Zhang, S., Wang, J., Wang, X., Wei, Y., Zhang, Y., Gao, C., & Sang, N.(2023). Hierarchical spatio-temporal decoupling for text-to-video generation. https:\/\/arxiv.org\/abs\/2312.04483."},{"key":"2736_CR45","unstructured":"Radford, A., Kim, J\u00a0W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et\u00a0al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748\u20138763. PMLR."},{"key":"2736_CR46","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022a). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pages 10684\u201310695.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"2736_CR47","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022b). High-resolution image synthesis with latent diffusion models. https:\/\/arxiv.org\/abs\/2112.10752.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"2736_CR48","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention\u2013MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234\u2013241. Springer.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"2736_CR49","doi-asserted-by":"crossref","unstructured":"Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E\u00a0L., Ghasemipour, K., Gontijo\u00a0Lopes, R., Karagol\u00a0Ayan, B., Salimans, T., et\u00a0al.(2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479\u201336494.","DOI":"10.52202\/068431-2643"},{"key":"2736_CR50","unstructured":"Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. arxiv:1606.03498."},{"key":"2736_CR51","doi-asserted-by":"crossref","unstructured":"Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. arxiv:2210.08402.","DOI":"10.52202\/068431-1833"},{"key":"2736_CR52","doi-asserted-by":"crossref","unstructured":"Shi, S., Li, W., Zhang, Y., He, J., Gong, B., & Zheng, Y. (2024). Resmaster: Mastering high-resolution image generation via structural and fine-grained guidance. arXiv:2406.16476.","DOI":"10.1609\/aaai.v39i7.32739"},{"key":"2736_CR53","unstructured":"Si, C., Huang, Z., Jiang, Y., & Liu, Z.(2023). Freeu: Free lunch in diffusion u-net. arxiv:2309.11497."},{"key":"2736_CR54","doi-asserted-by":"crossref","unstructured":"Si, C., Huang, Z., Jiang, Y., & Liu, Z. (2024). Freeu: Free lunch in diffusion u-net. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 4733\u20134743.","DOI":"10.1109\/CVPR52733.2024.00453"},{"key":"2736_CR55","unstructured":"Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256\u20132265. PMLR."},{"key":"2736_CR56","unstructured":"Song, J., Meng, C., & Ermon, S. (2020a). Denoising diffusion implicit models. arXiv:2010.02502."},{"key":"2736_CR57","unstructured":"Song, Y., Sohl-Dickstein, J., Kingma, D\u00a0P., Kumar, A., Ermon, S., & Poole, B. (2020b). Score-based generative modeling through stochastic differential equations. arXiv:2011.13456."},{"key":"2736_CR58","unstructured":"Song, Y., Xie, H., Zhang, Z., Wen, B., Ma, L., Mi, Z., & Chen, H. (2024). Turbo sparse: Achieving llm sota performance with minimal activated parameters. arXiv:2406.05955."},{"key":"2736_CR59","unstructured":"Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., & Tang, J. (2023) Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv:2309.03350."},{"key":"2736_CR60","unstructured":"Wang, J., Yue, Z., Zhou, S., Chan, K. C.\u00a0K., & Loy, C.C. (2024a). Exploiting diffusion prior for real-world image super-resolution. arxiv:2305.07015."},{"key":"2736_CR61","unstructured":"Wang, W., Liu, J., Lin, Z., Yan, J., Chen, S., Low, C., Hoang, T., Wu, J., Liew, J.H., Yan, H., Zhou, D., & Feng, J. (2024b). Magicvideo-v2: Multi-stage high-aesthetic video generation, arxiv:2401.04468."},{"key":"2736_CR62","doi-asserted-by":"crossref","unstructured":"Wang, X., Kontkanen, J., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I., Mildenhall, B., Srinivasan, P., Verbin, D., Holynski, A. (2024c). Generative powers of ten. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 7173\u20137182.","DOI":"10.1109\/CVPR52733.2024.00685"},{"key":"2736_CR63","unstructured":"Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., & Yang, P., et\u00a0al. (2023) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv:2309.15103."},{"key":"2736_CR64","doi-asserted-by":"crossref","unstructured":"Wu, H., Shen, S., Hu, Q., Zhang, X., Zhang, Y., & Wang, Y. (2024) Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning. arxiv:2408.11001.","DOI":"10.1109\/WACV61041.2025.00388"},{"key":"2736_CR65","doi-asserted-by":"crossref","unstructured":"Xie, E., Yao, L., Shi, H., Liu, Z., Zhou, D., Liu, Z., Li, J. & Li, Z. (2023) Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, pages 4230\u20134239.","DOI":"10.1109\/ICCV51070.2023.00390"},{"key":"2736_CR66","unstructured":"Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., & Lu, Y., et\u00a0al. (2024) Sana: Efficient high-resolution image synthesis with linear diffusion transformers. arXiv:2410.10629"},{"key":"2736_CR67","doi-asserted-by":"crossref","unstructured":"Xing, J., Xia, M., Zhang, Y., Chen, H., Yu, W., Liu, H., Wang, X., Wong, T.-T., & Shan, Y. (2023) Dynamicrafter: Animating open-domain images with video diffusion priors. arxiv:2310.12190.","DOI":"10.1007\/978-3-031-72952-2_23"},{"key":"2736_CR68","unstructured":"Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., & Bin C. (2024a) Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In Forty-first International Conference on Machine Learning."},{"key":"2736_CR69","unstructured":"Yang,Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Gu, X., Zhang, Y., Wang, W., Cheng, Y., Liu, T., Xu, B., Dong, Y., & Tang, J. (2024b). Cogvideox: Text-to-video diffusion models with an expert transformer. arxiv:2408.06072."},{"key":"2736_CR70","doi-asserted-by":"crossref","unstructured":"Zhang, D\u00a0J., Wu, J\u00a0Z., Liu, J-W., Zhao, R., Ran, L., Gu, Y., Gao, D., & Shou, M\u00a0Z. (2023a). Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv:2309.15818.","DOI":"10.1007\/s11263-024-02271-9"},{"key":"2736_CR71","doi-asserted-by":"crossref","unstructured":"Zhang, K., Liang, J., Van\u00a0Gool, L., & Timofte, R. (2021). Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, pages 4791\u20134800.","DOI":"10.1109\/ICCV48922.2021.00475"},{"key":"2736_CR72","doi-asserted-by":"crossref","unstructured":"Zhang, L., Rao, A., & Agrawala, M.(2023b). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, pages 3836\u20133847.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"2736_CR73","volume-title":"and Jiajun Liang","author":"S Zhang","year":"2023","unstructured":"Zhang, S., Chen, Z., Zhao, Z., Chen, Z., Tang, Y., Chen, Y., & Cao, W. (2023). and Jiajun Liang. Hidiffusion: Unlocking high-resolution creativity and efficiency in low-resolution trained diffusion models. CoRR."},{"key":"2736_CR74","doi-asserted-by":"crossref","unstructured":"Zhang, S., Chen, Z., Zhao, Z., Chen, Y., Tang, Y., & Liang, J. (2024). Hidiffusion: Unlocking higher-resolution creativity and efficiency in pretrained diffusion models. arxiv:2311.17528.","DOI":"10.1007\/978-3-031-72983-6_9"},{"key":"2736_CR75","doi-asserted-by":"publisher","first-page":"7571","DOI":"10.1609\/aaai.v38i7.28589","volume":"38","author":"Q Zheng","year":"2024","unstructured":"Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Songcen, X., & Hang, X. (2024). Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 7571\u20137578.","journal-title":"In Proceedings of the AAAI Conference on Artificial Intelligence"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02736-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-026-02736-z","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02736-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,18]],"date-time":"2026-04-18T05:44:25Z","timestamp":1776491065000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-026-02736-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,6]]},"references-count":75,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["2736"],"URL":"https:\/\/doi.org\/10.1007\/s11263-026-02736-z","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,6]]},"assertion":[{"value":"30 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 January 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 March 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of Interest"}}],"article-number":"147"}}