{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,24]],"date-time":"2026-06-24T16:14:40Z","timestamp":1782317680290,"version":"3.54.5"},"reference-count":54,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T00:00:00Z","timestamp":1743984000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T00:00:00Z","timestamp":1743984000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001775","name":"University of Technology Sydney","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001775","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a generative model that yields explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio proposes to incorporate articulation modeling into the explicit mesh representation to support high-resolution rendering and avatar animation. To ensure view consistency and pose controllability of the resulting avatars, we introduce a simple-yet-effective 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text ready for animation. Furthermore, it is competent for many applications, <jats:italic>e.g.<\/jats:italic>, multimodal avatar animations and style-guided avatar creation. Please refer to our <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/avatarstudio23.github.io\/\" ext-link-type=\"uri\">project page<\/jats:ext-link> for more results.\n<\/jats:p>","DOI":"10.1007\/s11263-025-02423-5","type":"journal-article","created":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T01:35:08Z","timestamp":1743989708000},"page":"5178-5196","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text"],"prefix":"10.1007","volume":"133","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6939-4074","authenticated-orcid":false,"given":"Xuanmeng","family":"Zhang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jianfeng","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenxu","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jun Hao","family":"Liew","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Huichao","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yi","family":"Yang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jiashi","family":"Feng","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,4,7]]},"reference":[{"key":"2423_CR1","unstructured":"(2023) Metahuman. https:\/\/www.unrealengine.com\/en-US\/metahum-an"},{"key":"2423_CR2","unstructured":"(2023) Torchmetrics. https:\/\/torchmetrics.readthedocs.io\/en\/stable\/multimodal\/clip_score.html."},{"key":"2423_CR3","doi-asserted-by":"crossref","unstructured":"Alldieck, T., Xu, H., & Sminchisescu, C. (2021) imghum: Implicit generative models of 3d human shape and articulated pose.","DOI":"10.1109\/ICCV48922.2021.00541"},{"key":"2423_CR4","doi-asserted-by":"crossref","unstructured":"Cao, Y., Cao, Y. P., Han, K., Shan, Y., & Wong, K. Y. K. (2023). Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv.","DOI":"10.1109\/CVPR52733.2024.00097"},{"key":"2423_CR5","unstructured":"Chen, J., Zhang, Y., Kang, D., Zhe, X., Bao, L., Jia, X., & Lu, H. (2021). Animatable neural radiance fields from monocular RGB videos. arXiv."},{"key":"2423_CR6","doi-asserted-by":"crossref","unstructured":"Chen, R., Chen, Y., Jiao, N., & Jia, K. (2023). Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV.","DOI":"10.1109\/ICCV51070.2023.02033"},{"key":"2423_CR7","doi-asserted-by":"crossref","unstructured":"Desbrun, M., Meyer, M., Schroder, P., & Barr, A. H. (2023). Implicit fairing of irregular meshes using diffusion and curvature flow. In Seminal graphics papers: Pushing the boundaries.","DOI":"10.1145\/3596711.3596729"},{"key":"2423_CR8","doi-asserted-by":"publisher","first-page":"4101","DOI":"10.21105\/joss.04101","volume":"7","author":"NS Detlefsen","year":"2022","unstructured":"Detlefsen, N. S., Borovec, J., Schock, J., Jha, A. H., Koker, T., Di Liello, L., Stancl, D., Quan, C., Grechkin, M., & Falcon, W. (2022). Torchmetrics-measuring reproducibility in pytorch. Journal of Open Source Software, 7, 4101.","journal-title":"Journal of Open Source Software"},{"key":"2423_CR9","doi-asserted-by":"crossref","unstructured":"G\u00fcler, R. A., Neverova, N., & Kokkinos, I. (2018). Densepose: Dense human pose estimation in the wild.","DOI":"10.1109\/CVPR.2018.00762"},{"key":"2423_CR10","unstructured":"Guo, Y. C., Liu, Y. T., Shao, R., Laforte, C., Voleti, V., Luo, G., Chen, C. H., Zou, Z. X., Wang, C., Cao, Y. P., & Zhang, S. H. (2023). threestudio: A unified framework for 3d content generation. https:\/\/github.com\/threestudio-project\/threestudio."},{"key":"2423_CR11","unstructured":"Hong, F., Chen, Z., Lan, Y., Pan, L., & Liu, Z. (2022a). Eva3d: Compositional 3d human generation from 2d image collections. arXiv."},{"key":"2423_CR12","doi-asserted-by":"crossref","unstructured":"Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., & Liu, Z. (2022b). AvatarCLIP: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics, 161.","DOI":"10.1145\/3528223.3530094"},{"key":"2423_CR13","doi-asserted-by":"crossref","unstructured":"Huang, X., Shao, R., Zhang, Q., Zhang, H., Feng, Y., Liu, Y., & Wang, Q. (2024). Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation. In CVPR.","DOI":"10.1109\/CVPR52733.2024.00437"},{"key":"2423_CR14","unstructured":"Huang, Y., Wang, J., Zeng, A., Cao, H., Qi, X., Shi, Y., Zha, Z., & Zhang, L. (2023). Dreamwaltz: Make a scene with complex 3d animatable avatars. In NeurIPS."},{"key":"2423_CR15","doi-asserted-by":"crossref","unstructured":"Jain, A., Mildenhall, B., Barron, J. T., Abbeel, P., & Poole, B. (2021). Zero-shot text-guided object generation with dream fields.","DOI":"10.1109\/CVPR52688.2022.00094"},{"key":"2423_CR16","doi-asserted-by":"crossref","unstructured":"Jiang, R., Wang, C., Zhang, J., Chai, M., He, M., Chen, D., & Liao, J. (2023). Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. arXiv.","DOI":"10.1109\/ICCV51070.2023.01322"},{"key":"2423_CR17","doi-asserted-by":"publisher","unstructured":"Kerbl, B., Kopanas, G., Leimk\u00fchler, T., & Drettakis, G. (2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 139. https:\/\/doi.org\/10.1145\/3592433.","DOI":"10.1145\/3592433"},{"key":"2423_CR18","unstructured":"Khalid, N. M., Xie, T., Belilovsky, E., & Popa, T. (2022). Clip-mesh: Generating textured meshes from text using pretrained image-text models."},{"key":"2423_CR19","unstructured":"Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR."},{"key":"2423_CR20","doi-asserted-by":"crossref","unstructured":"Kocabas, M., Huang, C. H. P., Hilliges, O., & Black, M. J. (2021). Pare: Part attention regressor for 3d human body estimation.","DOI":"10.1109\/ICCV48922.2021.01094"},{"key":"2423_CR21","unstructured":"Kolotouros, N., Alldieck, T., Zanfir, A., Bazavan, E. G., Fieraru, M. & Sminchisescu, C. (2023). Dreamhuman: Animatable 3d avatars from text. In NeurIPS."},{"key":"2423_CR22","doi-asserted-by":"publisher","unstructured":"Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., & Aila, T. (2020). Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 194. https:\/\/doi.org\/10.1145\/3414685.3417861.","DOI":"10.1145\/3414685.3417861"},{"key":"2423_CR23","doi-asserted-by":"crossref","unstructured":"Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., & Lu, C. (2021). Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation.","DOI":"10.1109\/CVPR46437.2021.00339"},{"key":"2423_CR24","doi-asserted-by":"crossref","unstructured":"Liao, T., Yi, H., Xiu, Y., Tang, J., Huang, Y., Thies, J., & Black, M. J. (2024). TADA! Text to Animatable Digital Avatars. In 3DV.","DOI":"10.1109\/3DV62453.2024.00150"},{"key":"2423_CR25","doi-asserted-by":"crossref","unstructured":"Lin, C. H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M. Y., & Lin, T. Y. (2023). Magic3d: High-resolution text-to-3d content creation.","DOI":"10.1109\/CVPR52729.2023.00037"},{"key":"2423_CR26","doi-asserted-by":"crossref","unstructured":"Lin, S., Liu, B., Li, J., & Yang, X. (2024). Common diffusion noise schedules and sample steps are flawed. In WACV.","DOI":"10.1109\/WACV57701.2024.00532"},{"key":"2423_CR27","doi-asserted-by":"publisher","unstructured":"Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6), 248. https:\/\/doi.org\/10.1145\/2816795.2818013.","DOI":"10.1145\/2816795.2818013"},{"key":"2423_CR28","doi-asserted-by":"crossref","unstructured":"Lorensen, W. E., & Cline, H. E. (1998). Marching cubes: A high resolution 3d surface construction algorithm.","DOI":"10.1145\/280811.281026"},{"key":"2423_CR29","doi-asserted-by":"crossref","unstructured":"Metzer, G., Richardson, E., Patashnik, O., Giryes, R., & Cohen-Or, D. (2022). Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv.","DOI":"10.1109\/CVPR52729.2023.01218"},{"key":"2423_CR30","doi-asserted-by":"crossref","unstructured":"Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis.","DOI":"10.1007\/978-3-030-58452-8_24"},{"key":"2423_CR31","doi-asserted-by":"publisher","unstructured":"M\u00fcller, T., Evans, A., Schied, C., & Keller, A. (2022). Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4), 102. https:\/\/doi.org\/10.1145\/3528223.3530127.","DOI":"10.1145\/3528223.3530127"},{"key":"2423_CR32","doi-asserted-by":"crossref","unstructured":"Nealen, A., Igarashi, T., Sorkine, O., & Alexa, M. (2006). Laplacian mesh optimization. In CGIT.","DOI":"10.1145\/1174429.1174494"},{"key":"2423_CR33","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3d hands, face, and body from a single image. In CVPR.","DOI":"10.1109\/CVPR.2019.01123"},{"key":"2423_CR34","doi-asserted-by":"crossref","unstructured":"Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Bao, H., & Zhou, X. (2021). Animatable neural radiance fields for human body modeling.","DOI":"10.1109\/ICCV48922.2021.01405"},{"key":"2423_CR35","unstructured":"Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). Dreamfusion: Text-to-3d using 2d diffusion."},{"key":"2423_CR36","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision."},{"key":"2423_CR37","doi-asserted-by":"crossref","unstructured":"Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., & Cohen-Or, D. (2023). Texture: Text-guided texturing of 3d shapes. arXiv.","DOI":"10.1145\/3588432.3591503"},{"key":"2423_CR38","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-resolution image synthesis with latent diffusion models.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"2423_CR39","doi-asserted-by":"crossref","unstructured":"Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. arXiv.","DOI":"10.1145\/3528233.3530757"},{"key":"2423_CR40","doi-asserted-by":"crossref","unstructured":"Sanghi, A., Chu, H., Lambourne, J., Wang, Y., Cheng, C. Y., & Fumero, M. (2021). Clip-forge: Towards zero-shot text-to-shape generation.","DOI":"10.1109\/CVPR52688.2022.01805"},{"key":"2423_CR41","unstructured":"Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., & et\u00a0al. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models."},{"key":"2423_CR42","unstructured":"Schwarz, K., Sauer, A., Niemeyer, M., Liao, Y., & Geiger, A. (2022). Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids."},{"key":"2423_CR43","unstructured":"Shen, T., Gao, J., Yin, K., Liu, M. Y., & Fidler, S. (2021). Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS."},{"key":"2423_CR44","unstructured":"Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., & Yang, X. (2023). Mvdream: Multi-view diffusion for 3d generation."},{"key":"2423_CR45","unstructured":"Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., & Bermano, A. H. (2023). Human motion diffusion model. In ICLR."},{"key":"2423_CR46","unstructured":"Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., & Zhu, J. (2023). Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv."},{"key":"2423_CR47","unstructured":"Xu, Y., Yang, Z., & Yang, Y. (2023a). Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. arXiv preprint arXiv:2312.08889."},{"key":"2423_CR48","unstructured":"Xu, Z., Zhang, J., Liew, J., Feng, J., & Shou, M. Z. (2023b). Xagen: 3d expressive human avatars generation. In NeurIPS."},{"key":"2423_CR49","unstructured":"Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv."},{"key":"2423_CR50","unstructured":"Yifan, W., Rahmann, L., & Sorkine-Hornung, O. (2021). Geometry-consistent neural shape representation with implicit displacement fields. arXiv."},{"key":"2423_CR51","doi-asserted-by":"crossref","unstructured":"Zhang, H., Chen, B., Yang, H., Qu, L., Wang, X., Chen, L., Long, C., Zhu, F., Du, K., & Zheng, M. (2024). Avatarverse: High-quality & stable 3d avatar creation from text and pose.","DOI":"10.1609\/aaai.v38i7.28540"},{"key":"2423_CR52","doi-asserted-by":"crossref","unstructured":"Zhang, J., Jiang, Z., Yang, D., Xu, H., Shi, Y., Song, G., Xu, Z., Wang, X., & Feng, J. (2022). Avatargen: A 3d generative model for animatable human avatars. arXiv.","DOI":"10.1007\/978-3-031-25066-8_39"},{"key":"2423_CR53","doi-asserted-by":"crossref","unstructured":"Zhang, L., Rao, A., & Agrawala, M. (2023a). Adding conditional control to text-to-image diffusion models.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"2423_CR54","doi-asserted-by":"crossref","unstructured":"Zhang, X., Zhang, J., Rohan, C., Xu, H., Song, G., Yang, Y., & Feng, J. (2023b). Getavatar: Generative textured meshes for animatable human avatars. In ICCV.","DOI":"10.1109\/ICCV51070.2023.00216"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02423-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02423-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02423-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T10:35:57Z","timestamp":1757154957000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02423-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,7]]},"references-count":54,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["2423"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02423-5","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,7]]},"assertion":[{"value":"20 July 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 March 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 April 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}