{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T17:11:28Z","timestamp":1780765888992,"version":"3.54.1"},"reference-count":98,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2022YFF0902301"],"award-info":[{"award-number":["2022YFF0902301"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"name":"NSFC programs","award":["61976138"],"award-info":[{"award-number":["61976138"]}]},{"name":"NSFC programs","award":["61977047"],"award-info":[{"award-number":["61977047"]}]},{"DOI":"10.13039\/501100003399","name":"STCSM","doi-asserted-by":"crossref","award":["2015F0203-000-06"],"award-info":[{"award-number":["2015F0203-000-06"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"crossref"}]},{"name":"SHMEC","award":["2019-01-07-00-01-E00003"],"award-info":[{"award-number":["2019-01-07-00-01-E00003"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2024,7,19]]},"abstract":"<jats:p>In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.<\/jats:p>","DOI":"10.1145\/3658146","type":"journal-article","created":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T14:47:57Z","timestamp":1721400477000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":162,"title":["CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8508-3359","authenticated-orcid":false,"given":"Longwen","family":"Zhang","sequence":"first","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos Technology, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4697-5183","authenticated-orcid":false,"given":"Ziyu","family":"Wang","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos Technology, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4837-7152","authenticated-orcid":false,"given":"Qixuan","family":"Zhang","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos Technology, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0213-4744","authenticated-orcid":false,"given":"Qiwei","family":"Qiu","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos Technology, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2746-6946","authenticated-orcid":false,"given":"Anqi","family":"Pang","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9673-8545","authenticated-orcid":false,"given":"Haoran","family":"Jiang","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos Technology, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1189-1254","authenticated-orcid":false,"given":"Wei","family":"Yang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8807-7787","authenticated-orcid":false,"given":"Lan","family":"Xu","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9198-6853","authenticated-orcid":false,"given":"Jingyi","family":"Yu","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,7,19]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Omer Bar-Tal Lior Yariv Yaron Lipman and Tali Dekel. 2023. MultiDiffusion: fusing diffusion paths for controlled image generation. 16 pages."},{"key":"e_1_2_1_2_1","unstructured":"Andreas Blattmann Tim Dockhorn Sumith Kulal Daniel Mendelevitch Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani and Robin Rombach. 2023. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv:2311.15127 [cs.CV]"},{"key":"e_1_2_1_3_1","unstructured":"Blender Online Community. 2024. Blender - a 3D modelling and rendering package. http:\/\/www.blender.org."},{"key":"e_1_2_1_4_1","unstructured":"Angel X. Chang Thomas Funkhouser Leonidas Guibas Pat Hanrahan Qixing Huang Zimo Li Silvio Savarese Manolis Savva Shuran Song Hao Su Jianxiong Xiao Li Yi and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. arXiv:1512.03012 [cs.GR]"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01701"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.02033"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Chen Zilong","year":"2024","unstructured":"Zilong Chen, Feng Wang, and Huaping Liu. 2024. Text-to-3D using Gaussian Splatting. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00609"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00433"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings, Part VIII 14","author":"Choy Christopher B","year":"2016","unstructured":"Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII 14. Springer, 628--644."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01263"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.264"},{"key":"e_1_2_1_13_1","unstructured":"Andrea Gesmundo and Kaitlin Maile. 2023. Composable Function-preserving Expansions for Transformer Architectures. arXiv:2308.06103 [cs.LG]"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00030"},{"key":"e_1_2_1_15_1","unstructured":"Yuan-Chen Guo Ying-Tian Liu Ruizhi Shao Christian Laforte Vikram Voleti Guan Luo Chia-Hao Chen Zi-Xin Zou Chen Wang Yan-Pei Cao and Song-Hai Zhang. 2023. threestudio: A unified framework for 3D content generation. https:\/\/github.com\/threestudio-project\/threestudio."},{"key":"e_1_2_1_16_1","unstructured":"Anchit Gupta Wenhan Xiong Yixin Nie Ian Jones and Barlas O\u011fuz. 2023. 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371 [cs.CV]"},{"key":"e_1_2_1_17_1","volume-title":"Lin (Eds.)","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 6840--6851. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf"},{"key":"e_1_2_1_18_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Hong Yicong","year":"2024","unstructured":"Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. 2024. LRM: Large Reconstruction Model for Single Image to 3D. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_19_1","volume-title":"LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.","author":"Hu Edward J","year":"2022","unstructured":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations."},{"key":"e_1_2_1_20_1","volume-title":"Guibas","author":"Huang Jingwei","year":"2018","unstructured":"Jingwei Huang, Hao Su, and Leonidas J. Guibas. 2018a. Robust Watertight Manifold Surface Generation Method for ShapeNet Models. arXiv:1802.01698 http:\/\/arxiv.org\/abs\/1802.01698"},{"key":"e_1_2_1_21_1","unstructured":"Jingwei Huang Yichao Zhou and Leonidas Guibas. 2020. ManifoldPlus: A Robust and Scalable Watertight Manifold Surface Generation Method for Triangle Soups. arXiv:2005.11621 [cs.GR]"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1111\/cgf.13498"},{"key":"e_1_2_1_23_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Huang Yukun","year":"2024","unstructured":"Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, and Lei Zhang. 2024. DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_24_1","unstructured":"Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions. arXiv:2305.02463 [cs.CV]"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592433"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589115"},{"key":"e_1_2_1_27_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Li Weiyu","year":"2024","unstructured":"Weiyu Li, Rui Chen, Xuelin Chen, and Ping Tan. 2024. SweetDreamer: Aligning Geometric Priors in 2D diffusion for Consistent Text-to-3D. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00037"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00532"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Ling Huan","year":"2024","unstructured":"Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. 2024. Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Liu Minghua","year":"2024","unstructured":"Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. 2024b. One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_32_1","volume-title":"Levine (Eds.)","volume":"36","author":"Liu Minghua","year":"2023","unstructured":"Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. 2023d. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 22226--22246. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/4683beb6bab325650db13afd05d1a14a-Paper-Conference.pdf"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00853"},{"key":"e_1_2_1_34_1","unstructured":"Xian Liu Jian Ren Aliaksandr Siarohin Ivan Skorokhodov Yanyu Li Dahua Lin Xihui Liu Ziwei Liu and Sergey Tulyakov. 2023b. HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion. arXiv:2310.08579 [cs.CV]"},{"key":"e_1_2_1_35_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Liu Yuan","year":"2024","unstructured":"Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. 2024a. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_36_1","unstructured":"Zexiang Liu Yangguang Li Youtian Lin Xin Yu Sida Peng Yan-Pei Cao Xiaojuan Qi Xiaoshui Huang Ding Liang and Wanli Ouyang. 2023a. UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation. arXiv:2312.08754 [cs.CV]"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Long Xiaoxiao","year":"2024","unstructured":"Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, and Wenping Wang. 2024. Wonder3D: Single Image to 3D using Cross-Domain Diffusion. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19824-3_13"},{"key":"e_1_2_1_39_1","unstructured":"Kleineberg Marian. 2021. mesh_to_sdf: Calculate signed distance fields for arbitrary meshes. https:\/\/github.com\/marian42\/mesh_to_sdf."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00459"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01218"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503250"},{"key":"e_1_2_1_43_1","volume-title":"International conference on machine learning. PMLR, 7220--7229","author":"Nash Charlie","year":"2020","unstructured":"Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. 2020. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning. PMLR, 7220--7229."},{"key":"e_1_2_1_44_1","unstructured":"Alex Nichol Heewoo Jun Prafulla Dhariwal Pamela Mishkin and Mark Chen. 2022. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. arXiv:2212.08751 [cs.CV]"},{"key":"e_1_2_1_45_1","unstructured":"OpenAI. 2023. GPT-4V: Generative Pre-trained Transformer 4 for Vision. https:\/\/www.openai.com\/."},{"key":"e_1_2_1_46_1","volume-title":"DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research","author":"Oquab Maxime","year":"2024","unstructured":"Maxime Oquab, Timoth\u00e9e Darcet, Th\u00e9o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. 2024. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024)."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00025"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings, Part III 16","author":"Peng Songyou","year":"2020","unstructured":"Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. 2020. Convolutional occupancy networks. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16. Springer, 523--540."},{"key":"e_1_2_1_49_1","volume-title":"Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, et al.","author":"Po Ryan","year":"2023","unstructured":"Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, et al. 2023. State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204 (2023)."},{"key":"e_1_2_1_50_1","volume-title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV]","author":"Podell Dustin","year":"2023","unstructured":"Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\u00fcller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV]"},{"key":"e_1_2_1_51_1","volume-title":"The Eleventh International Conference on Learning Representations.","author":"Poole Ben","year":"2023","unstructured":"Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2023. DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations."},{"key":"e_1_2_1_52_1","volume-title":"Guibas","author":"Qi Charles R.","year":"2017","unstructured":"Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet++: deep hierarchical feature learning on point sets in a metric space. 30 (2017), 5105--5114."},{"key":"e_1_2_1_53_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Qian Guocheng","year":"2024","unstructured":"Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. 2024. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Qiu Lingteng","year":"2024","unstructured":"Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, and Xiaoguang Han. 2024. RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_55_1","volume-title":"International conference on machine learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763."},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24","volume":"8831","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821--8831. http:\/\/proceedings.mlr.press\/v139\/ramesh21a.html"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01072"},{"key":"e_1_2_1_58_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Ren Xuanchi","year":"2024","unstructured":"Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. 2024. XCube (X3): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588432.3591503"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_1_61_1","volume-title":"Oh (Eds.)","volume":"35","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 36479--36494. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf"},{"key":"e_1_2_1_62_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Seo Junyoung","year":"2024","unstructured":"Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. 2024. Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_63_1","unstructured":"Tianchang Shen Jun Gao Kangxue Yin Ming-Yu Liu and Sanja Fidler. 2021. Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. In Advances in Neural Information Processing Systems A. Beygelzimer Y. Dauphin P. Liang and J. Wortman Vaughan (Eds.)."},{"key":"e_1_2_1_64_1","unstructured":"Ruoxi Shi Hansheng Chen Zhuoyang Zhang Minghua Liu Chao Xu Xinyue Wei Linghao Chen Chong Zeng and Hao Su. 2023. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model. arXiv:2310.15110 [cs.CV]"},{"key":"e_1_2_1_65_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Shi Yichun","year":"2024","unstructured":"Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. 2024. MV-Dream: Multi-view Diffusion for 3D Generation. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Siddiqui Yawar","year":"2024","unstructured":"Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nie\u00dfner. 2024. MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_67_1","volume-title":"International Conference on Machine Learning, ICML 2023","volume":"31929","author":"Singer Uriel","year":"2023","unstructured":"Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, and Yaniv Taigman. 2023. Text-To-4D Dynamic Scene Generation. In International Conference on Machine Learning, ICML 2023, 23--29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 31915--31929. https:\/\/proceedings.mlr.press\/v202\/singer23a.html"},{"key":"e_1_2_1_68_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Sun Jingxiang","year":"2024","unstructured":"Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. 2024. DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00467"},{"key":"e_1_2_1_70_1","volume-title":"Skeletonnet: A topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images","author":"Tang Jiapeng","year":"2021","unstructured":"Jiapeng Tang, Xiaoguang Han, Mingkui Tan, Xin Tong, and Kui Jia. 2021a. Skeletonnet: A topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images. IEEE transactions on pattern analysis and machine intelligence 44, 10 (2021), 6454--6471."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00644"},{"key":"e_1_2_1_72_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Tang Jiaxiang","year":"2024","unstructured":"Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2024. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_73_1","volume-title":"\u0141 ukasz Kaiser, and Illia Polosukhin","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_2_1_74_1","first-page":"27171","article-title":"NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction","volume":"34","author":"Wang Peng","year":"2021","unstructured":"Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. 2021a. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. Advances in Neural Information Processing Systems 34 (2021), 27171--27183.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_75_1","volume-title":"PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction. In The Twelfth International Conference on Learning Representations.","author":"Wang Peng","year":"2024","unstructured":"Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. 2024. PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_76_1","unstructured":"Peng-Shuai Wang. 2022. mesh2sdf. https:\/\/github.com\/wang-ps\/mesh2sdf. Converts an input mesh to a signed distance field (SDF)."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528223.3530087"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00217"},{"key":"e_1_2_1_79_1","volume-title":"Thirty-seventh Conference on Neural Information Processing Systems.","author":"Wang Zhengyi","year":"2023","unstructured":"Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. In Thirty-seventh Conference on Neural Information Processing Systems."},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00317"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00084"},{"key":"e_1_2_1_82_1","volume-title":"International Conference on Machine Learning. PMLR, 10524--10533","author":"Xiong Ruibin","year":"2020","unstructured":"Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the transformer architecture. In International Conference on Machine Learning. PMLR, 10524--10533."},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592129"},{"key":"e_1_2_1_84_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Xu Yinghao","year":"2024","unstructured":"Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. 2024. DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_85_1","volume-title":"Juan Carlos Niebles, and Silvio Savarese","author":"Xue Le","year":"2023","unstructured":"Le Xue, Ning Yu, Shu Zhang, Junnan Li, Roberto Mart\u00edn-Mart\u00edn, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2023. ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding. arXiv:2305.08275 [cs.CV]"},{"key":"e_1_2_1_86_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Yariv Lior","year":"2024","unstructured":"Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, and Yaron Lipman. 2024. Mosaic-SDF for 3D Generative Models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."},{"key":"e_1_2_1_87_1","unstructured":"Hu Ye Jun Zhang Sibo Liu Xiao Han and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721 [cs.CV]"},{"key":"e_1_2_1_88_1","unstructured":"Fukun Yin Xin Chen Chi Zhang Biao Jiang Zibo Zhao Jiayuan Fan Gang Yu Taihao Li and Tao Chen. 2023. ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model. arXiv:2311.17618 [cs.CV]"},{"key":"e_1_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612232"},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00883"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592442"},{"key":"e_1_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592094"},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"e_1_2_1_94_1","unstructured":"Youjia Zhang Junqing Yu Zikai Song and Wei Yang. 2023d. Optimized View and Geometry Distillation from Multi-view Diffuser. arXiv:2312.06198 [cs.CV]"},{"key":"e_1_2_1_95_1","volume-title":"Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in neural information processing systems","author":"Zhao Zibo","year":"2023","unstructured":"Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. 2023. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in neural information processing systems (2023)."},{"key":"e_1_2_1_96_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592103"},{"key":"e_1_2_1_97_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Zhu Junzhe","year":"2024","unstructured":"Junzhe Zhu, Peiye Zhuang, and Sanmi Koyejo. 2024. HIFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_98_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition.","author":"Zou Zi-Xin","year":"2024","unstructured":"Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. 2024. Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658146","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3658146","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:05:54Z","timestamp":1750291554000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658146"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,19]]},"references-count":98,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,7,19]]}},"alternative-id":["10.1145\/3658146"],"URL":"https:\/\/doi.org\/10.1145\/3658146","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,19]]},"assertion":[{"value":"2024-07-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}