{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T09:28:49Z","timestamp":1780392529981,"version":"3.54.1"},"reference-count":97,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2022YFF0902301"],"award-info":[{"award-number":["2022YFF0902301"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"name":"NSFC programs","award":["61976138"],"award-info":[{"award-number":["61976138"]}]},{"name":"NSFC programs","award":["61977047"],"award-info":[{"award-number":["61977047"]}]},{"DOI":"10.13039\/501100003399","name":"STCSM","doi-asserted-by":"crossref","award":["2015F0203-000-06"],"award-info":[{"award-number":["2015F0203-000-06"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"crossref"}]},{"name":"SHMEC","award":["2019-01-07-00-01-E00003"],"award-info":[{"award-number":["2019-01-07-00-01-E00003"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2025,8,1]]},"abstract":"<jats:p>Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relations. This enables understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object's full geometry, using Masked Auto Encoder (MAE) and point cloud conditioning to mitigate the effects of occlusions and partial object information, ensuring accurate alignment with the source image's geometry and texture. To align each object with the scene, the alignment generation model computes the necessary transformations, allowing the generated meshes to be accurately placed and integrated into the scene's point cloud. Finally, CAST applies a physics-aware correction mechanism, which leverages a fine-grained relation graph to generate a constraint graph. This graph guides the optimization of object poses, ensuring physical consistency and spatial coherence. By utilizing Signed Distance Fields (SDF), the model effectively addresses issues such as occlusions, object penetration, and floating objects, ensuring that the generated scene accurately reflects real-world physical interactions. Experimental results demonstrate that CAST significantly improves the quality of single-image 3D scene reconstruction, offering enhanced realism and accuracy in scene understanding and reconstruction tasks. CAST has practical applications in virtual content creation, such as immersive game environments and film production, where real-world setups can be seamlessly integrated into virtual landscapes. Additionally, CAST can be leveraged in robotics, enabling efficient real-to-simulation workflows and providing realistic, scalable simulation environments for robotic systems.<\/jats:p>","DOI":"10.1145\/3730841","type":"journal-article","created":{"date-parts":[[2025,7,27]],"date-time":"2025-07-27T04:02:41Z","timestamp":1753588961000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-2056-6057","authenticated-orcid":false,"given":"Kaixin","family":"Yao","sequence":"first","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8508-3359","authenticated-orcid":false,"given":"Longwen","family":"Zhang","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4736-5664","authenticated-orcid":false,"given":"Xinhao","family":"Yan","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-4612-8789","authenticated-orcid":false,"given":"Yan","family":"Zeng","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4837-7152","authenticated-orcid":false,"given":"Qixuan","family":"Zhang","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"},{"name":"Deemos, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8807-7787","authenticated-orcid":false,"given":"Lan","family":"Xu","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1189-1254","authenticated-orcid":false,"given":"Wei","family":"Yang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3207-7921","authenticated-orcid":false,"given":"Jiayuan","family":"Gu","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9198-6853","authenticated-orcid":false,"given":"Jingyi","family":"Yu","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,7,27]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.1987.4767965"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00580"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00539"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.2312\/CONF\/EG2012\/STARS\/095-134"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.121791"},{"key":"e_1_2_2_7_1","volume-title":"Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288","author":"Bhat Shariq Farooq","year":"2023","unstructured":"Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M\u00fcller. 2023. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)."},{"key":"e_1_2_2_8_1","unstructured":"Andreas Blattmann Tim Dockhorn Sumith Kulal Daniel Mendelevitch Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)."},{"key":"e_1_2_2_9_1","volume-title":"Forty-first International Conference on Machine Learning.","author":"Bruce Jake","year":"2024","unstructured":"Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. 2024. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning."},{"key":"e_1_2_2_10_1","volume-title":"Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012","author":"Chang Angel X","year":"2015","unstructured":"Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3203192"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/3DV62453.2024.00142"},{"key":"e_1_2_2_13_1","volume-title":"Ying Nian Wu, and Chenfanfu Jiang","author":"Chen Yunuo","year":"2024","unstructured":"Yunuo Chen, Tianyi Xie, Zeshun Zong, Xuan Li, Feng Gao, Yin Yang, Ying Nian Wu, and Chenfanfu Jiang. 2024b. Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication. arXiv preprint arXiv:2405.18515 (2024)."},{"key":"e_1_2_2_14_1","volume-title":"Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453","author":"Cheng An-Chieh","year":"2024","unstructured":"An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. 2024. Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453 (2024)."},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00478"},{"key":"e_1_2_2_16_1","first-page":"8282","article-title":"Panoptic 3d scene reconstruction from a single rgb image","volume":"34","author":"Dahnert Manuel","year":"2021","unstructured":"Manuel Dahnert, Ji Hou, Matthias Nie\u00dfner, and Angela Dai. 2021. Panoptic 3d scene reconstruction from a single rgb image. Advances in Neural Information Processing Systems 34 (2021), 8282\u20138293.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.261"},{"key":"e_1_2_2_18_1","volume-title":"Automated Creation of Digital Cousins for Robust Policy Learning. arXiv preprint arXiv:2410.07408","author":"Dai Tianyuan","year":"2024","unstructured":"Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. 2024. Automated Creation of Digital Cousins for Robust Policy Learning. arXiv preprint arXiv:2410.07408 (2024)."},{"key":"e_1_2_2_19_1","volume-title":"Samir Yitzhak Gadre, et al","author":"Deitke Matt","year":"2024","unstructured":"Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. 2024. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01263"},{"key":"e_1_2_2_21_1","volume-title":"Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View. arXiv preprint arXiv:2404.03421","author":"Dogaru Andreea","year":"2024","unstructured":"Andreea Dogaru, Mert \u00d6zer, and Bernhard Egger. 2024. Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View. arXiv preprint arXiv:2404.03421 (2024)."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01075"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3658236"},{"key":"e_1_2_2_24_1","volume-title":"Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314","author":"Gao Ruiqi","year":"2024","unstructured":"Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. 2024a. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024)."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1177\/0278364913491297"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00174"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2007.4408933"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00399"},{"key":"e_1_2_2_29_1","volume-title":"Chuang Gan, Joshua B Tenenbaum, Kaiming He, and Wojciech Matusik.","author":"Guo Minghao","year":"2024","unstructured":"Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Elaine Owens, Chuang Gan, Joshua B Tenenbaum, Kaiming He, and Wojciech Matusik. 2024. Physically Compatible 3D Object Modeling from a Single Image. arXiv preprint arXiv:2405.20510 (2024)."},{"key":"e_1_2_2_30_1","doi-asserted-by":"crossref","unstructured":"Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Diederik P Kingma Ben Poole Mohammad Norouzi David J Fleet et al. 2022a. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).","DOI":"10.52202\/068431-0628"},{"key":"e_1_2_2_31_1","first-page":"8633","article-title":"Video diffusion models","volume":"35","author":"Ho Jonathan","year":"2022","unstructured":"Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022b. Video diffusion models. Advances in Neural Information Processing Systems 35 (2022), 8633\u20138646.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_32_1","volume-title":"Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400","author":"Hong Yicong","year":"2023","unstructured":"Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. 2023. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)."},{"key":"e_1_2_2_33_1","volume-title":"MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation. arXiv preprint arXiv:2412.03558","author":"Huang Zehuan","year":"2024","unstructured":"Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, and Lu Sheng. 2024. MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation. arXiv preprint arXiv:2412.03558 (2024)."},{"key":"e_1_2_2_34_1","unstructured":"Hyper3D. 2025. Omnicraft. https:\/\/hyper3d.ai\/omnicraft\/hdri"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.169"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592433"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01236"},{"key":"e_1_2_2_39_1","volume-title":"Proceedings, Part XVII 16","author":"Labb\u00e9 Yann","year":"2020","unstructured":"Yann Labb\u00e9, Justin Carpentier, Mathieu Aubry, and Josef Sivic. 2020. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XVII 16. Springer, 574\u2013591."},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417861"},{"key":"e_1_2_2_41_1","volume-title":"SPARC: Sparse render-and-compare for CAD model alignment in a single RGB image. arXiv preprint arXiv:2210.01044","author":"Langer Florian","year":"2022","unstructured":"Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. SPARC: Sparse render-and-compare for CAD model alignment in a single RGB image. arXiv preprint arXiv:2210.01044 (2022)."},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1093\/oso\/9780199256044.001.0001"},{"key":"e_1_2_2_43_1","volume-title":"LLM-enhanced Scene Graph Learning for Household Rearrangement. In SIGGRAPH Asia 2024 Conference Papers. 1\u201311","author":"Li Wenhao","year":"2024","unstructured":"Wenhao Li, Zhiyuan Yu, Qijin She, Zhinan Yu, Yuqing Lan, Chenyang Zhu, Ruizhen Hu, and Kai Xu. 2024b. LLM-enhanced Scene Graph Learning for Household Rearrangement. In SIGGRAPH Asia 2024 Conference Papers. 1\u201311."},{"key":"e_1_2_2_44_1","volume-title":"Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al.","author":"Li Xuanlin","year":"2024","unstructured":"Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. 2024a. Evaluating Real-World Robot Manipulation Policies in Simulation. arXiv preprint arXiv:2405.05941 (2024)."},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00623"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19769-7_25"},{"key":"e_1_2_2_47_1","volume-title":"One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems 36","author":"Liu Minghua","year":"2024","unstructured":"Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. 2024. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00853"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72970-6_3"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00085"},{"key":"e_1_2_2_51_1","unstructured":"Yuan Liu Cheng Lin Zijiao Zeng Xiaoxiao Long Lingjie Liu Taku Komura and Wenping Wang. 2023a. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. In arXiv preprint arXiv:2309.03453."},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00951"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2009.5414068"},{"key":"e_1_2_2_54_1","volume-title":"Volumetric hierarchical approximate convex decomposition. Game engine gems 3","author":"Mamou Khaled","year":"2016","unstructured":"Khaled Mamou, E Lengyel, and A Peters. 2016. Volumetric hierarchical approximate convex decomposition. Game engine gems 3 (2016), 141\u2013158."},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01315"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00921"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_24"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528223.3530127"},{"key":"e_1_2_2_59_1","unstructured":"Junfeng Ni Yixin Chen Bohan Jing Nan Jiang Bin Wang Bo Dai Puhao Li Yixin Zhu Song-Chun Zhu and Siyuan Huang. 2024. PhyRecon: Physically Plausible Neural Scene Reconstruction. Advances in Neural Information Processing Systems."},{"key":"e_1_2_2_60_1","unstructured":"Maxime Oquab Timoth\u00e9e Darcet Th\u00e9o Moutakanni Huy Vo Marc Szafraniec Vasil Khalidov Pierre Fernandez Daniel Haziza Francisco Massa Alaaeldin El-Nouby et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)."},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00963"},{"key":"e_1_2_2_62_1","volume-title":"Dreamfusion: Text-to-3d using 2d diffusion. In arXiv preprint arXiv:2209.14988.","author":"Poole Ben","year":"2022","unstructured":"Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion: Text-to-3d using 2d diffusion. In arXiv preprint arXiv:2209.14988."},{"key":"e_1_2_2_63_1","volume-title":"7th Annual Conference on Robot Learning.","author":"Rana Krishan","year":"2023","unstructured":"Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. 2023. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In 7th Annual Conference on Robot Learning."},{"key":"e_1_2_2_64_1","unstructured":"Nikhila Ravi Valentin Gabeur Yuan-Ting Hu Ronghang Hu Chaitanya Ryali Tengyu Ma Haitham Khedr Roman R\u00e4dle Chloe Rolland Laura Gustafson et al. 2024. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)."},{"key":"e_1_2_2_65_1","unstructured":"Tianhe Ren Shilong Liu Ailing Zeng Jing Lin Kunchang Li He Cao Jiayu Chen Xinyu Huang Yukang Chen Feng Yan et al. 2024. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)."},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00314"},{"key":"e_1_2_2_67_1","volume-title":"Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image. arXiv preprint arXiv:2406.04343","author":"Szymanowicz Stanislaw","year":"2024","unstructured":"Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, Jo\u00e3o F Henriques, Christian Rupprecht, and Andrea Vedaldi. 2024a. Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image. arXiv preprint arXiv:2406.04343 (2024)."},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00972"},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73235-5_1"},{"key":"e_1_2_2_70_1","volume-title":"Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653","author":"Tang Jiaxiang","year":"2023","unstructured":"Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)."},{"key":"e_1_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01641"},{"key":"e_1_2_2_72_1","volume-title":"Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. arXiv preprint arXiv:2403.03949","author":"Torne Marcel","year":"2024","unstructured":"Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. 2024. Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. arXiv preprint arXiv:2403.03949 (2024)."},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.88573"},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73232-4_25"},{"key":"e_1_2_2_75_1","volume-title":"Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. arXiv preprint arXiv:2410.19115","author":"Wang Ruicheng","year":"2024","unstructured":"Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. 2024b. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. arXiv preprint arXiv:2410.19115 (2024)."},{"key":"e_1_2_2_76_1","volume-title":"Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems 36","author":"Wang Zhengyi","year":"2024","unstructured":"Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2024a. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528223.3530103"},{"key":"e_1_2_2_78_1","volume-title":"Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image. arXiv preprint arXiv:2405.20343","author":"Wu Kailu","year":"2024","unstructured":"Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. 2024a. Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image. arXiv preprint arXiv:2405.20343 (2024)."},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02036"},{"key":"e_1_2_2_80_1","volume-title":"Structured 3D Latents for Scalable and Versatile 3D Generation. arXiv preprint arXiv:2412.01506","author":"Xiang Jianfeng","year":"2024","unstructured":"Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. 2024. Structured 3D Latents for Scalable and Versatile 3D Generation. arXiv preprint arXiv:2412.01506 (2024)."},{"key":"e_1_2_2_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00461"},{"key":"e_1_2_2_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00420"},{"key":"e_1_2_2_83_1","volume-title":"Precise-Physics Driven Text-to-3D Generation. arXiv preprint arXiv:2403.12438","author":"Xu Qingshan","year":"2024","unstructured":"Qingshan Xu, Jiao Liu, Melvin Wong, Caishun Chen, and Yew-Soon Ong. 2024. Precise-Physics Driven Text-to-3D Generation. arXiv preprint arXiv:2403.12438 (2024)."},{"key":"e_1_2_2_84_1","volume-title":"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441","author":"Yang Jianwei","year":"2023","unstructured":"Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)."},{"key":"e_1_2_2_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00987"},{"key":"e_1_2_2_86_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01539"},{"key":"e_1_2_2_87_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00830"},{"key":"e_1_2_2_88_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00455"},{"key":"e_1_2_2_89_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00636"},{"key":"e_1_2_2_90_1","volume-title":"Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems 35","author":"Yu Zehao","year":"2022","unstructured":"Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. 2022. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems 35 (2022), 25018\u201325032."},{"key":"e_1_2_2_91_1","volume-title":"Vision as Bayesian inference: analysis by synthesis? Trends in cognitive sciences 10, 7","author":"Yuille Alan","year":"2006","unstructured":"Alan Yuille and Daniel Kersten. 2006. Vision as Bayesian inference: analysis by synthesis? Trends in cognitive sciences 10, 7 (2006), 301\u2013308."},{"key":"e_1_2_2_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/3618342"},{"key":"e_1_2_2_93_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02439"},{"key":"e_1_2_2_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/3658146"},{"key":"e_1_2_2_95_1","unstructured":"SUN Zhengwentai. 2023. clip-score: CLIP Score for PyTorch. https:\/\/github.com\/taited\/clip-score. Version 0.1.1."},{"key":"e_1_2_2_96_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72627-9_23"},{"key":"e_1_2_2_97_1","volume-title":"Open3D: A Modern Library for 3D Data Processing. ArXiv abs\/1801.09847","author":"Zhou Qian-Yi","year":"2018","unstructured":"Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. 2018. Open3D: A Modern Library for 3D Data Processing. ArXiv abs\/1801.09847 (2018)."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3730841","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:59:14Z","timestamp":1774634354000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730841"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,27]]},"references-count":97,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,8,1]]}},"alternative-id":["10.1145\/3730841"],"URL":"https:\/\/doi.org\/10.1145\/3730841","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,27]]},"assertion":[{"value":"2025-01-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}