{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T07:03:20Z","timestamp":1763795000034,"version":"3.45.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"12","funder":[{"name":"Guangdong Major Project of Basic and Applied Basic Research","award":["2023B0303000009"],"award-info":[{"award-number":["2023B0303000009"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with broad applications in various downstream tasks. While recent studies focus on designing increasingly complex and powerful deep learning methods to regress depth maps directly, we propose a novel approach by introducing the Virtual Point Cloud (VPC) as an intermediate representation to provide the approximate geometric prior for the MDE task. In this article, we design a multi-scale multi-space representation fusion-enhanced MDE framework to address the challenges of MDE. Specifically, to resolve the issue of scale ambiguity, we design a VPC feature extraction module to learn multi-scale 3D geometric information for the depth prior. Then, we explicitly introduce geometric constraints for global depth prediction by incorporating a multi-space representation fusion from both the texture features in 2D space and the geometric features in 3D space. To mitigate errors at object boundaries, we introduce a confidence map generated based on the quality of the VPC to refine the predicted depth map. Specifically, we construct convolution receptive fields based on 3D spatial distances in spherical coordinates, ensuring that the confidence map provides reliable geometric guidance at object boundaries. Furthermore, we propose an independent confidence geometric consistency loss to supervise the refinement process. Experimental results demonstrate that our method significantly outperforms state-of-the-art approaches across all evaluation metrics on the KITTI and NYU-Depth-v2 datasets, achieving RMSE improvements of 9.2% and 2.8%, respectively. Moreover, zero-shot evaluations on the nuScenes and SUN-RGBD datasets further validate the generalizability of our approach.<\/jats:p>","DOI":"10.1145\/3770076","type":"journal-article","created":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T13:43:03Z","timestamp":1759326183000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Multi-space Representation Fusion Enhanced Monocular Depth Estimation via Virtual Point Cloud"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4844-1353","authenticated-orcid":false,"given":"Lin","family":"Bie","sequence":"first","affiliation":[{"name":"BNRist, THUIBCS, BLBCI, School of Software, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9720-826X","authenticated-orcid":false,"given":"Siqi","family":"Li","sequence":"additional","affiliation":[{"name":"BNRist, THUIBCS, BLBCI, School of Software, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8761-9396","authenticated-orcid":false,"given":"Xiaopin","family":"Zhong","sequence":"additional","affiliation":[{"name":"College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0597-1426","authenticated-orcid":false,"given":"Zongze","family":"Wu","sequence":"additional","affiliation":[{"name":"College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9705-3365","authenticated-orcid":false,"given":"Yue","family":"Gao","sequence":"additional","affiliation":[{"name":"BNRist, THUIBCS, BLBCI, School of Software, Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00947"},{"key":"e_1_3_1_3_2","first-page":"4009","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Bhat Shariq Farooq","year":"2021","unstructured":"Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 4009\u20134018."},{"key":"e_1_3_1_4_2","unstructured":"Shariq Farooq Bhat Reiner Birkl Diana Wofk Peter Wonka and Matthias M\u00fcller. 2023. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288. Retrieved from https:\/\/arxiv.org\/abs\/2302.12288"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3652583.3658074"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02064"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.01012"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00966"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_11_2","first-page":"611","volume-title":"In Proceedings of the International Conference on Learning Representation","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G. Heigold, S. Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representation, 611\u2013631."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73247-8_25"},{"key":"e_1_3_1_13_2","first-page":"2366","volume-title":"Proceedings of the Conference on Neural Information Processing Systems","author":"Eigen David","year":"2014","unstructured":"David Eigen, Christian Puhrsch, and Fergus Rob. 2014. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Conference on Neural Information Processing Systems, 2366\u20132374."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00214"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00393"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00847"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA48506.2021.9561035"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3444912"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01253"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2022.3210560"},{"key":"e_1_3_1_22_2","unstructured":"Jin Han Lee Myung-Kyu Han Dong Wook Ko and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326. Retrieved from https:\/\/arxiv.org\/abs\/1907.10326"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2024.3416065"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3694978"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3690826"},{"key":"e_1_3_1_26_2","first-page":"14405","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Liu Ce","year":"2023","unstructured":"Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. 2023. VA-DepthNet: A variational approach to single image depth prediction. In Proceedings of the International Conference on Learning Representations, 14405\u201314425."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01664"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3638559"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"e_1_3_1_31_2","first-page":"9685","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Mahdi S.","year":"2021","unstructured":"S. Mahdi, H. Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris, and Yagiz Aksoy. 2021. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 9685\u20139694."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIM.2023.3315416"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102426"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.5555\/1285266.1285269"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02672"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02057"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00963"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00592"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3019967"},{"key":"e_1_3_1_40_2","doi-asserted-by":"crossref","first-page":"7660","DOI":"10.1109\/TMM.2022.3224810","article-title":"Towards comprehensive monocular depth estimation: Multiple heads are better than one","volume":"25","author":"Shao Shuwei","year":"2022","unstructured":"Shuwei Shao, Ran Li, Zhongcai Pei, Zhong Liu, Weihai Chen, Wentao Zhu, Xingming Wu, and Baochang Zhang. 2022. Towards comprehensive monocular depth estimation: Multiple heads are better than one. IEEE Trans. Multimed. 25 (2022), 7660\u20137671.","journal-title":"IEEE Trans. Multimed"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-024-02293-3"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3411571"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIV.2023.3299935"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00729"},{"key":"e_1_3_1_45_2","first-page":"53025","volume-title":"Proceedings of the Conference on Neural Information Processing Systems","author":"Shao Shuwei","year":"2023","unstructured":"Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, and Zhengguo Li. 2023. IEBins: Iterative elastic bins for monocular depth estimation. In Proceedings of the Conference on Neural Information Processing Systems, 53025\u201353037."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01710-9"},{"key":"e_1_3_1_47_2","first-page":"746","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Silberman Nathan","year":"2012","unstructured":"Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGB-D images. In Proceedings of the European Conference on Computer Vision. Springer, 746\u2013760."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298655"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00651"},{"key":"e_1_3_1_50_2","first-page":"64629","volume-title":"Proceedings of the Conference on Neural Information Processing Systems","author":"Wang Kun","year":"2024","unstructured":"Kun Wang, Zhiqiang Yan, Junkai Fan, Wanlu Zhu, Xiang Li, Jun Li, and Jian Yang. 2024. DCDepth: Progressive monocular depth estimation in discrete cosine domain. In Proceedings of the Conference on Neural Information Processing Systems, 64629\u201364648."},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00114"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01391"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00466"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19812-0_13"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00987"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00027"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00802"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00389"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8794160"},{"key":"e_1_3_1_61_2","first-page":"14128","volume-title":"Proceedings of the Conference on Neural Information Processing Systems","author":"Zhang Chi","year":"2022","unstructured":"Chi Zhang, Wei Yin, Zhibin Wang, Gang Yu, Bin Fu, and Chunhua Shen. 2022. Hierarchical normalization for robust monocular depth estimation. In Proceedings of the Conference on Neural Information Processing Systems, 14128\u201314139."},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3672397"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01777"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.660"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3770076","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T06:59:24Z","timestamp":1763794764000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3770076"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":63,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3770076"],"URL":"https:\/\/doi.org\/10.1145\/3770076","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2024-12-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}