{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:17:51Z","timestamp":1775067471377,"version":"3.50.1"},"reference-count":84,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T00:00:00Z","timestamp":1763596800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Monocular metric depth estimation (MMDE) aims to generate depth maps with an absolute metric scale from a single RGB image, which enables accurate spatial understanding, 3D reconstruction, and autonomous navigation. Unlike conventional monocular depth estimation that predicts only relative depth, MMDE maintains geometric consistency across frames and supports reliable integration with visual SLAM, high-precision 3D modeling, and novel view synthesis. This survey provides a comprehensive review of MMDE, tracing its evolution from geometry-based formulations to modern learning-based frameworks. The discussion emphasizes the importance of datasets, distinguishing metric datasets that supply absolute ground-truth depth from relative datasets that facilitate ordinal or normalized depth learning. Representative datasets, including KITTI, NYU-Depth, ApolloScape, and TartanAir, are analyzed with respect to scene composition, sensor modality, and intended application domain. Methodological progress is examined across several dimensions, including model architecture design, domain generalization, structural detail preservation, and the integration of synthetic data that complements real-world captures. Recent advances in patch-based inference, generative modeling, and loss design are compared to reveal their respective advantages and limitations. By summarizing the current landscape and outlining open research challenges, this work establishes a clear reference framework that supports future studies and facilitates the deployment of MMDE in real-world vision systems requiring precise and robust metric depth estimation.<\/jats:p>","DOI":"10.3390\/computers14110502","type":"journal-article","created":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T10:30:59Z","timestamp":1763634659000},"page":"502","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Survey on Monocular Metric Depth Estimation"],"prefix":"10.3390","volume":"14","author":[{"given":"Jiuling","family":"Zhang","sequence":"first","affiliation":[{"name":"CRRC Technology Innovation (Beijing) Co., Ltd., 12th Research Institute, Beijing 100083, China"}]},{"given":"Yurong","family":"Wu","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Huilong","family":"Jiang","sequence":"additional","affiliation":[{"name":"CRRC Technology Innovation (Beijing) Co., Ltd., 12th Research Institute, Beijing 100083, China"},{"name":"CRRC Dalian Co., Ltd., Dalian 116045, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1145\/3503250","article-title":"Nerf: Representing scenes as neural radiance fields for view synthesis","volume":"65","author":"Mildenhall","year":"2021","journal-title":"Commun. ACM"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3592433","article-title":"3D gaussian splatting for real-time radiance field rendering","volume":"42","author":"Kerbl","year":"2023","journal-title":"ACM Trans. Graph."},{"key":"ref_3","unstructured":"Ye, C., Nie, Y., Chang, J., Chen, Y., Zhi, Y., and Han, X. (2024). Gaustudio: A modular framework for 3D gaussian splatting and beyond. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Szeliski, R. (2022). Computer Vision: Algorithms and Applications, Springer Nature.","DOI":"10.1007\/978-3-030-34372-9"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zheng, J., Lin, C., Sun, J., Zhao, Z., Li, Q., and Shen, C. (2024, January 16\u201322). Physical 3D adversarial attacks against monocular depth estimation in autonomous driving. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02308"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Leduc, A., Cioppa, A., Giancola, S., Ghanem, B., and Van Droogenbroeck, M. (2024, January 16\u201322). SoccerNet-Depth: A scalable dataset for monocular depth estimation in sports videos. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPRW63382.2024.00333"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, L., Rao, A., and Agrawala, M. (2023, January 2\u20133). Adding conditional control to text-to-image diffusion models. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Khan, N., Xiao, L., and Lanman, D. (2023, January 2\u20133). Tiled multiplane images for practical 3D photography. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00959"},{"key":"ref_9","unstructured":"Liew, J.H., Yan, H., Zhang, J., Xu, Z., and Feng, J. (2023). Magicedit: High-fidelity and temporally coherent video editing. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Xu, D., Jiang, Y., Wang, P., Fan, Z., Wang, Y., and Wang, Z. (2023, January 17\u201324). Neurallift-360: Lifting an in-the-wild 2D photo to a 3D object with 360deg views. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00435"},{"key":"ref_11","unstructured":"Shahbazi, M., Claessens, L., Niemeyer, M., Collins, E., Tonioni, A., Van Gool, L., and Tombari, F. (2024). Inserf: Text-driven generative object insertion in neural 3D scenes. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Shriram, J., Trevithick, A., Liu, L., and Ramamoorthi, R. (2024). Realmdreamer: Text-driven 3D scene generation with inpainting and depth diffusion. arXiv.","DOI":"10.1109\/3DV66043.2025.00086"},{"key":"ref_13","unstructured":"Deng, J., Yin, W., Guo, X., Zhang, Q., Hu, X., Ren, W., Long, X.X., and Tan, P. (2025, January 19\u201323). Boost 3D reconstruction using diffusion-based monocular camera calibration. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Honolulu, HI, USA."},{"key":"ref_14","unstructured":"Guo, J., Ding, Y., Chen, X., Chen, S., Li, B., Zou, Y., Lyu, X., Tan, F., Qi, X., and Li, Z. (2025, January 11\u201315). Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4D driving scene generation. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA."},{"key":"ref_15","unstructured":"Daxberger, E., Wenzel, N., Griffiths, D., Gang, H., Lazarow, J., Kohavi, G., Kang, K., Eichner, M., Yang, Y., and Dehghan, A. (2025, January 19\u201323). Mm-spatial: Exploring 3D spatial understanding in multimodal llms. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Honolulu, HI, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Yu, Y., Liu, S., Pautrat, R., Pollefeys, M., and Larsson, V. (2025, January 11\u201315). Relative pose estimation through affine corrections of monocular depth priors. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.01557"},{"key":"ref_17","unstructured":"Kim, H., Baik, S., and Joo, H. (2025, January 19\u201323). DAViD: Modeling dynamic affordance of 3D objects using pre-trained video diffusion models. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Honolulu, HI, USA."},{"key":"ref_18","unstructured":"Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., and M\u00fcller, M. (2023). Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv."},{"key":"ref_19","unstructured":"Bochkovskiy, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S., and Koltun, V. (2025, January 24\u201328). Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. Proceedings of the Thirteenth International Conference on Learning Representations, Singapore."},{"key":"ref_20","unstructured":"Saxena, S., Hur, J., Herrmann, C., Sun, D., and Fleet, D.J. (2023). Zero-shot metric depth with a field-of-view conditioned diffusion model. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., and Zhao, H. (2024, January 16\u201322). Depth anything: Unleashing the power of large-scale unlabeled data. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00987"},{"key":"ref_22","unstructured":"Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., and Zhao, H. (2024). Depth anything v2. Advances in Neural Information Processing Systems 37, Curran Associates, Inc."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Guo, Y., Garg, S., Miangoleh, S.M.H., Huang, X., and Ren, L. (2025, January 10\u201317). Depth any camera: Zero-shot metric depth estimation from any camera. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.02514"},{"key":"ref_24","unstructured":"Bhoi, A. (2019). Monocular depth estimation: A survey. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Khan, F., Salahuddin, S., and Javidnia, H. (2020). Deep learning-based monocular depth estimation methods\u2014A state-of-the-art review. Sensors, 20.","DOI":"10.3390\/s20082272"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1612","DOI":"10.1007\/s11431-020-1582-8","article-title":"Monocular depth estimation based on deep learning: An overview","volume":"63","author":"Zhao","year":"2020","journal-title":"Sci. China Technol. Sci."},{"key":"ref_27","unstructured":"Ruan, X., Yan, W., Huang, J., Guo, P., and Guo, W. (2020, January 6\u20138). Monocular depth estimation based on deep learning: A survey. Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"305","DOI":"10.3390\/vehicles6010013","article-title":"Deep learning-based stereopsis and monocular depth estimation techniques: A review","volume":"6","author":"Lahiri","year":"2024","journal-title":"Vehicles"},{"key":"ref_29","unstructured":"Tosi, F., Ramirez, P.Z., and Poggi, M. (October, January 29). Diffusion models for monocular depth estimation: Overcoming challenging conditions. Proceedings of the European Conference on Computer Vision, Milan, Italy."},{"key":"ref_30","unstructured":"Vyas, P., Saxena, C., Badapanda, A., and Goswami, A. (2022). Outdoor monocular depth estimation: A research review. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"16940","DOI":"10.1109\/TITS.2022.3160741","article-title":"Towards real-time monocular depth estimation for robotics: A survey","volume":"23","author":"Dong","year":"2022","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Masoumian, A., Rashwan, H.A., Cristiano, J., Asif, M.S., and Puig, D. (2022). Monocular depth estimation using deep learning: A review. Sensors, 22.","DOI":"10.3390\/s22145353"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"2396","DOI":"10.1109\/TPAMI.2023.3330944","article-title":"Monocular depth estimation: A thorough review","volume":"46","author":"Arampatzakis","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3677327","article-title":"Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey","volume":"56","author":"Rajapaksha","year":"2024","journal-title":"ACM Comput. Surv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., and Schindler, K. (2024, January 16\u201322). Repurposing diffusion-based image generators for monocular depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00907"},{"key":"ref_36","unstructured":"Bhat, S.F., Alhashim, I., and Wonka, P. (2021, January 20\u201325). Adabins: Depth estimation using adaptive bins. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S., and Aksoy, Y. (2021, January 20\u201325). Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00956"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Jampani, V., Chang, H., Sargent, K., Kar, A., Tucker, R., Krainin, M., and Liu, C. (2021, January 11\u201317). Slide: Single image 3D photography with soft layering and depth-aware inpainting. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01229"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., and Davison, A. (2011, January 16\u201319). Kinectfusion: Real-time 3D reconstruction and interaction using a moving depth camera. Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA.","DOI":"10.1145\/2047196.2047270"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1917","DOI":"10.1109\/JSEN.2010.2101060","article-title":"Lock-in time-of-flight (ToF) cameras: A survey","volume":"11","author":"Foix","year":"2011","journal-title":"IEEE Sens. J."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.cviu.2015.05.006","article-title":"Kinect range sensing: Structured-light versus Time-of-Flight Kinect","volume":"139","author":"Sarbol","year":"2015","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"He, Y., Liang, B., Zou, Y., He, J., and Yang, J. (2017). Depth errors analysis and correction for time-of-flight (ToF) cameras. Sensors, 17.","DOI":"10.3390\/s17010092"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1023\/A:1014573219977","article-title":"A taxonomy and evaluation of dense two-frame stereo correspondence algorithms","volume":"47","author":"Scharstein","year":"2002","journal-title":"Int. J. Comput. Vis."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1109\/TPAMI.2007.1166","article-title":"Stereo processing by semiglobal matching and mutual information","volume":"30","author":"Hirschmuller","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"112","DOI":"10.37936\/ecti-cit.2019132.194324","article-title":"A review on stereo vision algorithm: Challenges and solutions","volume":"13","author":"Kok","year":"2019","journal-title":"ECTI Trans. Comput. Inf. Technol. (ECTI-CIT)"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Wofk, D., Ranftl, R., M\u00fcller, M., and Koltun, V. (June, January 29). Monocular visual-inertial depth estimation. Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK.","DOI":"10.1109\/ICRA48891.2023.10161013"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Singh, A.D., Ba, Y., Sarker, A., Zhang, H., Kadambi, A., Soatto, S., Srivastava, M., and Wong, A. (2023, January 17\u201324). Depth estimation from camera image and mmwave radar point cloud. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00895"},{"key":"ref_48","unstructured":"Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev, D., and Lin, C.H. (2025). Vipe: Video pose engine for 3D geometric perception. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Garg, R., Bg, V.K., Carneiro, G., and Reid, I. (2016, January 11\u201314). Unsupervised cnn for single view depth estimation: Geometry to the rescue. Proceedings of the Computer Vision\u2014ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part VIII 14.","DOI":"10.1007\/978-3-319-46484-8_45"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Ranftl, R., Bochkovskiy, A., and Koltun, V. (2021, January 10\u201317). Vision transformers for dense prediction. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01196"},{"key":"ref_51","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., and El-Nouby, A. (2023). Dinov2: Learning robust visual features without supervision. arXiv."},{"key":"ref_52","unstructured":"Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems 27, Curran Associates, Inc."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Eigen, D., and Fergus, R. (2015, January 7\u201313). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.304"},{"key":"ref_54","unstructured":"Birkl, R., Wofk, D., and M\u00fcller, M. (2023). Midas v3. 1\u2013a model zoo for robust monocular relative depth estimation. arXiv."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1109\/TPAMI.2022.3152247","article-title":"A survey on vision transformer","volume":"45","author":"Han","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018, January 18\u201323). Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00214"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., and Shen, C. (2023, January 2\u20133). Metric3d: Towards zero-shot metric 3D prediction from a single image. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00830"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Guizilini, V., Vasiljevic, I., Chen, D., Ambru\u0219, R., and Gaidon, A. (2023, January 2\u20133). Towards zero-shot scale-aware monocular depth estimation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00847"},{"key":"ref_59","unstructured":"Spencer, J., Russell, C., Hadfield, S., and Bowden, R. (2024). Kick back & relax++: Scaling beyond ground-truth depth with slowtv & cribstv. arXiv."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Bhat, S.F., Alhashim, I., and Wonka, P. (2022, January 23\u201327). Localbins: Improving depth estimation by learning local distributions. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19769-7_28"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"3964","DOI":"10.1109\/TIP.2024.3416065","article-title":"Binsformer: Revisiting adaptive bins for monocular depth estimation","volume":"33","author":"Li","year":"2024","journal-title":"IEEE Trans. Image Process."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Yuan, W., Gu, X., Dai, Z., Zhu, S., and Tan, P. (2022). NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation. arXiv.","DOI":"10.1109\/CVPR52688.2022.00389"},{"key":"ref_63","unstructured":"Spencer, J., Tosi, F., Poggi, M., Arora, R.S., Russell, C., Hadfield, S., and Elder, J.H. (2024, January 16\u201322). The third monocular depth estimation challenge. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Marsal, R., Chabot, F., Loesch, A., Grolleau, W., and Sahbi, H. (2024, January 3\u20138). MonoProb: Self-supervised monocular depth estimation with interpretable uncertainty. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV57701.2024.00360"},{"key":"ref_65","doi-asserted-by":"crossref","first-page":"107189","DOI":"10.1016\/j.engappai.2023.107189","article-title":"Large-scale monocular depth estimation in the wild","volume":"127","author":"Montazer","year":"2024","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"3664","DOI":"10.1109\/TCSVT.2024.3509619","article-title":"MonoDiffusion: Self-supervised monocular depth estimation using diffusion model","volume":"35","author":"Shao","year":"2024","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Wang, Y., Liang, Y., Xu, H., Jiao, S., and Yu, H. (2024, January 25\u201327). Sqldepth: Generalizable self-supervised fine-structured monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence, Stanford, CA, USA.","DOI":"10.1609\/aaai.v38i6.28383"},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., and Yu, F. (2024, January 16\u201322). UniDepth: Universal monocular metric depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00963"},{"key":"ref_69","unstructured":"Sun, B., Jin, M., Yin, B., and Hou, Q. (2025). Depth Anything at Any Condition. arXiv."},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Lin, H., Peng, S., Chen, J., Peng, S., Sun, J., Liu, M., Bao, H., Feng, J., Zhou, X., and Kang, B. (2025, January 11\u201315). Prompting depth anything for 4k resolution accurate metric depth estimation. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.01591"},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., and Yang, J. (2025, January 11\u201315). Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.00496"},{"key":"ref_72","unstructured":"Zhang, Z., Yang, L., Yang, T., Yu, C., Guo, X., Lao, Y., and Zhao, H. (2025, January 19\u201323). StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Honolulu, HI, USA."},{"key":"ref_73","unstructured":"Wang, Z., Chen, S., Yang, L., Wang, J., Zhang, Z., Zhao, H., and Zhao, Z. (2025). Depth Anything with Any Prior. arXiv."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Wang, Y., Li, J., Hong, C., Li, R., Sun, L., Song, X., Wang, Z., Cao, Z., and Lin, G. (2025, January 11\u201315). TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.00984"},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Li, Z., Bhat, S.F., and Wonka, P. (2024, January 16\u201322). Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00955"},{"key":"ref_76","unstructured":"Li, Z., Bhat, S.F., and Wonka, P. (October, January 29). PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation. Proceedings of the European Conference on Computer Vision, Milan, Italy."},{"key":"ref_77","unstructured":"Duan, Y., Guo, X., and Zhu, Z. (October, January 29). Diffusiondepth: Diffusion denoising approach for monocular depth estimation. Proceedings of the European Conference on Computer Vision, Milan, Italy."},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Zavadski, D., Kal\u0161an, D., and Rother, C. (2024, January 8\u201312). Primedepth: Efficient monocular depth estimation with a stable diffusion preimage. Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam.","DOI":"10.1007\/978-981-96-0917-8_2"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Patni, S., Agarwal, A., and Arora, C. (2024, January 16\u201322). Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02672"},{"key":"ref_80","unstructured":"Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y., Tan, P., and Long, X. (October, January 29). Geowizard: Unleashing the diffusion priors for 3D geometry estimation from a single image. Proceedings of the European Conference on Computer Vision, Milan, Italy."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Pham, D.H., Do, T., Nguyen, P., Hua, B.S., Nguyen, K., and Nguyen, R. (2025, January 11\u201315). Sharpdepth: Sharpening metric depth predictions using diffusion distillation. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.01590"},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"1623","DOI":"10.1109\/TPAMI.2020.3019967","article-title":"Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer","volume":"44","author":"Ranftl","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"10579","DOI":"10.1109\/TPAMI.2024.3444912","article-title":"Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation","volume":"46","author":"Hu","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. (2025, January 11\u201315). Vggt: Visual geometry grounded transformer. Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.00499"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/11\/502\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T10:46:22Z","timestamp":1763635582000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/11\/502"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,20]]},"references-count":84,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["computers14110502"],"URL":"https:\/\/doi.org\/10.3390\/computers14110502","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,20]]}}}