{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T19:03:58Z","timestamp":1772910238665,"version":"3.50.1"},"reference-count":63,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2025,9,1]],"date-time":"2025-09-01T00:00:00Z","timestamp":1756684800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Catalan government","award":["FI 2020"],"award-info":[{"award-number":["FI 2020"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JSAN"],"abstract":"<jats:p>Monocular Depth Estimation (MDE) remains a challenging problem due to texture ambiguity, occlusion, and scale variation in real-world scenes. While recent deep learning methods have made significant progress, maintaining structural consistency and robustness across diverse environments remains difficult. In this paper, we propose DAR-MDE, a novel framework that combines an autoencoder backbone with a Multi-Scale Feature Aggregation (MSFA) module and a Refining Attention Network (RAN). The MSFA module enables the model to capture geometric details across multiple resolutions, while the RAN enhances depth predictions by attending to structurally important regions guided by depth-feature similarity. We also introduce a multi-scale loss based on curvilinear saliency to improve edge-aware supervision and depth continuity. The proposed model achieves robust and accurate depth estimation across varying object scales, cluttered scenes, and weak-texture regions. We evaluated DAR-MDE on the NYU Depth v2, SUN RGB-D, and Make3D datasets, demonstrating competitive accuracy and real-time inference speeds (19 ms per image) without relying on auxiliary sensors. Our method achieves a \u03b4 &lt; 1.25 accuracy of 87.25% and a relative error of 0.113 on NYU Depth v2, outperforming several recent state-of-the-art models. Our approach highlights the potential of lightweight RGB-only depth estimation models for real-world deployment in robotics and scene understanding.<\/jats:p>","DOI":"10.3390\/jsan14050090","type":"journal-article","created":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T16:05:22Z","timestamp":1756829122000},"page":"90","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["DAR-MDE: Depth-Attention Refinement for Multi-Scale Monocular Depth Estimation"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0902-7245","authenticated-orcid":false,"given":"Saddam","family":"Abdulwahab","sequence":"first","affiliation":[{"name":"Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili, 43007 Tarragona, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5421-1637","authenticated-orcid":false,"given":"Hatem A.","family":"Rashwan","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili, 43007 Tarragona, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0146-5515","authenticated-orcid":false,"given":"Moumen T.","family":"El-Melegy","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Assiut University, Assiut 71516, Egypt"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0562-4205","authenticated-orcid":false,"given":"Domenec","family":"Puig","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili, 43007 Tarragona, Spain"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Chiu, C.H., Astuti, L., Lin, Y.C., and Hung, M.K. (2024, January 14\u201316). Dual-Attention Mechanism for Monocular Depth Estimation. Proceedings of the 2024 16th International Conference on Computer and Automation Engineering (ICCAE), Melbourne, Australia.","DOI":"10.1109\/ICCAE59995.2024.10569356"},{"key":"ref_2","unstructured":"Yang, Y., Wang, X., Li, D., Tian, L., Sirasao, A., and Yang, X. (2024). Towards Scale-Aware Full Surround Monodepth with Transformers. arXiv."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"109982","DOI":"10.1016\/j.patcog.2023.109982","article-title":"CATNet: Convolutional attention and transformer for monocular depth estimation","volume":"145","author":"Tang","year":"2024","journal-title":"Pattern Recognit."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2244","DOI":"10.1109\/TIV.2022.3210274","article-title":"Self-supervised monocular depth estimation with geometric prior and pixel-level sensitivity","volume":"8","author":"Liu","year":"2022","journal-title":"IEEE Trans. Intell. Veh."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Hu, D., Peng, L., Chu, T., Zhang, X., Mao, Y., Bondell, H., and Gong, M. (2022, January 23\u201327). Uncertainty Quantification in Depth Estimation via Constrained Ordinal Regression. Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel. Available online: https:\/\/www.ecva.net\/papers\/eccv_2022\/papers_ECCV\/papers\/136620229.pdf.","DOI":"10.1007\/978-3-031-20086-1_14"},{"key":"ref_6","unstructured":"Wang, Y., and Piao, X. (2024, January 16\u201322). Scale-Aware Deep Networks for Monocular Depth Estimation. Proceedings of the CVPR, Seattle, WA, USA."},{"key":"ref_7","first-page":"1595","article-title":"A Low Light Image Enhancement Method Based on Dehazing Physical Model","volume":"143","author":"Wang","year":"2025","journal-title":"Comput. Model. Eng. Sci. (CMES)"},{"key":"ref_8","unstructured":"Zhang, Z. (June, January 29). Lightweight Deep Networks for Real-Time Monocular Depth Estimation. Proceedings of the ICRA, London, UK."},{"key":"ref_9","unstructured":"Yoon, S., and Lee, J. (2024, January 16\u201322). Context-Aware Depth Estimation via Multi-Modal Fusion. Proceedings of the CVPR, Seattle, WA, USA."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"104802","DOI":"10.1016\/j.dsp.2024.104802","article-title":"CDAN: Convolutional dense attention-guided network for low-light image enhancement","volume":"156","author":"Shakibania","year":"2025","journal-title":"Digit. Signal Process."},{"key":"ref_11","unstructured":"Ding, L. (October, January 29). Attention-Guided Depth Completion from Sparse Data. Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1583","DOI":"10.1007\/s13042-020-01251-y","article-title":"Attention-based context aggregation network for monocular depth estimation","volume":"12","author":"Chen","year":"2021","journal-title":"Int. J. Mach. Learn. Cybern."},{"key":"ref_13","first-page":"2366","article-title":"Depth map prediction from a single image using a multi-scale deep network","volume":"2","author":"Eigen","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Xu, D., Ricci, E., Ouyang, W., Wang, X., and Sebe, N. (2017, January 21\u201326). Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.25"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Liu, C., Gu, J., Kim, K., Narasimhan, S.G., and Kautz, J. (2019, January 15\u201320). Neural RGB(r)D sensing: Depth and uncertainty from a video camera. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01124"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Masoumian, A., Rashwan, H.A., Cristiano, J., Asif, M.S., and Puig, D. (2022). Monocular depth estimation using deep learning: A review. Sensors, 22.","DOI":"10.3390\/s22145353"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"5389","DOI":"10.1109\/LRA.2022.3155823","article-title":"Object-Aware Monocular Depth Prediction with Instance Convolutions","volume":"7","author":"Simsar","year":"2022","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"2321","DOI":"10.1109\/TIP.2022.3154931","article-title":"DMRA: Depth-induced multi-scale recurrent attention network for RGB-D saliency detection","volume":"31","author":"Ji","year":"2022","journal-title":"IEEE Trans. Image Process."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Aich, S., Vianney, J.M.U., Islam, M.A., and Liu, M.K.B. (June, January 30). Bidirectional attention network for monocular depth estimation. Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi\u2019an, China.","DOI":"10.1109\/ICRA48506.2021.9560885"},{"key":"ref_20","unstructured":"Lee, J.H., Han, M.K., Ko, D.W., and Suh, I.H. (2019). From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Jung, H., Kim, Y., Min, D., Oh, C., and Sohn, K. (2017, January 17\u201320). Depth prediction from a single image with conditional adversarial networks. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296575"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wofk, D., Ma, F., Yang, T.J., Karaman, S., and Sze, V. (2019, January 20\u201324). Fastdepth: Fast monocular depth estimation on embedded systems. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8794182"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Moukari, M., Picard, S., Simon, L., and Jurie, F. (2018, January 7\u201310). Deep multi-scale architectures for monocular depth estimation. Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece.","DOI":"10.1109\/ICIP.2018.8451408"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"2947","DOI":"10.1109\/TCSVT.2020.2973068","article-title":"Adversarial learning for depth and viewpoint estimation from a single image","volume":"30","author":"Abdulwahab","year":"2020","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"103753","DOI":"10.1016\/j.jvcir.2023.103753","article-title":"Depth estimation of supervised monocular images based on semantic segmentation","volume":"90","author":"Wang","year":"2023","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Xu, Y., Yang, Y., and Zhang, L. (2023). DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction. arXiv.","DOI":"10.1609\/aaai.v37i3.25411"},{"key":"ref_27","unstructured":"Chen, L. (2023, January 10\u201316). Multi-Scale Adaptive Feature Fusion for Monocular Depth Estimation. Proceedings of the NeurIPS, New Orleans, LA, USA."},{"key":"ref_28","unstructured":"Tan, M. (2023, January 10\u201316). Hierarchical Multi-Scale Learning for Monocular Depth Estimation. Proceedings of the NeurIPS, New Orleans, LA, USA."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"16423","DOI":"10.1007\/s00521-022-07663-x","article-title":"Monocular depth map estimation based on a multi-scale deep architecture and curvilinear saliency feature boosting","volume":"34","author":"Abdulwahab","year":"2022","journal-title":"Neural Comput. Appl."},{"key":"ref_30","first-page":"1228","article-title":"Refinenet: Multi-path refinement networks for dense prediction","volume":"42","author":"Lin","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. (2018, January 18\u201323). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00143"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"55230","DOI":"10.1109\/ACCESS.2020.2981842","article-title":"Contextual attention refinement network for real-time semantic segmentation","volume":"8","author":"Hao","year":"2020","journal-title":"IEEE Access"},{"key":"ref_33","first-page":"39","article-title":"KRAN: Knowledge Refining Attention Network for Recommendation","volume":"16","author":"Zhang","year":"2021","journal-title":"ACM Trans. Knowl. Discov. Data (TKDD)"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1007\/BF00336961","article-title":"The structure of images","volume":"50","author":"Koenderink","year":"1984","journal-title":"Biol. Cybern."},{"key":"ref_35","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_38","unstructured":"Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., and Aila, T. (2018). Noise2noise: Learning image restoration without clean data. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1007\/s12021-018-9377-x","article-title":"SegAN: Adversarial network with multi-scale L1 loss for medical image segmentation","volume":"16","author":"Xue","year":"2018","journal-title":"Neuroinformatics"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Xue, Y., Xu, T., and Huang, X. (2018, January 4\u20137). Adversarial learning with multi-scale loss for skin lesion segmentation. Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA.","DOI":"10.1109\/ISBI.2018.8363707"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"515","DOI":"10.1049\/iet-cvi.2018.5645","article-title":"Fully convolutional multi-scale dense networks for monocular depth estimation","volume":"13","author":"Liu","year":"2019","journal-title":"IET Comput. Vis."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"7709","DOI":"10.1109\/ACCESS.2020.2964733","article-title":"Efficient and high-quality monocular depth estimation via gated multi-scale network","volume":"8","author":"Lin","year":"2020","journal-title":"IEEE Access"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"4429","DOI":"10.1109\/TIP.2019.2911484","article-title":"Using curvilinear features in focus for registering a single image to a 3D object","volume":"28","author":"Rashwan","year":"2019","journal-title":"IEEE Trans. Image Process."},{"key":"ref_44","unstructured":"Alhashim, I., and Wonka, P. (2018). High quality monocular depth estimation via transfer learning. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012, January 7\u201313). Indoor segmentation and support inference from rgbd images. Proceedings of the European Conference on Computer Vision, Florence, Italy.","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Song, S., Lichtenberg, S.P., and Xiao, J. (2015, January 7\u201312). Sun RGB-D: A RGB-D scene understanding benchmark suite. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298655"},{"key":"ref_47","unstructured":"Saxena, A., Sun, M., and Ng, A.Y. (2008, January 13\u201317). Make3D: Depth Perception from a Single Still Image. Proceedings of the AAAI, Chicago, IL, USA."},{"key":"ref_48","unstructured":"Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). PyTorch: An Open Source Machine Learning Framework, Facebook AI Research. Available online: https:\/\/pytorch.org."},{"key":"ref_49","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Liu, F., Shen, C., and Lin, G. (2015, January 7\u201312). Deep convolutional neural fields for depth estimation from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299152"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"109067","DOI":"10.1016\/j.knosys.2022.109067","article-title":"Single image depth estimation based on sculpture strategy","volume":"250","author":"Chen","year":"2022","journal-title":"Knowl.-Based Syst."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Ramamonjisoa, M., Firman, M., Watson, J., Lepetit, V., and Turmukhambetov, D. (2021). Single Image Depth Estimation Using Wavelet Decomposition. arXiv.","DOI":"10.1109\/CVPR46437.2021.01094"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"22640","DOI":"10.1109\/ACCESS.2021.3055497","article-title":"Encoder-Decoder Structure with the Feature Pyramid for Depth Estimation from a Single Image","volume":"9","author":"Tang","year":"2021","journal-title":"IEEE Access"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Guo, X., Zhao, H., Shao, S., Li, X., and Zhang, B. (2024). F2-Depth: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis. arXiv.","DOI":"10.3390\/fi16100375"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"2144","DOI":"10.1109\/TPAMI.2014.2316835","article-title":"Depth transfer: Depth extraction from video using non-parametric sampling","volume":"36","author":"Karsch","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Godard, C., Mac Aodha, O., and Brostow, G.J. (2017, January 21\u201326). Unsupervised monocular depth estimation with left-right consistency. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.699"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Kuznietsov, Y., Stuckler, J., and Leibe, B. (2017, January 21\u201326). Semi-supervised deep learning for monocular depth map prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.238"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., and Lu, J. (2023, January 1\u20136). Unleashing text-to-image diffusion models for visual perception. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00527"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Chen, X., Chen, X., and Zha, Z.J. (2019). Structure-aware residual pyramid network for monocular depth estimation. arXiv.","DOI":"10.24963\/ijcai.2019\/98"},{"key":"ref_60","unstructured":"Yin, W., Liu, Y., Shen, C., and Yan, Y. (November, January 27). Enforcing geometric constraints of virtual normal for depth prediction. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_61","unstructured":"Bhat, S.F., Alhashim, I., and Wonka, P. (2021, January 20\u201325). Adabins: Depth estimation using adaptive bins. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"3964","DOI":"10.1109\/TIP.2024.3416065","article-title":"Binsformer: Revisiting adaptive bins for monocular depth estimation","volume":"33","author":"Li","year":"2024","journal-title":"IEEE Trans. Image Process."},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"113518","DOI":"10.1016\/j.knosys.2025.113518","article-title":"Out-of-distribution monocular depth estimation with local invariant regression","volume":"319","author":"Hu","year":"2025","journal-title":"Knowl.-Based Syst."}],"container-title":["Journal of Sensor and Actuator Networks"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2224-2708\/14\/5\/90\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:36:46Z","timestamp":1760035006000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2224-2708\/14\/5\/90"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,1]]},"references-count":63,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,10]]}},"alternative-id":["jsan14050090"],"URL":"https:\/\/doi.org\/10.3390\/jsan14050090","relation":{},"ISSN":["2224-2708"],"issn-type":[{"value":"2224-2708","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,1]]}}}