{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T02:40:19Z","timestamp":1768790419858,"version":"3.49.0"},"reference-count":38,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T00:00:00Z","timestamp":1751328000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62232004"],"award-info":[{"award-number":["62232004"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62272099"],"award-info":[{"award-number":["62272099"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["BK20231543"],"award-info":[{"award-number":["BK20231543"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["BK20230024"],"award-info":[{"award-number":["BK20230024"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["62232004"],"award-info":[{"award-number":["62232004"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["62272099"],"award-info":[{"award-number":["62272099"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["BK20231543"],"award-info":[{"award-number":["BK20231543"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["BK20230024"],"award-info":[{"award-number":["BK20230024"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education","award":["62232004"],"award-info":[{"award-number":["62232004"]}]},{"name":"Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education","award":["62272099"],"award-info":[{"award-number":["62272099"]}]},{"name":"Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education","award":["BK20231543"],"award-info":[{"award-number":["BK20231543"]}]},{"name":"Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education","award":["BK20230024"],"award-info":[{"award-number":["BK20230024"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>Self-supervised depth estimation from monocular image sequences provides depth information without costly sensors like LiDAR, offering significant value for autonomous driving. Although self-supervised algorithms can reduce the dependence on labeled data, the performance is still affected by scene occlusions, lighting differences, and sparse textures. Existing methods do not consider the enhancement and interaction fusion of features. In this paper, we propose a novel parallel multi-scale semantic-depth interactive fusion network. First, we adopt a multi-stage feature attention network for feature extraction, and a parallel semantic-depth interactive fusion module is introduced to refine edges. Furthermore, we also employ a metric loss based on semantic edges to take full advantage of semantic geometric information. Our network is trained and evaluated on KITTI datasets. The experimental results show that the methods achieve satisfactory performance compared to other existing methods.<\/jats:p>","DOI":"10.3390\/jimaging11070218","type":"journal-article","created":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T04:04:22Z","timestamp":1751342662000},"page":"218","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Parallel Multi-Scale Semantic-Depth Interactive Fusion Network for Depth Estimation"],"prefix":"10.3390","volume":"11","author":[{"given":"Chenchen","family":"Fu","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Southeast University, Nanjing 210000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7835-0105","authenticated-orcid":false,"given":"Sujunjie","family":"Sun","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Southeast University, Nanjing 210000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ning","family":"Wei","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Southeast University, Nanjing 210000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vincent","family":"Chau","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Southeast University, Nanjing 210000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xueyong","family":"Xu","sequence":"additional","affiliation":[{"name":"North Information Control Research Academy Group Co., Ltd., Nanjing 210000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weiwei","family":"Wu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Southeast University, Nanjing 210000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,1]]},"reference":[{"key":"ref_1","first-page":"2366","article-title":"Depth map prediction from a single image using a multi-scale deep network","volume":"27","author":"Eigen","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_2","unstructured":"Li, B., Shen, C., Dai, Y., Van Den Hengel, A., and He, M. (2015, January 7\u201312). Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2274","DOI":"10.1109\/TPAMI.2012.120","article-title":"SLIC superpixels compared to state-of-the-art superpixel methods","volume":"34","author":"Achanta","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zhou, T., Brown, M., Snavely, N., and Lowe, D.G. (2017, January 21\u201326). Unsupervised learning of depth and ego-motion from video. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.700"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Mahjourian, R., Wicke, M., and Angelova, A. (2018, January 18\u201326). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00594"},{"key":"ref_6","unstructured":"Godard, C., Mac Aodha, O., Firman, M., and Brostow, G.J. (November, January 27). Digging into self-supervised monocular depth estimation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Meng, Y., Lu, Y., Raj, A., Sunarjo, S., Guo, R., Javidi, T., Bansal, G., and Bharadia, D. (2019, January 15\u201320). Signet: Semantic instance aided unsupervised 3d geometry perception. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01004"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Klingner, M., Term\u00f6hlen, J.A., Mikolajczyk, J., and Fingscheidt, T. (2020, January 23\u201328). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XX 16.","DOI":"10.1007\/978-3-030-58565-5_35"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (2016, January 25\u201328). Deeper depth prediction with fully convolutional residual networks. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.32"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Mancini, M., Costante, G., Valigi, P., and Ciarfuglia, T.A. (2016, January 9\u201314). Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. Proceedings of the 2016 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea.","DOI":"10.1109\/IROS.2016.7759632"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wang, P., Xu, W., Zhao, L., and Nevatia, R. (2018, January 2\u20137). Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12257"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Poggi, M., Aleotti, F., Tosi, F., and Mattoccia, S. (2020, January 13\u201319). On the uncertainty of self-supervised monocular depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00329"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., and Gaidon, A. (2020, January 13\u201319). 3d packing for self-supervised monocular depth estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00256"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yin, Z., and Shi, J. (2018, January 18\u201323). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00212"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"2548","DOI":"10.1007\/s11263-021-01484-6","article-title":"Unsupervised scale-consistent depth learning from video","volume":"129","author":"Bian","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_16","unstructured":"Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., and Wang, J. (2019). High-resolution representations for labeling pixels and regions. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","article-title":"Segnet: A deep convolutional encoder-decoder architecture for image segmentation","volume":"39","author":"Badrinarayanan","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chen, P.Y., Liu, A.H., Liu, Y.C., and Wang, Y.C.F. (2019, January 15\u201320). Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00273"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for autonomous driving? the kitti vision benchmark suite. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhu, S., Brazil, G., and Liu, X. (2020, January 13\u201319). The edge of depth: Explicit constraints between segmentation and depth. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01313"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Jung, H., Park, E., and Yoo, S. (2021, January 11\u201317). Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01241"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., and Yuan, Y. (2021, January 2\u20139). Hr-depth: High resolution self-supervised monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence, Online.","DOI":"10.1609\/aaai.v35i3.16329"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hou, Q., Zhou, D., and Feng, J. (2021, January 20\u201325). Coordinate attention for efficient mobile network design. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01350"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Xu, D., Ouyang, W., Wang, X., and Sebe, N. (2018, January 18\u201323). Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00077"},{"key":"ref_25","unstructured":"Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual attention. Adv. Neural Inf. Process. Syst., 27, Available online: https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2014\/file\/3e456b31302cf8210edd4029292a40ad-Paper.pdf."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Dong, X., and Shen, J. (2018, January 8\u201314). Triplet loss in siamese network for object tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01261-8_28"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"824","DOI":"10.1109\/TPAMI.2008.132","article-title":"Make3d: Learning 3d scene structure from a single still image","volume":"31","author":"Saxena","year":"2008","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Shu, C., Yu, K., Duan, Z., and Yang, K. (2020, January 23\u201328). Feature-metric loss for self-supervised learning of depth and egomotion. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58529-7_34"},{"key":"ref_29","unstructured":"Choi, J., Jung, D., Lee, D., and Kim, C. (2020). Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv."},{"key":"ref_30","unstructured":"Casser, V., Pirk, S., Mahjourian, R., and Angelova, A. (February, January 27). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., and Black, M.J. (2019, January 15\u201320). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01252"},{"key":"ref_32","unstructured":"Pnvr, K., Zhou, H., and Jacobs, D. (2020, January 13\u201319). Sharingan: Combining synthetic and real data for unsupervised geometry estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chanduri, S.S., Suri, Z.K., Vozniak, I., and M\u00fcller, C. (2021). Camlessmonodepth: Monocular depth estimation with unknown camera parameters. arXiv.","DOI":"10.5244\/C.35.378"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Yan, J., Zhao, H., Bu, P., and Jin, Y. (2021, January 1\u20133). Channel-wise attention-based network for self-supervised monocular depth estimation. Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK.","DOI":"10.1109\/3DV53792.2021.00056"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1016\/j.neucom.2022.10.073","article-title":"GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network","volume":"517","author":"Masoumian","year":"2023","journal-title":"Neurocomputing"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Liu, M., Salzmann, M., and He, X. (2014, January 23\u201328). Discrete-continuous depth estimation from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.97"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Godard, C., Mac Aodha, O., and Brostow, G.J. (2017, January 21\u201326). Unsupervised monocular depth estimation with left-right consistency. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.699"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Wang, C., Buenaposada, J.M., Zhu, R., and Lucey, S. (2018, January 18\u201323). Learning depth from monocular videos using direct methods. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00216"}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/7\/218\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:02:09Z","timestamp":1760032929000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/7\/218"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":38,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2025,7]]}},"alternative-id":["jimaging11070218"],"URL":"https:\/\/doi.org\/10.3390\/jimaging11070218","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,1]]}}}