{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:45:03Z","timestamp":1760233503103,"version":"build-2065373602"},"reference-count":90,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2021,1,20]],"date-time":"2021-01-20T00:00:00Z","timestamp":1611100800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Semantic segmentation is one of the most widely studied problems in computer vision communities, which makes a great contribution to a variety of applications. A lot of learning-based approaches, such as Convolutional Neural Network (CNN), have made a vast contribution to this problem. While rich context information of the input images can be learned from multi-scale receptive fields by convolutions with deep layers, traditional CNNs have great difficulty in learning the geometrical relationship and distribution of objects in the RGB image due to the lack of depth information, which may lead to an inferior segmentation quality. To solve this problem, we propose a method that improves segmentation quality with depth estimation on RGB images. Specifically, we estimate depth information on RGB images via a depth estimation network, and then feed the depth map into the CNN which is able to guide the semantic segmentation. Furthermore, in order to parse the depth map and RGB images simultaneously, we construct a multi-branch encoder\u2013decoder network and fuse the RGB and depth features step by step. Extensive experimental evaluation on four baseline networks demonstrates that our proposed method can enhance the segmentation quality considerably and obtain better performance compared to other segmentation networks.<\/jats:p>","DOI":"10.3390\/s21030690","type":"journal-article","created":{"date-parts":[[2021,1,21]],"date-time":"2021-01-21T00:53:41Z","timestamp":1611190421000},"page":"690","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Semantic Segmentation Leveraging Simultaneous Depth Estimation"],"prefix":"10.3390","volume":"21","author":[{"given":"Wenbo","family":"Sun","sequence":"first","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3325-1183","authenticated-orcid":false,"given":"Zhi","family":"Gao","sequence":"additional","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7833-1876","authenticated-orcid":false,"given":"Jinqiang","family":"Cui","sequence":"additional","affiliation":[{"name":"Peng Cheng Laboratory, Shenzhen 518055, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8230-3803","authenticated-orcid":false,"given":"Bharath","family":"Ramesh","sequence":"additional","affiliation":[{"name":"The N.1 Institute for Health, National University of Singapore, Singapore 117411, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9545-2760","authenticated-orcid":false,"given":"Bin","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ziyao","family":"Li","sequence":"additional","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,1,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Zhang, H., Geiger, A., and Urtasun, R. (2013, January 1\u20138). Understanding high-level semantics by modeling traffic patterns. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.379"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A.M. (2016, January 27\u201330). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.352"},{"key":"ref_3","unstructured":"Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., and Zhang, J. (2016). End to end learning for self-driving cars. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Valada, A., Oliveira, G.L., Brox, T., and Burgard, W. (2016, January 3\u20136). Deep multispectral semantic scene understanding of forested environments using multimodal fusion. Proceedings of the International Symposium on Experimental Robotics, Tokyo, Japan.","DOI":"10.1007\/978-3-319-50115-4_41"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Vineet, V., Miksik, O., Lidegaard, M., Nie\u00dfner, M., Golodetz, S., Prisacariu, V.A., K\u00e4hler, O., Murray, D.W., Izadi, S., and P\u00e9rez, P. (2015, January 26\u201330). Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA.","DOI":"10.1109\/ICRA.2015.7138983"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"\u00c7i\u00e7ek, \u00d6., Abdulkadir, A., Lienkamp, S.S., Brox, T., and Ronneberger, O. (2016, January 17\u201321). 3D U-Net: Learning dense volumetric segmentation from sparse annotation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Athens, Greece.","DOI":"10.1007\/978-3-319-46723-8_49"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Miksik, O., Vineet, V., Lidegaard, M., Prasaath, R., Nie\u00dfner, M., Golodetz, S., Hicks, S.L., P\u00e9rez, P., Izadi, S., and Torr, P.H. (2015, January 18\u201323). The semantic paintbrush: Interactive 3d mapping and recognition in large outdoor spaces. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, Seoul, Korea.","DOI":"10.1145\/2702123.2702222"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Yang, M., Yu, K., Zhang, C., Li, Z., and Yang, K. (2018, January 18\u201322). Denseaspp for semantic segmentation in street scenes. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00388"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Kirillov, A., Wu, Y., He, K., and Girshick, R. (2020, January 14\u201319). Pointrend: Image segmentation as rendering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00982"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Marin, D., He, Z., Vajda, P., Chatterjee, P., Tsai, S., Yang, F., and Boykov, Y. (2019, January 27\u201328). Efficient segmentation: Learning downsampling near semantic boundaries. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00222"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Pan, F., Shin, I., Rameau, F., Lee, S., and Kweon, I.S. (2020, January 14\u201319). Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00382"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","article-title":"Segnet: A deep convolutional encoder\u2013decoder architecture for image segmentation","volume":"39","author":"Badrinarayanan","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chaurasia, A., and Culurciello, E. (2017, January 10\u201313). Linknet: Exploiting encoder representations for efficient semantic segmentation. Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA.","DOI":"10.1109\/VCIP.2017.8305148"},{"key":"ref_14","unstructured":"Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., and Wang, J. (2019). High-resolution representations for labeling pixels and regions. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Hazirbas, C., Ma, L., Domokos, C., and Cremers, D. (2016, January 20\u201324). Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan.","DOI":"10.1007\/978-3-319-54181-5_14"},{"key":"ref_16","unstructured":"Jiang, J., Zheng, L., Luo, F., and Zhang, Z. (2018). Rednet: Residual encoder\u2013decoder network for indoor rgb-d semantic segmentation. arXiv."},{"key":"ref_17","unstructured":"Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation using depth information. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Hu, X., Yang, K., Fei, L., and Wang, K. (2019, January 22\u201325). Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803025"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"He, Y., Chiu, W.C., Keuper, M., and Fritz, M. (2017, January 21\u201326). Std2p: Rgbd semantic segmentation using spatio-temporal data-driven pooling. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.757"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Deng, Z., Todorovic, S., and Jan Latecki, L. (2015, January 7\u201313). Semantic segmentation of rgbd images with mutex constraints. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.202"},{"key":"ref_21","unstructured":"Yin, W., Wang, X., Shen, C., Liu, Y., Tian, Z., Xu, S., Sun, C., and Renyin, D. (2020). DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"302","DOI":"10.1007\/s11263-018-1140-0","article-title":"Semantic understanding of scenes through the ade20k dataset","volume":"127","author":"Zhou","year":"2019","journal-title":"Int. J. Comput. Vis."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"62","DOI":"10.1109\/TSMC.1979.4310076","article-title":"A threshold selection method from gray-level histograms","volume":"9","author":"Otsu","year":"1979","journal-title":"IEEE Trans. Syst. Man Cybern."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1016\/0004-3702(70)90008-1","article-title":"Scene analysis using regions","volume":"1","author":"Brice","year":"1970","journal-title":"Artif. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1109\/34.49050","article-title":"Integrating region growing and edge detection","volume":"12","author":"Pavlidis","year":"1990","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Cavallaro, A., Steiger, O., and Ebrahimi, T. (2003, January 6\u20139). Semantic segmentation and description for video transcoding. Proceedings of the 2003 International Conference on Multimedia and Expo, Baltimore, MD, USA.","DOI":"10.1109\/ICME.2003.1221382"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1055\/s-0038-1633889","article-title":"A semantic approach to segmentation of overlapping objects","volume":"43","author":"Wittenberg","year":"2004","journal-title":"Methods Inf. Med."},{"key":"ref_28","unstructured":"Doulamis, A.D., Doulamis, N.D., Ntalianis, K.S., and Kollias, S.D. (November, January 31). Unsupervised semantic object segmentation of stereoscopic video sequences. Proceedings of the 1999 International Conference on Information Intelligence and Systems, Bethesda, MD, USA."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1007\/s10489-010-0212-9","article-title":"A neural network based retrainable framework for robust object recognition with application to mobile robotics","volume":"35","author":"An","year":"2011","journal-title":"Appl. Intell."},{"key":"ref_30","unstructured":"Doulamis, A.D., Doulamis, N.D., and Kollias, S.D. (1997, January 12\u201315). Retrainable neural networks for image analysis and classification. Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, Orlando, FL, USA."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1109\/72.822517","article-title":"On-line retrainable neural networks: Improving the performance of neural networks in image analysis problems","volume":"11","author":"Doulamis","year":"2000","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"603","DOI":"10.1109\/34.1000236","article-title":"Mean shift: A robust approach toward feature space analysis","volume":"24","author":"Comaniciu","year":"2002","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1023\/B:VISI.0000022288.19776.77","article-title":"Efficient graph-based image segmentation","volume":"59","author":"Felzenszwalb","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_35","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","article-title":"The pascal visual object classes (voc) challenge","volume":"88","author":"Everingham","year":"2010","journal-title":"Int. J. Comput. Vis."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012, January 7\u201313). Indoor segmentation and support inference from rgbd images. Proceedings of the European Conference on Computer Vision, Florence, Italy.","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Liu, C., Yuen, J., and Torralba, A. (2009, January 20\u201325). Nonparametric scene parsing: Label transfer via dense scene alignment. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206536"},{"key":"ref_40","unstructured":"Liu, W., Rabinovich, A., and Berg, A.C. (2015). Parsenet: Looking wider to see better. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Li, Y., Qi, H., Dai, J., Ji, X., and Wei, Y. (2017, January 21\u201326). Fully convolutional instance-aware semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.472"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1876","DOI":"10.1109\/TMI.2017.2695227","article-title":"Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance","volume":"36","author":"Yuan","year":"2017","journal-title":"IEEE Trans. Med Imaging"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Liu, N., Li, H., Zhang, M., Liu, J., Sun, Z., and Tan, T. (2016, January 13\u201316). Accurate iris segmentation in non-cooperative environments using fully convolutional networks. Proceedings of the 2016 International Conference on Biometrics (ICB), Halmstad, Sweden.","DOI":"10.1109\/ICB.2016.7550055"},{"key":"ref_44","unstructured":"Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A.L. (2014). Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Lin, G., Shen, C., Van Den Hengel, A., and Reid, I. (2016, January 27\u201330). Efficient piecewise training of deep structured models for semantic segmentation. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.348"},{"key":"ref_46","unstructured":"Schwing, A.G., and Urtasun, R. (2015). Fully connected deep structured networks. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P.H. (2015, January 7\u201313). Conditional random fields as recurrent neural networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.179"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Liu, Z., Li, X., Luo, P., Loy, C.C., and Tang, X. (2015, January 7\u201313). Semantic image segmentation via deep parsing network. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.162"},{"key":"ref_49","unstructured":"Fu, J., Liu, J., Wang, Y., Zhou, J., Wang, C., and Lu, H. (2019). Stacked deconvolutional network for semantic segmentation. IEEE Trans. Image Process."},{"key":"ref_50","unstructured":"Xia, X., and Kulis, B. (2017). W-net: A deep model for fully unsupervised image segmentation. arXiv."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs","volume":"40","author":"Chen","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_52","unstructured":"Chen, L.C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018, January 8\u201314). Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Zhou, L., Zhang, C., and Wu, M. (2018, January 18\u201322). D-LinkNet: LinkNet With Pretrained Encoder and Dilated Convolution for High Resolution Satellite Imagery Road Extraction. Proceedings of the CVPR Workshops, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPRW.2018.00034"},{"key":"ref_55","unstructured":"Wu, H., Zhang, J., Huang, K., Liang, K., and Yu, Y. (2019). Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Deb, D., and Ventura, J. (2018, January 18\u201322). An aggregated multicolumn dilated convolution network for perspective-free counting. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPRW.2018.00057"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Huang, Q., Xia, C., Wu, C., Li, S., Wang, Y., Song, Y., and Kuo, C.C.J. (2017). Semantic segmentation with reverse attention. arXiv.","DOI":"10.5244\/C.31.18"},{"key":"ref_58","unstructured":"Li, H., Xiong, P., An, J., and Wang, L. (2018). Pyramid attention network for semantic segmentation. arXiv."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu, H. (2019, January 15\u201320). Dual attention network for scene segmentation. Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00326"},{"key":"ref_60","unstructured":"Yuan, Y., and Wang, J. (2018). Ocnet: Object context network for scene parsing. arXiv."},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., and Liu, W. (2019, January 27\u201328). Ccnet: Criss-cross attention for semantic segmentation. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00069"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Ren, M., and Zemel, R.S. (2017, January 21\u201326). End-to-end instance segmentation with recurrent attention. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.39"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Eigen, D., and Fergus, R. (2015, January 7\u201313). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.304"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Ma, L., St\u00fcckler, J., Kerl, C., and Cremers, D. (2017, January 24\u201328). Multi-view deep learning for consistent semantic mapping with rgb-d cameras. Proceedings of the 2017 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada.","DOI":"10.1109\/IROS.2017.8202213"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Wang, J., Wang, Z., Tao, D., See, S., and Wang, G. (2016, January 11\u201314). Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46454-1_40"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Gupta, S., Girshick, R., Arbel\u00e1ez, P., and Malik, J. (2014, January 6\u201312). Learning rich features from RGB-D images for object detection and segmentation. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10584-0_23"},{"key":"ref_67","first-page":"541","article-title":"LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling","volume":"Volume 9906","author":"Li","year":"2016","journal-title":"Computer Vision\u2014ECCV 2016. ECCV 2016. Lecture Notes in Computer Science"},{"key":"ref_68","unstructured":"Park, S.J., Hong, K.S., and Lee, S. (2017, January 22\u201329). Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Song, S., and Xiao, J. (2016, January 27\u201330). Deep sliding shapes for amodal 3d object detection in rgb-d images. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.94"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Wang, W., and Neumann, U. (2018, January 8\u201314). Depth-aware cnn for rgb-d segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01252-6_9"},{"key":"ref_71","first-page":"2366","article-title":"Depth map prediction from a single image using a multi-scale deep network","volume":"27","author":"Eigen","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018, January 18\u201322). Deep ordinal regression network for monocular depth estimation. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00214"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Yin, W., Liu, Y., Shen, C., and Yan, Y. (2019, January 27\u201328). Enforcing geometric constraints of virtual normal for depth prediction. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00578"},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Godard, C., Mac Aodha, O., Firman, M., and Brostow, G.J. (2019, January 27\u201328). Digging into self-supervised monocular depth estimation. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00393"},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Zhou, T., Brown, M., Snavely, N., and Lowe, D.G. (2017, January 21\u201326). Unsupervised learning of depth and ego-motion from video. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.700"},{"key":"ref_76","unstructured":"Casser, V., Pirk, S., Mahjourian, R., and Angelova, A. (February, January 27). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Mahjourian, R., Wicke, M., and Angelova, A. (2018, January 18\u201322). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00594"},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Wang, C., Buenaposada, J.M., Zhu, R., and Lucey, S. (2018, January 18\u201322). Learning Depth from Monocular Videos using Direct Methods. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00216"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Yang, N., Wang, R., Stuckler, J., and Cremers, D. (2018, January 8\u201314). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01237-3_50"},{"key":"ref_80","unstructured":"Zhou, L., Ye, J., Abello, M., Wang, S., and Kaess, M. (2018). Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss. arXiv."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid scene parsing network. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_83","doi-asserted-by":"crossref","unstructured":"Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.C. (2018, January 18\u201322). Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., and Gaidon, A. (2020, January 14\u201319). 3D Packing for Self-Supervised Monocular Depth Estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00256"},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Jian, S. (2015, January 7\u201312). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/ICCV.2015.123"},{"key":"ref_86","unstructured":"Yu, F., and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv."},{"key":"ref_87","doi-asserted-by":"crossref","unstructured":"Lin, G., Milan, A., Shen, C., and Reid, I. (2017, January 21\u201326). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.549"},{"key":"ref_88","doi-asserted-by":"crossref","unstructured":"Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. (2018, January 8\u201314). Unified perceptual parsing for scene understanding. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01228-1_26"},{"key":"ref_89","doi-asserted-by":"crossref","unstructured":"Liang, X., Zhou, H., and Xing, E. (2018, January 18\u201322). Dynamic-structured semantic propagation network. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00085"},{"key":"ref_90","doi-asserted-by":"crossref","unstructured":"Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., and Jia, J. (2018, January 8\u201314). Psanet: Point-wise spatial attention network for scene parsing. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_17"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/3\/690\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:13:02Z","timestamp":1760159582000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/3\/690"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,20]]},"references-count":90,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2021,2]]}},"alternative-id":["s21030690"],"URL":"https:\/\/doi.org\/10.3390\/s21030690","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,1,20]]}}}