{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,31]],"date-time":"2025-10-31T08:01:34Z","timestamp":1761897694587,"version":"build-2065373602"},"reference-count":38,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2022,2,21]],"date-time":"2022-02-21T00:00:00Z","timestamp":1645401600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003141","name":"Consejo Nacional de Ciencia y Tecnolog\u00eda","doi-asserted-by":"publisher","award":["APN2017-5241"],"award-info":[{"award-number":["APN2017-5241"]}],"id":[{"id":"10.13039\/501100003141","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003069","name":"Instituto Polit\u00e9cnico Nacional","doi-asserted-by":"publisher","award":["2083"],"award-info":[{"award-number":["2083"]}],"id":[{"id":"10.13039\/501100003069","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Single image depth estimation works fail to separate foreground elements because they can easily be confounded with the background. To alleviate this problem, we propose the use of a semantic segmentation procedure that adds information to a depth estimator, in this case, a 3D Convolutional Neural Network (CNN)\u2014segmentation is coded as one-hot planes representing categories of objects. We explore 2D and 3D models. Particularly, we propose a hybrid 2D\u20133D CNN architecture capable of obtaining semantic segmentation and depth estimation at the same time. We tested our procedure on the SYNTHIA-AL dataset and obtained \u03c33=0.95, which is an improvement of 0.14 points (compared with the state of the art of \u03c33=0.81) by using manual segmentation, and \u03c33=0.89 using automatic semantic segmentation, proving that depth estimation is improved when the shape and position of objects in a scene are known.<\/jats:p>","DOI":"10.3390\/s22041669","type":"journal-article","created":{"date-parts":[[2022,2,21]],"date-time":"2022-02-21T20:48:41Z","timestamp":1645476521000},"page":"1669","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":17,"title":["Improving Depth Estimation by Embedding Semantic Segmentation: A Hybrid CNN Model"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4572-5713","authenticated-orcid":false,"given":"Jos\u00e9 E.","family":"Valdez-Rodr\u00edguez","sequence":"first","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n, Instituto Polit\u00e9cnico Nacional, Av. Juan de Dios B\u00e1tiz s\/n, Ciudad de M\u00e9xico 07738, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2836-2102","authenticated-orcid":false,"given":"Hiram","family":"Calvo","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n, Instituto Polit\u00e9cnico Nacional, Av. Juan de Dios B\u00e1tiz s\/n, Ciudad de M\u00e9xico 07738, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9828-3568","authenticated-orcid":false,"given":"Edgardo","family":"Felipe-River\u00f3n","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n, Instituto Polit\u00e9cnico Nacional, Av. Juan de Dios B\u00e1tiz s\/n, Ciudad de M\u00e9xico 07738, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1028-9197","authenticated-orcid":false,"given":"Marco A.","family":"Moreno-Armend\u00e1riz","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n, Instituto Polit\u00e9cnico Nacional, Av. Juan de Dios B\u00e1tiz s\/n, Ciudad de M\u00e9xico 07738, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,2,21]]},"reference":[{"key":"ref_1","unstructured":"Blake, R., and Sekuler, R. (2006). Perception, McGraw-Hill Companies Incorporated. McGraw-Hill Higher Education."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Howard, I.P. (2012). Perceiving in Depth, Volume 1: Basic Mechanisms, Oxford University Press.","DOI":"10.1093\/acprof:oso\/9780199764143.001.0001"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Valdez-Rodr\u00edguez, J.E., Calvo, H., and Felipe-River\u00f3n, E.M. (2017, January 23\u201328). Road perspective depth reconstruction from single images using reduce-refine-upsample CNNs. Proceedings of the Mexican International Conference on Artificial Intelligence, Enseneda, Mexico.","DOI":"10.1007\/978-3-030-02837-4_3"},{"key":"ref_4","unstructured":"Eigen, D., Puhrsch, C., and Fergus, R. (2014, January 8\u201313). Depth map prediction from a single image using a multi-scale deep network. Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Eigen, D., and Fergus, R. (2015, January 7\u201313). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.304"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2024","DOI":"10.1109\/TPAMI.2015.2505283","article-title":"Learning depth from single monocular images using deep convolutional neural fields","volume":"38","author":"Liu","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Mousavian, A., Pirsiavash, H., and Ko\u0161eck\u00e1, J. (2016, January 25\u201328). Joint semantic segmentation and depth estimation with deep convolutional networks. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.69"},{"key":"ref_8","unstructured":"Afifi, A.J., and Hellwich, O. (December, January 30). Object depth estimation from a single image using fully convolutional neural network. Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (2016, January 25\u201328). Deeper depth prediction with fully convolutional residual networks. Proceedings of the 2016 Fourth International Conference 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.32"},{"key":"ref_10","unstructured":"Li, B., Dai, Y., Chen, H., and He, M. (2017). Single image depth estimation by dilated deep residual convolutional neural network and soft-weight-sum inference. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Xu, D., Ricci, E., Ouyang, W., Wang, X., and Sebe, N. (2017, January 21\u201326). Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.25"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Koch, T., Liebel, L., Fraundorfer, F., and K\u00f6rner, M. (2018). Evaluation of CNN-based single-image depth estimation methods. arXiv.","DOI":"10.1007\/978-3-030-11015-4_25"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Atapour-Abarghouei, A., and Breckon, T.P. (2019, January 16\u201319). To complete or to estimate, that is the question: A multi-task approach to depth completion and monocular depth estimation. Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec, QC, Canada.","DOI":"10.1109\/3DV.2019.00029"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lin, X., S\u00e1nchez-Escobedo, D., Casas, J.R., and Pard\u00e0s, M. (2019). Depth estimation and semantic segmentation from a single RGB image using a hybrid convolutional neural network. Sensors, 19.","DOI":"10.3390\/s19081795"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"455","DOI":"10.1007\/s10846-020-01205-0","article-title":"Semi-Supervised Monocular Depth Estimation Based on Semantic Supervision","volume":"100","author":"Yue","year":"2020","journal-title":"J. Intell. Robot. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Sun, W., Gao, Z., Cui, J., Ramesh, B., Zhang, B., and Li, Z. (2021). Semantic Segmentation Leveraging Simultaneous Depth Estimation. Sensors, 21.","DOI":"10.3390\/s21030690"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, H.M., Lin, H.Y., and Chang, C.C. (2021). Object Detection and Depth Estimation Approach Based on Deep Convolutional Neural Networks. Sensors, 21.","DOI":"10.3390\/s21144755"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Genovese, A., Piuri, V., Rundo, F., Scotti, F., and Spampinato, C. (2021, January 10\u201312). Driver attention assistance by pedestrian\/cyclist distance estimation from a single RGB image: A CNN-based semantic segmentation approach. Proceedings of the 2021 22nd IEEE International Conference on Industrial Technology (ICIT), Valencia, Spain.","DOI":"10.1109\/ICIT46573.2021.9453567"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"272","DOI":"10.30897\/ijegeo.737993","article-title":"Comparison of Fully Convolutional Networks (FCN) and U-Net for Road Segmentation from High Resolution Imageries","volume":"7","author":"Ozturk","year":"2020","journal-title":"Int. J. Environ. Geoinform."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Tran, L.A., and Le, M.H. (2019, January 20\u201321). Robust U-Net-based road lane markings detection for autonomous driving. Proceedings of the 2019 International Conference on System Science and Engineering (ICSSE), Dong Hoi, Vietnam.","DOI":"10.1109\/ICSSE.2019.8823532"},{"key":"ref_24","first-page":"439","article-title":"Single-Stage Refinement CNN for Depth Estimation in Monocular Images","volume":"24","author":"Calvo","year":"2020","journal-title":"Comput. Sist."},{"key":"ref_25","unstructured":"Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. (2016). Understanding Deep Neural Networks with Rectified Linear Units. arXiv."},{"key":"ref_26","unstructured":"LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., and Jackel, L.D. (1989, January 27\u201330). Handwritten digit recognition with a back-propagation network. Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zeiler, M.D., and Fergus, R. (2014, January 6\u201312). Visualizing and understanding convolutional networks. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Xu, N., Price, B., Cohen, S., and Huang, T. (2017). Deep Image Matting. arXiv.","DOI":"10.1109\/CVPR.2017.41"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zolfaghari Bengar, J., Gonzalez-Garcia, A., Villalonga, G., Raducanu, B., Aghdam, H.H., Mozerov, M., Lopez, A.M., and van de Weijer, J. (2019). Temporal Coherence for Active Learning in Videos. arXiv.","DOI":"10.1109\/ICCVW.2019.00120"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for autonomous driving? The kitti vision benchmark suite. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016, January 27\u201330). The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.350"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A.M. (2016, January 27\u201330). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.352"},{"key":"ref_34","unstructured":"Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv."},{"key":"ref_35","unstructured":"Chollet, F., Duryea, E., and Hu, W. (2022, February 20). Keras. Available online: https:\/\/keras.io."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"LeCun, Y.A., Bottou, L., Orr, G.B., and M\u00fcller, K.R. (2012). Efficient backprop. Neural Networks: Tricks of the Trade, Springer.","DOI":"10.1007\/978-3-642-35289-8_3"},{"key":"ref_37","unstructured":"Honauer, K. (2019). Performance Metrics and Test Data Generation for Depth Estimation Algorithms. [Ph.D. Thesis, Faculty of Mathematics and Computer Science]."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Wang, Y., Tsai, Y.H., Hung, W.C., Ding, W., Liu, S., and Yang, M.H. (2022, January 3\u20138). Semi-supervised multi-task learning for semantics and depth. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00272"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/4\/1669\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:23:52Z","timestamp":1760135032000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/4\/1669"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,21]]},"references-count":38,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,2]]}},"alternative-id":["s22041669"],"URL":"https:\/\/doi.org\/10.3390\/s22041669","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,2,21]]}}}