{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T03:23:27Z","timestamp":1768879407767,"version":"3.49.0"},"reference-count":56,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2022,1,18]],"date-time":"2022-01-18T00:00:00Z","timestamp":1642464000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["31971792"],"award-info":[{"award-number":["31971792"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["31920200043"],"award-info":[{"award-number":["31920200043"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>To find an economical solution to infer the depth of the surrounding environment of unmanned agricultural vehicles (UAV), a lightweight depth estimation model called MonoDA based on a convolutional neural network is proposed. A series of sequential frames from monocular videos are used to train the model. The model is composed of two subnetworks\u2014the depth estimation subnetwork and the pose estimation subnetwork. The former is a modified version of U-Net that reduces the number of bridges, while the latter takes EfficientNet-B0 as its backbone network to extract the features of sequential frames and predict the pose transformation relations between the frames. The self-supervised strategy is adopted during the training, which means the depth information labels of frames are not needed. Instead, the adjacent frames in the image sequence and the reprojection relation of the pose are used to train the model. Subnetworks\u2019 outputs (depth map and pose relation) are used to reconstruct the input frame, then a self-supervised loss between the reconstructed input and the original input is calculated. Finally, the loss is employed to update the parameters of the two subnetworks through the backward pass. Several experiments are conducted to evaluate the model\u2019s performance, and the results show that MonoDA has competitive accuracy over the KITTI raw dataset as well as our vineyard dataset. Besides, our method also possessed the advantage of non-sensitivity to color. On the computing platform of our UAV\u2019s environment perceptual system NVIDIA JETSON TX2, the model could run at 18.92 FPS. To sum up, our approach provides an economical solution for depth estimation by using monocular cameras, which achieves a good trade-off between accuracy and speed and can be used as a novel auxiliary depth detection paradigm for UAVs.<\/jats:p>","DOI":"10.3390\/s22030721","type":"journal-article","created":{"date-parts":[[2022,1,18]],"date-time":"2022-01-18T22:47:32Z","timestamp":1642546052000},"page":"721","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["Monocular Depth Estimation with Self-Supervised Learning for Vineyard Unmanned Agricultural Vehicle"],"prefix":"10.3390","volume":"22","author":[{"given":"Xue-Zhi","family":"Cui","sequence":"first","affiliation":[{"name":"School of Mechanical and Electrical Engineering, Gansu Agriculture University, Lanzhou 730070, China"}]},{"given":"Quan","family":"Feng","sequence":"additional","affiliation":[{"name":"School of Mechanical and Electrical Engineering, Gansu Agriculture University, Lanzhou 730070, China"}]},{"given":"Shu-Zhi","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Electrical Engineering, Northwest University for Nationalities, Lanzhou 730030, China"}]},{"given":"Jian-Hua","family":"Zhang","sequence":"additional","affiliation":[{"name":"Agricultural Information Institute of CAAS, Beijing 100081, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,1,18]]},"reference":[{"key":"ref_1","first-page":"381","article-title":"Recent development in automatic guidance and autonomous vehicle for agriculture: A Review","volume":"44","author":"Han","year":"2018","journal-title":"Zhejiang. Univ. (Agric. Life. Sci.)"},{"key":"ref_2","first-page":"196","article-title":"Identification and counting method of orchard pests based on fusion method of infrared sensor and machine vision","volume":"32","author":"Tian","year":"2016","journal-title":"Trans. Chin. Soc. Agric. Eng."},{"key":"ref_3","first-page":"159","article-title":"Method on Ranging for Banana Tree with Laser and Ultrasonic Sensors Based on Fitting and Filtering","volume":"52","author":"Fu","year":"2021","journal-title":"Trans. Chin. Soc. Agric. Mach."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1016\/S1881-8366(13)80019-5","article-title":"Accurate Position Detecting during Asparagus Spear Harvesting using a Laser Sensor","volume":"6","author":"Hiroki","year":"2013","journal-title":"Eng. Agric. Environ. Food."},{"key":"ref_5","first-page":"59","article-title":"Research advance on vision system of apple picking robot","volume":"33","author":"Wang","year":"2017","journal-title":"Trans. Chin. Soc. Agric. Eng."},{"key":"ref_6","unstructured":"Zhang, M. (2019). Study on Binocular Range Measurement in Information Collection of Paddy Field Culture Area. [Master\u2019s Thesis, Kunming University of Science and Technology]."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"130561","DOI":"10.1109\/ACCESS.2020.3009387","article-title":"Coal mine rescue robots based on binocular vision: A review of the state of the art","volume":"8","author":"Zhai","year":"2020","journal-title":"IEEE Access"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"9599","DOI":"10.1007\/s11042-019-08140-9","article-title":"Target positioning method in binocular vision manipulator control based on improved canny operator","volume":"79","author":"Han","year":"2020","journal-title":"Multimed Tools Appl."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"3002","DOI":"10.1051\/matecconf\/202133603002","article-title":"Binocular intelligent following robot based on YOLO\u2013LITE","volume":"336","author":"Zheng","year":"2021","journal-title":"MATEC Web Conf."},{"key":"ref_10","first-page":"46","article-title":"Obstacle Detection of Agricultural Vehicles Based on Millimeter Wave Radar and Camera","volume":"3","author":"Song","year":"2019","journal-title":"Modern. Inf. Technol."},{"key":"ref_11","unstructured":"Sun, K. (2019). Research on Obstacle Avoidance Technology of Plant Protection UAV Based on Millimter Wave. [Master\u2019s Thesis, Hangzhou Dianzi University]."},{"key":"ref_12","first-page":"21","article-title":"Orchard Trunk Detection Algorithm for Agricultural Robot Based on Laser Radar","volume":"51","author":"Niu","year":"2020","journal-title":"Trans. Chin. Soc. Agric. Mach."},{"key":"ref_13","unstructured":"Hu, Z. (2017). Research on Automatic Spraying Method of Fruit Trees. [Master\u2019s Thesis, Yantai University]."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"547","DOI":"10.1002\/rob.21852","article-title":"Under canopy light detection and ranging-based autonomous navigation","volume":"36","author":"Vitor","year":"2019","journal-title":"J. Filed Robot"},{"key":"ref_15","first-page":"334","article-title":"Intra\u2013row Path Extraction and Navigation for Orchards Based on LiDAR","volume":"51","author":"Li","year":"2020","journal-title":"Trans. Chin. Soc. Agric. Mach."},{"key":"ref_16","first-page":"80","article-title":"Development of dual\u2013lidar navigation system for greenhouse transportation robot","volume":"36","author":"Hou","year":"2020","journal-title":"Trans. Chin. Soc. Agric. Eng."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"14948","DOI":"10.1364\/OE.392386","article-title":"SPADnet: Deep RGB\u2013SPAD sensor fusion assisted by monocular depth estimation","volume":"28","author":"Sun","year":"2020","journal-title":"Opt. Express"},{"key":"ref_18","unstructured":"Saxena, A., Chung, S., and Ng A, Y. (2006, January 4\u20137). Learning depth from single monocular images. Proceedings of the 2006 Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Konrad, J., Wang, M., and Ishwar, P. (2012, January 16\u201321). 2d\u2013to\u20133d image conversion by learning depth from examples. Proceedings of the 2012 Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA.","DOI":"10.1109\/CVPRW.2012.6238903"},{"key":"ref_20","unstructured":"Karsch, K., Liu, C., and Kang S, B. (2006, January 7\u201313). Depth extraction from video using non\u2013parametric sampling. Proceedings of the 2006 European Conference on Computer Vision, Firenze, Italy."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Liu, M., Salzmann, M., and He, X. (2014, January 20\u201324). Discrete\u2013continuous depth estimation from a single image. Proceedings of the 2014 Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.97"},{"key":"ref_22","first-page":"1","article-title":"DepthTransfer: Depth Extraction from Video Using Non \u2013 parametric Sampling","volume":"99","author":"Karsch","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_23","unstructured":"Eigen, D., Puhrsch, C., and Fergus, R. (2014, January 8\u201313). Depth Map Prediction from a Single Image using a Multi\u2013Scale Deep Network. Proceedings of the 2014 Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1426","DOI":"10.1109\/TPAMI.2018.2839602","article-title":"Sebe N. Monocular Depth Estimation using Multi\u2013Scale Continuous CRFs as Sequential Deep Networks","volume":"41","author":"Xu","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Fu, H., Gong, M., Wang, C., Tmanghelich, K., and Tao, D. (2018, January 18\u201322). Deep Ordinal Regression Network for Monocular Depth Estimation. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00214"},{"key":"ref_26","unstructured":"Li, N., Shen, N., Dai, N., Hengel, A., and He, N. (2015, January 7\u201312). Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"3174","DOI":"10.1109\/TCSVT.2017.2740321","article-title":"Estimating depth from monocular images as classification using deep fully convolutional residual networks","volume":"28","author":"Cao","year":"2018","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (, January 25\u201328October). Deeper depth prediction with fully convolutional residual networks. Proceedings of the 2016 Fourth International Conference on 3D Vision, Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.32"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, X., Fouhey, D., and Gupta, A. (2015, January 7\u201312). Designing deep networks for surface normal estimation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298652"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Garg, R., Kumar, V., Carneiro, G., and Reid, I. (2016, January 11\u201314). Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. Proceedings of the 2016 European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_45"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Godard, C., Aodha, O., and Brostow, G. (2017, January 21\u201326). Unsupervised monocular depth estimation with left\u2013right consistency. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.699"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Flynn, J., Neulander, I., Philbin, J., and Snavely, N. (2016, January 27\u201330). Deep Stereo: Learning to Predict New Views from the World\u2019s Imagery. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.595"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Xie, J., Girshick, R., and Farhadi, A. (2016, January 11\u201314). Deep3d: Fully automatic 2d\u2013to\u20133d video conversion with deep convolutional neural networks. Proceedings of the 2016 European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46493-0_51"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wang, P., Xu, W., Zhao, L., and Nevatia, R. (2018, January 2\u20137). Unsupervised learning of geometry from videos with edge\u2013aware depth\u2013normal consistency. Proceedings of the 2018 Association for the Advance of Artificial Intelligence, New Orleans, LO, USA.","DOI":"10.1609\/aaai.v32i1.12257"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wang, P., Xu, W., Wang, Y., Zhao, L., and Nevatia, R. (2018, January 18\u201322). LEGO: Learning edge with geometry all at once by watching videos. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00031"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Mahjourian, R., Wicke, M., and Angelova, A. (2018, January 18\u201322). Unsupervised learning of depth and ego\u2013motion from monocular video using 3D geometric constraints. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00594"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Lin, R., Lu, Y., and Lu, G. (2019, January 17\u201320). APAC\u2013Net: Unsupervised Learning of Depth and Ego\u2013Motion from Monocular Video. Proceedings of the 2019 International Conference on Intelligent Science and Big Data Engineering, Nanjing, China.","DOI":"10.1007\/978-3-030-36189-1_28"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Yin, Z., and Shi, J. (2018, January 18\u201322). GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00212"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Wang, C., Buenaposada, J., Zhu, R., and Lucey, S. (2018, January 18\u201322). Learning depth from monocular videos using direct methods. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00216"},{"key":"ref_40","first-page":"8","article-title":"DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross\u2013Task Consistency","volume":"11209","author":"Zou","year":"2018","journal-title":"ECCV Comput. Vis."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"2624","DOI":"10.1109\/TPAMI.2019.2930258","article-title":"Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding","volume":"42","author":"Luo","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Casser, V., Pirk, S., Mahjourian, R., and Angelova, A. (2019, January 27). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the 2019 Association for the Advance of Artificial Intelligence, Honolulu, HI, USA.","DOI":"10.1609\/aaai.v33i01.33018001"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Godard, C., Aodha, O., Firman, M., and Brostow, G. (2019, January 16\u201320). Digging Into Self\u2013Supervised Monocular Depth Estimation. Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/ICCV.2019.00393"},{"key":"ref_44","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton G, E. (2012, January 3\u20136). ImageNet classification with deep convolutional neural networks. Proceedings of the 2012 International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_45","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large\u2013Scale Image Recognition. Proceedings of the 2015 International Conference on Learning Representations, San Diego, CA, USA."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Xie, S., Girshick, R., Doll\u00e1r, P., Tu, Z., and He, K. (2017, January 21\u201326). Aggregated residual transformations for deep neural networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.634"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real\u2013time object detection. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, faster, stronger. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask R\u2013CNN. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_52","unstructured":"Wang, X., Zhang, R., Kong, T., Li, L., and Shen, C. (2020). Solov2: Dynamic and fast instance segmentation. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U\u2013Net:Convolutional Networks for Biomedical Image Segmentation. Proceedings of the 2015 International Conference on Medical Image Computing and Computer\u2013Assisted Intervention, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., and Ricci, E. (2018, January 18\u201322). Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00412"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Wofk, D., Ma, F., Yang, T., Karaman, S., and Sze, V. (2019, January 20\u201324). FastDepth: Fast Monocular Depth Estimation on Embedded Systems. Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8794182"},{"key":"ref_56","unstructured":"Tan, M., and Le, Q. (2019, January 9\u201315). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 2019 International Conference on Machine Learning, Long Beach, CA, USA."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/3\/721\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:03:22Z","timestamp":1760133802000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/3\/721"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,18]]},"references-count":56,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2022,2]]}},"alternative-id":["s22030721"],"URL":"https:\/\/doi.org\/10.3390\/s22030721","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,1,18]]}}}