{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T02:59:52Z","timestamp":1763348392555,"version":"build-2065373602"},"reference-count":53,"publisher":"MDPI AG","issue":"21","license":[{"start":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T00:00:00Z","timestamp":1634688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Project","award":["2017YFE0125300"],"award-info":[{"award-number":["2017YFE0125300"]}]},{"DOI":"10.13039\/501100018617","name":"Liaoning Revitalization Talents Program","doi-asserted-by":"publisher","award":["XLYC1802056"],"award-info":[{"award-number":["XLYC1802056"]}],"id":[{"id":"10.13039\/501100018617","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>In recent years, self-supervised monocular depth estimation has gained popularity among researchers because it uses only a single camera at a much lower cost than the direct use of laser sensors to acquire depth. Although monocular self-supervised methods can obtain dense depths, the estimation accuracy needs to be further improved for better applications in scenarios such as autonomous driving and robot perception. In this paper, we innovatively combine soft attention and hard attention with two new ideas to improve self-supervised monocular depth estimation: (1) a soft attention module and (2) a hard attention strategy. We integrate the soft attention module in the model architecture to enhance feature extraction in both spatial and channel dimensions, adding only a small number of parameters. Unlike traditional fusion approaches, we use the hard attention strategy to enhance the fusion of generated multi-scale depth predictions. Further experiments demonstrate that our method can achieve the best self-supervised performance both on the standard KITTI benchmark and the Make3D dataset.<\/jats:p>","DOI":"10.3390\/s21216956","type":"journal-article","created":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T21:31:26Z","timestamp":1634765486000},"page":"6956","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Joint Soft\u2013Hard Attention for Self-Supervised Monocular Depth Estimation"],"prefix":"10.3390","volume":"21","author":[{"given":"Chao","family":"Fan","sequence":"first","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China"},{"name":"Liaoning Key Laboratory of Domestic Industrial Control Platform Technology on Basic Hardware & Software, Shenyang 110168, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhenyu","family":"Yin","sequence":"additional","affiliation":[{"name":"Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China"},{"name":"Liaoning Key Laboratory of Domestic Industrial Control Platform Technology on Basic Hardware & Software, Shenyang 110168, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8690-4649","authenticated-orcid":false,"given":"Fulong","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China"},{"name":"Liaoning Key Laboratory of Domestic Industrial Control Platform Technology on Basic Hardware & Software, Shenyang 110168, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3696-9038","authenticated-orcid":false,"given":"Anying","family":"Chai","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China"},{"name":"Liaoning Key Laboratory of Domestic Industrial Control Platform Technology on Basic Hardware & Software, Shenyang 110168, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Feiqing","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China"},{"name":"Liaoning Key Laboratory of Domestic Industrial Control Platform Technology on Basic Hardware & Software, Shenyang 110168, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,10,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1196","DOI":"10.1007\/s11431-015-5828-x","article-title":"Vision navigation for aircrafts based on 3D reconstruction from real-time image sequences","volume":"58","author":"Zhu","year":"2015","journal-title":"Sci. China Technol. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for autonomous driving? The KITTI vision benchmark suite. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"707","DOI":"10.1049\/iet-ipr.2018.5920","article-title":"Survey on depth perception in head mounted displays: Distance estimation in virtual reality, augmented reality, and mixed reality","volume":"13","author":"Marsh","year":"2019","journal-title":"IET Image Process."},{"key":"ref_4","unstructured":"Eigen, D., Puhrsch, C., and Fergus, R. (2014, January 8\u201313). Depth map prediction from a single image using a multi-scale deep network. Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (2016, January 25\u201328). Deeper depth prediction with fully convolutional residual networks. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.32"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018, January 18\u201323). Deep ordinal regression network for monocular depth estimation. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00214"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhou, T., Brown, M., Snavely, N., and Lowe, D.G. (2017, January 21\u201326). Unsupervised learning of depth and ego-motion from video. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.700"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, C., Buenaposada, J.M., Zhu, R., and Lucey, S. (2018, January 18\u201323). Learning depth from monocular videos using direct methods. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00216"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Yin, Z., and Shi, J. (2018, January 18\u201323). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00212"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zou, Y., Luo, Z., and Huang, J.-B. (2018, January 8\u201314). Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01228-1_3"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Godard, C., Aodha, O.M., Firman, M., and Brostow, G. (November, January 27). Digging into self-supervised monocular depth estimation. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00393"},{"key":"ref_12","unstructured":"Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., and Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2624","DOI":"10.1109\/TPAMI.2019.2930258","article-title":"Every pixel counts ++: Joint learning of geometry and motion with 3D holistic understanding","volume":"42","author":"Luo","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Godard, C., Aodha, O.M., and Brostow, G.J. (2017, January 21\u201326). Unsupervised monocular depth estimation with left-right consistency. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.699"},{"key":"ref_15","unstructured":"Guizilini, V., Hou, R., Li, J., Ambrus, R., and Gaidon, A. (2019, January 6\u20139). Semantically-guided representation learning for self-supervised monocular depth. Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Klingner, M., Term\u00f6hlen, J.-A., Mikolajczyk, J., and Fingscheidt, T. (2020, January 23\u201328). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58565-5_35"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Guizilini, V., Ambru\u0219, R., Pillai, S., Raventos, A., and Gaidon, A. (2020, January 13\u201319). 3D packing for self-supervised monocular depth estimation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00256"},{"key":"ref_18","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_19","unstructured":"Gao, F., Yu, J., Shen, H., Wang, Y., and Yang, H. (2020, January 16\u201318). Attentional separation-and-aggregation network for self-supervised depth-pose learning in dynamic scenes. Proceedings of the 4th Conference on Robot Learning (CoRL), Cambridge, MA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Johnston, A., and Carneiro, G. (2020, January 13\u201319). Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00481"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Garg, R., Bg, V.K., Carneiro, G., and Reid, I. (2016, January 8\u201316). Unsupervised cnn for single view depth estimation: Geometry to the rescue. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_45"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1231","DOI":"10.1177\/0278364913491297","article-title":"Vision meets robotics: The kitti dataset","volume":"32","author":"Geiger","year":"2013","journal-title":"Int. J. Robot. Res."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"824","DOI":"10.1109\/TPAMI.2008.132","article-title":"Make3d: Learning 3d scene structure from a single still image","volume":"31","author":"Saxena","year":"2008","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"2024","DOI":"10.1109\/TPAMI.2015.2505283","article-title":"Learning depth from single monocular images using deep convolutional neural Fields","volume":"38","author":"Liu","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lee, S., Lee, J., Kim, B., Yi, E., and Kim, J. (2021, January 2\u20139). Patch-wise attention network for monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence, Edinburgh, UK.","DOI":"10.1609\/aaai.v35i3.16282"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Watson, J., Firman, M., Brostow, G.J., and Turmukhambetov, D. (2019, January 11\u201317). Self-supervised monocular depth hints. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00225"},{"key":"ref_27","unstructured":"Casser, V., Pirk, S., Mahjourian, R., and Angelova, A. (February, January 27). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Gordon, A., Li, H., Jonschkowski, R., and Angelova, A. (2019, January 11\u201317). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00907"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Liu, C., Gu, J., Kim, K., Narasimhan, S.G., and Kautz, J. (2019, January 15\u201320). Neural RGB\u00ae D sensing: Depth and uncertainty from a video camera. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Angeles, CA, USA.","DOI":"10.1109\/CVPR.2019.01124"},{"key":"ref_30","first-page":"12626","article-title":"Forget about the lidar: Self-supervised depth estimators with med probability volumes","volume":"33","author":"GonzalezBello","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., and Yuan, Y. (2021, January 2\u20139). HR-depth: High resolution self-supervised monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence, A Virtual Conference, Edinburgh, UK.","DOI":"10.1609\/aaai.v35i3.16329"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"6813","DOI":"10.1109\/LRA.2020.3017478","article-title":"Don\u2032t forget the past: Recurrent depth estimation from monocular video","volume":"5","author":"Patil","year":"2020","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G., and Firman, M. (2021, January 19\u201325). The temporal opportunist: Self-supervised multi-frame monocular depth. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, A Virtual Conference, Edinburgh, UK.","DOI":"10.1109\/CVPR46437.2021.00122"},{"key":"ref_34","unstructured":"Galassi, A., Lippi, M., and Torroni, P. (2019). Attention, please! A critical review of neural attention models in natural language processing. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.-Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201323). Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., and Xu, W. (2018, January 18\u201323). Occlusion aware unsupervised learning of optical flow. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00513"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1023\/A:1014573219977","article-title":"A taxonomy and evaluation of dense two-frame stereo correspondence algorithms","volume":"47","author":"Scharstein","year":"2002","journal-title":"Int. J. Comput. Vis."},{"key":"ref_42","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Eigen, D., and Fergus, R. (2015, January 7\u201313). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.304"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Chen, Y., Schmid, C., and Sminchisescu, C. (2019, January 27\u201328). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00716"},{"key":"ref_46","first-page":"71","article-title":"Consistent video depth estimation","volume":"39","author":"Luo","year":"2020","journal-title":"ACM Trans. Graph. TOG"},{"key":"ref_47","unstructured":"Wang, J., Zhang, G., Wu, Z., Li, X., and Liu, L. (2020). Self-supervised joint learning framework of depth estimation via implicit cues. arXiv."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Kuznietsov, Y., Proesmans, M., and Van Gool, L. (2021, January 5\u20139). Comoda: Continuous monocular depth adaptation using past experiences. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA.","DOI":"10.1109\/WACV48630.2021.00295"},{"key":"ref_49","unstructured":"McCraith, R., Neumann, L., Zisserman, A., and Vedaldi, A. (2020). Monocular depth estimation with self-supervised instance adaptation. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., and Black, M.J. (2019, January 15\u201320). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA.","DOI":"10.1109\/CVPR.2019.01252"},{"key":"ref_51","unstructured":"Li, H., Gordon, A., Zhao, H., Casser, V., and Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"2144","DOI":"10.1109\/TPAMI.2014.2316835","article-title":"Depth transfer: Depth extraction from video using non-parametric sampling","volume":"36","author":"Karsch","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Liu, M., Salzmann, M., and He, X. (2014, January 23\u201328). Discrete-continuous depth estimation from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.97"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/21\/6956\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:19:01Z","timestamp":1760167141000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/21\/6956"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,20]]},"references-count":53,"journal-issue":{"issue":"21","published-online":{"date-parts":[[2021,11]]}},"alternative-id":["s21216956"],"URL":"https:\/\/doi.org\/10.3390\/s21216956","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,10,20]]}}}