{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T21:40:37Z","timestamp":1774906837751,"version":"3.50.1"},"reference-count":62,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T00:00:00Z","timestamp":1694563200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T00:00:00Z","timestamp":1694563200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2023,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.<\/jats:p>","DOI":"10.1007\/s11633-023-1458-0","type":"journal-article","created":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T04:01:25Z","timestamp":1694577685000},"page":"837-854","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":162,"title":["DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation"],"prefix":"10.1007","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2932-9179","authenticated-orcid":false,"given":"Zhenyu","family":"Li","sequence":"first","affiliation":[]},{"given":"Zehui","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Xianming","family":"Liu","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5694-505X","authenticated-orcid":false,"given":"Junjun","family":"Jiang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,9,13]]},"reference":[{"key":"1458_CR1","doi-asserted-by":"publisher","first-page":"770","DOI":"10.1109\/CVPR.2016.90","volume-title":"Deep residual learning for image recognition","author":"K M He","year":"2016","unstructured":"K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770\u2013778, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.90."},{"key":"1458_CR2","doi-asserted-by":"publisher","first-page":"2002","DOI":"10.1109\/CVPR.2018.00214","volume-title":"Deep ordinal regression network for monocular depth estimation","author":"H Fu","year":"2018","unstructured":"H. Fu, M. M. Gong, C. H. Wang, K. Batmanghelich, D. C. Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 2002\u20132011, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00214."},{"key":"1458_CR3","unstructured":"J. H. Lee, M. K. Han, D. W. Ko, I. H. Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation, [Online], Available: https:\/\/arxiv.org.abs\/1907.10326, 2019."},{"key":"1458_CR4","doi-asserted-by":"publisher","first-page":"4008","DOI":"10.1109\/CVPR46437.2021.00400","volume-title":"AdaBins: Depth estimation using adaptive bins","author":"S F Bhat","year":"2021","unstructured":"S. F. Bhat, I. Alhashim, P. Wonka. AdaBins: Depth estimation using adaptive bins. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 4008\u20134017, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00400."},{"key":"1458_CR5","doi-asserted-by":"publisher","first-page":"12159","DOI":"10.1109\/ICCV48922.2021.01196","volume-title":"Vision transformers for dense prediction","author":"R Ranftl","year":"2021","unstructured":"R. Ranftl, A. Bochkovskiy, V. Koltun. Vision transformers for dense prediction. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 12159\u201312168, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01196."},{"key":"1458_CR6","unstructured":"A. Saxena, S. H. Chung, A. Y. Ng. Learning depth from single monocular images. In Proceedings of the 18th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 1161\u20131168, 2005."},{"key":"1458_CR7","doi-asserted-by":"publisher","first-page":"234","DOI":"10.1007\/978-3-319-24574-4_28","volume-title":"U-Net: Convolutional networks for biomedical image segmentation","author":"O Ronneberger","year":"2015","unstructured":"O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Munich, Germany, pp. 234\u2013241, 2015. DOI: https:\/\/doi.org\/10.1007\/978-3-319-24574-4_28."},{"issue":"4","key":"1458_CR8","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","volume":"40","author":"L C Chen","year":"2018","unstructured":"L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834\u2013848, 2018. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2017.2699184.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1458_CR9","doi-asserted-by":"publisher","first-page":"6230","DOI":"10.1109\/CVPR.2017.660","volume-title":"Pyramid scene parsing network","author":"H S Zhao","year":"2017","unstructured":"H. S. Zhao, J. P. Shi, X. J. Qi, X. G. Wang, J. Y. Jia. Pyramid scene parsing network. In Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6230\u20136239, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.660."},{"key":"1458_CR10","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000\u20136010, 2017."},{"key":"1458_CR11","doi-asserted-by":"publisher","first-page":"581","DOI":"10.1007\/978-3-030-58574-7_35","volume-title":"Guiding monocular depth estimation using depth-attention volume","author":"L Huynh","year":"2020","unstructured":"L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, J. Heikkil\u00e4. Guiding monocular depth estimation using depth-attention volume. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 581\u2013597, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58574-7_35."},{"key":"1458_CR12","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021."},{"key":"1458_CR13","unstructured":"G. Yang, H. Tang, M. Ding, N. Sebe, E. Ricci. Transformers solve the limited receptive field for monocular depth prediction. In Proceedings of International Confonference on Computer Vision, 2021."},{"key":"1458_CR14","doi-asserted-by":"publisher","first-page":"559","DOI":"10.1109\/ICCV48922.2021.00062","volume-title":"Incorporating convolution designs into visual transformers","author":"K Yuan","year":"2021","unstructured":"K. Yuan, S. P. Guo, Z. W. Liu, A. J. Zhou, F. W. Yu, W. Wu. Incorporating convolution designs into visual transformers. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 559\u2013568, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00062."},{"key":"1458_CR15","unstructured":"Z. H. Dai, H. X. Liu, Q. V. Le, M. X. Tan. Coatnet: Marrying convolution and attention for all data sizes. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 3965\u20133977, 2021."},{"key":"1458_CR16","unstructured":"T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll\u00e1r, R. B. Girshick. Early convolutions help transformers see better. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 30392\u201330400, 2021."},{"key":"1458_CR17","doi-asserted-by":"publisher","first-page":"764","DOI":"10.1109\/ICCV.2017.89","volume-title":"Deformable convolutional networks","author":"J F Dai","year":"2017","unstructured":"J. F. Dai, H. Z. Qi, Y. W. Xiong, Y. Li, G. D. Zhang, H. Hu, Y. C. Wei. Deformable convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 764\u2013773, 2017. DOI: https:\/\/doi.org\/10.1109\/ICCV.2017.89."},{"key":"1458_CR18","unstructured":"X. Z. Zhu, W. J. Su, L. W. Lu, B. Li, X. G. Wang, J. F. Dai. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of 9th International Conference on Learning Representations, 2021."},{"issue":"11","key":"1458_CR19","doi-asserted-by":"publisher","first-page":"1231","DOI":"10.1177\/0278364913491297","volume":"32","author":"A Geiger","year":"2013","unstructured":"A. Geiger, P. Lenz, C. Stiller, R. Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231\u20131237, 2013. DOI: https:\/\/doi.org\/10.1177\/0278364913491297.","journal-title":"The International Journal of Robotics Research"},{"key":"1458_CR20","doi-asserted-by":"publisher","first-page":"746","DOI":"10.1007\/978-3-642-33715-4_54","volume-title":"Indoor segmentation and support inference from RGBD images","author":"N Silberman","year":"2012","unstructured":"N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision, Springer, Florence, Italy, pp. 746\u2013760, 2012. DOI: https:\/\/doi.org\/10.1007\/978-3-642-33715-4_54."},{"key":"1458_CR21","doi-asserted-by":"publisher","first-page":"567","DOI":"10.1109\/CVPR.2015.7298655","volume-title":"SUN RGB-D: A RGB-D scene understanding benchmark suite","author":"S R Song","year":"2015","unstructured":"S. R. Song, S. P. Lichtenberg, J. X. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 567\u2013576, 2015. DOI: https:\/\/doi.org\/10.1109\/CVPR.2015.7298655."},{"key":"1458_CR22","doi-asserted-by":"publisher","first-page":"353","DOI":"10.1007\/978-3-319-46487-9_22","volume-title":"Depth map super-resolution by deep multi-scale guidance","author":"T W Hui","year":"2016","unstructured":"T. W. Hui, C. C. Loy, X. O. Tang. Depth map super-resolution by deep multi-scale guidance. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 353\u2013369, 2016. DOI: https:\/\/doi.org\/10.1007\/978-3-319-46487-9_22."},{"issue":"7","key":"1458_CR23","doi-asserted-by":"publisher","first-page":"835","DOI":"10.1109\/TVCG.2015.2398440","volume":"21","author":"J Lee","year":"2015","unstructured":"J. Lee, Y. Kim, S. Lee, B. Kim, J. Noh. High-quality depth estimation using an exemplar 3D model for stereo conversion. IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 7, pp. 835\u2013847, 2015. DOI: https:\/\/doi.org\/10.1109\/TVCG.2015.2398440.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"issue":"11","key":"1458_CR24","doi-asserted-by":"publisher","first-page":"8355","DOI":"10.1109\/TPAMI.2021.3102575","volume":"44","author":"J X Dong","year":"2022","unstructured":"J. X. Dong, J. S. Pan, J. S. Ren, L. Lin, J. H. Tang, M. H. Yang. Learning spatially variant linear representation models for joint filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8355\u20138370, 2022. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2021.3102575.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1458_CR25","unstructured":"Z. Q. Zhang, X. G. Zhu, Y. W. Li, X. Q. Chen, Y. Guo. Adversarial attacks on monocular depth estimation, [Online], Available: https:\/\/arxiv.org\/abs\/2003.10315, 2020."},{"key":"1458_CR26","unstructured":"D. Eigen, C. Puhrsch, R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 2366\u20132374, 2014."},{"key":"1458_CR27","doi-asserted-by":"publisher","first-page":"1043","DOI":"10.1109\/WACV.2019.00116","volume-title":"Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries","author":"J J Hu","year":"2019","unstructured":"J. J. Hu, M. Ozay, Y. Zhang, T. Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, USA, pp. 1043\u20131051, 2019. DOI: https:\/\/doi.org\/10.1109\/WACV.2019.00116."},{"issue":"12","key":"1458_CR28","doi-asserted-by":"publisher","first-page":"3446","DOI":"10.1109\/TVCG.2020.3023634","volume":"26","author":"X B Yang","year":"2020","unstructured":"X. B. Yang, L. Y. Zhou, H. Q. Jiang, Z. L. Tang, Y. B. Wang, H. J. Bao, G. F. Zhang. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 12, pp. 3446\u20133456, 2020. DOI: https:\/\/doi.org\/10.1109\/TVCG.2020.3023634.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"1458_CR29","unstructured":"M. X. Tan, Q. V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 6105\u20136114, 2019."},{"key":"1458_CR30","doi-asserted-by":"publisher","first-page":"2261","DOI":"10.1109\/CVPR.2017.243","volume-title":"Densely connected convolutional networks","author":"G Huang","year":"2017","unstructured":"G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 2261\u20132269, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.243."},{"key":"1458_CR31","unstructured":"I. Alhashim, P. Wonka. High quality monocular depth estimation via transfer learning, [Online], Available: https:\/\/arxiv.org\/abs\/1812.11941, 2018."},{"key":"1458_CR32","doi-asserted-by":"publisher","first-page":"9992","DOI":"10.1109\/ICCV48922.2021.00986","volume-title":"Swin Transformer: Hierarchical vision transformer using shifted windows","author":"Z Liu","year":"2021","unstructured":"Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992\u201310002, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986."},{"key":"1458_CR33","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1007\/978-3-030-58452-8_13","volume-title":"End-to-end object detection with transformers","author":"N Carion","year":"2020","unstructured":"N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 213\u2013229, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58452-8_13."},{"key":"1458_CR34","doi-asserted-by":"publisher","first-page":"6877","DOI":"10.1109\/CVPR46437.2021.00681","volume-title":"Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers","author":"S X Zheng","year":"2021","unstructured":"S. X. Zheng, J. C. Lu, H. S. Zhao, X. T. Zhu, Z. K. Luo, Y. B. Wang, Y. W. Fu, J. F. Feng, T. Xiang, P. H. S. Torr, L. Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 6877\u20136886, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00681."},{"key":"1458_CR35","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1007\/978-3-030-01267-0_4","volume-title":"Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss","author":"J B Jiao","year":"2018","unstructured":"J. B. Jiao, Y. Cao, Y. B. Song, R. Lau. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 55\u201371, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01267-0_4."},{"key":"1458_CR36","doi-asserted-by":"publisher","first-page":"548","DOI":"10.1109\/ICCV48922.2021.00061","volume-title":"Pyramid vision transformer: A versatile backbone for dense prediction without convolutions","author":"W H.Wang","year":"2021","unstructured":"W. H.Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 548\u2013558, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00061."},{"key":"1458_CR37","doi-asserted-by":"publisher","unstructured":"Z. Y. Li, Z. H. Chen, A. Li, L. J. Fang, Q. H. Jiang, X. M. Liu, J. J. Jiang, B. L. Zhou, H. Zhao. SimIPU: Simple 2D image and 3D point cloud unsupervised pre-training for spatial-aware visual representations. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 1500\u20131508, 2022. DOI: https:\/\/doi.org\/10.1609\/aaai.v36i2.20040.","DOI":"10.1609\/aaai.v36i2.20040"},{"key":"1458_CR38","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1007\/978-3-319-46484-8_45","volume-title":"Unsupervised CNN for single view depth estimation: Geometry to the rescue","author":"R Garg","year":"2016","unstructured":"R. Garg, V. K. B.G., G. Carneiro, I. Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 740\u2013756, 2016. DOI: https:\/\/doi.org\/10.1007\/978-3-319-46484-8_45."},{"key":"1458_CR39","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1109\/3DV.2017.00012","volume-title":"Sparsity invariant CNNs","author":"J Uhrig","year":"2017","unstructured":"J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, A. Geiger. Sparsity invariant CNNs. In Proceedings of International Conference on 3D Vision, IEEE, Qingdao, China, pp. 11\u201320, 2017. DOI: https:\/\/doi.org\/10.1109\/3DV.2017.00012."},{"key":"1458_CR40","doi-asserted-by":"publisher","first-page":"1625","DOI":"10.1109\/ICCV.2013.458","volume-title":"SUN3D: A database of big spaces reconstructed using SfM and object labels","author":"J X Xiao","year":"2013","unstructured":"J. X. Xiao, A. Owens, A. Torralba. SUN3D: A database of big spaces reconstructed using SfM and object labels. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 1625\u20131632, 2013. DOI: https:\/\/doi.org\/10.1109\/ICCV.2013.458."},{"key":"1458_CR41","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1007\/978-1-4471-4640-7_8","volume-title":"Consumer Depth Cameras for Computer Vision","author":"A Janoch","year":"2013","unstructured":"A. Janoch, S. Karayev, Y. Q. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A category-level 3D object dataset: Putting the kinect to work. Consumer Depth Cameras for Computer Vision, A. Fossati, J. Gall, H. Grabner, X. F. Ren, K. Konolige, Eds., London, UK: Springer, pp. 141\u2013165, 2013. DOI: https:\/\/doi.org\/10.1007\/978-1-4471-4640-7_8."},{"key":"1458_CR42","unstructured":"M. Contributors. MMsegmentation: Openmmlab semantic segmentation toolbox and benchmark, [Online], Available: https:\/\/gitee.com\/deadkany\/mmsegmentation, 2020."},{"key":"1458_CR43","unstructured":"A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097\u20131105, 2012."},{"key":"1458_CR44","doi-asserted-by":"publisher","first-page":"12643","DOI":"10.1109\/ICCV48922.2021.01243","volume":"ProceedingsofIE","author":"B Y Li","year":"2021","unstructured":"B. Y. Li, Y. Huang, Z. Y. Liu, D. P. Zou, W. X. Yu. StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 12643\u201312653, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01243.","journal-title":"StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation"},{"key":"1458_CR45","doi-asserted-by":"publisher","first-page":"12767","DOI":"10.1109\/ICCV48922.2021.01255","volume-title":"Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments","author":"P Ji","year":"2021","unstructured":"P. Ji, R. Z. Li, B. Bhanu, Y. Xu. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 12767\u201312776, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01255."},{"key":"1458_CR46","doi-asserted-by":"publisher","first-page":"239","DOI":"10.1109\/3DV.2016.32","volume-title":"Deeper depth prediction with fully convolutional residual networks","author":"I Laina","year":"2016","unstructured":"I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 4th International Conference on 3D Vision, IEEE, Standford, USA, pp. 239\u2013248, 2016. DOI: https:\/\/doi.org\/10.1109\/3DV.2016.32."},{"key":"1458_CR47","doi-asserted-by":"publisher","first-page":"3906","DOI":"10.1109\/CVPR52688.2022.00389","volume-title":"Neural window fully-connected CRFs for monocular depth estimation","author":"W H Yuan","year":"2022","unstructured":"W. H. Yuan, X. D. Gu, Z. Z. Dai, S. Y. Zhu, P. Tan. Neural window fully-connected CRFs for monocular depth estimation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 3906\u20133915, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.00389."},{"key":"1458_CR48","doi-asserted-by":"publisher","first-page":"6602","DOI":"10.1109\/CVPR.2017.699","volume-title":"Unsupervised monocular depth estimation with left-right consistency","author":"C Godard","year":"2017","unstructured":"C. Godard, O. M. Aodha, G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6602\u20136611, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.699."},{"key":"1458_CR49","doi-asserted-by":"publisher","first-page":"4755","DOI":"10.1109\/CVPR42600.2020.00481","volume-title":"Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume","author":"A Johnston","year":"2020","unstructured":"A. Johnston, G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 4755\u20134764, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPR42600.2020.00481."},{"key":"1458_CR50","doi-asserted-by":"publisher","first-page":"232","DOI":"10.1007\/978-3-030-01219-9_14","volume-title":"Monocular depth estimation with affinity, vertical pooling, and label enhancement","author":"Y K Gan","year":"2018","unstructured":"Y. K. Gan, X. Y. Xu, W. X. Sun, L. Lin. Monocular depth estimation with affinity, vertical pooling, and label enhancement. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 232\u2013247, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01219-9_14."},{"key":"1458_CR51","doi-asserted-by":"publisher","first-page":"5683","DOI":"10.1109\/ICCV.2019.00578","volume-title":"Enforcing geometric constraints of virtual normal for depth prediction","author":"W Yin","year":"2019","unstructured":"W. Yin, Y. F. Liu, C. H. Shen, Y. L. Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 5683\u20135692, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCV.2019.00578."},{"issue":"5","key":"1458_CR52","doi-asserted-by":"publisher","first-page":"2673","DOI":"10.1109\/TPAMI.2020.3043781","volume":"44","author":"D Xu","year":"2022","unstructured":"D. Xu, X. Alameda-Pineda, W. L. Ouyang, E. Ricci, X. G. Wang, N. Sebe. Probabilistic graph attention network with conditional kernels for pixel-wise prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2673\u20132688, 2022. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2020.3043781.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1458_CR53","doi-asserted-by":"publisher","first-page":"11746","DOI":"10.1109\/ICRA48506.2021.9560885","volume-title":"Bidirectional attention network for monocular depth estimation","author":"S Aich","year":"2020","unstructured":"S. Aich, J. M. U. Vianney, M. A. Islam, M. K. B. Liu. Bidirectional attention network for monocular depth estimation. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Xi\u2019an, China, pp. 11746\u201311752, 2020. DOI: https:\/\/doi.org\/10.1109\/ICRA48506.2021.9560885."},{"key":"1458_CR54","doi-asserted-by":"publisher","unstructured":"S. Lee, J. Lee, B. Kim, E. Yi, J. Kim. Patch-wise attention network for monocular depth estimation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 1873\u20131881, 2021. DOI: https:\/\/doi.org\/10.1609\/aaai.v35i3.16282.","DOI":"10.1609\/aaai.v35i3.16282"},{"key":"1458_CR55","doi-asserted-by":"publisher","first-page":"3996","DOI":"10.1109\/CVPR46437.2021.00399","volume-title":"ViP-DeepLab: Learning visual perception with depth-aware video panoptic segmentation","author":"S Y Qiao","year":"2021","unstructured":"S. Y. Qiao, Y. K. Zhu, H. Adam, A. Yuille, L. C. Chen. ViP-DeepLab: Learning visual perception with depth-aware video panoptic segmentation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3996\u20134007, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00399."},{"key":"1458_CR56","doi-asserted-by":"crossref","unstructured":"X. T. Chen, X. J. Chen, Z. J. Zha. Structure-aware residual pyramid network for monocular depth estimation. In Proceedings of the 28th International Joint conference on Artificial Intelligence, Macao, China, pp. 694\u2013700, 2019.","DOI":"10.24963\/ijcai.2019\/98"},{"key":"1458_CR57","doi-asserted-by":"publisher","first-page":"491","DOI":"10.1007\/978-3-030-58558-7_29","volume-title":"Big transfer (BiT): General visual representation learning","author":"A Kolesnikov","year":"2020","unstructured":"A. Kolesnikov, L. Beyer, X. H. Zhai, J. Puigcerver, J. Yung, S. Gelly, N. Houlsby. Big transfer (BiT): General visual representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 491\u2013507, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58558-7_29."},{"key":"1458_CR58","doi-asserted-by":"publisher","unstructured":"A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725\u20131732. DOI: https:\/\/doi.org\/10.1109\/CVPR.2014.223.","DOI":"10.1109\/CVPR.2014.223"},{"key":"1458_CR59","doi-asserted-by":"publisher","first-page":"7132","DOI":"10.1109\/CVPR.2018.00745","volume-title":"Squeeze-and-excitation networks","author":"J Hu","year":"2018","unstructured":"J. Hu, L. Shen, G. Sun. Squeeze-and-excitation networks. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7132\u20137141, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00745."},{"key":"1458_CR60","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/978-3-030-01234-2_1","volume-title":"CBAM: Convolutional block attention module","author":"S Woo","year":"2018","unstructured":"S. Woo, J. Park, J. Y. Lee, I. S. Kweon. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 3\u201319, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01234-2_1."},{"key":"1458_CR61","doi-asserted-by":"publisher","first-page":"4661","DOI":"10.1109\/ICCV48922.2021.00464","volume-title":"Specificity-preserving RGB-D saliency detection","author":"T Zhou","year":"2021","unstructured":"T. Zhou, H. Z. Fu, G. Chen, Y. Zhou, D. P. Fan, L. Shao. Specificity-preserving RGB-D saliency detection. In Proceedings of IEEE\/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 4661\u20134671, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00464."},{"key":"1458_CR62","doi-asserted-by":"publisher","unstructured":"W. B. Zhang, G. P. Ji, Z. Wang, K. R. Fu, Q. J. Zhao. Depth quality-inspired feature manipulation for efficient RGB-D salient object detection. In Proceedings of the 29th ACM International Conference on Multimedia, ACM, pp. 731\u2013740, 2021. DOI: https:\/\/doi.org\/10.1145\/3474085.3475240.","DOI":"10.1145\/3474085.3475240"}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-023-1458-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-023-1458-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-023-1458-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,6]],"date-time":"2024-05-06T13:10:17Z","timestamp":1715001017000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-023-1458-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,13]]},"references-count":62,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,12]]}},"alternative-id":["1458"],"URL":"https:\/\/doi.org\/10.1007\/s11633-023-1458-0","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"value":"2731-538X","type":"print"},{"value":"2731-5398","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,13]]},"assertion":[{"value":"5 March 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 May 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 September 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declared that they have no conflicts of interest to this work.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations of conflict of interest"}}]}}