{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T08:40:38Z","timestamp":1780389638931,"version":"3.54.1"},"reference-count":47,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T00:00:00Z","timestamp":1678752000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T00:00:00Z","timestamp":1678752000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Industrial Artificial Intelligence"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Monocular depth estimation (MDE) has shown impressive performance recently, even in zero-shot or few-shot scenarios. In this paper, we consider the use of MDE on board low-altitude drone flights, which is required in a number of safety-critical and monitoring operations. In particular, we evaluate a state-of-the-art vision transformer (ViT) variant, pre-trained on a massive MDE dataset. We test it both in a zero-shot scenario and after fine-tuning on a dataset of flight records, and compare its performance to that of a classical fully convolutional network. In addition, we evaluate for the first time whether these models are susceptible to adversarial attacks, by optimizing a small adversarial patch that generalizes across scenarios. We investigate several variants of losses for this task, including weighted error losses in which we can customize the design of the patch to selectively decrease the performance of the model on a desired depth range. Overall, our results highlight that (a) ViTs can outperform convolutive models in this context after a proper fine-tuning, and (b) they appear to be more robust to adversarial attacks designed in the form of patches, which is a crucial property for this family of tasks.<\/jats:p>","DOI":"10.1007\/s44244-023-00005-3","type":"journal-article","created":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T00:02:27Z","timestamp":1678752147000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["On the robustness of vision transformers for in-flight monocular depth estimation"],"prefix":"10.1007","volume":"1","author":[{"given":"Simone","family":"Ercolino","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alessio","family":"Devoto","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Luca","family":"Monorchio","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Matteo","family":"Santini","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Silvio","family":"Mazzaro","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Simone","family":"Scardapane","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2023,3,14]]},"reference":[{"key":"5_CR1","unstructured":"Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Adv Neural Inf Process Syst 27. https:\/\/dl.acm.org\/doi\/10.5555\/2969033.2969091"},{"key":"5_CR2","doi-asserted-by":"crossref","unstructured":"Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp 2650\u20132658","DOI":"10.1109\/ICCV.2015.304"},{"key":"5_CR3","doi-asserted-by":"crossref","unstructured":"Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5162\u20135170","DOI":"10.1109\/CVPR.2015.7299152"},{"key":"5_CR4","doi-asserted-by":"crossref","unstructured":"Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270\u2013279","DOI":"10.1109\/CVPR.2017.699"},{"key":"5_CR5","doi-asserted-by":"crossref","unstructured":"Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 3828\u20133838","DOI":"10.1109\/ICCV.2019.00393"},{"key":"5_CR6","doi-asserted-by":"crossref","unstructured":"Nathan Silberman, PK Derek Hoiem, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: ECCV","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"5_CR7","doi-asserted-by":"crossref","unstructured":"Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: Computer Vision, ECCV 2012-12th European conference on computer vision, proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp 746\u2013760","DOI":"10.1007\/978-3-642-33715-4_54"},{"issue":"5","key":"5_CR8","doi-asserted-by":"publisher","first-page":"824","DOI":"10.1109\/TPAMI.2008.132","volume":"31","author":"A Saxena","year":"2009","unstructured":"Saxena A, Sun M, Ng AY (2009) Make3d: Learning 3d scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824\u2013840","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"5_CR9","doi-asserted-by":"crossref","unstructured":"Fonder M, Van Droogenbroeck M (2019) Mid-air: a multi-modal dataset for extremely low altitude drone flights. In: 2019 IEEE\/CVF Conference on computer vision and pattern recognition workshops (CVPRW), pp 553\u2013562","DOI":"10.1109\/CVPRW.2019.00081"},{"key":"5_CR10","unstructured":"Ranftl R, Lasinger K, Hafner D, Schindler K, Koltun V (2020) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell. https:\/\/ieeexplore.ieee.org\/document\/9178977"},{"key":"5_CR11","doi-asserted-by":"crossref","unstructured":"Zhang Z, Xiong M, Xiong H (2019) Monocular depth estimation for uav obstacle avoidance. In: 2019 4th International conference on cloud computing and internet of things (CCIOT), pp 43\u201347. IEEE","DOI":"10.1109\/CCIOT48581.2019.8980350"},{"key":"5_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.isprsjprs.2021.03.024","volume":"176","author":"L Madhuanand","year":"2021","unstructured":"Madhuanand L, Nex F, Yang MY (2021) Self-supervised monocular depth estimation from oblique uav videos. ISPRS J Photogramm Remote Sens 176:1\u201314","journal-title":"ISPRS J Photogramm Remote Sens"},{"issue":"6","key":"5_CR13","doi-asserted-by":"publisher","first-page":"2097","DOI":"10.3390\/s22062097","volume":"22","author":"T Shimada","year":"2022","unstructured":"Shimada T, Nishikawa H, Kong X, Tomiyama H (2022) Pix2pix-based monocular depth estimation for drones with optical flow on airsim. Sensors 22(6):2097","journal-title":"Sensors"},{"issue":"7","key":"5_CR14","doi-asserted-by":"publisher","first-page":"8101","DOI":"10.1007\/s10489-021-02908-z","volume":"52","author":"Y Djenouri","year":"2022","unstructured":"Djenouri Y, Hatleskog J, Hjelmervik J, Bjorne E, Utstumo T, Mobarhan M (2022) Deep learning based decomposition for visual navigation in industrial platforms. Appl Intell 52(7):8101\u20138117","journal-title":"Appl Intell"},{"issue":"2","key":"5_CR15","doi-asserted-by":"publisher","first-page":"46","DOI":"10.3390\/drones6020046","volume":"6","author":"SO Ajakwe","year":"2022","unstructured":"Ajakwe SO, Ihekoronye VU, Kim D-S, Lee JM (2022) Dronet: multi-tasking framework for real-time industrial facility aerial surveillance and safety. Drones 6(2):46","journal-title":"Drones"},{"key":"5_CR16","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations"},{"key":"5_CR17","doi-asserted-by":"crossref","unstructured":"Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 12179\u201312188","DOI":"10.1109\/ICCV48922.2021.01196"},{"key":"5_CR18","doi-asserted-by":"crossref","unstructured":"Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Xiao C, Prakash A, Kohno T, Song D (2018) Robust physical-world attacks on deep learning visual classification. In: Proc. IEEE conference on computer vision and pattern recognition, pp 1625\u20131634","DOI":"10.1109\/CVPR.2018.00175"},{"key":"5_CR19","doi-asserted-by":"crossref","unstructured":"Huang L, Gao C, Zhou Y, Xie C, Yuille AL, Zou C, Liu N (2020) Universal physical camouflage attacks on object detectors. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 720\u2013729","DOI":"10.1109\/CVPR42600.2020.00080"},{"key":"5_CR20","unstructured":"Chiang P-Y, Ni R, Abdelkader A, Zhu C, Studer C, Goldstein T (2020) Certified defenses for adversarial patches. arXiv preprint arXiv:2003.06693"},{"key":"5_CR21","unstructured":"Liu X, Yang H, Liu Z, Song L, Li H, Chen Y (2018) Dpatch: an adversarial patch attack on object detectors. arXiv preprint arXiv:1806.02299"},{"key":"5_CR22","unstructured":"Brown TB, Man\u00e9 D, Roy A, Abadi M, Gilmer J (2017) Adversarial patch. arXiv preprint arXiv:1712.09665"},{"key":"5_CR23","doi-asserted-by":"publisher","first-page":"179094","DOI":"10.1109\/ACCESS.2020.3027372","volume":"8","author":"K Yamanaka","year":"2020","unstructured":"Yamanaka K, Matsumoto R, Takahashi K, Fujii T (2020) Adversarial patch attacks on monocular depth estimation networks. IEEE Access 8:179094\u2013179104","journal-title":"IEEE Access"},{"key":"5_CR24","doi-asserted-by":"publisher","first-page":"14410","DOI":"10.1109\/ACCESS.2018.2807385","volume":"6","author":"N Akhtar","year":"2018","unstructured":"Akhtar N, Mian A (2018) Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6:14410\u201314430","journal-title":"IEEE Access"},{"key":"5_CR25","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1016\/j.patcog.2018.07.023","volume":"84","author":"B Biggio","year":"2018","unstructured":"Biggio B, Roli F (2018) Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recogn 84:317\u2013331","journal-title":"Pattern Recogn"},{"issue":"9","key":"5_CR26","doi-asserted-by":"publisher","first-page":"2805","DOI":"10.1109\/TNNLS.2018.2886017","volume":"30","author":"X Yuan","year":"2019","unstructured":"Yuan X, He P, Zhu Q, Li X (2019) Adversarial examples: attacks and defenses for deep learning. IEEE Trans Neural Netw Learn Syst 30(9):2805\u20132824","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"5_CR27","doi-asserted-by":"crossref","unstructured":"Dalvi NN, Domingos PM, Mausam Sanghai SK, Verma D (2004) Adversarial classification. Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining","DOI":"10.1145\/1014052.1014066"},{"key":"5_CR28","doi-asserted-by":"crossref","unstructured":"Lowd D, Meek C (2005) Adversarial learning. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 641\u2013647","DOI":"10.1145\/1081870.1081950"},{"key":"5_CR29","doi-asserted-by":"crossref","unstructured":"Zhou Y, Jorgensen Z, Inge M (2008) Countering good word attacks on statistical spam filters with instance differentiation and multiple instance learning. In: Tools in artificial intelligence. IntechOpen","DOI":"10.5772\/6068"},{"key":"5_CR30","unstructured":"Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199"},{"key":"5_CR31","doi-asserted-by":"crossref","unstructured":"Xie C, Wang J, Zhang Z, Zhou Y, Xie L, Yuille A (2017) Adversarial examples for semantic segmentation and object detection. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 1369\u20131378","DOI":"10.1109\/ICCV.2017.153"},{"key":"5_CR32","unstructured":"Song D, Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Tramer F, Prakash A, Kohno T (2018) Physical adversarial examples for object detectors. In: Proceedings of 12th USENIX Workshop on Offensive Technologies (WOOT)"},{"key":"5_CR33","unstructured":"Cisse M, Adi Y, Neverova N, Keshet J (2017) Houdini: fooling deep structured prediction models. arXiv preprint arXiv:1707.05373"},{"key":"5_CR34","doi-asserted-by":"crossref","unstructured":"Wu Z, Lim S-N, Davis LS, Goldstein T (2020) Making an invisibility cloak: real world adversarial attacks on object detectors. In: Proceedings of European conference on computer vision (ECCV), pp 1\u201317. Springer","DOI":"10.1007\/978-3-030-58548-8_1"},{"key":"5_CR35","doi-asserted-by":"crossref","unstructured":"Arnab A, Miksik O, Torr PH (2018) On the robustness of semantic segmentation models to adversarial attacks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 888\u2013897","DOI":"10.1109\/CVPR.2018.00099"},{"key":"5_CR36","doi-asserted-by":"crossref","unstructured":"Moosavi-Dezfooli S-M, Fawzi A, Fawzi O, Frossard P (2017) Universal adversarial perturbations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765\u20131773","DOI":"10.1109\/CVPR.2017.17"},{"issue":"5","key":"5_CR37","doi-asserted-by":"publisher","first-page":"828","DOI":"10.1109\/TEVC.2019.2890858","volume":"23","author":"J Su","year":"2019","unstructured":"Su J, Vargas DV, Sakurai K (2019) One pixel attack for fooling deep neural networks. IEEE Trans Evol Comput 23(5):828\u2013841","journal-title":"IEEE Trans Evol Comput"},{"key":"5_CR38","doi-asserted-by":"crossref","unstructured":"Mahmood K, Mahmood R, Van Dijk, M (2021) On the robustness of vision transformers to adversarial examples. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 7838\u20137847","DOI":"10.1109\/ICCV48922.2021.00774"},{"key":"5_CR39","unstructured":"Bhoi A (2019) Monocular depth estimation: a survey. arXiv preprint arXiv:1901.09402"},{"key":"5_CR40","doi-asserted-by":"crossref","unstructured":"Xiaogang R, Wenjing Y, Jing H, Peiyuan G, Wei G (2020) Monocular depth estimation based on deep learning: a survey. In: 2020 Chinese Automation Congress (CAC), pp 2436\u20132440. IEEE","DOI":"10.1109\/CAC51589.2020.9327548"},{"key":"5_CR41","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1016\/j.neucom.2020.12.089","volume":"438","author":"Y Ming","year":"2021","unstructured":"Ming Y, Meng X, Fan C, Yu H (2021) Deep learning for monocular depth estimation: a review. Neurocomputing 438:14\u201333","journal-title":"Neurocomputing"},{"key":"5_CR42","unstructured":"Saxena A, Chung S, Ng A (2005) Learning depth from single monocular images. Adv Neural Inf Process Syst 18. https:\/\/dl.acm.org\/doi\/10.5555\/2976248.2976394"},{"key":"5_CR43","doi-asserted-by":"crossref","unstructured":"Aleotti F, Tosi F, Poggi M, Mattoccia S (2018) Generative adversarial networks for unsupervised monocular depth prediction. In: Proceedings of the European conference on computer vision (ECCV) workshops","DOI":"10.1007\/978-3-030-11009-3_20"},{"key":"5_CR44","doi-asserted-by":"crossref","unstructured":"Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV), pp 239\u2013248. IEEE","DOI":"10.1109\/3DV.2016.32"},{"key":"5_CR45","doi-asserted-by":"crossref","unstructured":"Watson J, Mac Aodha O, Prisacariu V, Brostow G, Firman M (2021) The temporal opportunist: self-supervised multi-frame monocular depth. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 1164\u20131174","DOI":"10.1109\/CVPR46437.2021.00122"},{"key":"5_CR46","unstructured":"Fonder M, Ernst D, Van Droogenbroeck M (2021) M4depth: a motion-based approach for monocular depth estimation on video sequences. arXiv preprint arXiv:2105.09847"},{"key":"5_CR47","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition, pp 248\u2013255","DOI":"10.1109\/CVPR.2009.5206848"}],"container-title":["Industrial Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44244-023-00005-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44244-023-00005-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44244-023-00005-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T00:24:43Z","timestamp":1678753483000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44244-023-00005-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,14]]},"references-count":47,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["5"],"URL":"https:\/\/doi.org\/10.1007\/s44244-023-00005-3","relation":{},"ISSN":["2731-667X"],"issn-type":[{"value":"2731-667X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,14]]},"assertion":[{"value":"26 September 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 January 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 March 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors do not have competing interests to declare.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"1"}}