{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,12]],"date-time":"2026-04-12T17:45:39Z","timestamp":1776015939274,"version":"3.50.1"},"reference-count":98,"publisher":"Springer Science and Business Media LLC","issue":"14","license":[{"start":{"date-parts":[[2022,10,21]],"date-time":"2022-10-21T00:00:00Z","timestamp":1666310400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,10,21]],"date-time":"2022-10-21T00:00:00Z","timestamp":1666310400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2021YFB2802100"],"award-info":[{"award-number":["2021YFB2802100"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61862061"],"award-info":[{"award-number":["61862061"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62061045"],"award-info":[{"award-number":["62061045"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Multimed Tools Appl"],"published-print":{"date-parts":[[2023,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Object detection is the most important problem in computer vision tasks. After AlexNet proposed, based on Convolutional Neural Network (CNN) methods have become mainstream in the computer vision field, many researches on neural networks and different transformations of algorithm structures have appeared. In order to achieve fast and accurate detection effects, it is necessary to jump out of the existing CNN framework and has great challenges. Transformer\u2019s relatively mature theoretical support and technological development in the field of Natural Language Processing have brought it into the researcher\u2019s sight, and it has been proved that Transformer\u2019s method can be used for computer vision tasks, and proved that it exceeds the existing CNN method in some tasks. In order to enable more researchers to better understand the development process of object detection methods, existing methods, different frameworks, challenging problems and development trends, paper introduced historical classic methods of object detection used CNN, discusses the highlights, advantages and disadvantages of these algorithms. By consulting a large amount of paper, the paper compared different CNN detection methods and Transformer detection methods. Vertically under fair conditions, 13 different detection methods that have a broad impact on the field and are the most mainstream and promising are selected for comparison. The comparative data gives us confidence in the development of Transformer and the convergence between different methods. It also presents the recent innovative approaches to using Transformer in computer vision tasks. In the end, the challenges, opportunities and future prospects of this field are summarized.<\/jats:p>","DOI":"10.1007\/s11042-022-13801-3","type":"journal-article","created":{"date-parts":[[2022,10,21]],"date-time":"2022-10-21T05:02:45Z","timestamp":1666328565000},"page":"21353-21383","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":171,"title":["A survey: object detection methods from CNN to transformer"],"prefix":"10.1007","volume":"82","author":[{"given":"Ershat","family":"Arkin","sequence":"first","affiliation":[]},{"given":"Nurbiya","family":"Yadikar","sequence":"additional","affiliation":[]},{"given":"Xuebin","family":"Xu","sequence":"additional","affiliation":[]},{"given":"Alimjan","family":"Aysa","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7566-6494","authenticated-orcid":false,"given":"Kurban","family":"Ubul","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,10,21]]},"reference":[{"key":"13801_CR1","doi-asserted-by":"publisher","unstructured":"Arkin E, Yadikar N, Muhtar Y, Ubul K (2021) \"A Survey of Object Detection Based on CNN and Transformer,\" 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), pp. 99\u2013108, https:\/\/doi.org\/10.1109\/PRML52754.2021.9520732.","DOI":"10.1109\/PRML52754.2021.9520732"},{"key":"13801_CR2","doi-asserted-by":"publisher","unstructured":"Bochkovskiy, A, Wang, CY, Liao, HYM (2020) Yolov4: Optimal speed and accuracy of object detection. https:\/\/doi.org\/10.48550\/arXiv.2004.10934.","DOI":"10.48550\/arXiv.2004.10934"},{"key":"13801_CR3","doi-asserted-by":"publisher","unstructured":"Brock, A, Donahue, J, Simonyan, K (2018) Large scale GAN training for high fidelity natural image synthesis. https:\/\/doi.org\/10.48550\/arXiv.1809.11096.","DOI":"10.48550\/arXiv.1809.11096"},{"key":"13801_CR4","doi-asserted-by":"publisher","unstructured":"Cai, Z, Fan, Q, Feris, RS, Vasconcelos, N (2016) A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision \u2013 ECCV 2016. ECCV 2016. Lecture notes in computer science(), vol 9908. Springer, Cham. https:\/\/doi.org\/10.1007\/978-3-319-46493-0_22.","DOI":"10.1007\/978-3-319-46493-0_22"},{"key":"13801_CR5","doi-asserted-by":"publisher","unstructured":"Cao Y, Chen K, Loy CC, Lin D (2020) \"Prime Sample Attention in Object Detection,\" 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11580\u201311588, https:\/\/doi.org\/10.1109\/CVPR42600.2020.01160.","DOI":"10.1109\/CVPR42600.2020.01160"},{"key":"13801_CR6","doi-asserted-by":"publisher","unstructured":"Carion, N, Massa, F, Synnaeve, G, Usunier, N, Kirillov, A, Zagoruyko, S (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision \u2013 ECCV 2020. ECCV 2020. Lecture notes in computer science(), vol 12346. Springer, Cham. https:\/\/doi.org\/10.1007\/978-3-030-58452-8_13.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"13801_CR7","doi-asserted-by":"publisher","unstructured":"Chen K et al. (2019) \"Hybrid Task Cascade for Instance Segmentation,\" 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4969\u20134978, https:\/\/doi.org\/10.1109\/CVPR.2019.00511.","DOI":"10.1109\/CVPR.2019.00511"},{"key":"13801_CR8","doi-asserted-by":"publisher","unstructured":"Chen C, Liu M, Meng X, Xiao W, Ju Q (2020) \"RefineDetLite: A Lightweight One-stage Object Detection Framework for CPU-only Devices,\" 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2997\u20133007, https:\/\/doi.org\/10.1109\/CVPRW50498.2020.00358.","DOI":"10.1109\/CVPRW50498.2020.00358"},{"key":"13801_CR9","unstructured":"Chen, M, et al. (2020) \u201cGenerative Pretraining From Pixels.\u201d ICML 2020: 37th International Conference on Machine Learning, vol. 1, 2020, pp. 1691\u20131703"},{"key":"13801_CR10","unstructured":"Cheng, B, Schwing, A, Kirillov, A (2021) Per-pixel classification is not all you need for semantic segmentation Advances in Neural Information Processing Systems, 34"},{"key":"13801_CR11","unstructured":"Chu, X, et al. (2021) \"Twins: Revisiting the design of spatial attention in vision transformers.\" Advances in Neural Information Processing Systems 34 (NeurIPS 2021)"},{"key":"13801_CR12","doi-asserted-by":"publisher","unstructured":"Chu, X, Tian, Z, Zhang, B, Wang, X, Wei, X, Xia, H, Shen, C (2021) Conditional positional encodings for vision transformers. https:\/\/doi.org\/10.48550\/arXiv.2102.10882.","DOI":"10.48550\/arXiv.2102.10882"},{"key":"13801_CR13","doi-asserted-by":"publisher","unstructured":"Cordonnier, J-B, et al. (2020) \u201cOn the Relationship between Self-Attention and Convolutional Layers.\u201d ICLR 2020\u00a0: Eighth International Conference on Learning Representations. https:\/\/doi.org\/10.48550\/arXiv.1911.03584","DOI":"10.48550\/arXiv.1911.03584"},{"key":"13801_CR14","unstructured":"Dai J, Li Y, He K, Sun J. (2016) R-FCN: object detection via region-based fully convolutional networks. In proceedings of the 30th international conference on neural information processing systems (NIPS'16). Curran associates Inc., red hook, NY, USA, 379\u2013387"},{"key":"13801_CR15","doi-asserted-by":"publisher","unstructured":"Dalal N, Triggs B (2005) \"Histograms of oriented gradients for human detection,\" 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp. 886\u2013893 vol. 1, https:\/\/doi.org\/10.1109\/CVPR.2005.177.","DOI":"10.1109\/CVPR.2005.177"},{"key":"13801_CR16","doi-asserted-by":"publisher","unstructured":"Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) \"ImageNet: A large-scale hierarchical image database,\" 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248\u2013255, https:\/\/doi.org\/10.1109\/CVPR.2009.5206848.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"13801_CR17","doi-asserted-by":"publisher","unstructured":"Dong, X, Bao, J, Chen, D, Zhang, W, Yu, N, Yuan, L, ..., Guo, B. (2021) Cswin transformer: A general vision transformer backbone with cross-shaped windows. https:\/\/doi.org\/10.48550\/arXiv.2107.0065.","DOI":"10.48550\/arXiv.2107.0065"},{"key":"13801_CR18","doi-asserted-by":"publisher","unstructured":"Dosovitskiy, A, et al. (2020) \u201cAn Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.\u201d https:\/\/doi.org\/10.48550\/arXiv.2010.11929.","DOI":"10.48550\/arXiv.2010.11929"},{"key":"13801_CR19","doi-asserted-by":"publisher","unstructured":"Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) \"CenterNet: Keypoint Triplets for Object Detection,\" 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), pp. 6568\u20136577, https:\/\/doi.org\/10.1109\/ICCV.2019.00667.","DOI":"10.1109\/ICCV.2019.00667"},{"issue":"2","key":"13801_CR20","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","volume":"88","author":"M Everingham","year":"2010","unstructured":"Everingham M et al (2010) The Pascal Visual Object Classes (VOC) Challenge. Int J Comput Vis 88(2):303\u2013338","journal-title":"Int J Comput Vis"},{"issue":"1","key":"13801_CR21","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1007\/s11263-014-0733-5","volume":"111","author":"M Everingham","year":"2015","unstructured":"Everingham M et al (2015) The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis 111(1):98\u2013136","journal-title":"Int J Comput Vis"},{"key":"13801_CR22","doi-asserted-by":"publisher","unstructured":"Fang, Y, Liao, B, Wang, X, Fang, J, Qi, J, Wu, R, ..., Liu, W (2021) You only look at one sequence: rethinking transformer in vision through object detection. Adv Neural Inf Proces Syst, 34. https:\/\/doi.org\/10.48550\/arXiv.2106.00666","DOI":"10.48550\/arXiv.2106.00666"},{"key":"13801_CR23","doi-asserted-by":"publisher","unstructured":"Fu, CY, Liu, W, Ranga, A, Tyagi, A, Berg, AC (2017) Dssd: Deconvolutional single shot detector. https:\/\/doi.org\/10.48550\/arXiv.1701.06659.","DOI":"10.48550\/arXiv.1701.06659"},{"key":"13801_CR24","doi-asserted-by":"publisher","unstructured":"Ge, Z, Liu, S, Wang, F, Li, Z, Sun, J (2021) Yolox: Exceeding yolo series in 2021. https:\/\/doi.org\/10.48550\/arXiv.2107.08430.","DOI":"10.48550\/arXiv.2107.08430"},{"key":"13801_CR25","doi-asserted-by":"publisher","unstructured":"Girshick R (2015) \"Fast R-CNN,\" 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440\u20131448, https:\/\/doi.org\/10.1109\/ICCV.2015.169.","DOI":"10.1109\/ICCV.2015.169"},{"key":"13801_CR26","doi-asserted-by":"publisher","unstructured":"Girshick R, Donahue J, Darrell T, Malik J (2014) \"Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,\" 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580\u2013587, https:\/\/doi.org\/10.1109\/CVPR.2014.81.","DOI":"10.1109\/CVPR.2014.81"},{"key":"13801_CR27","unstructured":"Han, K, et al. (2021) \"Transformer in transformer.\" Advances in Neural Information Processing Systems 34 (NeurIPS 2021)"},{"key":"13801_CR28","doi-asserted-by":"publisher","unstructured":"Hassani, A, Walton, S, Li, J, Li, S, Shi, H (2022) Neighborhood Attention Transformer. https:\/\/doi.org\/10.48550\/arXiv.2106.03146.","DOI":"10.48550\/arXiv.2106.03146"},{"issue":"9","key":"13801_CR29","doi-asserted-by":"publisher","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","volume":"37","author":"K He","year":"2015","unstructured":"He K et al (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904\u20131916","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"2","key":"13801_CR30","doi-asserted-by":"publisher","first-page":"386","DOI":"10.1109\/TPAMI.2018.2844175","volume":"42","author":"K He","year":"2020","unstructured":"He K et al (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42(2):386\u2013397","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"13801_CR31","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/LGRS.2021.3103069","volume":"19","author":"M Hong","year":"2022","unstructured":"Hong M, Li S, Yang Y, Zhu F, Zhao Q, Lu L (2022, Art no 8018505) SSPNet: Scale Selection Pyramid Network for Tiny Person Detection From UAV Images. IEEE Geosci Remote Sens Lett 19:1\u20135. https:\/\/doi.org\/10.1109\/LGRS.2021.3103069","journal-title":"IEEE Geosci Remote Sens Lett"},{"key":"13801_CR32","doi-asserted-by":"publisher","unstructured":"Howard, AG, Zhu, M, Chen, B, Kalenichenko, D, Wang, W, Weyand, T, ..., Adam, H. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. https:\/\/doi.org\/10.48550\/arXiv.1704.04861.","DOI":"10.48550\/arXiv.1704.04861"},{"key":"13801_CR33","doi-asserted-by":"publisher","unstructured":"Howard A et al. (2019) \"Searching for MobileNetV3,\" 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), pp. 1314\u20131324, https:\/\/doi.org\/10.1109\/ICCV.2019.00140.","DOI":"10.1109\/ICCV.2019.00140"},{"key":"13801_CR34","doi-asserted-by":"publisher","unstructured":"Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) \"Densely Connected Convolutional Networks,\" 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261\u20132269, https:\/\/doi.org\/10.1109\/CVPR.2017.243.","DOI":"10.1109\/CVPR.2017.243"},{"key":"13801_CR35","doi-asserted-by":"publisher","unstructured":"Iandola, FN, Han, S, Moskewicz, MW, Ashraf, K, Dally, WJ, Keutzer, K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. https:\/\/doi.org\/10.48550\/arXiv.1602.07360.","DOI":"10.48550\/arXiv.1602.07360"},{"key":"13801_CR36","unstructured":"Jiang, Y, Chang, S, Wang, Z (2021) Transgan: two pure transformers can make one strong Gan, and that can scale up. Adv Neural Inf Proces Syst, 34"},{"issue":"10","key":"13801_CR37","doi-asserted-by":"publisher","first-page":"2896","DOI":"10.1109\/TCSVT.2017.2736553","volume":"28","author":"K Kang","year":"2018","unstructured":"Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X, Ouyang W (Oct. 2018) T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circuits Syst Vid Technol 28(10):2896\u20132907. https:\/\/doi.org\/10.1109\/TCSVT.2017.2736553","journal-title":"IEEE Trans Circuits Syst Vid Technol"},{"key":"13801_CR38","doi-asserted-by":"publisher","unstructured":"Karlinsky L et al. (2019) \"RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection,\" 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5192\u20135201, https:\/\/doi.org\/10.1109\/CVPR.2019.00534.","DOI":"10.1109\/CVPR.2019.00534"},{"key":"13801_CR39","doi-asserted-by":"publisher","first-page":"1956","DOI":"10.1007\/s11263-020-01316-z","volume":"128","author":"A Kuznetsova","year":"2020","unstructured":"Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, Duerig T, Ferrari V (2020) The open images dataset V4. Int J Comput Vis 128:1956\u20131981. https:\/\/doi.org\/10.1007\/s11263-020-01316-z","journal-title":"Int J Comput Vis"},{"key":"13801_CR40","doi-asserted-by":"publisher","first-page":"642","DOI":"10.1007\/s11263-019-01204-1","volume":"128","author":"H Law","year":"2020","unstructured":"Law H, Deng J (2020) CornerNet: detecting objects as paired Keypoints. Int J Comput Vis 128:642\u2013656. https:\/\/doi.org\/10.1007\/s11263-019-01204-1","journal-title":"Int J Comput Vis"},{"key":"13801_CR41","doi-asserted-by":"publisher","unstructured":"Li Y, Li J, Lin W, Li J (2018) Tiny-DSOD: lightweight object detection for resource-restricted usages. https:\/\/doi.org\/10.48550\/arXiv.1807.11013","DOI":"10.48550\/arXiv.1807.11013"},{"key":"13801_CR42","doi-asserted-by":"publisher","unstructured":"Li Y, Chen Y, Wang N, Zhang Z-X (2019) \"Scale-Aware Trident Networks for Object Detection,\" 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), pp. 6053\u20136062, https:\/\/doi.org\/10.1109\/ICCV.2019.00615.","DOI":"10.1109\/ICCV.2019.00615"},{"key":"13801_CR43","doi-asserted-by":"publisher","unstructured":"Liang T, Chu X, Liu Y, Wang Y, Tang Z, Chu W, ... Ling H (2021) Cbnetv2: a composite backbone network architecture for object detection. https:\/\/doi.org\/10.48550\/arXiv.2107.00420","DOI":"10.48550\/arXiv.2107.00420"},{"key":"13801_CR44","doi-asserted-by":"publisher","unstructured":"Lin, TY. et al. (2014) Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision \u2013 ECCV 2014. ECCV 2014. Lecture notes in computer science, vol 8693. Springer, Cham. https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"13801_CR45","doi-asserted-by":"publisher","unstructured":"Lin T-Y, Doll\u00e1r P, Girshick R, He K, Hariharan B, Belongie S (2017)\u00a0\u201cFeature pyramid networks for object detection,\u201d\u00a02017\u00a0IEEE conference on computer vision and pattern recognition (CVPR), pp 936\u2013944.\u00a0https:\/\/doi.org\/10.1109\/CVPR.2017.106","DOI":"10.1109\/CVPR.2017.106"},{"issue":"2","key":"13801_CR46","doi-asserted-by":"publisher","first-page":"318","DOI":"10.1109\/TPAMI.2018.2858826","volume":"42","author":"T-Y Lin","year":"2020","unstructured":"Lin T-Y, Goyal P, Girshick R, He K, Doll\u00e1r P (2020) Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal Mach Intell 42(2):318\u2013327. https:\/\/doi.org\/10.1109\/TPAMI.2018.2858826","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"13801_CR47","doi-asserted-by":"publisher","unstructured":"Liu, W et al. (2016) SSD: Single Shot MultiBox Detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision \u2013 ECCV 2016. ECCV 2016. Lecture notes in computer science(), vol 9905. Springer, Cham. https:\/\/doi.org\/10.1007\/978-3-319-46448-0_2.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"13801_CR48","doi-asserted-by":"publisher","unstructured":"Liu S, Johns E, Davison AJ (2019)\u00a0\u201cEnd-to-end multi-task learning with attention,\u201d\u00a02019\u00a0IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp\u00a01871\u20131880.\u00a0https:\/\/doi.org\/10.1109\/CVPR.2019.00197","DOI":"10.1109\/CVPR.2019.00197"},{"key":"13801_CR49","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1007\/s11263-019-01247-4","volume":"128","author":"L Liu","year":"2020","unstructured":"Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietik\u00e4inen M (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128:261\u2013318. https:\/\/doi.org\/10.1007\/s11263-019-01247-4","journal-title":"Int J Comput Vis"},{"issue":"07","key":"13801_CR50","doi-asserted-by":"publisher","first-page":"11685","DOI":"10.1609\/aaai.v34i07.6838","volume":"34","author":"Z Liu","year":"2020","unstructured":"Liu Z, Zheng T, Xu G, Yang Z, Liu H, Cai D (2020) Training-time-friendly network for real-time object detection. Proceedings of the AAAI Conference on Artificial Intelligence 34(07):11685\u201311692. https:\/\/doi.org\/10.1609\/aaai.v34i07.6838","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"13801_CR51","doi-asserted-by":"publisher","unstructured":"Liu Z et al. (2021) \"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,\" IEEE\/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992\u201310002, https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"13801_CR52","doi-asserted-by":"publisher","unstructured":"Liu, Z, Mao, H, Wu, CY, Feichtenhofer, C, Darrell, T, Xie, S (2022) A ConvNet for the 2020s. https:\/\/doi.org\/10.48550\/arXiv.2201.03545.","DOI":"10.48550\/arXiv.2201.03545"},{"key":"13801_CR53","doi-asserted-by":"publisher","unstructured":"Ma C, Huang J-B, Yang X, Yang M-H (2015) \"Hierarchical Convolutional Features for Visual Tracking,\" 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3074\u20133082, https:\/\/doi.org\/10.1109\/ICCV.2015.352.","DOI":"10.1109\/ICCV.2015.352"},{"key":"13801_CR54","doi-asserted-by":"publisher","unstructured":"Ma, N, Zhang, X, Zheng, HT, Sun, J (2018) ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision \u2013 ECCV 2018. ECCV 2018. Lecture notes in computer science(), vol 11218. Springer, Cham. https:\/\/doi.org\/10.1007\/978-3-030-01264-9_8.","DOI":"10.1007\/978-3-030-01264-9_8"},{"key":"13801_CR55","doi-asserted-by":"publisher","first-page":"107149","DOI":"10.1016\/j.patcog.2019.107149","volume":"100","author":"W Ma","year":"2020","unstructured":"Ma W et al (2020) MDFN: Multi-Scale Deep Feature Learning Network for Object Detection. Pattern Recog 100:107149","journal-title":"Pattern Recog"},{"key":"13801_CR56","doi-asserted-by":"publisher","unstructured":"Ma, T, Mao, M, Zheng, H, Gao, P, Wang, X, Han, S, ..., Doermann, D. (2021) Oriented object detection with transformer. https:\/\/doi.org\/10.48550\/arXiv.2106.03146.","DOI":"10.48550\/arXiv.2106.03146"},{"key":"13801_CR57","doi-asserted-by":"publisher","unstructured":"Mehta, S, Rastegari M (n.d.) \"Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer.\" https:\/\/doi.org\/10.48550\/arXiv.2110.02178.","DOI":"10.48550\/arXiv.2110.02178"},{"key":"13801_CR58","doi-asserted-by":"publisher","unstructured":"Newell, A, Yang, K, Deng, J (2016) Stacked Hourglass Networks for Human Pose Estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision \u2013 ECCV 2016. ECCV 2016. Lecture notes in computer science(), vol 9912. Springer, Cham https:\/\/doi.org\/10.1007\/978-3-319-46484-8_29.","DOI":"10.1007\/978-3-319-46484-8_29"},{"key":"13801_CR59","doi-asserted-by":"publisher","unstructured":"Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) \u201cLibra R-CNN: towards balanced learning for object detection,\u201d\u00a02019 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 821\u2013830.\u00a0https:\/\/doi.org\/10.1109\/CVPR.2019.00091","DOI":"10.1109\/CVPR.2019.00091"},{"key":"13801_CR60","doi-asserted-by":"publisher","unstructured":"Peng Z et al. (2021) \"Conformer: Local Features Coupling Global Representations for Visual Recognition,\" 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), pp. 357\u2013366, https:\/\/doi.org\/10.1109\/ICCV48922.2021.00042.","DOI":"10.1109\/ICCV48922.2021.00042"},{"key":"13801_CR61","doi-asserted-by":"publisher","unstructured":"Qiao S, Chen L-C, Yuille A (2021) \"DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution,\" 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10208\u201310219, https:\/\/doi.org\/10.1109\/CVPR46437.2021.01008.","DOI":"10.1109\/CVPR46437.2021.01008"},{"key":"13801_CR62","doi-asserted-by":"publisher","unstructured":"Qin Z et al. (2019) \"ThunderNet: Towards Real-Time Generic Object Detection on Mobile Devices,\" 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), pp. 6717\u20136726, https:\/\/doi.org\/10.1109\/ICCV.2019.00682.","DOI":"10.1109\/ICCV.2019.00682"},{"key":"13801_CR63","doi-asserted-by":"publisher","unstructured":"Qiu H et al. (2021) \"CrossDet: Crossline Representation for Object Detection,\" 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), pp. 3175\u20133184, https:\/\/doi.org\/10.1109\/ICCV48922.2021.00318.","DOI":"10.1109\/ICCV48922.2021.00318"},{"key":"13801_CR64","doi-asserted-by":"publisher","first-page":"2979","DOI":"10.1007\/s11263-020-01355-6","volume":"128","author":"S Rahman","year":"2020","unstructured":"Rahman S, Khan SH, Porikli F (2020) Zero-shot object detection: joint recognition and localization of novel concepts. Int J Comput Vis 128:2979\u20132999. https:\/\/doi.org\/10.1007\/s11263-020-01355-6","journal-title":"Int J Comput Vis"},{"key":"13801_CR65","doi-asserted-by":"publisher","unstructured":"Redmon J, Farhadi A (2017) \"YOLO9000: Better, Faster, Stronger,\" 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517\u20136525, https:\/\/doi.org\/10.1109\/CVPR.2017.690.","DOI":"10.1109\/CVPR.2017.690"},{"key":"13801_CR66","doi-asserted-by":"publisher","unstructured":"Redmon, J, Farhadi A (n.d.) \u201cYOLOv3: An Incremental Improvement.\u201d https:\/\/doi.org\/10.48550\/arXiv.1804.02767.","DOI":"10.48550\/arXiv.1804.02767"},{"key":"13801_CR67","doi-asserted-by":"publisher","unstructured":"Redmon J, Divvala S, Girshick R, Farhadi A (2016) \"You Only Look Once: Unified, Real-Time Object Detection,\" 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779\u2013788, https:\/\/doi.org\/10.1109\/CVPR.2016.91.","DOI":"10.1109\/CVPR.2016.91"},{"issue":"6","key":"13801_CR68","doi-asserted-by":"publisher","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","volume":"39","author":"S Ren","year":"2017","unstructured":"Ren S et al (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137\u20131149","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"13801_CR69","doi-asserted-by":"publisher","unstructured":"Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A (eds) Medical image computing and computer-assisted intervention \u2013 MICCAI\u00a02015. MICCAI\u00a02015. Lecture notes in computer science, vol 9351. Springer, Cham. https:\/\/doi.org\/10.1007\/978-3-319-24574-4_28","DOI":"10.1007\/978-3-319-24574-4_28"},{"issue":"3","key":"13801_CR70","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky O et al (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211\u2013252","journal-title":"Int J Comput Vis"},{"key":"13801_CR71","doi-asserted-by":"publisher","unstructured":"Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) \"MobileNetV2: Inverted Residuals and Linear Bottlenecks,\" 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510\u20134520, https:\/\/doi.org\/10.1109\/CVPR.2018.00474.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"13801_CR72","doi-asserted-by":"publisher","unstructured":"Shen Z, Liu Z, Li J, Jiang Y, Chen Y, Xue X (2017) \"DSOD: learning deeply supervised object detectors from scratch,\" 2017 IEEE international conference on computer vision (ICCV), pp. 1937-1945, https:\/\/doi.org\/10.1109\/ICCV.2017.212.","DOI":"10.1109\/ICCV.2017.212"},{"key":"13801_CR73","doi-asserted-by":"publisher","unstructured":"Shrivastava A, Gupta A, Girshick R (2016) \"Training Region-Based Object Detectors with Online Hard Example Mining,\" 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761\u2013769, https:\/\/doi.org\/10.1109\/CVPR.2016.89.","DOI":"10.1109\/CVPR.2016.89"},{"key":"13801_CR74","unstructured":"Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015). OpenReview.net, : 1\u201314"},{"key":"13801_CR75","doi-asserted-by":"publisher","first-page":"19753","DOI":"10.1007\/s11042-021-10711-8","volume":"80","author":"S Singh","year":"2021","unstructured":"Singh S, Ahuja U, Kumar M, Kumar K, Sachdeva M (2021) Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment. Multimed Tools Appl 80:19753\u201319768. https:\/\/doi.org\/10.1007\/s11042-021-10711-8","journal-title":"Multimed Tools Appl"},{"key":"13801_CR76","doi-asserted-by":"publisher","first-page":"2852","DOI":"10.3390\/s21082852","volume":"21","author":"PN Srinivasu","year":"2021","unstructured":"Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ (2021) Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors 21:2852. https:\/\/doi.org\/10.3390\/s21082852","journal-title":"Sensors"},{"key":"13801_CR77","doi-asserted-by":"publisher","unstructured":"Tan, M, Le Q (2019) \"Efficientnet: rethinking model scaling for convolutional neural networks.\" International conference on machine learning. PMLR, https:\/\/doi.org\/10.48550\/arXiv.1905.11946","DOI":"10.48550\/arXiv.1905.11946"},{"key":"13801_CR78","doi-asserted-by":"publisher","unstructured":"Tan M et al. (2019) \"MnasNet: Platform-Aware Neural Architecture Search for Mobile,\" 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815\u20132823, https:\/\/doi.org\/10.1109\/CVPR.2019.00293.","DOI":"10.1109\/CVPR.2019.00293"},{"key":"13801_CR79","doi-asserted-by":"publisher","unstructured":"Tan M, Pang R, Le QV (2020) \"EfficientDet: Scalable and Efficient Object Detection,\" 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10778\u201310787, https:\/\/doi.org\/10.1109\/CVPR42600.2020.01079.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"13801_CR80","unstructured":"Touvron, H, et al. (2021) \u201cTraining Data-Efficient Image Transformers & Distillation through Attention.\u201d ICML 2021: 38th International Conference on Machine Learning, pp. 10347\u201310357."},{"issue":"2","key":"13801_CR81","doi-asserted-by":"publisher","first-page":"154","DOI":"10.1007\/s11263-013-0620-5","volume":"104","author":"JR Uijlings","year":"2013","unstructured":"Uijlings JR et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154\u2013171","journal-title":"Int J Comput Vis"},{"key":"13801_CR82","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I. 2017 Attention is all you need. In proceedings of the 31st international conference on neural information processing systems (NIPS'17). Curran associates Inc., red hook, NY, USA, 6000\u20136010"},{"key":"13801_CR83","doi-asserted-by":"publisher","unstructured":"Viola P, Jones M (2001) \"Rapid object detection using a boosted cascade of simple features,\" proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, pp. 511\u2013518, https:\/\/doi.org\/10.1109\/CVPR.2001.990517.","DOI":"10.1109\/CVPR.2001.990517"},{"key":"13801_CR84","doi-asserted-by":"publisher","first-page":"2988","DOI":"10.3390\/s22082988","volume":"22","author":"A Vulli","year":"2022","unstructured":"Vulli A, Srinivasu PN, Sashank MSK, Shafi J, Choi J, Ijaz MF (2022) Fine-tuned DenseNet-169 for breast Cancer metastasis prediction using FastAI and 1-cycle policy. Sensors 22:2988. https:\/\/doi.org\/10.3390\/s22082988","journal-title":"Sensors"},{"key":"13801_CR85","doi-asserted-by":"publisher","unstructured":"Wan F, Liu C, Ke W, Ji X, Jiao J, Ye Q (2019) \"C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection,\" 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2194\u20132203, https:\/\/doi.org\/10.1109\/CVPR.2019.00230.","DOI":"10.1109\/CVPR.2019.00230"},{"key":"13801_CR86","unstructured":"Wang RJ et al (2018) \u201cPelee:\u00a0 a real-time object detection system on mobile devices.\u201d NIPS\u201918 Proceedings of the 32nd international conference on neural information processing systems, vol 31, pp\u00a01967\u20131976"},{"key":"13801_CR87","doi-asserted-by":"publisher","unstructured":"Wang W et al (2021)\u00a0\u201cPyramid vision transformer: a versatile backbone for dense prediction without convolutions,\u201d\u00a02021\u00a0IEEE\/CVF international conference on computer vision (ICCV),\u00a02021, pp 548\u2013558.\u00a0https:\/\/doi.org\/10.1109\/ICCV48922.2021.00061","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"13801_CR88","doi-asserted-by":"publisher","unstructured":"Wang Y, Huang R, Song S, Huang Z, Gao H (n.d.) Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length. Adv Neural Inf Process Syst 34. https:\/\/doi.org\/10.48550\/arXiv.2105.15075","DOI":"10.48550\/arXiv.2105.15075"},{"key":"13801_CR89","doi-asserted-by":"publisher","unstructured":"Xie S, Girshick R, Doll\u00e1r P, Tu Z, He K (2017) \"Aggregated Residual Transformations for Deep Neural Networks,\" 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987\u20135995, https:\/\/doi.org\/10.1109\/CVPR.2017.634.","DOI":"10.1109\/CVPR.2017.634"},{"key":"13801_CR90","unstructured":"Xie, E, Wang, W, Yu, Z, Anandkumar, A, Alvarez, JM, Luo, P (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Proces Syst, 34"},{"key":"13801_CR91","doi-asserted-by":"publisher","unstructured":"Xiong Y et al. (2021) \"MobileDets: Searching for Object Detection Architectures for Mobile Accelerators,\" 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3824\u20133833, https:\/\/doi.org\/10.1109\/CVPR46437.2021.00382.","DOI":"10.1109\/CVPR46437.2021.00382"},{"key":"13801_CR92","doi-asserted-by":"publisher","unstructured":"Yang, J, Li, C, Zhang, P, Dai, X, Xiao, B, Yuan, L, Gao, J (2021) Focal self-attention for local-global interactions in vision transformers. https:\/\/doi.org\/10.48550\/arXiv.2107.00641.","DOI":"10.48550\/arXiv.2107.00641"},{"key":"13801_CR93","doi-asserted-by":"publisher","unstructured":"Yin T, Zhou X, Kr\u00e4henb\u00fchl P (2021) \"Center-based 3D Object Detection and Tracking,\" 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11779\u201311788, https:\/\/doi.org\/10.1109\/CVPR46437.2021.01161.","DOI":"10.1109\/CVPR46437.2021.01161"},{"key":"13801_CR94","doi-asserted-by":"publisher","unstructured":"Zeiler, MD, Fergus, R (2014) Visualizing and Understanding Convolutional Networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision \u2013 ECCV 2014. ECCV 2014. Lecture notes in computer science, vol 8689. Springer, Cham https:\/\/doi.org\/10.1007\/978-3-319-10590-1_53.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"13801_CR95","doi-asserted-by":"publisher","unstructured":"Zhang X, Zhou X, Lin M Sun J (2018) \"ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,\" 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848\u20136856, https:\/\/doi.org\/10.1109\/CVPR.2018.00716.","DOI":"10.1109\/CVPR.2018.00716"},{"key":"13801_CR96","doi-asserted-by":"publisher","unstructured":"Zhou P, Ni B, Geng C, Hu J, Xu Y (2018) \"Scale-Transferrable Object Detection,\" 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 528\u2013537, https:\/\/doi.org\/10.1109\/CVPR.2018.00062.","DOI":"10.1109\/CVPR.2018.00062"},{"key":"13801_CR97","doi-asserted-by":"publisher","unstructured":"Zhou, X, Koltun, V, Kr\u00e4henb\u00fchl, P (2021) Probabilistic two-stage detection. https:\/\/doi.org\/10.48550\/arXiv.2103.07461.","DOI":"10.48550\/arXiv.2103.07461"},{"key":"13801_CR98","unstructured":"Zhu, X, Su, W, Lu, L, Li, B, Wang, X, Dai, J (2020) Deformable detr: Deformable transformers for end-to-end object detection. In Proc. ICLR, 2021 Oral, PP. 1\u201316"}],"container-title":["Multimedia Tools and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-022-13801-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11042-022-13801-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-022-13801-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,25]],"date-time":"2023-05-25T09:16:51Z","timestamp":1685006211000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11042-022-13801-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,21]]},"references-count":98,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2023,6]]}},"alternative-id":["13801"],"URL":"https:\/\/doi.org\/10.1007\/s11042-022-13801-3","relation":{},"ISSN":["1380-7501","1573-7721"],"issn-type":[{"value":"1380-7501","type":"print"},{"value":"1573-7721","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,10,21]]},"assertion":[{"value":"11 March 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 July 2022","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 September 2022","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 October 2022","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interests"}}]}}