{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T11:09:30Z","timestamp":1763809770248,"version":"build-2065373602"},"reference-count":50,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2017,11,24]],"date-time":"2017-11-24T00:00:00Z","timestamp":1511481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Vehicle detection in aerial images is an important and challenging task. Traditionally, many target detection models based on sliding-window fashion were developed and achieved acceptable performance, but these models are time-consuming in the detection phase. Recently, with the great success of convolutional neural networks (CNNs) in computer vision, many state-of-the-art detectors have been designed based on deep CNNs. However, these CNN-based detectors are inefficient when applied in aerial image data due to the fact that the existing CNN-based models struggle with small-size object detection and precise localization. To improve the detection accuracy without decreasing speed, we propose a CNN-based detection model combining two independent convolutional neural networks, where the first network is applied to generate a set of vehicle-like regions from multi-feature maps of different hierarchies and scales. Because the multi-feature maps combine the advantage of the deep and shallow convolutional layer, the first network performs well on locating the small targets in aerial image data. Then, the generated candidate regions are fed into the second network for feature extraction and decision making. Comprehensive experiments are conducted on the Vehicle Detection in Aerial Imagery (VEDAI) dataset and Munich vehicle dataset. The proposed cascaded detection model yields high performance, not only in detection accuracy but also in detection speed.<\/jats:p>","DOI":"10.3390\/s17122720","type":"journal-article","created":{"date-parts":[[2017,11,24]],"date-time":"2017-11-24T06:39:25Z","timestamp":1511505565000},"page":"2720","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":75,"title":["Robust Vehicle Detection in Aerial Images Based on Cascaded Convolutional Neural Networks"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5686-7955","authenticated-orcid":false,"given":"Jiandan","family":"Zhong","sequence":"first","affiliation":[{"name":"Institute of Optics and Electronics, Chinese Academy of Sciences, No. 1, Guangdian Avenue, Chengdu 610209, China"},{"name":"School of Optoelectronic Information, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, China"},{"name":"University of Chinese Academy of Sciences, 19 A Yuquan Rd, Shijingshan District, Beijing 100039, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0900-1582","authenticated-orcid":false,"given":"Tao","family":"Lei","sequence":"additional","affiliation":[{"name":"Institute of Optics and Electronics, Chinese Academy of Sciences, No. 1, Guangdian Avenue, Chengdu 610209, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guangle","family":"Yao","sequence":"additional","affiliation":[{"name":"Institute of Optics and Electronics, Chinese Academy of Sciences, No. 1, Guangdian Avenue, Chengdu 610209, China"},{"name":"School of Optoelectronic Information, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu 610054, China"},{"name":"University of Chinese Academy of Sciences, 19 A Yuquan Rd, Shijingshan District, Beijing 100039, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2017,11,24]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"5817","DOI":"10.1007\/s11042-015-2520-x","article-title":"Vehicle detection and recognition for intelligent traffic surveillance system","volume":"76","author":"Tang","year":"2017","journal-title":"Multimedia Tools Appl."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1109\/TCSVT.2014.2358031","article-title":"Efficient Feature Selection and Classification for Vehicle Detection","volume":"25","author":"Wen","year":"2015","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_3","unstructured":"Xu, H., Zhou, Z., Sheng, B., and Ma, L. (2013, January 19\u201323). Fast vehicle detection based on feature and real-time prediction. Proceedings of the IEEE International Symposium on Circuits & Systems, Beijing, China."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"197","DOI":"10.1007\/s10043-015-0067-8","article-title":"Vision-based multi-scaled vehicle detection and distance relevant mix tracking for driver assistance system","volume":"22","author":"Gu","year":"2015","journal-title":"Opt. Rev."},{"key":"ref_5","unstructured":"Dalal, N., and Triggs, B. (2005, January 20\u201325). Histograms of oriented gradients for human detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","article-title":"Distinctive image features from scale-invariant keypoints","volume":"60","author":"Lowe","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"389","DOI":"10.1145\/1961189.1961199","article-title":"LIBSVM: A library for support vector machines","volume":"2","author":"Chang","year":"2011","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"ref_8","unstructured":"Viola, P., and Jones, M. (2001, January 8\u201314). Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1627","DOI":"10.1109\/TPAMI.2009.167","article-title":"Object detection with discriminatively trained part based models","volume":"32","author":"Felzenszwalb","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1007\/s11263-013-0620-5","article-title":"Selective search for object recognition","volume":"104","author":"Uijlings","year":"2013","journal-title":"Int. J. Comput. Vis."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Cheng, M., Zhang, Z., Lin, W., and Torr, P. (2014, January 23\u201328). BING: Binarized Normed Gradients for Objectness Estimation at 300 fps. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.414"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2189","DOI":"10.1109\/TPAMI.2012.28","article-title":"Measuring the objectness of image windows","volume":"54","author":"Alexe","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G. (2012, January 3\u20136). ImageNet classification with deep convolutional neural networks. Proceedings of the 25th International Conference on Neural Information Processing System, Lake Tahoe, NV, USA."},{"key":"ref_14","unstructured":"Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., and Li, F. (2017, July 10). ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012). Available online: http:\/\/www.image-net.org\/challenges\/LSVRC\/2012."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_16","unstructured":"Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (arXiv, 2013). OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","article-title":"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition","volume":"37","author":"He","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, L. (2014, January 6\u201312). Microsoft COCO: Common Objects in Context. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_21","unstructured":"Simonyan, K., and Zisserman, A. (arXiv, 2014). Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Ghodrati, A., Pedersoli, M., Tuytelaars, T., Diba, A., and Gool, L. (2015, January 7\u201313). Deepproposal: Hunting objects by cascading deep convolutional layers. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.296"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1312","DOI":"10.1109\/TPAMI.2011.231","article-title":"CPMC: Automatic object segmentation using constrained parametric min-cuts","volume":"34","author":"Carreira","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Zitnick, C., and Dollar, P. (2014, January 6\u201312). Edge boxes: Locating object proposals from edges. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_26"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"814","DOI":"10.1109\/TPAMI.2015.2465908","article-title":"What makes for effective detection proposals?","volume":"38","author":"Hosang","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Chavali, N., Agrawal, H., Mahendru, A., and Batra, D. (2016, January 27\u201330). Object-Proposal Evaluation Protocol is \u2018Gameable\u2019. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.97"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Arbel\u00e1ez, P., Pont-Tuset, J., Barron, J., Marques, F., and Malik, J. (2014, January 23\u201328). Multiscale combinatorial grouping. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.49"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Kuo, W., Hariharan, B., and Malik, J. (2015, January 7\u201313). Deepbox: Learning objectness with convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.285"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2274","DOI":"10.1109\/TPAMI.2012.120","article-title":"SLIC superpixels compared to state-of-the-art superpixel methods","volume":"34","author":"Achanta","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Vedaldi, A., and Soatto, S. (2008, January 12\u201318). Quick shift and kernel methods for mode seeking. Proceedings of the European Conference on Computer Vision, Marseille, France.","DOI":"10.1007\/978-3-540-88693-8_52"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Veksler, O., Boykov, Y., and Mehrani, P. (2010, January 5\u201311). Superpixels and supervoxels in an energy optimization framework. Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece.","DOI":"10.1007\/978-3-642-15555-0_16"},{"key":"ref_32","first-page":"1","article-title":"SEEDS: Superpixels Extracted via Energy-Driven Sampling","volume":"7578","author":"Bergh","year":"2013","journal-title":"Int. J. Comput. Vis."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Sohn, K., Villegas, R., Pan, G., and Lee, H. (2015, January 7\u201312). Improving object detection with deep convolutional networks via bayesian optimization and structured prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298621"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D. (2014, January 23\u201328). Scalable object detection using deep neural networks. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.276"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Xu, Y., Yu, G., Wang, Y., Wu, X., and Ma, Y. (2016). A Hybrid Vehicle Detection Method Based on Viola-Jones and HOG + SVM from UAV Images. Sensors, 16.","DOI":"10.3390\/s16081325"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Ammour, N., Alhichri, H., Bazi, Y., Benjdira, B., and Alajlan, N. (2017). Deep Learning Approach for Car Detection in UAV Imagery. Remote Sens., 9.","DOI":"10.3390\/rs9040312"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"21651","DOI":"10.1007\/s11042-016-4043-5","article-title":"Vehicle detection from high-resolution aerial images using spatial pyramid pooling-based deep convolutional neural networks","volume":"76","author":"Qu","year":"2016","journal-title":"Multimedia Tools Appl."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Tang, T., Zhou, S., Deng, Z., Zou, H., and Lei, L. (2017). Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining. Sensors, 17.","DOI":"10.3390\/s17020336"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"3652","DOI":"10.1109\/JSTARS.2017.2694890","article-title":"Toward Fast and Accurate Vehicle Detection in Aerial Images Using Coupled Region-Based Convolutional Neural Networks","volume":"10","author":"Deng","year":"2017","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1016\/j.jvcir.2015.11.002","article-title":"Vehicle detection in aerial imagery: A small target detection benchmark","volume":"34","author":"Razakarivony","year":"2016","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"1938","DOI":"10.1109\/LGRS.2015.2439517","article-title":"Fast Multiclass Vehicle Detection on Aerial Images","volume":"12","author":"Liu","year":"2015","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Kong, T., Yao, A., Chen, Y., and Sun, F. (2016, January 27\u201330). Hypernet: Towards accurate region proposal generation and joint object detection. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.98"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"541","DOI":"10.1162\/neco.1989.1.4.541","article-title":"Backpropagation applied to handwritten zip code recognition","volume":"4","author":"LeCun","year":"1989","journal-title":"Neural Comput."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., and Long, J. (2014, January 3\u20137). Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA.","DOI":"10.1145\/2647868.2654889"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","article-title":"The Pascal Visual Object Classes (VOC) Challenge","volume":"88","author":"Everingham","year":"2010","journal-title":"Int. J. Comput. Vis."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Lipton, Z., Elkan, C., and Naryanaswamy, B. (2014, January 15\u201319). Optimal Thresholding of Classifiers to Maximize F1 Measure. Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France.","DOI":"10.1007\/978-3-662-44851-9_15"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Zeiler, M., and Fergus, R. (2014, January 6\u201312). Visualizing and Understanding Convolutional Networks. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"11315","DOI":"10.3390\/rs61111315","article-title":"An operational system for estimating road traffic information from aerial images","volume":"6","author":"Leitloff","year":"2014","journal-title":"Remote Sens."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/17\/12\/2720\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T18:51:04Z","timestamp":1760208664000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/17\/12\/2720"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,11,24]]},"references-count":50,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2017,12]]}},"alternative-id":["s17122720"],"URL":"https:\/\/doi.org\/10.3390\/s17122720","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2017,11,24]]}}}