{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T20:28:26Z","timestamp":1775766506351,"version":"3.50.1"},"reference-count":65,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2022,2,14]],"date-time":"2022-02-14T00:00:00Z","timestamp":1644796800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Provincial Natural Science Foundation Project","award":["ZR2021MC099"],"award-info":[{"award-number":["ZR2021MC099"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>There has been substantial progress in small object detection in aerial images in recent years, due to the extensive applications and improved performances of convolutional neural networks (CNNs). Typically, traditional machine learning algorithms tend to prioritize inference speed over accuracy. Insufficient samples can cause problems for convolutional neural networks, such as instability, non-convergence, and overfitting. Additionally, detecting aerial images has inherent challenges, such as varying altitudes and illuminance situations, and blurred and dense objects, resulting in low detection accuracy. As a result, this paper adds a transformer backbone attention mechanism as a branch network, using the region-wide feature information. This paper also employs a generative model to expand the input aerial images ahead of the backbone. The respective advantages of the generative model and transformer network are incorporated. On the dataset presented in this study, the model achieves 96.77% precision, 98.83% recall, and 97.91% mAP by adding the Multi-GANs module to the one-stage detection network. These three indices are enhanced by 13.9%, 20.54%, and 10.27%, respectively, when compared to the other detection networks. Furthermore, this study provides an auto-pruning technique that may achieve 32.2 FPS inference speed with a minor performance loss while responding to the real-time detection task\u2019s usage environment. This research also develops a macOS application for the proposed algorithm using Swift development technology.<\/jats:p>","DOI":"10.3390\/rs14040923","type":"journal-article","created":{"date-parts":[[2022,2,14]],"date-time":"2022-02-14T20:58:03Z","timestamp":1644872283000},"page":"923","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":39,"title":["GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2428-9211","authenticated-orcid":false,"given":"Yan","family":"Zhang","sequence":"first","affiliation":[{"name":"College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3034-0199","authenticated-orcid":false,"given":"Xi","family":"Liu","sequence":"additional","affiliation":[{"name":"College of Humanities and Development, China Agricultural University, Beijing 100083, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2949-5492","authenticated-orcid":false,"given":"Shiyun","family":"Wa","sequence":"additional","affiliation":[{"name":"College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6488-9297","authenticated-orcid":false,"given":"Shuyu","family":"Chen","sequence":"additional","affiliation":[{"name":"College of Engineering, China Agricultural University, Beijing 100083, China"}]},{"given":"Qin","family":"Ma","sequence":"additional","affiliation":[{"name":"College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,2,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1875","DOI":"10.1111\/2041-210X.13277","article-title":"Improving the precision and accuracy of animal population estimates with aerial image object detection","volume":"10","author":"Eikelboom","year":"2019","journal-title":"Methods Ecol. Evol."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Xiao, Z., Wang, K., Wan, Q., Tan, X., Xu, C., and Xia, F. (2021). A2S-Det: Efficiency Anchor Matching in Aerial Image Oriented Object Detection. Remote. Sens., 13.","DOI":"10.3390\/rs13010073"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chen, C., Zhong, J., and Tan, Y. (2019). Multiple-oriented and small object detection with convolutional neural networks for aerial image. Remote. Sens., 11.","DOI":"10.3390\/rs11182176"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Wang, Y., Zorzi, S., and Bittner, K. (2021, January 19\u201325). Machine-learned 3D Building Vectorization from Satellite Imagery. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPRW53098.2021.00118"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11042-021-11344-7","article-title":"Visual object tracking using similarity transformation and adaptive optical flow","volume":"volume","author":"Abbasi","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Liu, M., Wang, X., Zhou, A., Fu, X., Ma, Y., and Piao, C. (2020). UAV-YOLO: Small object detection on unmanned aerial vehicle perspective. Sensors, 20.","DOI":"10.3390\/s20082238"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, W., Tang, P., and Zhao, L. (2019). Remote sensing image scene classification using CNN-CapsNet. Remote. Sens., 11.","DOI":"10.3390\/rs11050494"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Pham, M.T., Courtrai, L., Friguet, C., Lef\u00e8vre, S., and Baussard, A. (2020). YOLO-Fine: One-stage detector of small objects under various backgrounds in remote sensing images. Remote. Sens., 12.","DOI":"10.3390\/rs12152501"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017, January 22\u201329). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"He, J., Deng, Z., Zhou, L., Wang, Y., and Qiao, Y. (2019, January 15\u201320). Adaptive pyramid context network for semantic segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00770"},{"key":"ref_12","first-page":"3189691","article-title":"An evaluation of deep learning methods for small object detection","volume":"2020","author":"Nguyen","year":"2020","journal-title":"J. Electr. Comput. Eng."},{"key":"ref_13","first-page":"4546896","article-title":"Small object detection with multiscale features","volume":"2018","author":"Hu","year":"2018","journal-title":"Int. J. Digit. Multimed. Broadcast."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, C., Wu, Y., Liu, J., and Han, J. (2021). MTI-YOLO: A Light-Weight and Real-Time Deep Neural Network for Insulator Detection in Complex Aerial Images. Energies, 14.","DOI":"10.3390\/en14051426"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Courtrai, L., Pham, M.T., and Lef\u00e8vre, S. (2020). Small Object Detection in Remote Sensing Images Based on Super-Resolution with Auxiliary Generative Adversarial Networks. Remote. Sens., 12.","DOI":"10.3390\/rs12193152"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Rabbi, J., Ray, N., Schubert, M., Chowdhury, S., and Chao, D. (2020). Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network. Remote. Sens., 12.","DOI":"10.20944\/preprints202003.0313.v2"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Xu, D., and Wu, Y. (2020). Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors, 20.","DOI":"10.3390\/s20154276"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Avola, D., Cinque, L., Diko, A., Fagioli, A., Foresti, G.L., Mecca, A., Pannone, D., and Piciarelli, C. (2021). MS-Faster R-CNN: Multi-stream backbone for improved Faster R-CNN object detection and aerial tracking from UAV images. Remote. Sens., 13.","DOI":"10.3390\/rs13091670"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"56214","DOI":"10.1109\/ACCESS.2021.3072067","article-title":"Toward efficient object detection in aerial images using extreme scale metric learning","volume":"9","author":"Jin","year":"2021","journal-title":"IEEE Access"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1016\/j.iatssr.2019.11.008","article-title":"Deep learning-based image recognition for autonomous driving","volume":"43","author":"Fujiyoshi","year":"2019","journal-title":"IATSS Res."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"4324878","DOI":"10.1155\/2019\/4324878","article-title":"Is deep learning for image recognition applicable to stock market prediction?","volume":"2019","author":"Sim","year":"2019","journal-title":"Complexity"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1109\/TRPMS.2019.2899538","article-title":"Machine (deep) learning methods for image processing and radiomics","volume":"3","author":"Hatt","year":"2019","journal-title":"IEEE Trans. Radiat. Plasma Med Sci."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"04006","DOI":"10.1051\/matecconf\/202133504006","article-title":"Feast In: A Machine Learning Image Recognition Model of Recipe and Lifestyle Applications","volume":"335","author":"Ann","year":"2021","journal-title":"MATEC Web Conf. EDP Sci."},{"key":"ref_24","unstructured":"Gu, H., Wen, F., Wang, B., Lee, A.K., and Xu, D. (2019). Machine Learning-Based Image Recognition for Visual Inspections, SNAME Maritime Convention."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016, January 8\u201316). Ssd: Single shot multibox detector. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_26","unstructured":"Li, Z., and Zhou, F. (2017). FSSD: Feature fusion single shot multibox detector. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zhang, S., Wen, L., Bian, X., Lei, Z., and Li, S.Z. (2018, January 18\u201323). Single-shot refinement neural network for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00442"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_29","unstructured":"Bochkovskiy, A., Wang, C.Y., and Liao, H.Y.M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv."},{"key":"ref_30","unstructured":"Jocher, G. (2022, January 17). Yolov5. Available online: https:\/\/github.com\/ultralytics\/yolov5."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tan, M., Pang, R., and Le, Q.V. (2020, January 14\u201319). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wa, S., Liu, Y., Zhou, X., Sun, P., and Ma, Q. (2021). High-Accuracy Detection of Maize Leaf Diseases CNN Based on Multi-Pathway Activation Function Module. Remote. Sens., 13.","DOI":"10.3390\/rs13214218"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zhang, Y., He, S., Wa, S., Zong, Z., and Liu, Y. (2021). Using Generative Module and Pruning Inference for the Fast and Accurate Detection of Apple Flower in Natural Environments. Information, 12.","DOI":"10.3390\/info12120495"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wa, S., Sun, P., and Wang, Y. (2021). Pear Defect Detection Method Based on ResNet and DCGAN. Information, 12.","DOI":"10.3390\/info12100397"},{"key":"ref_35","unstructured":"Wu, H., Zhang, J., Huang, K., Liang, K., and Yu, Y. (2019). Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"1707","DOI":"10.1002\/mp.13416","article-title":"Deeply supervised 3D fully convolutional networks with group dilated convolution for automatic MRI prostate segmentation","volume":"46","author":"Wang","year":"2019","journal-title":"Med. Phys."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Li, X., Shen, X., Zhou, Y., Wang, X., and Li, T.Q. (2020). Classification of breast cancer histopathological images using interleaved DenseNet with SENet (IDSNet). PLoS ONE, 15.","DOI":"10.1371\/journal.pone.0232127"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Wang, S.H., Fernandes, S., Zhu, Z., and Zhang, Y.D. (2021). AVNC: Attention-based VGG-style network for COVID-19 diagnosis by CBAM. IEEE Sensors J.","DOI":"10.1109\/JSEN.2021.3062442"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Chen, L., Tian, X., Chai, G., Zhang, X., and Chen, E. (2021). A New CBAM-P-Net Model for Few-Shot Forest Species Classification Using Airborne Hyperspectral Images. Remote. Sens., 13.","DOI":"10.3390\/rs13071269"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"186","DOI":"10.26599\/TST.2020.9010053","article-title":"Can: Effective cross features by global attention mechanism and neural network for ad click prediction","volume":"27","author":"Cai","year":"2021","journal-title":"Tsinghua Sci. Technol."},{"key":"ref_41","first-page":"114272","article-title":"Research for image caption based on global attention mechanism","volume":"Volume 11427","author":"Wu","year":"2020","journal-title":"Proceedings of the Second Target Recognition and Artificial Intelligence Summit Forum"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"012041","DOI":"10.1088\/1742-6596\/1861\/1\/012041","article-title":"GAU-Net: U-Net Based on Global Attention Mechanism for brain tumor segmentation","volume":"1861","author":"Gan","year":"2021","journal-title":"J. Physics Conf. Ser."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Bello, I., Zoph, B., Vaswani, A., Shlens, J., and Le, Q.V. (2019, January 27\u201328). Attention augmented convolutional networks. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00338"},{"key":"ref_44","unstructured":"Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., and Xu, Y. (2020). A survey on visual transformer. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Sajid, U., Chen, X., Sajid, H., Kim, T., and Wang, G. (2021, January 11\u201317). Audio-visual transformer based crowd counting. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00254"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Truong, T.D., Duong, C.N., Pham, H.A., Raj, B., Le, N., and Luu, K. (2021, January 11\u201317). The Right to Talk: An Audio-Visual Transformer Approach. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00114"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1016\/j.isprsjprs.2014.10.002","article-title":"Multi-class geospatial object detection and geographic image classification based on collection of part detectors","volume":"98","author":"Cheng","year":"2014","journal-title":"ISPRS J. Photogramm. Remote. Sens."},{"key":"ref_48","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G. (2022, January 17). ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst., Available online: https:\/\/proceedings.neurips.cc\/paper\/2012\/file\/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv.","DOI":"10.1007\/978-1-4899-7687-1_79"},{"key":"ref_50","unstructured":"DeVries, T., and Taylor, G.W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., and Yoo, Y. (2019, January 27\u201328). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00612"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Huang, S., Wang, X., and Tao, D. (2020). SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data. arXiv.","DOI":"10.1609\/aaai.v35i2.16255"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_54","unstructured":"Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_56","unstructured":"Everingham, M. (2007). The PASCAL Visual Object Classes Challenge 2007, Springer."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_58","unstructured":"Arjovsky, M., and Bottou, L. (2017). Towards principled methods for training generative adversarial networks. arXiv."},{"key":"ref_59","unstructured":"Arjovsky, M., Chintala, S., and Bottou, L. (2017, January 6\u201311). Wasserstein generative adversarial networks. Proceedings of the International Conference on Machine Learning Sydney, Sydney, NSW, Australia."},{"key":"ref_60","unstructured":"Mariani, G., Scheidegger, F., Istrate, R., Bekas, C., and Malossi, C. (2018). Bagan: Data augmentation with balancing gan. arXiv."},{"key":"ref_61","unstructured":"Odena, A., Olah, C., and Shlens, J. (2017, January 6\u201311). Conditional image synthesis with auxiliary classifier gans. Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019, January 15\u201320). Generalized intersection over union: A metric and a loss for bounding box regression. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00075"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (2020, January 26\u201328). Distance-IoU loss: Faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6999"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_65","unstructured":"Kaggle (2022, January 17). Global Wheat Detection. Available online: https:\/\/www.kaggle.com\/c\/global-wheat-detection."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/4\/923\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:19:29Z","timestamp":1760134769000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/4\/923"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,14]]},"references-count":65,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,2]]}},"alternative-id":["rs14040923"],"URL":"https:\/\/doi.org\/10.3390\/rs14040923","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,2,14]]}}}