{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T16:33:03Z","timestamp":1775579583740,"version":"3.50.1"},"reference-count":60,"publisher":"MDPI AG","issue":"24","license":[{"start":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T00:00:00Z","timestamp":1733961600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62101256"],"award-info":[{"award-number":["62101256"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["BE2022391"],"award-info":[{"award-number":["BE2022391"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100013058","name":"Jiangsu Provincial Key Research and Development Program","doi-asserted-by":"publisher","award":["62101256"],"award-info":[{"award-number":["62101256"]}],"id":[{"id":"10.13039\/501100013058","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100013058","name":"Jiangsu Provincial Key Research and Development Program","doi-asserted-by":"publisher","award":["BE2022391"],"award-info":[{"award-number":["BE2022391"]}],"id":[{"id":"10.13039\/501100013058","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Data augmentation methods offer a cost-effective and efficient alternative to the acquisition of additional data, significantly enhancing data diversity and model generalization, making them particularly favored in object detection tasks. However, existing data augmentation techniques primarily focus on the visible spectrum and are directly applied to RGB-T object detection tasks, overlooking the inherent differences in image data between the two tasks. Visible images capture rich color and texture information during the daytime, while infrared images are capable of imaging under low-light complex scenarios during the nighttime. By integrating image information from both modalities, their complementary characteristics can be exploited to improve the overall effectiveness of data augmentation methods. To address this, we propose a cross-modality data augmentation method tailored for RGB-T object detection, leveraging masked image modeling within representation learning. Specifically, we focus on the temporal consistency of infrared images and combine them with visible images under varying lighting conditions for joint data augmentation, thereby enhancing the realism of the augmented images. Utilizing the masked image modeling method, we reconstruct images by integrating multimodal features, achieving cross-modality data augmentation in feature space. Additionally, we investigate the differences and complementarities between data augmentation methods in data space and feature space. Building upon existing theoretical foundations, we propose an integrative framework that combines these methods for improved augmentation effectiveness. Furthermore, we address the slow convergence observed with the existing Mosaic method in aerial imagery by introducing a multi-scale training strategy and proposing a full-scale Mosaic method as a complement. This optimization significantly accelerates network convergence. The experimental results validate the effectiveness of our proposed method and highlight its potential for further advancements in cross-modality object detection tasks.<\/jats:p>","DOI":"10.3390\/rs16244649","type":"journal-article","created":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T03:52:49Z","timestamp":1733975569000},"page":"4649","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Cross-Modality Data Augmentation for Aerial Object Detection with Representation Learning"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0211-2922","authenticated-orcid":false,"given":"Chiheng","family":"Wei","sequence":"first","affiliation":[{"name":"The School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China"}]},{"given":"Lianfa","family":"Bai","sequence":"additional","affiliation":[{"name":"The School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1026-2824","authenticated-orcid":false,"given":"Xiaoyu","family":"Chen","sequence":"additional","affiliation":[{"name":"The School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China"}]},{"given":"Jing","family":"Han","sequence":"additional","affiliation":[{"name":"The School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,12,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wong, S.C., Gatt, A., Stamatescu, V., and McDonnell, M.D. (December, January 30). Understanding data augmentation for classification: When to warp?. Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia.","DOI":"10.1109\/DICTA.2016.7797091"},{"key":"ref_2","unstructured":"Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2020, January 7\u201312). Random erasing data augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA."},{"key":"ref_3","unstructured":"Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., and Yoo, Y. (November, January 27). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_4","unstructured":"Chen, P., Liu, S., Zhao, H., Wang, X., and Jia, J. (2020). Gridmask data augmentation. arXiv."},{"key":"ref_5","unstructured":"Bochkovskiy, A., Wang, C.Y., and Liao, H.Y.M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Georgievski, B. (2019, January 17\u201319). Image augmentation with neural style transfer. Proceedings of the International Conference on ICT Innovations, Ohrid, Macedonia.","DOI":"10.1007\/978-3-030-33110-8_18"},{"key":"ref_7","unstructured":"Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Adv. Neural Inf. Process. Syst., 27."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T., Cubuk, E.D., Le, Q.V., and Zoph, B. (2021, January 20\u201325). Simple copy-paste is a strong data augmentation method for instance segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00294"},{"key":"ref_9","unstructured":"Bao, H., Dong, L., Piao, S., and Wei, F. (2021). Beit: Bert pre-training of image transformers. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., and Girshick, R. (2022, January 18\u201324). Masked autoencoders are scalable vision learners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Bachmann, R., Mizrahi, D., Atanov, A., and Zamir, A. (2022, January 23\u201327). Multimae: Multi-modal multi-task masked autoencoders. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19836-6_20"},{"key":"ref_12","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_13","unstructured":"Tian, Y., Fan, L., Isola, P., Chang, H., and Krishnan, D. (2024). Stablerep: Synthetic images from text-to-image models make strong visual representation learners. Adv. Neural Inf. Process. Syst., 36."},{"key":"ref_14","unstructured":"Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., and Fleet, D.J. (2023). Synthetic data from diffusion models improves imagenet classification. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wu, Y., Wang, Z., Zeng, D., Shi, Y., and Hu, J. (2023, January 7\u201314). Synthetic data can also teach: Synthesizing effective data for unsupervised visual representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i3.25388"},{"key":"ref_16","unstructured":"Wang, Y., Zhang, J., and Wang, Y. (2024). Do Generated Data Always Help Contrastive Learning?. arXiv."},{"key":"ref_17","unstructured":"Ren, S. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Lin, T. (2017). Focal Loss for Dense Object Detection. arXiv.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Cai, Z., and Vasconcelos, N. (2018, January 18\u201323). Cascade r-cnn: Delving into high quality object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00644"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Tan, M., Pang, R., and Le, Q.V. (2020, January 13\u201319). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Cao, Z., Yang, H., Zhao, J., Guo, S., and Li, L. (2021). Attention fusion for one-stage multispectral pedestrian detection. Sensors, 21.","DOI":"10.3390\/s21124184"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhang, H., Fromont, E., Lef\u00e8vre, S., and Avignon, B. (2021, January 5\u20139). Guided attentive feature fusion for multispectral pedestrian detection. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Virtual.","DOI":"10.1109\/WACV48630.2021.00012"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1016\/j.patcog.2018.08.005","article-title":"Illumination-aware faster R-CNN for robust multispectral pedestrian detection","volume":"85","author":"Li","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_24","unstructured":"Zhou, K., Chen, L., and Cao, X. (2020). Improving multispectral pedestrian detection by addressing modality imbalance problems. Computer Vision\u2013ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23\u201328 August 2020, Springer International Publishing. Proceedings, Part XVIII 16."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Chen, Y.T., Shi, J., Ye, Z., Mertz, C., Ramanan, D., and Kong, S. (2022, January 23\u201327). Multimodal object detection via probabilistic ensembling. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-20077-9_9"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"6700","DOI":"10.1109\/TCSVT.2022.3168279","article-title":"Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning","volume":"32","author":"Sun","year":"2022","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Bao, C., Cao, J., Hao, Q., Cheng, Y., Ning, Y., and Zhao, T. (2023). Dual-YOLO architecture from infrared and visible images for object detection. Sensors, 23.","DOI":"10.3390\/s23062934"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"102246","DOI":"10.1016\/j.inffus.2024.102246","article-title":"Improving RGB-infrared object detection with cascade alignment-guided transformer","volume":"105","author":"Yuan","year":"2024","journal-title":"Inf. Fusion"},{"key":"ref_29","unstructured":"DeVries, T. (2017). Improved Regularization of Convolutional Neural Networks with Cutout. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kisantal, M. (2019). Augmentation for Small Object Detection. arXiv.","DOI":"10.5121\/csit.2019.91713"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Bongini, F., Berlincioni, L., Bertini, M., and Del Bimbo, A. (2021, January 20\u201324). Partially fake it till you make it: Mixing real and fake thermal images for improved object detection. Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China.","DOI":"10.1145\/3474085.3475679"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhu, J.Y., Park, T., Isola, P., and Efros, A.A. (2017, January 22\u201329). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.244"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Lu, Y., and Lu, G. (2021, January 16\u201319). Bridging the invisible and visible world: Translation between rgb and ir images through contour cycle gan. Proceedings of the 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA.","DOI":"10.1109\/AVSS52988.2021.9663750"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"15808","DOI":"10.1109\/TITS.2022.3145476","article-title":"Thermal infrared image colorization for nighttime driving scenes with top-down guided attention","volume":"23","author":"Luo","year":"2022","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"15841","DOI":"10.1109\/TITS.2024.3442871","article-title":"Memory-Guided Collaborative Attention for Nighttime Thermal Infrared Image Colorization of Traffic Scenes","volume":"25","author":"Luo","year":"2024","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_36","first-page":"1","article-title":"DR-AVIT: Towards Diverse and Realistic Aerial Visible-to-Infrared Image Translation","volume":"62","author":"Han","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Hao, X., Zhu, Y., Appalaraju, S., Zhang, A., Zhang, W., Li, B., and Li, M. (2023, January 2\u20137). Mixgen: A new multi-modal data augmentation. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACVW58289.2023.00042"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Wang, H., Lin, G., Hoi, S., and Miao, C. (2022, January 10\u201314). Paired cross-modal data augmentation for fine-grained image-to-text retrieval. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal.","DOI":"10.1145\/3503161.3547809"},{"key":"ref_39","unstructured":"Lee, D., Liu, S., Gu, J., Liu, M.Y., Yang, M.H., and Kautz, J. (2018). Context-aware synthesis and placement of object instances. Adv. Neural Inf. Process. Syst., 31."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., and Catanzaro, B. (2018, January 18\u201323). High-resolution image synthesis and semantic manipulation with conditional gans. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00917"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Li, J., Chen, C., and Xiong, Z. (2022, January 18\u201324). Contextual outpainting with object-level contrastive learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01116"},{"key":"ref_42","unstructured":"Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_43","unstructured":"Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., and Hu, H. (2022, January 18\u201324). Simmim: A simple framework for masked image modeling. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00943"},{"key":"ref_45","unstructured":"Xu, H., Ding, S., Zhang, X., Xiong, H., and Tian, Q. (2022). Masked autoencoders are robust data augmentors. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Isola, P., Zhu, J.Y., Zhou, T., and Efros, A.A. (2017, January 21\u201326). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.632"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"127449","DOI":"10.1016\/j.neucom.2024.127449","article-title":"Infrared colorization with cross-modality zero-shot learning","volume":"579","author":"Wei","year":"2024","journal-title":"Neurocomputing"},{"key":"ref_48","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., and Doll\u00e1r, P. (2014). Microsoft coco: Common objects in context. Computer Vision\u2013ECCV 2014: Proceedings of the 13th European Conference, Zurich, Switzerland, 6\u201312 September 2014, Springer International Publishing. Proceedings, Part V 13."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Yang, X., Zhang, G., Wang, J., Liu, Y., Hou, L., Jiang, X., Liu, X., Yan, J., and Lyu, C. (2022, January 10\u201314). Mmrotate: A rotated object detection benchmark using pytorch. Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal.","DOI":"10.1145\/3503161.3548541"},{"key":"ref_50","unstructured":"Ding, J., Xue, N., Long, Y., Xia, G.S., and Lu, Q. (November, January 27). Learning RoI transformer for oriented object detection in aerial images. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Liu, J., Zhang, S., Wang, S., and Metaxas, D.N. (2016). Multispectral deep neural networks for pedestrian detection. arXiv.","DOI":"10.5244\/C.30.73"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1016\/j.jvcir.2015.11.002","article-title":"Vehicle detection in aerial imagery: A small target detection benchmark","volume":"34","author":"Razakarivony","year":"2016","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Hore, A., and Ziou, D. (2010, January 23\u201326). Image quality metrics: PSNR vs. SSIM. Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey.","DOI":"10.1109\/ICPR.2010.579"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1109\/TIP.2003.819861","article-title":"Image quality assessment: From error visibility to structural similarity","volume":"13","author":"Wang","year":"2004","journal-title":"IEEE Trans. Image Process."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Zhang, H. (2017). mixup: Beyond empirical risk minimization. arXiv.","DOI":"10.1007\/978-1-4899-7687-1_79"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Yuan, M., Wang, Y., and Wei, X. (2022, January 23\u201327). Translation, scale and rotation: Cross-modal alignment meets RGB-infrared vehicle detection. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-20077-9_30"},{"key":"ref_57","first-page":"1","article-title":"C2 Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection","volume":"62","author":"Yuan","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"47773","DOI":"10.1007\/s11042-023-15333-w","article-title":"SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment","volume":"82","author":"Cheng","year":"2023","journal-title":"Multimed. Tools Appl."},{"key":"ref_59","unstructured":"Zhou, M., Li, T., Qiao, C., Xie, D., Wang, G., Ruan, N., Mei, L., and Yang, Y. (2024). DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing. arXiv."},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"108786","DOI":"10.1016\/j.patcog.2022.108786","article-title":"Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery","volume":"130","author":"Qingyun","year":"2022","journal-title":"Pattern Recognit."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/24\/4649\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:53:19Z","timestamp":1760115199000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/24\/4649"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,12]]},"references-count":60,"journal-issue":{"issue":"24","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["rs16244649"],"URL":"https:\/\/doi.org\/10.3390\/rs16244649","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,12]]}}}