{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T20:10:37Z","timestamp":1773087037525,"version":"3.50.1"},"reference-count":218,"publisher":"MDPI AG","issue":"21","license":[{"start":{"date-parts":[[2024,11,4]],"date-time":"2024-11-04T00:00:00Z","timestamp":1730678400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China (NSFC)","award":["62076093"],"award-info":[{"award-number":["62076093"]}]},{"name":"National Natural Science Foundation of China (NSFC)","award":["61871182"],"award-info":[{"award-number":["61871182"]}]},{"name":"National Natural Science Foundation of China (NSFC)","award":["62206095"],"award-info":[{"award-number":["62206095"]}]},{"name":"National Natural Science Foundation of China (NSFC)","award":["2022MS078"],"award-info":[{"award-number":["2022MS078"]}]},{"name":"National Natural Science Foundation of China (NSFC)","award":["2023JG002"],"award-info":[{"award-number":["2023JG002"]}]},{"name":"National Natural Science Foundation of China (NSFC)","award":["2023JC006"],"award-info":[{"award-number":["2023JC006"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["62076093"],"award-info":[{"award-number":["62076093"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["61871182"],"award-info":[{"award-number":["61871182"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["62206095"],"award-info":[{"award-number":["62206095"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["2022MS078"],"award-info":[{"award-number":["2022MS078"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["2023JG002"],"award-info":[{"award-number":["2023JG002"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["2023JC006"],"award-info":[{"award-number":["2023JC006"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Remote sensing images contain a wealth of Earth-observation information. Efficient extraction and application of hidden knowledge from these images will greatly promote the development of resource and environment monitoring, urban planning and other related fields. Remote sensing image caption (RSIC) involves obtaining textual descriptions from remote sensing images through accurately capturing and describing the semantic-level relationships between objects and attributes in the images. However, there is currently no comprehensive review summarizing the progress in RSIC based on deep learning. After defining the scope of the papers to be discussed and summarizing them all, the paper begins by providing a comprehensive review of the recent advancements in RSIC, covering six key aspects: encoder\u2013decoder framework, attention mechanism, reinforcement learning, learning with auxiliary task, large visual language models and few-shot learning. Subsequently a brief explanation on the datasets and evaluation metrics for RSIC is given. Furthermore, we compare and analyze the results of the latest models and the pros and cons of different deep learning methods. Lastly, future directions of RSIC are suggested. The primary objective of this review is to offer researchers a more profound understanding of RSIC.<\/jats:p>","DOI":"10.3390\/rs16214113","type":"journal-article","created":{"date-parts":[[2024,11,4]],"date-time":"2024-11-04T09:52:54Z","timestamp":1730713974000},"page":"4113","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Models, Comparisons and Future Directions"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3271-3585","authenticated-orcid":false,"given":"Ke","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China"},{"name":"Hebei Key Laboratory of Power Internet of Things Technology, North China Electric Power University, Baoding 071003, China"}]},{"given":"Peijie","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China"}]},{"given":"Jianqiang","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Electronic and Communication Engineering, North China Electric Power University, Baoding 071003, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,11,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"120045","DOI":"10.1016\/j.jenvman.2024.120045","article-title":"The key to sustainability: In-depth investigation of environmental quality in G20 countries through the lens of renewable energy, economic complexity and geopolitical risk resilience","volume":"352","author":"Wang","year":"2024","journal-title":"J. Environ. Manag."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1814","DOI":"10.1109\/JSTARS.2022.3148139","article-title":"Progress and challenges in intelligent remote sensing satellite systems","volume":"15","author":"Zhang","year":"2022","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"987","DOI":"10.1109\/LGRS.2018.2884087","article-title":"Improving hypersharpening for WorldView-3 data","volume":"16","author":"Selva","year":"2018","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"034515","DOI":"10.1117\/1.JRS.15.034515","article-title":"Quality analysis of Worldview-4 DSMs generated by least squares matching and semiglobal matching","volume":"15","author":"Sefercik","year":"2021","journal-title":"J. Appl. Remote Sens."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"012019","DOI":"10.1088\/1742-6596\/1763\/1\/012019","article-title":"Satellite data receiving antenna system for pleiades neo observation satellite","volume":"1763","author":"Hestrio","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1865","DOI":"10.1109\/JPROC.2017.2675998","article-title":"Remote sensing image scene classification: Benchmark and state of the art","volume":"105","author":"Cheng","year":"2017","journal-title":"Proc. IEEE"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1926","DOI":"10.1109\/LGRS.2020.3011405","article-title":"Remote sensing image scene classification based on an enhanced attention module","volume":"18","author":"Zhao","year":"2020","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_8","first-page":"5607514","article-title":"Remote sensing image change detection with transformers","volume":"60","author":"Chen","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_9","first-page":"5604816","article-title":"A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection","volume":"60","author":"Shi","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"116793","DOI":"10.1016\/j.eswa.2022.116793","article-title":"Remote sensing image super-resolution and object detection: Benchmark and state of the art","volume":"197","author":"Wang","year":"2022","journal-title":"Expert Syst. Appl."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1497","DOI":"10.1109\/JSTARS.2020.3041316","article-title":"YOLOrs: Object detection in multimodal remote sensing imagery","volume":"14","author":"Sharma","year":"2020","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_12","first-page":"5607713","article-title":"Multiattention network for semantic segmentation of fine-resolution remote sensing images","volume":"60","author":"Li","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_13","first-page":"5403913","article-title":"Semantic segmentation with attention mechanism for remote sensing images","volume":"60","author":"Zhao","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"7872","DOI":"10.1109\/TGRS.2020.2984703","article-title":"Similarity-based unsupervised deep transfer learning for remote sensing image retrieval","volume":"58","author":"Liu","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"740","DOI":"10.1080\/2150704X.2019.1647368","article-title":"Enhancing remote sensing image retrieval using a triplet deep metric learning network","volume":"41","author":"Cao","year":"2020","journal-title":"Int. J. Remote Sens."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"3623","DOI":"10.1109\/TGRS.2017.2677464","article-title":"Can a machine generate humanlike language descriptions for a remote sensing image?","volume":"55","author":"Shi","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1002\/rob.21756","article-title":"Post-disaster assessment with unmanned aerial vehicles: A survey on practical implementations and research approaches","volume":"35","author":"Recchiuto","year":"2018","journal-title":"J. Field Robot."},{"key":"ref_18","first-page":"20","article-title":"Risk assessment of storm surge disaster based on numerical models and remote sensing","volume":"68","author":"Liu","year":"2018","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"364","DOI":"10.1016\/j.isprsjprs.2019.11.018","article-title":"Remote sensing algorithms for estimation of fractional vegetation cover using pure vegetation index values: A review","volume":"159","author":"Gao","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"124905","DOI":"10.1016\/j.jhydrol.2020.124905","article-title":"A review of remote sensing applications in agriculture for food security: Crop growth and yield, irrigation, and crop losses","volume":"586","author":"Karthikeyan","year":"2020","journal-title":"J. Hydrol."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"3879","DOI":"10.3390\/rs6053879","article-title":"Supporting global environmental change research: A review of trends and knowledge gaps in urban remote sensing","volume":"6","author":"Wentz","year":"2014","journal-title":"Remote Sens."},{"key":"ref_22","first-page":"5603814","article-title":"High-resolution remote sensing image captioning based on structured attention","volume":"60","author":"Zhao","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"389","DOI":"10.5721\/EuJRS20144723","article-title":"A review of remote sensing image classification techniques: The role of spatio-contextual information","volume":"47","author":"Li","year":"2014","journal-title":"Eur. J. Remote Sens."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"232","DOI":"10.1080\/20964471.2019.1657720","article-title":"A survey of remote sensing image classification based on CNNs","volume":"3","author":"Song","year":"2019","journal-title":"Big Earth Data"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"606","DOI":"10.1109\/JSTSP.2011.2139193","article-title":"A survey of active learning algorithms for supervised remote sensing image classification","volume":"5","author":"Tuia","year":"2011","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1016\/j.isprsjprs.2021.01.020","article-title":"Remote sensing image segmentation advances: A meta-analysis","volume":"173","author":"Kotaridis","year":"2021","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"501","DOI":"10.1080\/07038992.2020.1805729","article-title":"A comprehensive survey of optical remote sensing image segmentation methods","volume":"46","author":"Wang","year":"2020","journal-title":"Can. J. Remote Sens."},{"key":"ref_28","first-page":"1667","article-title":"Review of remote sensing image segmentation techniques","volume":"4","author":"Kaur","year":"2015","journal-title":"Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET)"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"126385","DOI":"10.1109\/ACCESS.2020.3008036","article-title":"Deep learning for change detection in remote sensing images: Comprehensive review and meta-analysis","volume":"8","author":"Khelifi","year":"2020","journal-title":"IEEE Access"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"101310","DOI":"10.1016\/j.ecoinf.2021.101310","article-title":"Analysis on change detection techniques for remote sensing applications: A review","volume":"63","author":"Afaq","year":"2021","journal-title":"Ecol. Inform."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1080\/10095020.2022.2085633","article-title":"Deep learning for change detection in remote sensing: A review","volume":"26","author":"Bai","year":"2023","journal-title":"Geo-Spat. Inf. Sci."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"154086","DOI":"10.1109\/ACCESS.2021.3128140","article-title":"A systematic survey of remote sensing image captioning","volume":"9","author":"Zhao","year":"2021","journal-title":"IEEE Access"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/MGRS.2023.3316438","article-title":"Language Integration in Remote Sensing: Tasks, datasets, and future directions","volume":"11","author":"Bashmal","year":"2023","journal-title":"IEEE Geosci. Remote Sens. Magazine"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Albawi, S., Mohammed, T.A., and Al-Zawi, S. (2017, January 21\u201323). Understanding of a convolutional neural network. Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey.","DOI":"10.1109\/ICEngTechnol.2017.8308186"},{"key":"ref_36","unstructured":"Salehinejad, H., Sankar, S., Barfett, J., Colak, E., and Valaee, S. (2017). Recent advances in recurrent neural networks. arXiv."},{"key":"ref_37","unstructured":"Rubinstein, R.Y., and Kroese, D.P. (2004). The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Springer."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"10532","DOI":"10.1109\/TGRS.2020.3044054","article-title":"Word\u2013sentence framework for remote sensing image captioning","volume":"59","author":"Wang","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China. IEEE.","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring models and data for remote sensing image caption generation","volume":"56","author":"Lu","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Nanal, W., and Hajiarbabi, M. (2023, January 20\u201323). Captioning remote sensing images using transformer architecture. Proceedings of the 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Bali, Indonesia.","DOI":"10.1109\/ICAIIC57133.2023.10067039"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"26661","DOI":"10.1007\/s11042-020-09294-7","article-title":"Remote sensing image caption generation via transformer and reinforcement learning","volume":"79","author":"Shen","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_43","first-page":"6004505","article-title":"Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation","volume":"21","author":"Wang","year":"2024","journal-title":"IEEE Geosci. Remote. Sens. Lett."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"012015","DOI":"10.1088\/1742-6596\/1712\/1\/012015","article-title":"Image Captioning Using Deep Convolutional Neural Networks (CNNs)","volume":"1712","author":"Geetha","year":"2020","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Zhu, Y., and Newsam, S. (2017, January 17\u201320). Densenet for dense flow. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296389"},{"key":"ref_46","unstructured":"Jastrz\u0119bski, S., Arpit, D., Ballas, N., Verma, V., Che, T., and Bengio, Y. (2017). Residual connections encourage iterative inference. arXiv."},{"key":"ref_47","first-page":"677","article-title":"Deep Attention Based DenseNet with Visual Switch Added BiLSTM for Caption Generation from Remote Sensing Images","volume":"16","author":"Badhe","year":"2023","journal-title":"Int. J. Intell. Eng. Syst."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1109\/LGRS.2020.2980933","article-title":"Denoising-based multiscale feature fusion for remote sensing image captioning","volume":"18","author":"Huang","year":"2020","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"105920","DOI":"10.1016\/j.knosys.2020.105920","article-title":"Remote sensing image captioning via variational autoencoder and reinforcement learning","volume":"203","author":"Shen","year":"2020","journal-title":"Knowl.-Based Syst."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Li, Z., Zhao, W., Du, X., Zhou, G., and Zhang, S. (2024). Cross-modal retrieval and semantic refinement for remote sensing image captioning. Remote Sens., 16.","DOI":"10.3390\/rs16010196"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks, Springer Nature.","DOI":"10.1007\/978-3-642-24797-2"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Dey, R., and Salem, F.M. (2017, January 6\u20139). Gate-variants of gated recurrent unit (GRU) neural networks. Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA.","DOI":"10.1109\/MWSCAS.2017.8053243"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Chouaf, S., Hoxha, G., Smara, Y., and Melgani, F. (2021, January 11\u201316). Captioning changes in bi-temporal remote sensing images. Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium.","DOI":"10.1109\/IGARSS47720.2021.9554419"},{"key":"ref_55","unstructured":"Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30, Neural Information Processing Systems Foundation, Inc."},{"key":"ref_56","unstructured":"Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberg, K.Q. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27, Neural Information Processing Systems Foundation, Inc."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., and Chao, L.S. (2019). Learning deep transformer models for machine translation. arXiv.","DOI":"10.18653\/v1\/P19-1176"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Liu, Y., and Lapata, M. (2019). Text summarization with pretrained encoders. arXiv.","DOI":"10.18653\/v1\/D19-1387"},{"key":"ref_59","unstructured":"Li, G., Zhu, L., Liu, P., and Yang, Y. (November, January 27). Entangled transformer for image captioning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Li, J., Yao, P., Guo, L., and Zhang, W. (2019). Boosted transformer for image captioning. Appl. Sci., 9.","DOI":"10.3390\/app9163260"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Suthaharan, S., and Suthaharan, S. (2016). Support vector machine. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning, Springer Publishing.","DOI":"10.1007\/978-1-4899-7641-3"},{"key":"ref_62","unstructured":"Thrun, S., Saul, L., and Sch\u00f6lkopf, B. (2003). Margin maximizing loss functions. Advances in Neural Information Processing Systems 16, Neural Information Processing Systems Foundation, Inc."},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1016\/j.isprsjprs.2016.01.011","article-title":"Random forest in remote sensing: A review of applications and future directions","volume":"114","author":"Belgiu","year":"2016","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"5627414","DOI":"10.1109\/TGRS.2022.3195692","article-title":"Change captioning: A new paradigm for multitemporal remote sensing image analysis","volume":"60","author":"Hoxha","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_65","first-page":"5404514","article-title":"A novel SVM-based decoder for remote sensing image captioning","volume":"60","author":"Hoxha","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Fu, K., Li, Y., Zhang, W., Yu, H., and Sun, X. (2020). Boosting memory with a persistent memory mechanism for remote sensing image captioning. Remote Sens., 12.","DOI":"10.3390\/rs12111874"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Wang, J., Chen, Z., Ma, A., and Zhong, Y. (2022, January 17\u201322). Capformer: Pure transformer for remote sensing image caption. Proceedings of the 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia.","DOI":"10.1109\/IGARSS46834.2022.9883199"},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1109\/TPAMI.2022.3152247","article-title":"A survey on vision transformer","volume":"45","author":"Han","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_69","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. aXiv."},{"key":"ref_70","first-page":"6506605","article-title":"Remote-sensing image captioning based on multilayer aggregated transformer","volume":"19","author":"Liu","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, Q., Chen, S., and Li, X. (August, January 28). Multi-scale cropping mechanism for remote sensing image captioning. Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan.","DOI":"10.1109\/IGARSS.2019.8900503"},{"key":"ref_72","doi-asserted-by":"crossref","first-page":"24852","DOI":"10.1109\/ACCESS.2022.3151874","article-title":"Using neural encoder-decoder models with continuous outputs for remote sensing image captioning","volume":"10","author":"Ramos","year":"2022","journal-title":"IEEE Access"},{"key":"ref_73","doi-asserted-by":"crossref","first-page":"6514005","DOI":"10.1109\/LGRS.2022.3192062","article-title":"TypeFormer: Multiscale transformer with type controller for remote sensing image caption","volume":"19","author":"Chen","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_74","doi-asserted-by":"crossref","first-page":"5612013","DOI":"10.1109\/TGRS.2023.3281334","article-title":"Improving image captioning systems with postprocessing strategies","volume":"61","author":"Hoxha","year":"2023","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"4291","DOI":"10.1109\/TNNLS.2020.3019893","article-title":"Attention in natural language processing","volume":"32","author":"Galassi","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_76","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_78","unstructured":"Huang, L., Wang, W., Chen, J., and Wei, X.Y. (November, January 27). Attention on attention for image captioning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017, January 22\u201329). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy.","DOI":"10.1109\/CVPR.2017.345"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"137355","DOI":"10.1109\/ACCESS.2019.2942154","article-title":"VAA: Visual aligning attention model for remote sensing image captioning","volume":"7","author":"Zhang","year":"2019","journal-title":"IEEE Access"},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Li, Y., Fang, S., Jiao, L., Liu, R., and Shang, R. (2020). A multi-level attention model for remote sensing image captions. Remote Sens., 12.","DOI":"10.3390\/rs12060939"},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"109893","DOI":"10.1016\/j.patcog.2023.109893","article-title":"Learning consensus-aware semantic knowledge for remote sensing image captioning","volume":"145","author":"Li","year":"2024","journal-title":"Pattern Recognit."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"5629419","DOI":"10.1109\/TGRS.2022.3201474","article-title":"NWPU-captions dataset and MLCA-net for remote sensing image captioning","volume":"60","author":"Cheng","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_84","doi-asserted-by":"crossref","first-page":"4848","DOI":"10.1080\/17538947.2023.2283482","article-title":"MC-Net: Multi-scale contextual information aggregation network for image captioning on remote sensing images","volume":"16","author":"Huang","year":"2023","journal-title":"Int. J. Digit. Earth"},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"Zhang, X., Li, Y., Wang, X., Liu, F., Wu, Z., Cheng, X., and Jiao, L. (2023). Multi-source interactive stair attention for remote sensing image captioning. Remote Sens., 15.","DOI":"10.3390\/rs15030579"},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Li, Y., Qi, H., Dai, J., Ji, X., and Wei, Y. (2017, January 22\u201329). Fully convolutional instance-aware semantic segmentation. Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, Venice, Italy.","DOI":"10.1109\/CVPR.2017.472"},{"key":"ref_87","doi-asserted-by":"crossref","unstructured":"Wang, C., Jiang, Z., and Yuan, Y. (October, January 26). Instance-aware remote sensing image captioning with cross-hierarchy attention. Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Online.","DOI":"10.1109\/IGARSS39084.2020.9323213"},{"key":"ref_88","doi-asserted-by":"crossref","first-page":"2001","DOI":"10.1109\/LGRS.2020.3009243","article-title":"Multiscale methods for optical remote-sensing image captioning","volume":"18","author":"Ma","year":"2020","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_89","doi-asserted-by":"crossref","first-page":"2154","DOI":"10.1109\/JSTARS.2022.3153636","article-title":"Multiscale multiinteraction network for remote sensing image captioning","volume":"15","author":"Wang","year":"2022","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_90","unstructured":"He, Y., Carass, A., Zuo, L., Dewey, B.E., and Prince, J.L. (2020). Self domain adapted network. Medical Image Computing and Computer Assisted Intervention\u2013MICCAI 2020: 23rd International Conference, Lima, Peru, October 4\u20138, 2020, Proceedings, Part I 23, Springer International Publishing."},{"key":"ref_91","doi-asserted-by":"crossref","first-page":"2608","DOI":"10.1109\/ACCESS.2019.2962195","article-title":"Exploring multi-level attention and semantic relationship for remote sensing image captioning","volume":"8","author":"Yuan","year":"2019","journal-title":"IEEE Access"},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Meng, Y., Gu, Y., Ye, X., Tian, J., Wang, S., Zhang, H., Hou, B., and Jiao, L. (2021, January 11\u201316). Multi-view attention network for remote sensing image captioning. Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium.","DOI":"10.1109\/IGARSS47720.2021.9555083"},{"key":"ref_93","first-page":"102741","article-title":"Transforming remote sensing images to textual descriptions","volume":"108","author":"Zia","year":"2022","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_94","doi-asserted-by":"crossref","unstructured":"Cornia, M., Stefanini, M., Baraldi, L., and Cucchiara, R. (2020, January 13\u201319). Meshed-memory transformer for image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"ref_95","doi-asserted-by":"crossref","first-page":"105076","DOI":"10.1016\/j.engappai.2022.105076","article-title":"Generating the captions for remote sensing images: A spatial-channel attention-based memory-guided transformer approach","volume":"114","author":"Gajbhiye","year":"2022","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_96","doi-asserted-by":"crossref","first-page":"7704","DOI":"10.1109\/JSTARS.2023.3305889","article-title":"From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning","volume":"16","author":"Du","year":"2023","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens."},{"key":"ref_97","doi-asserted-by":"crossref","first-page":"5643912","DOI":"10.1109\/TGRS.2024.3475633","article-title":"TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning","volume":"62","author":"Wu","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_98","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_99","doi-asserted-by":"crossref","first-page":"4706213","DOI":"10.1109\/TGRS.2023.3328181","article-title":"Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning","volume":"61","author":"Meng","year":"2023","journal-title":"IEEE Trans. Geosci. Remote. Sens."},{"key":"ref_100","doi-asserted-by":"crossref","first-page":"4703515","DOI":"10.1109\/TGRS.2024.3385500","article-title":"A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning","volume":"62","author":"Meng","year":"2024","journal-title":"IEEE Trans. Geosci. Remote. Sens."},{"key":"ref_101","first-page":"103672","article-title":"Exploring region features in remote sensing image captioning","volume":"127","author":"Zhao","year":"2024","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_102","doi-asserted-by":"crossref","unstructured":"Guo, J., Li, Z., Song, B., and Chi, Y. (2024). TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sens., 16.","DOI":"10.3390\/rs16111843"},{"key":"ref_103","first-page":"5607314","article-title":"Cooperative Connection Transformer for Remote Sensing Image Captioning","volume":"62","author":"Zhao","year":"2024","journal-title":"IEEE Trans. Geosci. Remote. Sens."},{"key":"ref_104","doi-asserted-by":"crossref","unstructured":"Cai, C., Wang, Y., and Yap, K.H. (2023). Interactive change-aware transformer network for remote sensing image change captioning. Remote Sens., 15.","DOI":"10.3390\/rs15235611"},{"key":"ref_105","first-page":"5624514","article-title":"Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning","volume":"62","author":"Zhou","year":"2024","journal-title":"IEEE Trans. Geosci. Remote. Sens."},{"key":"ref_106","doi-asserted-by":"crossref","unstructured":"Liu, C., Yang, J., Qi, Z., Zou, Z., and Shi, Z. (2023, January 16\u201321). Progressive scale-aware network for remote sensing image change captioning. Proceedings of the IGARSS 2023\u20132023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA.","DOI":"10.1109\/IGARSS52108.2023.10283451"},{"key":"ref_107","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, X., Tang, X., Zhou, H., and Li, C. (2019). Description Generation for Remote Sensing Images Using Attribute Attention Mechanism. Remote. Sens., 11.","DOI":"10.3390\/rs11060612"},{"key":"ref_108","doi-asserted-by":"crossref","first-page":"22409","DOI":"10.1007\/s11042-023-16421-7","article-title":"GAF-Net: Global view guided attribute fusion network for remote sensing image captioning","volume":"83","author":"Peng","year":"2024","journal-title":"Multimed. Tools Appl."},{"key":"ref_109","first-page":"5608816","article-title":"Recurrent attention and semantic gate for remote sensing image captioning","volume":"60","author":"Li","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_110","first-page":"5615216","article-title":"Global visual feature and linguistic state guided attention for remote sensing image captioning","volume":"60","author":"Zhang","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_111","doi-asserted-by":"crossref","first-page":"6910","DOI":"10.1109\/TCYB.2022.3222606","article-title":"GLCM: Global\u2013Local Captioning Model for Remote Sensing Image Captioning","volume":"53","author":"Wang","year":"2022","journal-title":"IEEE Trans. Cybern."},{"key":"ref_112","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Diao, W., Zhang, W., Yan, M., Gao, X., and Sun, X. (2019). LAM: Remote sensing image captioning with label-attention mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11202349"},{"key":"ref_113","doi-asserted-by":"crossref","unstructured":"Cheng, K., Wu, Z., Jin, H., and Li, X. (2024, January 7\u201312). Remote Sensing Image Captioning with Multi-Scale Feature and Small Target Attention. Proceedings of the IGARSS 2024\u20132024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece.","DOI":"10.1109\/IGARSS53475.2024.10642778"},{"key":"ref_114","doi-asserted-by":"crossref","first-page":"1985","DOI":"10.1109\/TGRS.2019.2951636","article-title":"Sound active attention framework for remote sensing image captioning","volume":"58","author":"Lu","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_115","doi-asserted-by":"crossref","first-page":"122136","DOI":"10.1109\/ACCESS.2022.3223444","article-title":"Mel frequency cepstral coefficient and its applications: A review","volume":"10","author":"Abdul","year":"2022","journal-title":"IEEE Access"},{"key":"ref_116","doi-asserted-by":"crossref","unstructured":"Zhang, H., Parkes, D.C., and Chen, Y. (2009, January 6\u201310). Policy teaching through reward function learning. Proceedings of the 10th ACM Conference on Electronic Commerce, Stanford, CA, USA.","DOI":"10.1145\/1566374.1566417"},{"key":"ref_117","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1109\/MSP.2017.2743240","article-title":"Deep reinforcement learning: A brief survey","volume":"34","author":"Arulkumaran","year":"2017","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_118","doi-asserted-by":"crossref","first-page":"5246","DOI":"10.1109\/TGRS.2020.3010106","article-title":"Truncation cross entropy loss for remote sensing image captioning","volume":"59","author":"Li","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_119","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_120","doi-asserted-by":"crossref","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017, January 21\u201326). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.131"},{"key":"ref_121","unstructured":"Ranzato, M.A., Chopra, S., Auli, M., and Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv."},{"key":"ref_122","unstructured":"Luo, R. (2020). A better variant of self-critical sequence training. arXiv."},{"key":"ref_123","doi-asserted-by":"crossref","unstructured":"Ren, Z., Gou, S., Guo, Z., Mao, S., and Li, R. (2022). A mask-guided transformer network with topic token for remote sensing image captioning. Remote Sens., 14.","DOI":"10.3390\/rs14122939"},{"key":"ref_124","unstructured":"Drenkow, N., Sani, N., Shpitser, I., and Unberath, M. (2021). A systematic review of robustness in deep learning for computer vision: Mind the gap?. arXiv."},{"key":"ref_125","unstructured":"Zhang, L., Sung, F., Liu, F., Xiang, T., Gong, S., Yang, Y., and Hospedales, T.M. (2017). Actor-critic sequence training for image captioning. arXiv."},{"key":"ref_126","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1109\/MSP.2017.2765202","article-title":"Generative adversarial networks: An overview","volume":"35","author":"Creswell","year":"2018","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_127","doi-asserted-by":"crossref","unstructured":"Rui, X., Cao, Y., Yuan, X., Kang, Y., and Song, W. (2021). Disastergan: Generative adversarial networks for remote sensing disaster image generation. Remote Sens., 13.","DOI":"10.3390\/rs13214284"},{"key":"ref_128","unstructured":"Pfau, D., and Vinyals, O. (2016). Connecting generative adversarial networks and actor-critic methods. arXiv."},{"key":"ref_129","doi-asserted-by":"crossref","unstructured":"Chavhan, R., Banerjee, B., Zhu, X.X., and Chaudhuri, S. (2020, January 10\u201315). A novel actor dual-critic model for remote sensing image captioning. Proceedings of the 2020 25th International Conference on Pattern Recognition, Milan, Italy.","DOI":"10.1109\/ICPR48806.2021.9412486"},{"key":"ref_130","doi-asserted-by":"crossref","unstructured":"Tong, Y., Chen, Y., and Shi, X. (2021). A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics (ACL).","DOI":"10.18653\/v1\/2021.findings-acl.424"},{"key":"ref_131","unstructured":"Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv."},{"key":"ref_132","doi-asserted-by":"crossref","unstructured":"Toshniwal, S., Tang, H., Lu, L., and Livescu, K. (2017). Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2017-1118"},{"key":"ref_133","doi-asserted-by":"crossref","first-page":"324","DOI":"10.1007\/s11704-010-0102-7","article-title":"Three challenges in data mining","volume":"4","author":"Yang","year":"2010","journal-title":"Front. Comput. Sci. China"},{"key":"ref_134","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_135","doi-asserted-by":"crossref","unstructured":"Liu, C., Chen, K., Qi, Z., Liu, Z., Zhang, H., Zou, Z., and Shi, Z. (2024, January 7\u201312). Pixel-level change detection pseudo-label learning for remote sensing change captioning. Proceedings of the 2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece.","DOI":"10.1109\/IGARSS53475.2024.10642750"},{"key":"ref_136","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1109\/TIP.2022.3226418","article-title":"Transition is a process: Pair-to-video change detection networks for very high resolution remote sensing images","volume":"32","author":"Lin","year":"2022","journal-title":"IEEE Trans. Image Process."},{"key":"ref_137","doi-asserted-by":"crossref","unstructured":"Li, X., Sun, B., and Li, S. (2024, January 7\u201312). Detection Assisted Change Captioning for Remote Sensing Image. Proceedings of the 2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece.","DOI":"10.1109\/IGARSS53475.2024.10640971"},{"key":"ref_138","doi-asserted-by":"crossref","first-page":"2392847","DOI":"10.1080\/17538947.2024.2392847","article-title":"Incorporating object counts into remote sensing image captioning","volume":"17","author":"Ni","year":"2024","journal-title":"Int. J. Digit. Earth"},{"key":"ref_139","doi-asserted-by":"crossref","unstructured":"Zhao, W., Yang, W., Chen, D., and Wei, F. (2023). DFEN: Dual feature enhancement network for remote sensing image caption. Electron., 12.","DOI":"10.3390\/electronics12071547"},{"key":"ref_140","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1016\/j.procs.2020.01.067","article-title":"Region driven remote sensing image captioning","volume":"165","author":"Kumar","year":"2019","journal-title":"Procedia Comput. Sci."},{"key":"ref_141","doi-asserted-by":"crossref","first-page":"6514905","DOI":"10.1109\/LGRS.2022.3198234","article-title":"Exploring transformer and multilabel classification for remote sensing image captioning","volume":"19","author":"Kandala","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_142","doi-asserted-by":"crossref","first-page":"4709616","DOI":"10.1109\/TGRS.2022.3224244","article-title":"A joint-training two-stage method for remote sensing image captioning","volume":"60","author":"Ye","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_143","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1016\/j.isprsjprs.2022.02.001","article-title":"Meta captioning: A meta learning based remote sensing image captioning framework","volume":"186","author":"Yang","year":"2022","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_144","doi-asserted-by":"crossref","unstructured":"Hoxha, G., Melgani, F., and Slaghenauffi, J. (2020, January 9\u201311). A new CNN-RNN framework for remote sensing image captioning. Proceedings of the 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Tunis, Tunisia.","DOI":"10.1109\/M2GARSS47143.2020.9105191"},{"key":"ref_145","first-page":"691","article-title":"Experimental assessment of beam search algorithm for improvement in image caption generation","volume":"22","author":"Chowdhary","year":"2019","journal-title":"J. Appl. Sci. Eng."},{"key":"ref_146","doi-asserted-by":"crossref","first-page":"256","DOI":"10.1109\/JSTARS.2019.2959208","article-title":"Retrieval topic recurrent memory network for remote sensing image captioning","volume":"13","author":"Wang","year":"2020","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_147","doi-asserted-by":"crossref","unstructured":"Cui, W., He, X., Yao, M., Wang, Z., Li, J., Hao, Y., Wu, W., Zhao, H., Chen, X., and Cui, W. (2020). Landslide image captioning method based on semantic gate and bi-temporal LSTM. ISPRS Int. J. Geo-Inf., 9.","DOI":"10.3390\/ijgi9040194"},{"key":"ref_148","doi-asserted-by":"crossref","first-page":"6922","DOI":"10.1109\/TGRS.2020.3031111","article-title":"SD-RSIC: Summarization-driven deep remote sensing image captioning","volume":"59","author":"Sumbul","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_149","doi-asserted-by":"crossref","first-page":"8555","DOI":"10.1109\/TGRS.2020.2988782","article-title":"RSVQA: Visual question answering for remote sensing data","volume":"58","author":"Lobry","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_150","doi-asserted-by":"crossref","unstructured":"Murali, N., and Shanthi, A.P. (2022). Remote sensing image captioning via multilevel attention-based visual question answering. Innovations in Computational Intelligence and Computer Vision: Proceedings of ICICV 2021, Springer Nature.","DOI":"10.1007\/978-981-19-0475-2_41"},{"key":"ref_151","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_152","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv."},{"key":"ref_153","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, C., Li, Y., and Lee, Y.J. (2023). Improved Baselines with Visual Instruction Tuning. arXiv.","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"ref_154","unstructured":"Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., and Elhoseiny, M. (2023). Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv."},{"key":"ref_155","unstructured":"He, Y., and Sun, Q. (2023). Towards Automatic Satellite Images Captions Generation Using Large Language Models. arXiv."},{"key":"ref_156","unstructured":"Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., and Brockman, G. (2021). Evaluating large language models trained on code. arXiv."},{"key":"ref_157","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1145\/505241.505243","article-title":"An optimal minimum spanning tree algorithm","volume":"49","author":"Pettie","year":"2002","journal-title":"JACM"},{"key":"ref_158","doi-asserted-by":"crossref","first-page":"9","DOI":"10.23919\/JSEE.2023.000035","article-title":"VLCA: Vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning","volume":"34","author":"Wei","year":"2023","journal-title":"J. Syst. Eng. Electron."},{"key":"ref_159","doi-asserted-by":"crossref","first-page":"101983","DOI":"10.1016\/j.wpi.2020.101983","article-title":"Patent claim generation by fine-tuning OpenAI GPT-2","volume":"62","author":"Lee","year":"2020","journal-title":"World Pat. Inf."},{"key":"ref_160","doi-asserted-by":"crossref","first-page":"11809","DOI":"10.1109\/JSTARS.2024.3413323","article-title":"NLP-Based Fusion Approach to Robust Image Captioning","volume":"17","author":"Ricci","year":"2024","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_161","unstructured":"Hu, Y., Yuan, J., Wen, C., Lu, X., and Li, X. (2023). Rsgpt: A remote sensing vision language model and benchmark. arXiv."},{"key":"ref_162","first-page":"49250","article-title":"Instructblip: Towards general-purpose vision-language models with instruction tuning","volume":"36","author":"Dai","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_163","doi-asserted-by":"crossref","unstructured":"Bazi, Y., Bashmal, L., Al Rahhal, M.M., Ricci, R., and Melgani, F. (2024). RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery. Remote Sens., 16.","DOI":"10.3390\/rs16091477"},{"key":"ref_164","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv."},{"key":"ref_165","unstructured":"Silva, J.D., Magalh\u00e3es, J., Tuia, D., and Martins, B. (2024). Large Language Models for Captioning and Retrieving Remote Sensing Images. arXiv."},{"key":"ref_166","unstructured":"Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (2024). Visual instruction tuning. Advances in Neural Information Processing Systems 36, Neural Information Processing Systems Foundation, Inc."},{"key":"ref_167","unstructured":"Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. (2023). A survey on multimodal large language models. arXiv."},{"key":"ref_168","unstructured":"Zhan, Y., Xiong, Z., and Yuan, Y. (2024). Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv."},{"key":"ref_169","doi-asserted-by":"crossref","unstructured":"Kuckreja, K., Danish, M.S., Naseer, M., Das, A., Khan, S., and Khan, F.S. (2024, January 17\u201321). Geochat: Grounded large vision-language model for remote sensing. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02629"},{"key":"ref_170","unstructured":"Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (2024). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing 36, Neural Information Processing Systems Foundation, Inc."},{"key":"ref_171","unstructured":"Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (2024). Samrs: Scaling-up remote sensing segmentation dataset with segment anything model. Advances in Neural Information Processing Systems 36, Neural Information Processing Systems Foundation, Inc."},{"key":"ref_172","unstructured":"Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., and Gonzalez, J.E. (2023, April 14). Vicuna: An Open-source Chatbot Impressing Gpt-4 with 90%* Chatgpt Quality. Available online: https:\/\/vicuna.lmsys.org."},{"key":"ref_173","first-page":"5917820","article-title":"Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain","volume":"62","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_174","first-page":"5917820","article-title":"Remoteclip: A vision language foundation model for remote sensing","volume":"62","author":"Liu","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_175","doi-asserted-by":"crossref","unstructured":"Yuan, Z., Zhang, W., Fu, K., Li, X., Deng, C., Wang, H., and Sun, X. (2022). Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv.","DOI":"10.1109\/TGRS.2021.3078451"},{"key":"ref_176","doi-asserted-by":"crossref","first-page":"104046","DOI":"10.1016\/j.imavis.2020.104046","article-title":"Deep learning-based object detection in low-altitude UAV datasets: A survey","volume":"104","author":"Mittal","year":"2020","journal-title":"Image Vis. Comput."},{"key":"ref_177","first-page":"5635616","article-title":"Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis","volume":"62","author":"Liu","year":"2024","journal-title":"IEEE Trans. Geosci. Remote. Sens."},{"key":"ref_178","first-page":"63","article-title":"Generalizing from a few examples: A survey on few-shot learning","volume":"53","author":"Wang","year":"2020","journal-title":"ACM Comput. Surv."},{"key":"ref_179","doi-asserted-by":"crossref","unstructured":"Chen, X., Jiang, M., and Zhao, Q. (2021, January 5\u20139). Self-distillation for few-shot image captioning. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Online.","DOI":"10.1109\/WACV48630.2021.00059"},{"key":"ref_180","unstructured":"Allen-Zhu, Z., and Li, Y. (2020). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv."},{"key":"ref_181","doi-asserted-by":"crossref","unstructured":"Barraco, M., Stefanini, M., Cornia, M., Cascianelli, S., Baraldi, L., and Cucchiara, R. (2022, January 21\u201325). CaMEL: Mean teacher learning for image captioning. Proceedings of the 26th International Conference on Pattern Recognition, Montreal, QC, Canada.","DOI":"10.1109\/ICPR56361.2022.9955644"},{"key":"ref_182","unstructured":"Laina, I., Rupprecht, C., and Navab, N. (November, January 27). Towards unsupervised image captioning with shared multimodal embeddings. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_183","unstructured":"Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. (2018). Meta-learning with latent embedding optimization. arXiv."},{"key":"ref_184","doi-asserted-by":"crossref","unstructured":"Zhou, H., Du, X., Xia, L., and Li, S. (2022). Self-learning for few-shot remote sensing image captioning. Remote Sens., 14.","DOI":"10.3390\/rs14184606"},{"key":"ref_185","doi-asserted-by":"crossref","first-page":"2337240","DOI":"10.1080\/17538947.2024.2337240","article-title":"FRIC: A framework for few-shot remote sensing image captioning","volume":"17","author":"Zhou","year":"2024","journal-title":"Int. J. Digit. Earth"},{"key":"ref_186","doi-asserted-by":"crossref","unstructured":"Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., and Lazebnik, S. (2015, January 7\u201313). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.303"},{"key":"ref_187","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014). Microsoft coco: Common objects in context. Computer Vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6\u201312, 2014, Proceedings, Part V, Springer International Publishing.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_188","doi-asserted-by":"crossref","first-page":"5633520","DOI":"10.1109\/TGRS.2022.3218921","article-title":"Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset","volume":"60","author":"Liu","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_189","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1016\/S0034-4257(01)00254-1","article-title":"Landsat-7 ETM+ as an observatory for land cover: Initial radiometric and geometric comparisons with Landsat-5 Thematic Mapper","volume":"78","author":"Masek","year":"2001","journal-title":"Remote Sens. Environ."},{"key":"ref_190","doi-asserted-by":"crossref","unstructured":"Yang, Y., and Newsam, S. (2010, January 2\u20135). Bag-of-visual-words and spatial extensions for land-use classification. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA.","DOI":"10.1145\/1869790.1869829"},{"key":"ref_191","doi-asserted-by":"crossref","first-page":"2175","DOI":"10.1109\/TGRS.2014.2357078","article-title":"Saliency-guided unsupervised feature learning for scene classification","volume":"53","author":"Zhang","year":"2014","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_192","doi-asserted-by":"crossref","unstructured":"Chen, H., and Shi, Z. (2020). A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens., 12.","DOI":"10.3390\/rs12101662"},{"key":"ref_193","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_194","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_195","unstructured":"Lin, C.Y., and Och, F.J. (2004, January 2\u20134). Looking for a few good metrics: ROUGE and its evaluation. Proceedings of the 4th NTCIR Workshop, Tokyo, Japan."},{"key":"ref_196","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_197","doi-asserted-by":"crossref","unstructured":"Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11\u201314, 2016, Proceedings, Part V, Springer International Publishing.","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"ref_198","doi-asserted-by":"crossref","first-page":"532","DOI":"10.1017\/pan.2020.4","article-title":"Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches","volume":"28","author":"Miller","year":"2020","journal-title":"Political Anal."},{"key":"ref_199","doi-asserted-by":"crossref","unstructured":"De Silva, V., and Sumanathilaka, T.G.D.K. (2024, January 21\u201324). A Survey on Image Captioning Using Object Detection and NLP. Proceedings of the 4th International Conference on Advanced Research in Computing, Belihuloya, Sri Lanka.","DOI":"10.1109\/ICARC61713.2024.10499755"},{"key":"ref_200","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1007\/s10489-023-05198-9","article-title":"Integrating grid features and geometric coordinates for enhanced image captioning","volume":"54","author":"Zhao","year":"2024","journal-title":"Appl. Intell."},{"key":"ref_201","doi-asserted-by":"crossref","unstructured":"Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., and Lu, H. (2020, January 13\u201319). Normalized and geometry-aware self-attention network for image captioning. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01034"},{"key":"ref_202","doi-asserted-by":"crossref","first-page":"100211","DOI":"10.1016\/j.hcc.2024.100211","article-title":"A survey on large language model (llm) security and privacy: The good, the bad, and the ugly","volume":"4","author":"Yao","year":"2024","journal-title":"High-Confid. Comput."},{"key":"ref_203","doi-asserted-by":"crossref","first-page":"5227","DOI":"10.1109\/TPAMI.2024.3362475","article-title":"SpectralGPT: Spectral remote sensing foundation model","volume":"46","author":"Hong","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_204","doi-asserted-by":"crossref","first-page":"6047","DOI":"10.1109\/TIP.2023.3328224","article-title":"Changes to captions: An attentive network for remote sensing change captioning","volume":"32","author":"Chang","year":"2023","journal-title":"IEEE Trans. Image Processing"},{"key":"ref_205","first-page":"5622018","article-title":"A decoupling paradigm with prompt learning for remote sensing image change captioning","volume":"61","author":"Liu","year":"2023","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_206","doi-asserted-by":"crossref","first-page":"6006905","DOI":"10.1109\/LGRS.2024.3383163","article-title":"Change Captioning for Satellite Images Time Series","volume":"21","author":"Peng","year":"2024","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_207","doi-asserted-by":"crossref","unstructured":"Sun, Y., Lei, L., Guan, D., Kuang, G., Li, Z., and Liu, L. (2024). Locality Preservation for Unsupervised Multimodal Change Detection in Remote Sensing Imagery. IEEE Transactions on Neural Networks and Learning Systems, IEEE.","DOI":"10.1109\/TNNLS.2024.3401696"},{"key":"ref_208","doi-asserted-by":"crossref","first-page":"2507605","DOI":"10.1109\/LGRS.2022.3217348","article-title":"Change smoothness-based signal decomposition method for multimodal change detection","volume":"19","author":"Zheng","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_209","doi-asserted-by":"crossref","unstructured":"Cheng, Q., Xu, Y., and Huang, Z. (2024). VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning. Remote Sens., 16.","DOI":"10.3390\/rs16162961"},{"key":"ref_210","first-page":"5607512","article-title":"Bootstrapping interactive image-text alignment for remote sensing image captioning","volume":"62","author":"Yang","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_211","first-page":"5624711","article-title":"HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning","volume":"62","author":"Yang","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_212","doi-asserted-by":"crossref","first-page":"5163","DOI":"10.1007\/s11069-023-06399-8","article-title":"Pixel-based classification method for earthquake-induced landslide mapping using remotely sensed imagery, geospatial data and temporal change information","volume":"120","author":"Asadi","year":"2024","journal-title":"Nat. Hazards"},{"key":"ref_213","doi-asserted-by":"crossref","unstructured":"Amitrano, D., Di Martino, G., Di Simone, A., and Imperatore, P. (2024). Flood Detection with SAR: A Review of Techniques and Datasets. Remote Sens., 16.","DOI":"10.3390\/rs16040656"},{"key":"ref_214","doi-asserted-by":"crossref","unstructured":"Wang, B., and Yao, Y. (2024). Mountain Vegetation Classification Method Based on Multi-Channel Semantic Segmentation Model. Remote Sens., 16.","DOI":"10.3390\/rs16020256"},{"key":"ref_215","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1016\/j.comcom.2024.01.032","article-title":"Semantic segmentation of deep learning remote sensing images based on band combination principle: Application in urban planning and land use","volume":"217","author":"Jia","year":"2024","journal-title":"Comput. Commun."},{"key":"ref_216","doi-asserted-by":"crossref","unstructured":"Hess, G., Tonderski, A., Petersson, C., \u00c5str\u00f6m, K., and Svensson, L. (2024, January 3\u20138). Lidarclip or: How I learned to talk to point clouds. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV57701.2024.00727"},{"key":"ref_217","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.isprsjprs.2024.01.002","article-title":"Similarity and dissimilarity relationships based graphs for multimodal change detection","volume":"208","author":"Sun","year":"2024","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_218","doi-asserted-by":"crossref","first-page":"829","DOI":"10.1162\/neco_a_01273","article-title":"A survey on deep learning for multimodal data fusion","volume":"32","author":"Gao","year":"2020","journal-title":"Neural Comput."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/21\/4113\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:28:11Z","timestamp":1760113691000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/21\/4113"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,4]]},"references-count":218,"journal-issue":{"issue":"21","published-online":{"date-parts":[[2024,11]]}},"alternative-id":["rs16214113"],"URL":"https:\/\/doi.org\/10.3390\/rs16214113","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,4]]}}}