{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T22:01:10Z","timestamp":1773180070361,"version":"3.50.1"},"reference-count":48,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T00:00:00Z","timestamp":1704240000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Two-stage remote sensing image captioning (RSIC) methods have achieved promising results by incorporating additional pre-trained remote sensing tasks to extract supplementary information and improve caption quality. However, these methods face limitations in semantic comprehension, as pre-trained detectors\/classifiers are constrained by predefined labels, leading to an oversight of the intricate and diverse details present in remote sensing images (RSIs). Additionally, the handling of auxiliary remote sensing tasks separately can introduce challenges in ensuring seamless integration and alignment with the captioning process. To address these problems, we propose a novel cross-modal retrieval and semantic refinement (CRSR) RSIC method. Specifically, we employ a cross-modal retrieval model to retrieve relevant sentences of each image. The words in these retrieved sentences are then considered as primary semantic information, providing valuable supplementary information for the captioning process. To further enhance the quality of the captions, we introduce a semantic refinement module that refines the primary semantic information, which helps to filter out misleading information and emphasize visually salient semantic information. A Transformer Mapper network is introduced to expand the representation of image features beyond the retrieved supplementary information with learnable queries. Both the refined semantic tokens and visual features are integrated and fed into a cross-modal decoder for caption generation. Through extensive experiments, we demonstrate the superiority of our CRSR method over existing state-of-the-art approaches on the RSICD, the UCM-Captions, and the Sydney-Captions datasets<\/jats:p>","DOI":"10.3390\/rs16010196","type":"journal-article","created":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T03:18:07Z","timestamp":1704251887000},"page":"196","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-8988-4545","authenticated-orcid":false,"given":"Zhengxin","family":"Li","sequence":"first","affiliation":[{"name":"The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Wenzhe","family":"Zhao","sequence":"additional","affiliation":[{"name":"The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Xuanyi","family":"Du","sequence":"additional","affiliation":[{"name":"The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China"}]},{"given":"Guangyao","family":"Zhou","sequence":"additional","affiliation":[{"name":"The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Songlin","family":"Zhang","sequence":"additional","affiliation":[{"name":"The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,1,3]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"3623","DOI":"10.1109\/TGRS.2017.2677464","article-title":"Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?","volume":"55","author":"Shi","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring models and data for remote sensing image caption generation","volume":"56","author":"Lu","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1002\/rob.21756","article-title":"Post-disaster assessment with unmanned aerial vehicles: A survey on practical implementations and research approaches","volume":"35","author":"Recchiuto","year":"2018","journal-title":"J. Field Robot."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"263","DOI":"10.1016\/j.isprsjprs.2022.07.001","article-title":"Fully-weighted HGNN: Learning efficient non-local relations with hypergraph in aerial imagery","volume":"191","author":"Tian","year":"2022","journal-title":"ISPRS J. Photogram. Remote Sens."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3295748","article-title":"A comprehensive survey of deep learning for image captioning","volume":"51","author":"Hossain","year":"2019","journal-title":"ACM Comput. Surv."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"154086","DOI":"10.1109\/ACCESS.2021.3128140","article-title":"A systematic survey of remote sensing image captioning","volume":"9","author":"Zhao","year":"2021","journal-title":"IEEE Access"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1207\/s15516709cog1402_1","article-title":"Finding structure in time","volume":"14","author":"Elman","year":"1990","journal-title":"Cognit. Sci."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"140301","DOI":"10.1007\/s11432-022-3588-0","article-title":"From single- to multi-modal remote sensing imagery interpretation: A survey and taxonomy","volume":"66","author":"Sun","year":"2023","journal-title":"Sci. China Inf. Sci."},{"key":"ref_10","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"5610510","DOI":"10.1109\/TGRS.2023.3277626","article-title":"SFRNet: Fine-Grained Oriented Object Recognition via Separate Feature Refinement","volume":"61","author":"Cheng","year":"2023","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"5603018","DOI":"10.1109\/TGRS.2021.3065112","article-title":"Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images","volume":"60","author":"Niu","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, X., Tang, X., Zhou, H., and Li, C. (2019). Description generation for remote sensing images using attribute attention mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11060612"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"256","DOI":"10.1109\/JSTARS.2019.2959208","article-title":"Retrieval topic recurrent memory network for remote sensing image captioning","volume":"13","author":"Wang","year":"2020","journal-title":"IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"6482","DOI":"10.1080\/01431161.2019.1594439","article-title":"Geospatial relation captioning for high-spatial-resolution images by using an attention-based neural network","volume":"40","author":"Chen","year":"2019","journal-title":"Int. J. Remote Sens."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"4709616","DOI":"10.1109\/TGRS.2022.3224244","article-title":"A Joint-Training Two-Stage Method For Remote Sensing Image Captioning","volume":"60","author":"Ye","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Diao, W., Zhang, W., Yan, M., Gao, X., and Sun, X. (2019). LAM: Remote sensing image captioning with label-attention mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11202349"},{"key":"ref_18","first-page":"5603814","article-title":"High-Resolution Remote Sensing Image Captioning Based on Structured Attention","volume":"60","author":"Zhao","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Sarto, S., Cornia, M., Baraldi, L., and Cucchiara, R. (2022, January 14\u201316). Retrieval-Augmented Transformer for Image Captioning. Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, Graz, Austria.","DOI":"10.1145\/3549555.3549585"},{"key":"ref_20","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the ICML, Online."},{"key":"ref_21","unstructured":"Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.W., Yao, Z., and Keutzer, K. (2021). How much can clip benefit vision-and-language tasks?. arXiv."},{"key":"ref_22","unstructured":"Jiasen, L., Goswami, V., Rohrbach, M., Parikh, D., and Lee, S. (2020, January 13\u201319). 12-in-1: Multi-task vision and language representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA."},{"key":"ref_23","unstructured":"Mokady, R., Hertz, A., and Bermano, A.H. (2021). ClipCap: CLIP prefix for image captioning. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China.","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Li, Y., Fang, S., Jiao, L., Liu, R., and Shang, R. (2020). A multi-level attention model for remote sensing image captions. Remote Sens., 12.","DOI":"10.3390\/rs12060939"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1109\/LGRS.2020.2980933","article-title":"Denoising-based multiscale feature fusion for remote sensing image captioning","volume":"18","author":"Huang","year":"2021","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_27","first-page":"1","article-title":"Recurrent attention and semantic gate for remote sensing image captioning","volume":"60","author":"Li","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"5246","DOI":"10.1109\/TGRS.2020.3010106","article-title":"Truncation cross entropy loss for remote sensing image captioning","volume":"59","author":"Li","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_29","first-page":"5608816","article-title":"Global visual feature and linguistic state guided attention for remote sensing image captioning","volume":"60","author":"Zhang","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"5404514","DOI":"10.1109\/TGRS.2021.3105004","article-title":"A novel SVM-based decoder for remote sensing image captioning","volume":"60","author":"Hoxha","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"10532","DOI":"10.1109\/TGRS.2020.3044054","article-title":"Word\u2014Sentence framework for remote sensing image captioning","volume":"59","author":"Wang","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"6922","DOI":"10.1109\/TGRS.2020.3031111","article-title":"SD-RSIC: Summarization-driven deep remote sensing image captioning","volume":"59","author":"Sumbul","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"6514905","DOI":"10.1109\/LGRS.2022.3198234","article-title":"Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning","volume":"19","author":"Kandala","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1016\/j.isprsjprs.2022.02.001","article-title":"Meta captioning: A meta learning based remote sensing image captioning framework","volume":"186","author":"Yang","year":"2022","journal-title":"ISPRS J. Photogram. Remote Sens."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhang, X., Li, Y., Wang, X., Liu, F., Wu, Z., Cheng, X., and Jiao, L. (2023). Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning. Remote Sens., 15.","DOI":"10.3390\/rs15030579"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"105920","DOI":"10.1016\/j.knosys.2020.105920","article-title":"Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning","volume":"203","author":"Shen","year":"2020","journal-title":"Knowl.-Based Syst."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"7704","DOI":"10.1109\/JSTARS.2023.3305889","article-title":"From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning","volume":"16","author":"Du","year":"2023","journal-title":"IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"109893","DOI":"10.1016\/j.patcog.2023.109893","article-title":"Learning consensus-aware semantic knowledge for remote sensing image captioning","volume":"145","author":"Li","year":"2024","journal-title":"Pattern Recognit."},{"key":"ref_39","unstructured":"Carion, N., Massa, F., and Synnaeve, G. (2020). European Conference on Computer Vision, Springer."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Yang, Y., and Newsam, S. (2010, January 2\u20135). Bag-of-visual-words and spatial extensions for land-use classification. Proceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS), San Jose, CA, USA.","DOI":"10.1145\/1869790.1869829"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"2175","DOI":"10.1109\/TGRS.2014.2357078","article-title":"Saliency-guided unsupervised feature learning for scene classification","volume":"53","author":"Zhang","year":"2015","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_43","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29\u201330). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation (StatMT), Morristown, NJ, USA."},{"key":"ref_44","unstructured":"Lin, C.-Y. (2004). Rouge: A Package for Automatic Evaluation of Summaries, Association for Computational Linguistics."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_46","first-page":"382","article-title":"SPICE: Semantic propositional image caption evaluation","volume":"9909","author":"Anderson","year":"2016","journal-title":"Proc. Eur. Conf. Comput. Vis."},{"key":"ref_47","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the ICLR, San Diego, CA, USA."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/1\/196\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T13:38:37Z","timestamp":1760103517000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/1\/196"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,3]]},"references-count":48,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,1]]}},"alternative-id":["rs16010196"],"URL":"https:\/\/doi.org\/10.3390\/rs16010196","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,3]]}}}