{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T22:50:34Z","timestamp":1769640634621,"version":"3.49.0"},"reference-count":46,"publisher":"MDPI AG","issue":"16","license":[{"start":{"date-parts":[[2024,8,12]],"date-time":"2024-08-12T00:00:00Z","timestamp":1723420800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2023YFB3906101"],"award-info":[{"award-number":["2023YFB3906101"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Pioneering remote sensing image captioning (RSIC) works use autoregressive decoding for fluent and coherent sentences but suffer from high latency and high computation costs. In contrast, non-autoregressive approaches improve inference speed by predicting multiple tokens simultaneously, though at the cost of performance due to a lack of sequential dependencies. Recently, diffusion model-based non-autoregressive decoding has shown promise in natural image captioning with iterative refinement, but its effectiveness is limited by the intrinsic characteristics of remote sensing images, which complicate robust input construction and affect the description accuracy. To overcome these challenges, we propose an innovative diffusion model for RSIC, named the Visual Conditional Control Diffusion Network (VCC-DiffNet). Specifically, we propose a Refined Multi-scale Feature Extraction (RMFE) module to extract the discernible visual context features of RSIs as input of the diffusion model-based non-autoregressive decoder to conditionally control a multi-step denoising process. Furthermore, we propose an Interactive Enhanced Decoder (IE-Decoder) utilizing dual image\u2013description interactions to generate descriptions finely aligned with the image content. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet performs comparably to, or even better than, popular autoregressive baselines in classic metrics, achieving around an 8.22\u00d7 speedup in Sydney-Captions, an 11.61\u00d7 speedup in UCM-Captions, a 15.20\u00d7 speedup in RSICD, and an 8.13\u00d7 speedup in NWPU-Captions.<\/jats:p>","DOI":"10.3390\/rs16162961","type":"journal-article","created":{"date-parts":[[2024,8,12]],"date-time":"2024-08-12T10:34:59Z","timestamp":1723458899000},"page":"2961","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6713-7544","authenticated-orcid":false,"given":"Qimin","family":"Cheng","sequence":"first","affiliation":[{"name":"School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China"}]},{"given":"Yuqi","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China"}]},{"given":"Ziyang","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,8,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"154086","DOI":"10.1109\/ACCESS.2021.3128140","article-title":"A Systematic Survey of Remote Sensing Image Captioning","volume":"9","author":"Zhao","year":"2021","journal-title":"IEEE Access"},{"key":"ref_2","unstructured":"Chen, T., Zhang, R., and Hinton, G.E. (2022). Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning. arXiv."},{"key":"ref_3","unstructured":"Fei, Z. (2019). Fast Image Caption Generation with Position Alignment. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Li, Y., Zhou, K., Zhao, W.X., and Wen, J.-R. (2023). Diffusion Models for Non-autoregressive Text Generation: A Survey. arXiv.","DOI":"10.24963\/ijcai.2023\/750"},{"key":"ref_5","unstructured":"Zhu, Z., Wei, Y., Wang, J., Gan, Z., Zhang, Z., Wang, L., Hua, G., Wang, L., Liu, Z., and Hu, H. (2022). Exploring Discrete Diffusion Models for Image Captioning. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Luo, J., Li, Y., Pan, Y., Yao, T., Feng, J., Chao, H., and Mei, T. (2023, January 18\u201322). Semantic-Conditional Diffusion Networks for Image Captioning. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02237"},{"key":"ref_7","unstructured":"Xu, S. (2022). CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep semantic understanding of high-resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China.","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"3623","DOI":"10.1109\/TGRS.2017.2677464","article-title":"Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?","volume":"55","author":"Shi","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1274","DOI":"10.1109\/LGRS.2019.2893772","article-title":"Semantic Descriptions of High-Resolution Remote Sensing Images","volume":"16","author":"Wang","year":"2019","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring Models and Data for Remote Sensing Image Caption Generation","volume":"56","author":"Lu","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, X., Tang, X., Zhou, H., and Li, C. (2019). Description generation for remote sensing images using attribute attention mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11060612"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"137355","DOI":"10.1109\/ACCESS.2019.2942154","article-title":"VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning","volume":"7","author":"Zhang","year":"2019","journal-title":"IEEE Access"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Diao, W., Zhang, W., Yan, M., Gao, X., and Sun, X. (2019). LAM: Remote Sensing Image Captioning with Label-Attention Mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11202349"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Fu, K., Li, Y., Zhang, W., Yu, H., and Sun, X. (2020). Boosting Memory with a Persistent Memory Mechanism for Remote Sensing Image Captioning. Remote Sens., 12.","DOI":"10.3390\/rs12111874"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Li, Y., Zhang, X., Gu, J., Li, C., Wang, X., Tang, X., and Jiao, L. (2022). Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens., 60.","DOI":"10.1109\/TGRS.2021.3102590"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Zhang, W., Yan, M., Gao, X., Fu, K., and Sun, X. (2022). Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens., 60.","DOI":"10.1109\/TGRS.2021.3132095"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wang, J., Wang, B., Xi, J., Bai, X., Ersoy, O.K., Cong, M., Gao, S., and Zhao, Z. (2024). Remote Sensing Image Captioning With Sequential Attention and Flexible Word Correlation. IEEE Geosci. Remote Sens. Lett., 21.","DOI":"10.1109\/LGRS.2024.3366984"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"26661","DOI":"10.1007\/s11042-020-09294-7","article-title":"Remote sensing image caption generation via transformer and reinforcement learning","volume":"79","author":"Shen","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Liu, C., Zhao, R., and Shi, Z.X. (2022). Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer. IEEE Geosci. Remote Sens. Lett., 19.","DOI":"10.1109\/LGRS.2022.3150957"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Ren, Z., Gou, S., Guo, Z., Mao, S., and Li, R. (2022). A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning. Remote Sens., 14.","DOI":"10.3390\/rs14122939"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhao, K., and Xiong, W. (2024). Exploring region features in remote sensing image captioning. Int. J. Appl. Earth Obs. Geoinf., 127.","DOI":"10.1016\/j.jag.2024.103672"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Zhao, K., and Xiong, W. (2024). Cooperative Connection Transformer for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens., 62.","DOI":"10.1109\/TGRS.2024.3360089"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lee, J., Mansimov, E., and Cho, K. (2018). Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. arXiv.","DOI":"10.18653\/v1\/D18-1149"},{"key":"ref_26","unstructured":"Gao, J., Meng, X., Wang, S., Li, X., Wang, S., Ma, S., and Gao, W. (2019). Masked Non-Autoregressive Image Captioning. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Guo, L., Liu, J., Zhu, X., He, X., Jiang, J., and Lu, H. (2020, January 11\u201317). Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan.","DOI":"10.24963\/ijcai.2020\/107"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Yu, H., Liu, Y., Qi, B., Hu, Z., and Liu, H. (2023, January 4\u201310). End-to-End Non-Autoregressive Image Captioning. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10095338"},{"key":"ref_29","first-page":"1309","article-title":"Partially Non-Autoregressive Image Captioning","volume":"35","author":"Fei","year":"2021","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yan, X., Fei, Z., Li, Z., Wang, S., Huang, Q., and Tian, Q. (2021, January 20\u201324). Semi-autoregressive image captioning. Proceedings of the MM \u201921: Proceedings of the 29th ACM International Conference on Multimedia, Virtual.","DOI":"10.1145\/3474085.3475179"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Zhang, Y., Hu, Z., and Wang, M. (2021, January 11\u201317). Semi-autoregressive transformer for image captioning. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00350"},{"key":"ref_32","unstructured":"He, Y., Cai, Z., Gan, X., and Chang, B. (2023). DiffCap: Exploring Continuous Diffusion on Image Captioning. arXiv."},{"key":"ref_33","unstructured":"Liu, G., Li, Y., Fei, Z., Fu, H., Luo, X., and Guo, Y. (2023). Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning. arXiv."},{"key":"ref_34","unstructured":"Austin, J., Johnson, D.D., Ho, J., Tarlow, D., and van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R.B., He, K., Hariharan, B., and Belongie, S.J. (2017, January 21\u201326). Feature Pyramid Networks for Object Detection. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_36","unstructured":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Cheng, Q., Huang, H., Xu, Y., Zhou, Y., Li, H., and Wang, Z. (2022). NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens., 60.","DOI":"10.1109\/TGRS.2022.3201474"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the ACL \u201902: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_39","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_40","unstructured":"Lin, C.Y. (2004, January 4\u201310). Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_42","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"7704","DOI":"10.1109\/JSTARS.2023.3305889","article-title":"From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning","volume":"16","author":"Du","year":"2023","journal-title":"IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Zia, U., Riaz, M.M., and Ghafoor, A. (2022). Transforming remote sensing images to textual descriptions. Int. J. Appl. Earth Obs. Geoinf., 108.","DOI":"10.1016\/j.jag.2022.102741"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"6910","DOI":"10.1109\/TCYB.2022.3222606","article-title":"GLCM: Global\u2013Local Captioning Model for Remote Sensing Image Captioning","volume":"53","author":"Wang","year":"2023","journal-title":"IEEE Trans. Cybern."},{"key":"ref_46","unstructured":"Mokady, R., and Hertz, A. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/16\/2961\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:35:33Z","timestamp":1760110533000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/16\/2961"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,12]]},"references-count":46,"journal-issue":{"issue":"16","published-online":{"date-parts":[[2024,8]]}},"alternative-id":["rs16162961"],"URL":"https:\/\/doi.org\/10.3390\/rs16162961","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,8,12]]}}}