{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,29]],"date-time":"2026-03-29T06:35:31Z","timestamp":1774766131308,"version":"3.50.1"},"reference-count":20,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,4]],"date-time":"2026-01-04T00:00:00Z","timestamp":1767484800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Describing land cover changes from multi-temporal remote sensing imagery requires capturing both visual transformations and their semantic meaning in natural language. Existing methods often struggle to balance visual accuracy with descriptive coherence. We propose MVLT-LoRA-CC (Multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes paired temporal images through patch embeddings and transformer blocks, aligning visual and textual representations via a multi-modal adapter. To improve efficiency and avoid unnecessary parameter growth, LoRA modules are selectively inserted only into the attention projection layers and cross-modal adapter blocks rather than being uniformly applied to all linear layers. This targeted design preserves general linguistic knowledge while enabling effective adaptation to remote sensing change description. To assess performance, we introduce the Complementary Consistency Score (CCS) framework, which evaluates both descriptive fidelity for change instances and classification accuracy for no change cases. Experiments on the LEVIR-CC test set demonstrate that MVLT-LoRA-CC generates semantically accurate captions, surpassing prior methods in both descriptive richness and temporal change recognition. The approach establishes a scalable solution for multi-modal land cover change description in remote sensing applications.<\/jats:p>","DOI":"10.3390\/rs18010166","type":"journal-article","created":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T08:40:53Z","timestamp":1767602453000},"page":"166","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1463-1837","authenticated-orcid":false,"given":"Javier Lamar","family":"Le\u00f3n","sequence":"first","affiliation":[{"name":"Centro Algoritmi, LASI, University of \u00c9vora, 7005-854 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0793-0003","authenticated-orcid":false,"given":"Vitor","family":"Nogueira","sequence":"additional","affiliation":[{"name":"Centro Algoritmi, LASI, University of \u00c9vora, 7005-854 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5055-2805","authenticated-orcid":false,"given":"Pedro","family":"Salgueiro","sequence":"additional","affiliation":[{"name":"Centro Algoritmi, LASI, University of \u00c9vora, 7005-854 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5086-059X","authenticated-orcid":false,"given":"Paulo","family":"Quaresma","sequence":"additional","affiliation":[{"name":"Centro Algoritmi, LASI, University of \u00c9vora, 7005-854 \u00c9vora, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and Tell: A Neural Image Caption Generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_2","unstructured":"Daudt, R.C., Le Saux, B., and Boulch, A. (2018, January 7\u201310). Fully Convolutional Siamese Networks for Change Detection. Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"6047","DOI":"10.1109\/TIP.2023.3328224","article-title":"Changes to captions: An attentive network for remote sensing change captioning","volume":"32","author":"Chang","year":"2023","journal-title":"IEEE Trans. Image Process."},{"key":"ref_4","first-page":"1","article-title":"Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset","volume":"60","author":"Liu","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_5","unstructured":"Wang, Y., Yu, W., and Ghamisi, P. (2025). Change Captioning in Remote Sensing: Evolution to SAT-Cap\u2013A Single-Stage Transformer Approach. arXiv."},{"key":"ref_6","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning Transferable Visual Models from Natural Language Supervision. Proceedings of the Proceedings of the International Conference on Machine Learning (ICML), Virtual."},{"key":"ref_7","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16\u00d716 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"127063","DOI":"10.1016\/j.neucom.2023.127063","article-title":"Roformer: Enhanced transformer with rotary position embedding","volume":"568","author":"Su","year":"2024","journal-title":"Neurocomputing"},{"key":"ref_9","first-page":"23716","article-title":"Flamingo: A visual language model for few-shot learning","volume":"35","author":"Alayrac","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Aghajanyan, A., Zettlemoyer, L., and Gupta, S. (2020). Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv.","DOI":"10.18653\/v1\/2021.acl-long.568"},{"key":"ref_11","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Cao, Y., Wang, S., and Wang, L. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv."},{"key":"ref_12","unstructured":"Wang, S., Yu, L., and Li, J. (2024). LoRA-GA: Low-Rank Adaptation with Gradient Approximation. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Yang, Y., Liu, T., Pu, Y., Liu, L., Zhao, Q., and Wan, Q. (2024). Remote sensing image change captioning using multi-attentive network with diffusion model. Remote Sens., 16.","DOI":"10.3390\/rs16214083"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"5648916","DOI":"10.1109\/TGRS.2024.3497338","article-title":"Semantic-cc: Boosting remote sensing image change captioning via foundational knowledge and semantic guidance","volume":"62","author":"Zhu","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 6\u201312). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_16","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_17","unstructured":"Lin, C.-Y. (2004, January 25\u201326). ROUGE: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2016, January 8\u201316). SPICE: Semantic propositional image caption evaluation. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"ref_20","unstructured":"Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., and Tang, J. (2025). Qwen2.5-vl technical report. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/18\/1\/166\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,6]],"date-time":"2026-01-06T08:51:39Z","timestamp":1767689499000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/18\/1\/166"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,4]]},"references-count":20,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["rs18010166"],"URL":"https:\/\/doi.org\/10.3390\/rs18010166","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,4]]}}}