{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T06:57:19Z","timestamp":1762325839211,"version":"build-2065373602"},"reference-count":37,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2024,1,23]],"date-time":"2024-01-23T00:00:00Z","timestamp":1705968000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Image captioning is a technique that enables the automatic extraction of natural language descriptions about the contents of an image. On the one hand, information in the form of natural language can enhance accessibility by reducing the expertise required to process, analyze, and exploit remote sensing images, while on the other, it provides a direct and general form of communication. However, image captioning is usually restricted to a single sentence, which barely describes the rich semantic information that typically characterizes remote sensing (RS) images. In this paper, we aim to move one step forward by proposing a captioning system that, mimicking human behavior, adopts dialogue as a tool to explore and dig for information, leading to more detailed and comprehensive descriptions of RS scenes. The system relies on a questions\u2013answers scheme fed by a query image and summarizes the dialogue content with ChatGPT. Experiments carried out on two benchmark remote sensing datasets confirm the potential of such an approach in the context of semantic information mining. Strengths and weaknesses are highlighted and discussed, as well as some possible future developments.<\/jats:p>","DOI":"10.3390\/rs16030441","type":"journal-article","created":{"date-parts":[[2024,1,23]],"date-time":"2024-01-23T07:22:32Z","timestamp":1705994552000},"page":"441","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2128-7456","authenticated-orcid":false,"given":"Riccardo","family":"Ricci","sequence":"first","affiliation":[{"name":"Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9287-0596","authenticated-orcid":false,"given":"Yakoub","family":"Bazi","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 4545, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9745-3732","authenticated-orcid":false,"given":"Farid","family":"Melgani","sequence":"additional","affiliation":[{"name":"Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2024,1,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wu, L., Tan, X., He, D., Tian, F., Qin, T., Lai, J., and Liu, T.Y. (2018). Beyond Error Propagation in Neural Machine Translation: Characteristics of Language. arXiv.","DOI":"10.18653\/v1\/D18-1396"},{"key":"ref_2","first-page":"1","article-title":"Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis","volume":"60","author":"Hoxha","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"3623","DOI":"10.1109\/TGRS.2017.2677464","article-title":"Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?","volume":"55","author":"Shi","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China.","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_7","first-page":"1","article-title":"A Novel SVM-Based Decoder for Remote Sensing Image Captioning","volume":"60","author":"Hoxha","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring Models and Data for Remote Sensing Image Caption Generation","volume":"56","author":"Lu","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1109\/LGRS.2020.2980933","article-title":"Denoising-Based Multiscale Feature Fusion for Remote Sensing Image Captioning","volume":"18","author":"Huang","year":"2021","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, X., Tang, X., Zhou, H., and Li, C. (2019). Description Generation for Remote Sensing Images Using Attribute Attention Mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11060612"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Diao, W., Zhang, W., Yan, M., Gao, X., and Sun, X. (2019). LAM: Remote Sensing Image Captioning with Label-Attention Mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11202349"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Wang, J., Chen, Z., Ma, A., and Zhong, Y. (2022, January 17\u201322). Capformer: Pure Transformer for Remote Sensing Image Caption. Proceedings of the IGARSS 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia.","DOI":"10.1109\/IGARSS46834.2022.9883199"},{"key":"ref_13","first-page":"1","article-title":"NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning","volume":"60","author":"Cheng","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7\u201313). VQA: Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"8555","DOI":"10.1109\/TGRS.2020.2988782","article-title":"RSVQA: Visual Question Answering for Remote Sensing Data","volume":"58","author":"Lobry","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TGRS.2022.3225843","article-title":"Mutual Attention Inception Network for Remote Sensing Visual Question Answering","volume":"60","author":"Zheng","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_17","first-page":"1","article-title":"From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data","volume":"60","author":"Yuan","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chappuis, C., Zermatten, V., Lobry, S., Le Saux, B., and Tuia, D. (2022, January 19\u201320). Prompt\u2013RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA.","DOI":"10.1109\/CVPRW56347.2022.00143"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TGRS.2022.3192460","article-title":"Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery","volume":"60","author":"Bazi","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3383465","article-title":"Visual Question Generation: The State of the Art","volume":"53","author":"Patil","year":"2020","journal-title":"ACM Comput. Surv."},{"key":"ref_21","unstructured":"Ren, M., Kiros, R., and Zemel, R. (2015). Exploring Models and Data for Image Question Answering. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"3618","DOI":"10.1073\/pnas.1422953112","article-title":"Visual Turing test for computer vision systems","volume":"112","author":"Geman","year":"2015","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_23","unstructured":"Yang, J., Lu, J., Lee, S., Batra, D., and Parikh, D. (2018). Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition. arXiv."},{"key":"ref_24","unstructured":"Vedd, N., Wang, Z., Rei, M., Miao, Y., and Specia, L. (2012). Guiding Visual Question Generation. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Jain, U., Zhang, Z., and Schwing, A. (2017). Creativity: Generating Diverse Questions using Variational Autoencoders. arXiv.","DOI":"10.1109\/CVPR.2017.575"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"3279","DOI":"10.1109\/JSTARS.2023.3261361","article-title":"Visual Question Generation From Remote Sensing Images","volume":"16","author":"Bashmal","year":"2023","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_27","unstructured":"Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang, W., and Elhoseiny, M. (2023). ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions. arXiv."},{"key":"ref_28","unstructured":"Li, J., Li, D., Savarese, S., and Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv."},{"key":"ref_29","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020). Language Models are Few-Shot Learners. arXiv."},{"key":"ref_30","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., and Ray, A. (2022). Training language models to follow instructions with human feedback. arXiv."},{"key":"ref_31","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv."},{"key":"ref_32","first-page":"9","article-title":"Language Models are Unsupervised Multitask Learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"key":"ref_33","unstructured":"(2022, April 14). Huggingface: Distilgpt2. Available online: https:\/\/huggingface.co\/distilgpt2."},{"key":"ref_34","unstructured":"Yang, Y., Li, Y., Fermuller, C., and Aloimonos, Y. (2015). Neural Self Talk: Image Understanding via Continuous Questioning and Answering. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., and Choi, Y. (2022). CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-main.595"},{"key":"ref_36","unstructured":"(2022, April 14). Huggingface: CompVis\/Stable-Diffusion-v1-4. Available online: https:\/\/huggingface.co\/CompVis\/stable-diffusion-v1-4."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv.","DOI":"10.1109\/CVPR.2016.308"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/3\/441\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T13:47:55Z","timestamp":1760104075000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/16\/3\/441"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,23]]},"references-count":37,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,2]]}},"alternative-id":["rs16030441"],"URL":"https:\/\/doi.org\/10.3390\/rs16030441","relation":{},"ISSN":["2072-4292"],"issn-type":[{"type":"electronic","value":"2072-4292"}],"subject":[],"published":{"date-parts":[[2024,1,23]]}}}