{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T05:42:39Z","timestamp":1774590159645,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"18","license":[{"start":{"date-parts":[[2023,9,21]],"date-time":"2023-09-21T00:00:00Z","timestamp":1695254400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62102423"],"award-info":[{"award-number":["62102423"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>In recent years, there has been a growing interest in remote sensing image\u2013text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image\u2013text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model\u2019s consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.<\/jats:p>","DOI":"10.3390\/rs15184637","type":"journal-article","created":{"date-parts":[[2023,9,21]],"date-time":"2023-09-21T21:16:49Z","timestamp":1695331009000},"page":"4637","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":14,"title":["A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text\u2013Image Retrieval in Remote Sensing"],"prefix":"10.3390","volume":"15","author":[{"given":"Xiong","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9337-0996","authenticated-orcid":false,"given":"Weipeng","family":"Li","sequence":"additional","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]},{"given":"Xu","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]},{"given":"Luyao","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]},{"given":"Fuzhong","family":"Zheng","sequence":"additional","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]},{"given":"Long","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]},{"given":"Haisu","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Information and Communication, National University of Defense Technology, Wuhan 430074, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,9,21]]},"reference":[{"key":"ref_1","unstructured":"Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_3","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021). An Image Is Worth 16\u00d716 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2222","DOI":"10.1109\/TNNLS.2016.2582924","article-title":"LSTM: A Search Space Odyssey","volume":"28","author":"Greff","year":"2017","journal-title":"IEEE Trans. Neural Networks Learn. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_6","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","article-title":"Multimodal Machine Learning: A Survey and Taxonomy","volume":"41","author":"Baltrusaitis","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","first-page":"1","article-title":"Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval","volume":"60","author":"Yuan","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_9","unstructured":"Faghri, F., Fleet, D.J., Kiros, J.R., and Fidler, S. (2018). VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Lee, K.H., Chen, X., Hua, G., Hu, H., and He, X. (2018). Stacked Cross Attention for Image-Text Matching. arXiv.","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., and Song, J. (2019, January 21\u201325). Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3350875"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Rahhal, M.M.A., Bazi, Y., Abdullah, T., Mekhalfi, M.L., and Zuair, M. (2020). Deep Unsupervised Embedding for Remote Sensing Image Retrieval Using Textual Cues. Appl. Sci., 10.","DOI":"10.3390\/app10248931"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Abdullah, T., Bazi, Y., Al Rahhal, M.M., Mekhalfi, M.L., Rangarajan, L., and Zuair, M. (2020). TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sens., 12.","DOI":"10.3390\/rs12030405"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"4284","DOI":"10.1109\/JSTARS.2021.3070872","article-title":"A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing","volume":"14","author":"Cheng","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_15","first-page":"1","article-title":"Fusion-Based Correlation Learning Model for Cross-Modal Remote Sensing Image Retrieval","volume":"19","author":"Lv","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_16","first-page":"1","article-title":"Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information","volume":"60","author":"Yuan","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1532","DOI":"10.1109\/JAS.2022.105773","article-title":"Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing","volume":"9","author":"Cheng","year":"2022","journal-title":"IEEE\/CAA J. Autom. Sin."},{"key":"ref_18","first-page":"1","article-title":"A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing","volume":"60","author":"Yuan","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7\u201312). FaceNet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_20","unstructured":"Van den Oord, A., Li, Y., and Vinyals, O. (2019). Representation Learning with Contrastive Predictive Coding. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"3359","DOI":"10.1080\/01431161.2022.2091964","article-title":"A Fusion-Based Contrastive Learning Model for Cross-Modal Remote Sensing Retrieval","volume":"43","author":"Li","year":"2022","journal-title":"Int. J. Remote Sens."},{"key":"ref_22","unstructured":"Zeng, Y., Zhang, X., Li, H., Wang, J., Zhang, J., and Zhou, W. (2022). X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring Models and Data for Remote Sensing Image Caption Generation","volume":"56","author":"Lu","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Huang, Y., Wang, W., and Wang, L. (2017, January 21\u201326). Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.767"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zheng, F., Li, W., Wang, X., Wang, L., Zhang, X., and Zhang, H. (2022). A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing. Appl. Sci., 12.","DOI":"10.3390\/app122312221"},{"key":"ref_26","unstructured":"Kim, W., Son, B., and Kim, I. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. arXiv."},{"key":"ref_27","unstructured":"Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., and Hoi, S. (2021). Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv."},{"key":"ref_28","unstructured":"Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., and Chang, K.W. (2019). VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv."},{"key":"ref_29","unstructured":"Huang, Z., Zeng, Z., Liu, B., Fu, D., and Fu, J. (2020). Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv."},{"key":"ref_30","unstructured":"Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"3623","DOI":"10.1109\/TGRS.2017.2677464","article-title":"Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?","volume":"55","author":"Shi","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep Semantic Understanding of High Resolution Remote Sensing Image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China.","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Mikriukov, G., Ravanbakhsh, M., and Demir, B. (2022). Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal Text-Image Retrieval in Remote Sensing. arXiv.","DOI":"10.1109\/ICASSP43922.2022.9746251"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Mikriukov, G., Ravanbakhsh, M., and Demir, B. (2022). An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing. arXiv.","DOI":"10.1109\/ICIP46576.2022.9897500"},{"key":"ref_35","unstructured":"Hadsell, R., Chopra, S., and LeCun, Y. (2006, January 17\u201322). Dimensionality Reduction by Learning an Invariant Mapping. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA."},{"key":"ref_36","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_37","unstructured":"Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., and Macherey, K. (2016). Google\u2019s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv."},{"key":"ref_38","unstructured":"Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. (2022). CoCa: Contrastive Captioners Are Image-Text Foundation Models. arXiv."},{"key":"ref_39","unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Tan, H., and Bansal, M. (2019). LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv.","DOI":"10.18653\/v1\/D19-1514"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Tian, Y., Krishnan, D., and Isola, P. (2020). Contrastive Multiview Coding. arXiv.","DOI":"10.1007\/978-3-030-58621-8_45"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_43","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Chen, X., and He, K. (2020). Exploring Simple Siamese Representation Learning. arXiv.","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"ref_45","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv."},{"key":"ref_46","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., and Duerig, T. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv."},{"key":"ref_47","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2021). Training Data-Efficient Image Transformers & Distillation through Attention. arXiv."},{"key":"ref_48","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. arXiv."},{"key":"ref_49","unstructured":"Hendrycks, D., and Gimpel, K. (2020). Gaussian Error Linear Units (GELUs). arXiv."},{"key":"ref_50","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv."},{"key":"ref_51","first-page":"315","article-title":"Deep Sparse Rectifier Neural Networks","volume":"15","author":"Glorot","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Cubuk, E.D., Zoph, B., Shlens, J., and Le, Q.V. (2020, January 14\u201319). Randaugment: Practical Automated Data Augmentation with a Reduced Search Space. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA.","DOI":"10.1109\/CVPRW50498.2020.00359"},{"key":"ref_53","unstructured":"Loshchilov, I., and Hutter, F. (2019). Decoupled Weight Decay Regularization. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/18\/4637\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:55:06Z","timestamp":1760129706000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/18\/4637"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,21]]},"references-count":53,"journal-issue":{"issue":"18","published-online":{"date-parts":[[2023,9]]}},"alternative-id":["rs15184637"],"URL":"https:\/\/doi.org\/10.3390\/rs15184637","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,21]]}}}