{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,11]],"date-time":"2026-02-11T18:05:57Z","timestamp":1770833157088,"version":"3.50.1"},"reference-count":39,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2020,6,9]],"date-time":"2020-06-09T00:00:00Z","timestamp":1591660800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["41701508"],"award-info":[{"award-number":["41701508"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>The encoder\u2013decoder framework has been widely used in the remote sensing image captioning task. When we need to extract remote sensing images containing specific characteristics from the described sentences for research, rich sentences can improve the final extraction results. However, the Long Short-Term Memory (LSTM) network used in decoders still loses some information in the picture over time when the generated caption is long. In this paper, we present a new model component named the Persistent Memory Mechanism (PMM), which can expand the information storage capacity of LSTM with an external memory. The external memory is a memory matrix with a predetermined size. It can store all the hidden layer vectors of LSTM before the current time step. Thus, our method can effectively solve the above problem. At each time step, the PMM searches previous information related to the input information at the current time from the external memory. Then the PMM will process the captured long-term information and predict the next word with the current information. In addition, it updates its memory with the input information. This method can pick up the long-term information missed from the LSTM but useful to the caption generation. By applying this method to image captioning, our CIDEr scores on datasets UCM-Captions, Sydney-Captions, and RSICD increased by 3%, 5%, and 7%, respectively.<\/jats:p>","DOI":"10.3390\/rs12111874","type":"journal-article","created":{"date-parts":[[2020,6,10]],"date-time":"2020-06-10T05:11:46Z","timestamp":1591765906000},"page":"1874","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":14,"title":["Boosting Memory with a Persistent Memory Mechanism for Remote Sensing Image Captioning"],"prefix":"10.3390","volume":"12","author":[{"given":"Kun","family":"Fu","sequence":"first","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"School of Microelectronics, University of Chinese Academy of Sciences, Beijing 100190, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Network Information System Technology, Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Institute of Electronics, Chinese Academy of Sciences, Suzhou 215000, China"}]},{"given":"Yang","family":"Li","sequence":"additional","affiliation":[{"name":"School of Microelectronics, University of Chinese Academy of Sciences, Beijing 100190, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Network Information System Technology, Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Wenkai","family":"Zhang","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Network Information System Technology, Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Hongfeng","family":"Yu","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Network Information System Technology, Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Xian","family":"Sun","sequence":"additional","affiliation":[{"name":"Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China"},{"name":"Key Laboratory of Network Information System Technology, Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China"}]}],"member":"1968","published-online":{"date-parts":[[2020,6,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.rse.2017.12.033","article-title":"Vessel detection and classification from spaceborne optical images: A literature survey","volume":"207","author":"Kanjir","year":"2018","journal-title":"Remote Sens. Environ."},{"key":"ref_2","first-page":"425","article-title":"PURIFICATION OF TRAINING SAMPLES BASED ON SPECTRAL FEATURE AND SUPERPIXEL SEGMENTATION","volume":"XLII-3","author":"Guan","year":"2018","journal-title":"ISPRS - Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1274","DOI":"10.1109\/LGRS.2019.2893772","article-title":"Semantic descriptions of high-resolution remote sensing images","volume":"16","author":"Wang","year":"2019","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Liu, D., Zha, Z.J., Zhang, H., Zhang, Y., and Wu, F. (2018). Context-aware visual policy network for sequence-level image captioning. arXiv.","DOI":"10.1145\/3240508.3240632"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Luo, R., Price, B., Cohen, S., and Shakhnarovich, G. (2018, January 18\u201322). Discriminability objective for training descriptive captions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00728"},{"key":"ref_6","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_8","unstructured":"Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"471","DOI":"10.1038\/nature20101","article-title":"Hybrid computing using a neural network with dynamic external memory","volume":"538","author":"Graves","year":"2016","journal-title":"Nature"},{"key":"ref_10","unstructured":"Chunseong Park, C., Kim, B., and Kim, G. (2017, January 21\u201326). Attend to you: Personalized image captioning with context sequence memory networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"289","DOI":"10.2307\/1942603","article-title":"Bottom-Up and Top-Down Impacts on Freshwater Pelagic Community Structure","volume":"59","author":"Mcqueen","year":"1989","journal-title":"Ecol. Monogr."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. (2010). Every Picture Tells a Story: Generating Sentences from Images. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Sun, C., Gan, C., and Nevatia, R. (2015, January 7\u201313). Automatic concept discovery from parallel text and visual corpora. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.298"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"Babytalk: Understanding and generating simple image descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_15","unstructured":"Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., and Choi, Y. (2012, January 8). Collective generation of natural image descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, Association for Computational Linguistics, Jeju Island, Korea."},{"key":"ref_16","unstructured":"Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daum\u00e9, H. (2012, January 23). Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Avignon, France."},{"key":"ref_17","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1007\/s10590-017-9197-z","article-title":"Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot","volume":"31","author":"Nakayama","year":"2016","journal-title":"Mach. Transl."},{"key":"ref_19","first-page":"3104","article-title":"Sequence to sequence learning with neural networks","volume":"2","author":"Sutskever","year":"2014","journal-title":"Adv. NIPS"},{"key":"ref_20","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Madison, WI, USA."},{"key":"ref_21","unstructured":"Kiros, R., Salakhutdinov, R., and Zemel, R. (2014, January 10). Multimodal neural language models. Proceedings of the International Conference on Machine Learning, Beijing, China."},{"key":"ref_22","unstructured":"Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A.L. (2014). Explain images with multimodal recurrent neural networks. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Chen, X., and Zitnick, C.L. (2014). Learning a recurrent visual representation for image caption generation. arXiv.","DOI":"10.1109\/CVPR.2015.7298856"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Doll\u00e1r, P., Gao, J., He, X., Mitchell, M., and Platt, J.C. (2015, January 7\u201312). From captions to visual concepts and back. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017, January 21\u201326). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.345"},{"key":"ref_26","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems, Kunming, China.","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring models and data for remote sensing image caption generation","volume":"56","author":"Lu","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Diao, W., Zhang, W., Yan, M., Gao, X., and Sun, X. (2019). LAM: Remote Sensing Image Captioning with Label-Attention Mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11202349"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yang, Y., and Newsam, S. (2010, January 2\u20135). Bag-of-visual-words and spatial extensions for land-use classification. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA.","DOI":"10.1145\/1869790.1869829"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"2175","DOI":"10.1109\/TGRS.2014.2357078","article-title":"Saliency-guided unsupervised feature learning for scene classification","volume":"53","author":"Zhang","year":"2014","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_33","unstructured":"Lavie, A., and Agarwal, A. (2005, January 11). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Ann Arbor, MI, USA."},{"key":"ref_34","unstructured":"Lin, C.Y. (2004, January 25\u201326). Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches out, Barcelona, Spain."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_36","first-page":"382","article-title":"SPICE: Semantic Propositional Image Caption Evaluation","volume":"11","author":"Anderson","year":"2016","journal-title":"Adapt. Behav."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017, January 20). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.131"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, X., Tang, X., Zhou, H., and Li, C. (2019). Description generation for remote sensing images using attribute attention mechanism. Remote Sens., 11.","DOI":"10.3390\/rs11060612"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/12\/11\/1874\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:37:07Z","timestamp":1760175427000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/12\/11\/1874"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6,9]]},"references-count":39,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2020,6]]}},"alternative-id":["rs12111874"],"URL":"https:\/\/doi.org\/10.3390\/rs12111874","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,6,9]]}}}