{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T00:56:17Z","timestamp":1760057777194,"version":"build-2065373602"},"reference-count":40,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T00:00:00Z","timestamp":1740441600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key R and D Program of China","award":["2021YFF0501101","62106074","52272347","2024JJ7132"],"award-info":[{"award-number":["2021YFF0501101","62106074","52272347","2024JJ7132"]}]},{"name":"National Natural Science Foundation of China Youth Fund Project","award":["2021YFF0501101","62106074","52272347","2024JJ7132"],"award-info":[{"award-number":["2021YFF0501101","62106074","52272347","2024JJ7132"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["2021YFF0501101","62106074","52272347","2024JJ7132"],"award-info":[{"award-number":["2021YFF0501101","62106074","52272347","2024JJ7132"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Science Fund of Hunan","award":["2021YFF0501101","62106074","52272347","2024JJ7132"],"award-info":[{"award-number":["2021YFF0501101","62106074","52272347","2024JJ7132"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>An image\u2013text retrieval method that integrates intramodal fine-grained local semantic information and intermodal global semantic information is proposed to address the weak fine-grained discrimination capabilities for the semantic features located between image and text modalities in cross-modal retrieval tasks. First, the original features of images and texts are extracted, and a graph attention network is employed for region relationship reasoning to obtain relation-enhanced local features. Then, an attention mechanism is used for different semantically interacting samples within the same modality, enabling comprehensive intramodal relationship learning and producing semantically enhanced image and text embeddings. Finally, a triplet loss function is used to train the entire model, and it is enhanced with an angular constraint. Through extensive comparative experiments conducted on the Flickr30K and MS-COCO benchmark datasets, the effectiveness and superiority of the proposed method were verified. It outperformed the current method by 6.4% relatively for image retrieval and 1.3% relatively for caption retrieval on MS-COCO (Recall@1 using the 1K test set).<\/jats:p>","DOI":"10.3390\/bdcc9030053","type":"journal-article","created":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T10:55:48Z","timestamp":1740480948000},"page":"53","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Fine-Grained Local and Global Semantic Fusion for Multimodal Image\u2013Text Retrieval"],"prefix":"10.3390","volume":"9","author":[{"given":"Shenao","family":"Peng","sequence":"first","affiliation":[{"name":"College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China"}]},{"given":"Zhongmei","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1694-0975","authenticated-orcid":false,"given":"Jianhua","family":"Liu","sequence":"additional","affiliation":[{"name":"College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China"}]},{"given":"Changfan","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China"}]},{"given":"Lin","family":"Jia","sequence":"additional","affiliation":[{"name":"College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pan, Z., Wu, F., and Zhang, B. (2023, January 18\u201322). Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01847"},{"key":"ref_2","unstructured":"Li, K., Zhang, Y., Li, K., Li, Y., and Fu, Y. (November, January 27). Visual Semantic Reasoning for Image-Text Matching. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Fu, Z., Mao, Z., Song, Y., and Zhang, Y. (2023, January 18\u201322). Learning Semantic Relationship Among Instances for Image-Text Matching. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01455"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Wei, X., Zhang, T., Li, Y., Zhang, Y., and Wu, F. (2020, January 14\u201319). Multi-Modality Cross Attention Network for Image and Sentence Matching. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01095"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Albalawi, B.M., Jamal, A.T., Al Khuzayem, L.A., and Alsaedi, O.A. (2024). An End-to-End Scene Text Recognition for Bilingual Text. Big Data Cogn. Comput., 8.","DOI":"10.3390\/bdcc8090117"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2392847","DOI":"10.1080\/17538947.2024.2392847","article-title":"Incorporating object counts into remote sensing image captioning","volume":"17","author":"Ren","year":"2024","journal-title":"Int. J. Digit. Earth"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Rao, J., Wang, F., Ding, L., Qi, S., Zhan, Y., Liu, W., and Tao, D. (2022, January 11\u201315). Where Does the Performance Improvement Come From?\u2014A Reproducibility Concern about Image-Text Retrieval. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval\u2014SIGIR\u201922, New York, NY, USA.","DOI":"10.1145\/3477495.3531715"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wu, Y., Wang, S., Song, G., and Huang, Q. (2019, January 21\u201325). Learning Fragment Self-Attention Embeddings for Image-Text Matching. Proceedings of the 27th ACM International Conference on Multimedia\u2014MM\u201919, New York, NY, USA.","DOI":"10.1145\/3343031.3350940"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Qu, L., Liu, M., Cao, D., Nie, L., and Tian, Q. (2020, January 12\u201316). Context-Aware Multi-View Summarization Network for Image-Text Matching. Proceedings of the 28th ACM International Conference on Multimedia\u2014MM\u201920, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413961"},{"key":"ref_10","first-page":"10","article-title":"Graph attention networks","volume":"1050","author":"Velickovic","year":"2017","journal-title":"Stat"},{"key":"ref_11","first-page":"I","article-title":"Attention is all you need","volume":"30","author":"Ashish","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Chen, J., Hu, H., Wu, H., Jiang, Y., and Wang, C. (2021, January 19\u201325). Learning the Best Pooling Strategy for Visual Semantic Embedding. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01553"},{"key":"ref_13","unstructured":"Faghri, F., Fleet, D.J., Kiros, J.R., and Fidler, S. (2018). VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv."},{"key":"ref_14","first-page":"2370","article-title":"A Survey on Deep Learning Based Image-Text Matching","volume":"46","author":"Liu","year":"2023","journal-title":"Chin. J. Comput."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017, January 22\u201329). Deep Metric Learning with Angular Loss. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.283"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Wang, L., Li, Y., and Lazebnik, S. (2016, January 27\u201330). Learning Deep Structure-Preserving Image-Text Embeddings. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.541"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"394","DOI":"10.1109\/TPAMI.2018.2797921","article-title":"Learning Two-Branch Neural Networks for Image-Text Matching Tasks","volume":"41","author":"Wang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Biten, A.F., Mafla, A., G\u00f3mez, L., and Karatzas, D. (2022, January 3\u20138). Is an Image Worth Five Sentences? A New Look Into Semantics for Image-Text Matching. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00254"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Lee, K.H., Chen, X., Hua, G., Hu, H., and He, X. (2018, January 8\u201314). Stacked Cross Attention for Image-Text Matching. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., and Han, J. (2020, January 14\u201319). IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01267"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3499027","article-title":"Cross-modal Graph Matching Network for Image-text Retrieval","volume":"18","author":"Cheng","year":"2022","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhang, K., Mao, Z., Wang, Q., and Zhang, Y. (2022, January 18\u201324). Negative-Aware Attention Framework for Image-Text Matching. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01521"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Wang, J.H., Norouzi, M., and Tsai, S.M. (2024). Augmenting Multimodal Content Representation with Transformers for Misinformation Detection. Big Data Cogn. Comput., 8.","DOI":"10.3390\/bdcc8100134"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Nam, H., Ha, J.W., and Kim, J. (2017, January 21\u201326). Dual Attention Networks for Multimodal Reasoning and Matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.232"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201322). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_29","unstructured":"Meila, M., and Zhang, T. (2021, January 18\u201324). Learning Intra-Batch Connections for Deep Metric Learning. Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event. Proceedings of Machine Learning Research."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"KAYA, M., and B\u0130LGE, H.\u015e. (2019). Deep Metric Learning: A Survey. Symmetry, 11.","DOI":"10.3390\/sym11091066"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (2014). Microsoft COCO: Common Objects in Context. Proceedings of the Computer Vision\u2013ECCV 2014, Zurich, Switzerland, 6\u201312 September 2014, Springer Nature.","DOI":"10.1007\/978-3-319-10605-2"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., and Lazebnik, S. (2015, January 7\u201313). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.303"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3451390","article-title":"Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders","volume":"17","author":"Messina","year":"2021","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"641","DOI":"10.1109\/TPAMI.2022.3148470","article-title":"Image-Text Embedding Learning via Visual and Textual Semantic Reasoning","volume":"45","author":"Li","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhang, Q., Lei, Z., Zhang, Z., and Li, S.Z. (2020, January 14\u201319). Context-Aware Attention Network for Image-Text Retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00359"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Wei, J., Xu, X., Yang, Y., Ji, Y., Wang, Z., and Shen, H.T. (2020, January 14\u201319). Universal Weighting Metric Learning for Cross-Modal Matching. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01302"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3563390","article-title":"Scene Graph Semantic Inference for Image and Text Matching","volume":"22","author":"Pei","year":"2023","journal-title":"ACM Trans. Asian Low-Resour. Lang. Inf. Process."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhou, H., Geng, Y., Zhao, J., and Ma, X. (2024, January 8\u201310). Semantic-Enhanced Attention Network for Image-Text Matching. Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China.","DOI":"10.1109\/CSCWD61410.2024.10580166"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Liu, C., Mao, Z., Liu, A.-A., Zhang, T., Wang, B., and Zhang, Y. (2019, January 21\u201325). Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. Proceedings of the 27th ACM International Conference on Multimedia (MM \u201919), Nice, France.","DOI":"10.1145\/3343031.3350869"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., and Shao, J. (November, January 27). CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea. Available online: https:\/\/openaccess.thecvf.com\/content_ICCV_2019\/html\/Wang_CAMP_Cross-Modal_Adaptive_Message_Passing_for_Text-Image_Retrieval_ICCV_2019_paper.html.","DOI":"10.1109\/ICCV.2019.00586"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/3\/53\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:42:17Z","timestamp":1760028137000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/3\/53"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,25]]},"references-count":40,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["bdcc9030053"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9030053","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2025,2,25]]}}}