{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:19:27Z","timestamp":1757618367134,"version":"3.44.0"},"reference-count":50,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2025,6,11]],"date-time":"2025-06-11T00:00:00Z","timestamp":1749600000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,11]],"date-time":"2025-06-11T00:00:00Z","timestamp":1749600000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62072465","61772561"],"award-info":[{"award-number":["62072465","61772561"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Text-image cross-model matching is a core challenge in multimodal machine learning, aiming to enable efficient retrieval of images and texts across different modalities. The difficulty in this task stems from the inherent gap between text and image representations, which can lead to suboptimal retrieval performance. Traditional approaches attempt to learn a shared representation space where both image and text can be directly compared. However, they often fail to account for the varying levels of semantic information captured in different layers of the encoders, resulting in inadequate alignment between the modalities. To address these limitations, we propose a novel approach called <jats:bold>P<\/jats:bold>rogressive <jats:bold>M<\/jats:bold>ulti-<jats:bold>S<\/jats:bold>ubspace <jats:bold>F<\/jats:bold>usion, dubbed <jats:bold>PMSF<\/jats:bold> for text-image matching. Our model reduces the model gap by using a progressive learning process, starting with shallow representations and moving to deeper layers. We use a dual-tower structure to encode multi-level features for both image and text, which are then mapped to corresponding auxiliary subspaces. These subspaces are fused through an adaptive GPO pooling strategy, enabling joint learning of a shared representation space. Experimental results on benchmark datasets, including Flickr30K and MSCOCO, show that PMSF significantly improves retrieval performance, achieving a Rsum score of 516.9 and 510.7, outperforming 23 state-of-the-art methods.<\/jats:p>","DOI":"10.1007\/s40747-025-01946-1","type":"journal-article","created":{"date-parts":[[2025,6,11]],"date-time":"2025-06-11T05:39:22Z","timestamp":1749620362000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Progressive multi-subspace fusion for text-image matching"],"prefix":"10.1007","volume":"11","author":[{"given":"Haoming","family":"Wang","sequence":"first","affiliation":[]},{"given":"Li","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"Wentao","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Qian\u2019ge","family":"Guo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,11]]},"reference":[{"key":"1946_CR1","doi-asserted-by":"crossref","unstructured":"Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1508\u20131517 (2020)","DOI":"10.1109\/WACV45572.2020.9093614"},{"key":"1946_CR2","doi-asserted-by":"publisher","first-page":"374","DOI":"10.1109\/LSP.2021.3135825","volume":"29","author":"H Lan","year":"2021","unstructured":"Lan H, Zhang P (2021) Learning and integrating multi-level matching features for image-text retrieval. IEEE Signal Process Lett 29:374\u2013378","journal-title":"IEEE Signal Process Lett"},{"key":"1946_CR3","doi-asserted-by":"crossref","unstructured":"Pei, J., Zhong, K., Yu, Z., Wang, L., Lakshmanna, K.: Scene graph semantic inference for image and text matching. Transactions on Asian and Low-Resource Language Information Processing (2022)","DOI":"10.1145\/3563390"},{"key":"1946_CR4","doi-asserted-by":"crossref","unstructured":"Zeng, P., Gao, L., Lyu, X., Jing, S., Song, J.: Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching. In: Proceedings of the ACM International Conference on Multimedia, pp. 2205\u20132213 (2021)","DOI":"10.1145\/3474085.3475380"},{"issue":"4","key":"1946_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3499027","volume":"18","author":"Y Cheng","year":"2022","unstructured":"Cheng Y, Zhu X, Qian J, Wen F, Liu P (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1\u201323","journal-title":"ACM Trans Multimed Comput Commun Appl"},{"key":"1946_CR6","doi-asserted-by":"crossref","unstructured":"Li, Z., Guo, C., Feng, Z., Hwang, J.-N., Xue, X.: Multi-view visual semantic embedding. In: Proceedings of the International Joint Conference on Artificial Intelligence (2022)","DOI":"10.24963\/ijcai.2022\/158"},{"key":"1946_CR7","doi-asserted-by":"crossref","unstructured":"Long, S., Han, S.C., Wan, X., Poon, J.: Gradual: Graph-based dual-modal representation for image-text matching. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 3459\u20133468 (2022)","DOI":"10.1109\/WACV51458.2022.00252"},{"key":"1946_CR8","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2024.108005","volume":"133","author":"T Yao","year":"2024","unstructured":"Yao T, Peng S, Sun Y, Sheng G, Fu H, Kong X (2024) Cross-modal semantic interference suppression for image-text matching. Eng Appl Artif Intell 133:108005","journal-title":"Eng Appl Artif Intell"},{"key":"1946_CR9","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2023.105923","volume":"120","author":"X Qin","year":"2023","unstructured":"Qin X, Li L, Hao F, Pang G, Wang Z (2023) Cross-modal information balance-aware reasoning network for image-text retrieval. Eng Appl Artif Intell 120:105923","journal-title":"Eng Appl Artif Intell"},{"key":"1946_CR10","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2023.106439","volume":"123","author":"Z Li","year":"2023","unstructured":"Li Z, Lu H, Fu H, Wang Z, Gu G (2023) Adaptive adversarial learning based cross-modal retrieval. Eng Appl Artif Intell 123:106439","journal-title":"Eng Appl Artif Intell"},{"key":"1946_CR11","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2022.117508","volume":"203","author":"W Ma","year":"2022","unstructured":"Ma W, Zhou T, Qin J, Xiang X, Tan Y, Cai Z (2022) A privacy-preserving content-based image retrieval method based on deep learning in cloud computing. Expert Syst Appl 203:117508","journal-title":"Expert Syst Appl"},{"key":"1946_CR12","doi-asserted-by":"publisher","first-page":"2825","DOI":"10.1109\/TMM.2022.3152090","volume":"25","author":"F Li","year":"2023","unstructured":"Li F, Wu Y, Bai H, Lin W, Cong R, Zhao Y (2023) Learning detail-structure alternative optimization for blind super-resolution. IEEE Trans Multimedia 25:2825\u20132838","journal-title":"IEEE Trans Multimedia"},{"key":"1946_CR13","unstructured":"Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)"},{"issue":"2","key":"1946_CR14","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3383184","volume":"16","author":"Z Zheng","year":"2020","unstructured":"Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl 16(2):1\u201323","journal-title":"ACM Trans Multimed Comput Commun Appl"},{"key":"1946_CR15","doi-asserted-by":"crossref","unstructured":"Chen, J., Hu, H., Wu, H., Jiang, Y., Wang, C.: Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15789\u201315798 (2021)","DOI":"10.1109\/CVPR46437.2021.01553"},{"issue":"4","key":"1946_CR16","first-page":"1","volume":"17","author":"X Xu","year":"2021","unstructured":"Xu X, Wang Y, He Y, Yang Y, Hanjalic A, Shen HT (2021) Cross-modal hybrid feature fusion for image-sentence matching. ACM Trans Multimed Comput Commun Appl 17(4):1\u201323","journal-title":"ACM Trans Multimed Comput Commun Appl"},{"key":"1946_CR17","doi-asserted-by":"publisher","first-page":"5065","DOI":"10.1109\/TMM.2023.3330091","volume":"26","author":"W Ma","year":"2024","unstructured":"Ma W, Wu X, Zhao S, Zhou T, Guo D, Gu L, Cai Z, Wang M (2024) Fedsh: Towards privacy-preserving text-based person re-identification. IEEE Trans Multimedia 26:5065\u20135077","journal-title":"IEEE Trans Multimedia"},{"key":"1946_CR18","doi-asserted-by":"publisher","first-page":"21847","DOI":"10.1109\/ACCESS.2020.2969808","volume":"8","author":"Z Li","year":"2020","unstructured":"Li Z, Ling F, Zhang C, Ma H (2020) Combining global and local similarity for cross-media retrieval. IEEE Access 8:21847\u201321856","journal-title":"IEEE Access"},{"issue":"1","key":"1946_CR19","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2020.102432","volume":"58","author":"W-H Li","year":"2021","unstructured":"Li W-H, Yang S, Wang Y, Song D, Li X-Y (2021) Multi-level similarity learning for image-text retrieval. Information Processing & Management 58(1):102432","journal-title":"Information Processing & Management"},{"key":"1946_CR20","unstructured":"Liu, X., He, Y., Cheung, Y.-M., Xu, X., Wang, N.: Learning relationship-enhanced semantic graph for fine-grained image\u2013text matching. IEEE Transactions on Cybernetics (2022)"},{"issue":"1","key":"1946_CR21","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2022.103119","volume":"60","author":"W Ma","year":"2023","unstructured":"Ma W, Zhou T, Qin J, Xiang X, Tan Y, Cai Z (2023) Adaptive multi-feature fusion via cross-entropy normalization for effective image retrieval. Information Processing & Management 60(1):103119","journal-title":"Information Processing & Management"},{"key":"1946_CR22","doi-asserted-by":"crossref","unstructured":"Zeng, S., Liu, C., Zhou, J., Chen, Y., Jiang, A., Li, H.: Learning hierarchical semantic correspondences for cross-modal image-text retrieval. In: Proceedings of the International Conference on Multimedia Retrieval, pp. 239\u2013248 (2022)","DOI":"10.1145\/3512527.3531358"},{"key":"1946_CR23","doi-asserted-by":"crossref","unstructured":"Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10638\u201310647 (2020)","DOI":"10.1109\/CVPR42600.2020.01065"},{"issue":"5","key":"1946_CR24","doi-asserted-by":"publisher","first-page":"7150","DOI":"10.1109\/TNNLS.2022.3214208","volume":"35","author":"W Ma","year":"2024","unstructured":"Ma W, Chen Q, Liu F, Zhou T, Cai Z (2024) Query-adaptive late fusion for hierarchical fine-grained video-text retrieval. IEEE Transactions on Neural Networks and Learning Systems 35(5):7150\u20137161","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"1946_CR25","unstructured":"Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Proceedings of the Advances in Neural Information Processing Systems 28 (2015)"},{"key":"1946_CR26","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2022.108213","volume":"241","author":"W Ma","year":"2022","unstructured":"Ma W, Zhou T, Qin J, Zhou Q, Cai Z (2022) Joint-attention feature fusion network and dual-adaptive nms for object detection. Knowl-Based Syst 241:108213","journal-title":"Knowl-Based Syst"},{"key":"1946_CR27","doi-asserted-by":"crossref","unstructured":"Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5764\u20135773 (2019)","DOI":"10.1109\/ICCV.2019.00586"},{"key":"1946_CR28","doi-asserted-by":"crossref","unstructured":"Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3536\u20133545 (2020)","DOI":"10.1109\/CVPR42600.2020.00359"},{"key":"1946_CR29","doi-asserted-by":"crossref","unstructured":"Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 18\u201334 (2020). Springer","DOI":"10.1007\/978-3-030-58586-0_2"},{"key":"1946_CR30","doi-asserted-by":"crossref","unstructured":"Wei, J., Xu, X., Wang, Z., Wang, G.: Meta self-paced learning for cross-modal matching. In: Proceedings of the ACM International Conference on Multimedia, pp. 3835\u20133843 (2021)","DOI":"10.1145\/3474085.3475451"},{"issue":"12","key":"1946_CR31","doi-asserted-by":"publisher","first-page":"5412","DOI":"10.1109\/TNNLS.2020.2967597","volume":"31","author":"X Xu","year":"2020","unstructured":"Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modal attention with semantic consistence for image-text matching. IEEE Transactions on Neural Networks and Learning Systems 31(12):5412\u20135425","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"1946_CR32","doi-asserted-by":"publisher","first-page":"1332","DOI":"10.1109\/LSP.2022.3178899","volume":"29","author":"Y Liu","year":"2022","unstructured":"Liu Y, Liu H, Wang H, Liu M (2022) Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Process Lett 29:1332\u20131336","journal-title":"IEEE Signal Process Lett"},{"key":"1946_CR33","doi-asserted-by":"crossref","unstructured":"Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 201\u2013216 (2018)","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"1946_CR34","unstructured":"Huo, Y., Zhang, M., Liu, G., Lu, H., Gao, Y., Yang, G., Wen, J., Zhang, H., Xu, B., Zheng, W., et al.: Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561 (2021)"},{"key":"1946_CR35","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the International Conference on Machine Learning, pp. 4904\u20134916 (2021). PMLR"},{"key":"1946_CR36","unstructured":"Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., et al.: Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)"},{"key":"1946_CR37","doi-asserted-by":"crossref","unstructured":"Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the ACM International Conference on Multimedia, pp. 638\u2013647 (2022)","DOI":"10.1145\/3503161.3547910"},{"key":"1946_CR38","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1016\/j.neucom.2022.07.028","volume":"508","author":"H Luo","year":"2022","unstructured":"Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2022) Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293\u2013304","journal-title":"Neurocomputing"},{"key":"1946_CR39","doi-asserted-by":"crossref","unstructured":"Gorti, S.K., Vouitsis, N., Ma, J., Golestan, K., Volkovs, M., Garg, A., Yu, G.: X-pool: Cross-modal language-video attention for text-video retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5006\u20135015 (2022)","DOI":"10.1109\/CVPR52688.2022.00495"},{"key":"1946_CR40","doi-asserted-by":"crossref","unstructured":"Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276 (2022)","DOI":"10.1109\/TIP.2023.3327924"},{"key":"1946_CR41","unstructured":"Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. arXiv preprint arXiv:2308.10045 (2023)"},{"key":"1946_CR42","unstructured":"Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171\u20134186 (2019)"},{"key":"1946_CR43","doi-asserted-by":"crossref","unstructured":"Liu, H., Luo, R., Shang, F., Niu, M., Liu, Y.: Progressive semantic matching for video-text retrieval. Proceedings of the ACM International Conference on Multimedia (2021)","DOI":"10.1145\/3474085.3475621"},{"key":"1946_CR44","doi-asserted-by":"crossref","unstructured":"Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2:67\u201378","DOI":"10.1162\/tacl_a_00166"},{"key":"1946_CR45","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740\u2013755 (2014). Springer","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"1946_CR46","unstructured":"Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)"},{"key":"1946_CR47","unstructured":"Ge, R., Kakade, S.M., Kidambi, R., Netrapalli, P.: The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. Proceedings of the Advances in Neural Information Processing Systems 32 (2019)"},{"key":"1946_CR48","doi-asserted-by":"crossref","unstructured":"Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4654\u20134662 (2019)","DOI":"10.1109\/ICCV.2019.00475"},{"key":"1946_CR49","doi-asserted-by":"crossref","unstructured":"Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10921\u201310930 (2020)","DOI":"10.1109\/CVPR42600.2020.01093"},{"key":"1946_CR50","unstructured":"Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(11) (2008)"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-025-01946-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-025-01946-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-025-01946-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T19:13:14Z","timestamp":1757185994000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-025-01946-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,11]]},"references-count":50,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["1946"],"URL":"https:\/\/doi.org\/10.1007\/s40747-025-01946-1","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"type":"print","value":"2199-4536"},{"type":"electronic","value":"2198-6053"}],"subject":[],"published":{"date-parts":[[2025,6,11]]},"assertion":[{"value":"29 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 June 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"332"}}