{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T06:14:17Z","timestamp":1768284857792,"version":"3.49.0"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T00:00:00Z","timestamp":1745884800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T00:00:00Z","timestamp":1745884800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Text-to-image person retrieval, a fine-grained cross-modal retrieval problem, aims to search for person images from an image library that match a given textual caption. Existing text-to-image person retrieval methods usually use fixed-point embedding to express the semantics of the two modalities and perform multi-granularity alignment between modalities in the embedding space. However, owing to the inherent mutual one-to-many correspondence between images and texts, it is often difficult for fixed-point embedding methods to adequately capture this relationship, leading to erroneous retrieval results. To address this problem, we propose a novel uncertainty-aware coarse-to-fine alignment method, which first maps fixed-point embedding to probability distributions and then aligns two modalities in terms of distributions and sampling points at a coarse-to-fine granularity, for accurate text-to-image person retrieval. Specifically, we first introduce two contrastive learning tasks of distribution contrast learning and point contrast learning, to achieve coarse-grained inter-modal alignment with uncertainty-aware. The distribution contrast learning task ensures that distributions with the same identity are as similar as possible across modalities through distribution-based contrastive learning. The point contrast learning task performs the contrastive learning of inter-modal and intra-modal sampling points, which not only models rich and diverse cross-modal associations, but also optimizes the learning of distributions. For the fine-grained association requirements of text-to-image person retrieval, we design the task of uncertainty-aware attribute masking language reconstruction, which achieves fine-grained alignment by randomly masking attribute words in the text and reconstructing them via inter-modal sample point interactions. Extensive experiments on two public datasets demonstrate the superior performance of our method.<\/jats:p>","DOI":"10.1007\/s44267-025-00078-x","type":"journal-article","created":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T04:21:23Z","timestamp":1745900483000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Uncertainty-aware coarse-to-fine alignment for text-image person retrieval"],"prefix":"10.1007","volume":"3","author":[{"given":"Yifei","family":"Deng","sequence":"first","affiliation":[]},{"given":"Zhengyu","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Chenglong","family":"Li","sequence":"additional","affiliation":[]},{"given":"Jin","family":"Tang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,4,29]]},"reference":[{"key":"78_CR1","unstructured":"Lei, J., Chen, X., Zhang, N., Wang, M., Bansal, M., Berg, T. L., & Yu, L. (2022). LoopITR: combining dual and cross encoder architectures for image-text retrieval. arXiv preprint. arXiv:2203.05465."},{"issue":"2","key":"78_CR2","doi-asserted-by":"publisher","first-page":"579","DOI":"10.1109\/TCSVT.2021.3067997","volume":"32","author":"Y. Zhu","year":"2022","unstructured":"Zhu, Y., Li, C., Tang, J., Luo, B., & Wang, L. (2022). RGBT tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, 32(2), 579\u2013592.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"78_CR3","doi-asserted-by":"publisher","first-page":"465","DOI":"10.1145\/3343031.3350928","volume-title":"Proceedings of the 27th ACM international conference on multimedia","author":"Y. Zhu","year":"2019","unstructured":"Zhu, Y., Li, C., Luo, B., Tang, J., & Wang, X. (2019). Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM international conference on multimedia (pp. 465\u2013472). New York: ACM."},{"key":"78_CR4","first-page":"1305","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"J. Garc\u00eda","year":"2015","unstructured":"Garc\u00eda, J., Martinel, N., Micheloni, C., & Vicente, A. G. (2015). Person re-identification ranking optimisation by discriminant context information analysis. In Proceedings of the IEEE international conference on computer vision (pp. 1305\u20131313). Piscataway: IEEE"},{"key":"78_CR5","first-page":"3641","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"J. Guo","year":"2019","unstructured":"Guo, J., Yuan, Y., Huang, L., Zhang, C., Yao, J., & Han, K. (2019). Beyond human parts: dual part-aligned representations for person re-identification. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 3641\u20133650). Piscataway: IEEE."},{"key":"78_CR6","first-page":"1908","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"S. Li","year":"2017","unstructured":"Li, S., Xiao, T., Li, H., Yang, W., & Wang, X. (2017). Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE international conference on computer vision (pp. 1908\u20131917). Piscataway: IEEE Comput. Soc."},{"key":"78_CR7","first-page":"707","volume-title":"Proceedings of the 15th European conference on computer vision","author":"Y. Zhang","year":"2018","unstructured":"Zhang, Y., & Lu, H. (2018). Deep cross-modal projection learning for image-text matching. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Proceedings of the 15th European conference on computer vision (pp. 707\u2013723). Cham: Springer."},{"key":"78_CR8","first-page":"402","volume-title":"Proceedings of the 16th European conference on computer vision","author":"Z. Wang","year":"2020","unstructured":"Wang, Z., Fang, Z., Wang, J., & Yang, Y. (2020). ViTAA: visual-textual attributes alignment in person search by natural language. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Proceedings of the 16th European conference on computer vision (pp. 402\u2013420). Cham: Springer."},{"issue":"12","key":"78_CR9","doi-asserted-by":"publisher","first-page":"17973","DOI":"10.1109\/TNNLS.2023.3310118","volume":"35","author":"S. Yan","year":"2024","unstructured":"Yan, S., Tang, H., Zhang, L., & Tang, J. (2024). Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems, 35(12), 17973\u201317986.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"78_CR10","doi-asserted-by":"crossref","unstructured":"Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint. arXiv:cs\/0205028.","DOI":"10.3115\/1118108.1118117"},{"key":"78_CR11","doi-asserted-by":"crossref","unstructured":"Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., & Tan, T. Pose-guided multi-granularity attention network for text-based person search. In Proceedings of the 34th AAAI conference on artificial intelligence (pp.\u00a011189\u201311196). Palo Alto: AAAI Press.","DOI":"10.1609\/aaai.v34i07.6777"},{"key":"78_CR12","first-page":"2787","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"D. Jiang","year":"2023","unstructured":"Jiang, D., & Ye, M. (2023). Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 2787\u20132797). Piscataway: IEEE."},{"key":"78_CR13","unstructured":"Ding, Z., Ding, C., Shao, Z., & Tao, D. (2021). Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint. arXiv:2107.12666."},{"key":"78_CR14","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1145\/3474085.3475369","volume-title":"Proceedings of the 29th ACM international conference on multimedia","author":"A. Zhu","year":"2021","unstructured":"Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., & Hua, G. (2021). DSSL: deep surroundings-person separation learning for text-based person retrieval. In H. T. Shen, Y. Zhuang, J. R. Smith, Y. Yang, P. Cesar, F. Metze, & B. Prabhakaran (Eds.), Proceedings of the 29th ACM international conference on multimedia (pp. 209\u2013217). New York: ACM."},{"issue":"22","key":"78_CR15","doi-asserted-by":"publisher","first-page":"5657","DOI":"10.3390\/rs14225657","volume":"14","author":"Y. Deng","year":"2022","unstructured":"Deng, Y., Li, C., Lu, A., Li, W., & Luo, B. (2022). Factory extraction from satellite images: benchmark and baseline. Remote Sensing, 14(22), 5657.","journal-title":"Remote Sensing"},{"key":"78_CR16","first-page":"5187","volume-title":"Proceedings of the conference on computer vision and pattern recognition","author":"S. Li","year":"2017","unstructured":"Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., & Wang, X. (2017). Person search with natural language description. In Proceedings of the conference on computer vision and pattern recognition (pp. 5187\u20135196). Piscataway: IEEE"},{"key":"78_CR17","first-page":"1879","volume-title":"Proceedings of the IEEE winter conference on applications of computer vision","author":"T. Chen","year":"2018","unstructured":"Chen, T., Xu, C., & Luo, J. (2018). Improving text-based person search by spatial matching and adaptive threshold. In Proceedings of the IEEE winter conference on applications of computer vision (pp. 1879\u20131887). Piscataway: IEEE"},{"key":"78_CR18","doi-asserted-by":"publisher","first-page":"4057","DOI":"10.1109\/TIP.2021.3068825","volume":"30","author":"Y. Chen","year":"2021","unstructured":"Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., & Ma, B. (2021). Cross-modal knowledge adaptation for language-based person search. IEEE Transactions on Image Processing, 30, 4057\u20134069.","journal-title":"IEEE Transactions on Image Processing"},{"key":"78_CR19","doi-asserted-by":"publisher","first-page":"6032","DOI":"10.1109\/TIP.2023.3327924","volume":"32","author":"S. Yan","year":"2023","unstructured":"Yan, S., Dong, N., Zhang, L., & Tang, J. (2023). CLIP-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing, 32, 6032\u20136046.","journal-title":"IEEE Transactions on Image Processing"},{"issue":"2","key":"78_CR20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3383184","volume":"16","author":"Z. Zheng","year":"2020","unstructured":"Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., & Shen, Y. (2020). Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing Communications and Applications, 16(2), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing Communications and Applications"},{"key":"78_CR21","first-page":"93","volume-title":"Person re-identification, advances in computer vision and pattern recognition","author":"R. Layne","year":"2014","unstructured":"Layne, R., Hospedales, T. M., & Gong, S. (2014). Attributes-based re-identification. In Person re-identification, advances in computer vision and pattern recognition (pp. 93\u2013117). Berlin: Springer."},{"key":"78_CR22","first-page":"1","volume-title":"Proceedings of the 3rd international conference on learning representations","author":"L. Vilnis","year":"2015","unstructured":"Vilnis, L., & McCallum, A. (2015). Word representations via Gaussian embedding. In Y. Bengio & Y. LeCun (Eds.), Proceedings of the 3rd international conference on learning representations (pp. 1\u201312). Retrieved March 10, 2025, from http:\/\/arxiv.org\/abs\/1412.6623."},{"key":"78_CR23","first-page":"5709","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Chang","year":"2020","unstructured":"Chang, J., Lan, Z., Cheng, C., & Wei, Y. (2020). Data uncertainty learning in face recognition. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 5709\u20135718). Piscataway: IEEE."},{"key":"78_CR24","first-page":"9826","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"A. Miech","year":"2021","unstructured":"Miech, A., Alayrac, J., Laptev, I., Sivic, J., & Zisserman, A. (2021). Thinking fast and slow: efficient text-to-visual retrieval with transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 9826\u20139836). Piscataway: IEEE."},{"key":"78_CR25","first-page":"8415","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"S. Chun","year":"2021","unstructured":"Chun, S., Oh, S. J., Rezende, R. S., Kalantidis, Y., & Larlus, D. (2021). Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8415\u20138424). Piscataway: IEEE."},{"key":"78_CR26","first-page":"8748","volume-title":"Proceedings of the 38th international conference on machine learning","author":"A. Radford","year":"2021","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th international conference on machine learning (pp. 8748\u20138763). PMLR."},{"key":"78_CR27","first-page":"1","volume-title":"Proceedings of the 9th international conference on learning representations","author":"A. Dosovitskiy","year":"2021","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: transformers for image recognition at scale. In Proceedings of the 9th international conference on learning representations (pp. 1\u201322). Retrieved March 10, 2025, from https:\/\/openreview.net\/forum?id=YicbFdNTTy."},{"key":"78_CR28","first-page":"1","volume-title":"Proceedings of the 2nd international conference on learning representations","author":"D. P. Kingma","year":"2014","unstructured":"Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Y. Bengio & Y. LeCun (Eds.), Proceedings of the 2nd international conference on learning representations (pp. 1\u201314). Retrieved March 10, 2025, from http:\/\/arxiv.org\/abs\/1312.6114."},{"key":"78_CR29","first-page":"23262","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Ji","year":"2023","unstructured":"Ji, Y., Wang, J., Gong, Y., Zhang, L., Zhu, Y., Wang, H., Zhang, J., Sakai, T., & Yang, Y. (2023). MAP: multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 23262\u201323271). Piscataway: IEEE."},{"key":"78_CR30","first-page":"79","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"L. Wei","year":"2018","unstructured":"Wei, L., Zhang, S., Gao, W., & Tian, Q. (2018). Person transfer GAN to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 79\u201388). Piscataway: IEEE."},{"key":"78_CR31","first-page":"624","volume-title":"Proceedings of the 17th European conference on computer vision","author":"X. Shu","year":"2022","unstructured":"Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., Ren, B., & Wang, X. (2022). See finer, see more: implicit modality alignment for text-based person retrieval. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 624\u2013641). Cham: Springer."},{"key":"78_CR32","first-page":"726","volume-title":"Proceedings of the 17th European conference on computer vision","author":"W. Suo","year":"2022","unstructured":"Suo, W., Sun, M., Niu, K., Gao, Y., Wang, P., Zhang, Y., & Wu, Q. (2022). A simple and robust correlation filtering method for text-based person search. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 726\u2013742). Cham: Springer."},{"key":"78_CR33","doi-asserted-by":"publisher","first-page":"5566","DOI":"10.1145\/3503161.3548028","volume-title":"Proceedings of the 30th ACM international conference on multimedia","author":"Z. Shao","year":"2022","unstructured":"Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., & Ding, C. (2022). Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th ACM international conference on multimedia (pp. 5566\u20135574). New York: ACM."},{"key":"78_CR34","doi-asserted-by":"publisher","first-page":"4157","DOI":"10.1145\/3581783.3611768","volume-title":"Proceedings of the 31st ACM international conference on multimedia","author":"Y. Ma","year":"2023","unstructured":"Ma, Y., Sun, X., Ji, J., Jiang, G., Zhuang, W., & Ji, R. (2023). Beat: bi-directional one-to-many embedding alignment for text-based person retrieval. In A. El-Saddik, T. Mei, R. Cucchiara, M. Bertini, D. P. Tobon Vallejo, P. K. Atrey, & M. S. Hossain (Eds.), Proceedings of the 31st ACM international conference on multimedia (pp. 4157\u20134168). New York: ACM."},{"key":"78_CR35","doi-asserted-by":"publisher","first-page":"6202","DOI":"10.1145\/3581783.3611832","volume-title":"Proceedings of the 31st ACM international conference on multimedia","author":"S. Yan","year":"2023","unstructured":"Yan, S., Dong, N., Liu, J., Zhang, L., & Tang, J. (2023). Learning comprehensive representations with richer self for text-to-image person re-identification. In A. El-Saddik, T. Mei, R. Cucchiara, M. Bertini, D. P. Tobon Vallejo, P. K. Atrey, & M. S. Hossain (Eds.), Proceedings of the 31st ACM international conference on multimedia (pp. 6202\u20136211). New York: ACM."},{"key":"78_CR36","first-page":"11140","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Z. Shao","year":"2023","unstructured":"Shao, Z., Zhang, X., Ding, C., Wang, J., & Wang, J. (2023). Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 11140\u201311150). Piscataway: IEEE."},{"key":"78_CR37","first-page":"7935","volume-title":"Proceedings of the IEEE international conference on acoustics, speech, and signal processing","author":"Y. Liu","year":"2024","unstructured":"Liu, Y., Li, Y., Liu, Z., Yang, W., Wang, Y., & Liao, Q. (2024). Clip-based synergistic knowledge transfer for text-based person retrieval. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (pp. 7935\u20137939). Piscataway: IEEE."},{"key":"78_CR38","doi-asserted-by":"publisher","first-page":"1990","DOI":"10.1109\/TIP.2024.3372832","volume":"33","author":"K. Niu","year":"2024","unstructured":"Niu, K., Huang, L., Long, Y., Huang, Y., Wang, L., & Zhang, Y. (2024). Comprehensive attribute prediction learning for person search by language. IEEE Transactions on Image Processing, 33, 1990\u20132003.","journal-title":"IEEE Transactions on Image Processing"}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00078-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-025-00078-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00078-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T04:21:35Z","timestamp":1745900495000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-025-00078-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,29]]},"references-count":38,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["78"],"URL":"https:\/\/doi.org\/10.1007\/s44267-025-00078-x","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"value":"2097-3330","type":"print"},{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,29]]},"assertion":[{"value":"5 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 April 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 April 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 April 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"6"}}