{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T02:19:16Z","timestamp":1772158756542,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"12","funder":[{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62372203 and 62302186"],"award-info":[{"award-number":["62372203 and 62302186"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Major Scientific and Technological Project of Shenzhen","award":["202316021"],"award-info":[{"award-number":["202316021"]}]},{"name":"National Key Research and Development Program of China","award":["2022YFB2601802"],"award-info":[{"award-number":["2022YFB2601802"]}]},{"name":"Major Scientific and Technological Project of Hubei Province","award":["2022BAA046, 2022BAA042"],"award-info":[{"award-number":["2022BAA046, 2022BAA042"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>\n                    Text-based person search aims to retrieve specific individuals from an extensive image gallery using textual queries. Recent approaches have delved into aligning global and part features in both text and image modalities, yielding substantial improvements. However, these methods often overlook intra-modality instance relations and uncertainties inherent in text-based person search. In response to these challenges, we propose the Cross-Modality Relation and Uncertainty Exploration (CRUE) method to model the relations and uncertainties in the matching procedure. To alleviate the strict alignment issues arising from hard labels in the original contrastive loss, an Intra-Modality Relation Exploration (IRE) module is introduced. This module is specifically designed to smooth hard-matching relations by modeling intra-modality similarity. Additionally, to address uncertain matching problems stemming from many-to-many relations, we propose a novel Uncertainty-Guided Modeling (UGM) module. This module is specifically designed to handle weak and noise-matched image\u2013text pairs by modeling features as distributions, thereby alleviating instability and noise. Both the IRE and UGM modules effectively consider genuine intra-modality similarities and reduce the negative impact of uncertainties. Experimental results demonstrate significant improvements across three widely used person search datasets, thereby validating the efficacy of the CRUE method in enhancing text-based person search. Our code will be available on GitHub at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/ShijuanHuang\/CRUE\">https:\/\/github.com\/ShijuanHuang\/CRUE<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3747185","type":"journal-article","created":{"date-parts":[[2025,7,3]],"date-time":"2025-07-03T11:45:45Z","timestamp":1751543145000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Cross-Modality Relation and Uncertainty Exploration for Text-Based Person Search"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-2177-5110","authenticated-orcid":false,"given":"Shijuan","family":"Huang","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4459-4977","authenticated-orcid":false,"given":"Zongyi","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6797-7412","authenticated-orcid":false,"given":"Hefei","family":"Ling","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2737-2685","authenticated-orcid":false,"given":"Jianbo","family":"Li","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3321504"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i1.27801"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00575"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.04.081"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00831"},{"key":"e_1_3_1_7_2","unstructured":"Zefeng Ding Changxing Ding Zhiyin Shao and Dacheng Tao. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666. Retrieved from https:\/\/arxiv.org\/abs\/2107.12666"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01262"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i4.20370"},{"key":"e_1_3_1_10_2","unstructured":"Chenyang Gao Guanyu Cai Xinyang Jiang Feng Zheng Jun Zhang Yifei Gong Pai Peng Xiaowei Guo and Xing Sun. 2021. Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036. Retrieved from https:\/\/arxiv.org\/abs\/2101.03036"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i3.27955"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3337653"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01474"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00273"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6777"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2024.110481"},{"key":"e_1_3_1_17_2","unstructured":"Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https:\/\/arxiv.org\/abs\/1312.6114"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i1.25225"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.551"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i2.25226"},{"key":"e_1_3_1_21_2","unstructured":"Zheng Li Lijia Si Caili Guo Yang Yang and Qiushi Cao. 2024. Data augmentation for text-based person retrieval using large language models. arXiv:2405.11971. Retrieved from https:\/\/arxiv.org\/abs\/2405.11971"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_42"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611768"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00970"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2984883"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-024-09691-1"},{"key":"e_1_3_1_27_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00591"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548028"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612009"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.112893"},{"key":"e_1_3_1_32_2","first-page":"624","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Shu Xiujun","year":"2022","unstructured":"Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. 2022. See finer, see more: Implicit modality alignment for text-based person retrieval. In Proceedings of the European Conference on Computer Vision. Springer, 624\u2013641."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58558-7_4"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-024-02094-8"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_30"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240552"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58610-2_24"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548057"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548166"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00165"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2024.124071"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVT.2024.3388249"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611832"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3327924"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3310118"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2014.16"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00064"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2024.111247"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_42"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2024.105309"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383184"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475369"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3301856"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3747185","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T07:00:08Z","timestamp":1763794808000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3747185"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":53,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3747185"],"URL":"https:\/\/doi.org\/10.1145\/3747185","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2025-02-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}