{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:06:11Z","timestamp":1750309571690,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2025,3,12]],"date-time":"2025-03-12T00:00:00Z","timestamp":1741737600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62272461, 62172417, 62276266, and 62277046"],"award-info":[{"award-number":["62272461, 62172417, 62276266, and 62277046"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"\u201cDouble First-Class\u201d Project of China University of Mining and Technology for Independent Innovation and Social Service","award":["2022ZZCX06"],"award-info":[{"award-number":["2022ZZCX06"]}]},{"name":"Six Talent Peaks Project in Jiangsu","award":["2015-DZXX-010 and 2018-XYDXX-044"],"award-info":[{"award-number":["2015-DZXX-010 and 2018-XYDXX-044"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>Traditional text-based person re-identification relies on identity labels. However, it is impossible to annotate large datasets, since identity annotation is expensive and time-consuming. Weakly supervised text-based person re-identification, where only text\u2013image pairs are available without annotation of identities, is very practical in real life. While dealing with the weakly supervised person re-identification, two issues should be strengthed, i.e., alignment caused by different modal, and cross-modal matching ambiguity caused by the lack of identity labels. In this article, we propose a similarity regulation and calibration alignment (SRCA) framework, which consists of two unimodal encoders for images and text, respectively, and a multi-modal encoder for the masked language modeling task. First, a similarity regulation (SR) strategy is proposed to relax the strict one-to-one constraints for the local similarities between different pairs by introducing a novel soft objective. The soft objective can adjust hard objectives to achieve soft cross-modal alignment by establishing a many-to-many relationship between two modalities. Second, the calibration alignment (CA) module is proposed to improve intra-class compactness by modeling pseudo-label assignment as optimal transport. The ambiguity of cross-modal matching can be reduced by aligning features and pseudo-labels of different modalities and gradually calibrating the distribution of pseudo-labels. Experimental results show that our method has achieved obvious advantages compared with existing methods and also demonstrated competitive performance compared with fully supervised methods.<\/jats:p>","DOI":"10.1145\/3711861","type":"journal-article","created":{"date-parts":[[2025,1,25]],"date-time":"2025-01-25T11:57:38Z","timestamp":1737806258000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Similarity Regulation and Calibration Alignment for Weakly Supervised Text-Based Person Re-Identification"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-7187-6136","authenticated-orcid":false,"given":"Ao","family":"Fu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, China University of Mining and Technology, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3564-5090","authenticated-orcid":false,"given":"Jiaqi","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, China University of Mining and Technology, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6207-0299","authenticated-orcid":false,"given":"Yong","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, China University of Mining and Technology, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9234-0912","authenticated-orcid":false,"given":"Wenliang","family":"Du","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, China University of Mining and Technology, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2734-915X","authenticated-orcid":false,"given":"Rui","family":"Yao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, China University of Mining and Technology, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7690-8547","authenticated-orcid":false,"given":"Abdulmotaleb","family":"El Saddik","sequence":"additional","affiliation":[{"name":"EECS, University of Ottawa, Ottawa, Ontario, Canada and Computer Vision, Mohamed Bin Zayed University for Humanities, Abu Dhabi, United Arab Emirates"}]}],"member":"320","published-online":{"date-parts":[[2025,3,12]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Yuki Markus Asano Christian Rupprecht and Andrea Vedaldi. 2019. Self-labelling via simultaneous clustering and representation learning. arXiv:1911.05371. Retrieved from http:\/\/arxiv.org\/abs\/1911.05371"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.04.081"},{"key":"e_1_3_1_4_2","first-page":"2292","article-title":"Sinkhorn distances: Lightspeed computation of optimal transport","volume":"26","author":"Cuturi Marco","year":"2013","unstructured":"Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Vol. 26, 2292\u20132300.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_5_2","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 4171\u20134186.","journal-title":"North American Chapter of the Association for Computational Linguistics"},{"key":"e_1_3_1_6_2","unstructured":"Zefeng Ding Changxing Ding Zhiyin Shao and Dacheng Tao. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666. Retrieved from http:\/\/arxiv.org\/abs\/2107.12666"},{"key":"e_1_3_1_7_2","first-page":"1","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 1\u201321."},{"key":"e_1_3_1_8_2","first-page":"4477","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"36","author":"Farooq Ammarah","year":"2022","unstructured":"Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. 2022. AXM-Net: Implicit cross-modal feature alignment for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 4477\u20134485."},{"key":"e_1_3_1_9_2","unstructured":"Chen Gao Guanyu Cai Xinyang Jiang Feng Zheng Jinchao Zhang Yifei Gong Pai Peng Xiao-Wei Guo and Xing Sun. 2021. Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036. Retrieved from http:\/\/arxiv.org\/abs\/2101.03036"},{"key":"e_1_3_1_10_2","unstructured":"Yixiao Ge Dapeng Chen and Hongsheng Li. 2020. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv:2001.01526. Retrieved from http:\/\/arxiv.org\/abs\/2001.01526"},{"key":"e_1_3_1_11_2","first-page":"11309","article-title":"Self-paced contrastive learning with hybrid memory for domain adaptive object re-id","volume":"33","author":"Ge Yixiao","year":"2020","unstructured":"Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, et al. 2020. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In Advances in Neural Information Processing Systems, Vol. 33, 11309\u201311321.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_12_2","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1016\/B978-0-12-817358-9.00015-9","volume-title":"Multimodal Scene Understanding","author":"Gomez Raul","year":"2019","unstructured":"Raul Gomez, Lluis Gomez, Jaume Gibert, and Dimosthenis Karatzas. 2019. Self-supervised learning from web data for multimodal retrieval. In Multimodal Scene Understanding. Elsevier, 279\u2013306."},{"key":"e_1_3_1_13_2","unstructured":"Xiao Han Sen He Li Zhang and Tao Xiang. 2021. Text-based person search with limited data. arXiv:2110.10807. Retrieved from http:\/\/arxiv.org\/abs\/2110.10807"},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","unstructured":"Zhong Ji Junhua Hu Deyin Liu Yuan Wu and Ye Zhao. 2022. Asymmetric cross-scale alignment for text-based person search. arXiv:2212.11958. Retrieved from https:\/\/doi.org\/10.1109\/tmm.2022.3225754","DOI":"10.1109\/TMM.2022.3225754"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00273"},{"key":"e_1_3_1_16_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from http:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/icassp43922.2022.9746846"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.551"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3323873.3325035"},{"key":"e_1_3_1_20_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_21_2","first-page":"4512","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing","author":"Reimers Nils","year":"2020","unstructured":"Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4512\u20134525."},{"key":"e_1_3_1_22_2","first-page":"5814","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Sarafianos Nikolaos","year":"2019","unstructured":"Nikolaos Sarafianos, Xiang Xu, and IoannisA Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 5814\u20135824."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548028"},{"key":"e_1_3_1_24_2","first-page":"624","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Shu Xiujun","year":"2022","unstructured":"Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. 2022. See finer, see more: Implicit modality alignment for text-based person retrieval. In Proceedings of the European Conference on Computer Vision. Springer, 624\u2013641."},{"key":"e_1_3_1_25_2","unstructured":"Guanshuo Wang Fufu Yu Junjie Li Qiong Jia and Shouhong Ding. 2023. Exploiting the textual potential from vision-language pre-training for text-based person search. arXiv:2303.04497. Retrieved from http:\/\/arxiv.org\/abs\/2303.04497"},{"key":"e_1_3_1_26_2","article-title":"RES-STS: Referring expression speaker via self-training with scorer for goal-oriented vision-language navigation","author":"Wang Liuyi","year":"2023","unstructured":"Liuyi Wang, Zongtao He, Ronghao Dang, Huiyi Chen, Chengju Liu, and Qijun Chen. 2023. RES-STS: Referring expression speaker via self-training with scorer for goal-oriented vision-language navigation. IEEE Transactions on Circuits and Systems for Video Technology (2023).","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_27_2","first-page":"402","volume-title":"Proceedings of the Conference on Computer Vision (ECCV \u201920)","author":"Wang Zhe","year":"2020","unstructured":"Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. ViTAA: Visual-textual attributes alignment in person search by natural language. In Proceedings of the Conference on Computer Vision (ECCV \u201920), 402\u2013420."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548057"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548166"},{"key":"e_1_3_1_30_2","unstructured":"Donglai Wei Sipeng Zhang Tong Yang and Jing Liu. 2023. Calibrating cross-modal feature for text-based person searching. arXiv:2304.02278. Retrieved from http:\/\/arxiv.org\/abs\/2304.02278"},{"key":"e_1_3_1_31_2","first-page":"1624","volume-title":"Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV \u201821)","author":"Wu Yushuang","year":"2021","unstructured":"Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. 2021. LapsCore: Language-guided person search via color reasoning. In Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV \u201821), 1624\u20131633."},{"key":"e_1_3_1_32_2","unstructured":"Wenhao Xu Zhiyin Shao and Changxing Ding. 2023. Mining false positive examples for text-based person re-identification. arXiv:2303.08466. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2303.08466"},{"key":"e_1_3_1_33_2","unstructured":"Shuanglin Yan Neng Dong Liyan Zhang and Jinhui Tang. 2022. CLIP-Driven fine-grained text-image person re-identification. arXiv:2210.10276. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2210.10276"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3310118"},{"key":"e_1_3_1_35_2","article-title":"Self-training vision language BERTs with a unified conditional model","author":"Yang Xiaofeng","year":"2023","unstructured":"Xiaofeng Yang, Fengmao Lv, Fayao Liu, and Guosheng Lin. 2023. Self-training vision language BERTs with a unified conditional model. IEEE Transactions on Circuits and Systems for Video Technology (2023).","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_42"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01120"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383184"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475369"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475369"},{"key":"e_1_3_1_41_2","unstructured":"Jialong Zuo Changqian Yu Nong Sang and Changxin Gao. 2023. PLIP: Language-image pre-training for person representation learning. arXiv:2305.08386. Retrieved from http:\/\/arxiv.org\/abs\/2305.08386"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711861","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711861","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:19:15Z","timestamp":1750295955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711861"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,12]]},"references-count":40,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3711861"],"URL":"https:\/\/doi.org\/10.1145\/3711861","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,3,12]]},"assertion":[{"value":"2024-01-09","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}