{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T17:04:14Z","timestamp":1774717454041,"version":"3.50.1"},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,12,20]],"date-time":"2024-12-20T00:00:00Z","timestamp":1734652800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Science and Technology Program of Guangdong Province","award":["2022B0701180001 and 2021B1101270007"],"award-info":[{"award-number":["2022B0701180001 and 2021B1101270007"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,1,31]]},"abstract":"<jats:p>Text-based person re-identification aims to find the target person from a large pedestrian gallery with the given natural language description. Previous works mainly focus on embedding salient textual and visual representations in a common latent space by utilizing the dual-path structure or parameter-shared network. However, they still lack the ability to effectively extract fine-grained unimodal features as well as fuse the cross-modal data, leading to the increase of misaligned cases. To settle these issues, we propose a text-and-image implicit learning Transformer (TILT) to eliminate textual anisotropy and enhance the cross-modal alignment from both domains based on the bi-direction multi-modal encoders. Specifically, we apply the pre-trained multi-modal embedding module to overcome the unimodal anisotropy problem with contrastive learning, and map fine-grained features with dual encoder in bi-directional masking. Then, we design the cross-modal interaction encoder to comprehensively mine implicit cross-modal relations by reconstructing masked tokens, and fuse rich multi-modal knowledge in a common space. In addition, the cross-modal similarity matching module is proposed to optimize the intra-domain classification and decrease the inter-domain divergence. Extensive experiments are conducted on three public benchmarks CUHK-PEDES, ICFG-PEDES, and RSTPReid to verify the effectiveness of our proposed framework. Results prove that our model outperforms state-of-the-art methods on all metrics.<\/jats:p>","DOI":"10.1145\/3686160","type":"journal-article","created":{"date-parts":[[2024,10,15]],"date-time":"2024-10-15T14:40:27Z","timestamp":1729003227000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Text-and-Image Learning Transformer for Cross-Modal Person Re-Identification"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-0144-6169","authenticated-orcid":false,"given":"Tinghui","family":"Wu","sequence":"first","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4502-0000","authenticated-orcid":false,"given":"Shuhe","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5432-8149","authenticated-orcid":false,"given":"Dihu","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Integrated Circuits, Sun Yat-Sen University, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4884-323X","authenticated-orcid":false,"given":"Haifeng","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,12,20]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"2617","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV \u201920)","author":"Aggarwal Surbhi","year":"2020","unstructured":"Surbhi Aggarwal, Venkatesh Babu Radhakrishnan, and Anirban Chakraborty. 2020. Text-based person search via attribute-aided matching. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV \u201920), 2617\u20132625."},{"key":"e_1_3_1_3_2","first-page":"555","volume-title":"Proceedings of the Thirty-Second International Joint Conference on Artificial IntelligenceMain Track","author":"Bai Yang","year":"2023","unstructured":"Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. 2023. RaSa: Relation and sensitivity aware representation learning for text-based person search. In Proceedings of the Thirty-Second International Joint Conference on Artificial IntelligenceMain Track, 555\u2013563. DOI: 10.24963\/ijcai.2023\/62"},{"key":"e_1_3_1_4_2","doi-asserted-by":"crossref","first-page":"69","DOI":"10.3115\/1225403.1225421","volume-title":"Proceedings of the COLING\/ACL 2006 Interactive Presentation Sessions","author":"Bird Steven","year":"2006","unstructured":"Steven Bird. 2006. NLTK: The natural language toolkit. In Proceedings of the COLING\/ACL 2006 Interactive Presentation Sessions, 69\u201372."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3029594"},{"key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"312","DOI":"10.1007\/978-981-19-2266-4_24","volume-title":"Digital TV and Wireless Multimedia Communications","author":"Chen Qingshan","year":"2022","unstructured":"Qingshan Chen, Zhenzhen Quan, Kun Zhao, Yifan Zheng, Zhi Liu, and Yujun Li. 2022. A cross-modality sketch person re-identification model based on cross-spectrum image generation. In Digital TV and Wireless Multimedia Communications. Guangtao Zhai, Jun Zhou, Hua Yang, Ping An, and Xiaokang Yang (Eds.),. Springer, 312\u2013324."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3595183"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2018.09.001"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.04.081"},{"key":"e_1_3_1_10_2","first-page":"409","volume-title":"Proceedings of the International Conference on Computer Engineering and Artificial Intelligence (ICCEAI \u201922)","author":"Chenyang Zhang","year":"2022","unstructured":"Zhang Chenyang, Feng Jun, and Wang Jiaqing. 2022. Text-to-image person search based on SSAN model and re-rank post-processing. In Proceedings of the International Conference on Computer Engineering and Artificial Intelligence (ICCEAI \u201922), 409\u2013413. DOI: 10.1109\/ICCEAI55464.2022.00091"},{"key":"e_1_3_1_11_2","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","volume-title":"Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition","author":"Deng Jia","year":"2009","unstructured":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248\u2013255. DOI: 10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_12_2","unstructured":"Zefeng Ding Changxing Ding Zhiyin Shao and Dacheng Tao. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2107.12666"},{"key":"e_1_3_1_13_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_14_2","first-page":"4477","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"36","author":"Farooq Ammarah","year":"2022","unstructured":"Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. 2022. AXM-Net: Implicit cross-modal feature alignment for person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence 36, 4 (Jun. 2022), 4477\u20134485. DOI: 10.1609\/aaai.v36i4.20370"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIPR51284.2021.00077"},{"key":"e_1_3_1_16_2","first-page":"2786","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops (ICCV \u201923)","author":"Fujii Takuro","year":"2023","unstructured":"Takuro Fujii and Shuhei Tarashima. 2023. BiLMa: Bidirectional local-matching for text-based person re-identification. In Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops (ICCV \u201923), 2786\u20132790."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3205216"},{"key":"e_1_3_1_18_2","unstructured":"Tianyu Gao Xingcheng Yao and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. arXiv:2104.08821. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2104.08821"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3344354"},{"key":"e_1_3_1_20_2","first-page":"193","volume-title":"Proceedings of the Computer Vision (ECCV \u201920)","author":"Han Ke","year":"2020","unstructured":"Ke Han, Yan Huang, Zerui Chen, Liang Wang, and Tieniu Tan. 2020. Prediction and recovery for adaptive low-resolution person re-identification. In Proceedings of the Computer Vision (ECCV \u201920). Springer International Publishing, Cham, 193\u2013209."},{"key":"e_1_3_1_21_2","unstructured":"Xiao Han Sen He Li Zhang and Tao Xiang. 2021. Text-based person search with limited data. arXiv:2110.10807. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2110.10807"},{"key":"e_1_3_1_22_2","first-page":"16000","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922)","author":"He Kaiming","year":"2022","unstructured":"Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 16000\u201316009."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3337653"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3225754"},{"key":"e_1_3_1_25_2","first-page":"2787","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923)","author":"Jiang Ding","year":"2023","unstructured":"Ding Jiang and Mang Ye. 2023. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201923), 2787\u20132797."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3088446"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6777"},{"key":"e_1_3_1_28_2","first-page":"32","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision Workshops (WACV \u201923)","author":"Josi Arthur","year":"2023","unstructured":"Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, and Eric Granger. 2023. Multimodal data augmentation for visual-infrared person ReID with corrupted data. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision Workshops (WACV \u201923), 32\u201341."},{"key":"e_1_3_1_29_2","first-page":"2","volume-title":"Proceedings of the NAACL-HLT","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Vol. 1, 2."},{"key":"e_1_3_1_30_2","first-page":"1","volume-title":"Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC \u201923)","author":"Kim Kunho","year":"2023","unstructured":"Kunho Kim, Min-Jae Kim, Hyungtae Kim, Seokmok Park, and Joonki Paik. 2023. Person re-identification method using text description through CLIP. In Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC \u201923), 1\u20134. DOI: 10.1109\/ICEIC57457.2023.10049924"},{"key":"e_1_3_1_31_2","first-page":"201","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV \u201918)","author":"Lee Kuang-Huei","year":"2018","unstructured":"Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV \u201918), 201\u2013216."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-020-01880-4"},{"key":"e_1_3_1_33_2","first-page":"2724","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP \u201922)","author":"Li Shiping","year":"2022","unstructured":"Shiping Li, Min Cao, and Min Zhang. 2022. Learning semantic-aligned feature representation for text-based person search. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP \u201922), 2724\u20132728. DOI: 10.1109\/ICASSP43922.2022.9746846"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2022.3217682"},{"key":"e_1_3_1_35_2","first-page":"1405","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"37","author":"Li Siyuan","year":"2023","unstructured":"Siyuan Li, Li Sun, and Qingli Li. 2023. CLIP-ReID: Exploiting vision-language model for image re-identification without concrete text labels. Proceedings of the AAAI Conference on Artificial Intelligence 37, 1 (Jun. 2023), 1405\u20131413. DOI: 10.1609\/aaai.v37i1.25225"},{"key":"e_1_3_1_36_2","first-page":"1970","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201917)","author":"Li Shuang","year":"2017","unstructured":"Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201917), 1970\u20131979."},{"key":"e_1_3_1_37_2","first-page":"152","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201914)","author":"Li Wei","year":"2014","unstructured":"Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201914), 152\u2013159."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2021.103172"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3092578"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2021.104168"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3330091"},{"key":"e_1_3_1_42_2","first-page":"2351","volume-title":"Proceedings of the IEEE International Conference on Image Processing (ICIP \u201920)","author":"Munir Asad","year":"2020","unstructured":"Asad Munir, Niki Martinel, and Christian Micheloni. 2020. Multi branch Siamese network for person re-identification. In Proceedings of the IEEE International Conference on Image Processing (ICIP \u201920), 2351\u20132355. DOI: 10.1109\/ICIP40778.2020.9191115"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2984883"},{"key":"e_1_3_1_44_2","doi-asserted-by":"crossref","first-page":"609","DOI":"10.1145\/3240508.3240606","volume-title":"Proceedings of the 26th ACM International Conference on Multimedia (MM \u201918)","author":"Pang Lu","year":"2018","unstructured":"Lu Pang, Yaowei Wang, Yi-Zhe Song, Tiejun Huang, and Yonghong Tian. 2018. Cross-domain adversarial feature learning for sketch re-identification. In Proceedings of the 26th ACM International Conference on Multimedia (MM \u201918). ACM, New York, NY, 609\u2013617. DOI: 10.1145\/3240508.3240606"},{"key":"e_1_3_1_45_2","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1007\/978-981-99-8073-4_34","volume-title":"Proceedings of the Neural Information Processing","author":"Pang Yonghua","year":"2024","unstructured":"Yonghua Pang, Canlong Zhang, Zhixin Li, and Liaojie Hu. 2024. Text-based person re-ID by saliency mask and dynamic label smoothing. In Proceedings of the Neural Information Processing. Springer Nature, Singapore, 443\u2013454."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.3390\/jimaging7010006"},{"key":"e_1_3_1_47_2","first-page":"8748","volume-title":"Proceedings of the 38th International Conference on Machine Learning","volume":"139","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139. Marina Meila and Tong Zhang (Eds.), PMLR, 8748\u20138763. Retrieved from https:\/\/proceedings.mlr.press\/v139\/radford21a.html"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48881-3_2"},{"key":"e_1_3_1_49_2","first-page":"5814","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201919)","author":"Sarafianos Nikolaos","year":"2019","unstructured":"Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201919), 5814\u20135824."},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Rico Sennrich Barry Haddow and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv:1508.07909. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.1508.07909","DOI":"10.18653\/v1\/P16-1162"},{"key":"e_1_3_1_51_2","first-page":"11174","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201923)","author":"Shao Zhiyin","year":"2023","unstructured":"Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. 2023. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV \u201923), 11174\u201311184."},{"key":"e_1_3_1_52_2","doi-asserted-by":"crossref","first-page":"5566","DOI":"10.1145\/3503161.3548028","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia (MM \u201922)","author":"Shao Zhiyin","year":"2022","unstructured":"Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. 2022. Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia (MM \u201922). ACM, New York, NY, 5566\u20135574. DOI: 10.1145\/3503161.3548028"},{"key":"e_1_3_1_53_2","first-page":"624","volume-title":"Proceedings of the Computer Vision Workshops (ECCV \u201922)","author":"Shu Xiujun","year":"2023","unstructured":"Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. 2023. See finer, see more: Implicit modality alignment for text-based person retrieval. In Proceedings of the Computer Vision Workshops (ECCV \u201922). Springer Nature, 624\u2013641."},{"key":"e_1_3_1_54_2","first-page":"480","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV \u201918)","author":"Sun Yifan","year":"2018","unstructured":"Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV \u201918), 480\u2013496."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3231103"},{"key":"e_1_3_1_56_2","first-page":"402","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920)","author":"Wang Zhe","year":"2020","unstructured":"Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. ViTAA: Visual-textual attributes alignment in person search by natural language. In Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920). Springer, 402\u2013420."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340262"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2022.105419"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-88007-1_38"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548057"},{"key":"e_1_3_1_61_2","doi-asserted-by":"crossref","first-page":"1984","DOI":"10.1145\/3503161.3548166","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia (MM \u201922)","author":"Wang Zijie","year":"2022","unstructured":"Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022. Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia (MM \u201922). ACM, New York, NY, 1984\u20131992. DOI: 10.1145\/3503161.3548166"},{"issue":"4","key":"e_1_3_1_62_2","first-page":"043028","article-title":"IMG-Net: Inner-cross-modal attentional multigranular network for description-based person re-identification","volume":"29","author":"Wang Zijie","year":"2020","unstructured":"Zijie Wang, Aichun Zhu, Zhe Zheng, Jing Jin, Zhouxin Xue, and Gang Hua. 2020. IMG-Net: Inner-cross-modal attentional multigranular network for description-based person re-identification. Journal of Electronic Imaging 29, 4 (2020), 043028\u2013043028.","journal-title":"Journal of Electronic Imaging"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447715"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3554739"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3327924"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3310118"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3054775"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/3473341"},{"key":"e_1_3_1_69_2","first-page":"686","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV \u201918)","author":"Zhang Ying","year":"2018","unstructured":"Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV \u201918), 686\u2013701."},{"key":"e_1_3_1_70_2","first-page":"1116","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV \u201915)","author":"Zheng Liang","year":"2015","unstructured":"Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision (ICCV \u201915), 1116\u20131124."},{"issue":"2","key":"e_1_3_1_71_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3383184","article-title":"Dual-path convolutional image-text embeddings with instance loss","volume":"16","author":"Zheng Zhedong","year":"2020","unstructured":"Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/3159171"},{"key":"e_1_3_1_73_2","first-page":"209","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia (MM \u201921)","author":"Zhu Aichun","year":"2021","unstructured":"Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia (MM \u201921). ACM, New York, NY, 209\u2013217. DOI: 10.1145\/3474085.3475369"},{"key":"e_1_3_1_74_2","first-page":"2223","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV \u201917)","author":"Zhu Jun-Yan","year":"2017","unstructured":"Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV \u201917), 2223\u20132232."},{"key":"e_1_3_1_75_2","unstructured":"Jialong Zuo Changqian Yu Nong Sang and Changxin Gao. 2023. PLIP: Language-image pre-training for person representation learning. arXiv:2305.08386. Retrieved from https:\/\/doi.org\/10.48550\/arXiv.2305.08386"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3686160","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3686160","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:50Z","timestamp":1750295870000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3686160"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,20]]},"references-count":74,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1,31]]}},"alternative-id":["10.1145\/3686160"],"URL":"https:\/\/doi.org\/10.1145\/3686160","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,20]]},"assertion":[{"value":"2024-02-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-07-25","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}