{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T01:50:09Z","timestamp":1777427409686,"version":"3.51.4"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"10","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62301315"],"award-info":[{"award-number":["62301315"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Startup Fund for Young Faculty at SJTU","award":["23X010501967"],"award-info":[{"award-number":["23X010501967"]}]},{"name":"Special Fund for \u201cAI \u00b8 Education\u201d at SJTU","award":["CTLD24A0059"],"award-info":[{"award-number":["CTLD24A0059"]}]},{"name":"Shanghai Municipal Health Commission Health Industry Clinical Research Special","award":["202340010"],"award-info":[{"award-number":["202340010"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>\n                    Generalizable person Re-Identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this article, we present a unified perspective called\n                    <jats:italic toggle=\"yes\">MMET<\/jats:italic>\n                    for more robust visual-semantic embedding learning on generalizable Re-ID. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called\n                    <jats:bold>Masked Multimodal Modeling (MMM)<\/jats:bold>\n                    strategy is introduced to mask both the image patches and the text tokens, which can jointly work on multimodal or unimodal data and significantly boost the performance of generalizable person Re-ID. Extensive experiments on benchmark datasets demonstrate the competitive performance of our method over previous approaches. We hope this method could advance the research towards visual-semantic representation learning. Our source code is also publicly available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/JeremyXSC\/MMET\">https:\/\/github.com\/JeremyXSC\/MMET<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3726528","type":"journal-article","created":{"date-parts":[[2025,3,28]],"date-time":"2025-03-28T16:19:31Z","timestamp":1743178771000},"page":"1-17","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Learning Visual-Semantic Embedding for Generalizable Person Re-Identification: A Unified Perspective"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9141-6460","authenticated-orcid":false,"given":"Suncheng","family":"Xiang","sequence":"first","affiliation":[{"name":"School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6271-0903","authenticated-orcid":false,"given":"Jingsheng","family":"Gao","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9826-9806","authenticated-orcid":false,"given":"Mingye","family":"Xie","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7327-0608","authenticated-orcid":false,"given":"Mengyuan","family":"Guan","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5469-507X","authenticated-orcid":false,"given":"Jiacheng","family":"Ruan","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5516-3016","authenticated-orcid":false,"given":"Yuzhuo","family":"Fu","sequence":"additional","affiliation":[{"name":"School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,14]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Hangbo Bao Li Dong Songhao Piao and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv:2106.08254. Retrieved from https:\/\/arxiv.org\/abs\/2106.08254"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01452"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2022.03.020"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00110"},{"key":"e_1_3_2_6_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2015.04.005"},{"key":"e_1_3_2_8_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3243316"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2010.5539926"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i3.27950"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00113"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMEW56448.2022.9859330"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01474"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01659"},{"key":"e_1_3_2_16_2","unstructured":"Alexander Hermans Lucas Beyer and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737. Retrieved from https:\/\/arxiv.org\/abs\/1703.07737"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33718-5_12"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00461"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11596"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01653"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.551"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.27"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298832"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58621-8_27"},{"key":"e_1_3_2_25_2","first-page":"1992","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Liao Shengcai","year":"2021","unstructured":"Shengcai Liao and Ling Shao. 2021. Transmatcher: Deep image matching through transformers for generalizable person re-identification. In Proceedings of the Advances in Neural Information Processing Systems, 1992\u20132003."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00721"},{"key":"e_1_3_2_27_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2019.00190"},{"key":"e_1_3_2_29_2","first-page":"10","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Muandet Krikamol","year":"2013","unstructured":"Krikamol Muandet, David Balduzzi, and Bernhard Sch\u00f6lkopf. 2013. Domain generalization via invariant feature representation. In Proceedings of the International Conference on Machine Learning. PMLR, 10\u201318."},{"key":"e_1_3_2_30_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00781"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548234"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3297058"},{"key":"e_1_3_2_34_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.74"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01519"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.427"},{"key":"e_1_3_2_38_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390294"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00715"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00242"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413815"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00016"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-023-06352-7"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-020-09569-z"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-022-06184-x"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME46284.2020.9102822"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413757"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3628452"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588441"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3186177"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3532866"},{"key":"e_1_3_2_53_2","unstructured":"Jiahui Yu Zirui Wang Vijay Vasudevan Legg Yeung Mojtaba Seyedhosseini and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text foundation models. arXiv:2205.01917. Retrieved from https:\/\/arxiv.org\/abs\/2205.01917"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16448"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.26"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591721"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.5555\/2919332.2919877"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3159171"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3072171"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.389"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_11"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3726528","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T15:56:13Z","timestamp":1774022173000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3726528"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,14]]},"references-count":60,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3726528"],"URL":"https:\/\/doi.org\/10.1145\/3726528","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,14]]},"assertion":[{"value":"2024-04-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}