{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T04:48:15Z","timestamp":1773550095601,"version":"3.50.1"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62202061 and 62171043"],"award-info":[{"award-number":["62202061 and 62171043"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Beijing Natural Science Foundation","award":["4232025 and 4254096"],"award-info":[{"award-number":["4232025 and 4254096"]}]},{"name":"Research Program of Beijing Municipal Education Commission","award":["KM202311232002"],"award-info":[{"award-number":["KM202311232002"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,3,31]]},"abstract":"<jats:p>With the increasing adoption of UAV platforms in areas such as public safety and smart cities, Aerial-Ground Person Re-Identification (AGPReID) has emerged as a crucial yet highly challenging task, garnering growing interest from the research community. While existing approaches have leveraged identity attributes and viewpoint disentanglement strategies to improve cross-view matching, their heavy reliance on prior knowledge often compromises model generalization. Furthermore, some methods that explicitly separate viewpoints may unintentionally discard identity-related, view-invariant features, leading to incomplete identity representations. To address these limitations, we propose a CLIP-based View-Consistent Alignment Framework (CVAF) with two training stages. In the first stage, learnable text tokens are employed to represent identity-aware textual descriptions. To promote consistent alignment across varying viewpoints, we introduce a Text Consistency Loss (TCL) that regularizes the stability of text-token interactions with multi-view images. In the second stage, we present a Semantic Filtering Module (SFM) that jointly modulates image patch tokens along spatial and channel dimensions. A text-guided cross-attention mechanism generates spatial attention maps to explicitly emphasize identity-relevant regions, while semantic matching between textual features and visual tokens enables adaptive reweighting of image representations, effectively suppressing background clutter and view-specific noise. Extensive experiments on multiple AGPReID datasets demonstrate that our CVAF outperforms the state-of-the-art methods.<\/jats:p>","DOI":"10.1145\/3785482","type":"journal-article","created":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T13:03:17Z","timestamp":1766494997000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["CVAF: A CLIP-Based View-Consistent Alignment Framework for Aerial-Ground Person Re-Identification"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-6134-1620","authenticated-orcid":false,"given":"Dongxu","family":"Mao","sequence":"first","affiliation":[{"name":"School of Computer Science, Beijing Information Science and Technology University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7098-9932","authenticated-orcid":false,"given":"Shangzhi","family":"Teng","sequence":"additional","affiliation":[{"name":"School of Computer Science, Beijing Information Science and Technology University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1422-0560","authenticated-orcid":false,"given":"Xueqiang","family":"Lyu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Beijing Information Science and Technology University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,2,27]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-96077-7_23"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547799"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00336"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2023.102201"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3339167"},{"key":"e_1_3_1_7_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16\u2009\u00d7\u200916 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01474"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01659"},{"key":"e_1_3_1_10_2","unstructured":"Alexander Hermans Lucas Beyer and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737. Retrieved from https:\/\/arxiv.org\/abs\/1703.07737"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i1.25225"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01600"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.27"},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","unstructured":"Wen Li Cheng Zou Meng Wang Furong Xu Jianan Zhao Ruobing Zheng Yuan Cheng and Wei Chu. 2023. Dc-former: Diverse and compact transformer for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 37 1415\u20131423.","DOI":"10.1609\/aaai.v37i2.25226"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00292"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2019.00190"},{"key":"e_1_3_1_17_2","unstructured":"Hao Luo Pichao Wang Yi Xu Feng Ding Yanxin Zhou Fan Wang Hao Li and Rong Jin. 2021. Self-supervised pre-training for transformer-based person re-identification. arXiv:2111.12084. Retrieved from https:\/\/arxiv.org\/abs\/2111.12084"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.00124"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME55011.2023.00440"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2024.3353078"},{"key":"e_1_3_1_21_2","unstructured":"Kien Nguyen Clinton Fookes Sridha Sridharan Yingli Tian Feng Liu Xiaoming Liu and Arun Ross. 2022. The state of aerial surveillance: A survey. arXiv:2201.03080. Retrieved from https:\/\/arxiv.org\/abs\/2201.03080"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME57554.2024.10687753"},{"key":"e_1_3_1_23_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48881-3_2"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_30"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240552"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME57554.2024.10687588"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02060"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","unstructured":"Tao Wang Hong Liu Pinhao Song Tianyu Guo and Wei Shi. 2022. Pose-guided feature disentangling for occluded person re-identification based on transformer. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 36 2540\u20132549.","DOI":"10.1609\/aaai.v36i3.20155"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00016"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3726528"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3327924"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.01794"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01642"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3054775"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00674"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2025.3535353"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02273"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-024-02277-3"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01620"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02077"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.2977528"},{"key":"e_1_3_1_43_2","unstructured":"Xuan Zhang Hao Luo Xing Fan Weilai Xiang Yixiao Sun Qiqi Xiao Wei Jiang Chi Zhang and Jian Sun. 2017. AlignedReID: Surpassing human-level performance in person re-identification. arXiv:1711.08184. Retrieved from https:\/\/arxiv.org\/abs\/1711.08184"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2025.112157"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00325"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.5555\/2919332.2919877"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01653-1"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00380"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00465"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58580-8_21"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3301856"},{"key":"e_1_3_1_52_2","unstructured":"Pengfei Zhu Longyin Wen Xiao Bian Haibin Ling and Qinghua Hu. 2018. Vision meets drones: A challenge. arXiv:1804.07437. Retrieved from https:\/\/arxiv.org\/abs\/1804.07437"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3785482","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T03:48:28Z","timestamp":1773546508000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3785482"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,27]]},"references-count":51,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3,31]]}},"alternative-id":["10.1145\/3785482"],"URL":"https:\/\/doi.org\/10.1145\/3785482","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,27]]},"assertion":[{"value":"2025-07-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}