{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T20:29:22Z","timestamp":1771705762018,"version":"3.50.1"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2024,3,8]],"date-time":"2024-03-08T00:00:00Z","timestamp":1709856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Talent Introduction Program for Youth Innovation Teams of Shandong Province"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>Pedestrian attribute recognition (PAR) aims at predicting the visual attributes of a pedestrian image. PAR has been used as soft biometrics for visual surveillance and IoT security. Most of the current PAR methods are developed based on discrete images. However, it is challenging for the image-based method to handle the occlusion and action-related attributes in real-world applications. Recently, video-based PAR has attracted much attention in order to exploit the temporal cues in the video sequences for better PAR. Unfortunately, existing methods usually ignore the correlations among different attributes and the relations between attributes and spatio regions. To address this problem, we propose a novel method for video-based PAR by exploring the relationships among different attributes in both the spatio and temporal domains. More specifically, a spatio-temporal saliency module (STSM) is introduced to capture the key visual patterns from the video sequences, and a module for spatio-temporal attribute relationship learning (STARL) is proposed to mine the correlations among these patterns. Meanwhile, a large-scale benchmark for video-based PAR, RAP-Video, is built by extending the image-based dataset RAP-2, which contains 83,216 tracklets with 25 scenes. To the best of our knowledge, this is the largest dataset for video-based PAR. Extensive experiments are performed on the proposed benchmark as well as on MARS Attribute and DukeMTMC-Video Attribute. The superior performance demonstrates the effectiveness of the proposed method.<\/jats:p>","DOI":"10.1145\/3632624","type":"journal-article","created":{"date-parts":[[2023,11,13]],"date-time":"2023-11-13T11:47:31Z","timestamp":1699876051000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual Surveillance"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-2141-4647","authenticated-orcid":false,"given":"Zhenyu","family":"Liu","sequence":"first","affiliation":[{"name":"Shandong University of Science and Technology, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6822-3989","authenticated-orcid":false,"given":"Da","family":"Li","sequence":"additional","affiliation":[{"name":"Center for Research on Intelligent Perception and Computing (CRIPAC), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-8189-524X","authenticated-orcid":false,"given":"Xinyu","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shandong University of Science and Technology, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9425-3065","authenticated-orcid":false,"given":"Zhang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Center for Research on Intelligent Perception and Computing (CRIPAC), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China and School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6794-7352","authenticated-orcid":false,"given":"Peng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shandong University of Science and Technology, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2131-1671","authenticated-orcid":false,"given":"Caifeng","family":"Shan","sequence":"additional","affiliation":[{"name":"Shandong University of Science and Technology, Qingdao, China, and Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4361-956X","authenticated-orcid":false,"given":"Jungong","family":"Han","sequence":"additional","affiliation":[{"name":"The University of Sheffield, Sheffield, UK"}]}],"member":"320","published-online":{"date-parts":[[2024,3,8]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","unstructured":"Tianrui Chai Zhiyuan Chen Annan Li Jiaxin Chen Xinyu Mei and Yunhong Wang. 2022. Video person re-identification using attribute-enhanced features. IEEE Transactions on Circuits and Systems for Video Technology 32 11 (2022) 7951\u20137966.","DOI":"10.1109\/TCSVT.2022.3189027"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01160"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-31723-2_18"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3178144"},{"key":"e_1_3_1_6_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An image is worth \\(16\\times 16\\) words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58536-5_14"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/3367243.3367381"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58595-2_24"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_12_2","unstructured":"Max Jaderberg Karen Simonyan Andrew Zisserman and Koray Kavukcuoglu. 2015. Spatial transformer networks. Advances in Neural Information Processing Systems 28 (2015) 2017\u20132025."},{"key":"e_1_3_1_13_2","doi-asserted-by":"crossref","unstructured":"Jian Jia Naiyu Gao Fei He Xiaotang Chen and Kaiqi Huang. 2022. Learning disentangled attribute representations for robust pedestrian attribute recognition. Proceedings of the AAAI Conference on Artificial Intelligence 36 1 (2022) 1069\u20131077.","DOI":"10.1609\/aaai.v36i1.19991"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3547144"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01621"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Ryan Layne Timothy M. Hospedales and Shaogang Gong. 2012. Person re-identification by attributes. In British Machine Vision Conference (BMVC) 2 3 (2012) 8.","DOI":"10.5244\/C.26.24"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACPR.2015.7486476"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2878349"},{"key":"e_1_3_1_19_2","first-page":"833","volume-title":"IJCAI","author":"Li Qiaozhe","year":"2019","unstructured":"Qiaozhe Li, Xin Zhao, Ran He, and Kaiqi Huang. 2019. Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation. In IJCAI. 833\u2013839."},{"issue":"7","key":"e_1_3_1_20_2","first-page":"2167","article-title":"Recurrent prediction with spatio-temporal attention for crowd attribute recognition","volume":"30","author":"Li Qiaozhe","year":"2019","unstructured":"Qiaozhe Li, Xin Zhao, Ran He, and Kaiqi Huang. 2019. Recurrent prediction with spatio-temporal attention for crowd attribute recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 7 (2019), 2167\u20132177.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.06.006"},{"key":"e_1_3_1_23_2","article-title":"Localization guided learning for pedestrian attribute recognition","author":"Liu Pengze","year":"2018","unstructured":"Pengze Liu, Xihui Liu, Junjie Yan, and Jing Shao. 2018. Localization guided learning for pedestrian attribute recognition. arXiv preprint arXiv:1808.09102 (2018).","journal-title":"arXiv preprint arXiv:1808.09102"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"e_1_3_1_25_2","volume-title":"7th International Conference on Learning Representations","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations."},{"key":"e_1_3_1_26_2","unstructured":"Alec Radford JongWook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell PamelaMishkin Jack Clark Gretchen Krueger and Ilya Sutskever. 2021. Learning transferable visualmodels fromnatural language supervision. In International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR) 8748\u20138763."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48881-3_2"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_42"},{"key":"e_1_3_1_29_2","volume-title":"British Machine Vision Conference 2017 (BMVC\u201917)","author":"Sarfraz M. Saquib","year":"2017","unstructured":"M. Saquib Sarfraz, Arne Schumann, Yan Wang, and Rainer Stiefelhagen. 2017. Deep view-sensitive pedestrian attribute inference in an end-to-end model. In British Machine Vision Conference 2017 (BMVC\u201917)."},{"issue":"3","key":"e_1_3_1_30_2","first-page":"1","article-title":"Shuffle-invariant network for action recognition in videos","volume":"18","author":"Shi Qinghongya","year":"2022","unstructured":"Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1\u201318.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"Thomhert Suprapto Siadari Mikyong Han and Hyunjin Yoon. 2019. GSR-MAR: Global super-resolution for person multi-attribute recognition. In International Conference on Computer Vision (ICCV) Workshops. 1098\u20131103.","DOI":"10.1109\/ICCVW.2019.00140"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6883"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2919199"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00510"},{"key":"e_1_3_1_35_2","unstructured":"Ziyi Tang Ruimao Zhang Zhanglin Peng Jinrui Chen and Liang Lin. 2022. Multi-stage spatio-temporal aggregation transformer for video person re-identification. IEEE Transactions on Multimedia Early Access 1\u201315."},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","unstructured":"Xiao Wang Shaofei Zheng Rui Yang Aihua Zheng Zhe Chen Jin Tang and Bin Luo. 2021. Pedestrian attribute recognition: A survey. Pattern Recognition 121 (2021) 108220.","DOI":"10.1016\/j.patcog.2021.108220"},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","unstructured":"Suncheng Xiang Dahong Qian Mengyuan Guan Binjie Yan Ting Liu Yuzhuo Fu and Guanjie You. 2023. Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multimedia Computing Communications and Applications 19 5s (2023) 1\u201320.","DOI":"10.1145\/3588441"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3559107"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3538749"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.3390\/s21186163"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01499-z"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46466-4_52"},{"key":"e_1_3_1_43_2","article-title":"Learning clip guided visual-text fusion transformer for video-based pedestrian attribute recognition","author":"Zhu Jun","year":"2023","unstructured":"Jun Zhu, Jiandong Jin, Zihan Yang, Xiaohao Wu, and Xiao Wang. 2023. Learning clip guided visual-text fusion transformer for video-based pedestrian attribute recognition. arXiv preprint arXiv:2304.10091 (2023).","journal-title":"arXiv preprint arXiv:2304.10091"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2013.51"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3632624","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3632624","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:57:54Z","timestamp":1750294674000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3632624"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,8]]},"references-count":43,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3632624"],"URL":"https:\/\/doi.org\/10.1145\/3632624","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,8]]},"assertion":[{"value":"2023-06-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}