{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T10:17:40Z","timestamp":1770718660664,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62072211"],"award-info":[{"award-number":["62072211"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>\n                    Aerial skiing is a challenging human-centric sport characterized by rapid motion, large-scale variations, and frequent occlusions. Its extensive spatial range is typically captured by cameras or drones from multiple perspectives, resulting in frequent and complex viewpoint shifts. These challenges encompass nearly all difficulties inherent in human-centric tracking tasks. In this article, we introduce\n                    <jats:italic toggle=\"yes\">SkiTrack<\/jats:italic>\n                    , the first dataset explicitly designed for tracking in aerial skiing. SkiTrack enhances the performance of existing tracking algorithms across a range of human-centric scenarios by providing precise annotations. We observe distinct characteristics in the tracked components, with the skis being rigid and low in visibility and the athlete\u2019s body highly deformable but more visible. To leverage these differences, we propose a components decoupled loss that applies separate constraints to the tracking of the athlete and skis, thereby improving tracking accuracy in skiing scenes. Our experimental results validate the effectiveness of both the SkiTrack dataset and the proposed decoupled loss function, demonstrating consistent improvements in the performance of established models on human-centric tracking tasks. Data are available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/xiaozhangfangyang\/FineSkiing\">https:\/\/github.com\/xiaozhangfangyang\/FineSkiing<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3778174","type":"journal-article","created":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T09:19:54Z","timestamp":1764235194000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["SkiTrack: An Aerial Skiing Benchmark for Human-Centric Object Tracking"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9025-3375","authenticated-orcid":false,"given":"Yu","family":"Jiang","sequence":"first","affiliation":[{"name":"College of Computer Science and Technology, Jilin University, Changchun, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8576-5285","authenticated-orcid":false,"given":"Yongji","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Jilin University, Changchun, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9720-826X","authenticated-orcid":false,"given":"Siqi","family":"Li","sequence":"additional","affiliation":[{"name":"BNRist, THUIBCS, BLBCI, School of Software, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6793-6506","authenticated-orcid":false,"given":"Yuehang","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Jilin University, Changchun, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4971-590X","authenticated-orcid":false,"given":"Yue","family":"Gao","sequence":"additional","affiliation":[{"name":"BNRist, THUIBCS, BLBCI, School of Software, Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,2,9]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00542"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.471"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.02138"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_27"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01822"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00879"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20047-2_22"},{"key":"e_1_3_1_9_2","first-page":"6475","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition,","author":"Chen","year":"2024","unstructured":"Chen, J. Jiang, and H. Sportsslomo. 2024. A new benchmark and baselines for human-centric video frame interpolation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 6475\u20136486."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3597612"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01400"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00803"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01324"},{"key":"e_1_3_1_14_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy A.","year":"2021","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2024.103978"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00832"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00552"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2022\/127"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01356"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2025.3546312"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00478"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3164253"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i1.25155"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2957464"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.248"},{"key":"e_1_3_1_26_2","first-page":"4194","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"39","author":"Kang B.","year":"2025","unstructured":"B. Kang, X. Chen, S. Lai, Y. Liu, Y. Liu, and D. Wang. 2025. Exploring enhanced contextual information for video-level object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, 4194\u20134202."},{"key":"e_1_3_1_27_2","unstructured":"W. Kay J. Carreira K. Simonyan B. Zhang C. Hillier S. Vijayanarasimhan F. Viola T. Green T. Back P. Natsev et al. 2017. The kinetics human action video dataset. arXiv:1705.06950. Retrieved from https:\/\/arxiv.org\/abs\/1705.06950"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.128"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01594-9"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3746284"},{"key":"e_1_3_1_31_2","first-page":"300","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Lin L.","year":"2024","unstructured":"L. Lin, H. Fan, Z. Zhang, Y. Wang, Y. Xu, and H. Ling. 2024. Tracking meets LoRA: Faster training, larger model, stronger performance. In Proceedings of the European Conference on Computer Vision. Springer, 300\u2013318."},{"key":"e_1_3_1_32_2","first-page":"16743","article-title":"SwinTrack: A simple and strong baseline for transformer tracking","volume":"35","author":"Lin L.","year":"2022","unstructured":"L. Lin, H. Fan, Z. Zhang, Y. Xu, and H. Ling. 2022. SwinTrack: A simple and strong baseline for transformer tracking. In Advances in Neural Information Processing Systems, Vol. 35, 16743\u201316754.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_19"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3206668"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3162599"},{"key":"e_1_3_1_36_2","first-page":"130797","article-title":"VastTrack: Vast category visual object tracking","volume":"37","author":"Peng L.","year":"2024","unstructured":"L. Peng, J. Gao, X. Liu, W. Li, S. Dong, Z. Zhang, H. Fan, and L. Zhang. 2024. VastTrack: Vast category visual object tracking. In Advances in Neural Information Processing Systems, Vol. 37, 130797\u2013130818.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00083"},{"key":"e_1_3_1_38_2","unstructured":"K. Soomro. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https:\/\/arxiv.org\/abs\/1212.0402"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_41"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3533253"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00807"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00162"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3291140"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00935"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3497746"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00200"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.312"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2388226"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01028"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3074239"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3323134"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20047-2_20"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00971"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3037518"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3486678"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00618"},{"key":"e_1_3_1_57_2","first-page":"1","article-title":"P2FTrack: Multi-object tracking with motion prior and feature posterior","volume":"21","author":"Zhang H.","year":"2024","unstructured":"H. Zhang, J. Wan, J. Zhang, D. Yuan, X. Li, and Y. Yang. 2024. P2FTrack: Multi-object tracking with motion prior and feature posterior. ACM Transactions on Multimedia Computing, Communications, and Applications 21 (2024), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3651308"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3306490"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3301933"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i10.33155"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i7.28591"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3212987"},{"key":"e_1_3_1_64_2","unstructured":"Y. Zhu X. Li C. Liu M. Zolfaghari Y. Xiong C. Wu Z. Zhang J. Tighe R. Manmatha and M. Li. 2020. A comprehensive study of deep video action recognition. arXiv:2012.06567. Retrieved from https:\/\/arxiv.org\/abs\/2012.06567"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3778174","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T14:57:49Z","timestamp":1770649069000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3778174"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,9]]},"references-count":63,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3778174"],"URL":"https:\/\/doi.org\/10.1145\/3778174","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,9]]},"assertion":[{"value":"2025-05-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}