{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T09:52:10Z","timestamp":1769939530485,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"12","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61932020"],"award-info":[{"award-number":["61932020"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Taishan Scholar Program of Shandong Province","award":["tstp20221128"],"award-info":[{"award-number":["tstp20221128"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>\n                    Recent advances in multi-object tracking (MOT) have demonstrated significant success in short-term association within the separated tracking-by-detection online paradigm. However, long-term tracking remains challenging. While graph-based approaches address this by modeling trajectories as global graphs, these methods are unsuitable for real-time applications due to their non-online nature. In this article, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs. This representation can be described using frame-ordered object sequences and binary adjacency matrices. We observe that this structure naturally aligns with Transformer attention mechanisms, enabling us to model the association problem using a classic Transformer architecture. Based on this insight, we introduce a concise pure transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT. Extensive experiments on four diverse datasets (SportsMOT, DanceTrack, MOT17, and MOT20) demonstrate that PuTR effectively establishes a solid baseline compared to existing foundational online methods while exhibiting superior domain adaptation capabilities. Furthermore, the separated nature enables efficient training and inference, making it suitable for practical applications. Implementation code and trained models are available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/chongweiliu\/PuTR\">https:\/\/github.com\/chongweiliu\/PuTR<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3749105","type":"journal-article","created":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T15:18:45Z","timestamp":1754061525000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4594-5262","authenticated-orcid":false,"given":"Chongwei","family":"Liu","sequence":"first","affiliation":[{"name":"Dalian University of Technology, Dalian, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3882-2205","authenticated-orcid":false,"given":"Haojie","family":"Li","sequence":"additional","affiliation":[{"name":"Dalian University of Technology, Dalian, China and Shandong University of Science and Technology, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5011-9726","authenticated-orcid":false,"given":"Zhihui","family":"Wang","sequence":"additional","affiliation":[{"name":"Dalian University of Technology, Dalian, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0516-3629","authenticated-orcid":false,"given":"Rui","family":"Xu","sequence":"additional","affiliation":[{"name":"Dalian University of Technology, Dalian, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Nir Aharon Roy Orfaig and Ben-Zion Bobrovsky. 2022. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv:2206.14651. Retrieved from https:\/\/arxiv.org\/abs\/2206.14651"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00103"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1155\/2008\/246309"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2016.7533003"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00628"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00792"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00934"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02191"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00910"},{"key":"e_1_3_2_12_2","unstructured":"Patrick Dendorfer Hamid Rezatofighi Anton Milan Javen Shi Daniel Cremers Ian Reid Stefan Roth Konrad Schindler and Laura Leal-Taix\u00e9. 2020. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003. Retrieved from https:\/\/arxiv.org\/abs\/2003.09003"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3240881"},{"issue":"2","key":"e_1_3_2_14_2","doi-asserted-by":"crossref","first-page":"611","DOI":"10.1007\/s00034-019-01234-7","article-title":"Multi-object detection and tracking (MODT) machine learning model for real-time video surveillance systems","volume":"39","author":"Mohamed Elhoseny.","year":"2020","unstructured":"Mohamed Elhoseny. 2020. Multi-object detection and tracking (MODT) machine learning model for real-time video surveillance systems. Circuits, Systems, and Signal Processing 39, 2 (2020), 611\u2013630.","journal-title":"Circuits, Systems, and Signal Processing"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1177\/0278364910365417"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00908"},{"key":"e_1_3_2_17_2","unstructured":"Ruopeng Gao Yijun Zhang and Limin Wang. 2024. Multiple object tracking as ID prediction. arXiv:2403.16848. Retrieved from https:\/\/arxiv.org\/abs\/2403.16848"},{"key":"e_1_3_2_18_2","unstructured":"Zheng Ge Songtao Liu Feng Wang Zeming Li and Jian Sun. 2021. YOLOX: Exceeding YOLO series in 2021. arXiv:2107.08430. Retrieved from https:\/\/arxiv.org\/abs\/2107.08430"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3565266"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2024.3422669"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCC.2004.829274"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01825"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2007.383180"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1115\/1.3662552"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1002\/nav.3800020109"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-023-01933-4"},{"key":"e_1_3_2_27_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Liu Shilong","year":"2022","unstructured":"Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. 2022. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=oMI9PjOb9Jl"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00914"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-020-01375-2"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP49359.2023.10222576"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"e_1_3_2_32_2","unstructured":"Anton Milan Laura Leal-Taix\u00e9 Ian Reid Stefan Roth and Konrad Schindler. 2016. MOT16: A benchmark for multi-object tracking. arXiv:1603.00831. Retrieved from https:\/\/arxiv.org\/abs\/1603.00831"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00023"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995604"},{"key":"e_1_3_2_35_2","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48881-3_2"},{"key":"e_1_3_2_37_2","first-page":"343","volume-title":"Proceedings of the 12th European Conference on Computer Vision (ECCV \u201912), Part II 12","author":"Zamir Amir Roshan","year":"2012","unstructured":"Amir Roshan Zamir, Afshin Dehghan, and Mubarak Shah. 2012. GMCP-Tracker: Global multi-object tracking using generalized minimum clique graphs. In Proceedings of the 12th European Conference on Computer Vision (ECCV \u201912), Part II 12. Springer, 343\u2013356."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1038\/323533a0"},{"issue":"1","key":"e_1_3_2_39_2","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1109\/TNN.2008.2005605","article-title":"The graph neural network model","volume":"20","author":"Scarselli Franco","year":"2008","unstructured":"Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61\u201380.","journal-title":"IEEE Transactions on Neural Networks"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00891"},{"key":"e_1_3_2_41_2","unstructured":"Shuai Shao Zijian Zhao Boxun Li Tete Xiao Gang Yu Xiangyu Zhang and Jian Sun. 2018. CrowdHuman: A benchmark for detecting human in a crowd. arXiv:1805.00123. Retrieved from https:\/\/arxiv.org\/abs\/1805.00123"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.02032"},{"issue":"1","key":"e_1_3_2_43_2","first-page":"104","article-title":"Deep affinity network for multiple object tracking","volume":"43","author":"Sun ShiJie","year":"2019","unstructured":"ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, and Mubarak Shah. 2019. Deep affinity network for multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (2019), 104\u2013119.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.394"},{"key":"e_1_3_2_45_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_46_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"1","key":"e_1_3_2_47_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3533253","article-title":"JDAN: Joint detection and association network for real-time online multi-object tracking","volume":"19","author":"Wang Haidong","year":"2023","unstructured":"Haidong Wang, Xuan He, Zhiyong Li, Jin Yuan, and Shutao Li. 2023. JDAN: Joint detection and association network for real-time online multi-object tracking. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1\u201317.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58621-8_7"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2017.8296962"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.112742"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00682"},{"key":"e_1_3_2_52_2","unstructured":"Feng Yan Weixin Luo Yujie Zhong Yiyang Gan and Lin Ma. 2023. Bridging the gap between end-to-end and non-end-to-end multi-object tracking. arXiv:2305.12724. Retrieved from https:\/\/arxiv.org\/abs\/2305.12724"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i7.28471"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611868"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i7.28493"},{"issue":"5","key":"e_1_3_2_56_2","first-page":"1","article-title":"Multi-object tracking with spatial-temporal tracklet association","volume":"20","author":"You Sisi","year":"2024","unstructured":"Sisi You, Hantao Yao, Bing-Kun Bao, and Changsheng Xu. 2024. Multi-object tracking with spatial-temporal tracklet association. ACM Transactions on Multimedia Computing, Communications and Applications 20, 5 (2024), 1\u201321.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_57_2","unstructured":"En Yu Tiancai Wang Zhuoling Li Yuang Zhang Xiangyu Zhang and Wenbing Tao. 2023. MOTRv3: Release-fetch supervision for end-to-end multi-object tracking. arXiv:2305.14298. Retrieved from https:\/\/arxiv.org\/abs\/2305.14298"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19812-0_38"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20047-2_1"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01513-4"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02112"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_28"},{"key":"e_1_3_2_63_2","unstructured":"Xingyi Zhou Dequan Wang and Philipp Kr\u00e4henb\u00fchl. 2019. Objects as points. arXiv:1904.07850. Retrieved from https:\/\/arxiv.org\/abs\/1904.07850"},{"key":"e_1_3_2_64_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zhu Xizhou","year":"2021","unstructured":"Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=gZ9hCDWe6ke"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3749105","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T07:00:01Z","timestamp":1763794801000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3749105"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":63,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3749105"],"URL":"https:\/\/doi.org\/10.1145\/3749105","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2025-01-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}