{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T20:04:21Z","timestamp":1776888261669,"version":"3.51.2"},"reference-count":53,"publisher":"MDPI AG","issue":"16","license":[{"start":{"date-parts":[[2021,8,20]],"date-time":"2021-08-20T00:00:00Z","timestamp":1629417600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>This paper presents a new model for multi-object tracking (MOT) with a transformer. MOT is a spatiotemporal correlation task among interest objects and one of the crucial technologies of multi-unmanned aerial vehicles (Multi-UAV). The transformer is a self-attentional codec architecture that has been successfully used in natural language processing and is emerging in computer vision. This study proposes the Vision Transformer Tracker (ViTT), which uses a transformer encoder as the backbone and takes images directly as input. Compared with convolution networks, it can model global context at every encoder layer from the beginning, which addresses the challenges of occlusion and complex scenarios. The model simultaneously outputs object locations and corresponding appearance embeddings in a shared network through multi-task learning. Our work demonstrates the superiority and effectiveness of transformer-based networks in complex computer vision tasks and paves the way for applying the pure transformer in MOT. We evaluated the proposed model on the MOT16 dataset, achieving 65.7% MOTA, and obtained a competitive result compared with other typical multi-object trackers.<\/jats:p>","DOI":"10.3390\/s21165608","type":"journal-article","created":{"date-parts":[[2021,8,22]],"date-time":"2021-08-22T22:59:27Z","timestamp":1629673167000},"page":"5608","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["ViTT: Vision Transformer Tracker"],"prefix":"10.3390","volume":"21","author":[{"given":"Xiaoning","family":"Zhu","sequence":"first","affiliation":[{"name":"School of Electronic Information Engineering, Beihang University, Beijing 100191, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yannan","family":"Jia","sequence":"additional","affiliation":[{"name":"School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sun","family":"Jian","sequence":"additional","affiliation":[{"name":"School of Electronic Information Engineering, Beihang University, Beijing 100191, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lize","family":"Gu","sequence":"additional","affiliation":[{"name":"School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhang","family":"Pu","sequence":"additional","affiliation":[{"name":"School of Electronic Information Engineering, Beihang University, Beijing 100191, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,8,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"504","DOI":"10.1126\/science.1127647","article-title":"Reducing the dimensionality of data with neural networks","volume":"313","author":"Hinton","year":"2006","journal-title":"Science"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1145\/3065386","article-title":"Imagenet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Commun. ACM"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Long, J., Evan, S., and Trevor, D. (2015, January 8\u201310). Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1274","DOI":"10.1109\/TMC.2019.2908171","article-title":"Distributed energy-efficient multi-UAV navigation for long-term communication coverage by deep reinforcement learning","volume":"19","author":"Liu","year":"2019","journal-title":"IEEE Trans. Mob. Comput."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2243","DOI":"10.1109\/LCOMM.2019.2940191","article-title":"Multi-UAV Dynamic Wireless Networking with Deep Reinforcement Learning","volume":"23","author":"Wang","year":"2019","journal-title":"IEEE Commun. Lett."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1109\/TCCN.2020.3027695","article-title":"Multi-agent deep reinforcement learning-based trajectory planning for multi-UAV assisted mobile edge computing","volume":"7","author":"Wang","year":"2020","journal-title":"IEEE Trans. Cogn. Commun. Netw."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. (2016, January 25\u201328). Simple Online and Realtime Tracking. Proceedings of the 23rd IEEE International Conference on Image Processing, Phoenix, AZ, USA.","DOI":"10.1109\/ICIP.2016.7533003"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Wojke, N., Alex, B., and Dietrich, P. (2017, January 17\u201320). Simple Online and Realtime Tracking with A Deep Association Metric. Proceedings of the 24th IEEE International Conference on Image Processing, Beijing, China.","DOI":"10.1109\/ICIP.2017.8296962"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Fang, K., Xiang, Y., Li, X., and Savarese, S. (2018, January 12\u201315). Recurrent Autoregressive Networks for Online Multi-Object Tracking. Proceedings of the 18th IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA.","DOI":"10.1109\/WACV.2018.00057"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"7077","DOI":"10.1007\/s11042-018-6467-6","article-title":"Multi-target tracking using CNN-based features: CNNMTT","volume":"78","author":"Mahmoudi","year":"2019","journal-title":"Multimed. Tools Appl."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016, January 8\u201316). Ssd: Single Shot Multibox Detector. Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015, January 11\u201318). Scalable Person Re-Identification: A Benchmark. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.133"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Li, W., Xiatian, Z., and Shaogang, G. (2017). Person re-identification by deep joint learning of multi-loss classification. arXiv.","DOI":"10.24963\/ijcai.2017\/305"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Wang, Z., Zheng, L., Liu, Y., and Wang, S. (2019). Towards real-time multi-object tracking. arXiv.","DOI":"10.1007\/978-3-030-58621-8_7"},{"key":"ref_17","unstructured":"Zhang, Y., Wang, C., Wang, X., Zeng, W., and Liu, W. (2020). FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. arXiv."},{"key":"ref_18","unstructured":"Bahdanau, D., Kyunghyun, C., and Yoshua, B. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-End Object Detection with Transformers. Proceedings of the European Conference on Computer Vision, Edinburgh, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_20","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv."},{"key":"ref_21","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_22","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., and Polosukhin, I. (2017). Attention is all you need. arXiv."},{"key":"ref_23","unstructured":"Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., and Luo, P. (2020). TransTrack: Multiple-Object Tracking with Transformer. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichtenhofer, C. (2021). TrackFormer: Multi-Object Tracking with Transformers. arXiv.","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"ref_25","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"L\u00fcscher, C., Beck, E., Irie, K., Kitza, M., Michel, W., Zeyer, A., Schl\u00fcter, R., and Ney, H. (2019). RWTH ASR Systems for LibriSpeech: Hybrid vs Attention--w\/o Data Augmentation. arXiv.","DOI":"10.21437\/Interspeech.2019-1780"},{"key":"ref_27","unstructured":"Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. (2018, January 13\u201319). Image Transformer. Proceedings of the International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_28","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"key":"ref_29","unstructured":"Synnaeve, G., Xu, Q., Kahn, J., Likhomanenko, T., Grave, E., Pratap, V., Sriram, A., Liptchinsky, V., and Collobert, R. (2019). End-to-end asr: From supervised to semi-supervised learning with modern architectures. arXiv."},{"key":"ref_30","unstructured":"Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. (2020, January 12\u201318). Generative Pretraining from Pixels. Proceedings of the 37th International Conference on Machine Learning, Available online: http:\/\/proceedings.mlr.press\/v119\/chen20s.html."},{"key":"ref_31","unstructured":"Cordonnier, J., Andreas, L., and Martin, J. (2019). On the relationship between self-attention and convolutional layers. arXiv."},{"key":"ref_32","unstructured":"Hermann, K.L., Ting, C., and Simon, K. (2019). The origins and prevalence of texture bias in convolutional neural networks. arXiv."},{"key":"ref_33","unstructured":"Zhang, R. (2019, January 10\u201315). Making Convolutional Networks Shift-Invariant Again. Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_34","unstructured":"Welch, G., and Gary, B. (1995). An Introduction to the Kalman Filter, University of North Carolina at Chapel Hill."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1002\/nav.3800020109","article-title":"The Hungarian method for the assignment problem","volume":"2","author":"Kuhn","year":"1955","journal-title":"Naval Res. Logist. Q."},{"key":"ref_36","unstructured":"Bergmann, P., Tim, M., and Laura, L.T. (November, January 27). Tracking without Bells and Whistles. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_37","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhou, X., Vladlen, K., and Philipp, K. (2020, January 23\u201328). Tracking Objects as Points. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58548-8_28"},{"key":"ref_39","unstructured":"Zhou, X., Dequan, W., and Philipp, K. (2019). Objects as points. arXiv."},{"key":"ref_40","unstructured":"Kendall, A., Yarin, G., and Roberto, C. (2018, January 18\u201323). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. Proceedings of the 31st IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Schroff, F., Dmitry, K., and James, P. (2015, January 7\u201312). Facenet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_42","unstructured":"Milan, A., Leal-Taix\u00e9, L., Reid, I., Roth, S., and Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., and Tian, Q. (2017, January 21\u201326). Person Re-Identification in the Wild. Proceedings of the 30th IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.357"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Xiao, T., Li, S., Wang, B., Lin, L., and Wang, X. (2017, January 21\u201326). Joint Detection and Identification Feature Learning for Person Search. Proceedings of the 30th IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.360"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Doll\u00e1r, P., Wojek, C., Schiele, B., and Perona, P. (2009, January 20\u201325). Pedestrian detection: A benchmark. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Fontainebleau Resort, Miami Beach, FL, USA.","DOI":"10.1109\/CVPRW.2009.5206631"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Ess, A., Leibe, B., Schindler, K., and Van Gool, L. (2008, January 23\u201328). A Mobile Vision System for Robust Multi-Person Tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.","DOI":"10.1109\/CVPR.2008.4587581"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zhang, S., Rodrigo, B., and Bernt, S. (2017, January 21\u201326). Citypersons: A Diverse Dataset for Pedestrian Detection. Proceedings of the 30th IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.474"},{"key":"ref_48","unstructured":"Da, K. (2014). A method for stochastic optimization. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Zhou, Z., Xing, J., Zhang, M., and Hu, W. (2018, January 20\u201324). Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching. Proceedings of the 24th International Conference on Pattern Recognition, Beijing, China.","DOI":"10.1109\/ICPR.2018.8545450"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Yu, F., Li, W., and Li, Q. (2016, January 8\u201316). Poi: Multiple Object Tracking with High Performance Detection and Appearance Feature. Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-48881-3_3"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A Large-Scale Hierarchical Image Database. Proceedings of the IEEE-Computer-Society Conference on Computer Vision and Pattern Recognition Workshops, Miami Beach, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature Pyramid Networks for Object Detection. Proceedings of the 30th IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/16\/5608\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:47:49Z","timestamp":1760165269000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/16\/5608"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,20]]},"references-count":53,"journal-issue":{"issue":"16","published-online":{"date-parts":[[2021,8]]}},"alternative-id":["s21165608"],"URL":"https:\/\/doi.org\/10.3390\/s21165608","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,8,20]]}}}