{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T15:37:47Z","timestamp":1775057867281,"version":"3.50.1"},"reference-count":28,"publisher":"MDPI AG","issue":"13","license":[{"start":{"date-parts":[[2021,6,30]],"date-time":"2021-06-30T00:00:00Z","timestamp":1625011200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003391","name":"Fonds Unique Interminist\u00e9riel","doi-asserted-by":"publisher","award":["FUI STAR: DOS0075476 00"],"award-info":[{"award-number":["FUI STAR: DOS0075476 00"]}],"id":[{"id":"10.13039\/501100003391","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001665","name":"Agence Nationale de la Recherche","doi-asserted-by":"publisher","award":["ANR-17-CE22-0001-01"],"award-info":[{"award-number":["ANR-17-CE22-0001-01"]}],"id":[{"id":"10.13039\/501100001665","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.<\/jats:p>","DOI":"10.3390\/s21134507","type":"journal-article","created":{"date-parts":[[2021,7,1]],"date-time":"2021-07-01T02:44:39Z","timestamp":1625107479000},"page":"4507","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Instance Sequence Queries for Video Instance Segmentation with Transformers"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6867-0401","authenticated-orcid":false,"given":"Zhujun","family":"Xu","sequence":"first","affiliation":[{"name":"Institut Sup\u00e9rieur de l\u2019A\u00e9ronautique et de l\u2019Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1909-5591","authenticated-orcid":false,"given":"Damien","family":"Vivet","sequence":"additional","affiliation":[{"name":"Institut Sup\u00e9rieur de l\u2019A\u00e9ronautique et de l\u2019Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, France"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,6,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Bolya, D., Zhou, C., Xiao, F., and Lee, Y. (November, January 27). YOLACT: Real-Time Instance Segmentation. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00925"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Liang, J., Homayounfar, N., Ma, W., Xiong, Y., Hu, R., and Urtasun, R. (2020, January 14\u201319). PolyTransform: Deep Polygon Transformer for Instance Segmentation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00915"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Cao, J., Anwer, R.M., Cholakkal, H., Khan, F., Pang, Y., and Shao, L. (2020, January 23\u201328). SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation. Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58568-6_1"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1109\/TPAMI.2018.2844175","article-title":"Mask R-CNN","volume":"42","author":"He","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Lee, Y., and Park, J. (2020, January 14\u201319). CenterMask: Real-Time Anchor-Free Instance Segmentation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01392"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., and Ouyang, W. (2019, January 16\u201320). Hybrid Task Cascade for Instance Segmentation. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00511"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chen, X., Girshick, R.B., He, K., and Doll\u00e1r, P. (November, January 27). TensorMask: A Foundation for Dense Object Segmentation. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00215"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018, January 18\u201322). Path Aggregation Network for Instance Segmentation. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00913"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. (November, January 27). GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea.","DOI":"10.1109\/ICCVW.2019.00246"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Yang, L., Fan, Y., and Xu, N. (November, January 27). Video Instance Segmentation. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00529"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Bertasius, G., and Torresani, L. (2020, January 14\u201319). Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00976"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Feng, Q., Yang, Z., Li, P., Wei, Y., and Yang, Y. (November, January 27). Dual Embedding Learning for Video Instance Segmentation. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea.","DOI":"10.1109\/ICCVW.2019.00090"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Athar, A., Mahadevan, S., Osep, A., Leal-Taix\u00e9, L., and Leibe, B. (2020, January 23\u201328). STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos. Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58621-8_10"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., and Xia, H. (2020). End-to-End Video Instance Segmentation with Transformers. arXiv.","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. arXiv.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_16","unstructured":"Hwang, S., Heo, M., Oh, S.W., and Kim, S.J. (2021). Video Instance Segmentation using Inter-Frame Communication Transformers. arXiv."},{"key":"ref_17","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need. arXiv."},{"key":"ref_18","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv."},{"key":"ref_19","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_20","unstructured":"Sun, P., Jiang, Y., Zhang, R., Xie, E., Cao, J., Hu, X., Kong, T., Yuan, Z., Wang, C., and Luo, P. (2020). TransTrack: Multiple-Object Tracking with Transformer. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Meinhardt, T., Kirillov, A., Leal-Taix\u00e9, L., and Feichtenhofer, C. (2021). TrackFormer: Multi-Object Tracking with Transformers. arXiv.","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1002\/nav.3800020109","article-title":"The Hungarian method for the assignment problem","volume":"2","author":"Kuhn","year":"1955","journal-title":"Nav. Res. Logist. Q."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Rezatofighi, S.H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019, January 16\u201320). Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00075"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Milletari, F., Navab, N., and Ahmadi, S.A. (2016, January 25\u201328). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.79"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"318","DOI":"10.1109\/TPAMI.2018.2858826","article-title":"Focal Loss for Dense Object Detection","volume":"42","author":"Lin","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_27","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled Weight Decay Regularization. Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 12). Microsoft COCO: Common Objects in Context6\u2013. Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/13\/4507\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:24:24Z","timestamp":1760163864000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/13\/4507"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,30]]},"references-count":28,"journal-issue":{"issue":"13","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["s21134507"],"URL":"https:\/\/doi.org\/10.3390\/s21134507","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,30]]}}}