{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T03:13:05Z","timestamp":1774926785446,"version":"3.50.1"},"reference-count":44,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2024,2,3]],"date-time":"2024-02-03T00:00:00Z","timestamp":1706918400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000780","name":"Artificial Intelligence National Laboratory","doi-asserted-by":"publisher","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000780","name":"Artificial Intelligence National Laboratory","doi-asserted-by":"publisher","award":["2020-1.1.2-PIACI-KFI-2020-00115"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2020-00115"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000780","name":"Artificial Intelligence National Laboratory","doi-asserted-by":"publisher","award":["TKP2021-NVA-29"],"award-info":[{"award-number":["TKP2021-NVA-29"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Research, Development, and Innovation Fund of Hungary","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"National Research, Development, and Innovation Fund of Hungary","award":["2020-1.1.2-PIACI-KFI-2020-00115"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2020-00115"]}]},{"name":"National Research, Development, and Innovation Fund of Hungary","award":["TKP2021-NVA-29"],"award-info":[{"award-number":["TKP2021-NVA-29"]}]},{"DOI":"10.13039\/501100012550","name":"Ministry of Culture and Innovation of Hungary from the National Research, Development, and Innovation Fund","doi-asserted-by":"publisher","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}],"id":[{"id":"10.13039\/501100012550","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012550","name":"Ministry of Culture and Innovation of Hungary from the National Research, Development, and Innovation Fund","doi-asserted-by":"publisher","award":["2020-1.1.2-PIACI-KFI-2020-00115"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2020-00115"]}],"id":[{"id":"10.13039\/501100012550","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012550","name":"Ministry of Culture and Innovation of Hungary from the National Research, Development, and Innovation Fund","doi-asserted-by":"publisher","award":["TKP2021-NVA-29"],"award-info":[{"award-number":["TKP2021-NVA-29"]}],"id":[{"id":"10.13039\/501100012550","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Governmental Agency for IT Development (KIF\u00dc) in Hungary","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"Governmental Agency for IT Development (KIF\u00dc) in Hungary","award":["2020-1.1.2-PIACI-KFI-2020-00115"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2020-00115"]}]},{"name":"Governmental Agency for IT Development (KIF\u00dc) in Hungary","award":["TKP2021-NVA-29"],"award-info":[{"award-number":["TKP2021-NVA-29"]}]},{"name":"Robert Bosch, Ltd.","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"Robert Bosch, Ltd.","award":["2020-1.1.2-PIACI-KFI-2020-00115"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2020-00115"]}]},{"name":"Robert Bosch, Ltd.","award":["TKP2021-NVA-29"],"award-info":[{"award-number":["TKP2021-NVA-29"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>A novel approach for video instance segmentation is presented using semisupervised learning. Our Cluster2Former model leverages scribble-based annotations for training, significantly reducing the need for comprehensive pixel-level masks. We augment a video instance segmenter, for example, the Mask2Former architecture, with similarity-based constraint loss to handle partial annotations efficiently. We demonstrate that despite using lightweight annotations (using only 0.5% of the annotated pixels), Cluster2Former achieves competitive performance on standard benchmarks. The approach offers a cost-effective and computationally efficient solution for video instance segmentation, especially in scenarios with limited annotation resources.<\/jats:p>","DOI":"10.3390\/s24030997","type":"journal-article","created":{"date-parts":[[2024,2,6]],"date-time":"2024-02-06T12:38:24Z","timestamp":1707223104000},"page":"997","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1662-7583","authenticated-orcid":false,"given":"\u00c1ron","family":"F\u00f3thi","sequence":"first","affiliation":[{"name":"Department of Artificial Intelligence, ELTE E\u00f6tv\u00f6s Lor\u00e1nd University, 1053 Budapest, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5315-7896","authenticated-orcid":false,"given":"Adri\u00e1n","family":"Szlatincs\u00e1n","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, ELTE E\u00f6tv\u00f6s Lor\u00e1nd University, 1053 Budapest, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2218-8855","authenticated-orcid":false,"given":"Ell\u00e1k","family":"Somfai","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, ELTE E\u00f6tv\u00f6s Lor\u00e1nd University, 1053 Budapest, Hungary"},{"name":"HUN-REN Wigner Research Centre for Physics, 1121 Budapest, Hungary"}]}],"member":"1968","published-online":{"date-parts":[[2024,2,3]]},"reference":[{"key":"ref_1","unstructured":"Yang, L., Fan, Y., and Xu, N. (November, January 27). Video instance segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., and Darrell, T. (2020, January 13\u201319). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00271"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2022","DOI":"10.1007\/s11263-022-01629-1","article-title":"Occluded video instance segmentation: A benchmark","volume":"130","author":"Qi","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Cheng, B., Parkhi, O., and Kirillov, A. (2022, January 18\u201324). Pointly-supervised instance segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00264"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Ke, L., Danelljan, M., Ding, H., Tai, Y.W., Tang, C.K., and Yu, F. (2023, January 17\u201324). Mask-free video instance segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02189"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Wu, J., Jiang, Y., Zhang, W., Bai, X., and Bai, S. (2021). Seqformer: A frustratingly simple model for video instance segmentation. arXiv.","DOI":"10.1007\/978-3-031-19815-1_32"},{"key":"ref_7","unstructured":"Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., and Schwing, A.G. (2021). Mask2former for video instance segmentation. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ke, L., Ding, H., Danelljan, M., Tai, Y.W., Tang, C.K., and Yu, F. (2022, January 23\u201327). Video mask transfiner for high-quality video instance segmentation. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1109\/CVPR52688.2022.00437"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., and Girdhar, R. (2022, January 18\u201324). Masked-attention mask transformer for universal image segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"9284","DOI":"10.1109\/TPAMI.2023.3246102","article-title":"A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction","volume":"45","author":"Shen","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-end object detection with transformers. Proceedings of the Computer Vision\u2014ECCV 2020: 16th European Conference, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_12","unstructured":"Hsu, Y.C., and Kira, Z. (2015). Neural network-based clustering using pairwise constraints. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hsu, Y.C., Xu, Z., Kira, Z., and Huang, J. (2018, January 8\u201313). Learning to cluster for proposal-free instance segmentation. Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil.","DOI":"10.1109\/IJCNN.2018.8489379"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"310","DOI":"10.1016\/j.neunet.2023.04.016","article-title":"Semi-supervised deep embedded clustering with pairwise constraints and subset allocation","volume":"164","author":"Wang","year":"2023","journal-title":"Neural Netw."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"F\u00f3thi, \u00c1., Farag\u00f3, K.B., Kop\u00e1csi, L., Milacski, Z.\u00c1., Varga, V., and L\u0151rincz, A. (2020, January 6\u201312). Multi Object Tracking for Similar Instances: A Hybrid Architecture. Proceedings of the International Conference on Neural Information Processing, Vancouver, BC, Canada.","DOI":"10.1007\/978-3-030-63830-6_37"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Yu, Q., Wang, H., Kim, D., Qiao, S., Collins, M., Zhu, Y., Adam, H., Yuille, A., and Chen, L.C. (2022, January 18\u201324). Cmt-deeplab: Clustering mask transformers for panoptic segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00259"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1305","DOI":"10.1049\/ipr2.12410","article-title":"Focal learning on stranger for imbalanced image segmentation","volume":"16","author":"Zhao","year":"2022","journal-title":"IET Image Process."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chen, X., Lian, Y., Jiao, L., Wang, H., Gao, Y., and Lingling, S. (2020, January 23\u201328). Supervised edge attention network for accurate image instance segmentation. Proceedings of the Computer Vision\u2014ECCV 2020: 16th European Conference, Glasgow, UK.","DOI":"10.1007\/978-3-030-58583-9_37"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bertasius, G., and Torresani, L. (2020, January 13\u201319). Classifying, segmenting, and tracking object instances in video with mask propagation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00976"},{"key":"ref_20","first-page":"1192","article-title":"Prototypical cross-attention networks for multiple object tracking and segmentation","volume":"34","author":"Ke","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_23","unstructured":"Bolya, D., Zhou, C., Xiao, F., and Lee, Y.J. (November, January 27). Yolact: Real-time instance segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y., and Yan, Y. (2020, January 13\u201319). Blendmask: Top-down meets bottom-up for instance segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00860"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, Q., Zhang, L., Bertinetto, L., Hu, W., and Torr, P.H. (2019, January 15\u201320). Fast online object tracking and segmentation: A unifying approach. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00142"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Kop\u00e1csi, L., Dobolyi, \u00c1., F\u00f3thi, \u00c1., Keller, D., Varga, V., and L\u0151rincz, A. (2021, January 14\u201317). RATS: Robust Automated Tracking and Segmentation of Similar Instances. Proceedings of the International Conference on Artificial Neural Networks, Bratislava, Slovakia.","DOI":"10.1007\/978-3-030-86365-4_41"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., and Xia, H. (2021, January 20\u201325). End-to-end video instance segmentation with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Li, J., Yu, B., Rao, Y., Zhou, J., and Lu, J. (2023, January 17\u201324). TCOVIS: Temporally Consistent Online Video Instance Segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Vancouver, BC, Canada.","DOI":"10.1109\/ICCV51070.2023.00107"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, H., Jiang, X., Ren, H., Hu, Y., and Bai, S. (2021, January 20\u201325). Swiftnet: Real-time video object segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00135"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., and Liu, W. (2021, January 20\u201325). Crossover learning for fast online video instance segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Nashville, TN, USA.","DOI":"10.1109\/ICCV48922.2021.00794"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., and Bai, X. (2022, January 23\u201327). In defense of online models for video instance segmentation. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19815-1_34"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., and Leibe, B. (2019, January 15\u201320). Mots: Multi-object tracking and segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00813"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Athar, A., Mahadevan, S., Osep, A., Leal-Taix\u00e9, L., and Leibe, B. (2020, January 23\u201328). Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. Proceedings of the Computer Vision\u2014ECCV 2020: 16th European Conference, Glasgow, UK.","DOI":"10.1007\/978-3-030-58621-8_10"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichtenhofer, C. (2022, January 18\u201324). Trackformer: Multi-object tracking with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00864"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., and Yu, F. (2021, January 20\u201325). Quasi-dense similarity learning for multiple object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00023"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., and Yuan, J. (2021, January 20\u201325). Track to detect and segment: An online multi-object tracker. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01217"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Yan, B., Jiang, Y., Sun, P., Wang, D., Yuan, Z., Luo, P., and Lu, H. (2022, January 23\u201327). Towards grand unification of object tracking. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19803-8_43"},{"key":"ref_38","first-page":"13352","article-title":"Video instance segmentation using inter-frame communication transformers","volume":"34","author":"Hwang","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_39","first-page":"23109","article-title":"Vita: Video instance segmentation via object token association","volume":"35","author":"Heo","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Pathak, D., Girshick, R., Doll\u00e1r, P., Darrell, T., and Hariharan, B. (2017, January 21\u201326). Learning features by watching objects move. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.638"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Fu, Y., Liu, S., Iqbal, U., De Mello, S., Shi, H., and Kautz, J. (2021, January 20\u201325). Learning to track instances without video annotations. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00857"},{"key":"ref_42","first-page":"31265","article-title":"Minvis: A minimal video instance segmentation framework without video-based training","volume":"35","author":"Huang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_43","unstructured":"Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel\u00e1ez, P., Sorkine-Hornung, A., and Van Gool, L. (2017). The 2017 DAVIS Challenge on Video Object Segmentation. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Caelles, S., Montes, A., Maninis, K.K., Chen, Y., Van Gool, L., Perazzi, F., and Pont-Tuset, J. (2018). The 2018 DAVIS Challenge on Video Object Segmentation. arXiv.","DOI":"10.1109\/CVPR.2017.565"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/3\/997\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T13:54:31Z","timestamp":1760104471000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/3\/997"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,3]]},"references-count":44,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,2]]}},"alternative-id":["s24030997"],"URL":"https:\/\/doi.org\/10.3390\/s24030997","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,3]]}}}