{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:11:56Z","timestamp":1777655516322,"version":"3.51.4"},"reference-count":73,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2025,1,15]],"date-time":"2025-01-15T00:00:00Z","timestamp":1736899200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,1,15]],"date-time":"2025-01-15T00:00:00Z","timestamp":1736899200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100008530","name":"European Regional Development Fund","doi-asserted-by":"publisher","award":["CZ.02.1.010.00.015 0030000468"],"award-info":[{"award-number":["CZ.02.1.010.00.015 0030000468"]}],"id":[{"id":"10.13039\/501100008530","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Horizon Europe Research and Innovation programme","award":["101092944"],"award-info":[{"award-number":["101092944"]}]},{"DOI":"10.13039\/501100001823","name":"Ministerstvo \u0160kolstv\u00ed, Ml\u00e1de\u017ee a T\u011blov\u00fdchovy","doi-asserted-by":"publisher","award":["90140"],"award-info":[{"award-number":["90140"]}],"id":[{"id":"10.13039\/501100001823","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007655","name":"\u010cesk\u00e9 Vysok\u00e9 U\u010den\u00ed Technick\u00e9 v Praze","doi-asserted-by":"publisher","award":["GS21184OHK33T37"],"award-info":[{"award-number":["GS21184OHK33T37"]}],"id":[{"id":"10.13039\/100007655","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR\u2019s density, supervised finetuning as well as additional qualitative results and their analysis.<\/jats:p>","DOI":"10.1007\/s11263-024-02320-3","type":"journal-article","created":{"date-parts":[[2025,1,15]],"date-time":"2025-01-15T22:07:50Z","timestamp":1736978870000},"page":"3519-3541","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation"],"prefix":"10.1007","volume":"133","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8946-2057","authenticated-orcid":false,"given":"Antonin","family":"Vobecky","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David","family":"Hurych","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Oriane","family":"Sim\u00e9oni","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Spyros","family":"Gidaris","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrei","family":"Bursuc","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Patrick","family":"P\u00e9rez","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Josef","family":"Sivic","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,1,15]]},"reference":[{"key":"2320_CR1","doi-asserted-by":"crossref","unstructured":"Afouras, T., Asano, Y. M., Fagan, F., Vedaldi, A., & Metze, F. (2022). Self-supervised object detection from audio-visual correspondence. Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 10575\u201310586.","DOI":"10.1109\/CVPR52688.2022.01032"},{"key":"2320_CR2","first-page":"25","volume":"33","author":"J-B Alayrac","year":"2020","unstructured":"Alayrac, J.-B., Recasens, A., Schneider, R., Arandjelovic, R., Ramapuram, J., De Fauw, J., Smaira, L., Dieleman, S., & Zisserman, A. (2020). Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems, 33, 25\u201337.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2320_CR3","first-page":"9758","volume":"33","author":"H Alwassel","year":"2020","unstructured":"Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33, 9758\u20139770.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2320_CR4","doi-asserted-by":"crossref","unstructured":"Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. in Proceedings of the IEEE international conference on computer vision , pp. 609\u2013617.","DOI":"10.1109\/ICCV.2017.73"},{"key":"2320_CR5","doi-asserted-by":"publisher","first-page":"103601","DOI":"10.1016\/j.cviu.2022.103601","volume":"227","author":"F Bartoccioni","year":"2023","unstructured":"Bartoccioni, F., Zablocki, \u00c9., P\u00e9rez, P., Cord, M., & Alahari, K. (2023). Lidartouch: Monocular metric depth estimation with a few-beam lidar. Computer Vision and Image Understanding, 227, 103601.","journal-title":"Computer Vision and Image Understanding"},{"key":"2320_CR6","unstructured":"Bielski, A., & Favaro, P. (2019). Emergence of object segmentation in perturbed generative models. Advances in Neural Information Processing Systems, 32."},{"key":"2320_CR7","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1007\/s41064-016-0003-y","volume":"85","author":"I Bogoslavskyi","year":"2017","unstructured":"Bogoslavskyi, I., & Stachniss, C. (2017). Efficient online segmentation for sparse 3d laser scans. PFG-Journal of Photogrammetry, Remote Sensing and Geoinformation Science, 85, 41\u201352.","journal-title":"PFG-Journal of Photogrammetry, Remote Sensing and Geoinformation Science"},{"key":"2320_CR8","doi-asserted-by":"crossref","unstructured":"Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). Nuscenes: A multimodal dataset for autonomous driving. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 11621\u201311631.","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"2320_CR9","doi-asserted-by":"crossref","unstructured":"Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. in Proceedings of the European conference on computer vision, pp. 132\u2013149.","DOI":"10.1007\/978-3-030-01264-9_9"},{"key":"2320_CR10","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 9650\u20139660.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"2320_CR11","doi-asserted-by":"crossref","unstructured":"Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., & Zisserman, A. (2021). Localizing visual sounds the hard way. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 16867\u201316876.","DOI":"10.1109\/CVPR46437.2021.01659"},{"key":"2320_CR12","doi-asserted-by":"crossref","unstructured":"Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. in Proceedings of the European conference on computer vision (ECCV), pp. 801\u2013818.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"2320_CR13","unstructured":"Chen, M., Arti\u00e8res, T., & Denoyer, L. (2019). Unsupervised object segmentation by redrawing. Advances in Neural Information Processing Systems, 32."},{"key":"2320_CR14","doi-asserted-by":"crossref","unstructured":"Chen, X., Yuan, Y., Zeng, G., & Wang, J. (2021). Semi-supervised semantic segmentation with cross pseudo supervision. Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 2613\u20132622.","DOI":"10.1109\/CVPR46437.2021.00264"},{"key":"2320_CR15","doi-asserted-by":"crossref","unstructured":"Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., & Chen, L.-C. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 12475\u201312485.","DOI":"10.1109\/CVPR42600.2020.01249"},{"key":"2320_CR16","unstructured":"Cho, J. H., Mall, U., Bala, K., & Hariharan, B. (2021). PiCIE: Unsupervised semantic segmentation using invariance and equivariance in clustering. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 16794\u201316804."},{"key":"2320_CR17","doi-asserted-by":"crossref","unstructured":"Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213\u20133223.","DOI":"10.1109\/CVPR.2016.350"},{"key":"2320_CR18","doi-asserted-by":"crossref","unstructured":"Dai, D., & Van Gool, L. (2018). 2018 21st international conference on intelligent transportation systems (ITSC), pp. 3819\u20133824.","DOI":"10.1109\/ITSC.2018.8569387"},{"key":"2320_CR19","unstructured":"Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2023). Vision transformers need registers."},{"key":"2320_CR20","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. in 2009 IEEE conference on computer vision and pattern recognition, pp. 248\u2013255.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"2320_CR21","doi-asserted-by":"crossref","unstructured":"Deng, Z., & Luo, Y. (2023). Learning neural eigenfunctions for unsupervised semantic segmentation. Proceedings of the IEEE\/CVF international conference on computer vision, pp. 551\u2013561.","DOI":"10.1109\/ICCV51070.2023.00057"},{"key":"2320_CR22","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16 x 16 words: Transformers for image recognition at scale. in International conference on learning representations."},{"key":"2320_CR23","doi-asserted-by":"publisher","unstructured":"Dosovitskiy, A., Fischer, P., Ilg, E., H\u00e4usser, P., Hazirbas, C., Golkov, V., Smagt, P. V. D., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. in Proceedings of the IEEE international conference on computer vision, pp. 2758\u20132766. https:\/\/doi.org\/10.1109\/ICCV.2015.316","DOI":"10.1109\/ICCV.2015.316"},{"key":"2320_CR24","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1023\/B:VISI.0000022288.19776.77","volume":"59","author":"PF Felzenszwalb","year":"2004","unstructured":"Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59, 167\u2013181.","journal-title":"International Journal of Computer Vision"},{"key":"2320_CR25","unstructured":"French, G., Laine, S., Aila, T., Mackiewicz, M., & Finlayson, G. (2020). Semi-supervised semantic segmentation needs strong, varied perturbations. in The British machine vision conference."},{"key":"2320_CR26","doi-asserted-by":"crossref","unstructured":"Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., & P\u00e9rez, P. (2021). Obow: Online bag-of-visual-words generation for self-supervised learning. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 6830\u20136840.","DOI":"10.1109\/CVPR46437.2021.00676"},{"key":"2320_CR27","unstructured":"Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. in The international conference on learning representations."},{"key":"2320_CR28","first-page":"21271","volume":"33","author":"J Grill","year":"2020","unstructured":"Grill, J., Strub, F., Altch\u00e9, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. \u00c1., Guo, Z., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271\u201321284.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2320_CR29","unstructured":"Hamilton, M., et al., (2022). Unsupervised semantic segmentation by distilling feature correspondences. in The international conference on learning representations."},{"key":"2320_CR30","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 9729\u20139738.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"2320_CR31","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770\u2013778.","DOI":"10.1109\/CVPR.2016.90"},{"key":"2320_CR32","doi-asserted-by":"crossref","unstructured":"H\u00e9naff, O. J., Koppula, S., Alayrac, J.-B., Oord, A. V. D., Vinyals, O., & Carreira, J. (2021). Efficient visual pretraining with contrastive detection. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 10086\u201310096.","DOI":"10.1109\/ICCV48922.2021.00993"},{"key":"2320_CR33","unstructured":"Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. NeurIPS Deep Learning and Representation Learning Workshop. http:\/\/arxiv.org\/abs\/1503.02531"},{"key":"2320_CR34","doi-asserted-by":"crossref","unstructured":"Hwang, J.-J., Yu, S. X., Shi, J., Collins, M. D., Yang, T.-J., Zhang, X., & Chen, L.-C. (2019). Segsort: Segmentation by discriminative sorting of segments. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 7334\u20137344.","DOI":"10.1109\/ICCV.2019.00743"},{"key":"2320_CR35","doi-asserted-by":"crossref","unstructured":"Jaritz, M., Vu, T.-H., Charette, R. D., Wirbel, E., & P\u00e9rez, P. (2020). Xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 12605\u201312614.","DOI":"10.1109\/CVPR42600.2020.01262"},{"key":"2320_CR36","doi-asserted-by":"crossref","unstructured":"Ji, X., Henriques, J. F., & Vedaldi, A. (2019). Invariant information clustering for unsupervised image classification and segmentation. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 9865\u20139874.","DOI":"10.1109\/ICCV.2019.00996"},{"key":"2320_CR37","doi-asserted-by":"crossref","unstructured":"Kanezaki, A. (2018). Unsupervised image segmentation by backpropagation. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pp. 1543\u20131547.","DOI":"10.1109\/ICASSP.2018.8462533"},{"issue":"1\u20132","key":"2320_CR38","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1002\/nav.3800020109","volume":"2","author":"HW Kuhn","year":"1955","unstructured":"Kuhn, H. W., & Yaw, B. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1\u20132), 83\u201397.","journal-title":"Naval Research Logistics Quarterly"},{"key":"2320_CR39","first-page":"11353","volume":"36","author":"M Lan","year":"2024","unstructured":"Lan, M., Wang, X., Ke, Y., Xu, J., Feng, L., & Zhang, W. (2024). Smooseg: Smoothness prior for unsupervised semantic segmentation. Advances in Neural Information Processing Systems, 36, 11353\u201311373.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2320_CR40","doi-asserted-by":"crossref","unstructured":"Li, D., Yang, J., Kreis, K., Torralba, A., & Fidler, S. (2021). Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 8300\u20138311.","DOI":"10.1109\/CVPR46437.2021.00820"},{"key":"2320_CR41","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117\u20132125.","DOI":"10.1109\/CVPR.2017.106"},{"key":"2320_CR42","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431\u20133440.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"2320_CR43","unstructured":"Mao, J., Niu, M., Jiang, C., Liang, H., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., Yu, J., et al. (2021). One million scenes for autonomous driving: Once dataset. Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 1)."},{"key":"2320_CR44","doi-asserted-by":"crossref","unstructured":"Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., & Zisserman, A. (2020). End-to-end learning of visual representations from uncurated instructional videos. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 9879\u20139889.","DOI":"10.1109\/CVPR42600.2020.00990"},{"key":"2320_CR45","doi-asserted-by":"crossref","unstructured":"Neuhold, G., Ollmann, T., Rota Bulo, S., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. in Proceedings of the IEEE international conference on computer vision, pp. 4990\u20134999.","DOI":"10.1109\/ICCV.2017.534"},{"key":"2320_CR46","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.-Y., Xu, H., Sharma, V., Li, S.-W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Bojanowski, P. (2023). Dinov2: Learning robust visual features without supervision."},{"key":"2320_CR47","doi-asserted-by":"crossref","unstructured":"Ouali, Y., Hudelot, C., & Tami, M. (2020). Autoregressive unsupervised image segmentation. in Computer vision\u2013ECCV 2020: 16th European conference, pp. 142\u2013158.","DOI":"10.1007\/978-3-030-58571-6_9"},{"key":"2320_CR48","doi-asserted-by":"crossref","unstructured":"Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. in Proceedings of the European conference on computer vision (ECCV), pp. 631\u2013648.","DOI":"10.1007\/978-3-030-01231-1_39"},{"key":"2320_CR49","doi-asserted-by":"crossref","unstructured":"Recasens, A., Luc, P., Alayrac, J.-B., Wang, L., Strub, F., Tallec, C., Malinowski, M., P\u0103tr\u0103ucean, V., Altch\u00e9, F., Valko, M., Grill, J.-B., van den Oord, A., & Zisserman, A. (2021). Broaden your views for self-supervised video learning. in Proceedings of the IEEE\/CVF international conference on computer vision , pp. 1255\u20131265.","DOI":"10.1109\/ICCV48922.2021.00129"},{"key":"2320_CR50","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. in Medical image computing and computer-assisted intervention\u2013MICCAI 2015: 18th international conference, pp. 234\u2013241.","DOI":"10.1007\/978-3-319-24574-4_28"},{"issue":"6","key":"2320_CR51","doi-asserted-by":"publisher","first-page":"3139","DOI":"10.1109\/TPAMI.2020.3045882","volume":"44","author":"C Sakaridis","year":"2020","unstructured":"Sakaridis, C., Dai, D., & Van Gool, L. (2020). Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 3139\u20133153.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2320_CR52","doi-asserted-by":"crossref","unstructured":"Sakaridis, C., Dai, D., & Van Gool, L. (2021). ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 10765\u201310775.","DOI":"10.1109\/ICCV48922.2021.01059"},{"key":"2320_CR53","doi-asserted-by":"crossref","unstructured":"Seong, H. S., Moon, W., Lee, S., & Heo, J.-P. (2023). Leveraging hidden positives for unsupervised semantic segmentation. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 19540\u201319549.","DOI":"10.1109\/CVPR52729.2023.01872"},{"key":"2320_CR54","doi-asserted-by":"crossref","unstructured":"Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 7262\u20137272.","DOI":"10.1109\/ICCV48922.2021.00717"},{"key":"2320_CR55","doi-asserted-by":"crossref","unstructured":"Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 2446\u20132454.","DOI":"10.1109\/CVPR42600.2020.00252"},{"key":"2320_CR56","volume-title":"Advances in neural information processing systems","author":"A Tarvainen","year":"2017","unstructured":"Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems.  (Vol. 30). Curran Associates, Inc."},{"key":"2320_CR57","doi-asserted-by":"crossref","unstructured":"Tian, H., Chen, Y., Dai, J., Zhang, Z., & Zhu, X. (2021). Unsupervised object detection with lidar clues. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 5962\u20135972.","DOI":"10.1109\/CVPR46437.2021.00590"},{"key":"2320_CR58","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & J\u00e9gou, H. (2021). Training data-efficient image transformers & distillation through attention. in International conference on machine learning, pp. 10347\u201310357."},{"key":"2320_CR59","doi-asserted-by":"crossref","unstructured":"Van Gansbeke, W., Vandenhende, S., Georgoulis, S., & Van Gool, L. (2021). Unsupervised semantic segmentation by contrasting object mask proposals. in Proceedings of the IEEE\/CVF international conference on computer vision, pp. 10052\u201310062.","DOI":"10.1109\/ICCV48922.2021.00990"},{"key":"2320_CR60","doi-asserted-by":"crossref","unstructured":"Varma, G., Subramanian, A., Namboodiri, A., Chandraker, M., & Jawahar, C. (2019). IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments. in 2019 IEEE winter conference on applications of computer vision , pp. 1743\u20131751.","DOI":"10.1109\/WACV.2019.00190"},{"key":"2320_CR61","doi-asserted-by":"crossref","unstructured":"Vobecky, A., Hurych, D., Sim\u00e9oni, O., Gidaris, S., Bursuc, A., P\u00e9rez, P., & Sivic, J. (2022). Drive &segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation. in European conference on computer vision, pp. 478\u2013495.","DOI":"10.1007\/978-3-031-19839-7_28"},{"issue":"3","key":"2320_CR62","first-page":"2692","volume":"35","author":"A Vobecky","year":"2021","unstructured":"Vobecky, A., Hurych, D., U\u0159i\u010d\u00e1\u0159, M., P\u00e9rez, P., & Sivic, J. (2021). Artificial dummies for urban dataset augmentation. The Association for the Advancement of Artificial Intelligence, 35(3), 2692\u20132700.","journal-title":"The Association for the Advancement of Artificial Intelligence"},{"issue":"10","key":"2320_CR63","doi-asserted-by":"publisher","first-page":"3349","DOI":"10.1109\/TPAMI.2020.2983686","volume":"43","author":"J Wang","year":"2020","unstructured":"Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349\u20133364.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2320_CR64","doi-asserted-by":"crossref","unstructured":"Weston, R., Cen, S., Newman, P., & Posner, I. (2019). Probably unknown: Deep inverse sensor modelling radar. in 2019 international conference on robotics and automation, pp. 5446\u20135452.","DOI":"10.1109\/ICRA.2019.8793263"},{"key":"2320_CR65","unstructured":"Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint [SPACE]arXiv:2105.15203."},{"key":"2320_CR66","doi-asserted-by":"crossref","unstructured":"Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., & Hu, H. (2021). Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 16684\u201316693.","DOI":"10.1109\/CVPR46437.2021.01641"},{"key":"2320_CR67","doi-asserted-by":"crossref","unstructured":"Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 2636\u20132645.","DOI":"10.1109\/CVPR42600.2020.00271"},{"key":"2320_CR68","doi-asserted-by":"crossref","unstructured":"Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. in Computer vision\u2013ECCV 2020: 16th European conference, pp. 173\u2013190.","DOI":"10.1007\/978-3-030-58539-6_11"},{"key":"2320_CR69","doi-asserted-by":"crossref","unstructured":"Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., & Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712\u20133722.","DOI":"10.1109\/CVPR.2018.00391"},{"key":"2320_CR70","first-page":"16579","volume":"33","author":"X Zhang","year":"2020","unstructured":"Zhang, X., & Maire, M. (2020). Self-supervised visual representation learning from hierarchical grouping. Advances in Neural Information Processing Systems, 33, 16579\u201316590.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2320_CR71","doi-asserted-by":"crossref","unstructured":"Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. in Proceedings of the European conference on computer vision, pp. 570\u2013586.","DOI":"10.1007\/978-3-030-01246-5_35"},{"key":"2320_CR72","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881\u20132890.","DOI":"10.1109\/CVPR.2017.660"},{"key":"2320_CR73","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. in Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 6881\u20136890.","DOI":"10.1109\/CVPR46437.2021.00681"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02320-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-024-02320-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02320-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,10]],"date-time":"2025-05-10T06:55:53Z","timestamp":1746860153000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-024-02320-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,15]]},"references-count":73,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,6]]}},"alternative-id":["2320"],"URL":"https:\/\/doi.org\/10.1007\/s11263-024-02320-3","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,15]]},"assertion":[{"value":"29 September 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 November 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 January 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}