{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T14:17:49Z","timestamp":1774534669284,"version":"3.50.1"},"reference-count":32,"publisher":"Sociedade Brasileira de Computacao - SB","issue":"1","license":[{"start":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T00:00:00Z","timestamp":1773532800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JBCS"],"abstract":"<jats:p>Video object segmentation (VOS) involves consistently identifying and classifying object pixels in video sequences, a task that traditionally depends on extensive, manually annotated datasets. In this work, we present SHLS (Superfeatures in a Highly Compressed Latent Space), a self-supervised VOS method that reduces reliance on both annotations and large training datasets. SHLS employs a metric learning framework combining superpixels and deep learning features, enabling effective training with just 10,000 unlabeled still images. Utilizing an efficient memory clustering mechanism, SHLS generates ultra-compact representations called superfeatures, which efficiently store and classify object information across video sequences. Experiments on the DAVIS dataset demonstrate SHLS's strong performance in multi-object scenarios, underscoring its potential as a robust and efficient alternative in self-supervised VOS.<\/jats:p>","DOI":"10.5753\/jbcs.2026.5904","type":"journal-article","created":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T13:39:51Z","timestamp":1774532391000},"page":"443-452","source":"Crossref","is-referenced-by-count":0,"title":["Memorizing Features Efficiently for Self-supervised Video Object Segmentation"],"prefix":"10.5753","volume":"32","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0404-2158","authenticated-orcid":false,"given":"Marcelo","family":"Mendon\u00e7a","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7183-8853","authenticated-orcid":false,"given":"Luciano","family":"Oliveira","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"3742","published-online":{"date-parts":[[2026,3,15]]},"reference":[{"key":"1","doi-asserted-by":"crossref","unstructured":"Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and S\u00fcsstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. <i>IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/TPAMI.2012.120\">10.1109\/TPAMI.2012.120<\/a>.","DOI":"10.1109\/TPAMI.2012.120"},{"key":"2","unstructured":"Araslanov, N., Schaub-Meyer, S., and Roth, S. (2021). Dense unsupervised learning for video segmentation. In <i>Advances in Neural Information Processing Systems<\/i>."},{"key":"3","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In <i>Proceedings of the 37th International Conference on Machine Learning<\/i>."},{"key":"4","doi-asserted-by":"crossref","unstructured":"Cheng, H. K. and Schwing, A. G. (2022). Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In <i>Computer Vision - ECCV 2022<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1007\/978-3-031-19815-1_37\">10.1007\/978-3-031-19815-1_37<\/a>.","DOI":"10.1007\/978-3-031-19815-1_37"},{"key":"5","doi-asserted-by":"crossref","unstructured":"Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H. S., and Hu, S.-M. (2015). Global contrast based salient region detection. <i>IEEE TPAMI<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/TPAMI.2014.2345401\">10.1109\/TPAMI.2014.2345401<\/a>.","DOI":"10.1109\/TPAMI.2014.2345401"},{"key":"6","doi-asserted-by":"crossref","unstructured":"Guo, P., Zhang, W., Li, X., Fan, J., and Zhang, W. (2025). Self-supervised video object segmentation via pseudo label rectification. <i>Pattern Recogn.<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1016\/j.patcog.2025.111428\">10.1016\/j.patcog.2025.111428<\/a>.","DOI":"10.1016\/j.patcog.2025.111428"},{"key":"7","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). Mask r-cnn. In <i>Proceedings of the IEEE International Conference on Computer Vision (ICCV)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/ICCV.2017.322\">10.1109\/ICCV.2017.322<\/a>.","DOI":"10.1109\/ICCV.2017.322"},{"key":"8","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In <i>Computer Vision - ECCV 2016<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1007\/978-3-319-46493-0_38\">10.1007\/978-3-319-46493-0_38<\/a>.","DOI":"10.1007\/978-3-319-46493-0_38"},{"key":"9","doi-asserted-by":"crossref","unstructured":"Hou, R., Chen, C., and Shah, M. (2017). An end-to-end 3d convolutional neural network for action detection and segmentation in videos. 10.48550\/ARXIV.1712.01111.","DOI":"10.1109\/ICCV.2017.620"},{"key":"10","unstructured":"Jabri, A., Owens, A., and Efros, A. A. (2020). Space-time correspondence as a contrastive random walk. <i>Advances in Neural Information Processing Systems<\/i>."},{"key":"11","doi-asserted-by":"crossref","unstructured":"Kim, Y., Choi, S., Lee, H., Kim, T., and Kim, C. (2020). Rpm-net: Robust pixel-level matching networks for self-supervised video object segmentation. In <i>2020 IEEE Winter Conference on Applications of Computer Vision (WACV)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/WACV45572.2020.9093294\">10.1109\/WACV45572.2020.9093294<\/a>.","DOI":"10.1109\/WACV45572.2020.9093294"},{"key":"12","doi-asserted-by":"crossref","unstructured":"Lai, Z., Lu, E., and Xie, W. (2020). MAST: A memory-augmented self-supervised tracker. In <i>IEEE Conference on Computer Vision and Pattern Recognition<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR42600.2020.00651\">10.1109\/CVPR42600.2020.00651<\/a>.","DOI":"10.1109\/CVPR42600.2020.00651"},{"key":"13","unstructured":"Lai, Z. and Xie, W. (2019). Self-supervised learning for video correspondence flow. In <i>BMVC<\/i>."},{"key":"14","doi-asserted-by":"crossref","unstructured":"Li, M., Hu, L., Xiong, Z., Zhang, B., Pan, P., and Liu, D. (2022). Recurrent dynamic embedding for video object segmentation. In <i>Conference on Computer Vision and Pattern Recognition<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR52688.2022.00139\">10.1109\/CVPR52688.2022.00139<\/a>.","DOI":"10.1109\/CVPR52688.2022.00139"},{"key":"15","doi-asserted-by":"crossref","unstructured":"Li, R. and Liu, D. (2023). Spatial-then-temporal self-supervised learning for video correspondence. In <i>Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)<\/i>, pages 2279-2288. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR52729.2023.00226\">10.1109\/CVPR52729.2023.00226<\/a>.","DOI":"10.1109\/CVPR52729.2023.00226"},{"key":"16","unstructured":"Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., and Yang, M.-H. (2019). Joint-task self-supervised learning for temporal correspondence. In <i>Advances in Neural Information Processing Systems<\/i>."},{"key":"17","doi-asserted-by":"crossref","unstructured":"Lu, X., Wang, W., Shen, J., Tai, Y., Crandall, D. J., and Hoi, S. H. (2020). Learning video object segmentation from unlabeled videos. In <i>Conference on Computer Vision and Pattern Recognition (CVPR)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR42600.2020.00898\">10.1109\/CVPR42600.2020.00898<\/a>.","DOI":"10.1109\/CVPR42600.2020.00898"},{"key":"18","unstructured":"Mendon\u00e7a, M., Fontinele, J., and Oliveira, L. (2023). SHLS: Superfeatures learned from still images for self-supervised vos. In <i>34th British Machine Vision Conference BMVC, Aberdeen, UK<\/i>."},{"key":"19","doi-asserted-by":"crossref","unstructured":"Mendon\u00e7a, M. and Oliveira, L. (2018). ISEC: Iterative over-segmentation via edge clustering. <i>Image and Vision Computing<\/i>, 80:45-57. DOI: <a href=\"https:\/\/doi.org\/10.1016\/j.imavis.2018.09.015\">10.1016\/j.imavis.2018.09.015<\/a>.","DOI":"10.1016\/j.imavis.2018.09.015"},{"key":"20","doi-asserted-by":"crossref","unstructured":"Miao, B., Bennamoun, M., Gao, Y., and Mian, A. (2022). Self-supervised video object segmentation by motion-aware mask propagation. In <i>International Conference on Multimedia and Expo (ICME)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/ICME52920.2022.9859966\">10.1109\/ICME52920.2022.9859966<\/a>.","DOI":"10.1109\/ICME52920.2022.9859966"},{"key":"21","unstructured":"Nguyen, D. T., Dax, M., Mummadi, C. K., Ngo, T. P. N., Nguyen, T. H. P., Lou, Z., and Brox, T. (2019). <i>DeepUSPS: Deep Robust Unsupervised Saliency Prediction with Self-Supervision<\/i>. Curran Associates Inc."},{"key":"22","doi-asserted-by":"crossref","unstructured":"Oh, S. W., Lee, J.-Y., Sunkavalli, K., and Kim, S. J. (2018). Fast video object segmentation by reference-guided mask propagation. In <i>Conference on Computer Vision and Pattern Recognition<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR.2018.00770\">10.1109\/CVPR.2018.00770<\/a>.","DOI":"10.1109\/CVPR.2018.00770"},{"key":"23","doi-asserted-by":"crossref","unstructured":"Oh, S. W., Lee, J.-Y., Xu, N., and Kim, S. J. (2019). Video object segmentation using space-time memory networks. In <i>Proceedings of the International Conference on Computer Vision (ICCV)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/ICCV.2019.00932\">10.1109\/ICCV.2019.00932<\/a>.","DOI":"10.1109\/ICCV.2019.00932"},{"key":"24","unstructured":"Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel\u00e1ez, P., Sorkine-Hornung, A., and Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. <i>arXiv:1704.00675<\/i>. 10.48550\/arXiv.1704.00675."},{"key":"25","doi-asserted-by":"crossref","unstructured":"Tokmakov, P., Alahari, K., and Schmid, C. (2017). Learning video object segmentation with visual memory. In <i>International Conference on Computer Vision (ICCV)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/ICCV.2017.480\">10.1109\/ICCV.2017.480<\/a>.","DOI":"10.1109\/ICCV.2017.480"},{"key":"26","doi-asserted-by":"crossref","unstructured":"Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., and Giro-i Nieto, X. (2019). Rvos: End-to-end recurrent network for video object segmentation. In <i>The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR.2019.00542\">10.1109\/CVPR.2019.00542<\/a>.","DOI":"10.1109\/CVPR.2019.00542"},{"key":"27","doi-asserted-by":"crossref","unstructured":"Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., and Murphy, K. (2018). Tracking emerges by colorizing videos. In <i>Computer Vision \u2013 ECCV 2018: 15th European Conference<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1007\/978-3-030-01261-8_24\">10.1007\/978-3-030-01261-8_24<\/a>.","DOI":"10.1007\/978-3-030-01261-8_24"},{"key":"28","doi-asserted-by":"crossref","unstructured":"Wang, X., Jabri, A., and Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In <i>CVPR<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1109\/CVPR.2019.00267\">10.1109\/CVPR.2019.00267<\/a>.","DOI":"10.1109\/CVPR.2019.00267"},{"key":"29","doi-asserted-by":"crossref","unstructured":"Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., and Huang, T. (2018). Youtube-vos: Sequence-to-sequence video object segmentation. In <i>Computer Vision \u2013 ECCV 2018 - 15th European Conference, 2018, Proceedings<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1007\/978-3-030-01228-1_36\">10.1007\/978-3-030-01228-1_36<\/a>.","DOI":"10.1007\/978-3-030-01228-1_36"},{"key":"30","doi-asserted-by":"crossref","unstructured":"Xu, X., Wang, J., Li, X., and Lu, Y. (2022). Reliable propagation-correction modulation for video object segmentation. <i>Proceedings of the AAAI Conference on Artificial Intelligence<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1609\/aaai.v36i3.20200\">10.1609\/aaai.v36i3.20200<\/a>.","DOI":"10.1609\/aaai.v36i3.20200"},{"key":"31","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wei, Y., and Yang, Y. (2020). Collaborative video object segmentation by foreground-background integration. In <i>Computer Vision \u2013 ECCV 2020: 16th European Conference<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1007\/978-3-030-58558-7_20\">10.1007\/978-3-030-58558-7_20<\/a>.","DOI":"10.1007\/978-3-030-58558-7_20"},{"key":"32","doi-asserted-by":"crossref","unstructured":"Zhu, W., Meng, J., and Xu, L. (2021). Self-supervised video object segmentation using integration-augmented attention. <i>Neurocomput.<\/i>. DOI: <a href=\"https:\/\/doi.org\/10.1016\/j.neucom.2021.04.090\">10.1016\/j.neucom.2021.04.090<\/a>.","DOI":"10.1016\/j.neucom.2021.04.090"}],"container-title":["Journal of the Brazilian Computer Society"],"original-title":[],"link":[{"URL":"https:\/\/journals-sol.sbc.org.br\/index.php\/jbcs\/article\/download\/5904\/3855","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals-sol.sbc.org.br\/index.php\/jbcs\/article\/download\/5904\/3855","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T13:40:03Z","timestamp":1774532403000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals-sol.sbc.org.br\/index.php\/jbcs\/article\/view\/5904"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,15]]},"references-count":32,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1,20]]}},"URL":"https:\/\/doi.org\/10.5753\/jbcs.2026.5904","relation":{},"ISSN":["1678-4804"],"issn-type":[{"value":"1678-4804","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,15]]}}}