{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,23]],"date-time":"2026-05-23T09:08:27Z","timestamp":1779527307951,"version":"3.53.1"},"reference-count":75,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T00:00:00Z","timestamp":1775520000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T00:00:00Z","timestamp":1775520000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Video Object Segmentation (VOS) is a key component in computer vision applications, including surveillance, autonomous driving, and robotics. However, existing VOS models often struggle with generalization to new videos with complex, topologically transforming deformable objects (eg.\u00a0cooking, assembling, state change), degraded environments and long video sequences, resulting in tracking drift, low recall and memory saturation. We developed\n                    <jats:bold>Mu<\/jats:bold>\n                    ltiple object VOS and tracking\n                    <jats:bold>S<\/jats:bold>\n                    mart\n                    <jats:bold>Mem<\/jats:bold>\n                    ory architecture (MuSMem), a generalizable approach that incorporates three key innovations: (i) fusing SAM with High-Quality masks alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) dynamic smart memory that manages a history of key frames based on a novel\n                    <jats:italic>information preserving gain<\/jats:italic>\n                    , combined with relevance and freshness spatio-temporal criteria; and (iii) explores the use of monocular depth maps for occlusion robustness. MuSMem significantly reduces memory usage, reduces drift, tracks complex object topological changes and improves long-term prediction performance. MuSMem can be integrated with Vision-Language Models (VLMs) for zero-shot generalization to unseen visual domains. Experiments using VOS benchmark datasets show that MuSMem ranks first on VOTSt-2024, Long Video Dataset and LVOS, and second on VOTS-2024, demonstrating the best generalizability and state-of-the-art performance across single-, multi-, and complex VOS tasks.\n                  <\/jats:p>","DOI":"10.1007\/s11263-026-02742-1","type":"journal-article","created":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T03:13:51Z","timestamp":1775531631000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Domain Generalization for Multiple Video Object Segmentation and Tracking Using Transformers and Smart Memory"],"prefix":"10.1007","volume":"134","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0861-0552","authenticated-orcid":false,"given":"Elham Soltani","family":"Kazemi","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Imad Eddine","family":"Toubal","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Gani","family":"Rahmon","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Juan","family":"Mogollon","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kannappan","family":"Palaniappan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,4,7]]},"reference":[{"key":"2742_CR1","doi-asserted-by":"crossref","unstructured":"Al-Shakarji, N., Gao, K., Bunyak, F., Aliakbarpour, H., Blasch, E., Narayaran, P., Seetharaman, G., & Palaniappan, K. (2021). Impact of Georegistration Accuracy on Wide Area Motion Imagery Object Detection and Tracking. In: 24th IEEE Int. Conf. on Information Fusion, 1\u20138.","DOI":"10.23919\/FUSION49465.2021.9626982"},{"issue":"8","key":"2742_CR2","doi-asserted-by":"publisher","first-page":"4618","DOI":"10.1109\/TGRS.2017.2695172","volume":"55","author":"H Aliakbarpour","year":"2017","unstructured":"Aliakbarpour, H., Palaniappan, K., & Seetharaman, G. (2017). Parallax-tolerant aerial image georegistration and efficient camera pose refinement - Without piecewise homographies. IEEE Transactions on Geoscience and Remote Sensing,55(8), 4618\u20134637.","journal-title":"IEEE Transactions on Geoscience and Remote Sensing"},{"key":"2742_CR3","doi-asserted-by":"crossref","unstructured":"Athar, A., Luiten, J., Hermans, A., Ramanan, D., & Leibe, B. (2022). Hodor: High-level object descriptors for object re-segmentation in video learned from static images. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 3022\u20133031.","DOI":"10.1109\/CVPR52688.2022.00303"},{"key":"2742_CR4","doi-asserted-by":"crossref","unstructured":"Bekuzarov, M., Bermudez, A., Lee, J.-Y., & Li, H. (2023). XMem++: Production-level Video Segmentation From Few Annotated Frames. In: IEEE Int. Conf. Computer Vision, 635\u2013644","DOI":"10.1109\/ICCV51070.2023.00065"},{"issue":"4","key":"2742_CR5","doi-asserted-by":"publisher","first-page":"20","DOI":"10.4304\/jmm.2.4.20-33","volume":"2","author":"F Bunyak","year":"2007","unstructured":"Bunyak, F., Palaniappan, K., Nath, S., et al. (2007). Flux tensor constrained geodesic active contours with sensor fusion for persistent object tracking. Journal of Multimedia,2(4), 20.","journal-title":"Journal of Multimedia"},{"key":"2742_CR6","doi-asserted-by":"crossref","unstructured":"Bunyak, F., Palaniappan, K., Nath, S.K., & Seetharaman, G. (2007b). Geodesic active contour based fusion of visible and infrared video for persistent object tracking. In: IEEE Workshop on Applications of Computer Vision (WACV), 35\u201335.","DOI":"10.1109\/WACV.2007.26"},{"key":"2742_CR7","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: IEEE Int. Conf. Computer Vision, 9650\u20139660.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"2742_CR8","first-page":"640","volume-title":"European Conf","author":"HK Cheng","year":"2022","unstructured":"Cheng, H. K., & Schwing, A. G. (2022). XMem: Long-term video object segmentation with an Atkinson-Shiffrin memory model. European Conf (pp. 640\u2013658). Computer Vision: Springer."},{"key":"2742_CR9","doi-asserted-by":"crossref","unstructured":"Cheng, H.K., Tai, Y.W., & Tang, C.K. (2021a). Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In: IEEE Conf. Computer Vision and Pattern Recognition, pp 5559\u20135568","DOI":"10.1109\/CVPR46437.2021.00551"},{"key":"2742_CR10","first-page":"11781","volume":"34","author":"HK Cheng","year":"2021","unstructured":"Cheng, H. K., Tai, Y. W., & Tang, C. K. (2021). Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems,34, 11781\u201311794.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2742_CR11","doi-asserted-by":"crossref","unstructured":"Cheng, H.K., Oh, S.W., Price, B., Schwing, A., & Lee, J.-Y. (2023). Tracking anything with decoupled video segmentation. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 1316\u20131326.","DOI":"10.1109\/ICCV51070.2023.00127"},{"key":"2742_CR12","doi-asserted-by":"crossref","unstructured":"Cheng, H.K., Oh, S.W., Price, B., Lee, J.-Y., & Schwing, A. (2024). Putting the object back into video object segmentation. In: IEEE Conf. Computer Vision and Pattern Recognition, 3151\u20133161.","DOI":"10.1109\/CVPR52733.2024.00304"},{"key":"2742_CR13","doi-asserted-by":"crossref","unstructured":"Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H.S., & Bai, S. (2023). MOSE: A new dataset for video object segmentation in complex scenes. In: Proceedings of the IEEE\/CVF international conference on computer vision, 20224\u201320234.","DOI":"10.1109\/ICCV51070.2023.01850"},{"key":"2742_CR14","doi-asserted-by":"crossref","unstructured":"Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Guo, Y., Lin, D., & Wang, J. (2024). Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. arXiv:2410.16268","DOI":"10.1109\/ICCV51701.2025.01264"},{"key":"2742_CR15","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: International Conference on Learning Representations, p 21pp, https:\/\/openreview.net\/forum?id=YicbFdNTTy."},{"issue":"8","key":"2742_CR16","doi-asserted-by":"publisher","first-page":"1441","DOI":"10.1109\/LGRS.2020.3000762","volume":"18","author":"K Gao","year":"2020","unstructured":"Gao, K., Aliakbarpour, H., Seetharaman, G., et al. (2020). DCT-based local descriptor for robust matching and feature tracking in wide area motion imagery. IEEE Geoscience and Remote Sensing Letters,18(8), 1441\u20131445.","journal-title":"IEEE Geoscience and Remote Sensing Letters"},{"key":"2742_CR17","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., & Girshick, R. (2017). Mask r-cnn. In: Proc. Int. Conf. on Computer Vision, pp 2961 \u2013 2969.","DOI":"10.1109\/ICCV.2017.322"},{"key":"2742_CR18","doi-asserted-by":"crossref","unstructured":"Hong, L., Chen, W., Liu, Z., Zhang, W., Guo, P., Chen, Z., & Zhang, W. (2023). Lvos: A benchmark for long-term video object segmentation. In: IEEE Conf. Computer Vision and Pattern Recognition, 13480 \u2013 13492.","DOI":"10.1109\/ICCV51070.2023.01240"},{"issue":"1\u20133","key":"2742_CR19","first-page":"185","volume":"17","author":"BK Horn","year":"1981","unstructured":"Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artificial intelligence,17(1\u20133), 185\u2013203.","journal-title":"Determining optical flow. Artificial intelligence"},{"key":"2742_CR20","first-page":"19545","volume":"33","author":"A Jabri","year":"2020","unstructured":"Jabri, A., Owens, A., & Efros, A. (2020). Space-time correspondence as a contrastive random walk. Advances in neural information processing systems,33, 19545\u201319560.","journal-title":"Advances in neural information processing systems"},{"key":"2742_CR21","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: Int. Conf. Machine Learning (PMLR), arxiv:2102.05918."},{"key":"2742_CR22","unstructured":"Jordi, P., Federico, P., Sergi, C., Pablo, A., Alexander, S., & Luc, V.G. (2017). The 2017 DAVIS Challenge on Video Object Segmentation. In: IEEE Proc. Conf. Computer Vision and Pattern Recognition (CVPR) Workshop."},{"key":"2742_CR23","first-page":"29914","volume":"36","author":"L Ke","year":"2023","unstructured":"Ke, L., Ye, M., Danelljan, M., et al. (2023). Segment anything in high quality. Advances in Neural Information Processing Systems,36, 29914\u201329934.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2742_CR24","doi-asserted-by":"crossref","unstructured":"Kirillov, A., Mintun, E., Ravi, N., et\u00a0al. (2023). Segment anything. In: Proc. Int. Conf. on Computer Vision, 4015\u20134026.","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"2742_CR25","doi-asserted-by":"crossref","unstructured":"Kristan, M., Matas, J., Danelljan, M., et\u00a0al. (2023). The first visual object tracking segmentation vots2023 challenge results. In: IEEE Proc. Int. Conf. on Computer Vision (ICCV), pp 1796\u20131818.","DOI":"10.1109\/ICCVW60793.2023.00195"},{"key":"2742_CR26","unstructured":"Lawrence, N., & Hyvarinen, A. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research 6(11)."},{"key":"2742_CR27","doi-asserted-by":"crossref","unstructured":"Li, M., Hu, L., Xiong, Z., Zhang, B., Pan, P., & Liu, D. (2022). Recurrent dynamic embedding for video object segmentation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1332\u20131341.","DOI":"10.1109\/CVPR52688.2022.00139"},{"key":"2742_CR28","first-page":"3430","volume":"33","author":"Y Liang","year":"2020","unstructured":"Liang, Y., Li, X., Jafari, N., et al. (2020). Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems,33, 3430\u20133441.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2742_CR29","doi-asserted-by":"crossref","unstructured":"Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., & Zhang, L. (2024a). Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In: Proc. European Conf. on Computer Vision (ECCV), arxiv:2303.05499","DOI":"10.1007\/978-3-031-72970-6_3"},{"key":"2742_CR30","doi-asserted-by":"crossref","unstructured":"Liu, S., Zeng, Z., Ren, T., et\u00a0al. (2024b). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision, Springer, 38\u201355.","DOI":"10.1007\/978-3-031-72970-6_3"},{"issue":"4","key":"2742_CR31","doi-asserted-by":"publisher","first-page":"3283","DOI":"10.1109\/LRA.2024.3366013","volume":"9","author":"A Maalouf","year":"2024","unstructured":"Maalouf, A., Jadhav, N., Jatavallabhula, K. M., et al. (2024). Follow Anything: Open-set detection, tracking, and following in real-time. IEEE Robotics and Automation Letters,9(4), 3283\u20133290.","journal-title":"IEEE Robotics and Automation Letters"},{"key":"2742_CR32","first-page":"1","volume-title":"IEEE Conf","author":"V Mahadevan","year":"2008","unstructured":"Mahadevan, V., & Vasconcelos, N. (2008). Background subtraction in highly dynamic scenes. IEEE Conf (pp. 1\u20136). Computer Vision and Pattern Recognition: IEEE."},{"key":"2742_CR33","unstructured":"Maninis, K.-K., Chen, K., Ghosh, S., Karpur, A., Chen, K., Xia, Y., Cao, B., Salz, D., Han, G., Dlabal, J., Gnanapragasam, D., Seyedhosseini, M., Zhou, H., & Araujo, A. (2025). TIPS: Text-Image Pretraining with Spatial awareness. In: International Conference on Learning Representations (ICLR), arxiv:2410.16512."},{"issue":"7","key":"2742_CR34","doi-asserted-by":"publisher","first-page":"1010","DOI":"10.1038\/s41592-023-01879-y","volume":"20","author":"M Martin","year":"2023","unstructured":"Martin, M., Ulman, V., Pablo, D., et al. (2023). The cell tracking challenge: 10 years of objective benchmarking. Nature Methods,20(7), 1010\u20131020.","journal-title":"Nature Methods"},{"issue":"2","key":"2742_CR35","doi-asserted-by":"publisher","first-page":"228","DOI":"10.1109\/34.908974","volume":"23","author":"AM Martinez","year":"2001","unstructured":"Martinez, A. M., & Kak, A. C. (2001). PCA versus LDA. IEEE Trans on Pattern Analysis and Machine Intelligence,23(2), 228\u2013233.","journal-title":"IEEE Trans on Pattern Analysis and Machine Intelligence"},{"key":"2742_CR36","unstructured":"Matej, K., Jiri, M., Pavel, T., et\u00a0al. (2024). The second visual object tracking segmentation VOTS2024 challenge results. In: European Conf. Computer Vision (ECCV)."},{"key":"2742_CR37","doi-asserted-by":"crossref","unstructured":"Oh, S.W., Lee, J.-Y., Xu, N., & Kim, S.J. (2019). Video object segmentation using space-time memory networks. In: IEEE Proc. Int. Conf. on Computer Vision (ICCV), 9226\u20139235.","DOI":"10.1109\/ICCV.2019.00932"},{"key":"2742_CR38","doi-asserted-by":"crossref","unstructured":"Palaniappan, K., Rao, R.M., & Seetharaman, G. (2011). Wide-area persistent airborne video: Architecture and challenges. Distributed Video Sensor Networks 349\u2013371.","DOI":"10.1007\/978-0-85729-127-1_24"},{"key":"2742_CR39","unstructured":"Pelapur, R., Candemir, S., Bunyak, F., Poostchi, M., Seetharaman, G., & Palaniappan, K. (2012). Persistent target tracking using likelihood fusion in wide-area and full motion video sequences. In: 15th International Conference on Information Fusion, pp 2420\u20132427"},{"issue":"5","key":"2742_CR40","doi-asserted-by":"publisher","first-page":"3868","DOI":"10.1109\/TAES.2020.2982340","volume":"56","author":"J Peng","year":"2020","unstructured":"Peng, J., & Aved, A. J. (2020). Regularized information preserving projections. IEEE Transactions on Aerospace and Electronic Systems,56(5), 3868\u20133877.","journal-title":"IEEE Transactions on Aerospace and Electronic Systems"},{"key":"2742_CR41","doi-asserted-by":"crossref","unstructured":"Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In: IEEE Proc. Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR.2016.85"},{"key":"2742_CR42","doi-asserted-by":"crossref","unstructured":"Poostchi, M., Aliakbarpour, H., Viguier, R., Bunyak, F., Palaniappan, K., & Seetharaman, G. (2016). Semantic Depth Map Fusion for Moving Vehicle Detection in Aerial Video. In: IEEE Conf. on Computer Vision and Pattern Recognition Workshops, pp 1575\u20131583.","DOI":"10.1109\/CVPRW.2016.196"},{"key":"2742_CR43","first-page":"8748","volume-title":"Int","author":"A Radford","year":"2021","unstructured":"Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. Int (pp. 8748\u20138763). PMLR: Conf. on Machine Learning."},{"issue":"3","key":"2742_CR44","doi-asserted-by":"publisher","first-page":"776","DOI":"10.1007\/s11263-023-01910-x","volume":"132","author":"G Rahmon","year":"2024","unstructured":"Rahmon, G., Palaniappan, K., Toubal, I. E., et al. (2024). DeepFTSG: Multi-stream asymmetric use-net trellis encoders with shared decoder feature fusion architecture for video motion segmentation. International Journal of Computer Vision,132(3), 776\u2013804.","journal-title":"International Journal of Computer Vision"},{"key":"2742_CR45","unstructured":"Ravi, N., Gabeur, V., Hu, Y.T., et\u00a0al. (2024a). Sam 2: Segment anything in images and videos. arXiv:2408.00714"},{"key":"2742_CR46","unstructured":"Ravi, N., Gabeur, V., Hu, Y.T., et\u00a0al. (2024b). Sam 2: Segment anything in images and videos. arXiv:2408.00714"},{"key":"2742_CR47","unstructured":"Ren, T., Liu, S., Zeng, A., et\u00a0al. (2024). Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159"},{"key":"2742_CR48","doi-asserted-by":"crossref","unstructured":"Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., & Felsberg, M. (2020). Learning fast and robust target models for video object segmentation. In: IEEE Conf. Computer Vision and Pattern Recognition, 7406\u20137415.","DOI":"10.1109\/CVPR42600.2020.00743"},{"key":"2742_CR49","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention\u2013MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, pp 234 \u2013 241","DOI":"10.1007\/978-3-319-24574-4_28"},{"issue":"1","key":"2742_CR50","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1016\/j.patrec.2004.07.013","volume":"26","author":"B Shoushtarian","year":"2005","unstructured":"Shoushtarian, B., & Bez, H. E. (2005). A practical adaptive approach for dynamic background subtraction using an invariant colour model and object tracking. Pattern Recognition Letters,26(1), 5\u201326.","journal-title":"Pattern Recognition Letters"},{"key":"2742_CR51","doi-asserted-by":"crossref","unstructured":"Siam, M., Kendall, A., & Jagersand, M. (2021) .Video class agnostic segmentation benchmark for autonomous driving. In: IEEE Conf. Computer Vision and Pattern Recognition, pp 2825 \u2013 2834.","DOI":"10.1109\/CVPRW53098.2021.00317"},{"key":"2742_CR52","doi-asserted-by":"crossref","unstructured":"Tokmakov, P., Li, J., Gaidon, A. (2023). Breaking the Object in Video Object Segmentation. In: IEEE Conf. Computer Vision and Pattern Recognition, pp 22836\u201322845.","DOI":"10.1109\/CVPR52729.2023.02187"},{"key":"2742_CR53","doi-asserted-by":"crossref","unstructured":"Toubal, I.E., Al-Shakarji, N., Cornelison, D.D.W., & Palaniappan, K (2023). Ensemble deep learning object detection fusion for cell tracking, mitosis, and lineage. IEEE Open Journal of Engineering in Medicine and Biology.","DOI":"10.1109\/OJEMB.2023.3288470"},{"key":"2742_CR54","doi-asserted-by":"crossref","unstructured":"Toubal, I.E., Avinash, A., Alldrin, N.G., et\u00a0al. (2024). Modeling collaborator: Enabling subjective vision classification with minimal human effort via llm tool-use. In: IEEE Conf. Computer Vision and Pattern Recognition, pp 17553 \u2013 17563.","DOI":"10.1109\/CVPR52733.2024.01662"},{"key":"2742_CR55","doi-asserted-by":"crossref","unstructured":"Tumanyan, N., Singer, A., Bagon, S., & Dekel, T. (2025). Dino-tracker: Taming dino for self-supervised point tracking in a single video. In: European Conference on Computer Vision, Springer, pp 367\u2013385.","DOI":"10.1007\/978-3-031-73347-5_21"},{"key":"2742_CR56","unstructured":"Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems."},{"key":"2742_CR57","doi-asserted-by":"crossref","unstructured":"Wang, R., Bunyak, F., Seetharaman, G., & Palaniappan, K. (2014). Static and moving object detection using flux tensor with split gaussian models. IEEE Conf Computer Vision and Pattern Recognition Workshops pp 420\u2013424.","DOI":"10.1109\/CVPRW.2014.68"},{"key":"2742_CR58","doi-asserted-by":"crossref","unstructured":"Wang, Y., Ahsan, U., Li, H., & Hagen, M. (2022). A Comprehensive Review of Modern Object Segmentation Approaches.","DOI":"10.1561\/9781638280712"},{"key":"2742_CR59","volume-title":"Gaussian processes for machine learning","author":"CK Williams","year":"2006","unstructured":"Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2). MA: MIT press Cambridge."},{"issue":"2","key":"2742_CR60","doi-asserted-by":"publisher","first-page":"270","DOI":"10.1162\/neco.1989.1.2.270","volume":"1","author":"RJ Williams","year":"1989","unstructured":"Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural computation,1(2), 270\u2013280.","journal-title":"Neural computation"},{"key":"2742_CR61","doi-asserted-by":"crossref","unstructured":"Wu, Q., Yang, T., Wu, W., & Chan, A.B. (2023). Scalable Video Object Segmentation with Simplified Framework. In: IEEE Proc. Int. Conf. on Computer Vision (ICCV), 13879\u201313889.","DOI":"10.1109\/ICCV51070.2023.01276"},{"key":"2742_CR62","doi-asserted-by":"crossref","unstructured":"Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., & Huang, T. (2018a). YouTube-VOS: Sequence-to-Sequence Video Object Segmentation. In: Computer Vision \u2013 ECCV, pp 603\u2013619.","DOI":"10.1007\/978-3-030-01228-1_36"},{"key":"2742_CR63","unstructured":"Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., & Huang, T.S. (2018b). YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. CoRR abs\/1809.03327. http:\/\/arxiv.org\/abs\/1809.03327"},{"key":"2742_CR64","doi-asserted-by":"crossref","unstructured":"Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In: IEEE Proc. Int. Conf. on Computer Vision (ICCV), 10448\u201310457.","DOI":"10.1109\/ICCV48922.2021.01028"},{"key":"2742_CR65","doi-asserted-by":"crossref","unstructured":"Yang, L., Wang, Y., Xiong, X., Yang, J., & Katsaggelos, A.K. (2018). Efficient video object segmentation via network modulation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6499\u20136507.","DOI":"10.1109\/CVPR.2018.00680"},{"key":"2742_CR66","doi-asserted-by":"crossref","unstructured":"Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., & Zhao, H. (2024a). Depth Anything V2. In: Globerson A, Mackey L, Belgrave D, et\u00a0al (eds) Advances in Neural Information Processing Systems, vol\u00a037. Curran Associates, Inc., pp 21875\u201321911.","DOI":"10.52202\/079017-0688"},{"key":"2742_CR67","doi-asserted-by":"crossref","unstructured":"Yang, Z., & Yang, Y. (2022). Decoupling Features in Hierarchical Propagation for Video Object Segmentation. In: Advances in Neural Information Processing Systems (NeurIPS).","DOI":"10.52202\/068431-2632"},{"key":"2742_CR68","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wei, Y., & Yang, Y. (2020). Collaborative video object segmentation by foreground-background integration. In: European Conference on Computer Vision, Springer, 332\u2013348.","DOI":"10.1007\/978-3-030-58558-7_20"},{"key":"2742_CR69","unstructured":"Yang, Z., Wei, Y., & Yang, Y. (2021). Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems (NeurIPS),34, 2491\u20132502."},{"issue":"9","key":"2742_CR70","first-page":"4701","volume":"44","author":"Z Yang","year":"2021","unstructured":"Yang, Z., Wei, Y., & Yang, Y. (2021). Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence,44(9), 4701\u20134712.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2742_CR71","doi-asserted-by":"crossref","unstructured":"Yang, Z., Miao, J., Wei, Y., Wang, W., Wang, X., Yang, Y. (2024b). Scalable video object segmentation with identification mechanism. IEEE Transactions on Pattern Analysis and Machine Intelligence.","DOI":"10.1109\/TPAMI.2024.3383592"},{"key":"2742_CR72","doi-asserted-by":"crossref","unstructured":"Ye, B., Chang, H., Ma, B., Shan, S., & Chen, X. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In: European Conference on Computer Vision, Springer, 341\u2013357.","DOI":"10.1007\/978-3-031-20047-2_20"},{"key":"2742_CR73","doi-asserted-by":"crossref","unstructured":"Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv:1605.07146","DOI":"10.5244\/C.30.87"},{"key":"2742_CR74","doi-asserted-by":"crossref","unstructured":"Zhou, J., Pang, Z., & Wang, Y.X. (2024). RMem: Restricted Memory Banks Improve Video Object Segmentation. In: IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 18602\u201318611.","DOI":"10.1109\/CVPR52733.2024.01760"},{"key":"2742_CR75","unstructured":"Zhu, J., Chen, Z., Hao, Z., Chang, S., Zhang, L., Wang, D., Lu, H., Luo, B., He, J.-Y., Lan, J.-P., Chen, H. & Li, C. (2023). Tracking Anything in High Quality. arXiv:2307.13974"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02742-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-026-02742-1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02742-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,23]],"date-time":"2026-05-23T08:42:28Z","timestamp":1779525748000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-026-02742-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,7]]},"references-count":75,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2026,5]]}},"alternative-id":["2742"],"URL":"https:\/\/doi.org\/10.1007\/s11263-026-02742-1","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,7]]},"assertion":[{"value":"12 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 January 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 April 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"206"}}