{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T17:42:58Z","timestamp":1775324578605,"version":"3.50.1"},"reference-count":171,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,4,8]],"date-time":"2022-04-08T00:00:00Z","timestamp":1649376000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,4,8]],"date-time":"2022-04-08T00:00:00Z","timestamp":1649376000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Artif Intell Rev"],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>As one of the fundamental problems in the field of video understanding, video object segmentation aims at segmenting objects of interest throughout the given video sequence. Recently, with the advancements of deep learning techniques, deep neural networks have shown outstanding performance improvements in many computer vision applications, with video object segmentation being one of the most advocated and intensively investigated. In this paper, we present a systematic review of the deep learning-based video segmentation literature, highlighting the pros and cons of each category of approaches. Concretely, we start by introducing the definition, background concepts and basic ideas of algorithms in this field. Subsequently, we summarise the datasets for training and testing a video object segmentation algorithm, as well as common challenges and evaluation metrics. Next, previous works are grouped and reviewed based on how they extract and use spatial and temporal features, where their architectures, contributions and the differences among each other are elaborated. At last, the quantitative and qualitative results of several representative methods on a dataset with many remaining challenges are provided and analysed, followed by further discussions on future research directions. This article is expected to serve as a tutorial and source of reference for learners intended to quickly grasp the current progress in this research area and practitioners interested in applying the video object segmentation methods to their problems. A public website is built to collect and track the related works in this field: <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/gaomingqi\/VOS-Review\">https:\/\/github.com\/gaomingqi\/VOS-Review<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s10462-022-10176-7","type":"journal-article","created":{"date-parts":[[2022,4,8]],"date-time":"2022-04-08T06:10:56Z","timestamp":1649398256000},"page":"457-531","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":85,"title":["Deep learning for video object segmentation: a review"],"prefix":"10.1007","volume":"56","author":[{"given":"Mingqi","family":"Gao","sequence":"first","affiliation":[]},{"given":"Feng","family":"Zheng","sequence":"additional","affiliation":[]},{"given":"James J. Q.","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Caifeng","family":"Shan","sequence":"additional","affiliation":[]},{"given":"Guiguang","family":"Ding","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4361-956X","authenticated-orcid":false,"given":"Jungong","family":"Han","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,4,8]]},"reference":[{"issue":"12","key":"10176_CR1","doi-asserted-by":"publisher","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","volume":"39","author":"V Badrinarayanan","year":"2017","unstructured":"Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder\u2013decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481\u20132495","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR2","unstructured":"Ballas N, Yao L, Pal C, Courville AC (2016) Delving deeper into convolutional networks for learning video representations. In: Proceedings of the International Conference on Learning Representations"},{"key":"10176_CR3","doi-asserted-by":"crossref","unstructured":"Bao L, Wu B, Liu W (2018) Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5977\u20135986","DOI":"10.1109\/CVPR.2018.00626"},{"key":"10176_CR4","doi-asserted-by":"crossref","unstructured":"Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: Proceedings of the European Conference on Computer Vision, Springer, pp 850\u2013865","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"10176_CR5","doi-asserted-by":"crossref","unstructured":"Bhat G, Lawin FJ, Danelljan M, Robinson A, Felsberg M, Van\u00a0Gool L, Timofte R (2020) Learning what to learn for video object segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 777\u2013794","DOI":"10.1007\/978-3-030-58536-5_46"},{"issue":"3","key":"10176_CR6","doi-asserted-by":"publisher","first-page":"500","DOI":"10.1109\/TPAMI.2010.143","volume":"33","author":"T Brox","year":"2010","unstructured":"Brox T, Malik J (2010) Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans Pattern Anal Mach Intell 33(3):500\u2013513","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR7","doi-asserted-by":"crossref","unstructured":"Brox T, Malik J (2010b) Object segmentation by long term analysis of point trajectories. In: Proceedings of the European Conference on Computer Vision, Springer, pp 282\u2013295","DOI":"10.1007\/978-3-642-15555-0_21"},{"key":"10176_CR8","doi-asserted-by":"crossref","unstructured":"Caelles S, Maninis KK, Pont-Tuset J, Leal-Taix\u00e9 L, Cremers D, Van\u00a0Gool L (2017) One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 221\u2013230","DOI":"10.1109\/CVPR.2017.565"},{"key":"10176_CR9","unstructured":"Caelles S, Pont-Tuset J, Perazzi F, Montes A, Maninis KK, Van\u00a0Gool L (2019) The 2019 davis challenge on vos: Unsupervised multi-object segmentation. arXiv preprint arXiv:190500737"},{"key":"10176_CR10","doi-asserted-by":"crossref","unstructured":"Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, Springer, pp 213\u2013229","DOI":"10.1007\/978-3-030-58452-8_13"},{"issue":"2","key":"10176_CR11","doi-asserted-by":"publisher","first-page":"266","DOI":"10.1109\/83.902291","volume":"10","author":"TF Chan","year":"2001","unstructured":"Chan TF, Vese LA (2001) Active contours without edges. IEEE Trans Image Process 10(2):266\u2013277","journal-title":"IEEE Trans Image Process"},{"key":"10176_CR12","unstructured":"Chen LC, Papandreou G, Schroff F, Adam H (2017b) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587"},{"key":"10176_CR13","doi-asserted-by":"crossref","unstructured":"Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018a) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision, pp 801\u2013818","DOI":"10.1007\/978-3-030-01234-2_49"},{"issue":"12","key":"10176_CR14","doi-asserted-by":"publisher","first-page":"2225","DOI":"10.1109\/TMM.2015.2481711","volume":"17","author":"L Chen","year":"2015","unstructured":"Chen L, Shen J, Wang W, Ni B (2015) Video object segmentation via dense trajectories. IEEE Trans Multimedia 17(12):2225\u20132234","journal-title":"IEEE Trans Multimedia"},{"issue":"4","key":"10176_CR15","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","volume":"40","author":"LC Chen","year":"2017","unstructured":"Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834\u2013848","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR16","doi-asserted-by":"crossref","unstructured":"Cheng HK, Chung J, Tai YW, Tang CK (2020) Cascadepsp: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8890\u20138899","DOI":"10.1109\/CVPR42600.2020.00891"},{"key":"10176_CR17","unstructured":"Cheng HK, Tai YW, Tang CK (2021) Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: Proceedings of the Advances in Neural Information Processing Systems"},{"issue":"3","key":"10176_CR18","doi-asserted-by":"publisher","first-page":"569","DOI":"10.1109\/TPAMI.2014.2345401","volume":"37","author":"MM Cheng","year":"2014","unstructured":"Cheng MM, Mitra NJ, Huang X, Torr PH, Hu SM (2014) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37(3):569\u2013582","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR19","doi-asserted-by":"crossref","unstructured":"Cheng J, Tsai YH, Hung WC, Wang S, Yang MH (2018) Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7415\u20137424","DOI":"10.1109\/CVPR.2018.00774"},{"key":"10176_CR20","doi-asserted-by":"crossref","unstructured":"Cheng J, Tsai YH, Wang S, Yang MH (2017) Segflow: Joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp 686\u2013695","DOI":"10.1109\/ICCV.2017.81"},{"key":"10176_CR21","doi-asserted-by":"crossref","unstructured":"Chen X, Li Z, Yuan Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9384\u20139393","DOI":"10.1109\/CVPR42600.2020.00940"},{"key":"10176_CR22","unstructured":"Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015a) Semantic image segmentation with deep convolutional nets and fully connected crfs. In: Proceedings of the International Conference on Learning Representations"},{"key":"10176_CR23","doi-asserted-by":"crossref","unstructured":"Chen Y, Pont-Tuset J, Montes A, Van\u00a0Gool L (2018b) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1189\u20131198","DOI":"10.1109\/CVPR.2018.00130"},{"issue":"7","key":"10176_CR24","doi-asserted-by":"publisher","first-page":"577","DOI":"10.1109\/TCSVT.2002.800516","volume":"12","author":"SY Chien","year":"2002","unstructured":"Chien SY, Ma SY, Chen LG (2002) Efficient moving object segmentation algorithm using background registration technique. IEEE Trans Circuits Syst Video Technol 12(7):577\u2013586","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"10176_CR25","doi-asserted-by":"crossref","unstructured":"Chockalingam P, Pradeep N, Birchfield S (2009) Adaptive fragments-based tracking of non-rigid objects using level sets. In: Proceedings of the IEEE International Conference on Computer Vision, IEEE, pp 1530\u20131537","DOI":"10.1109\/ICCV.2009.5459276"},{"key":"10176_CR26","doi-asserted-by":"crossref","unstructured":"Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1251\u20131258","DOI":"10.1109\/CVPR.2017.195"},{"key":"10176_CR27","doi-asserted-by":"crossref","unstructured":"Ci H, Wang C, Wang Y (2018) Video object segmentation by learning location-sensitive embeddings. In: Proceedings of the European Conference on Computer Vision, pp 501\u2013516","DOI":"10.1007\/978-3-030-01252-6_31"},{"issue":"10","key":"10176_CR28","doi-asserted-by":"publisher","first-page":"1337","DOI":"10.1109\/TPAMI.2003.1233909","volume":"25","author":"R Cucchiara","year":"2003","unstructured":"Cucchiara R, Grana C, Piccardi M, Prati A (2003) Detecting moving objects, ghosts, and shadows in video streams. IEEE Trans Pattern Anal Mach Intell 25(10):1337\u20131342","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"6","key":"10176_CR29","doi-asserted-by":"publisher","first-page":"1614","DOI":"10.1109\/TNN.2007.896861","volume":"18","author":"D Culibrk","year":"2007","unstructured":"Culibrk D, Marques O, Socek D, Kalva H, Furht B (2007) Neural network approach to background modeling for video object segmentation. IEEE Trans Neural Netw 18(6):1614\u20131627","journal-title":"IEEE Trans Neural Netw"},{"key":"10176_CR30","unstructured":"De\u00a0Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language. In: Proceedings of the Advances in Neural Information Processing Systems, pp 6594\u20136604"},{"key":"10176_CR31","doi-asserted-by":"crossref","unstructured":"Duarte K, Rawat YS, Shah M (2019) Capsulevos: Semi-supervised video object segmentation using capsule routing. In: Proceedings of the IEEE International Conference on Computer Vision, pp 8480\u20138489","DOI":"10.1109\/ICCV.2019.00857"},{"key":"10176_CR32","doi-asserted-by":"crossref","unstructured":"Duke B, Ahmed A, Wolf C, Aarabi P, Taylor GW (2021) Sstvos: Sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5912\u20135921","DOI":"10.1109\/CVPR46437.2021.00585"},{"key":"10176_CR33","doi-asserted-by":"crossref","unstructured":"Endres I, Hoiem D (2010) Category independent object proposals. In: Proceedings of the European Conference on Computer Vision, Springer, pp 575\u2013588","DOI":"10.1007\/978-3-642-15555-0_42"},{"issue":"2","key":"10176_CR34","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","volume":"88","author":"M Everingham","year":"2010","unstructured":"Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303\u2013338","journal-title":"Int J Comput Vis"},{"issue":"1","key":"10176_CR35","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1007\/s11263-014-0733-5","volume":"111","author":"M Everingham","year":"2015","unstructured":"Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98\u2013136","journal-title":"Int J Comput Vis"},{"key":"10176_CR36","unstructured":"Everingham M, Van\u00a0Gool L, Williams C, Winn J, Zisserman A (2012) The pascal visual object classes challenge 2012 (voc2012) results (2012). In: URL http:\/\/www.pascal-network.org\/challenges\/VOC\/voc2011\/workshop\/index.html"},{"key":"10176_CR37","doi-asserted-by":"crossref","unstructured":"Faktor A, Irani M (2014) Video segmentation by non-local consensus voting. In: Proceedings of the British Machine Vision Conference, vol\u00a02, p\u00a08","DOI":"10.5244\/C.28.21"},{"key":"10176_CR38","doi-asserted-by":"crossref","unstructured":"Fan DP, Cheng MM, Liu JJ, Gao SH, Hou Q, Borji A (2018) Salient objects in clutter: Bringing salient object detection to the foreground. In: Proceedings of the European Conference on Computer Vision, pp 186\u2013202","DOI":"10.1007\/978-3-030-01267-0_12"},{"issue":"6","key":"10176_CR39","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1145\/2816795.2818105","volume":"34","author":"Q Fan","year":"2015","unstructured":"Fan Q, Zhong F, Lischinski D, Cohen-Or D, Chen B (2015) Jumpcut: non-successive mask transfer and interpolation for video cutout. ACM Trans Graph 34(6):195","journal-title":"ACM Trans Graph"},{"key":"10176_CR40","doi-asserted-by":"crossref","unstructured":"Fragkiadaki K, Zhang G, Shi J (2012) Video segmentation by tracing discontinuities in a trajectory embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 1846\u20131853","DOI":"10.1109\/CVPR.2012.6247883"},{"key":"10176_CR41","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1016\/j.asoc.2018.05.018","volume":"70","author":"A Garcia-Garcia","year":"2018","unstructured":"Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Martinez-Gonzalez P, Garcia-Rodriguez J (2018) A survey on deep learning techniques for image and video semantic segmentation. Appl Soft Comput 70:41\u201365","journal-title":"Appl Soft Comput"},{"issue":"4","key":"10176_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3329784","volume":"52","author":"S Ghosh","year":"2019","unstructured":"Ghosh S, Das N, Das I, Maulik U (2019) Understanding deep learning techniques for image segmentation. ACM Comput Surv 52(4):1\u201335","journal-title":"ACM Comput Surv"},{"key":"10176_CR43","doi-asserted-by":"crossref","unstructured":"Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1440\u20131448","DOI":"10.1109\/ICCV.2015.169"},{"key":"10176_CR44","doi-asserted-by":"crossref","unstructured":"Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 580\u2013587","DOI":"10.1109\/CVPR.2014.81"},{"key":"10176_CR45","unstructured":"Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of the Advances in Neural Information Processing Systems, pp 2672\u20132680"},{"key":"10176_CR46","doi-asserted-by":"crossref","unstructured":"Griffin BA, Corso JJ (2019) Bubblenets: Learning to select the guidance frame in video object segmentation by deep sorting frames. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8914\u20138923","DOI":"10.1109\/CVPR.2019.00912"},{"key":"10176_CR47","doi-asserted-by":"crossref","unstructured":"Han J, Yang L, Zhang D, Chang X, Liang X (2018) Reinforcement cutting-agent learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9080\u20139089","DOI":"10.1109\/CVPR.2018.00946"},{"key":"10176_CR48","doi-asserted-by":"crossref","unstructured":"Hariharan B, Arbel\u00e1ez P, Bourdev L, Maji S, Malik J (2011) Semantic contours from inverse detectors. In: Proceedings of the IEEE International Conference on Computer Vision, IEEE, pp 991\u2013998","DOI":"10.1109\/ICCV.2011.6126343"},{"issue":"9","key":"10176_CR49","doi-asserted-by":"publisher","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","volume":"37","author":"K He","year":"2015","unstructured":"He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904\u20131916","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR50","doi-asserted-by":"crossref","unstructured":"He K, Gkioxari G, Doll\u00e1r P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961\u20132969","DOI":"10.1109\/ICCV.2017.322"},{"key":"10176_CR51","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"10176_CR52","unstructured":"Hinton GE, Sabour S, Frosst N (2018) Matrix capsules with EM routing. In: Proceedings of the International Conference on Learning Representations"},{"key":"10176_CR53","doi-asserted-by":"crossref","unstructured":"Hu YT, Chen HS, Hui K, Huang JB, Schwing AG (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3105\u20133115","DOI":"10.1109\/CVPR.2019.00322"},{"key":"10176_CR54","unstructured":"Hu YT, Huang JB, Schwing A (2017) Maskrnn: Instance level video object segmentation. In: Proceedings of the Advances in Neural Information Processing Systems, pp 325\u2013334"},{"key":"10176_CR55","doi-asserted-by":"crossref","unstructured":"Hu YT, Huang JB, Schwing AG (2018b) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In: Proceedings of the European Conference on Computer Vision, pp 786\u2013802","DOI":"10.1007\/978-3-030-01246-5_48"},{"key":"10176_CR56","doi-asserted-by":"crossref","unstructured":"Hu YT, Huang JB, Schwing AG (2018c) Videomatch: Matching based video object segmentation. In: Proceedings of the European Conference on Computer Vision, pp 54\u201370","DOI":"10.1007\/978-3-030-01237-3_4"},{"key":"10176_CR57","doi-asserted-by":"crossref","unstructured":"Huang G, Liu Z, Van Der\u00a0Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700\u20134708","DOI":"10.1109\/CVPR.2017.243"},{"key":"10176_CR58","doi-asserted-by":"crossref","unstructured":"Hu P, Wang G, Kong X, Kuen J, Tan YP (2018a) Motion-guided cascaded refinement network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1400\u20131409","DOI":"10.1109\/CVPR.2018.00152"},{"key":"10176_CR59","doi-asserted-by":"crossref","unstructured":"Hu L, Zhang P, Zhang B, Pan P, Xu Y, Jin R (2021) Learning position and target consistency for memory-based video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4144\u20134154","DOI":"10.1109\/CVPR46437.2021.00413"},{"key":"10176_CR60","doi-asserted-by":"crossref","unstructured":"Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2462\u20132470","DOI":"10.1109\/CVPR.2017.179"},{"key":"10176_CR61","doi-asserted-by":"crossref","unstructured":"Jain SD, Grauman K (2014) Supervoxel-consistent foreground propagation in video. In: Proceedings of the European Conference on Computer Vision, Springer, pp 656\u2013671","DOI":"10.1007\/978-3-319-10593-2_43"},{"key":"10176_CR62","doi-asserted-by":"crossref","unstructured":"Jain SD, Xiong B, Grauman K (2017) Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 2117\u20132126","DOI":"10.1109\/CVPR.2017.228"},{"key":"10176_CR63","doi-asserted-by":"crossref","unstructured":"Jampani V, Gadde R, Gehler PV (2017) Video propagation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 451\u2013461","DOI":"10.1109\/CVPR.2017.336"},{"key":"10176_CR64","doi-asserted-by":"crossref","unstructured":"Jampani V, Kiefel M, Gehler PV (2016) Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4452\u20134461","DOI":"10.1109\/CVPR.2016.482"},{"key":"10176_CR65","doi-asserted-by":"crossref","unstructured":"Jang WD, Kim CS (2017) Online video object segmentation via convolutional trident network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5849\u20135858","DOI":"10.1109\/CVPR.2017.790"},{"key":"10176_CR66","doi-asserted-by":"crossref","unstructured":"Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8953\u20138962","DOI":"10.1109\/CVPR.2019.00916"},{"issue":"9","key":"10176_CR67","doi-asserted-by":"publisher","first-page":"1175","DOI":"10.1007\/s11263-019-01164-6","volume":"127","author":"A Khoreva","year":"2019","unstructured":"Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2019) Lucid data dreaming for video object segmentation. Int J Comput Vis 127(9):1175\u20131197","journal-title":"Int J Comput Vis"},{"issue":"2","key":"10176_CR68","doi-asserted-by":"publisher","first-page":"122","DOI":"10.1109\/76.988659","volume":"12","author":"C Kim","year":"2002","unstructured":"Kim C, Hwang JN (2002) Fast and automatic video object segmentation and tracking for content-based applications. IEEE Trans Circuits Syst Video Technol 12(2):122\u2013129","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"10176_CR69","doi-asserted-by":"crossref","unstructured":"Koh YJ, Lee YY, Kim CS (2018) Sequential clique optimization for video object segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 537\u2013556","DOI":"10.1007\/978-3-030-01264-9_32"},{"key":"10176_CR70","unstructured":"Kr\u00e4henb\u00fchl P, Koltun V (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In: Proceedings of the Advances in Neural Information Processing Systems, pp 109\u2013117"},{"key":"10176_CR71","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp 1097\u20131105"},{"key":"10176_CR72","unstructured":"LaLonde R, Bagci U (2018) Capsules for object segmentation. arXiv preprint arXiv:180404241"},{"key":"10176_CR73","doi-asserted-by":"crossref","unstructured":"Lee YJ, Kim J, Grauman K (2011) Key-segments for video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, IEEE, pp 1995\u20132002","DOI":"10.1109\/ICCV.2011.6126471"},{"key":"10176_CR74","unstructured":"Liang Y, Li X, Jafari N, Chen J (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the Advances in Neural Information Processing Systems 33"},{"key":"10176_CR75","doi-asserted-by":"crossref","unstructured":"Li X, Change\u00a0Loy C (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision, pp 90\u2013105","DOI":"10.1007\/978-3-030-01219-9_6"},{"key":"10176_CR76","doi-asserted-by":"crossref","unstructured":"Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2192\u20132199","DOI":"10.1109\/ICCV.2013.273"},{"key":"10176_CR77","doi-asserted-by":"crossref","unstructured":"Lin TY, Doll\u00e1r P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117\u20132125","DOI":"10.1109\/CVPR.2017.106"},{"key":"10176_CR78","doi-asserted-by":"crossref","unstructured":"Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, Springer, pp 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"10176_CR79","doi-asserted-by":"crossref","unstructured":"Lin H, Qi X, Jia J (2019) Agss-vos: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3949\u20133957","DOI":"10.1109\/ICCV.2019.00405"},{"key":"10176_CR80","doi-asserted-by":"crossref","unstructured":"Li Y, Qi H, Dai J, Ji X, Wei Y (2017c) Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2359\u20132367","DOI":"10.1109\/CVPR.2017.472"},{"key":"10176_CR81","doi-asserted-by":"crossref","unstructured":"Li S, Seybold B, Vorobyov A, Fathi A, Huang Q, Jay\u00a0Kuo CC (2018b) Instance embedding transfer to unsupervised video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6526\u20136535","DOI":"10.1109\/CVPR.2018.00683"},{"key":"10176_CR82","doi-asserted-by":"crossref","unstructured":"Li S, Seybold B, Vorobyov A, Lei X, Jay\u00a0Kuo CC (2018c) Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision, pp 207\u2013223","DOI":"10.1007\/978-3-030-01219-9_13"},{"key":"10176_CR83","doi-asserted-by":"crossref","unstructured":"Liu Y, Zhang Q, Zhang D, Han J (2019) Employing deep part-object relationships for salient object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1232\u20131241","DOI":"10.1109\/ICCV.2019.00132"},{"key":"10176_CR84","doi-asserted-by":"crossref","unstructured":"Li X, Wei T, Chen YP, Tai YW, Tang CK (2020) Fss-1000: A 1000-class dataset for few-shot segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2869\u20132878","DOI":"10.1109\/CVPR42600.2020.00294"},{"key":"10176_CR85","doi-asserted-by":"crossref","unstructured":"Li G, Xie Y, Lin L, Yu Y (2017a) Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2386\u20132395","DOI":"10.1109\/CVPR.2017.34"},{"key":"10176_CR86","doi-asserted-by":"crossref","unstructured":"Li B, Yan J, Wu W, Zhu Z, Hu X (2018a) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8971\u20138980","DOI":"10.1109\/CVPR.2018.00935"},{"key":"10176_CR87","unstructured":"Li G, Yu Y (2015) Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5455\u20135463"},{"key":"10176_CR88","doi-asserted-by":"crossref","unstructured":"Li J, Zheng A, Chen X, Zhou B (2017b) Primary video object segmentation via complementary cnns and neighborhood reversible flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1417\u20131425","DOI":"10.1109\/ICCV.2017.158"},{"key":"10176_CR89","doi-asserted-by":"crossref","unstructured":"Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431\u20133440","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"10176_CR90","doi-asserted-by":"crossref","unstructured":"Luiten J, Voigtlaender P, Leibe B (2018) Premvos: Proposal-generation, refinement and merging for video object segmentation. In: Proceedings of the Asian Conference on Computer Vision, pp 565\u2013580","DOI":"10.1007\/978-3-030-20870-7_35"},{"key":"10176_CR91","doi-asserted-by":"crossref","unstructured":"Luiten J, Zulfikar IE, Leibe B (2020) Unovost: Unsupervised offline video object segmentation and tracking. In: Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pp 2000\u20132009","DOI":"10.1109\/WACV45572.2020.9093285"},{"key":"10176_CR92","doi-asserted-by":"crossref","unstructured":"Lu X, Wang W, Danelljan M, Zhou T, Shen J, Van\u00a0Gool L (2020a) Video object segmentation with episodic graph memory networks. In: Proceedings of the European Conference on Computer Vision, Springer, pp 661\u2013679","DOI":"10.1007\/978-3-030-58580-8_39"},{"key":"10176_CR93","doi-asserted-by":"crossref","unstructured":"Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3623\u20133632","DOI":"10.1109\/CVPR.2019.00374"},{"key":"10176_CR94","doi-asserted-by":"crossref","unstructured":"Lu X, Wang W, Shen J, Crandall D, Luo J (2020b) Zero-shot video object segmentation with co-attention siamese networks. In: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence","DOI":"10.1109\/TPAMI.2020.3040258"},{"key":"10176_CR95","unstructured":"Ma T, Latecki LJ (2012) Maximum weight cliques with mutex constraints for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 670\u2013677"},{"issue":"6","key":"10176_CR96","doi-asserted-by":"publisher","first-page":"1515","DOI":"10.1109\/TPAMI.2018.2838670","volume":"41","author":"KK Maninis","year":"2018","unstructured":"Maninis KK, Caelles S, Chen Y, Pont-Tuset J, Leal-Taix\u00e9 L, Cremers D, Van Gool L (2018) Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell 41(6):1515\u20131530","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"5","key":"10176_CR97","doi-asserted-by":"publisher","first-page":"530","DOI":"10.1109\/TPAMI.2004.1273918","volume":"26","author":"DR Martin","year":"2004","unstructured":"Martin DR, Fowlkes CC, Malik J (2004) Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans Pattern Anal Mach Intell 26(5):530\u2013549","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR98","doi-asserted-by":"crossref","unstructured":"Neuhold G, Ollmann T, Rota\u00a0Bulo S, Kontschieder P (2017) The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4990\u20134999","DOI":"10.1109\/ICCV.2017.534"},{"key":"10176_CR99","doi-asserted-by":"crossref","unstructured":"Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1520\u20131528","DOI":"10.1109\/ICCV.2015.178"},{"issue":"6","key":"10176_CR100","doi-asserted-by":"publisher","first-page":"1187","DOI":"10.1109\/TPAMI.2013.242","volume":"36","author":"P Ochs","year":"2013","unstructured":"Ochs P, Malik J, Brox T (2013) Segmentation of moving objects by long term video analysis. IEEE Trans Pattern Anal Mach Intell 36(6):1187\u20131200","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR101","doi-asserted-by":"crossref","unstructured":"Ochs P, Brox T (2012) Higher order motion models and spectral clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 614\u2013621","DOI":"10.1109\/CVPR.2012.6247728"},{"key":"10176_CR102","doi-asserted-by":"crossref","unstructured":"Oh SW, Lee JY, Sunkavalli K, Joo\u00a0Kim S (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7376\u20137385","DOI":"10.1109\/CVPR.2018.00770"},{"key":"10176_CR103","doi-asserted-by":"crossref","unstructured":"Oh SW, Lee JY, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9226\u20139235","DOI":"10.1109\/ICCV.2019.00932"},{"key":"10176_CR104","doi-asserted-by":"crossref","unstructured":"Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1777\u20131784","DOI":"10.1109\/ICCV.2013.223"},{"key":"10176_CR105","unstructured":"Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: Proceedings of the International Conference on Machine Learning, PMLR, pp 4055\u20134064"},{"key":"10176_CR106","doi-asserted-by":"crossref","unstructured":"Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2663\u20132672","DOI":"10.1109\/CVPR.2017.372"},{"key":"10176_CR107","doi-asserted-by":"crossref","unstructured":"Perazzi F, Pont-Tuset J, McWilliams B, Van\u00a0Gool L, Gross M, Sorkine-Hornung A (2016a) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 724\u2013732","DOI":"10.1109\/CVPR.2016.85"},{"key":"10176_CR108","doi-asserted-by":"crossref","unstructured":"Perazzi F, Pont-Tuset J, McWilliams B, Van\u00a0Gool L, Gross M, Sorkine-Hornung A (2016b) A benchmark dataset and evaluation methodology for video object segmentation: Supplemental material. In: URL https:\/\/davischallenge.org\/files\/davis_supplementary.pdf","DOI":"10.1109\/CVPR.2016.85"},{"key":"10176_CR109","unstructured":"Pont-Tuset J, Perazzi F, Caelles S, Arbel\u00e1ez P, Sorkine-Hornung A, Van\u00a0Gool L (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:170400675"},{"key":"10176_CR110","doi-asserted-by":"crossref","unstructured":"Prest A, Leistner C, Civera J, Schmid C, Ferrari V (2012) Learning object class detectors from weakly annotated video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 3282\u20133289","DOI":"10.1109\/CVPR.2012.6248065"},{"key":"10176_CR111","unstructured":"Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp 91\u201399"},{"key":"10176_CR112","doi-asserted-by":"crossref","unstructured":"Robinson A, Lawin FJ, Danelljan M, Khan FS, Felsberg M (2020) Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7406\u20137415","DOI":"10.1109\/CVPR42600.2020.00743"},{"key":"10176_CR113","doi-asserted-by":"crossref","unstructured":"Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234\u2013241","DOI":"10.1007\/978-3-319-24574-4_28"},{"issue":"3","key":"10176_CR114","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211\u2013252","journal-title":"Int J Comput Vis"},{"key":"10176_CR115","doi-asserted-by":"crossref","unstructured":"Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 629\u2013645","DOI":"10.1007\/978-3-030-58542-6_38"},{"key":"10176_CR116","doi-asserted-by":"crossref","unstructured":"Seong H, Oh SW, Lee JY, Lee S, Lee S, Kim E (2021) Hierarchical Memory Matching Network for Video Object Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12889\u201312898","DOI":"10.1109\/ICCV48922.2021.01265"},{"issue":"4","key":"10176_CR117","doi-asserted-by":"publisher","first-page":"717","DOI":"10.1109\/TPAMI.2015.2465960","volume":"38","author":"J Shi","year":"2015","unstructured":"Shi J, Yan Q, Xu L, Jia J (2015) Hierarchical image saliency detection on extended cssd. IEEE Trans Pattern Anal Mach Intell 38(4):717\u2013729","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR118","unstructured":"Shi X, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015b) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Proceedings of the Advances in Neural Information Processing Systems, pp 802\u2013810"},{"issue":"1","key":"10176_CR119","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1109\/76.554415","volume":"7","author":"T Sikora","year":"1997","unstructured":"Sikora T (1997) The mpeg-4 video standard verification model. IEEE Trans Circuits Syst Video Technol 7(1):19\u201331","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"10176_CR120","unstructured":"Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations"},{"key":"10176_CR121","doi-asserted-by":"crossref","unstructured":"Song H, Wang W, Zhao S, Shen J, Lam KM (2018) Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European Conference on Computer Vision, pp 715\u2013731","DOI":"10.1007\/978-3-030-01252-6_44"},{"issue":"8","key":"10176_CR122","doi-asserted-by":"publisher","first-page":"1797","DOI":"10.1109\/TPAMI.2018.2884990","volume":"41","author":"H Tjaden","year":"2018","unstructured":"Tjaden H, Schwanecke U, Sch\u00f6mer E, Cremers D (2018) A region-based gauss-newton approach to real-time monocular multiple object tracking. IEEE Trans Pattern Anal Mach Intell 41(8):1797\u20131812","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR123","doi-asserted-by":"crossref","unstructured":"Tokmakov P, Alahari K, Schmid C (2017a) Learning motion patterns in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3386\u20133394","DOI":"10.1109\/CVPR.2017.64"},{"key":"10176_CR124","doi-asserted-by":"crossref","unstructured":"Tokmakov P, Alahari K, Schmid C (2017b) Learning video object segmentation with visual memory. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4481\u20134490","DOI":"10.1109\/ICCV.2017.480"},{"key":"10176_CR125","doi-asserted-by":"crossref","unstructured":"Tron R, Vidal R (2007) A benchmark for the comparison of 3-d motion segmentation algorithms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 1\u20138","DOI":"10.1109\/CVPR.2007.382974"},{"key":"10176_CR126","doi-asserted-by":"crossref","unstructured":"Tsai YH, Yang MH, Black MJ (2016) Video segmentation via object flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3899\u20133908","DOI":"10.1109\/CVPR.2016.423"},{"issue":"2","key":"10176_CR127","doi-asserted-by":"publisher","first-page":"190","DOI":"10.1007\/s11263-011-0512-5","volume":"100","author":"D Tsai","year":"2012","unstructured":"Tsai D, Flagg M, Nakazawa A, Rehg JM (2012) Motion coherent tracking using multi-label mrf optimization. Int J Comput Vis 100(2):190\u2013202","journal-title":"Int J Comput Vis"},{"key":"10176_CR128","doi-asserted-by":"crossref","unstructured":"Ventura C, Bellver M, Girbau A, Salvador A, Marques F, Giro-i Nieto X (2019) Rvos: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5277\u20135286","DOI":"10.1109\/CVPR.2019.00542"},{"key":"10176_CR129","doi-asserted-by":"crossref","unstructured":"Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen LC (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9481\u20139490","DOI":"10.1109\/CVPR.2019.00971"},{"key":"10176_CR130","doi-asserted-by":"crossref","unstructured":"Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for video object segmentation. In: Proceedings of the British Machine Vision Conference","DOI":"10.5244\/C.31.116"},{"issue":"12","key":"10176_CR131","doi-asserted-by":"publisher","first-page":"5645","DOI":"10.1109\/TIP.2017.2745098","volume":"26","author":"W Wang","year":"2017","unstructured":"Wang W, Shen J, Porikli F (2017) Selective video object cutout. IEEE Trans Image Process 26(12):5645\u20135655","journal-title":"IEEE Trans Image Process"},{"issue":"4","key":"10176_CR132","doi-asserted-by":"publisher","first-page":"985","DOI":"10.1109\/TPAMI.2018.2819173","volume":"41","author":"W Wang","year":"2018","unstructured":"Wang W, Shen J, Porikli F, Yang R (2018) Semi-supervised video object segmentation with super-trajectories. IEEE Trans Pattern Anal Mach Intell 41(4):985\u2013998","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"10176_CR133","doi-asserted-by":"crossref","unstructured":"Wang H, Jiang X, Ren H, Hu Y, Bai S (2021a) Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1296\u20131305","DOI":"10.1109\/CVPR46437.2021.00135"},{"key":"10176_CR134","doi-asserted-by":"crossref","unstructured":"Wang W, Lu X, Shen J, Crandall DJ, Shao L (2019b) Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9236\u20139245","DOI":"10.1109\/ICCV.2019.00933"},{"key":"10176_CR135","doi-asserted-by":"crossref","unstructured":"Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, Ruan X (2017a) Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 136\u2013145","DOI":"10.1109\/CVPR.2017.404"},{"key":"10176_CR136","doi-asserted-by":"crossref","unstructured":"Wang W, Shen J, Lu X, Hoi SC, Ling H (2020) Paying attention to video object pattern understanding. In: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence","DOI":"10.1109\/TPAMI.2020.2966453"},{"key":"10176_CR137","doi-asserted-by":"crossref","unstructured":"Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3395\u20133402","DOI":"10.1109\/CVPR.2015.7298961"},{"key":"10176_CR138","doi-asserted-by":"crossref","unstructured":"Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi SC, Ling H (2019c) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3064\u20133074","DOI":"10.1109\/CVPR.2019.00318"},{"key":"10176_CR139","doi-asserted-by":"crossref","unstructured":"Wang Z, Xu J, Liu L, Zhu F, Shao L (2019d) Ranet: Ranking attention network for fast video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3978\u20133987","DOI":"10.1109\/ICCV.2019.00408"},{"key":"10176_CR140","doi-asserted-by":"crossref","unstructured":"Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021c) End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8741\u20138750","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"10176_CR141","doi-asserted-by":"crossref","unstructured":"Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019a) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1328\u20131338","DOI":"10.1109\/CVPR.2019.00142"},{"key":"10176_CR142","doi-asserted-by":"crossref","unstructured":"Wang W, Zhou T, Porikli F, Crandall D, Van\u00a0Gool L (2021b) A survey on deep learning technique for video segmentation. arXiv preprint arXiv:210701153","DOI":"10.1109\/TPAMI.2022.3225573"},{"issue":"10","key":"10176_CR143","doi-asserted-by":"publisher","first-page":"1550","DOI":"10.1109\/5.58337","volume":"78","author":"PJ Werbos","year":"1990","unstructured":"Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550\u20131560","journal-title":"Proc IEEE"},{"key":"10176_CR144","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1016\/j.patcog.2019.01.006","volume":"90","author":"Z Wu","year":"2019","unstructured":"Wu Z, Shen C, Van Den Hengel A (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recogn 90:119\u2013133","journal-title":"Pattern Recogn"},{"key":"10176_CR145","doi-asserted-by":"crossref","unstructured":"Xiao H, Feng J, Lin G, Liu Y, Zhang M (2018) Monet: Deep motion exploitation for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1140\u20131148","DOI":"10.1109\/CVPR.2018.00125"},{"key":"10176_CR146","doi-asserted-by":"crossref","unstructured":"Xie H, Yao H, Zhou S, Zhang S, Sun W (2021) Efficient regional memory network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1286\u20131295","DOI":"10.1109\/CVPR46437.2021.00134"},{"key":"10176_CR147","first-page":"12549","volume":"34","author":"Y Xu","year":"2020","unstructured":"Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. Proc AAAI Conf Artif Intell 34:12549\u201312556","journal-title":"Proc AAAI Conf Artif Intell"},{"key":"10176_CR148","doi-asserted-by":"crossref","unstructured":"Xu S, Liu D, Bao L, Liu W, Zhou P (2019c) Mhp-vos: Multiple hypotheses propagation for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 314\u2013323","DOI":"10.1109\/CVPR.2019.00040"},{"key":"10176_CR149","doi-asserted-by":"crossref","unstructured":"Xu K, Wen L, Li G, Bo L, Huang Q (2019a) Spatiotemporal cnn for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1379\u20131388","DOI":"10.1109\/CVPR.2019.00147"},{"key":"10176_CR150","unstructured":"Xu N, Yang L, Fan Y, Huang TS, Yang J, Shi H (2019b) The 2nd large-scale video object segmentation challenge - track 1: Video object segmentation. In: URL https:\/\/competitions.codalab.org\/competitions\/20127#participate-get-data"},{"key":"10176_CR151","doi-asserted-by":"crossref","unstructured":"Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018a) Youtube-vos: Sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision, pp 585\u2013601","DOI":"10.1007\/978-3-030-01228-1_36"},{"key":"10176_CR152","doi-asserted-by":"crossref","unstructured":"Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018b) Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:180903327","DOI":"10.1007\/978-3-030-01228-1_36"},{"key":"10176_CR153","doi-asserted-by":"crossref","unstructured":"Yang L, Fan Y, Xu N (2019a) Video instance segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5188\u20135197","DOI":"10.1109\/ICCV.2019.00529"},{"key":"10176_CR154","doi-asserted-by":"crossref","unstructured":"Yang Z, Wang Q, Bertinetto L, Hu W, Bai S, Torr PH (2019b) Anchor diffusion for unsupervised video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 931\u2013940","DOI":"10.1109\/ICCV.2019.00102"},{"key":"10176_CR155","doi-asserted-by":"crossref","unstructured":"Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6499\u20136507","DOI":"10.1109\/CVPR.2018.00680"},{"key":"10176_CR156","doi-asserted-by":"crossref","unstructured":"Yang Z, Wei Y, Yang Y (2020) Collaborative video object segmentation by foreground-background integration. In: Proceedings of the European Conference on Computer Vision, Springer, pp 332\u2013348","DOI":"10.1007\/978-3-030-58558-7_20"},{"key":"10176_CR157","unstructured":"Yang Z, Wei Y, Yang Y (2021a) Associating objects with transformers for video object segmentation. In: Proceedings of the Advances in Neural Information Processing Systems"},{"key":"10176_CR158","doi-asserted-by":"crossref","unstructured":"Yang Z, Wei Y, Yang Y (2021b) Collaborative video object segmentation by multi-scale foreground-background integration. In: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence","DOI":"10.1109\/TPAMI.2021.3081597"},{"issue":"4","key":"10176_CR159","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3391743","volume":"11","author":"R Yao","year":"2020","unstructured":"Yao R, Lin G, Xia S, Zhao J, Zhou Y (2020) Video object segmentation and tracking: a survey. ACM Trans Intell Syst Technol 11(4):1\u201347","journal-title":"ACM Trans Intell Syst Technol"},{"issue":"4","key":"10176_CR160","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1145\/1177352.1177355","volume":"38","author":"A Yilmaz","year":"2006","unstructured":"Yilmaz A, Javed O, Shah M (2006) Object tracking: a survey. ACM Comput Surv 38(4):13","journal-title":"ACM Comput Surv"},{"key":"10176_CR161","doi-asserted-by":"crossref","unstructured":"Yoon JS, Rameau F, Kim J, Lee S, Shin S, So\u00a0Kweon I (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2167\u20132176","DOI":"10.1109\/ICCV.2017.238"},{"key":"10176_CR162","unstructured":"Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the International Conference on Learning Representations"},{"key":"10176_CR163","doi-asserted-by":"crossref","unstructured":"Zeng X, Liao R, Gu L, Xiong Y, Fidler S, Urtasun R (2019a) Dmm-net: Differentiable mask-matching network for video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3929\u20133938","DOI":"10.1109\/ICCV.2019.00403"},{"key":"10176_CR164","doi-asserted-by":"crossref","unstructured":"Zeng Y, Zhang P, Zhang J, Lin Z, Lu H (2019b) Towards high-resolution salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7234\u20137243","DOI":"10.1109\/ICCV.2019.00733"},{"key":"10176_CR165","doi-asserted-by":"crossref","unstructured":"Zhang D, Javed O, Shah M (2013) Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 628\u2013635","DOI":"10.1109\/CVPR.2013.87"},{"key":"10176_CR166","doi-asserted-by":"crossref","unstructured":"Zhang L, Lin Z, Zhang J, Lu H, He Y (2019) Fast video object segmentation via dynamic targeting network. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5582\u20135591","DOI":"10.1109\/ICCV.2019.00568"},{"key":"10176_CR167","doi-asserted-by":"crossref","unstructured":"Zhang Y, Wu Z, Peng H, Lin S (2020) A transductive approach for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6949\u20136958","DOI":"10.1109\/CVPR42600.2020.00698"},{"issue":"8","key":"10176_CR168","doi-asserted-by":"publisher","first-page":"1259","DOI":"10.1109\/76.809160","volume":"9","author":"D Zhong","year":"1999","unstructured":"Zhong D, Chang SF (1999) An integrated approach for content-based video object segmentation and retrieval. IEEE Trans Circuits Syst Video Technol 9(8):1259\u20131268","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"10176_CR169","unstructured":"Zhou D, Bousquet O, Lal TN, Weston J, Sch\u00f6lkopf B (2004) Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp 321\u2013328"},{"key":"10176_CR170","doi-asserted-by":"crossref","unstructured":"Zhou T, Li J, Li X, Shao L (2021) Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6985\u20136994","DOI":"10.1109\/CVPR46437.2021.00691"},{"issue":"7","key":"10176_CR171","doi-asserted-by":"publisher","first-page":"773","DOI":"10.1016\/j.patrec.2005.11.005","volume":"27","author":"Z Zivkovic","year":"2006","unstructured":"Zivkovic Z, Van Der Heijden F (2006) Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn Lett 27(7):773\u2013780","journal-title":"Pattern Recogn Lett"}],"container-title":["Artificial Intelligence Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10462-022-10176-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10462-022-10176-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10462-022-10176-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,4]],"date-time":"2023-01-04T09:17:35Z","timestamp":1672823855000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10462-022-10176-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,8]]},"references-count":171,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["10176"],"URL":"https:\/\/doi.org\/10.1007\/s10462-022-10176-7","relation":{},"ISSN":["0269-2821","1573-7462"],"issn-type":[{"value":"0269-2821","type":"print"},{"value":"1573-7462","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,8]]},"assertion":[{"value":"8 April 2022","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}