{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,10]],"date-time":"2026-05-10T06:00:36Z","timestamp":1778392836114,"version":"3.51.4"},"reference-count":95,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2022,6,18]],"date-time":"2022-06-18T00:00:00Z","timestamp":1655510400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,6,18]],"date-time":"2022-06-18T00:00:00Z","timestamp":1655510400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2022,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"http:\/\/songbai.site\/ovis\">http:\/\/songbai.site\/ovis<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11263-022-01629-1","type":"journal-article","created":{"date-parts":[[2022,6,18]],"date-time":"2022-06-18T08:02:52Z","timestamp":1655539372000},"page":"2022-2039","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":135,"title":["Occluded Video Instance Segmentation: A Benchmark"],"prefix":"10.1007","volume":"130","author":[{"given":"Jiyang","family":"Qi","sequence":"first","affiliation":[]},{"given":"Yan","family":"Gao","sequence":"additional","affiliation":[]},{"given":"Yao","family":"Hu","sequence":"additional","affiliation":[]},{"given":"Xinggang","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Xiaoyu","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Xiang","family":"Bai","sequence":"additional","affiliation":[]},{"given":"Serge","family":"Belongie","sequence":"additional","affiliation":[]},{"given":"Alan","family":"Yuille","sequence":"additional","affiliation":[]},{"given":"Philip H. S.","family":"Torr","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2570-9118","authenticated-orcid":false,"given":"Song","family":"Bai","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,6,18]]},"reference":[{"key":"1629_CR1","unstructured":"Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2006). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675"},{"key":"1629_CR2","doi-asserted-by":"crossref","unstructured":"Athar, A., Mahadevan, S., O\u0161ep, A., Leal-Taix\u00e9, L., & Leibe, B. (2020). Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In ECCV.","DOI":"10.1007\/978-3-030-58621-8_10"},{"key":"1629_CR3","doi-asserted-by":"crossref","unstructured":"Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In CVPR","DOI":"10.1109\/CVPR42600.2020.00976"},{"key":"1629_CR4","doi-asserted-by":"crossref","unstructured":"Bertasius, G., Torresani, L., & Shi, J. (2018). Object detection in video with spatiotemporal sampling networks. In ECCV (pp. 331\u2013346).","DOI":"10.1007\/978-3-030-01258-8_21"},{"key":"1629_CR5","doi-asserted-by":"crossref","unstructured":"Bolya, D., Foley, S., Hays, J., & Hoffman, J. (2020). Tide: A general toolbox for identifying object detection errors. In ECCV","DOI":"10.1007\/978-3-030-58580-8_33"},{"issue":"2","key":"1629_CR6","doi-asserted-by":"publisher","first-page":"88","DOI":"10.1016\/j.patrec.2008.04.005","volume":"30","author":"GJ Brostow","year":"2009","unstructured":"Brostow, G. J., Fauqueur, J., & Cipolla, R. (2009). Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2), 88\u201397.","journal-title":"Pattern Recognition Letters"},{"key":"1629_CR7","unstructured":"Caelles, S., Pont-Tuset, J., Perazzi, F., Montes, A., Maninis, K. K., & Van\u00a0Gool, L. (2019). The 2019 davis challenge on vos: Unsupervised multi-object segmentation. arXiv"},{"key":"1629_CR8","doi-asserted-by":"crossref","unstructured":"Cao, J., Anwer, R. M., Cholakkal, H., Khan, F. S., Pang, Y., & Shao, L. (2020). Sipmask: Spatial information preservation for fast image and video instance segmentation. In ECCV.","DOI":"10.1007\/978-3-030-58568-6_1"},{"key":"1629_CR9","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV (pp. 213\u2013229). Springer.","DOI":"10.1007\/978-3-030-58452-8_13"},{"issue":"4","key":"1629_CR10","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","volume":"40","author":"LC Chen","year":"2017","unstructured":"Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4), 834\u2013848.","journal-title":"IEEE TPAMI"},{"key":"1629_CR11","doi-asserted-by":"crossref","unstructured":"Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"1629_CR12","doi-asserted-by":"crossref","unstructured":"Chu, Q., Ouyang, W., Li, H., Wang, X., Liu, B., & Yu, N. (2017). Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In ICCV (pp. 4836\u20134845).","DOI":"10.1109\/ICCV.2017.518"},{"key":"1629_CR13","doi-asserted-by":"crossref","unstructured":"Chu, X., Zheng, A., Zhang, X., & Sun, J. (2020). Detection in crowded scenes: One proposal, multiple predictions. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01223"},{"key":"1629_CR14","doi-asserted-by":"crossref","unstructured":"Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition.","DOI":"10.1109\/CVPR.2016.350"},{"key":"1629_CR15","doi-asserted-by":"crossref","unstructured":"Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In ICCV.","DOI":"10.1109\/ICCV.2017.89"},{"key":"1629_CR16","doi-asserted-by":"crossref","unstructured":"Devaranjan, J., Kar, A., & Fidler, S. (2020). Meta-sim2: Unsupervised learning of scene structure for synthetic data generation. In ECCV (pp. 715\u2013733). Springer.","DOI":"10.1007\/978-3-030-58520-4_42"},{"key":"1629_CR17","unstructured":"DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552"},{"key":"1629_CR18","doi-asserted-by":"crossref","unstructured":"Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der\u00a0Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In CVPR.","DOI":"10.1109\/ICCV.2015.316"},{"key":"1629_CR19","doi-asserted-by":"crossref","unstructured":"Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV (pp. 1301\u20131310)","DOI":"10.1109\/ICCV.2017.146"},{"key":"1629_CR20","doi-asserted-by":"crossref","unstructured":"Fang, Y., Yang, S., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., & Liu, W. (2021). Instances as queries. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00683"},{"key":"1629_CR21","unstructured":"Fayyaz, M., Saffar, M.H., Sabokrou, M., Fathy, M., Klette, R., & Huang, F. (2016). STFCN: spatio-temporal fcn for semantic video segmentation. In ACCV."},{"key":"1629_CR22","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"1629_CR23","doi-asserted-by":"crossref","unstructured":"Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T. Y., Cubuk, E. D., Le, Q. V., & Zoph, B. (2021). Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR (pp. 2918\u20132928).","DOI":"10.1109\/CVPR46437.2021.00294"},{"key":"1629_CR24","doi-asserted-by":"crossref","unstructured":"Gupta, A., Dollar, P., & Girshick, R. (2019). LVIS: A dataset for large vocabulary instance segmentation. In CVPR.","DOI":"10.1109\/CVPR.2019.00550"},{"key":"1629_CR25","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377"},{"key":"1629_CR26","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., & Girshick, R. (2017). Mask R-CNN. In CVPR.","DOI":"10.1109\/ICCV.2017.322"},{"key":"1629_CR27","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.","DOI":"10.1109\/CVPR.2016.90"},{"issue":"4","key":"1629_CR28","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1167\/8.4.16","volume":"8","author":"J Hegd\u00e9","year":"2008","unstructured":"Hegd\u00e9, J., Fang, F., Murray, S. O., & Kersten, D. (2008). Preferential responses to occluded objects in the human visual cortex. JOV, 8(4), 16\u201316.","journal-title":"JOV"},{"key":"1629_CR29","doi-asserted-by":"crossref","unstructured":"Hosang, J., Benenson, R., & Schiele, B. (2017). Learning non-maximum suppression. In CVPR (pp. 4507\u20134515).","DOI":"10.1109\/CVPR.2017.685"},{"key":"1629_CR30","doi-asserted-by":"crossref","unstructured":"Hu, Y. T., Huang, J. B., & Schwing, A. G. (2018). Videomatch: Matching based video object segmentation. In ECCV.","DOI":"10.1007\/978-3-030-01237-3_4"},{"key":"1629_CR31","doi-asserted-by":"crossref","unstructured":"Huang, Z., Huang, L., Gong, Y., Huang, C., & Wang, X. (2019). Mask scoring R-CNN. In CVPR.","DOI":"10.1109\/CVPR.2019.00657"},{"key":"1629_CR32","unstructured":"Hwang, S., Heo, M., Oh, S. W., & Kim, S. J. (2021). Video instance segmentation using inter-frame communication transformers. arXiv preprint arXiv:2106.03299"},{"key":"1629_CR33","doi-asserted-by":"crossref","unstructured":"Johnander, J., Danelljan, M., Brissman, E., Khan, F. S., & Felsberg, M. (2019). A generative appearance model for end-to-end video object segmentation. In CVPR.","DOI":"10.1109\/CVPR.2019.00916"},{"key":"1629_CR34","doi-asserted-by":"crossref","unstructured":"Kar, A., Prakash, A., Liu, M. Y., Cameracci, E., Yuan, J., Rusiniak, M., Acuna, D., Torralba, A., & Fidler, S. (2019). Meta-sim: Learning to generate synthetic datasets. In ICCV (pp. 4551\u20134560).","DOI":"10.1109\/ICCV.2019.00465"},{"key":"1629_CR35","doi-asserted-by":"crossref","unstructured":"Ke, L., Tai, Y. W., & Tang, C. K. (2021). Deep occlusion-aware instance segmentation with overlapping bilayers. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00401"},{"key":"1629_CR36","doi-asserted-by":"crossref","unstructured":"Khoreva, A., Perazzi, F., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In CVPR.","DOI":"10.1109\/CVPR.2017.372"},{"key":"1629_CR37","doi-asserted-by":"crossref","unstructured":"Kim, D., Woo, S., Lee, J. Y., & Kweon, I. S. (2020). Video panoptic segmentation. In CVPR.","DOI":"10.1109\/CVPR42600.2020.00988"},{"key":"1629_CR38","doi-asserted-by":"crossref","unstructured":"Kirillov, A., He, K., Girshick, R., Rother, C., & Doll\u00e1r, P. (2019). Panoptic segmentation. In CVPR.","DOI":"10.1109\/CVPR.2019.00963"},{"key":"1629_CR39","doi-asserted-by":"crossref","unstructured":"Kirillov, A., Wu, Y., He, K., & Girshick, R. (2020). Pointrend: Image segmentation as rendering. In CVPR.","DOI":"10.1109\/CVPR42600.2020.00982"},{"key":"1629_CR40","doi-asserted-by":"crossref","unstructured":"Kortylewski, A., He, J., Liu, Q., & Yuille, A. L. (2020a). Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In CVPR (pp. 8940\u20138949).","DOI":"10.1109\/CVPR42600.2020.00896"},{"issue":"3","key":"1629_CR41","doi-asserted-by":"publisher","first-page":"736","DOI":"10.1007\/s11263-020-01401-3","volume":"129","author":"A Kortylewski","year":"2021","unstructured":"Kortylewski, A., Liu, Q., Wang, A., Sun, Y., & Yuille, A. (2021). Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion. IJCV, 129(3), 736\u2013760.","journal-title":"IJCV"},{"key":"1629_CR42","doi-asserted-by":"crossref","unstructured":"Kortylewski, A., Liu, Q., Wang, H., Zhang, Z., & Yuille, A. (2020b). Combining compositional models and deep networks for robust object classification under occlusion. In WACV (pp. 1333\u20131341).","DOI":"10.1109\/WACV45572.2020.9093560"},{"key":"1629_CR43","doi-asserted-by":"crossref","unstructured":"Lazarow, J., Lee, K., Shi, K., & Tu, Z. (2020). Learning instance occlusion for panoptic segmentation. In CVPR (pp. 10720\u201310729).","DOI":"10.1109\/CVPR42600.2020.01073"},{"key":"1629_CR44","doi-asserted-by":"crossref","unstructured":"Li, M., Li, S., Li, L., & Zhang, L. (2021). Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In CVPR.","DOI":"10.1109\/CVPR46437.2021.01106"},{"key":"1629_CR45","doi-asserted-by":"crossref","unstructured":"Li, Q., Qi, X., & Torr, P. H. (2020). Unifying training and inference for panoptic segmentation. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01333"},{"key":"1629_CR46","doi-asserted-by":"crossref","unstructured":"Li, S., Seybold, B., Vorobyov, A., Fathi, A., & Kuo, C. C. J. (2018). Instance embedding transfer to unsupervised video object segmentation. In CVPR.","DOI":"10.1109\/CVPR.2018.00683"},{"key":"1629_CR47","doi-asserted-by":"crossref","unstructured":"Li, X., & Loy, C. C. (2018). Video object segmentation with joint re-identification and attention-aware mask propagation. In ECCV.","DOI":"10.1007\/978-3-030-01219-9_6"},{"key":"1629_CR48","unstructured":"Li, Y., Xu, N., Peng, J., See, J., & Lin, W. (2020). Delving into the cyclic mechanism in semi-supervised video object segmentation. NeurIPS, 33."},{"key":"1629_CR49","doi-asserted-by":"crossref","unstructured":"Lin, C. C., Hung, Y., Feris, R., & He, L. (2020). Video instance segmentation tracking with a modified vae architecture. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01316"},{"key":"1629_CR50","doi-asserted-by":"crossref","unstructured":"Lin, T. Y., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"1629_CR51","doi-asserted-by":"crossref","unstructured":"Liu, D., Cui, Y., Tan, W., & Chen, Y. (2021). SG-Net: Spatial granularity network for one-stage video instance segmentation. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00969"},{"key":"1629_CR52","doi-asserted-by":"crossref","unstructured":"Liu, Q., Chu, Q., Liu, B., & Yu, N. (2020). GSM: Graph similarity model for multi-object tracking. In IJCAI (pp. 530\u2013536).","DOI":"10.24963\/ijcai.2020\/74"},{"key":"1629_CR53","doi-asserted-by":"crossref","unstructured":"Liu, S., Huang, D., & Wang, Y. (2019). Adaptive NMS: Refining pedestrian detection in a crowd. In CVPR.","DOI":"10.1109\/CVPR.2019.00662"},{"key":"1629_CR54","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 10012\u201310022).","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1629_CR55","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"1629_CR56","unstructured":"Milan, A., Leal-Taix\u00e9, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831"},{"issue":"1","key":"1629_CR57","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1068\/p180055","volume":"18","author":"K Nakayama","year":"1989","unstructured":"Nakayama, K., Shimojo, S., & Silverman, G. H. (1989). Stereoscopic depth: its relation to image segmentation, grouping, and the recognition of occluded objects. Perception, 18(1), 55\u201368.","journal-title":"Perception"},{"key":"1629_CR58","unstructured":"Nikolenko, S. I. (2019). Synthetic data for deep learning. arXiv"},{"key":"1629_CR59","doi-asserted-by":"crossref","unstructured":"Nilsson, D., & Sminchisescu, C. (2018). Semantic video segmentation by gated recurrent flow propagation. In CVPR.","DOI":"10.1109\/CVPR.2018.00713"},{"key":"1629_CR60","doi-asserted-by":"crossref","unstructured":"Oh, S. W., Lee, J. Y., Sunkavalli, K., & Kim, S. J. (2018). Fast video object segmentation by reference-guided mask propagation. In CVPR.","DOI":"10.1109\/CVPR.2018.00770"},{"key":"1629_CR61","doi-asserted-by":"crossref","unstructured":"Oh, S. W., Lee, J. Y., Xu, N., & Kim, S. J. (2019). Video object segmentation using space-time memory networks. In ICCV.","DOI":"10.1109\/ICCV.2019.00932"},{"key":"1629_CR62","doi-asserted-by":"crossref","unstructured":"Perazzi, F., Pont-Tuset, J., McWilliams, B., Van\u00a0Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.","DOI":"10.1109\/CVPR.2016.85"},{"key":"1629_CR63","doi-asserted-by":"crossref","unstructured":"Qi, J., Gao, Y., Hu, Y., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P., & Bai, S. (2021). Occluded video instance segmentation: Dataset and ICCV 2021 challenge. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track.","DOI":"10.1007\/s11263-022-01629-1"},{"issue":"3","key":"1629_CR64","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211\u2013252.","journal-title":"IJCV"},{"key":"1629_CR65","unstructured":"Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., & Sun, J. (2018). Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123"},{"issue":"7","key":"1629_CR66","first-page":"1442","volume":"36","author":"AW Smeulders","year":"2013","unstructured":"Smeulders, A. W., Chu, D. M., Cucchiara, R., Calderara, S., Dehghan, A., & Shah, M. (2013). Visual tracking: An experimental survey. IEEE TPAMI, 36(7), 1442\u20131468.","journal-title":"IEEE TPAMI"},{"key":"1629_CR67","doi-asserted-by":"crossref","unstructured":"Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In ICCV (pp. 9627\u20139636).","DOI":"10.1109\/ICCV.2019.00972"},{"key":"1629_CR68","doi-asserted-by":"crossref","unstructured":"Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning motion patterns in videos. In CVPR.","DOI":"10.1109\/CVPR.2017.64"},{"key":"1629_CR69","doi-asserted-by":"crossref","unstructured":"Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., & Chen, L. C. (2019). FEELVOS: Fast end-to-end embedding learning for video object segmentation. In CVPR.","DOI":"10.1109\/CVPR.2019.00971"},{"key":"1629_CR70","doi-asserted-by":"crossref","unstructured":"Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019). MOTS: Multi-object tracking and segmentation. In CVPR.","DOI":"10.1109\/CVPR.2019.00813"},{"key":"1629_CR71","doi-asserted-by":"crossref","unstructured":"Voigtlaender, P., & Leibe, B. (2017). Online adaptation of convolutional neural networks for video object segmentation. In BMVC.","DOI":"10.5244\/C.31.116"},{"key":"1629_CR72","doi-asserted-by":"crossref","unstructured":"Wang, H., Jiang, X., Ren, H., Hu, Y., & Bai, S. (2021). Swiftnet: Real-time video object segmentation. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00135"},{"key":"1629_CR73","doi-asserted-by":"crossref","unstructured":"Wang, W., Feiszli, M., Wang, H., & Tran, D. (2021). Unidentified video objects: A benchmark for dense, open-world segmentation. arXiv preprint arXiv:2104.04691","DOI":"10.1109\/ICCV48922.2021.01060"},{"key":"1629_CR74","doi-asserted-by":"crossref","unstructured":"Wang, W., Song, H., Zhao, S., Shen, J., & Ling, H. (2019). Learning unsupervised video object segmentation through visual attention. In CVPR","DOI":"10.1109\/CVPR.2019.00318"},{"key":"1629_CR75","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR (pp. 7794\u20137803).","DOI":"10.1109\/CVPR.2018.00813"},{"key":"1629_CR76","doi-asserted-by":"crossref","unstructured":"Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., & Shen, C. (2018). Repulsion loss: Detecting pedestrians in a crowd. In CVPR.","DOI":"10.1109\/CVPR.2018.00811"},{"key":"1629_CR77","doi-asserted-by":"crossref","unstructured":"Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., & Xia, H. (2020). End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"1629_CR78","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2020.102907","volume":"193","author":"L Wen","year":"2020","unstructured":"Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M. C., Qi, H., Lim, J., Yang, M. H., & Lyu, S. (2020). UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Computer Vision and Image Understanding, 193, 102907.","journal-title":"Computer Vision and Image Understanding"},{"key":"1629_CR79","doi-asserted-by":"crossref","unstructured":"Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., & Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. In CVPR.","DOI":"10.1109\/CVPR46437.2021.01217"},{"key":"1629_CR80","doi-asserted-by":"crossref","unstructured":"Wu, J., Song, L., Wang, T., Zhang, Q., & Yuan, J. (2020). Forest R-CNN: Large-vocabulary long-tailed object detection and instance segmentation. In ACM Multimedia.","DOI":"10.1145\/3394171.3413970"},{"key":"1629_CR81","doi-asserted-by":"crossref","unstructured":"Wu, J., Zhou, C., Yang, M., Zhang, Q., Li, Y., & Yuan, J. (2020). Temporal-context enhanced detection of heavily occluded pedestrians. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01344"},{"key":"1629_CR82","doi-asserted-by":"crossref","unstructured":"Xie, S., Girshick, R., Doll\u00e1r, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.","DOI":"10.1109\/CVPR.2017.634"},{"key":"1629_CR83","doi-asserted-by":"crossref","unstructured":"Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). Upsnet: A unified panoptic segmentation network. In CVPR.","DOI":"10.1109\/CVPR.2019.00902"},{"key":"1629_CR84","doi-asserted-by":"crossref","unstructured":"Xu, J., Cao, Y., Zhang, Z., & Hu, H. (2019). Spatial-temporal relation networks for multi-object tracking. In ICCV (pp. 3988\u20133998).","DOI":"10.1109\/ICCV.2019.00409"},{"key":"1629_CR85","doi-asserted-by":"crossref","unstructured":"Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., & Huang, T. (2018). Youtube-vos: Sequence-to-sequence video object segmentation. In ECCV.","DOI":"10.1007\/978-3-030-01228-1_36"},{"key":"1629_CR86","doi-asserted-by":"crossref","unstructured":"Xu, Z., Zhang, W., Tan, X., Yang, W., Huang, H., Wen, S., Ding, E., & Huang, L. (2020). Segment as points for efficient online multi-object tracking and segmentation. In ECCV.","DOI":"10.1007\/978-3-030-58452-8_16"},{"key":"1629_CR87","doi-asserted-by":"crossref","unstructured":"Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.","DOI":"10.1109\/ICCV.2019.00529"},{"key":"1629_CR88","doi-asserted-by":"crossref","unstructured":"Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., & Liu, W. (2021). Crossover learning for fast online video instance segmentation. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00794"},{"key":"1629_CR89","doi-asserted-by":"crossref","unstructured":"Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV (pp. 6023\u20136032).","DOI":"10.1109\/ICCV.2019.00612"},{"key":"1629_CR90","doi-asserted-by":"crossref","unstructured":"Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., & Loy, C. C. (2020). Self-supervised scene de-occlusion. In CVPR.","DOI":"10.1109\/CVPR42600.2020.00384"},{"key":"1629_CR91","doi-asserted-by":"crossref","unstructured":"Zhang, S., Benenson, R., & Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In CVPR (pp. 3213\u20133221).","DOI":"10.1109\/CVPR.2017.474"},{"key":"1629_CR92","doi-asserted-by":"crossref","unstructured":"Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In ECCV.","DOI":"10.1007\/978-3-030-01219-9_39"},{"key":"1629_CR93","doi-asserted-by":"crossref","unstructured":"Zhou, C., & Yuan, J. (2018). Bi-box regression for pedestrian detection and occlusion estimation. In ECCV (pp. 135\u2013151).","DOI":"10.1007\/978-3-030-01246-5_9"},{"key":"1629_CR94","doi-asserted-by":"crossref","unstructured":"Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., & Yang, M. H. (2018). Online multi-object tracking with dual matching attention networks. In ECCV (pp. 366\u2013382)","DOI":"10.1007\/978-3-030-01228-1_23"},{"key":"1629_CR95","doi-asserted-by":"crossref","unstructured":"Zhu, X., Xiong, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Deep feature flow for video recognition. In CVPR.","DOI":"10.1109\/CVPR.2017.441"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-022-01629-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-022-01629-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-022-01629-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,7,14]],"date-time":"2022-07-14T09:20:41Z","timestamp":1657790441000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-022-01629-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,18]]},"references-count":95,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2022,8]]}},"alternative-id":["1629"],"URL":"https:\/\/doi.org\/10.1007\/s11263-022-01629-1","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,18]]},"assertion":[{"value":"8 November 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 May 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 June 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}