{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T16:05:22Z","timestamp":1753891522153,"version":"3.41.2"},"reference-count":47,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2022,10,13]],"date-time":"2022-10-13T00:00:00Z","timestamp":1665619200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>We study video inpainting, which aims to recover realistic textures from damaged frames. Recent progress has been made by taking other frames as references so that relevant textures can be transferred to damaged frames. However, existing video inpainting approaches neglect the ability of the model to extract information and reconstruct the content, resulting in the inability to reconstruct the textures that should be transferred accurately. In this paper, we propose a novel and effective spatial-temporal texture transformer network (STTTN) for video inpainting. STTTN consists of six closely related modules optimized for video inpainting tasks: feature similarity measure for more accurate frame pre-repair, an encoder with strong information extraction ability, embedding module for finding a correlation, coarse low-frequency feature transfer, refinement high-frequency feature transfer, and decoder with accurate content reconstruction ability. Such a design encourages joint feature learning across the input and reference frames. To demonstrate the advancedness and effectiveness of the proposed model, we conduct comprehensive ablation learning and qualitative and quantitative experiments on multiple datasets by using standard stationary masks and more realistic moving object masks. The excellent experimental results demonstrate the authenticity and reliability of the STTTN.<\/jats:p>","DOI":"10.3389\/fnbot.2022.1002453","type":"journal-article","created":{"date-parts":[[2022,10,13]],"date-time":"2022-10-13T05:05:30Z","timestamp":1665637530000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Learning a spatial-temporal texture transformer network for video inpainting"],"prefix":"10.3389","volume":"16","author":[{"given":"Pengsen","family":"Ma","sequence":"first","affiliation":[]},{"given":"Tao","family":"Xue","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2022,10,13]]},"reference":[{"key":"B1","first-page":"6836","article-title":"\u201cVivit: a video vision transformer,\u201d","author":"Arnab","year":"2021","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B2","doi-asserted-by":"publisher","first-page":"1200","DOI":"10.1109\/83.935036","article-title":"Filling-in by joint interpolation of vector fields and gray levels","volume":"10","author":"Ballester","year":"2001","journal-title":"IEEE Trans. Image Process"},{"key":"B3","first-page":"25","article-title":"\u201cSuper-resolution enhancement of video,\u201d","author":"Bishop","year":"2003","journal-title":"International Workshop on Artificial Intelligence and Statistics"},{"key":"B4","doi-asserted-by":"publisher","first-page":"27","DOI":"10.3389\/fams.2019.00027","article-title":"Modeling variational inpainting methods with splines","volume":"5","author":"Bo\u00dfmann","year":"2019","journal-title":"Front. Appl. Math. Stat"},{"key":"B5","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.565","article-title":"The 2018 davis challenge on video object segmentation","author":"Caelles","year":"2018","journal-title":"arXiv preprint arXiv:1803.00557"},{"key":"B6","first-page":"9066","article-title":"\u201cFree-form video inpainting with 3d gated convolution and temporal patchgan,\u201d","author":"Chang","year":"","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B7","article-title":"Learnable gated temporal shift module for deep video inpainting","author":"Chang","year":"","journal-title":"arXiv preprint arXiv:1907.01131"},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1907.01131","article-title":"Learnable gated temporal shift module for deep video inpainting","author":"Chang","year":"","journal-title":"arXiv preprint arXiv:1907.01131"},{"key":"B9","article-title":"\u201cVornet: spatio-temporally consistent video inpainting for object removal,\u201d","author":"Chang","year":"","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops"},{"key":"B10","first-page":"589","article-title":"\u201cVisformer: The vision-friendly transformer,\u201d","author":"Chen","year":"2021","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B11","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","article-title":"\u201cImagenet: a large-scale hierarchical image database,\u201d","author":"Deng","year":"2009","journal-title":"2009 IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B12","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2010.11929","article-title":"An image is worth 16x16 words: transformers for image recognition at scale","author":"Dosovitskiy","year":"2020","journal-title":"arXiv preprint arXiv:2010.11929"},{"key":"B13","first-page":"713","article-title":"\u201cFlow-edge guided video completion,\u201d","author":"Gao","year":"2020","journal-title":"European Conference on Computer Vision"},{"key":"B14","doi-asserted-by":"publisher","first-page":"923456","DOI":"10.3389\/fmed.2022.923456","article-title":"Chest l-transformer: local features with position attention for weakly supervised chest radiograph segmentation and classification","volume":"9","author":"Gu","year":"2022","journal-title":"Front. Med"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2012.12556","article-title":"A survey on visual transformer","author":"Han","year":"2020","journal-title":"arXiv e-prints, arXiv: 2012.12556"},{"key":"B16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3072959.3073659","article-title":"Globally and locally consistent image completion","volume":"36","author":"Iizuka","year":"2017","journal-title":"ACM Trans. Graphics"},{"key":"B17","first-page":"694","article-title":"\u201cPerceptual losses for real-time style transfer and super-resolution,\u201d","author":"Johnson","year":"2016","journal-title":"European Conference on Computer Vision"},{"key":"B18","first-page":"5792","article-title":"\u201cDeep video inpainting,\u201d","author":"Kim","year":"2019","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B19","first-page":"106","article-title":"\u201cSpatio-temporal transformer network for video restoration,\u201d","author":"Kim","year":"2018","journal-title":"Proceedings of the European Conference on Computer Vision (ECCV)"},{"key":"B20","doi-asserted-by":"crossref","DOI":"10.21236\/ADA459805","author":"Kobla","year":"1996","journal-title":"Feature normalization for video indexing and retrieval"},{"key":"B21","doi-asserted-by":"publisher","first-page":"2599","DOI":"10.1109\/TPAMI.2018.2865304","article-title":"Fast and accurate image super-resolution with deep laplacian pyramid networks","volume":"41","author":"Lai","year":"","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"B22","first-page":"170","article-title":"\u201cLearning blind video temporal consistency,\u201d","author":"Lai","year":"","journal-title":"Proceedings of the European Conference on Computer Vision (ECCV)"},{"key":"B23","first-page":"14599","article-title":"\u201cFlow-guided video inpainting with scene templates,\u201d","author":"Lao","year":"2021","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B24","article-title":"Vitgan: Training gans with vision transformers","author":"Lee","year":"2021","journal-title":"arXiv preprint arXiv"},{"key":"B25","first-page":"4413","article-title":"\u201cCopy-and-paste networks for deep video inpainting,\u201d","author":"Lee","year":"2019","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B26","doi-asserted-by":"crossref","first-page":"305","DOI":"10.1109\/ICCV.2003.1238360","article-title":"\u201cLearning how to inpaint from global image statistics,\u201d","author":"Levin","year":"2003","journal-title":"Proceedings Ninth IEEE International Conference on Computer Vision, Vol. 1"},{"key":"B27","doi-asserted-by":"publisher","first-page":"933660","DOI":"10.3389\/fnins.2022.933660","article-title":"Convolutional recurrent neural network for dynamic functional mri analysis and brain disease identification","volume":"16","author":"Lin","year":"2022","journal-title":"Front. Neurosci"},{"key":"B28","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2104.06637","article-title":"Decoupled spatial-temporal transformer for video inpainting","author":"Liu","year":"","journal-title":"arXiv preprint arXiv:2104.06637"},{"key":"B29","first-page":"14040","article-title":"\u201cFuseformer: fusing fine-grained information in transformers for video inpainting,\u201d","author":"Liu","year":"","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B30","first-page":"10012","article-title":"\u201cSwin transformer: hierarchical vision transformer using shifted windows,\u201d","author":"Liu","year":"","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B31","first-page":"4403","article-title":"\u201cOnion-peel networks for deep video completion,\u201d","author":"Oh","year":"2019","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B32","first-page":"724","article-title":"\u201cA benchmark dataset and evaluation methodology for video object segmentation,\u201d","author":"Perazzi","year":"2016","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B33","doi-asserted-by":"publisher","first-page":"3300","DOI":"10.1109\/TPAMI.2021.3050918","article-title":"Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction","volume":"44","author":"Shu","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"B34","first-page":"2481","article-title":"\u201cDistillation-guided image inpainting,\u201d","author":"Suin","year":"2021","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B35","doi-asserted-by":"publisher","first-page":"636","DOI":"10.1109\/TPAMI.2019.2928540","article-title":"Coherence constrained graph lstm for group activity recognition","volume":"44","author":"Tang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"B36","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1080\/10867651.2004.10487596","article-title":"An image inpainting technique based on the fast marching method","volume":"9","author":"Telea","year":"2004","journal-title":"J. Graph. Tools"},{"key":"B37","doi-asserted-by":"publisher","first-page":"5232","DOI":"10.1609\/aaai.v33i01.33015232","article-title":"Video inpainting by jointly learning temporal structure and spatial details","volume":"33","author":"Wang","year":"2019","journal-title":"Proc. AAAI Conf. Artif. Intell"},{"key":"B38","doi-asserted-by":"publisher","first-page":"824592","DOI":"10.3389\/fnbot.2021.824592","article-title":"Progressive multi-scale vision transformer for facial action unit detection","volume":"15","author":"Wang","year":"2022","journal-title":"Front. Neurorob"},{"key":"B39","doi-asserted-by":"publisher","first-page":"107448","DOI":"10.1016\/j.patcog.2020.107448","article-title":"Multistage attention network for image inpainting","volume":"106","author":"Wang","year":"2020","journal-title":"Pattern Recogn"},{"key":"B40","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1808.06601","article-title":"Video-to-video synthesis","author":"Wang","year":"2018","journal-title":"arXiv preprint arXiv:1808.06601"},{"key":"B41","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01228-1_36","article-title":"Youtube-vos: a large-scale video object segmentation benchmark","author":"Xu","year":"2018","journal-title":"arXiv preprint arXiv"},{"key":"B42","first-page":"3723","article-title":"\u201cDeep flow-guided video inpainting,\u201d","author":"Xu","year":"2019","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B43","first-page":"5791","article-title":"\u201cLearning texture transformer network for image super-resolution,\u201d","author":"Yang","year":"2020","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B44","first-page":"4471","article-title":"\u201cFree-form image inpainting with gated convolution,\u201d","author":"Yu","year":"2019","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"B45","doi-asserted-by":"publisher","first-page":"12733","DOI":"10.1609\/aaai.v34i07.6967","article-title":"Region normalization for image inpainting","volume":"34","author":"Yu","year":"2020","journal-title":"Proc. AAAI Conf. Artif. Intell"},{"key":"B46","first-page":"528","article-title":"\u201cLearning joint spatial-temporal transformations for video inpainting,\u201d","author":"Zeng","year":"2020","journal-title":"European Conference on Computer Vision"},{"key":"B47","first-page":"16448","article-title":"\u201cProgressive temporal feature alignment network for video inpainting,\u201d","author":"Zou","year":"2021","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1002453\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,13]],"date-time":"2022-10-13T05:06:07Z","timestamp":1665637567000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1002453\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,13]]},"references-count":47,"alternative-id":["10.3389\/fnbot.2022.1002453"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2022.1002453","relation":{},"ISSN":["1662-5218"],"issn-type":[{"type":"electronic","value":"1662-5218"}],"subject":[],"published":{"date-parts":[[2022,10,13]]},"article-number":"1002453"}}