{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,8]],"date-time":"2025-11-08T13:25:18Z","timestamp":1762608318895,"version":"build-2065373602"},"reference-count":40,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2021,5,29]],"date-time":"2021-05-29T00:00:00Z","timestamp":1622246400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61907025, 61702278"],"award-info":[{"award-number":["61907025, 61702278"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Natural Science Foundation of Jiangsu Higher Education Institutions of China","award":["19KJB520048"],"award-info":[{"award-number":["19KJB520048"]}]},{"name":"NUPTSF","award":["NY219069"],"award-info":[{"award-number":["NY219069"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Subtitles are crucial for video content understanding. However, a large amount of videos have only burned-in, hardcoded subtitles that prevent video re-editing, translation, etc. In this paper, we construct a deep-learning-based system for the inverse conversion of a burned-in subtitle video to a subtitle file and an inpainted video, by coupling three deep neural networks (CTPN, CRNN, and EdgeConnect). We evaluated the performance of the proposed method and found that the deep learning method achieved high-precision separation of the subtitles and video frames and significantly improved the video inpainting results compared to the existing methods. This research fills a gap in the application of deep learning to burned-in subtitle video reconstruction and is expected to be widely applied in the reconstruction and re-editing of videos with subtitles, advertisements, logos, and other occlusions.<\/jats:p>","DOI":"10.3390\/info12060233","type":"journal-article","created":{"date-parts":[[2021,5,31]],"date-time":"2021-05-31T00:22:15Z","timestamp":1622420535000},"page":"233","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Joint Subtitle Extraction and Frame Inpainting for Videos with Burned-In Subtitles"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9245-7611","authenticated-orcid":false,"given":"Haoran","family":"Xu","sequence":"first","affiliation":[{"name":"School of Electronic and Optical Engineering & School of Microelectronics, Nanjing University of Posts and Telecommunications, Nanjing 210049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yanbai","family":"He","sequence":"additional","affiliation":[{"name":"School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xinya","family":"Li","sequence":"additional","affiliation":[{"name":"School of Educational Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaoying","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Computer Engineering, Tongda College of Nanjing University of Posts and Telecommunications, Yangzhou 225127, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3887-5438","authenticated-orcid":false,"given":"Chuanyan","family":"Hao","sequence":"additional","affiliation":[{"name":"School of Educational Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6451-7565","authenticated-orcid":false,"given":"Bo","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Educational Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,5,29]]},"reference":[{"key":"ref_1","first-page":"461","article-title":"Automatic Text Detection and Removal in Video Images","volume":"13","author":"Liqin","year":"2008","journal-title":"Chin. J. Image Graph."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"131","DOI":"10.1016\/j.image.2017.09.013","article-title":"End-to-end subtitle detection and recognition for videos in East Asian languages via CNN ensemble","volume":"60","author":"Xu","year":"2018","journal-title":"Signal Process. Image Commun."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"368","DOI":"10.1016\/j.patrec.2020.01.019","article-title":"End-to-end video subtitle recognition via a deep Residual Neural Network","volume":"131","author":"Yan","year":"2020","journal-title":"Pattern Recognit. Lett."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Favorskaya, M.N., Zotin, A.G., and Damov, M.V. (2010, January 18\u201320). Intelligent inpainting system for texture reconstruction in videos with text removal. Proceedings of the International Congress on Ultra Modern Telecommunications and Control Systems, Moscow, Russia.","DOI":"10.1109\/ICUMT.2010.5676476"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Khodadadi, M., and Behrad, A. (2012, January 15\u201317). Text localization, extraction and inpainting in color images. Proceedings of the 20th Iranian Conference on Electrical Engineering (ICEE2012), Tehran, Iran.","DOI":"10.1109\/IranianCEE.2012.6292505"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1907","DOI":"10.1016\/j.sigpro.2008.02.002","article-title":"A new approach for text segmentation using a stroke filter","volume":"88","author":"Jung","year":"2008","journal-title":"Signal Process."},{"key":"ref_7","unstructured":"Zhang, D.Q., and Chang, S.F. (July, January 27). Learning to detect scene text using a higher-order MRF with belief propagation. Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA."},{"key":"ref_8","unstructured":"Wolf, C., Jolion, J.M., and Chassaing, F. (2002, January 11\u201315). Text localization, enhancement and binarization in multimedia documents. Proceedings of the Object Recognition Supported by User Interaction for Service Robots, Quebec City, QC, Canada."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Tian, Z., Huang, W., He, T., He, P., and Qiao, Y. (2016). Detecting text in natural image with connectionist text proposal network. Proceedings of the European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46484-8_4"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"353","DOI":"10.1017\/S0956792502004904","article-title":"Digital inpainting based on the Mumford\u2013Shah\u2013Euler image model","volume":"13","author":"Esedoglu","year":"2002","journal-title":"Eur. J. Appl. Math."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1273","DOI":"10.1109\/TCSVT.2007.903663","article-title":"Image compression with edge-based inpainting","volume":"17","author":"Liu","year":"2007","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1200","DOI":"10.1109\/83.935036","article-title":"Filling-in by joint interpolation of vector fields and gray levels","volume":"10","author":"Ballester","year":"2001","journal-title":"IEEE Trans. Image Process."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2185520.2185578","article-title":"Image melding: Combining inconsistent images using patch-based synthesis","volume":"31","author":"Darabi","year":"2012","journal-title":"Acm Trans. Graph. (TOG)"},{"key":"ref_14","first-page":"1","article-title":"Image completion using planar structure guidance","volume":"33","author":"Huang","year":"2014","journal-title":"Acm Trans. Graph. (TOG)"},{"key":"ref_15","unstructured":"Nazeri, K., Ng, E., Joseph, T., Qureshi, F.Z., and Ebrahimi, M. (2019). Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2298","DOI":"10.1109\/TPAMI.2016.2646371","article-title":"An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition","volume":"39","author":"Shi","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","first-page":"223","article-title":"Intelligent method of texture reconstruction in video sequences based on neural networks","volume":"5","author":"Favorskaya","year":"2013","journal-title":"Int. J. Reason. Based Intell. Syst."},{"key":"ref_18","unstructured":"Vuong, T.L., Le, D.M., Le, T.T., and Le, T.H. (2016, January 27\u201330). Pre-rendered subtitles removal in video sequences using text detection and inpainting. Proceedings of the International Conference on Electronics, Information and Communication, Danang, Vietnam."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Jaderberg, M., Vedaldi, A., and Zisserman, A. (2014). Deep features for text spotting. Proceedings of the European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10593-2_34"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Busta, M., Neumann, L., and Matas, J. (2015, January 7\u201313). Fastext: Efficient unconstrained scene text detector. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.143"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Huang, W., Qiao, Y., and Tang, X. (2014). Robust scene text detection with convolution neural network induced mser trees. Proceedings of the European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10593-2_33"},{"key":"ref_22","first-page":"970","article-title":"Robust text detection in natural scene images","volume":"36","author":"Yin","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_23","unstructured":"Wang, T., Wu, D.J., Coates, A., and Ng, A.Y. (2012, January 11\u201315). End-to-end text recognition with convolutional neural networks. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba Science City, Japan."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Bissacco, A., Cummins, M., Netzer, Y., and Neven, H. (2013, January 1\u20138). Photoocr: Reading text in uncontrolled conditions. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.102"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11263-015-0823-z","article-title":"Reading text in the wild with convolutional neural networks","volume":"116","author":"Jaderberg","year":"2016","journal-title":"Int. J. Comput. Vis."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_27","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"855","DOI":"10.1109\/TPAMI.2008.137","article-title":"A novel connectionist system for unconstrained handwriting recognition","volume":"31","author":"Graves","year":"2008","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T.S. (2018, January 18\u201323). Generative image inpainting with contextual attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00577"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yeh, R.A., Chen, C., Yian Lim, T., Schwing, A.G., Hasegawa-Johnson, M., and Do, M.N. (2017, January 21\u201326). Semantic image inpainting with deep generative models. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.728"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"107:1","DOI":"10.1145\/3072959.3073659","article-title":"Globally and Locally Consistent Image Completion","volume":"36","author":"Iizuka","year":"2017","journal-title":"Acm Trans. Graph."},{"key":"ref_32","unstructured":"Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. arXiv."},{"key":"ref_33","unstructured":"Zeiler, M.D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv."},{"key":"ref_34","unstructured":"Kingma, D.P., and Ba, J. (2017). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1452","DOI":"10.1109\/TPAMI.2017.2723009","article-title":"Places: A 10 Million Image Database for Scene Recognition","volume":"40","author":"Zhou","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Yu, J., Jiang, Y., Wang, Z., Cao, Z., and Huang, T. (2016, January 15\u201319). Unitbox: An advanced object detection network. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2967274"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1080\/10867651.2004.10487596","article-title":"An image inpainting technique based on the fast marching method","volume":"9","author":"Telea","year":"2004","journal-title":"J. Graph. Tools"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"8","DOI":"10.4236\/jcc.2019.73002","article-title":"Image quality assessment through FSIM, SSIM, MSE and PSNR\u2014A comparative study","volume":"7","author":"Sara","year":"2019","journal-title":"J. Comput. Commun."},{"key":"ref_39","unstructured":"Wang, Z., Simoncelli, E.P., and Bovik, A.C. (2003, January 9\u201312). Multiscale structural similarity for image quality assessment. Proceedings of the The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA."},{"key":"ref_40","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/6\/233\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:08:54Z","timestamp":1760162934000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/6\/233"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,29]]},"references-count":40,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2021,6]]}},"alternative-id":["info12060233"],"URL":"https:\/\/doi.org\/10.3390\/info12060233","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2021,5,29]]}}}