{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T14:13:36Z","timestamp":1753884816559,"version":"3.41.2"},"reference-count":55,"publisher":"World Scientific Pub Co Pte Ltd","issue":"09","funder":[{"name":"the National Key Research and Development Program of China","award":["2022YFF0902003"],"award-info":[{"award-number":["2022YFF0902003"]}]},{"name":"the Ningbo Key Research and Development Program","award":["2022Z097"],"award-info":[{"award-number":["2022Z097"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Int. J. Patt. Recogn. Artif. Intell."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p> In recent times, audio-driven lip-synching generation for digital humans has attracted considerable attention. However, the prevailing methodologies frequently encounter challenges pertaining to elevated computational complexity and deficient real-time performance. Although the MuseTalk framework has achieved notable progress in inference efficiency through its end-to-end, latent-space-based single-step generation algorithm, it still suffers from noticeable lip jitter and insufficient synchronization between audio and lip movements. To address these limitations, we propose an enhanced multi-frame inpainting framework that integrates Variational Autoencoders (VAE) and a multi-scale U-Net architecture. Specifically, our approach directly synthesizes the occluded lip region by leveraging multi-frame visual references combined with corresponding audio embeddings, thereby effectively improving lip synchronization and maintaining identity consistency. Furthermore, we introduce a landmark-guided multi-frame sampling strategy designed to enhance model attention towards lip dynamics. To facilitate deeper feature extraction and fusion, we propose a hierarchical latent-space feature fusion network (FusionNet), incorporating global and local residual connections and an enhanced Convolutional Block Attention Module. Additionally, a frame interpolation technique is employed during inference to further smooth lip movements and significantly mitigate lip jitter. The model has been trained on a large-scale Chinese dataset and comprehensively evaluated using both Chinese and English datasets. The experimental results demonstrate that the proposed framework achieves high visual accuracy, consistent lip synchronization, and efficient real-time inference, highlighting its strong cross-lingual generalization capability. <\/jats:p>","DOI":"10.1142\/s021800142557006x","type":"journal-article","created":{"date-parts":[[2025,4,26]],"date-time":"2025-04-26T04:22:43Z","timestamp":1745641363000},"source":"Crossref","is-referenced-by-count":0,"title":["One-Step Multi-Frame Inpainting Framework for Real-Time Lip-Sync Digital Human Generation"],"prefix":"10.1142","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5720-6374","authenticated-orcid":false,"given":"Yijun","family":"Bei","sequence":"first","affiliation":[{"name":"School of Software Technology Zhejiang University, Ningbo 310048, P.\u00a0R.\u00a0China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4680-2842","authenticated-orcid":false,"given":"Yunze","family":"Qi","sequence":"additional","affiliation":[{"name":"School of Software Technology Zhejiang University, Ningbo 310048, P.\u00a0R.\u00a0China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5765-4094","authenticated-orcid":false,"given":"Hengrui","family":"Lou","sequence":"additional","affiliation":[{"name":"School of Software Technology Zhejiang University, Ningbo 310048, P.\u00a0R.\u00a0China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6502-4454","authenticated-orcid":false,"given":"Erteng","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Software Technology Zhejiang University, Ningbo 310048, P.\u00a0R.\u00a0China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-0282-2376","authenticated-orcid":false,"given":"Ke","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Software Technology Zhejiang University, Ningbo 310048, P.\u00a0R.\u00a0China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0407-2497","authenticated-orcid":false,"given":"Hongchang","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Software Technology Zhejiang University, Ningbo 310048, P.\u00a0R.\u00a0China"}]}],"member":"219","published-online":{"date-parts":[[2025,5,27]]},"reference":[{"key":"S021800142557006XBIB001","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2016.7477553"},{"journal-title":"J. Software","author":"Bei Y.","key":"S021800142557006XBIB002"},{"key":"S021800142557006XBIB004","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i10.28987"},{"key":"S021800142557006XBIB005","doi-asserted-by":"publisher","DOI":"10.1353\/lan.0.0054"},{"key":"S021800142557006XBIB008","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00802"},{"key":"S021800142557006XBIB010","doi-asserted-by":"publisher","DOI":"10.1109\/VR.2017.7892322"},{"first-page":"408","volume-title":"16th European Conf. Computer Vision","author":"Das D.","key":"S021800142557006XBIB012"},{"key":"S021800142557006XBIB013","doi-asserted-by":"publisher","DOI":"10.1142\/S0218001424540132"},{"key":"S021800142557006XBIB014","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6717"},{"key":"S021800142557006XBIB015","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00151"},{"key":"S021800142557006XBIB016","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00573"},{"key":"S021800142557006XBIB018","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00518"},{"key":"S021800142557006XBIB019","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"S021800142557006XBIB021","first-page":"6626","volume":"30","author":"Heusel M.","year":"2017","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"S021800142557006XBIB023","first-page":"6840","volume":"33","author":"Ho J.","year":"2020","journal-title":"Adv. Neural Inform. Process. Syst."},{"first-page":"694","volume-title":"14th European Conf. Computer Vision","author":"Johnson J.","key":"S021800142557006XBIB024"},{"key":"S021800142557006XBIB027","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00338"},{"key":"S021800142557006XBIB028","doi-asserted-by":"publisher","DOI":"10.1109\/ICME52920.2022.9859720"},{"key":"S021800142557006XBIB029","doi-asserted-by":"publisher","DOI":"10.1142\/S0218001424520190"},{"key":"S021800142557006XBIB030","first-page":"28092","volume":"34","author":"Liu Z.","year":"2021","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"S021800142557006XBIB031","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2858232"},{"key":"S021800142557006XBIB032","doi-asserted-by":"publisher","DOI":"10.1109\/JAS.2022.105686"},{"key":"S021800142557006XBIB033","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.304"},{"key":"S021800142557006XBIB035","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553469"},{"key":"S021800142557006XBIB036","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00521"},{"key":"S021800142557006XBIB037","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3351601"},{"key":"S021800142557006XBIB039","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00070"},{"key":"S021800142557006XBIB040","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413532"},{"key":"S021800142557006XBIB041","first-page":"1428","volume-title":"Proc. 27th ACM Int. Conf. Multimedia (ACM MM, 2019)","author":"Prajwal K. R.","year":"2019"},{"key":"S021800142557006XBIB042","first-page":"28492","volume-title":"Int. Conf. Machine Learning (ICML, 2023)","author":"Radford A.","year":"2023"},{"key":"S021800142557006XBIB043","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01350"},{"key":"S021800142557006XBIB045","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"first-page":"234","volume-title":"18th Int. Conf. Medical Image Computing and Computer-Assisted Intervention","author":"Ronneberger O.","key":"S021800142557006XBIB046"},{"key":"S021800142557006XBIB047","doi-asserted-by":"publisher","DOI":"10.3758\/BF03211902"},{"key":"S021800142557006XBIB048","doi-asserted-by":"publisher","DOI":"10.1145\/3399715.3400873"},{"key":"S021800142557006XBIB049","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19775-8_39"},{"key":"S021800142557006XBIB050","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00197"},{"key":"S021800142557006XBIB052","first-page":"92","volume-title":"European Conf. Computer Vision (ECCV, 2024)","author":"Sun Y.","year":"2024"},{"key":"S021800142557006XBIB053","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72658-3_23"},{"key":"S021800142557006XBIB054","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3409380"},{"key":"S021800142557006XBIB055","doi-asserted-by":"publisher","DOI":"10.1142\/S0218001424520037"},{"key":"S021800142557006XBIB056","doi-asserted-by":"publisher","DOI":"10.1007\/s11063-019-10116-7"},{"key":"S021800142557006XBIB057","first-page":"6306","volume":"30","author":"Van Den Oord A.","year":"2017","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"S021800142557006XBIB058","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01408"},{"key":"S021800142557006XBIB059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58589-1_42"},{"key":"S021800142557006XBIB060","first-page":"7686","volume":"31","author":"Wang N.","year":"2018","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"S021800142557006XBIB062","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"S021800142557006XBIB063","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73223-2_23"},{"key":"S021800142557006XBIB064","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00129"},{"key":"S021800142557006XBIB070","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2603342"},{"key":"S021800142557006XBIB072","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i3.25464"},{"key":"S021800142557006XBIB073","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00366"},{"key":"S021800142557006XBIB074","doi-asserted-by":"publisher","DOI":"10.1142\/S0218001424540119"},{"key":"S021800142557006XBIB075","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00416"},{"issue":"6","key":"S021800142557006XBIB077","first-page":"1","volume":"39","author":"Zhou Y.","year":"2020","journal-title":"ACM Trans. Graph."}],"container-title":["International Journal of Pattern Recognition and Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S021800142557006X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,20]],"date-time":"2025-06-20T08:07:59Z","timestamp":1750406879000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S021800142557006X"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,27]]},"references-count":55,"journal-issue":{"issue":"09","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.1142\/S021800142557006X"],"URL":"https:\/\/doi.org\/10.1142\/s021800142557006x","relation":{},"ISSN":["0218-0014","1793-6381"],"issn-type":[{"type":"print","value":"0218-0014"},{"type":"electronic","value":"1793-6381"}],"subject":[],"published":{"date-parts":[[2025,5,27]]},"article-number":"2557006"}}