{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T14:21:32Z","timestamp":1753885292562,"version":"3.41.2"},"reference-count":32,"publisher":"World Scientific Pub Co Pte Ltd","issue":"02n03","funder":[{"name":"National Science Project","award":["KC4.0\/19-25"],"award-info":[{"award-number":["KC4.0\/19-25"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Int. J. As. Lang. Proc."],"published-print":{"date-parts":[[2022,9]]},"abstract":"<jats:p> Training a multi-speaker Text-to-Speech (TTS) model requires multiple speakers\u2019 voices to generate an average speech model. However, the average speech synthesis model will be distorted or averaged, resulting in low quality if the new speaker\u2019s voice has too little data to train. The existing methods require fine-tuning the model; otherwise, the model will achieve low adaptive quality. However, for synthesis voice to achieve high adaptive quality, at least thousands of fine-tuning steps are required. To solve these issues, in this paper, we propose a Vietnamese multi-speaker TTS adaptive-based technique that synthesizes high-quality speech and effectively adapts to new speakers, with two main improvements: (1) propose an Extracting Mel-Vector (EMV) architecture with three components, the Encoder\u2013Decoder\u2013Embedding Features, which enables complete learning of speaker features with Mel-spectrograms as input for few-shot training and (2) a continuous-learning technique called \u201cdata-distributing\u201d preserves the new speaker\u2019s characteristics after many training epochs. Our proposed model outperformed the baseline multi-speaker synthesis model and achieved a MOS score of 3.8\/4.6 and SIM of 2.6\/4 with only 1 min of the target speaker\u2019s voice. <\/jats:p>","DOI":"10.1142\/s2717554523500042","type":"journal-article","created":{"date-parts":[[2023,5,17]],"date-time":"2023-05-17T07:18:47Z","timestamp":1684307927000},"source":"Crossref","is-referenced-by-count":0,"title":["Improving Few-Shot Multi-Speaker Text-to-Speech Adaptive-Based with Extracting Mel-Vector (EMV) for Vietnamese"],"prefix":"10.1142","volume":"32","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7741-3570","authenticated-orcid":false,"given":"Phuong Pham","family":"Ngoc","sequence":"first","affiliation":[{"name":"Thai Nguyen University, Thai Nguyen, Vietnam"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chung Tran","family":"Quang","sequence":"additional","affiliation":[{"name":"AIMed Vietnam Artificial Intelligence Solutions, Ha Noi, Vietnam"},{"name":"Japan Advanced Institute of Science and Technology, (JAIST) Nomi, Ishikawa 923-1292, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mai Luong","family":"Chi","sequence":"additional","affiliation":[{"name":"Institute of Information Technology, Vietnam Academy of Science and Technology, Ha Noi, Vietnam"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"219","published-online":{"date-parts":[[2023,6,29]]},"reference":[{"key":"S2717554523500042BIB001","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"S2717554523500042BIB003","first-page":"5530","volume-title":"Int. Conf. Machine Learning","author":"Kim J.","year":"2021"},{"key":"S2717554523500042BIB004","first-page":"214","volume-title":"Proc. ICLR","author":"Ping W.","year":"2018"},{"key":"S2717554523500042BIB005","series-title":"Proceedings of Machine Learning Research","first-page":"5180","volume-title":"Proc. 35th Int. Conf. Machine Learning","volume":"80","author":"Wang Y.","year":"2018"},{"key":"S2717554523500042BIB006","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74200-5_3"},{"key":"S2717554523500042BIB007","volume-title":"Advances in Neural Information Processing Systems","volume":"31","author":"Jia Y.","year":"2018"},{"key":"S2717554523500042BIB008","first-page":"35","volume-title":"Proc. 7th Int. Workshop on Vietnamese Language and Speech Processing","author":"Nguyen T. T. T.","year":"2020"},{"key":"S2717554523500042BIB009","doi-asserted-by":"publisher","DOI":"10.1109\/ICSDA.2009.5278366"},{"key":"S2717554523500042BIB010","first-page":"52","volume-title":"Proc. SAI Intelligent Systems Conf.","author":"Tits N.","year":"2019"},{"key":"S2717554523500042BIB012","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054301"},{"key":"S2717554523500042BIB013","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413880"},{"key":"S2717554523500042BIB014","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683501"},{"key":"S2717554523500042BIB015","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053520"},{"key":"S2717554523500042BIB016","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053436"},{"key":"S2717554523500042BIB017","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054535"},{"key":"S2717554523500042BIB018","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.587"},{"key":"S2717554523500042BIB019","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1611835114"},{"key":"S2717554523500042BIB020","doi-asserted-by":"publisher","DOI":"10.1109\/O-COCOSDA202152914.2021.9660445"},{"key":"S2717554523500042BIB021","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"S2717554523500042BIB022","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683120"},{"key":"S2717554523500042BIB023","first-page":"10019","volume":"31","author":"Arik S.","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"S2717554523500042BIB024","first-page":"7748","volume-title":"Int. Conf. Machine Learning","author":"Min D.","year":"2021"},{"key":"S2717554523500042BIB025","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01418-6_48"},{"key":"S2717554523500042BIB027","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2773081"},{"key":"S2717554523500042BIB028","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_5"},{"key":"S2717554523500042BIB030","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3167258"},{"key":"S2717554523500042BIB031","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414001"},{"key":"S2717554523500042BIB032","doi-asserted-by":"publisher","DOI":"10.1186\/s13636-019-0166-8"},{"key":"S2717554523500042BIB033","doi-asserted-by":"publisher","DOI":"10.1109\/SLT.2016.7846260"},{"key":"S2717554523500042BIB035","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1386"},{"key":"S2717554523500042BIB037","first-page":"63","volume-title":"Spoken Languages Technologies for Under-Resourced Languages","author":"Kominek J.","year":"2008"},{"issue":"11","key":"S2717554523500042BIB038","first-page":"2579","volume":"9","author":"Van der Maaten L.","year":"2008","journal-title":"J. Mach. Learn. Res."}],"container-title":["International Journal of Asian Language Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S2717554523500042","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,7,24]],"date-time":"2023-07-24T03:41:23Z","timestamp":1690170083000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S2717554523500042"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9]]},"references-count":32,"journal-issue":{"issue":"02n03","published-print":{"date-parts":[[2022,9]]}},"alternative-id":["10.1142\/S2717554523500042"],"URL":"https:\/\/doi.org\/10.1142\/s2717554523500042","relation":{},"ISSN":["2717-5545","2424-791X"],"issn-type":[{"type":"print","value":"2717-5545"},{"type":"electronic","value":"2424-791X"}],"subject":[],"published":{"date-parts":[[2022,9]]},"article-number":"2350004"}}