{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T05:54:05Z","timestamp":1780379645672,"version":"3.54.1"},"reference-count":31,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2022,5,28]],"date-time":"2022-05-28T00:00:00Z","timestamp":1653696000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100006595","name":"Romanian Ministry of Education and Research","doi-asserted-by":"publisher","award":["PN-III-P1-1.1-PD-2019-0918"],"award-info":[{"award-number":["PN-III-P1-1.1-PD-2019-0918"]}],"id":[{"id":"10.13039\/501100006595","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.<\/jats:p>","DOI":"10.3390\/s22114104","type":"journal-article","created":{"date-parts":[[2022,5,31]],"date-time":"2022-05-31T02:30:06Z","timestamp":1653964206000},"page":"4104","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["FlexLip: A Controllable Text-to-Lip System"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4354-4393","authenticated-orcid":false,"given":"Dan","family":"Onea\u021b\u0103","sequence":"first","affiliation":[{"name":"Speech and Dialogue Research Lab, University \u201cPolitehnica\u201d of Bucharest, 060042 Bucharest, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7728-5863","authenticated-orcid":false,"given":"Be\u00e1ta","family":"L\u0151rincz","sequence":"additional","affiliation":[{"name":"Faculty of Mathematics and Computer Science, \u201cBabe\u0219-Bolyai\u201d University, 400347 Cluj-Napoca, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2894-5770","authenticated-orcid":false,"given":"Adriana","family":"Stan","sequence":"additional","affiliation":[{"name":"Department of Communications, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8711-3854","authenticated-orcid":false,"given":"Horia","family":"Cucu","sequence":"additional","affiliation":[{"name":"Speech and Dialogue Research Lab, University \u201cPolitehnica\u201d of Bucharest, 060042 Bucharest, Romania"},{"name":"Zevo Technology, 077042 Ro\u0219u, Chiajna, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,28]]},"reference":[{"key":"ref_1","unstructured":"Chung, J.S., Jamaludin, A., and Zisserman, A. (2017). You said that?. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Thies, J., Elgharib, M., Tewari, A., Theobalt, C., and Nie\u00dfner, M. (2020, January 23\u201328). Neural voice puppetry: Audio-driven facial reenactment. Proceedings of the European Conference on Computer Vision, Virtual.","DOI":"10.1007\/978-3-030-58517-4_42"},{"key":"ref_3","unstructured":"Kumar, R., Sotelo, J., Kumar, K., de Br\u00e9bisson, A., and Bengio, Y. (2017). ObamaNet: Photo-realistic lip-sync from text. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zhang, S., Yuan, J., Liao, M., and Zhang, L. (2021). Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary. arXiv.","DOI":"10.1109\/ICASSP43922.2022.9747380"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073640","article-title":"Synthesizing Obama: Learning lip sync from audio","volume":"36","author":"Suwajanakorn","year":"2017","journal-title":"Acm Trans. Graph."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Eskimez, S.E., Maddox, R.K., Xu, C., and Duan, Z. (2018, January 2\u20136). Generating talking face landmarks from speech. Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, Guildford, UK.","DOI":"10.1007\/978-3-319-93764-9_35"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Greenwood, D., Matthews, I., and Laycock, S. (2018, January 2\u20136). Joint learning of facial expression and head pose from speech. Proceedings of the Interspeech, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2587"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Bregler, C., Covell, M., and Slaney, M. (1997, January 3\u20138). Video Rewrite: Driving Visual Speech with Audio. Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA.","DOI":"10.1145\/258734.258880"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1023\/A:1008166717597","article-title":"Visual Speech Synthesis by Morphing Visemes","volume":"38","author":"Ezzat","year":"2000","journal-title":"Int. J. Comput. Vision"},{"key":"ref_10","unstructured":"Taylor, S.L., Mahler, M., Theobald, B.J., and Matthews, I. (2012, January 29\u201331). Dynamic Units of Visual Speech. Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation, Lausanne, Switzerland."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3306346.3323028","article-title":"Text-Based Editing of Talking-Head Video","volume":"38","author":"Fried","year":"2019","journal-title":"ACM Trans. Graph."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kim, Y., Nam, S., Cho, I., and Kim, S.J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. arXiv.","DOI":"10.1186\/s13640-019-0478-8"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chen, L., Wu, Z., Ling, J., Li, R., Tan, X., and Zhao, S. (2021). Transformer-S2A: Robust and Efficient Speech-to-Animation. arXiv.","DOI":"10.1109\/ICASSP43922.2022.9747495"},{"key":"ref_14","unstructured":"Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., and Lee, H. (2017, January 6\u201311). Learning to generate long-term future via hierarchical prediction. Proceedings of the International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_15","unstructured":"Aldeneh, Z., Fedzechkina, M., Seto, S., Metcalf, K., Sarabia, M., Apostoloff, N., and Theobald, B.J. (2022). Towards a Perceptual Model for Estimating the Quality of Visual Speech. arXiv."},{"key":"ref_16","unstructured":"van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. arXiv."},{"key":"ref_17","unstructured":"Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.Y. (2021, January 3\u20137). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. Proceedings of the International Conference on Learning Representations, Virtual."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"\u0141a\u0144cucki, A. (2021, January 6\u201311). Fastpitch: Parallel text-to-speech with pitch prediction. Proceedings of the ICASSP 2021\u20132021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413889"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Beliaev, S., and Ginsburg, B. (2021). TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction. arXiv.","DOI":"10.21437\/Interspeech.2021-1770"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Prenger, R., Valle, R., and Catanzaro, B. (2019, January 12\u201317). Waveglow: A flow-based generative network for speech synthesis. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683143"},{"key":"ref_21","unstructured":"Kingma, D.P., and Dhariwal, P. (2018, January 3\u20138). Glow: Generative Flow with Invertible 1 \u00d7 1 Convolutions. Proceedings of the Advances in Neural Information Processing Systems, Montr\u00e9al, QC, Canada."},{"key":"ref_22","unstructured":"Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., and Zhong, J. (2021). SpeechBrain: A General-Purpose Speech Toolkit. arXiv."},{"key":"ref_23","unstructured":"Taylor, P., Black, A.W., and Caley, R. (1998, January 26\u201329). The architecture of the Festival speech synthesis system. Proceedings of the Third ESCA\/COCOSDA Workshop (ETRW) on Speech Synthesis, Blue Mountains, Australia."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An ASR corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zen, H., Clark, R., Weiss, R.J., Dang, V., Jia, Y., Wu, Y., Zhang, Y., and Chen, Z. (2019). LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. arXiv.","DOI":"10.21437\/Interspeech.2019-2441"},{"key":"ref_26","unstructured":"Ito, K., and Johnson, L. (2022, May 15). The LJ Speech Dataset. Available online: https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"ref_27","first-page":"1755","article-title":"Dlib-ml: A Machine Learning Toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_28","unstructured":"Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A.G. (2018, January 6\u201310). Averaging weights leads to wider optima and better generalization. Proceedings of the Uncertainty in Artificial Intelligence, Monterey, CA, USA."},{"key":"ref_29","unstructured":"Taylor, J., and Richmond, K. (September, January 30). Confidence Intervals for ASR-based TTS Evaluation. Proceedings of the Interspeech, Brno, Czech Republic."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1109\/TASL.2010.2064307","article-title":"Front-end factor analysis for speaker verification","volume":"19","author":"Dehak","year":"2010","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Desplanques, B., Thienpondt, J., and Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv.","DOI":"10.21437\/Interspeech.2020-2650"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/11\/4104\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:20:29Z","timestamp":1760138429000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/11\/4104"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,28]]},"references-count":31,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2022,6]]}},"alternative-id":["s22114104"],"URL":"https:\/\/doi.org\/10.3390\/s22114104","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,28]]}}}