{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:21:14Z","timestamp":1760149274979,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2023,7,28]],"date-time":"2023-07-28T00:00:00Z","timestamp":1690502400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China","award":["42075134"],"award-info":[{"award-number":["42075134"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.<\/jats:p>","DOI":"10.3390\/computers12080150","type":"journal-article","created":{"date-parts":[[2023,7,28]],"date-time":"2023-07-28T07:35:24Z","timestamp":1690529724000},"page":"150","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5217-3614","authenticated-orcid":false,"given":"Xufeng","family":"Ling","sequence":"first","affiliation":[{"name":"AI School, Tianhua College, Shanghai Normal University, No. 1661 North Sheng Xin Road, Shanghai 201815, China"}]},{"given":"Yu","family":"Zhu","sequence":"additional","affiliation":[{"name":"Shanghai Library, Institute of Scientific and Technical Information of Shanghai, 1555 West Huaihai Road, Shanghai 200030, China"}]},{"given":"Wei","family":"Liu","sequence":"additional","affiliation":[{"name":"AI School, Tianhua College, Shanghai Normal University, No. 1661 North Sheng Xin Road, Shanghai 201815, China"}]},{"given":"Jingxin","family":"Liang","sequence":"additional","affiliation":[{"name":"AI School, Tianhua College, Shanghai Normal University, No. 1661 North Sheng Xin Road, Shanghai 201815, China"}]},{"given":"Jie","family":"Yang","sequence":"additional","affiliation":[{"name":"Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai 200240, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,28]]},"reference":[{"key":"ref_1","first-page":"142","article-title":"Review of Speech Driven Facial Animation","volume":"22","author":"Li","year":"2017","journal-title":"Comput. Eng. Appl."},{"key":"ref_2","first-page":"142","article-title":"3D Visualization Method for Tongue Movements in Pronunciation","volume":"5","author":"Li","year":"2016","journal-title":"PR AI"},{"key":"ref_3","first-page":"142","article-title":"Chinese Speech Synchronized 3D Lip Animation","volume":"4","author":"Mi","year":"2015","journal-title":"Appl. Res. Comput."},{"key":"ref_4","first-page":"142","article-title":"Visual Speech Synthesis Based on Articulatory Trajectory","volume":"6","author":"Zheng","year":"2013","journal-title":"Comput. Appl. Softw."},{"key":"ref_5","first-page":"142","article-title":"Phonetic Training Based on Visualized Articulatory Model","volume":"1","author":"Zhi","year":"2020","journal-title":"J. Foreign Lang."},{"key":"ref_6","first-page":"142","article-title":"Speech-driven Articulator Motion Synthesis with Deep Neural Networks","volume":"6","author":"Tang","year":"2016","journal-title":"Acta Autom. Sin."},{"key":"ref_7","first-page":"142","article-title":"Physiology Based Tongue Modeling and Simulation","volume":"12","author":"Jiang","year":"2015","journal-title":"J. Comput.-Aided Des. Comput. Graph."},{"key":"ref_8","first-page":"142","article-title":"Visualization Study of Virtual Human Tongue in Speech Production","volume":"10","author":"Chen","year":"2013","journal-title":"Chin. J. Rehabil. Theory Pract."},{"key":"ref_9","unstructured":"Radford, A., Metz, L., and Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv."},{"key":"ref_10","unstructured":"Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.C. (2017). Improved Training of Wasserstein GANs. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. (2020). Analyzing and Improving the Image Quality of StyleGAN. arXiv.","DOI":"10.1109\/CVPR42600.2020.00813"},{"key":"ref_12","unstructured":"Nibali, A., He, Z., Morgan, S., and Prendergast, L. (2018). Numerical Coordinate Regression with Convolutional Neural Networks. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Newell, A., Yang, K., and Deng, J. (2016). Stacked Hourglass Networks for Human Pose Estimation. arXiv.","DOI":"10.1007\/978-3-319-46484-8_29"},{"key":"ref_14","unstructured":"Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y. (2019). OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). Deep High-Resolution Representation Learning for Human Pose EstimationKe. arXiv.","DOI":"10.1109\/CVPR.2019.00584"},{"key":"ref_16","unstructured":"Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O.K., Singhal, S., and Som, S. (2022). Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasksar. arXiv."},{"key":"ref_17","unstructured":"Cheng, B., Schwing, A., and Kirillov, A. (2021). Per-Pixel Classification is Not All You Need for Semantic Segmentation. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Karras, T., Laine, S., and Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv.","DOI":"10.1109\/CVPR.2019.00453"},{"key":"ref_20","unstructured":"(2012, January 01). University of Glasgow Homepage. Available online: https:\/\/www.gla.ac.uk\/."},{"key":"ref_21","unstructured":"(2018, January 01). University of British Columbia Homepage. Available online: https:\/\/enunciate.arts.ubc.ca\/."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Guo, Y., Jin, Y., Luo, Y., He, Z., and Lee, H. (2018). Unsupervised Discovery of Object Landmarks as Structural Representations. arXiv.","DOI":"10.1109\/CVPR.2018.00285"},{"key":"ref_23","unstructured":"Jakab, T., Gupta, A., Bilen, H., and Vedaldi, A. (2018). Conditional Image Generation for Learning the Structure of Visual Objects. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Siarohin, A., Lathuili\u00e8re, S., Tulyakov, S., Ricci, E., and Sebe, N. (2019). Animating Arbitrary Objects via Deep Motion Transfer. arXiv.","DOI":"10.1109\/CVPR.2019.00248"},{"key":"ref_25","unstructured":"Siarohin, A., Lathuili\u00e8re, S., Tulyakov, S., Ricci, E., and Sebe, N. (2020). First Order Motion Model for Image Animation. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Siarohin, A., Woodford, O.J., Ren, J., Chai, M., and Tulyakov, S. (2021). Motion Representations for Articulated Animation. arXiv.","DOI":"10.1109\/CVPR46437.2021.01344"},{"key":"ref_27","unstructured":"Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. arXiv."},{"key":"ref_28","unstructured":"Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. (2022). GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv."},{"key":"ref_29","unstructured":"Song, J., Meng, C., and Ermon, S. (2022). Denoising Diffusion Implicit Models. arXiv."},{"key":"ref_30","unstructured":"Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.H. (2023). Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/12\/8\/150\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:21:22Z","timestamp":1760127682000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/12\/8\/150"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,28]]},"references-count":30,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2023,8]]}},"alternative-id":["computers12080150"],"URL":"https:\/\/doi.org\/10.3390\/computers12080150","relation":{},"ISSN":["2073-431X"],"issn-type":[{"type":"electronic","value":"2073-431X"}],"subject":[],"published":{"date-parts":[[2023,7,28]]}}}