{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T23:48:51Z","timestamp":1770335331753,"version":"3.49.0"},"reference-count":54,"publisher":"MDPI AG","issue":"23","license":[{"start":{"date-parts":[[2021,11,26]],"date-time":"2021-11-26T00:00:00Z","timestamp":1637884800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system\u2019s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., \u201ctime\u201d and \u201csome\u201d. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%\u2014an improvement of 15.0% compared with the state-of-the-art approaches.<\/jats:p>","DOI":"10.3390\/s21237890","type":"journal-article","created":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T01:45:02Z","timestamp":1638323102000},"page":"7890","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6725-0405","authenticated-orcid":false,"given":"Souheil","family":"Fenghour","sequence":"first","affiliation":[{"name":"School of Engineering, London South Bank University, London SE1 0AA, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0030-1199","authenticated-orcid":false,"given":"Daqing","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Engineering, London South Bank University, London SE1 0AA, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1436-1742","authenticated-orcid":false,"given":"Kun","family":"Guo","sequence":"additional","affiliation":[{"name":"Xi\u2019an VANXUM Electronics Technology Co., Ltd., Xi\u2019an 710129, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1415-4444","authenticated-orcid":false,"given":"Bo","family":"Li","sequence":"additional","affiliation":[{"name":"School of Electronics and Information, Northwestern Polytechnical University, Xi\u2019an 710129, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9036-3061","authenticated-orcid":false,"given":"Perry","family":"Xiao","sequence":"additional","affiliation":[{"name":"School of Engineering, London South Bank University, London SE1 0AA, UK"}]}],"member":"1968","published-online":{"date-parts":[[2021,11,26]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"215516","DOI":"10.1109\/ACCESS.2020.3040906","article-title":"Lip Reading Sentences Using Deep Learning with Only Visual Cues","volume":"8","author":"Fenghour","year":"2020","journal-title":"IEEE Access"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Howell, D., Cox, S., and Theobald, B. (2016). Visual Units and Confusion Modelling for Automatic Lip-reading. Image Vis. Comput., 51.","DOI":"10.1016\/j.imavis.2016.03.003"},{"key":"ref_3","unstructured":"Thangthai, K., Bear, H.L., and Harvey, R. (2017, January 4\u20137). Comparing phonemes and visemes with DNN-based lipreading. Proceedings of the 28th British Machine Vision Conference, London, UK."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Bear, H.L., and Harvey, R. (2016, January 20\u201325). Decoding visemes: Improving machine lip-reading. Proceedings of the International Conference on Acoustics. Speech and Signal Processing, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472029"},{"key":"ref_5","unstructured":"Lan, Y., Harvey, R., and Theobald, B.J. (1988, January 11\u201314). Insights into machine lip reading. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, New York, NY, USA."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Almajai, I., Cox, S., Harvey, R., and Lan, Y. (2016, January 20\u201325). Improved speaker independent lip reading using speaker adaptive training and deep neural networks. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472172"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Ma, P., Petridis, S., and Pantic, M. (2021, January 6\u201311). End-to-End Audio-visual Speech Recognition with Conformers. Proceedings of the ICASSP, Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414567"},{"key":"ref_8","unstructured":"Botev, A., Zheng, B., and Barber, D. (2017, January 20\u201322). Complementary sum sampling for likelihood approximation in large scale classification. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA."},{"key":"ref_9","unstructured":"Firth, J.R. (1957). A Synopsis of Linguistic Theory. Studies in Linguistic Analysis, Blackwell. Special Volume of the Philological Society."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1006\/csla.2001.0184","article-title":"Weighted finite-state transducers in speech recognition","volume":"16","author":"Mohri","year":"2002","journal-title":"Comput. Speech Lang."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Goldschen, A.J., Garcia, O.N., and Petajan, E.D. (1997). Continuous automatic speech recognition by li-preading. Motion-Based Recognition, Springer.","DOI":"10.1007\/978-94-015-8935-2_14"},{"key":"ref_12","unstructured":"Kun, J., Xu, J., and He, B. (2019). A Survey on Neural Network Language Models. arXiv."},{"key":"ref_13","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"J. Mac. Learn. Res."},{"key":"ref_14","unstructured":"Lipton, Z. (2015). A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv."},{"key":"ref_15","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modelling. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"603","DOI":"10.1109\/TMM.2015.2407694","article-title":"TCD-TIMIT: An audio-visual corpus of continuous speech","volume":"17","author":"Harte","year":"2015","journal-title":"IEEE Trans. Multimed."},{"key":"ref_17","unstructured":"Lan, Y., Theobald, B.-J., Harvey, R., Ong, E.-J., and Bowden, R. (2017, January 25\u201326). Improving visual features for lip-reading. Proceedings of the International Conference on Auditory-Visual Speech Processing, Stockholm, Sweden."},{"key":"ref_18","unstructured":"Howell, D.L. (2015). Confusion Modelling for Lip-Reading. [Ph.D. Thesis, University of East Anglia]."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"JChung, S., Zisserman, A., Senior, A., and Vinyals, O. (2016, January 27\u201330). Lip Reading Sentences in the Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, International Conference on Automatic Face and Gesture Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2017.367"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Fenghour, S., Chen, D., and Xiao, P. (2019, January 9\u201312). Decoder-Encoder LSTM for Lip Reading. Proceedings of the Conference: 8th International Conference on Software and Information Engineering (ICSIE), Cairo, Egypt.","DOI":"10.1145\/3328833.3328845"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Shillingford, B., Assael, Y., Hoffman, M., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K., and Bennett, L. (2018). Large-Scale Visual Speech Recognition. arXiv.","DOI":"10.21437\/Interspeech.2019-1669"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Thangthai, K., and Harvey, R. (2017, January 20\u201324). Improving Computer Lipreading via DNN Sequence Discriminative Training Techniques. Proceedings of the Interspeech, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-106"},{"key":"ref_23","unstructured":"Thangthai, K., Harvey, R., Cox, S., and Theobald, B.J. (2015, January 11\u201313). Improving Lip-reading Performance for Robust Audiovisual Speech Recognition using DNNs. Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, Vienna, Austria."},{"key":"ref_24","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2021, November 01). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/www.cs.ubc.ca\/~amuham01\/LING530\/papers\/radford2018improving.pdf."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Sterpu, G., and Harte, N. (2018). Towards Lipreading Sentences with Active Appearance Models. arXiv.","DOI":"10.21437\/AVSP.2017-14"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Peymanfard, J., Mohammadi, M.R., Zeinali, H., and Mozayani, N. (2021). Lip reading using ex-ternal viseme decoding. arXiv.","DOI":"10.1109\/MVIP53647.2022.9738749"},{"key":"ref_27","unstructured":"Lamel, L., Kassel, R.H., and Seneff, S. (1989, January 21\u201323). Speech database development: Design and analysis of the acoustic-phonetic corpus. Proceedings of the DARPA Speech Recognition Workshop, Philadelphia, PA, USA."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"4688","DOI":"10.1109\/TNNLS.2019.2957276","article-title":"A GRU-Gated Attention Model for Neural Machine Translation","volume":"31","author":"Zhang","year":"2017","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"492","DOI":"10.1016\/j.csl.2006.09.003","article-title":"Continuous space language models","volume":"21","author":"Schwenk","year":"2007","journal-title":"Comput. Speech Lang."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kuncoro, A., Dyer, C., Hale, J., Yogatama, D., Clark, S., and Blunsom, P. (2018, January 15\u201320). LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1132"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"521","DOI":"10.1162\/tacl_a_00115","article-title":"Assessing the Ability of LSTMs to Learn Syn-tax-Sensitive Dependencies","volume":"4","author":"Linzen","year":"2016","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Handler, A., Denny, M., Wallach, H., and O\u2019Connor, B. (2016, January 5). Bag of What? Simple Noun Phrase Ex-traction for Text Analysis. Proceedings of the First Workshop on NLP and Computational Social Science, Austin, TX, USA.","DOI":"10.18653\/v1\/W16-5615"},{"key":"ref_33","unstructured":"Kondrak, G. (May, January 29). A new algorithm for the alignment of phonetic sequences. Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, Seattle, WA, USA."},{"key":"ref_34","unstructured":"Afouras, T., Chung, J.S., and Zisserman, A. (2018). LRS3-TED: A large-scale dataset for visual speech recognition. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"448","DOI":"10.1016\/S0749-596X(02)00010-4","article-title":"Context sensitivity in the spelling of English vowels","volume":"47","author":"Treiman","year":"2001","journal-title":"J. Mem. Lang."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Lee, S., and Yook, D. (2002, January 18\u201322). Audio-to-Visual Conversion Using Hidden Markov Models. Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence, Tokyo, Japan.","DOI":"10.1007\/3-540-45683-X_60"},{"key":"ref_37","unstructured":"Jeffers, J., and Barley, M. (1971). Speechreading (Lipreading), Charles C Thomas Publisher Limited."},{"key":"ref_38","unstructured":"Neti, C., Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., and Mashari, A. (2000). Audio Visual Speech Recognition, Technical Report IDIAP."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Hazen, T.J., Saenko, K., La, C., and Glass, J.R. (2004, January 14\u201315). A segment based audio-visual speech recognizer: Data collection, development, and initial experiments. Proceedings of the 6th International Conference on Multimodal Interfaces, New York, NY, USA.","DOI":"10.1145\/1027933.1027972"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Bozkurt, E., Erdem, C.E., Erzin, E., Erdem, T., and Ozkan, M. (2007, January 7\u20139). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. Proceedings of the 3DTV Conference, Kos, Greece.","DOI":"10.1109\/3DTV.2007.4379417"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"796","DOI":"10.1044\/jshr.1104.796","article-title":"Confusions among visually perceived consonants","volume":"11","author":"Fisher","year":"1968","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_42","unstructured":"DeLand, F. (1931). The Story of Lip-Reading, Its Genesis and Development, The Volta Bureau."},{"key":"ref_43","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Po-losukhin, I. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the NIPS, Long Beach, CA, USA."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Vogel, S., Ney, H., and Tillmann, C. (1996, January 5\u20139). HMM-Based Word Alignment in Statistical Translation. Proceedings of the COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark.","DOI":"10.3115\/993268.993313"},{"key":"ref_45","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the ICLR, San Diego, CA, USA."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A Mathematical Theory of Communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst. Tech. J."},{"key":"ref_47","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2015, January 7\u20139). Neural machine translation by jointly learning to align and translate. Proceedings of the ICLR, San Diego, CA, USA."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Luong, T., Pham, H., and Manning, C.D. (2015, January 17\u201321). Effective Approaches to Attention-based Neural Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.","DOI":"10.18653\/v1\/D15-1166"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Afouras, T., Chung, J.S., and Zisserman, A. (2018). Deep lip reading: A comparison of models and an online application. arXiv.","DOI":"10.21437\/Interspeech.2018-1943"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Wei, J., and Zou, K. (2019, January 3\u20137). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1670"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Ataman, D., Firat, O., Gangi, M., Federico, M., and Birch, A. (2019, January 4). On the Importance of Word Boundaries in Character-level Neural Machine Translation. Proceedings of the 3rd Workshop on Neural Generation and Translation(WNGT), Hong Kong, China.","DOI":"10.18653\/v1\/D19-5619"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009, January 14\u201318). Curriculum learning. Proceedings of the ICML, Montreal, QC, Canada.","DOI":"10.1145\/1553374.1553380"},{"key":"ref_53","unstructured":"Spitkovsky, V.I., Alshawi, H., and Jurafsky, D. (2010, January 1\u20136). From baby steps to leapfrog: How less is more in unsupervised dependency parsing. Proceedings of the NAACL, Los Angeles, CA, USA."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Tsvetkov, Y., Faruqui, M., Ling, W., Macwhinney, B., and Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. arXiv.","DOI":"10.18653\/v1\/P16-1013"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/23\/7890\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:36:22Z","timestamp":1760168182000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/23\/7890"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,26]]},"references-count":54,"journal-issue":{"issue":"23","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["s21237890"],"URL":"https:\/\/doi.org\/10.3390\/s21237890","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,26]]}}}