{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,18]],"date-time":"2026-02-18T23:46:52Z","timestamp":1771458412867,"version":"3.50.1"},"reference-count":47,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2021,12,23]],"date-time":"2021-12-23T00:00:00Z","timestamp":1640217600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)","award":["NRF-2018X1A3A1069795"],"award-info":[{"award-number":["NRF-2018X1A3A1069795"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as \u201ca\u201d, \u201can\u201d, \u201ceight\u201d, and \u201cbin\u201d because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.<\/jats:p>","DOI":"10.3390\/s22010072","type":"journal-article","created":{"date-parts":[[2021,12,23]],"date-time":"2021-12-23T21:40:21Z","timestamp":1640295621000},"page":"72","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4705-1254","authenticated-orcid":false,"given":"Sanghun","family":"Jeon","sequence":"first","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1522-5064","authenticated-orcid":false,"given":"Ahmed","family":"Elsharkawy","sequence":"additional","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mun Sang","family":"Kim","sequence":"additional","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,12,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"746","DOI":"10.1038\/264746a0","article-title":"Hearing lips and seeing voices","volume":"264","author":"McGurk","year":"1976","journal-title":"Nature"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Ramakrishnan, S. (2012). Automatic visual speech recognition. Speech Enhancement, Modeling, Recognition\u2014Algorithms, and Applications, Intechopen.","DOI":"10.5772\/2391"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"796","DOI":"10.1044\/jshr.1104.796","article-title":"Confusions among visually perceived consonants","volume":"11","author":"Fisher","year":"1968","journal-title":"J. Speech Hear. Res."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"562","DOI":"10.3758\/BF03204211","article-title":"Perceptual dominance during lipreading","volume":"32","author":"Easton","year":"1982","journal-title":"Atten. Percept. Psychophys."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Chung, J.S., Senior, A., Vinyals, O., and Zisserman, A. (2017, January 21\u201326). Lip reading sentences in the wild. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.367"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Kastaniotis, D., Tsourounis, D., and Fotopoulos, S. (2020). Lip Reading Modeling with Temporal Convolutional Networks for Medical Support applications. 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), IEEE.","DOI":"10.1109\/CISP-BMEI51763.2020.9263634"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"012146","DOI":"10.1088\/1742-6596\/1871\/1\/012146","article-title":"Lip-Corrector: Application of BERT-based Model in Sentence-level Lipreading","volume":"1871","author":"Zhao","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1016\/j.imavis.2018.07.002","article-title":"Survey on automatic lip-reading in the era of deep learning","volume":"78","author":"Sukno","year":"2018","journal-title":"Image Vis. Comput."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"204518","DOI":"10.1109\/ACCESS.2020.3036865","article-title":"A survey of research on lipreading technology","volume":"8","author":"Hao","year":"2020","journal-title":"IEEE Access"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"981","DOI":"10.1007\/s11760-019-01630-1","article-title":"Lipreading with DenseNet and resBi-LSTM","volume":"14","author":"Chen","year":"2020","journal-title":"Signal Image Video Process."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Tsourounis, D., Kastaniotis, D., and Fotopoulos, S. (2021). Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions. J. Imaging, 7.","DOI":"10.3390\/jimaging7050091"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"215516","DOI":"10.1109\/ACCESS.2020.3040906","article-title":"Lip Reading Sentences Using Deep Learning with Only Visual Cues","volume":"8","author":"Fenghour","year":"2020","journal-title":"IEEE Access"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ma, S., Wang, S., and Lin, X. (2020). A Transformer-based Model for Sentence-Level Chinese Mandarin Lipreading. 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), IEEE.","DOI":"10.1109\/DSC50466.2020.00020"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1016\/j.imavis.2014.06.004","article-title":"A review of recent advances in visual speech decoding","volume":"32","author":"Zhou","year":"2014","journal-title":"Image Vis. Comput."},{"key":"ref_15","unstructured":"Xiao, J. (2018). 3D feature pyramid attention module for robust visual speech recognition. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2421","DOI":"10.1121\/1.2229005","article-title":"An audio-visual corpus for speech perception and automatic speech recognition","volume":"120","author":"Cooke","year":"2006","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv.","DOI":"10.5244\/C.28.6"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017). Densely connected convolutional networks. arXiv.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_21","unstructured":"Assael, Y.M., Shillingford, B., Whiteson, S., and De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Chu, S.M., and Huang, T.S. (2000, January 16\u201320). Bimodal speech recognition using coupled hidden Markov models. Proceedings of the Sixth International Conference on Spoken Language Processing (ICSLP 2000), Beijing, China.","DOI":"10.21437\/ICSLP.2000-377"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Wand, M., Koutn\u00edk, J., and Schmidhuber, J. (2016, January 20\u201325). Lipreading with long short-term memory. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472852"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Petridis, S., and Pantic, M. (2016, January 20\u201325). Deep complementary bottleneck features for visual speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472088"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Goldschen, A.J., Garcia, O.N., and Petajan, E.D. (1997). Continuous automatic speech recognition by lipreading. Motion-Based Recognition, Springer.","DOI":"10.1007\/978-94-015-8935-2_14"},{"key":"ref_27","unstructured":"Potamianos, G., Graf, H.P., and Cosatto, E. (1998, January 7). An image transform approach for HMM based automatic lipreading. Proceedings of the 1998 International Conference on Image Processing. ICIP98 (Cat. No. 98CB36269), Chicago, IL, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., and Ogata, T. (2014, January 14\u201318). Lipreading using convolutional neural network. Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore.","DOI":"10.21437\/Interspeech.2014-293"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.cviu.2018.02.001","article-title":"Learning to lip read words by watching videos","volume":"173","author":"Chung","year":"2018","journal-title":"Comput. Vis. Image Under."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhang, P., Wang, D., Lu, H., Wang, H., and Ruan, X. (2017). Amulet: Aggregating multi-level convolutional features for salient object detection. arXiv.","DOI":"10.1109\/ICCV.2017.31"},{"key":"ref_31","unstructured":"Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. (2015, January 7\u201312). Efficient object localization using convolutional networks. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298664"},{"key":"ref_33","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_35","first-page":"1755","article-title":"Dlib-ml: A machine learning toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Lean. Res."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. (2013, January 2\u20138). 300 faces in-the-wild challenge: The first facial landmark localization challenge. Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Sydney, Australia.","DOI":"10.1109\/ICCVW.2013.59"},{"key":"ref_37","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_38","unstructured":"Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., and Zhou, J. (2000). Audio-Visual Speech Recognition, Center for Language and Speech Processing, The Johns Hopkins University. Final Workshop 2000 Report."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Xu, K., Li, D., Cassimatis, N., and Wang, X. (2018, January 15\u201319). LCANet: End-to-end lipreading with Cascaded Attention-CTC. Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00088"},{"key":"ref_40","unstructured":"Rastogi, A., Agarwal, R., Gupta, V., Dhar, J., and Bhattacharya, M. (2019, January 27\u201328). LRNeuNet: An attention based deep architecture for lipreading from multitudinous sized videos. Proceedings of the 2019 International Conference on Computing, Power and Communication, New Delhi, India."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Qu, L., Weber, C., and Wermter, S. (2019, January 15\u201319). LipSound: Neural mel-spectrogram reconstruction for lip reading. Proceedings of the INTERSPEECH 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-1393"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Luo, M., Yang, S., Shan, S., and Chen, X.J. (2020, January 16\u201320). Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina.","DOI":"10.1109\/FG47880.2020.00010"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Liu, J., Ren, Y., Zhao, Z., Zhang, C., Huai, B., and Yuan, J. (2020, January 12\u201316). FastLR. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413740"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Hlav\u00e1\u010d, M., Gruber, I., \u017delezn\u00fd, M., and Karpov, A. (2020, January 7\u20139). Lipreading with LipsID. Proceedings of the International Conference on Speech and Computer, St. Petersburgh, Russia.","DOI":"10.1007\/978-3-030-60276-5_18"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Yang, C., Wang, S., Zhang, X., and Zhu, Y. (2020, January 25\u201328). Speaker-independent lipreading with limited data. Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates.","DOI":"10.1109\/ICIP40778.2020.9190780"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Chen, W., Tan, X., Xia, Y., Qin, T., Wang, Y., and Liu, T.-Y. (2020, January 12\u201316). DualLip: A system for joint lip reading and generation. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413623"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zhang, T., He, L., Li, X., and Feng, G. (2021). Efficient end-to-end sentence level lipreading with temporal convolutional network. Appl. Sci., 11.","DOI":"10.3390\/app11156975"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/1\/72\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:51:52Z","timestamp":1760169112000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/1\/72"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12,23]]},"references-count":47,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,1]]}},"alternative-id":["s22010072"],"URL":"https:\/\/doi.org\/10.3390\/s22010072","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12,23]]}}}