{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T16:27:25Z","timestamp":1776443245803,"version":"3.51.2"},"reference-count":43,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2023,5,27]],"date-time":"2023-05-27T00:00:00Z","timestamp":1685145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"JSPS KAKENHI","award":["19KT0029"],"award-info":[{"award-number":["19KT0029"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>This paper studies various deep learning models for word-level lip-reading technology, one of the tasks in the supervised learning of video classification. Several public datasets have been published in the lip-reading research field. However, few studies have investigated lip-reading techniques using multiple datasets. This paper evaluates deep learning models using four publicly available datasets, namely Lip Reading in the Wild (LRW), OuluVS, CUAVE, and Speech Scene by Smart Device (SSSD), which are representative datasets in this field. LRW is one of the large-scale public datasets and targets 500 English words released in 2016. Initially, the recognition accuracy of LRW was 66.1%, but many research groups have been working on it. The current the state of the art (SOTA) has achieved 94.1% by 3D-Conv + ResNet18 + {DC-TCN, MS-TCN, BGRU} + knowledge distillation + word boundary. Regarding the SOTA model, in this paper, we combine existing models such as ResNet, WideResNet, WideResNet, EfficientNet, MS-TCN, Transformer, ViT, and ViViT, and investigate the effective models for word lip-reading tasks using six deep learning models with modified feature extractors and classifiers. Through recognition experiments, we show that similar model structures of 3D-Conv + ResNet18 for feature extraction and MS-TCN model for inference are valid for four datasets with different scales.<\/jats:p>","DOI":"10.3390\/a16060269","type":"journal-article","created":{"date-parts":[[2023,5,27]],"date-time":"2023-05-27T16:17:33Z","timestamp":1685204253000},"page":"269","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["Efficient DNN Model for Word Lip-Reading"],"prefix":"10.3390","volume":"16","author":[{"given":"Taiki","family":"Arakane","sequence":"first","affiliation":[{"name":"Department of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka 820-8502, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8844-9707","authenticated-orcid":false,"given":"Takeshi","family":"Saitoh","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka 820-8502, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2023,5,27]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Saitoh, T., and Konishi, R. (2010, January 23\u201326). Profile Lip Reading for Vowel and Word Recognition. Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010), Istanbul, Turkey.","DOI":"10.1109\/ICPR.2010.335"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Nakamura, Y., Saitoh, T., and Itoh, K. (2022, January 18\u201320). 3D CNN-based mouth shape recognition for patient with intractable neurological diseases. Proceedings of the 13th International Conference on Graphics and Image Processing (ICGIP 2021), Kunming, China.","DOI":"10.1117\/12.2623642"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"298","DOI":"10.3389\/frai.2022.1070964","article-title":"Isolated single sound lip-reading using a frame-based camera and event-based camera","volume":"5","author":"Kanamaru","year":"2023","journal-title":"Front. Artif. Intell."},{"key":"ref_4","unstructured":"Chung, J.S., and Zisserman, A. (2016, January 20\u201324). Lip Reading in the Wild. Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan."},{"key":"ref_5","first-page":"9","article-title":"Lip Reading using Facial Expression Features","volume":"10","author":"Shirakata","year":"2020","journal-title":"Int. J. Comput. Vis. Signal Process."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Martinez, B., Ma, P., Petridis, S., and Pantic, M. (2020, January 4\u20138). Lipreading using temporal convolutional networks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053841"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Kodama, M., and Saitoh, T. (2022, January 18\u201320). Replacing speaker-independent recognition task with speaker-dependent task for lip-reading using First Order Motion Model. Proceedings of the 13th International Conference on Graphics and Image Processing (ICGIP 2021), Kunming, China.","DOI":"10.1117\/12.2623640"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ma, P., Martinez, B., Petridis, S., and Pantic, M. (2021, January 6\u201311). Towards Practical Lipreading with Distilled and Efficient Models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9415063"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Fu, Y., Lu, Y., and Ni, R. (2023). Chinese Lip-Reading Research Based on ShuffleNet and CBAM. Appl. Sci., 13.","DOI":"10.3390\/app13021106"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Chung, J.S., Senior, A., Vinyals, O., and Zisserman, A. (2017, January 21\u201326). Lip Reading Sentences in the Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.367"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Arakane, T., Saitoh, T., Chiba, R., Morise, M., and Oda, Y. (2023, January 24\u201325). Conformer-Based Lip-Reading for Japanese Sentence. Proceedings of the 37th International Conference on Image and Vision Computing, Auckland, New Zealand.","DOI":"10.1007\/978-3-031-25825-1_34"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Jeon, S., Elsharkawy, A., and Kim, M.S. (2022). Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition. Sensors, 22.","DOI":"10.3390\/s22010072"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1254","DOI":"10.1109\/TMM.2009.2030637","article-title":"Lipreading with local spatiotemporal descriptors","volume":"11","author":"Zhao","year":"2009","journal-title":"IEEE Trans. Multimed."},{"key":"ref_14","first-page":"1189","article-title":"Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus","volume":"2002","author":"Patterson","year":"2002","journal-title":"EURASIP J. Appl. Signal Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Saitoh, T., and Kubokawa, M. (2018, January 20\u201324). SSSD: Speech Scene Database by Smart Device for Visual Speech Recognition. Proceedings of the 24th International Conference on Pattern Recognition (ICPR2018), Beijing, China.","DOI":"10.1109\/ICPR.2018.8545664"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, K., Shan, S., and Chen, X. (2019, January 14\u201318). LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. Proceedings of the 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG2019), Lille, France.","DOI":"10.1109\/FG.2019.8756582"},{"key":"ref_17","unstructured":"Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A., and Karpov, A. (2022, January 21\u201323). Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC2022), Marseille, France."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Ma, P., Wang, Y., Petridis, S., Shen, J., and Pantic, M. (2022, January 23\u201327). Training Strategies for Improved Lip-reading. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746706"},{"key":"ref_19","unstructured":"Feng, D., Yang, S., Shan, S., and Chen, X. (2020). Learn an Effective Lip Reading Model without Pains. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Kim, M., Yeo, J.H., and Ro, Y.M. (March, January 22). Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading. Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), Virtual.","DOI":"10.1609\/aaai.v36i1.20003"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Koumparoulis, A., and Potamianos, G. (2022, January 23\u201327). Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9747729"},{"key":"ref_22","unstructured":"Ivanko, D., Ryumin, D., Kashevnik, A., Axyonov, A., and Karnov, A. (September, January 29). Visual Speech Recognition in a Driver Assistance System. Proceedings of the 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016, January 27\u201330). Rethinking the Inception Architecture for Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_24","unstructured":"Viola, P., and Jones, M. (2001, January 8\u201314). Rapid object detection using a boosted cascade of simple features. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA."},{"key":"ref_25","unstructured":"Dalal, N., and Triggs, B. (2005, January 20\u201325). Histograms of oriented gradients for human detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Ververas, E., Kotsia, I., and Zafeiriou, S. (2020, January 14\u201319). RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual.","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Kazemi, V., and Sullivan, J. (2014, January 24\u201327). One millisecond face alignment with an ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.241"},{"key":"ref_28","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zagoruyko, S., and Komodakis, N. (2016, January 19\u201322). Wide Residual Networks. Proceedings of the British Machine Vision Conference (BMVC), York, UK.","DOI":"10.5244\/C.30.87"},{"key":"ref_30","unstructured":"Tan, M., and Le, Q. (2019, January 9\u201315). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning (ICMR), Long Beach, CA, USA."},{"key":"ref_31","unstructured":"Zoph, B., and Le, Q.V. (2017, January 24\u201326). Neural architecture search with reinforcement learning. Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France."},{"key":"ref_32","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS2017), Long Beach, CA, USA."},{"key":"ref_33","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 3\u20137). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu\u010di\u0107, M., and Schmid, C. (2021, January 11\u201317). ViViT: A Video Vision Transformer. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Virtual.","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"ref_35","unstructured":"Bai, S., Kolter, J.Z., and Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Cubuk, E.D., Zoph, B., Shlens, J., and Le, Q. (2020, January 14\u201319). RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. Proceedings of the Advances in Neural Information Processing Systems, Virtual.","DOI":"10.1109\/CVPRW50498.2020.00359"},{"key":"ref_37","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (May, January 30). Mixup: Beyond empirical risk minimization. Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019, January 16\u201320). ArcFace: Additive Angular Margin Loss for Deep Face Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00482"},{"key":"ref_39","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled weight decay regularization. Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Stafylakis, T., and Tzimiropoulos, G. (2018, January 20\u201324). Combining Residual Networks with LSTMs for Lipreading. Proceedings of the Conference of the International Speech Communication Association (Interspeech 2017), Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-85"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., and Pantic, M. (2018, January 15\u201320). End-to-end Audiovisual Speech Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461326"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Tsourounis, D., Kastaniotis, D., and Fotopoulos, S. (2021). Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions. J. Imaging, 7.","DOI":"10.3390\/jimaging7050091"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Iwasaki, M., Kubokawa, M., and Saitoh, T. (2017, January 8\u201312). Two Features Combination with Gated Recurrent Unit for Visual Speech Recognition. Proceedings of the 14th IAPR Conference on Machine Vision Applications (MVA), Nagoya, Japan.","DOI":"10.23919\/MVA.2017.7986867"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/6\/269\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:43:28Z","timestamp":1760125408000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/6\/269"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,27]]},"references-count":43,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2023,6]]}},"alternative-id":["a16060269"],"URL":"https:\/\/doi.org\/10.3390\/a16060269","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,27]]}}}