{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T14:32:13Z","timestamp":1775917933522,"version":"3.50.1"},"reference-count":77,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2022,5,9]],"date-time":"2022-05-09T00:00:00Z","timestamp":1652054400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)","award":["NRF-2018X1A3A1069795"],"award-info":[{"award-number":["NRF-2018X1A3A1069795"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30\u00b0, 45\u00b0, and 60\u00b0). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0\u00b0), 4.79% (30\u00b0), 5.51% (45\u00b0), 6.18% (60\u00b0), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.<\/jats:p>","DOI":"10.3390\/s22093597","type":"journal-article","created":{"date-parts":[[2022,5,10]],"date-time":"2022-05-10T00:30:28Z","timestamp":1652142628000},"page":"3597","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4705-1254","authenticated-orcid":false,"given":"Sanghun","family":"Jeon","sequence":"first","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6050-6594","authenticated-orcid":false,"given":"Mun Sang","family":"Kim","sequence":"additional","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Antonakos, E., Roussos, A., and Zafeiriou, S. (2015, January 4\u20138). A survey on mouth modeling and analysis for sign language recognition. Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia.","DOI":"10.1109\/FG.2015.7163162"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2008\/810362","article-title":"Comparison of image transform-based features for visual speech recognition in clean and corrupted videos","volume":"2008","author":"Seymour","year":"2007","journal-title":"EURASIP J. Image Video Process."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"3939","DOI":"10.1121\/1.2936018","article-title":"Audiovisual automatic speech recognition: Progress and challenges","volume":"123","author":"Potamianos","year":"2008","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1016\/j.imavis.2014.06.004","article-title":"A review of recent advances in visual speech decoding","volume":"32","author":"Zhou","year":"2014","journal-title":"Image Vis. Comput."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/MSP.2015.116","article-title":"Biometric liveness detection: Challenges and research opportunities","volume":"13","author":"Akhtar","year":"2015","journal-title":"IEEE Secur. Priv."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073640","article-title":"Synthesizing Obama: Learning lip sync from audio","volume":"36","author":"Suwajanakorn","year":"2017","journal-title":"ACM Trans. Graphics (ToG)"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2306","DOI":"10.1109\/TPAMI.2019.2911077","article-title":"Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos","volume":"42","author":"Koller","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zhou, H., Zhou, W., Zhou, Y., and Li, H. (2020, January 7\u201312). Spatial-temporal multi-cue network for continuous sign language recognition. Proceedings of the AAAI Conference on Artificial Intelligence, New York City, NY, USA.","DOI":"10.1609\/aaai.v34i07.7001"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"215516","DOI":"10.1109\/ACCESS.2020.3040906","article-title":"Lip reading sentences using deep learning with only visual cues","volume":"8","author":"Fenghour","year":"2020","journal-title":"IEEE Access"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Yang, C., Wang, S., Zhang, X., and Zhu, Y. (2020, January 25\u201328). Speaker-independent lipreading with limited data. Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates.","DOI":"10.1109\/ICIP40778.2020.9190780"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"2054003","DOI":"10.1142\/S0218001420540038","article-title":"Automatic lip reading using convolution neural network and bidirectional long short-term memory","volume":"34","author":"Lu","year":"2020","journal-title":"Int. J. Pattern. Recognit. Artif. Intell."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"981","DOI":"10.1007\/s11760-019-01630-1","article-title":"Lipreading with DenseNet and resBi-LSTM","volume":"14","author":"Chen","year":"2020","journal-title":"Signal Image Video Process."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Petridis, S., Li, Z., and Pantic, M. (2017, January 5\u20139). End-to-end visual speech recognition with LSTMs. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952625"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Xu, K., Li, D., Cassimatis, N., and Wang, X. (2018, January 15\u201319). LCANet: End-to-end lipreading with cascaded attention-CTC. Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00088"},{"key":"ref_15","unstructured":"Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S., and Venkatesan, S.M. (2019). LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1044\/1059-0889.0403.67","article-title":"Analysis of view angle used in speechreading training of sentences","volume":"4","author":"Bauman","year":"1995","journal-title":"Am. J. Audiol."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Lan, Y., Theobald, B.-J., and Harvey, R. (2012, January 9\u201313). View independent computer lip-reading. Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, Melbourne, VIC, Australia.","DOI":"10.1109\/ICME.2012.192"},{"key":"ref_18","unstructured":"Assael, Y.M., Shillingford, B., Whiteson, S., and De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv."},{"key":"ref_19","unstructured":"Santos, T.I., Abel, A., Wilson, N., and Xu, Y. (2021, January 19\u201322). Speaker-independent visual speech recognition with the Inception V3 model. Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Lucey, P., and Potamianos, G. (2006, January 3\u20136). Lipreading using profile versus frontal views. Proceedings of the 2006 IEEE Workshop on Multimedia Signal Processing, Victoria, BC, Canada.","DOI":"10.1109\/MMSP.2006.285261"},{"key":"ref_21","unstructured":"Saitoh, T., Zhou, Z., Zhao, G., and Pietik\u00e4inen, M. (2016). Concatenated frame image based CNN for visual speech recognition. Asian Conference on Computer Vision, Springer."},{"key":"ref_22","unstructured":"Zimmermann, M., Ghazi, M.M., Ekenel, H.K., and Thiran, J.-P. (2016). Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. Asian Conference on Computer Vision, Springer."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Koumparoulis, A., and Potamianos, G. (2018, January 18\u201321). Deep view2view mapping for view-invariant lipreading. Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639698"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Petridis, S., Wang, Y., Li, Z., and Pantic, M. (2017). End-to-end multi-view lipreading. arXiv.","DOI":"10.5244\/C.31.161"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zimmermann, M., Ghazi, M.M., Ekenel, H.K., and Thiran, J.-P. (2017). Combining multiple views for visual speech recognition. arXiv.","DOI":"10.21437\/AVSP.2017-10"},{"key":"ref_26","unstructured":"Sahrawat, D., Kumar, Y., Aggarwal, S., Yin, Y., Shah, R.R., and Zimmermann, R. (2020). \u201cNotic My Speech\u201d\u2014Blending speech patterns with multimedia. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Anina, I., Zhou, Z., Zhao, G., and Pietik\u00e4inen, M. (2015, January 4\u20138). OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia.","DOI":"10.1109\/FG.2015.7163155"},{"key":"ref_28","unstructured":"Estellers, V., and Thiran, J.-P. (September, January 29). Multipose audio-visual speech recognition. Proceedings of the 2011 19th European Signal Processing Conference, Barcelona, Spain."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Isobe, S., Tamura, S., and Hayamizu, S. (2021, January 4\u20136). Speech recognition using deep canonical correlation analysis in noisy environments. Proceedings of the ICPRAM, Online.","DOI":"10.5220\/0010268200630070"},{"key":"ref_30","unstructured":"Komai, Y., Yang, N., Takiguchi, T., and Ariki, Y. (November, January 29). Robust AAM-based audio-visual speech recognition against face direction changes. Proceedings of the 20th ACM international conference on Multimedia, Nara, Japan."},{"key":"ref_31","unstructured":"Lee, D., Lee, J., and Kim, K.-E. (2016, January 20\u201324). Multi-view automatic lip-reading using neural network. Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Jeon, S., Elsharkawy, A., and Kim, M.S. (2022). Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors, 22.","DOI":"10.3390\/s22010072"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zhang, P., Wang, D., Lu, H., Wang, H., and Ruan, X. (2017, January 22\u201329). Amulet: Aggregating multi-level convolutional features for salient object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.31"},{"key":"ref_34","first-page":"1243","article-title":"Learning to combine foveal glimpses with a third-order Boltzmann machine","volume":"23","author":"Larochelle","year":"2010","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_35","unstructured":"Mnih, V., Heess, N., and Graves, A. (2014, January 8\u201313). Recurrent models of visual attention. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_36","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.-Y., and Kweon, I.S. (2018, January 8\u201314). CBAM: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zhang, T., He, L., Li, X., and Feng, G. (2021). Efficient end-to-end sentence-level lipreading with temporal convolutional networks. Appl. Sci., 11.","DOI":"10.3390\/app11156975"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Hlav\u00e1\u010d, M., Gruber, I., \u017delezn\u00fd, M., and Karpov, A. (2020, January 7\u20139). Lipreading with LipsID. Proceedings of the International Conference on Speech and Computer, St. Petersburg, Russia.","DOI":"10.1007\/978-3-030-60276-5_18"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Luo, M., Yang, S., Shan, S., and Chen, X. (2020, January 16\u201320). Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina.","DOI":"10.1109\/FG47880.2020.00010"},{"key":"ref_42","unstructured":"Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. (2015, January 7\u201312). Efficient object localization using convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298664"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_45","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"654","DOI":"10.1016\/j.ejor.2017.11.054","article-title":"Deep learning with long short-term memory networks for financial market predictions","volume":"270","author":"Fischer","year":"2018","journal-title":"Eur. J. Operat. Res."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"607","DOI":"10.5626\/JOK.2017.44.6.607","article-title":"Water level forecasting based on deep learning: A use case of Trinity River-Texas-The United States","volume":"44","author":"Tran","year":"2017","journal-title":"J. KIISE"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.cviu.2018.02.001","article-title":"Learning to lip read words by watching videos","volume":"173","author":"Chung","year":"2018","journal-title":"Comput. Vis. Image Understand."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd international Conference on Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Chung, J.S., Senior, A., Vinyals, O., and Zisserman, A. (2017, January 26). Lip reading sentences in the wild. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.367"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Cheng, J., Dong, L., and Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv.","DOI":"10.18653\/v1\/D16-1053"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., and Zhang, C. (2018, January 2\u20137). Disan: Directional self-attention network for RNN\/CNN-free language understanding. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11941"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Chan, W., and Jaitly, N. (2017, January 5\u20139). Very deep convolutional networks for end-to-end speech recognition. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953077"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Kim, S., Hori, T., and Watanabe, S. (2017, January 5\u20139). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Chiu, C.-C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., and Gonina, E. (2018, January 15\u201320). State-of-the-art speech recognition with sequence-to-sequence models. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462105"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Zeyer, A., Irie, K., Schl\u00fcter, R., and Ney, H. (2018). Improved training of end-to-end attention models for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2018-1616"},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1109\/TASLP.2017.2765834","article-title":"Progressive joint modeling in unsupervised single-channel overlapped speech recognition","volume":"26","author":"Chen","year":"2017","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_58","unstructured":"Erdogan, H., Hayashi, T., Hershey, J.R., Hori, T., Hori, C., Hsu, W.N., Kim, S., Le Roux, J., Meng, Z., and Watanabe, S. (2016, January 13). Multi-channel speech recognition: Lstms all the way through. Proceedings of the CHiME-4 Workshop, San Francisco, CA, USA."},{"key":"ref_59","unstructured":"Liu, P.J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., and Shazeer, N. (2018). Generating Wikipedia by summarizing long sequences. arXiv."},{"key":"ref_60","unstructured":"Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., and Eck, D. (2018). Music transformer. arXiv."},{"key":"ref_61","first-page":"1755","article-title":"Dlib-ml: A machine learning toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. (2013, January 2\u20138). 300 faces in-the-wild challenge: The first facial landmark localization challenge. Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia.","DOI":"10.1109\/ICCVW.2013.59"},{"key":"ref_63","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Bottou, L. (2012). Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade, Springer.","DOI":"10.1007\/978-3-642-35289-8_25"},{"key":"ref_65","first-page":"26","article-title":"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude","volume":"4","author":"Tieleman","year":"2012","journal-title":"COURSERA Neural Netw. Mach. Learn."},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_67","unstructured":"Schaul, T., Antonoglou, I., and Silver, D. (2013). Unit tests for stochastic optimization. arXiv."},{"key":"ref_68","unstructured":"Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013, January 16\u201321). On the importance of initialization and momentum in deep learning. Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA. PMLR."},{"key":"ref_69","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TPAMI.2013.173","article-title":"A compact representation of visual speech data using latent variables","volume":"36","author":"Zhou","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_70","unstructured":"Chung, J.S., and Zisserman, A. (2016, January 20\u201324). Out of time: Automated lip sync in the wild. Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan."},{"key":"ref_71","unstructured":"Chung, J.S., and Zisserman, A. (2017, January 4\u20137). Lip reading in profile. Proceedings of the British Machine Vision Conference (BMVC), Imperial College London, London, UK."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Petridis, S., Wang, Y., Li, Z., and Pantic, M. (2017). End-to-end audiovisual fusion with LSTMs. arXiv.","DOI":"10.21437\/AVSP.2017-8"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Han, H., Kang, S., and Yoo, C.D. (2017, January 17\u201320). Multi-view visual speech recognition based on multi task learning. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8297030"},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Fung, I., and Mak, B. (2018, January 15\u201320). End-to-end low-resource lip-reading with maxout CNN and LSTM. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462280"},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Fernandez-Lopez, A., and Sukno, F.M. (2019, January 2\u20136). Lip-reading with limited-data network. Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coru\u00f1a, Spain.","DOI":"10.23919\/EUSIPCO.2019.8902572"},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Isobe, S., Tamura, S., Hayamizu, S., Gotoh, Y., and Nose, M. (2021, January 16\u201318). Multi-angle lipreading using angle classification and angle-specific feature integration. Proceedings of the 2020 International Conference on Communications, Signal Processing, and Their Applications (ICCSPA), Sharjah, United Arab Emirates.","DOI":"10.1109\/ICCSPA49915.2021.9385743"},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Isobe, S., Tamura, S., Hayamizu, S., Gotoh, Y., and Nose, M. (2021). Multi-angle lipreading with angle classification-based feature extraction and its application to audio-visual speech recognition. Future Internet, 13.","DOI":"10.3390\/fi13070182"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/9\/3597\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:08:16Z","timestamp":1760137696000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/9\/3597"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,9]]},"references-count":77,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2022,5]]}},"alternative-id":["s22093597"],"URL":"https:\/\/doi.org\/10.3390\/s22093597","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,9]]}}}