{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,12]],"date-time":"2026-07-12T04:30:33Z","timestamp":1783830633732,"version":"3.55.0"},"reference-count":27,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T00:00:00Z","timestamp":1675728000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term \u201cin the wild\u201d is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR\u2019s performance is affected. Here, we propose new improvements for AVSR models by incorporating data-augmentation techniques to generate more data samples for building the classification models. For the data-augmentation techniques, we utilized a combination of conventional approaches (e.g., flips and rotations), as well as newer approaches, such as generative adversarial networks (GANs). To validate the approaches, we used augmented data from well-known datasets (LRS2\u2014Lip Reading Sentences 2 and LRS3) in the training process and testing was performed using the original data. The study and experimental results indicated that the proposed AVSR model and framework, combined with the augmentation approach, enhanced the performance of the AVSR framework in the wild for noisy datasets. Furthermore, in this study, we discuss the domains of automatic speech recognition (ASR) architectures and audio-visual speech recognition (AVSR) architectures and give a concise summary of the AVSR models that have been proposed.<\/jats:p>","DOI":"10.3390\/s23041834","type":"journal-article","created":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T02:28:17Z","timestamp":1675736897000},"page":"1834","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild"],"prefix":"10.3390","volume":"23","author":[{"given":"Yibo","family":"He","sequence":"first","affiliation":[{"name":"School of AI and Advanced Computing, Xian Jiaotong Liverpool University, Suzhou 215123, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kah Phooi","family":"Seng","sequence":"additional","affiliation":[{"name":"School of AI and Advanced Computing, Xian Jiaotong Liverpool University, Suzhou 215123, China"},{"name":"School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Li Minn","family":"Ang","sequence":"additional","affiliation":[{"name":"School of Science, Technology and Engineering, University of Sunshine Coast, Sippy Downs, QLD 4502, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2023,2,7]]},"reference":[{"key":"ref_1","unstructured":"Haton, J.-P. (1999). Speech Processing, Recognition and Artificial Neural Networks: Proceedings of the 3rd International School on Neural Nets \u201cEduardo R. Caianiello\u201d, Springer."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1109\/WESCAN.1995.493968","article-title":"Speech recognition techniques using RBF networks","volume":"Volume 1","author":"Phillips","year":"1995","journal-title":"IEEE WESCANEX 95. Communications, Power, and Computing. Conference Proceedings"},{"key":"ref_3","first-page":"305","article-title":"Lyapunov theory-based multilayered neural network","volume":"56","author":"Lim","year":"2009","journal-title":"IEEE Trans. Circuits Syst. II Express Briefs"},{"key":"ref_4","unstructured":"Kinjo, T., and Funaki, K. (2006). IECON 2006-32nd Annual Conference on IEEE Industrial Electronics, IEEE."},{"key":"ref_5","unstructured":"Zweig, G., and Nguyen, P. (2009). 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, IEEE."},{"key":"ref_6","first-page":"1","article-title":"Robust speech recognition in additive and channel noise environments using GMM and EM algorithm","volume":"Volume 1","author":"Fujimoto","year":"2004","journal-title":"2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17\u201321 May 2004"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"e1","DOI":"10.1017\/ATSIP.2015.22","article-title":"Deep learning: From speech recognition to language and multimodal processing","volume":"5","author":"Deng","year":"2016","journal-title":"APSIPA Trans. Signal Inf. Process."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"19143","DOI":"10.1109\/ACCESS.2019.2896880","article-title":"Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 2019, 7, 19143\u201319165","volume":"7","author":"Nassif","year":"2019","journal-title":"IEEE Access"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1109\/6046.865479","article-title":"Audio-visual speech modeling for continuous speech recognition","volume":"2","author":"Dupont","year":"2000","journal-title":"IEEE Trans. Multimedia"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2025","DOI":"10.1109\/JPROC.2006.886017","article-title":"Audio-Visual Biometrics","volume":"94","author":"Aleksic","year":"2006","journal-title":"Proc. IEEE"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"266","DOI":"10.1109\/THMS.2017.2695613","article-title":"Video Analytics for Customer Emotion and Satisfaction at Contact Centers","volume":"48","author":"Seng","year":"2017","journal-title":"IEEE Trans. Human-Machine Syst."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Tian, Y., Shi, J., Li, B., Duan, Z., and Xu, C. (2018, January 8\u201314). Audio-visual event localization in unconstrained videos. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01216-8_16"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Son Chung, J., Senior, A., Vinyals, O., and Zisserman, A. (2017, January 21\u201326). Lip reading sentences in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.367"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yu, J., Zhang, S.X., Wu, J., Ghorbani, S., Wu, B., Kang, S., and Yu, D. (2020, January 4\u20138). Audio-visual recognition of overlapped speech for the lrs2 dataset. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054127"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"8717","DOI":"10.1109\/TPAMI.2018.2889052","article-title":"Deep audio-visual speech recognition","volume":"44","author":"Afouras","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Shoumy, N.J., Ang, L.-M., Rahaman, D.M.M., Zia, T., Seng, K.P., and Khatun, S. (2021, January 26\u201329). Augmented audio data in improving speech emotion classification tasks. Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Federal Territory of Kuala Lumpur, Malaysia.","DOI":"10.1007\/978-3-030-79463-7_30"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Stafylakis, T., and Tzimiropoulos, G. (2018, January 8\u201314). Zero-shot keyword spotting for visual speech recognition in-the-wild. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01225-0_32"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Han, K.J., Hahm, S., Kim, B.-H., Kim, J., and Lane, I. (2017, January 20\u201324). Deep Learning-Based Telephony Speech Recognition in the Wild. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1695"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Ali, A., Vogel, S., and Renals, S. (2017, January 16\u201320). Speech recognition challenge in the wild: Arabic MGB-3. Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan.","DOI":"10.1109\/ASRU.2017.8268952"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, S., Zheng, W., Zong, Y., Lu, C., Tang, C., Jiang, X., and Xia, W. (2019, January 14\u201318). Bi-modality fusion for emotion recognition in the wild. Proceedings of the 2019 International Conference on Multimodal Interaction, Suzhou, China.","DOI":"10.1145\/3340555.3355719"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Lu, C., Zheng, W., Li, C., Tang, C., Liu, S., Yan, S., and Zong, Y. (2018, January 16\u201320). Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA.","DOI":"10.1145\/3242969.3264992"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"8669","DOI":"10.1007\/s00521-020-05616-w","article-title":"HEU Emotion: A large-scale database for multimodal emotion recognition in the wild","volume":"33","author":"Chen","year":"2021","journal-title":"Neural Comput. Appl."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hajavi, A., and Etemad, A. (2021, January 6\u201311). Siamese capsule network for end-to-end speaker recognition in the wild. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414722"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Nguyen, H., Maclagan, S.J., Nguyen, T.D., Nguyen, T., Flemons, P., Andrews, K., Ritchie, E.G., and Phung, D. (2017, January 19\u201321). Animal Recognition and Identification with Deep Convolutional Neural Networks for Automated Wildlife Monitoring. Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan.","DOI":"10.1109\/DSAA.2017.31"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., and Jawahar, C. (2020, January 12\u201316). A lip sync expert is all you need for speech to lip generation in the wild. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413532"},{"key":"ref_26","unstructured":"(2022, November 20). The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset. Available online: https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/lip_reading\/lrs2.html."},{"key":"ref_27","unstructured":"(2022, November 10). Lip Reading Sentences 3 (LRS3) Dataset. Available online: https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/lip_reading\/lrs3.html."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/4\/1834\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:26:16Z","timestamp":1760120776000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/4\/1834"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,7]]},"references-count":27,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2023,2]]}},"alternative-id":["s23041834"],"URL":"https:\/\/doi.org\/10.3390\/s23041834","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,7]]}}}