{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T12:14:42Z","timestamp":1776773682808,"version":"3.51.2"},"reference-count":45,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2023,10,19]],"date-time":"2023-10-19T00:00:00Z","timestamp":1697673600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>This paper proposes a novel multimodal generative adversarial network AVSR (multimodal AVSR GAN) architecture, to improve both the energy efficiency and the AVSR classification accuracy of artificial intelligence Internet of things (IoT) applications. The audio-visual speech recognition (AVSR) modality is a classical multimodal modality, which is commonly used in IoT and embedded systems. Examples of suitable IoT applications include in-cabin speech recognition systems for driving systems, AVSR in augmented reality environments, and interactive applications such as virtual aquariums. The application of multimodal sensor data for IoT applications requires efficient information processing, to meet the hardware constraints of IoT devices. The proposed multimodal AVSR GAN architecture is composed of a discriminator and a generator, each of which is a two-stream network, corresponding to the audio stream information and the visual stream information, respectively. To validate this approach, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process, and testing was performed using the original data. The research and experimental results showed that the proposed multimodal AVSR GAN architecture improved the AVSR classification accuracy. Furthermore, in this study, we discuss the domain of GANs and provide a concise summary of the proposed GANs.<\/jats:p>","DOI":"10.3390\/info14100575","type":"journal-article","created":{"date-parts":[[2023,10,19]],"date-time":"2023-10-19T11:46:26Z","timestamp":1697715986000},"page":"575","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":21,"title":["Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT"],"prefix":"10.3390","volume":"14","author":[{"given":"Yibo","family":"He","sequence":"first","affiliation":[{"name":"School of AI and Advanced Computing, Xi\u2019an Jiaotong Liverpool University, Suzhou 215000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kah Phooi","family":"Seng","sequence":"additional","affiliation":[{"name":"School of AI and Advanced Computing, Xi\u2019an Jiaotong Liverpool University, Suzhou 215000, China"},{"name":"School of Computer Science, Queensland University of Technology, Brisbane City, QLD 4000, Australia"},{"name":"School of Science Technology and Engineering, University of the Sunshine Coast, Sippy Downs, QLD 4556, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Li Minn","family":"Ang","sequence":"additional","affiliation":[{"name":"School of Science Technology and Engineering, University of the Sunshine Coast, Sippy Downs, QLD 4556, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,10,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"2787","DOI":"10.1016\/j.comnet.2010.05.010","article-title":"The internet of things: A survey","volume":"54","author":"Atzori","year":"2010","journal-title":"Comput. Netw."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"173","DOI":"10.1016\/j.sigpro.2015.05.021","article-title":"Interference alignment and game-theoretic power allocation in MIMO heterogeneous sensor networks communications","volume":"126","author":"Zhao","year":"2016","journal-title":"Signal Process."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1016\/j.cose.2005.12.003","article-title":"Radio frequency identification (RFID)","volume":"25","author":"Roberts","year":"2006","journal-title":"Comput. Secur."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"e1930","DOI":"10.1002\/nem.1930","article-title":"Recent advances delivered by mobile cloud computing and internet of things for big data applications: A survey","volume":"27","author":"Stergiou","year":"2017","journal-title":"Int. J. Netw. Manag."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"725","DOI":"10.3389\/fpsyg.2014.00725","article-title":"What is the McGurk effect?","volume":"5","author":"Tiippana","year":"2014","journal-title":"Front. Psychol."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Kinjo, T., and Funaki, K. (2006, January 6\u201310). On HMM speech recognition based on complex speech analysis. Proceedings of the IECON 2006\u201432nd Annual Conference on IEEE Industrial Electronics, Paris, France.","DOI":"10.1109\/IECON.2006.347837"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1109\/6046.865479","article-title":"Audio-visual speech modeling for continuous speech recognition","volume":"2","author":"Dupont","year":"2000","journal-title":"IEEE Trans. Multimed."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"8717","DOI":"10.1109\/TPAMI.2018.2889052","article-title":"Deep audio-visual speech recognition","volume":"44","author":"Afouras","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_9","unstructured":"Zhang, X., Cheng, F., and Wang, S. (November, January 27). Spatio-temporal fusion based convolutional sequence learning for lip reading. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Li, W., Wang, S., Lei, M., Siniscalchi, S.M., and Lee, C.H. (2019, January 12\u201317). Improving audio-visual speech recognition performance with cross-modal student-teacher training. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682868"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Mehrabani, M., Bangalore, S., and Stern, B. (2015, January 14\u201316). Personalized speech recognition for Internet of Things. Proceedings of the 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), Milan, Italy.","DOI":"10.1109\/WF-IoT.2015.7389082"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dabran, I., Avny, T., Singher, E., and Danan, H.B. (2017, January 3\u201315). Augmented reality speech recognition for the hearing impaired. Proceedings of the 2017 IEEE International Conference on Microwaves, Antennas, Communications and Electronic Systems (COMCAS), Tel-Aviv, Israel.","DOI":"10.1109\/COMCAS.2017.8244731"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"8406","DOI":"10.1109\/JIOT.2019.2917933","article-title":"Privacy-preserving outsourced speech recognition for smart IoT devices","volume":"6","author":"Ma","year":"2019","journal-title":"IEEE Internet Things J."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"B\u00e4ckstr\u00f6m, T. (2018, January 28\u201331). Speech coding, speech interfaces and IoT-opportunities and challenges. Proceedings of the 2018 52nd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA.","DOI":"10.1109\/ACSSC.2018.8645502"},{"key":"ref_15","unstructured":"Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv preprint."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Park, T., Liu, M., Wang, T., and Zhu, J. (2019, January 15\u201320). Semantic image synthesis with spatially-adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00244"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., and Loy, C.C. (2018, January 8\u201314). Esrgan: Enhanced super-resolution generative adversarial networks. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-11021-5_5"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T.S. (2018, January 18\u201322). Generative image inpainting with contextual attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00577"},{"key":"ref_19","unstructured":"Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Advances in Neural Information Processing Systems (NeurIPS), The MIT Press."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1007\/s12021-018-9377-x","article-title":"Segan: Adversarial network with multi-scale L1 loss for medical image segmentation","volume":"16","author":"Xue","year":"2018","journal-title":"Neuroinformatics"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Karras, T., Laine, S., and Aila, T. (2019, January 15\u201320). A style-based generator architecture for generative adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00453"},{"key":"ref_22","unstructured":"Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Progressive Growingof GANs for Improved Quality, Stability, and Variation. arXiv preprint."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. (2020, January 14\u201319). Analyzing and improving the image quality of stylegan. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00813"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Wang, X., Li, Y., Zhang, H., and Shan, Y. (2021, January 20\u201325). Towards real-world blind face restoration with generative facial prior. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00905"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1109\/MSP.2012.2211477","article-title":"The mnist database of handwritten digit images for machine learning research [best of the web]","volume":"29","author":"Deng","year":"2012","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_26","unstructured":"Krizhevsky, A., and Hinton, G. (2009). Learning Multiple Layers of Features from Tiny Images, NYU."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zeiler, M.D., and Fergus, R. (2014, January 6\u201312). Visualizing and understanding convolutional networks. Proceedings of the Computer Vision\u2014ECCV 2014: 13th European Conference, Zurich, Switzerland. Proceedings, Part I 13.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"ref_28","unstructured":"Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint."},{"key":"ref_29","first-page":"90","article-title":"Deep convolutional neural network for image deconvolution","volume":"27","author":"Xu","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_30","unstructured":"Springenberg, J.T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint."},{"key":"ref_31","unstructured":"Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018, January 3\u20138). How does batch normalization help optimization?. Proceedings of the 2018 Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_32","unstructured":"Li, Y., and Yuan, Y. (2017, January 4\u20139). Convergence analysis of two-layer neural networks with ReLU activation. Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Lai, W.S., Huang, J.B., Ahuja, N., and Yang, M.H. (2017). Deep Laplacian pyramid networks for fast and accurate super-resolution. arXiv preprint.","DOI":"10.1109\/CVPR.2017.618"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_35","unstructured":"Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. (2019). Self-attention generative adversarial networks. arXiv preprint."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Zhu, J.Y., Park, T., Isola, P., and Efros, A.A. (2017, January 22\u201329). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.244"},{"key":"ref_37","unstructured":"Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., and Shechtman, E. (2017). Toward multimodal image-to-image translation. arXiv preprint."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., and Choo, J. (2018). Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint.","DOI":"10.1109\/CVPR.2018.00916"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Ledig, C., Theis, L., Husz\u00e1r, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., and Wang, Z. (2017). Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint.","DOI":"10.1109\/CVPR.2017.19"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/j.image.2019.02.008","article-title":"Diverse adversarial network for image super-resolution","volume":"74","author":"Zareapoor","year":"2019","journal-title":"Signal Process. Image Commun."},{"key":"ref_41","unstructured":"Chung, J., and Zisserman, A. (2017, January 4\u20137). Lip reading in profile. Proceedings of the Ritish Machine Vision Conference, London, UK."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., and Jawahar, C.V. (2020, January 12\u201316). A lip sync expert is all you need for speech to lip generation in the wild. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413532"},{"key":"ref_43","unstructured":"(2023, March 08). The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset. Available online: https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/lip_reading\/lrs2.html."},{"key":"ref_44","unstructured":"(2023, March 08). Lip Reading Sentences 3 (LRS3) Dataset. Available online: https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/lip_reading\/lrs3.html."},{"key":"ref_45","unstructured":"(2023, July 08). The Oxford-BBC Lip Reading in the Wild (LRW) Dataset. Available online: https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/lip_reading\/lrw1.html."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/10\/575\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:10:02Z","timestamp":1760130602000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/10\/575"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,19]]},"references-count":45,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2023,10]]}},"alternative-id":["info14100575"],"URL":"https:\/\/doi.org\/10.3390\/info14100575","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,19]]}}}