{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T03:04:22Z","timestamp":1771988662674,"version":"3.50.1"},"reference-count":63,"publisher":"MDPI AG","issue":"20","license":[{"start":{"date-parts":[[2022,10,12]],"date-time":"2022-10-12T00:00:00Z","timestamp":1665532800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)","award":["NRF-2018X1A3A1069795"],"award-info":[{"award-number":["NRF-2018X1A3A1069795"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user\u2013system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as caf\u00e9s, museums, music halls, and kiosks.<\/jats:p>","DOI":"10.3390\/s22207738","type":"journal-article","created":{"date-parts":[[2022,10,12]],"date-time":"2022-10-12T22:45:29Z","timestamp":1665614729000},"page":"7738","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4705-1254","authenticated-orcid":false,"given":"Sanghun","family":"Jeon","sequence":"first","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6050-6594","authenticated-orcid":false,"given":"Mun Sang","family":"Kim","sequence":"additional","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,10,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"22","DOI":"10.4018\/IJMHCI.2020010102","article-title":"Framing the design space of multimodal mid-air gesture and speech-based interaction with mobile devices for older people","volume":"12","author":"Mich","year":"2020","journal-title":"Int. J. Mob. Hum. Comput. Interact."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Kaburagi, R., Ishimaru, Y., Chin, W.H., Yorita, A., Kubota, N., and Egerton, S. (2021, January 8\u201310). Lifelong robot edutainment based on self-efficacy. Proceedings of the 2021 5th IEEE International Conference on Cybernetics (CYBCONF), Sendai, Japan.","DOI":"10.1109\/CYBCONF51991.2021.9464143"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Soo, V.-W., Huang, C.-F., Su, Y.-H., and Su, M.-J. (2018, January 27\u201330). AI applications on music technology for edutainment. Proceedings of the International Conference on Innovative Technologies and Learning, Portoroz, Slovenia.","DOI":"10.1007\/978-3-319-99737-7_63"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tsai, T.-H., Chi, P.-T., and Cheng, K.-H. (2019, January 24\u201326). A sketch classifier technique with deep learning models realized in an embedded system. Proceedings of the 2019 IEEE 22nd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Cluj-Napoca, Romania.","DOI":"10.1109\/DDECS.2019.8724656"},{"key":"ref_5","first-page":"82","article-title":"Educational values in factual nature pictures","volume":"33","author":"Disney","year":"1954","journal-title":"Educ. Horiz."},{"key":"ref_6","unstructured":"Rapeepisarn, K., Wong, K.W., Fung, C.C., and Depickere, A. (2006, January 4\u20136). Similarities and differences between \u201clearn through play\u201d and \u201cedutainment\u201d. Proceedings of the 3rd Australasian Conference on Interactive Entertainment, Perth, Australia."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"136864","DOI":"10.1155\/2013\/136864","article-title":"Assessment in and of serious games: An overview","volume":"2013","author":"Bellotti","year":"2013","journal-title":"Adv. Hum.-Comput. Interact."},{"key":"ref_8","unstructured":"Zin, H.M., and Zain, N.Z.M. (2010, January 1). The effects of edutainment towards students\u2019 achievements. Proceedings of the Regional Conference on Knowledge Integration in ICT, Putrajaya, Malaysia."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s10956-007-9077-z","article-title":"Comparing the impacts of tutorial and edutainment software programs on students\u2019 achievements, misconceptions, and attitudes towards biology","volume":"17","author":"Kara","year":"2008","journal-title":"J. Sci. Educ. Technol."},{"key":"ref_10","unstructured":"Efthymiou, N., Filntisis, P., Potamianos, G., and Maragos, P. (July, January 29). A robotic edutainment framework for designing child-robot interaction scenarios. Proceedings of the 14th Pervasive Technologies Related to Assistive Environments Conference, Corfu, Greece."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Matul\u00edk, M., Vavre\u010dka, M., and Vidovi\u0107ov\u00e1, L. (2020, January 17\u201319). Edutainment software for the Pepper robot. Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, Newcastle Upon Tyne, UK.","DOI":"10.1145\/3440084.3441194"},{"key":"ref_12","first-page":"9753979","article-title":"User satisfaction for an augmented reality application to support productive vocabulary using speech recognition","volume":"2018","author":"Arshad","year":"2018","journal-title":"Adv. Multimed."},{"key":"ref_13","first-page":"207","article-title":"Istanbul Aquarium edutainment project","volume":"10","author":"Yum","year":"2022","journal-title":"Online J. Art Des."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1016\/j.cag.2019.06.003","article-title":"2D, 3D or speech? A case study on which user interface is preferable for what kind of object interaction in immersive virtual reality","volume":"82","author":"Hepperle","year":"2019","journal-title":"Comput. Graph."},{"key":"ref_15","unstructured":"Janowski, K., Kistler, F., and Andr\u00e9, E. (2013, January 19\u201321). Gestures or speech? Comparing modality selection for different interaction tasks in a virtual environment. Proceedings of the Tilburg Gesture Research Meeting, Tilburg, The Netherlands."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"565","DOI":"10.1016\/B978-0-12-397025-1.00047-6","article-title":"Multisensory integration and audiovisual speech perception","volume":"2","author":"Venezia","year":"2015","journal-title":"Brain Mapp. Encycl. Ref."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1001","DOI":"10.1098\/rstb.2007.2155","article-title":"The processing of audio-visual speech: Empirical and neural bases","volume":"363","author":"Campbell","year":"2008","journal-title":"Philos. Trans. R. Soc. B Biol. Sci."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"212","DOI":"10.1121\/1.1907309","article-title":"Visual contribution to speech intelligibility in noise","volume":"26","author":"Sumby","year":"1954","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1068\/p060031","article-title":"The role of vision in the perception of speech","volume":"6","author":"Dodd","year":"1977","journal-title":"Perception"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1129","DOI":"10.1097\/00001756-200306110-00006","article-title":"Brain activity during audiovisual speech perception: An fMRI study of the McGurk effect","volume":"14","author":"Jones","year":"2003","journal-title":"Neuroreport"},{"key":"ref_21","first-page":"153","article-title":"The importance of prosodic speech elements for the lipreader","volume":"4","author":"Risberg","year":"1974","journal-title":"Scand. Audiol."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1121\/1.392335","article-title":"The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects","volume":"77","author":"Grant","year":"1985","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"397","DOI":"10.1121\/1.397690","article-title":"Single-channel vibrotactile supplements to visual perception of intonation and stress","volume":"85","author":"Bernstein","year":"1989","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"746","DOI":"10.1038\/264746a0","article-title":"Hearing lips and seeing voices","volume":"264","author":"McGurk","year":"1976","journal-title":"Nature"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_27","unstructured":"(2022, July 07). Google Cloud Speech to Text. Available online: https:\/\/cloud.google.com\/speech-to-text."},{"key":"ref_28","unstructured":"(2022, July 07). Watson Speech to Text. Available online: https:\/\/www.ibm.com\/kr-ko\/cloud\/watson-speech-to-text."},{"key":"ref_29","unstructured":"(2022, July 07). Microsoft Azure Cognitive Services. Available online: https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. (2018, January 15\u201320). The Microsoft 2017 conversational speech recognition system. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461870"},{"key":"ref_31","unstructured":"(2022, July 07). Amazon Alexa. Available online: https:\/\/developer.amazon.com\/en-US\/alexa."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Petridis, S., and Pantic, M. (2016, January 20\u201325). Deep complementary bottleneck features for visual speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472088"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Wand, M., Koutn\u00edk, J., and Schmidhuber, J. (2016, January 20\u201325). Lipreading with long short-term memory. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472852"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"2421","DOI":"10.1121\/1.2229005","article-title":"An audio-visual corpus for speech perception and automatic speech recognition","volume":"120","author":"Cooke","year":"2006","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., and Ogata, T. (2014, January 14\u201318). Lipreading using convolutional neural network. Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore.","DOI":"10.21437\/Interspeech.2014-293"},{"key":"ref_37","unstructured":"Assael, Y.M., Shillingford, B., Whiteson, S., and De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Fenghour, S., Chen, D., Guo, K., Li, B., and Xiao, P. (2021). An effective conversion of visemes to words for high-performance automatic lipreading. Sensors, 21.","DOI":"10.3390\/s21237890"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Li, H., Yadikar, N., Zhu, Y., Mamut, M., and Ubul, K. (2022). Learning the relative dynamic features for word-level lipreading. Sensors, 22.","DOI":"10.3390\/s22103732"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Xu, K., Li, D., Cassimatis, N., and Wang, X. (2018, January 15\u201319). LCANet: End-to-end lipreading with cascaded attention-CTC. Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00088"},{"key":"ref_41","first-page":"20","article-title":"Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx)","volume":"7","author":"Bohouta","year":"2017","journal-title":"Int. J. Eng. Res. Appl."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"10","DOI":"10.2991\/ijndc.k.201218.005","article-title":"The performance evaluation of continuous speech recognition based on Korean phonological rules of cloud-based speech recognition open API","volume":"9","author":"Yoo","year":"2021","journal-title":"Int. J. Netw. Distrib. Comput."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Alibegovi\u0107, B., Prlja\u010da, N., Kimmel, M., and Schultalbers, M. (2020, January 13\u201315). Speech recognition system for a service robot\u2014A performance evaluation. Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China.","DOI":"10.1109\/ICARCV50220.2020.9305342"},{"key":"ref_44","first-page":"245","article-title":"Using voice recognition software to improve communicative writing and social participation in an individual with severe acquired dysgraphia: An experimental single-case therapy study","volume":"30","author":"Caute","year":"2016","journal-title":"Aphasiology"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Jeon, S., and Kim, M.S. (2022). End-to-end lip-reading open cloud-based speech architecture. Sensors, 22.","DOI":"10.3390\/s22082938"},{"key":"ref_46","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013, January 5\u201310). Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA. Ser. NIPS\u201913."},{"key":"ref_47","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_48","unstructured":"Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. (2015, January 7\u201312). Efficient object localization using convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298664"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"34195","DOI":"10.1007\/s11042-020-09054-7","article-title":"Revisiting spatial dropout for regularizing convolutional neural networks","volume":"79","author":"Lee","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.-Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_52","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_53","unstructured":"Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv."},{"key":"ref_54","first-page":"1755","article-title":"Dlib-ml: A machine learning toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_55","first-page":"26","article-title":"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude","volume":"4","author":"Tieleman","year":"2012","journal-title":"COURSERA Neural Netw. Mach. Learn."},{"key":"ref_56","unstructured":"Zeiler, M.D. (2012). Adadelta: An adaptive learning rate method. arXiv."},{"key":"ref_57","first-page":"2121","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","volume":"12","author":"Duchi","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_58","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Bottou, L. (2012). Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade, Springer.","DOI":"10.1007\/978-3-642-35289-8_25"},{"key":"ref_60","unstructured":"Masters, D., and Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv."},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"312","DOI":"10.1016\/j.icte.2020.04.010","article-title":"The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset","volume":"6","author":"Kandel","year":"2020","journal-title":"ICT Express"},{"key":"ref_62","unstructured":"You, Y., Gitman, I., and Ginsburg, B. (2017). Large batch training of convolutional networks. arXiv."},{"key":"ref_63","unstructured":"Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P.T.P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/20\/7738\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:52:35Z","timestamp":1760143955000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/20\/7738"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,12]]},"references-count":63,"journal-issue":{"issue":"20","published-online":{"date-parts":[[2022,10]]}},"alternative-id":["s22207738"],"URL":"https:\/\/doi.org\/10.3390\/s22207738","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,10,12]]}}}