{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,20]],"date-time":"2026-04-20T10:21:45Z","timestamp":1776680505292,"version":"3.51.2"},"reference-count":46,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2022,4,12]],"date-time":"2022-04-12T00:00:00Z","timestamp":1649721600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2018X1A3A1069795"],"award-info":[{"award-number":["NRF-2018X1A3A1069795"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google\u2019s trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.<\/jats:p>","DOI":"10.3390\/s22082938","type":"journal-article","created":{"date-parts":[[2022,4,12]],"date-time":"2022-04-12T22:48:45Z","timestamp":1649803725000},"page":"2938","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["End-to-End Lip-Reading Open Cloud-Based Speech Architecture"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4705-1254","authenticated-orcid":false,"given":"Sanghun","family":"Jeon","sequence":"first","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mun Sang","family":"Kim","sequence":"additional","affiliation":[{"name":"Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,4,12]]},"reference":[{"key":"ref_1","unstructured":"Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing, Prentice Hall."},{"key":"ref_2","unstructured":"Deng, L., and O\u2019Shaughnessy, D. (2003). Speech Processing: A Dynamic and Optimization-Oriented Approach, CRC Press."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1116","DOI":"10.1109\/JPROC.2012.2236631","article-title":"Speech-Centric Information Processing: An Optimization-Oriented Approach","volume":"101","author":"He","year":"2013","journal-title":"Proc. IEEE"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"565","DOI":"10.1016\/B978-0-12-397025-1.00047-6","article-title":"Multisensory Integration and Audiovisual Speech Perception","volume":"2","author":"Venezia","year":"2015","journal-title":"Brain Mapp. Encycl. Ref."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1001","DOI":"10.1098\/rstb.2007.2155","article-title":"The Processing of Audio-Visual Speech: Empirical and Neural Bases","volume":"363","author":"Campbell","year":"2008","journal-title":"Philos. Trans. R. Soc. Lond. B Biol. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Calvert, G., Spence, C., and Stein, B.E. (2004). The Handbook of Multisensory Processes, MIT Press.","DOI":"10.7551\/mitpress\/3422.001.0001"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"212","DOI":"10.1121\/1.1907309","article-title":"Visual Contribution to Speech Intelligibility in Noise","volume":"26","author":"Sumby","year":"1954","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1068\/p060031","article-title":"The Role of Vision in the Perception of Speech","volume":"6","author":"Dodd","year":"1977","journal-title":"Perception"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1129","DOI":"10.1097\/00001756-200306110-00006","article-title":"Brain Activity During Audiovisual Speech Perception: An fMRI Study of the McGurk Effect","volume":"14","author":"Jones","year":"2003","journal-title":"Neuroreport"},{"key":"ref_10","first-page":"153","article-title":"The Importance of Prosodic Speech Elements for the Lipreader","volume":"4","author":"Risberg","year":"1974","journal-title":"Scand. Audiol."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1121\/1.392335","article-title":"The Contribution of Fundamental Frequency, Amplitude Envelope, and Voicing Duration Cues to Speechreading in Normal-Hearing Subjects","volume":"77","author":"Grant","year":"1985","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"397","DOI":"10.1121\/1.397690","article-title":"Single-Channel Vibrotactile Supplements to Visual Perception of Intonation and Stress","volume":"85","author":"Bernstein","year":"1989","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"746","DOI":"10.1038\/264746a0","article-title":"Hearing Lips and Seeing Voices","volume":"264","author":"McGurk","year":"1976","journal-title":"Nature"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. (2018, January 15\u201320). The Microsoft 2017 Conversational Speech Recognition System. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461870"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kim, J.-B., and Kweon, H.-J. (2020). The Analysis on Commercial and Open Source Software Speech Recognition Technology. International Conference Computability Science Intellettuale Appliance Informatics, Springer. Studies in Computational Intelligence.","DOI":"10.1007\/978-3-030-25225-0_1"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D Convolutional Neural Networks for Human Action Recognition","volume":"35","author":"Ji","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Petridis, S., and Pantic, M. (2016, January 20\u201325). Deep Complementary Bottleneck Features for Visual Speech Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472088"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wand, M., Koutn\u00edk, J., and Schmidhuber, J. (2016, January 20\u201325). Lipreading with Long Short-Term Memory. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472852"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"2421","DOI":"10.1121\/1.2229005","article-title":"An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition","volume":"120","author":"Cooke","year":"2006","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., and Ogata, T. (2014, January 14\u201318). Lipreading Using Convolutional Neural Network. Proceedings of the Fifteenth Annual Conference Interna Speech Commentata Associa\u00e7\u00e3o, Singapore.","DOI":"10.21437\/Interspeech.2014-293"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1016\/j.imavis.2014.06.004","article-title":"A Review of Recent Advances in Visual Speech Decoding","volume":"32","author":"Zhou","year":"2014","journal-title":"Image Vis. Comput."},{"key":"ref_22","unstructured":"Assael, Y.M., Shillingford, B., Whiteson, S., and De Freitas, N. (2016). Lipnet: End-to-End Sentence-Level Lipreading. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhang, P., Wang, D., Lu, H., Wang, H., and Ruan, X. (2017, January 22\u201329). Amulet: Aggregating Multi-Level Convolutional Features for Salient Object Detection. Proceedings of the IEEE International Conference Computability Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.31"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd International Conference Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.cviu.2018.02.001","article-title":"Learning to Lip Read Words by Watching Videos","volume":"173","author":"Chung","year":"2018","journal-title":"Comput. Vis. Image Understand"},{"key":"ref_26","first-page":"20","article-title":"Comparing Speech Recognition Systems (Microsoft API, Google API and CMU Sphinx)","volume":"7","author":"Bohouta","year":"2017","journal-title":"Int. J. Eng. Res. Appl."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"10","DOI":"10.2991\/ijndc.k.201218.005","article-title":"The Performance Evaluation of Continuous Speech Recognition Based on Korean Phonological Rules of Cloud-Based Speech Recognition Open API","volume":"9","author":"Yoo","year":"2021","journal-title":"Int. J. Network Distr Comput."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Alibegovi\u0107, B., Prlja\u010da, N., Kimmel, M., and Schultalbers, M. (2020, January 13\u201315). Speech Recognition System for a Service Robot-A Performance Evaluation. Proceedings of the International Conference on Control, Automation, Robotics and Vision, Shenzhen, China.","DOI":"10.1109\/ICARCV50220.2020.9305342"},{"key":"ref_29","first-page":"245","article-title":"Using Voice Recognition Software to Improve Communicative Writing and Social Participation in an Individual with Severe Acquired Dysgraphia: An Experimental Single-Case Therapy Study","volume":"30","author":"Caute","year":"2016","journal-title":"Aphasiology"},{"key":"ref_30","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013, January 5\u201310). Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA. Ser. NIPS\u201913."},{"key":"ref_31","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv."},{"key":"ref_32","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv."},{"key":"ref_33","first-page":"1755","article-title":"Dlib-ml: A Machine Learning Toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_34","unstructured":"Tivive, F.H.C., and Bouzerdoum, A. (2022, February 17). An Eye Feature Detector Based on Convolutional Neural Network. Available online: https:\/\/ro.uow.edu.au\/infopapers\/2860\/."},{"key":"ref_35","first-page":"3","article-title":"Autoencoders, Minimum Description Length, and Helmholtz Free Energy","volume":"6","author":"Hinton","year":"1994","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Nix, R., and Zhang, J. (2017, January 14\u201319). Classification of Android Apps and Malware Using Deep Neural Networks. Proceedings of the 2017 International Joint Conference on Neural Networks, Anchorage, AK, USA.","DOI":"10.1109\/IJCNN.2017.7966078"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 30). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference Computer Vision Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_38","unstructured":"Warden, P. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Majumdar, S., and Ginsburg, B. (2020). MatchboxNet: 1-d Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-1058"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Vygon, R., and Mikhaylovskiy, N. (2021). Learning Efficient Representations for Keyword Spotting with Triplet Loss. arXiv.","DOI":"10.1007\/978-3-030-87802-3_69"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Mo, T., and Liu, B. (2021). Encoder-Decoder Neural Architecture Optimization for Keyword Spotting. arXiv.","DOI":"10.21437\/Interspeech.2020-3132"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. (2013, January 1\u20138). Volume 300 faces in-the-wild challenge: The first facial landmark localization challenge. Proceedings of the IEEE International Conference Computability Vision Workshops 2013, Sydney, Australia.","DOI":"10.1109\/ICCVW.2013.59"},{"key":"ref_43","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_44","unstructured":"Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., and Mashari, A. (2000). Audio Visual Speech Recognition (No, R.E.P. Work), IDIAP."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"035081","DOI":"10.1121\/1.4799597","article-title":"DEMAND: Diverse Environments Multichannel Acoustic Noise Database","volume":"19","author":"Thiemann","year":"2013","journal-title":"Proc. Mtgs. Acoust."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Jeon, S., Elsharkawy, A., and Kim, M.S. (2021). Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition. Sensors, 22.","DOI":"10.3390\/s22010072"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/8\/2938\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:52:29Z","timestamp":1760136749000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/8\/2938"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,12]]},"references-count":46,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2022,4]]}},"alternative-id":["s22082938"],"URL":"https:\/\/doi.org\/10.3390\/s22082938","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,12]]}}}