{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T15:36:31Z","timestamp":1777995391876,"version":"3.51.4"},"reference-count":53,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2020,4,19]],"date-time":"2020-04-19T00:00:00Z","timestamp":1587254400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The advent of new devices, technology, machine learning techniques, and the availability of free large speech corpora results in rapid and accurate speech recognition. In the last two decades, extensive research has been initiated by researchers and different organizations to experiment with new techniques and their applications in speech processing systems. There are several speech command based applications in the area of robotics, IoT, ubiquitous computing, and different human-computer interfaces. Various researchers have worked on enhancing the efficiency of speech command based systems and used the speech command dataset. However, none of them catered to noise in the same. Noise is one of the major challenges in any speech recognition system, as real-time noise is a very versatile and unavoidable factor that affects the performance of speech recognition systems, particularly those that have not learned the noise efficiently. We thoroughly analyse the latest trends in speech recognition and evaluate the speech command dataset on different machine learning based and deep learning based techniques. A novel technique is proposed for noise robustness by augmenting noise in training data. Our proposed technique is tested on clean and noisy data along with locally generated data and achieves much better results than existing state-of-the-art techniques, thus setting a new benchmark.<\/jats:p>","DOI":"10.3390\/s20082326","type":"journal-article","created":{"date-parts":[[2020,4,21]],"date-time":"2020-04-21T04:49:38Z","timestamp":1587444578000},"page":"2326","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":46,"title":["Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data"],"prefix":"10.3390","volume":"20","author":[{"given":"Ayesha","family":"Pervaiz","sequence":"first","affiliation":[{"name":"Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7819-5990","authenticated-orcid":false,"given":"Fawad","family":"Hussain","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6009-9151","authenticated-orcid":false,"given":"Huma","family":"Israr","sequence":"additional","affiliation":[{"name":"School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad H-12, Pakistan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Muhammad Ali","family":"Tahir","sequence":"additional","affiliation":[{"name":"School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad H-12, Pakistan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fawad Riasat","family":"Raja","sequence":"additional","affiliation":[{"name":"Machine Intelligence and Pattern Analysis Laboratory, Griffith University, Nathan, QLD 4111, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Naveed Khan","family":"Baloch","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Farruh","family":"Ishmanov","sequence":"additional","affiliation":[{"name":"Department of Electronics and Communication Engineering, Kwangwoon University, Seoul 447-1, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6570-5306","authenticated-orcid":false,"given":"Yousaf Bin","family":"Zikria","sequence":"additional","affiliation":[{"name":"Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,4,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Purington, A., Taft, J.G., Sannon, S., Bazarova, N.N., and Taylor, S.H. (2017, January 6\u201311). \u201cAlexa is my new BFF\u201d Social Roles, User Satisfaction, and Personification of the Amazon Echo. Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, Denver, CO, USA.","DOI":"10.1145\/3027063.3053246"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"L\u00f3pez, G., Quesada, L., and Guerrero, L.A. (2017, January 17). Alexa vs. Siri vs. Cortana vs. Google Assistant: A comparison of speech based natural user interfaces. Proceedings of the International Conference on Applied Human Factors and Ergonomics, Cham, Switzerland.","DOI":"10.1007\/978-3-319-60366-7_23"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1109\/MSP.2008.918411","article-title":"An introduction to voice search","volume":"25","author":"Wang","year":"2008","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zweig, G., and Chang, S. (2011, January 27\u201331). Personalizing model m for voice-search. Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy.","DOI":"10.21437\/Interspeech.2011-243"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Gao, Y., Gu, L., Zhou, B., Sarikaya, R., Afify, M., Kuo, H.K., Zhu, W.z., Deng, Y., Prosser, C., and Zhang, W. (2006). IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-Speech Translator, IBM Thomas J Watson Research Center Yorktown Heights. Technical Report.","DOI":"10.3115\/1706257.1706268"},{"key":"ref_6","unstructured":"Ehsani, F., Master, D., and Zuber, E.D. (2013). Mobile Speech-to-Speech Interpretation System. (8,478,578), U.S. Patent."},{"key":"ref_7","unstructured":"Lee, K.A., Larcher, A., Thai, H., Ma, B., and Li, H. (2011, January 27\u201331). Joint application of speech and speaker recognition for automation and security in smart home. Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy."},{"key":"ref_8","unstructured":"Howard, J., and Junqua, J.C. (2003). Automatic Control of Household Activity Using Speech Recognition and Natural Language. (6,513,006), U.S. Patent."},{"key":"ref_9","unstructured":"Warden, P. (2018, April 13). Available online: https:\/\/ai.googleblog.com\/2017\/08\/launching-speech-commands435dataset.html."},{"key":"ref_10","unstructured":"Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhang, S.X., Liu, C., Yao, K., and Gong, Y. (2015, January 19\u201324). Deep neural support vector machines for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia.","DOI":"10.1109\/ICASSP.2015.7178777"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhang, S.X., Zhao, R., Liu, C., Li, J., and Gong, Y. (2016, January 20\u201325). Recurrent support vector machines for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472806"},{"key":"ref_13","unstructured":"Song, W., and Cai, J. (2015). End-to-end deep neural network for automatic speech recognition. Stanford CS224D Reports, Stanford University."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","article-title":"Convolutional neural networks for speech recognition","volume":"22","author":"Mohamed","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Huang, J.T., Li, J., and Gong, Y. (2015, January 19\u201324). An analysis of convolutional neural networks for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia.","DOI":"10.1109\/ICASSP.2015.7178920"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1016\/j.neunet.2014.08.005","article-title":"Deep convolutional neural networks for large-scale speech tasks","volume":"64","author":"Sainath","year":"2015","journal-title":"Neural Netw."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). A time delay neural network architecture for efficient modelling of long temporal contexts. Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-647"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Sercu, T., and Goel, V. (2016). Advances in very deep convolutional neural networks for LVCSR. arXiv.","DOI":"10.21437\/Interspeech.2016-1033"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Yu, D., Xiong, W., Droppo, J., Stolcke, A., Ye, G., Li, J., and Zweig, G. (2016, January 8\u201312). Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention. Proceedings of the Interspeech 2016, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-251"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Qian, Y., and Woodland, P.C. (2016, January 13\u201316). Very deep convolutional neural networks for robust speech recognition. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), San Juan, Puertico Rico.","DOI":"10.1109\/SLT.2016.7846307"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"2263","DOI":"10.1109\/TASLP.2016.2602884","article-title":"Very deep convolutional neural networks for noise robust speech recognition","volume":"24","author":"Qian","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Sainath, T.N., and Parada, C. (2015, January 6\u201310). Convolutional neural networks for small-footprint keyword spotting. Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-352"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"McMahan, B., and Rao, D. (2018, January 2\u20137). Listening to the world improves speech command recognition. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11284"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1109\/LSP.2017.2657381","article-title":"Deep convolutional neural networks and data augmentation for environmental sound classification","volume":"24","author":"Salamon","year":"2017","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_27","unstructured":"Jansson, P. (2018). Single-Word Speech Recognition with Convolutional Neural Networks on Raw Waveforms, Arcada University."},{"key":"ref_28","unstructured":"de Andrade, D.C., Leo, S., Viana, M.L.D.S., and Bernkopf, C. (2018). A neural attention model for speech command recognition. arXiv."},{"key":"ref_29","unstructured":"Zhang, Y., Suda, N., Lai, L., and Chandra, V. (2017). Hello edge: Keyword spotting on microcontrollers. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Chen, G., Parada, C., and Heigold, G. (2014, January 4\u20139). Small-footprint keyword spotting using deep neural networks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6854370"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Arik, S.O., Kliegl, M., Child, R., Hestness, J., Gibiansky, A., Fougner, C., Prenger, R., and Coates, A. (2017). Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv.","DOI":"10.21437\/Interspeech.2017-1737"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Sun, M., Raju, A., Tucker, G., Panchapagesan, S., Fu, G., Mandal, A., Matsoukas, S., Strom, N., and Vitaladevuni, S. (2016, January 13\u201316). Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), San Juan, Puertico Rico.","DOI":"10.1109\/SLT.2016.7846306"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chollet, F. (2017, January 21\u201326). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.195"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Segal, Y., Fuchs, T.S., and Keshet, J. (2019). SpeechYOLO: Detection and Localization of Speech Objects. arXiv.","DOI":"10.21437\/Interspeech.2019-1749"},{"key":"ref_35","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (July, January 26). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Tang, R., and Lin, J. (2018, January 15\u201320). Deep residual learning for small-footprint keyword spotting. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462688"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Zhang, C., and Koishida, K. (2017, January 20\u201324). End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances. Proceedings of the Interspeech, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1608"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. (2018, January 15\u201320). The Microsoft 2017 conversational speech recognition system. Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461870"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Bae, J., and Kim, D.S. (2018, January 2\u20136). End-to-End Speech Command Recognition with Capsule Network. Proceedings of the Interspeech, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1888"},{"key":"ref_40","unstructured":"Sabour, S., Frosst, N., and Hinton, G.E. (2017). Dynamic routing between capsules. Advances in Neural Information Processing Systems, The MIT Press."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Soni, M., Sheikh, I., and Kopparapu, S.K. (2019, January 11\u201313). Label-Driven Time-Frequency Masking for Robust Speech Command Recognition. Proceedings of the International Conference on Text, Speech, and Dialogue, Ljubljana, Slovenia.","DOI":"10.1007\/978-3-030-27947-9_29"},{"key":"ref_42","unstructured":"Daniel, P., Arnab, G., Gilles, B., Lukas, B., Ondrej, G., Nagendra, G., Mirko, H., Petr, M., Yanmin, Q., and Petr, S. (2011, January 1\u201315). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA. Number EPFL-CONF-192584."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1109\/TASSP.1980.1163420","article-title":"Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences","volume":"28","author":"Davis","year":"1980","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_44","unstructured":"Muda, L., Begam, M., and Elamvazuthi, I. (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1016\/S0167-6393(98)00033-8","article-title":"Cepstral domain segmental feature vector normalization for noise robust speech recognition","volume":"25","author":"Viikki","year":"1998","journal-title":"Speech Commun."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Haeb-Umbach, R., and Ney, H. (1992, January 23\u201326). Linear discriminant analysis for improved large vocabulary continuous speech recognition. Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA, USA.","DOI":"10.1109\/ICASSP.1992.225984"},{"key":"ref_47","unstructured":"Kumar, N. (1998). Investigation of Silicon Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition, The Johns Hopkins University."},{"key":"ref_48","unstructured":"Gopinath, R.A. (1998, January 12\u201315). Maximum likelihood modelling with Gaussian distributions for classification. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP\u201998 (Cat. No. 98CH36181), Seattle, WA, USA."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Prasad, N.V., and Umesh, S. (2013, January 8\u201312). Improved cepstral mean and variance normalization using Bayesian framework. Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic.","DOI":"10.1109\/ASRU.2013.6707722"},{"key":"ref_50","unstructured":"(2019, October 09). Online Voice Recorder. Available online: https:\/\/online-voice-recorder.com\/."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1186\/s40537-019-0197-0","article-title":"A survey on image data augmentation for deep learning","volume":"6","author":"Shorten","year":"2019","journal-title":"J. Big Data"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Wiesler, S., Richard, A., Schl\u00fcter, R., and Ney, H. (2014, January 4\u20139). Mean-normalized stochastic gradient for large-scale deep learning. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6853582"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Yu, D., and Deng, L. (2016). Automatic Speech Recognition-A Deep Learning Approach, Springer.","DOI":"10.1007\/978-1-4471-5779-3"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/8\/2326\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T13:45:08Z","timestamp":1760363108000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/8\/2326"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,19]]},"references-count":53,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2020,4]]}},"alternative-id":["s20082326"],"URL":"https:\/\/doi.org\/10.3390\/s20082326","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,4,19]]}}}