{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T11:36:43Z","timestamp":1769600203854,"version":"3.49.0"},"reference-count":75,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,7,5]],"date-time":"2021-07-05T00:00:00Z","timestamp":1625443200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,5]],"date-time":"2021-07-05T00:00:00Z","timestamp":1625443200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005713","name":"Technische Universit\u00e4t M\u00fcnchen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005713","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.<\/jats:p>","DOI":"10.1186\/s13636-021-00215-6","type":"journal-article","created":{"date-parts":[[2021,7,5]],"date-time":"2021-07-05T10:04:33Z","timestamp":1625479473000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":17,"title":["Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition"],"prefix":"10.1186","volume":"2021","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0641-3178","authenticated-orcid":false,"given":"Lujun","family":"Li","sequence":"first","affiliation":[]},{"given":"Yikai","family":"Kang","sequence":"additional","affiliation":[]},{"given":"Yuchen","family":"Shi","sequence":"additional","affiliation":[]},{"given":"Ludwig","family":"K\u00fcrzinger","sequence":"additional","affiliation":[]},{"given":"Tobias","family":"Watzel","sequence":"additional","affiliation":[]},{"given":"Gerhard","family":"Rigoll","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,7,5]]},"reference":[{"key":"215_CR1","unstructured":"W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, attend and spell (2015). arXiv preprint arXiv:1508.01211."},{"key":"215_CR2","first-page":"577","volume":"28","author":"J. K. Chorowski","year":"2015","unstructured":"J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition. Adv. Neural Inf. Process Syst.28:, 577\u2013585 (2015).","journal-title":"Adv. Neural Inf. Process Syst."},{"issue":"6","key":"215_CR3","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","volume":"29","author":"G. Hinton","year":"2012","unstructured":"G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. -r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process Mag.29(6), 82\u201397 (2012).","journal-title":"IEEE Sig. Process Mag."},{"key":"215_CR4","doi-asserted-by":"publisher","unstructured":"C. -C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al, in Proc. ICASSP. State-of-the-art speech recognition with sequence-to-sequence models (IEEE, 2018), pp. 4774\u20134778. https:\/\/doi.org\/10.1109\/icassp.2018.8462105.","DOI":"10.1109\/icassp.2018.8462105"},{"key":"215_CR5","doi-asserted-by":"publisher","unstructured":"D. Povey, H. Hadian, P. Ghahremani, K. Li, S. Khudanpur, in Proc. ICASSP. A time-restricted self-attention layer for ASR (IEEE, 2018), pp. 5874\u20135878. https:\/\/doi.org\/10.1109\/icassp.2018.8462497.","DOI":"10.1109\/icassp.2018.8462497"},{"key":"215_CR6","doi-asserted-by":"crossref","unstructured":"Z. Tian, J. Yi, J. Tao, Y. Bai, Z. Wen, Self-attention transducers for end-to-end speech recognition (2019). arXiv preprint arXiv:1909.13037.","DOI":"10.21437\/Interspeech.2019-2203"},{"key":"215_CR7","doi-asserted-by":"publisher","unstructured":"J. Salazar, K. Kirchhoff, Z. Huang, in Proc. ICASSP. Self-attention networks for connectionist temporal classification in speech recognition (IEEE, 2019), pp. 7115\u20137119. https:\/\/doi.org\/10.1109\/icassp.2019.8682539.","DOI":"10.1109\/icassp.2019.8682539"},{"key":"215_CR8","doi-asserted-by":"publisher","unstructured":"K. J. Han, J. Huang, Y. Tang, X. He, B. Zhou, in Proc. INTERSPEECH. Multi-stride self-attention for speech recognition, (2019), pp. 2788\u20132792. https:\/\/doi.org\/10.21437\/interspeech.2019-1973.","DOI":"10.21437\/interspeech.2019-1973"},{"key":"215_CR9","doi-asserted-by":"publisher","unstructured":"K. J. Han, R. Prieto, T. Ma, in Proc. ASRU. State-of-the-art speech recognition using multi-stream self-attention with dilated 1D convolutions (IEEE, 2019), pp. 54\u201361. https:\/\/doi.org\/10.1109\/asru46091.2019.9003730.","DOI":"10.1109\/asru46091.2019.9003730"},{"key":"215_CR10","doi-asserted-by":"crossref","unstructured":"N. -Q. Pham, T. -S. Nguyen, J. Niehues, M. M\u00fcller, S. St\u00fcker, A. Waibel, Very deep self-attention networks for end-to-end speech recognition (2019). arXiv preprint arXiv:1904.13377.","DOI":"10.21437\/Interspeech.2019-2702"},{"key":"215_CR11","unstructured":"C. -F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, M. L. Seltzer, Transformer-transducer: End-to-end speech recognition with self-attention (2019). arXiv preprint arXiv:1910.12977."},{"key":"215_CR12","doi-asserted-by":"crossref","unstructured":"H. Luo, S. Zhang, M. Lei, L. Xie, Simplified self-attention for transformer-based end-to-end speech recognition (2020). arXiv preprint arXiv:2005.10463.","DOI":"10.1109\/SLT48900.2021.9383581"},{"issue":"3","key":"215_CR13","doi-asserted-by":"publisher","first-page":"197","DOI":"10.1109\/TASSP.1978.1163086","volume":"26","author":"J. Lim","year":"1978","unstructured":"J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Sig. Process. 26(3), 197\u2013210 (1978).","journal-title":"IEEE Trans. Acoust. Speech Sig. Process"},{"key":"215_CR14","doi-asserted-by":"publisher","unstructured":"A. Narayanan, D. Wang, in Proc. ICASSP. Ideal ratio mask estimation using deep neural networks for robust speech recognition (IEEE, 2013), pp. 7092\u20137096. https:\/\/doi.org\/10.1109\/icassp.2013.6639038.","DOI":"10.1109\/icassp.2013.6639038"},{"issue":"12","key":"215_CR15","doi-asserted-by":"publisher","first-page":"1849","DOI":"10.1109\/TASLP.2014.2352935","volume":"22","author":"Y. Wang","year":"2014","unstructured":"Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE\/ACM Trans. Audio Speech Lang. Process. 22(12), 1849\u20131858 (2014).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"key":"215_CR16","first-page":"1503","volume-title":"Proc. INTERSPEECH","author":"S. Nie","year":"2015","unstructured":"S. Nie, S. Liang, W. Xue, X. Zhang, W. Liu, et al., in Proc. INTERSPEECH. Two-stage multi-target joint learning for monaural speech separation (International Speech Communication Association (ISCA)Dresden, 2015), pp. 1503\u20131507."},{"key":"215_CR17","doi-asserted-by":"publisher","unstructured":"F. Weninger, J. R. Hershey, J. Le Roux, B. Schuller, in Proc. GlobalSIP. Discriminatively trained recurrent neural networks for single-channel speech separation (IEEE, 2014), pp. 577\u2013581. https:\/\/doi.org\/10.1109\/globalsip.2014.7032183.","DOI":"10.1109\/globalsip.2014.7032183"},{"key":"215_CR18","doi-asserted-by":"publisher","unstructured":"H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, in Proc. ICASSP. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks (IEEE, 2015), pp. 708\u2013712. https:\/\/doi.org\/10.1109\/icassp.2015.7178061.","DOI":"10.1109\/icassp.2015.7178061"},{"issue":"1","key":"215_CR19","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1109\/TASLP.2014.2364452","volume":"23","author":"Y. Xu","year":"2014","unstructured":"Y. Xu, J. Du, L. -R. Dai, C. -H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE\/ACM Trans. Audio Speech Lang. Process. 23(1), 7\u201319 (2014).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"issue":"11","key":"215_CR20","doi-asserted-by":"publisher","first-page":"2043","DOI":"10.1109\/TASLP.2018.2851151","volume":"26","author":"S. Nie","year":"2018","unstructured":"S. Nie, S. Liang, W. Liu, X. Zhang, J. Tao, Deep learning based speech separation via NMF-style reconstructions. IEEE\/ACM Trans. Audio Speech Lang. Process. 26(11), 2043\u20132055 (2018).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"key":"215_CR21","doi-asserted-by":"publisher","unstructured":"Y. Ephraim, in Proc. ICASSP. A minimum mean square error approach for speech enhancement (IEEE, 1990), pp. 829\u2013832. https:\/\/doi.org\/10.1109\/icassp.1990.115960.","DOI":"10.1109\/icassp.1990.115960"},{"issue":"1","key":"215_CR22","doi-asserted-by":"publisher","first-page":"229","DOI":"10.1109\/TASL.2007.911054","volume":"16","author":"Y. Hu","year":"2007","unstructured":"Y. Hu, P. C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229\u2013238 (2007).","journal-title":"IEEE Trans. Audio Speech Lang. Process"},{"key":"215_CR23","unstructured":"S. R. Quackenbush, T. P. Barnwell, M. A. Clements, Objective measures of speech quality, Ellis Horwood Series in Artificial Intelligence (Prentice Hall, 1988)."},{"key":"215_CR24","doi-asserted-by":"publisher","unstructured":"M. L. Seltzer, in Hands-Free Speech Communication and Microphone Arrays. Bridging the gap: towards a unified framework for hands-free speech recognition using microphone arrays (IEEE, 2008), pp. 104\u2013107. https:\/\/doi.org\/10.1109\/hscma.2008.4538698.","DOI":"10.1109\/hscma.2008.4538698"},{"issue":"4","key":"215_CR25","doi-asserted-by":"publisher","first-page":"796","DOI":"10.1109\/TASLP.2016.2528171","volume":"24","author":"Z. -Q. Wang","year":"2016","unstructured":"Z. -Q. Wang, D. Wang, A joint training framework for robust automatic speech recognition. IEEE\/ACM Trans. Audio Speech Lang. Process. 24(4), 796\u2013806 (2016).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"key":"215_CR26","volume-title":"Proc. INTERSPEECH","author":"Z. -q. Wang","year":"2015","unstructured":"Z. -q. Wang, D. Wang, in Proc. INTERSPEECH. Joint training of speech separation, filterbank and acoustic model for robust automatic speech recognition (International Speech Communication Association (ISCA)Dresden, 2015)."},{"key":"215_CR27","doi-asserted-by":"crossref","unstructured":"T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, Multichannel end-to-end speech recognition (2017). arXiv preprint arXiv:1703.04783.","DOI":"10.1109\/ICASSP.2018.8462161"},{"key":"215_CR28","doi-asserted-by":"publisher","unstructured":"B. Liu, S. Nie, S. Liang, W. Liu, M. Yu, L. Chen, S. Peng, C. Li, in Proc. INTERSPEECH. Jointly adversarial enhancement training for robust end-to-end speech recognition (IEEE, 2019), pp. 491\u2013495. https:\/\/doi.org\/10.21437\/interspeech.2019-1242.","DOI":"10.21437\/interspeech.2019-1242"},{"key":"215_CR29","doi-asserted-by":"crossref","unstructured":"S. Pascual, A. Bonafonte, J. Serra, SEGAN: speech enhancement generative adversarial network (2017). arXiv preprint arXiv:1703.09452.","DOI":"10.21437\/Interspeech.2017-1428"},{"key":"215_CR30","doi-asserted-by":"publisher","unstructured":"M. H. Soni, N. Shah, H. A. Patil, in Proc. ICASSP. Time-frequency masking-based speech enhancement using generative adversarial network (IEEE, 2018), pp. 5039\u20135043. https:\/\/doi.org\/10.1109\/icassp.2018.8462068.","DOI":"10.1109\/icassp.2018.8462068"},{"key":"215_CR31","doi-asserted-by":"crossref","unstructured":"D. Michelsanti, Z. -H. Tan, Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification (2017). arXiv preprint arXiv:1709.01703.","DOI":"10.21437\/Interspeech.2017-1620"},{"key":"215_CR32","doi-asserted-by":"publisher","unstructured":"P. Shen, X. Lu, S. Li, H. Kawai, in Proc. INTERSPEECH. Conditional generative adversarial nets classifier for spoken language identification (IEEE, 2017), pp. 2814\u20132818. https:\/\/doi.org\/10.21437\/interspeech.2017-553.","DOI":"10.21437\/interspeech.2017-553"},{"key":"215_CR33","doi-asserted-by":"crossref","unstructured":"S. Sahu, R. Gupta, C. Espy-Wilson, On enhancing speech emotion recognition using generative adversarial networks (2018). arXiv preprint arXiv:1806.06626.","DOI":"10.21437\/Interspeech.2018-1883"},{"key":"215_CR34","doi-asserted-by":"publisher","unstructured":"H. Hu, T. Tan, Y. Qian, in Proc. ICASSP. Generative adversarial networks based data augmentation for noise robust speech recognition (IEEE, 2018), pp. 5044\u20135048. https:\/\/doi.org\/10.1109\/icassp.2018.8462624.","DOI":"10.1109\/icassp.2018.8462624"},{"key":"215_CR35","doi-asserted-by":"publisher","unstructured":"C. Donahue, B. Li, R. Prabhavalkar, in Proc. ICASSP. Exploring speech enhancement with generative adversarial networks for robust speech recognition (IEEE, 2018), pp. 5024\u20135028. https:\/\/doi.org\/10.1109\/icassp.2018.8462581.","DOI":"10.1109\/icassp.2018.8462581"},{"key":"215_CR36","doi-asserted-by":"publisher","unstructured":"L. Dong, S. Xu, B. Xu, in Proc. ICASSP. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition (IEEE, 2018), pp. 5884\u20135888. https:\/\/doi.org\/10.1109\/icassp.2018.8462506.","DOI":"10.1109\/icassp.2018.8462506"},{"key":"215_CR37","doi-asserted-by":"crossref","unstructured":"A. Gulati, J. Qin, C. -C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al, Conformer: convolution-augmented transformer for speech recognition (2020). arXiv preprint arXiv:2005.08100.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"215_CR38","doi-asserted-by":"publisher","first-page":"1700","DOI":"10.1109\/LSP.2020.3025020","volume":"27","author":"H. Phan","year":"2020","unstructured":"H. Phan, I. V. McLoughlin, L. Pham, O. Y. Ch\u00e9n, P. Koch, M. De Vos, A. Mertins, Improving GANs for speech enhancement. IEEE Sig. Process Lett.27:, 1700\u20131704 (2020).","journal-title":"IEEE Sig. Process Lett."},{"key":"215_CR39","unstructured":"D. Baby, isegan: improved speech enhancement generative adversarial networks (2020). arXiv preprint arXiv:2002.08796."},{"key":"215_CR40","doi-asserted-by":"crossref","unstructured":"H. Phan, H. L. Nguyen, O. Y. Ch\u00e9n, P. Koch, N. Q. Duong, I. McLoughlin, A. Mertins, Self-attention generative adversarial network for speech enhancement (2020). arXiv preprint arXiv:2010.09132.","DOI":"10.1109\/ICASSP39728.2021.9414265"},{"key":"215_CR41","doi-asserted-by":"publisher","unstructured":"Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, D. Takeuchi, in Proc. ICASSP. Speech enhancement using self-adaptation and multi-head self-attention (IEEE, 2020), pp. 181\u2013185. https:\/\/doi.org\/10.1109\/icassp40776.2020.9053214.","DOI":"10.1109\/icassp40776.2020.9053214"},{"key":"215_CR42","doi-asserted-by":"publisher","unstructured":"A. Sriram, H. Jun, Y. Gaur, S. Satheesh, in Proc. ICASSP. Robust speech recognition using generative adversarial networks (IEEE, 2018), pp. 5639\u20135643. https:\/\/doi.org\/10.1109\/icassp.2018.8462456.","DOI":"10.1109\/icassp.2018.8462456"},{"key":"215_CR43","doi-asserted-by":"publisher","unstructured":"K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, L. Xie, in Proc. INTERSPEECH. Investigating generative adversarial networks based speech dereverberation for robust speech recognition (IEEE, 2018), pp. 1581\u20131585. https:\/\/doi.org\/10.21437\/interspeech.2018-1780.","DOI":"10.21437\/interspeech.2018-1780"},{"key":"215_CR44","doi-asserted-by":"publisher","unstructured":"B. Liu, S. Nie, Y. Zhang, D. Ke, S. Liang, W. Liu, in Proc. ICASSP. Boosting noise robustness of acoustic model via deep adversarial training (IEEE, 2018), pp. 5034\u20135038. https:\/\/doi.org\/10.1109\/icassp.2018.8462093.","DOI":"10.1109\/icassp.2018.8462093"},{"key":"215_CR45","doi-asserted-by":"publisher","unstructured":"J. Droppo, A. Acero, in Proc. INTERSPEECH, 1. Joint discriminative front end and back end training for improved speech recognition accuracy (IEEE, 2006), pp. 281\u2013284. https:\/\/doi.org\/10.1109\/icassp.2006.1660012.","DOI":"10.1109\/icassp.2006.1660012"},{"key":"215_CR46","doi-asserted-by":"publisher","unstructured":"T. Gao, J. Du, L. Dai, C. Lee, in Proc. ICASSP. Joint training of front-end and back-end deep neural networks for robust speech recognition (IEEE, 2015), pp. 4375\u20134379. https:\/\/doi.org\/10.1109\/icassp.2015.7178797.","DOI":"10.1109\/icassp.2015.7178797"},{"key":"215_CR47","doi-asserted-by":"publisher","unstructured":"M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, in Proc. SLT. Batch-normalized joint training for DNN-based distant speech recognition (IEEE, 2016), pp. 28\u201334. https:\/\/doi.org\/10.1109\/slt.2016.7846241.","DOI":"10.1109\/slt.2016.7846241"},{"issue":"12","key":"215_CR48","doi-asserted-by":"publisher","first-page":"2231","DOI":"10.1109\/TASLP.2016.2598308","volume":"24","author":"Y. Qian","year":"2016","unstructured":"Y. Qian, T. Tan, D. Yu, Neural network based multi-factor aware joint training for robust speech recognition. IEEE\/ACM Trans. Audio Speech Lang. Process. 24(12), 2231\u20132240 (2016).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"key":"215_CR49","first-page":"5998","volume":"30","author":"A. Vaswani","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inf. Process Syst.30:, 5998\u20136008 (2017).","journal-title":"Adv. Neural Inf. Process Syst."},{"key":"215_CR50","first-page":"2672","volume":"27","author":"I. Goodfellow","year":"2014","unstructured":"I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. Adv. Neural Inf. Process Syst.27:, 2672\u20132680 (2014).","journal-title":"Adv. Neural Inf. Process Syst."},{"key":"215_CR51","first-page":"7794","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"X. Wang","year":"2018","unstructured":"X. Wang, R. Girshick, A. Gupta, K. He, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Non-local neural networks (IEEE Computer SocietySalt Lake City, 2018), pp. 7794\u20137803."},{"key":"215_CR52","first-page":"7354","volume-title":"International Conference on Machine Learning","author":"H. Zhang","year":"2019","unstructured":"H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, in International Conference on Machine Learning. Self-attention generative adversarial networks (Association for Computing MachineryNew York, 2019), pp. 7354\u20137363."},{"key":"215_CR53","unstructured":"A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks (2015). arXiv preprint arXiv:1511.06434."},{"key":"215_CR54","doi-asserted-by":"publisher","unstructured":"K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE International Conference on Computer Vision. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, (2015), pp. 1026\u20131034. https:\/\/doi.org\/10.1109\/iccv.2015.123.","DOI":"10.1109\/iccv.2015.123"},{"key":"215_CR55","unstructured":"T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs (2016). arXiv preprint arXiv:1606.03498."},{"key":"215_CR56","first-page":"3","volume-title":"Proc. Icml","author":"A. L. Maas","year":"2013","unstructured":"A. L. Maas, A. Y. Hannun, A. Y. Ng, in Proc. Icml, 30. Rectifier nonlinearities improve neural network acoustic models (JMLR.orgAtlanta, 2013), p. 3."},{"key":"215_CR57","unstructured":"Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, T. -Y. Liu, Understanding and improving transformer from a multi-particle dynamic system point of view (2019). arXiv preprint arXiv:1906.02762."},{"key":"215_CR58","doi-asserted-by":"publisher","unstructured":"A. Zeyer, P. Bahar, K. Irie, R. Schl\u00fcter, H. Ney, in Proc. ASRU. A comparison of transformer and LSTM encoder decoder models for ASR (IEEE, 2019), pp. 8\u201315. https:\/\/doi.org\/10.1109\/asru46091.2019.9004025.","DOI":"10.1109\/asru46091.2019.9004025"},{"key":"215_CR59","doi-asserted-by":"crossref","unstructured":"Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, L. S. Chao, Learning deep transformer models for machine translation (2019). arXiv preprint arXiv:1906.01787.","DOI":"10.18653\/v1\/P19-1176"},{"key":"215_CR60","first-page":"933","volume-title":"International Conference on Machine Learning","author":"Y. N. Dauphin","year":"2017","unstructured":"Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, in International Conference on Machine Learning. Language modeling with gated convolutional networks (International Conference on Machine Learning (IOML)Sydney, 2017), pp. 933\u2013941."},{"key":"215_CR61","doi-asserted-by":"publisher","unstructured":"P. Isola, J. -Y. Zhu, T. Zhou, A. A. Efros, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Image-to-image translation with conditional adversarial networks, (2017), pp. 1125\u20131134. https:\/\/doi.org\/10.1109\/cvpr.2017.632.","DOI":"10.1109\/cvpr.2017.632"},{"key":"215_CR62","doi-asserted-by":"publisher","unstructured":"D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Context encoders: feature learning by inpainting, (2016), pp. 2536\u20132544. https:\/\/doi.org\/10.1109\/cvpr.2016.278.","DOI":"10.1109\/cvpr.2016.278"},{"key":"215_CR63","doi-asserted-by":"publisher","unstructured":"H. Bu, J. Du, X. Na, B. Wu, H. Zheng, in 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I\/O Systems and Assessment (O-COCOSDA). Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline, (2017), pp. 1\u20135. https:\/\/doi.org\/10.1109\/icsda.2017.8384449.","DOI":"10.1109\/icsda.2017.8384449"},{"issue":"3","key":"215_CR64","doi-asserted-by":"publisher","first-page":"247","DOI":"10.1016\/0167-6393(93)90095-3","volume":"12","author":"A. Varga","year":"1992","unstructured":"A. Varga, H. Steeneken, D. Jones, The noisex-92 study on the effect of additive noise on automatic speech recognition system. Speech Commun.12(3), 247\u2013251 (1992).","journal-title":"Speech Commun."},{"key":"215_CR65","doi-asserted-by":"publisher","unstructured":"S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, T. Ochiai, in Proc. INTERSPEECH. ESPnet: end-to-end speech processing toolkit (IEEE, 2018), pp. 2207\u20132211. https:\/\/doi.org\/10.21437\/interspeech.2018-1456.","DOI":"10.21437\/interspeech.2018-1456"},{"key":"215_CR66","doi-asserted-by":"publisher","unstructured":"S. Kim, T. Hori, S. Watanabe, in Proc. ICASSP. Joint CTC-attention based end-to-end speech recognition using multi-task learning (IEEE, 2017), pp. 4835\u20134839. https:\/\/doi.org\/10.1109\/icassp.2017.7953075.","DOI":"10.1109\/icassp.2017.7953075"},{"key":"215_CR67","volume-title":"ICML","author":"V. Nair","year":"2010","unstructured":"V. Nair, G. E. Hinton, in ICML. Rectified linear units improve restricted boltzmann machines (OmnipressMadison, 2010)."},{"issue":"2","key":"215_CR68","first-page":"26","volume":"4","author":"T. Tieleman","year":"2012","unstructured":"T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn.4(2), 26\u201331 (2012).","journal-title":"COURSERA Neural Netw. Mach. Learn."},{"key":"215_CR69","unstructured":"D. P. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980."},{"key":"215_CR70","unstructured":"Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al, Google\u2019s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv preprint arXiv:1609.08144."},{"key":"215_CR71","volume-title":"Sixteenth Annual Conference of the International Speech Communication Association","author":"T. Ko","year":"2015","unstructured":"T. Ko, V. Peddinti, D. Povey, S. Khudanpur, in Sixteenth Annual Conference of the International Speech Communication Association. Audio augmentation for speech recognition (International Speech Communication Association (ISCA)Dresden, 2015)."},{"key":"215_CR72","doi-asserted-by":"crossref","unstructured":"D. S. Park, W. Chan, Y. Zhang, C. -C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, Specaugment: a simple data augmentation method for automatic speech recognition (2019). arXiv preprint arXiv:1904.08779.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"215_CR73","doi-asserted-by":"publisher","unstructured":"A. Narayanan, D. Wang, in Proc. ICASSP. Joint noise adaptive training for robust automatic speech recognition (IEEE, 2014), pp. 2504\u20132508. https:\/\/doi.org\/10.1109\/icassp.2014.6854051.","DOI":"10.1109\/icassp.2014.6854051"},{"key":"215_CR74","unstructured":"ITU, Recommendation ITU-T P, 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs. ITU-Telecommunication Standardization Sector, 2007."},{"key":"215_CR75","doi-asserted-by":"publisher","DOI":"10.1201\/b14529","volume-title":"Speech enhancement: theory and practice","author":"P. C. Loizou","year":"2013","unstructured":"P. C. Loizou, Speech enhancement: theory and practice (CRC press, Boca Raton, 2013)."}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00215-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-021-00215-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00215-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,5]],"date-time":"2021-07-05T10:15:41Z","timestamp":1625480141000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-021-00215-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,5]]},"references-count":75,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["215"],"URL":"https:\/\/doi.org\/10.1186\/s13636-021-00215-6","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,5]]},"assertion":[{"value":"4 February 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 June 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 July 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"26"}}