{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T07:21:14Z","timestamp":1770276074855,"version":"3.49.0"},"reference-count":86,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,7,21]],"date-time":"2021-07-21T00:00:00Z","timestamp":1626825600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,21]],"date-time":"2021-07-21T00:00:00Z","timestamp":1626825600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Romanian Ministry of Research and Education","award":["PN-III-P1-1.2-PCCDI-2017-0818 \/ 73PCCDI,within PNCDI III"],"award-info":[{"award-number":["PN-III-P1-1.2-PCCDI-2017-0818 \/ 73PCCDI,within PNCDI III"]}]},{"DOI":"10.13039\/100014121","name":"Xilinx","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100014121","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.<\/jats:p>","DOI":"10.1186\/s13636-021-00217-4","type":"journal-article","created":{"date-parts":[[2021,7,21]],"date-time":"2021-07-21T09:03:07Z","timestamp":1626858187000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":24,"title":["Performance vs. hardware requirements in state-of-the-art automatic speech recognition"],"prefix":"10.1186","volume":"2021","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2122-4997","authenticated-orcid":false,"given":"Alexandru-Lucian","family":"Georgescu","sequence":"first","affiliation":[]},{"given":"Alessandro","family":"Pappalardo","sequence":"additional","affiliation":[]},{"given":"Horia","family":"Cucu","sequence":"additional","affiliation":[]},{"given":"Michaela","family":"Blott","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,7,21]]},"reference":[{"issue":"1","key":"217_CR1","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1109\/JSSC.2017.2752838","volume":"53","author":"M. Price","year":"2017","unstructured":"M. Price, J. Glass, A. P. Chandrakasan, A low-power speech recognizer and voice activity detector using deep neural networks. IEEE J. Solid-State Circ.53(1), 66\u201375 (2017).","journal-title":"IEEE J. Solid-State Circ."},{"key":"217_CR2","doi-asserted-by":"publisher","first-page":"230","DOI":"10.1109\/SiPS.2016.48","volume-title":"2016 IEEE International Workshop on Signal Processing Systems (SiPS)","author":"M. Lee","year":"2016","unstructured":"M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, W. Sung, in 2016 IEEE International Workshop on Signal Processing Systems (SiPS). Fpga-based low-power speech recognition with recurrent neural networks (IEEEDallas, 2016), pp. 230\u2013235. https:\/\/doi.org\/10.1109\/SiPS.2016.48."},{"key":"217_CR3","doi-asserted-by":"publisher","unstructured":"S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., in Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. Ese: Efficient speech recognition engine with sparse lstm on fpga, (2017), pp. 75\u201384. https:\/\/doi.org\/10.1145\/3020078.3021745.","DOI":"10.1145\/3020078.3021745"},{"key":"217_CR4","doi-asserted-by":"publisher","first-page":"52227","DOI":"10.1109\/ACCESS.2018.2870273","volume":"6","author":"B. Liu","year":"2018","unstructured":"B. Liu, H. Qin, Y. Gong, W. Ge, M. Xia, L. Shi, Eera-asr: An energy-efficient reconfigurable architecture for automatic speech recognition with hybrid dnn and approximate computing. IEEE Access. 6:, 52227\u201352237 (2018).","journal-title":"IEEE Access"},{"key":"217_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/MICRO.2016.7783750","volume-title":"49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"R. Yazdani","year":"2016","unstructured":"R. Yazdani, A. Segura, J. -M. Arnau, A. Gonzalez, in 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). An ultra low-power hardware accelerator for automatic speech recognition (IEEETaipei, 2016), pp. 1\u201312. https:\/\/doi.org\/10.1109\/MICRO.2016.7783750."},{"key":"217_CR6","unstructured":"S. Migacz, in GPU Technology Conference, vol. 2. 8-bit inference with tensorrt, (2017), p. 5. https:\/\/on-demand.gputechconf.com\/gtc\/2017\/presentation\/s7310-8-bit-inference-withtensorrt.pdf."},{"issue":"5","key":"217_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3178115","volume":"9","author":"Z. Zhang","year":"2018","unstructured":"Z. Zhang, J. Geiger, J. Pohjalainen, A. E. -D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. (TIST). 9(5), 1\u201328 (2018).","journal-title":"ACM Trans. Intell. Syst. Technol. (TIST)"},{"key":"217_CR8","unstructured":"J. Park, Y. Boo, I. Choi, S. Shin, W. Sung, in Advances in Neural Information Processing Systems. Fully neural network based speech recognition on mobile and embedded devices, (2018), pp. 10620\u201310630. https:\/\/dl.acm.org\/doi\/10.5555\/3327546.3327722."},{"issue":"2","key":"217_CR9","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1109\/JSTSP.2019.2908700","volume":"13","author":"H. Purwins","year":"2019","unstructured":"H. Purwins, B. Li, T. Virtanen, J. Schl\u00fcter, S. -Y. Chang, T. Sainath, Deep learning for audio signal processing. IEEE J. Sel. Top. Signal Process.13(2), 206\u2013219 (2019).","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"217_CR10","doi-asserted-by":"crossref","unstructured":"C. Kim, D. Gowda, D. Lee, J. Kim, A. Kumar, S. Kim, A. Garg, C. Han, A review of on-device fully neural end-to-end automatic speech recognition algorithms. arXiv preprint arXiv:2012.07974 (2020).","DOI":"10.1109\/IEEECONF51394.2020.9443456"},{"issue":"8","key":"217_CR11","doi-asserted-by":"publisher","first-page":"1018","DOI":"10.3390\/sym11081018","volume":"11","author":"D. Wang","year":"2019","unstructured":"D. Wang, X. Wang, S. Lv, An overview of end-to-end automatic speech recognition. Symmetry. 11(8), 1018 (2019).","journal-title":"Symmetry"},{"key":"217_CR12","doi-asserted-by":"publisher","unstructured":"C. Shan, J. Zhang, Y. Wang, L. Xie, in ICASSP. Attention-based end-to-end speech recognition on voice search (IEEE, 2018), pp. 4764\u20134768. https:\/\/doi.org\/10.1109\/ICASSP.2018.8462492.","DOI":"10.1109\/ICASSP.2018.8462492"},{"key":"217_CR13","unstructured":"R. Collobert, C. Puhrsch, G. Synnaeve, Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193 (2016)."},{"key":"217_CR14","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1016\/j.neucom.2020.07.053","volume":"417","author":"M. Alam","year":"2020","unstructured":"M. Alam, M. D. Samad, L. Vidyaratne, A. Glandon, K. M. Iftekharuddin, Survey on deep neural networks in speech and vision systems. Neurocomputing. 417:, 302\u2013321 (2020).","journal-title":"Neurocomputing"},{"key":"217_CR15","doi-asserted-by":"publisher","unstructured":"T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu, et al., in ICASSP. No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models (IEEE, 2018), pp. 5859\u20135863. https:\/\/doi.org\/10.1109\/icassp.2018.8462380.","DOI":"10.1109\/icassp.2018.8462380"},{"key":"217_CR16","doi-asserted-by":"publisher","unstructured":"C. -C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al., in ICASSP. State-of-the-art speech recognition with sequence-to-sequence models (IEEE, 2018), pp. 4774\u20134778. https:\/\/doi.org\/10.1109\/ICASSP.2018.8462105.","DOI":"10.1109\/ICASSP.2018.8462105"},{"key":"217_CR17","unstructured":"R. Collobert, A. Hannun, G. Synnaeve, Word-level speech recognition with a dynamic lexicon. arXiv preprint arXiv:1906.04323 (2019)."},{"key":"217_CR18","doi-asserted-by":"publisher","unstructured":"A. Graves, A. -r. Mohamed, G. Hinton, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Speech recognition with deep recurrent neural networks (IEEE, 2013), pp. 6645\u20136649. https:\/\/doi.org\/10.1109\/ICASSP.2013.6638947.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"217_CR19","doi-asserted-by":"publisher","unstructured":"H. Hadian, H. Sameti, D. Povey, S. Khudanpur, in Interspeech. End-to-end speech recognition using lattice-free mmi, (2018), pp. 12\u201316. https:\/\/doi.org\/10.21437\/Interspeech.2018-1423.","DOI":"10.21437\/Interspeech.2018-1423"},{"key":"217_CR20","unstructured":"R. W. Hamming, Digital Filters (Courier Corporation, 1998)."},{"key":"217_CR21","unstructured":"A. V. Oppenheim, Discrete-time Signal Processing (Pearson Education India, 1999)."},{"issue":"3","key":"217_CR22","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1121\/1.1915893","volume":"8","author":"S. S. Stevens","year":"1937","unstructured":"S. S. Stevens, J. Volkmann, E. B. Newman, A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am.8(3), 185\u2013190 (1937).","journal-title":"J. Acoust. Soc. Am."},{"issue":"4","key":"217_CR23","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1109\/TASSP.1980.1163420","volume":"28","author":"S. Davis","year":"1980","unstructured":"S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process.28(4), 357\u2013366 (1980).","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"issue":"4","key":"217_CR24","doi-asserted-by":"publisher","first-page":"788","DOI":"10.1109\/TASL.2010.2064307","volume":"19","author":"N. Dehak","year":"2010","unstructured":"N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process.19(4), 788\u2013798 (2010).","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"issue":"4","key":"217_CR25","doi-asserted-by":"publisher","first-page":"1435","DOI":"10.1109\/TASL.2006.881693","volume":"15","author":"P. Kenny","year":"2007","unstructured":"P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio Speech Lang. Process. 15(4), 1435\u20131447 (2007).","journal-title":"IEEE Trans. Audio Speech Lang. Process"},{"key":"217_CR26","doi-asserted-by":"publisher","unstructured":"R. D. Lopez-Cozar, M. Araki, Spoken, Multilingual and Multimodal Dialogue Systems: Development and Assessment (Wiley, 2005). https:\/\/doi.org\/10.1002\/0470021578.","DOI":"10.1002\/0470021578"},{"key":"217_CR27","doi-asserted-by":"publisher","unstructured":"Y. Zhang, M. Alder, R. Togneri, in ICASSP, vol. 1. Using gaussian mixture modeling in speech recognition (IEEE, 1994), p. 613. https:\/\/doi.org\/10.1109\/ICASSP.1994.389219.","DOI":"10.1109\/ICASSP.1994.389219"},{"key":"217_CR28","doi-asserted-by":"publisher","unstructured":"S. J. Young, J. J. Odell, P. C. Woodland, in Proceedings of the Workshop on Human Language Technology. Tree-based state tying for high accuracy acoustic modelling (Association for Computational Linguistics, 1994), pp. 307\u2013312. https:\/\/doi.org\/10.3115\/1075812.1075885.","DOI":"10.3115\/1075812.1075885"},{"issue":"1","key":"217_CR29","doi-asserted-by":"publisher","first-page":"164","DOI":"10.1214\/aoms\/1177697196","volume":"41","author":"L. E. Baum","year":"1970","unstructured":"L. E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat.41(1), 164\u2013171 (1970).","journal-title":"Ann. Math. Stat."},{"issue":"2","key":"217_CR30","doi-asserted-by":"publisher","first-page":"260","DOI":"10.1109\/TIT.1967.1054010","volume":"13","author":"A. Viterbi","year":"1967","unstructured":"A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory. 13(2), 260\u2013269 (1967).","journal-title":"IEEE Trans. Inf. Theory"},{"issue":"6","key":"217_CR31","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","volume":"29","author":"G Hinton","year":"2012","unstructured":"G Hinton, et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Proc. Mag.29(6), 82\u201397 (2012).","journal-title":"IEEE Signal Proc. Mag."},{"issue":"3","key":"217_CR32","doi-asserted-by":"publisher","first-page":"328","DOI":"10.1109\/29.21701","volume":"37","author":"A. Waibel","year":"1989","unstructured":"A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. J. Lang, Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process.37(3), 328\u2013339 (1989).","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"217_CR33","doi-asserted-by":"crossref","unstructured":"V. Peddinti, D. Povey, S. Khudanpur, in Interspeech. A time delay neural network architecture for efficient modeling of long temporal contexts, (2015), pp. 3214\u20133218. https:\/\/academic.microsoft.com\/paper\/2402146185\/reference.","DOI":"10.21437\/Interspeech.2015-647"},{"key":"217_CR34","doi-asserted-by":"publisher","unstructured":"D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, S. Khudanpur, in Interspeech. Semi-orthogonal low-rank matrix factorization for deep neural networks, (2018), pp. 3743\u20133747. https:\/\/doi.org\/10.21437\/Interspeech.2018-1417.","DOI":"10.21437\/Interspeech.2018-1417"},{"issue":"10","key":"217_CR35","doi-asserted-by":"publisher","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","volume":"22","author":"O. Abdel-Hamid","year":"2014","unstructured":"O. Abdel-Hamid, A. -r. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition. IEEE\/ACM Trans. Audio Speech Lang. Process.22(10), 1533\u20131545 (2014).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"217_CR36","unstructured":"Kaldi Help Google Group: CNN-TDNN vs. TDNN (2020). https:\/\/groups.google.com\/d\/msg\/kaldi-help\/jsg1Oo4bNGQ\/uwvFw5PtBwAJ. Accessed 23 Mar 2020."},{"key":"217_CR37","doi-asserted-by":"publisher","unstructured":"F. L. Kreyssig, C. Zhang, P. C. Woodland, in ICASSP. Improved tdnns using deep kernels and frequency dependent grid-rnns (IEEE, 2018), pp. 4864\u20134868. https:\/\/doi.org\/10.1109\/ICASSP.2018.8462523.","DOI":"10.1109\/ICASSP.2018.8462523"},{"key":"217_CR38","doi-asserted-by":"publisher","unstructured":"A. Biswas, E. Y\u0131lmaz, F. de Wet, E. van der Westhuizen, T. Niesler, in Interspeech. Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR, (2019), pp. 3745\u20133749. https:\/\/doi.org\/10.21437\/interspeech.2019-1325.","DOI":"10.21437\/interspeech.2019-1325"},{"key":"217_CR39","doi-asserted-by":"publisher","unstructured":"C. Zoril\u0103, C. Boeddeker, R. Doddipatla, R. Haeb-Umbach, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). An investigation into the effectiveness of enhancement in asr training and test for chime-5 dinner party transcription (IEEE, 2019), pp. 47\u201353. https:\/\/doi.org\/10.1109\/ASRU46091.2019.9003785.","DOI":"10.1109\/ASRU46091.2019.9003785"},{"key":"217_CR40","unstructured":"N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, R. Collobert, Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864 (2018)."},{"key":"217_CR41","doi-asserted-by":"publisher","unstructured":"G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Densely connected convolutional networks, (2017), pp. 4700\u20134708. https:\/\/doi.org\/10.1109\/cvpr.2017.243.","DOI":"10.1109\/cvpr.2017.243"},{"key":"217_CR42","doi-asserted-by":"publisher","unstructured":"J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, R. T. Gadde, in Interspeech. Jasper: An End-to-End Convolutional Neural Acoustic Model, (2019), pp. 71\u201375. https:\/\/doi.org\/10.21437\/interspeech.2019-1819.","DOI":"10.21437\/interspeech.2019-1819"},{"key":"217_CR43","doi-asserted-by":"crossref","unstructured":"H. Sak, A. Senior, F. Beaufays, in Interspeech. Long short-term memory recurrent neural network architectures for large scale acoustic modeling, (2014), pp. 338\u2013342. https:\/\/research.google\/pubs\/pub43905.pdf.","DOI":"10.21437\/Interspeech.2014-80"},{"issue":"Aug","key":"217_CR44","first-page":"115","volume":"3","author":"F. A. Gers","year":"2002","unstructured":"F. A. Gers, N. N. Schraudolph, J. Schmidhuber, Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res.3(Aug), 115\u2013143 (2002).","journal-title":"J. Mach. Learn. Res."},{"key":"217_CR45","unstructured":"D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., in 2016 International Conference on Machine Learning. Deep speech 2: End-to-end speech recognition in english and mandarin, (2016), pp. 173\u2013182. https:\/\/academic.microsoft.com\/paper\/2193413348\/reference."},{"key":"217_CR46","unstructured":"D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"key":"217_CR47","doi-asserted-by":"publisher","unstructured":"W. Chan, N. Jaitly, Q. Le, O. Vinyals, in ICASSP. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition (IEEE, 2016), pp. 4960\u20134964. https:\/\/doi.org\/10.1109\/ICASSP.2016.7472621.","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"217_CR48","unstructured":"I. Sutskever, et al., Q. Le, Sequence to Sequence Learning with Neural Networks.Adv. Neural Inf. Process. Syst.27:, 3104\u20133112 (2014)."},{"key":"217_CR49","doi-asserted-by":"publisher","unstructured":"K. Cho, B. van Merri\u00ebnboer, D. Bahdanau, Y. Bengio, in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. On the properties of neural machine translation: Encoder\u2013decoder approaches, (2014), pp. 103\u2013111. https:\/\/doi.org\/10.3115\/v1\/w14-4012.","DOI":"10.3115\/v1\/w14-4012"},{"key":"217_CR50","doi-asserted-by":"publisher","unstructured":"A. Hannun, A. Lee, Q. Xu, R. Collobert, in Interspeech. Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions, (2019), pp. 3785\u20133789. https:\/\/doi.org\/10.21437\/interspeech.2019-2460.","DOI":"10.21437\/interspeech.2019-2460"},{"key":"217_CR51","doi-asserted-by":"publisher","unstructured":"A. Graves, S. Fern\u00e1ndez, F. Gomez, J. Schmidhuber, in Proceedings of the 23rd International Conference on Machine Learning. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks (ACM, 2006), pp. 369\u2013376. https:\/\/doi.org\/10.1145\/1143844.1143891.","DOI":"10.1145\/1143844.1143891"},{"key":"217_CR52","doi-asserted-by":"publisher","unstructured":"T. Hori, S. Watanabe, Y. Zhang, W. Chan, in Interspeech. Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm, (2017). https:\/\/doi.org\/10.21437\/INTERSPEECH.2017-1296.","DOI":"10.21437\/INTERSPEECH.2017-1296"},{"key":"217_CR53","doi-asserted-by":"publisher","unstructured":"D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, S. Khudanpur, in Interspeech. Purely sequence-trained neural networks for asr based on lattice-free mmi, (2016), pp. 2751\u20132755. https:\/\/doi.org\/10.21437\/Interspeech.2016-595.","DOI":"10.21437\/Interspeech.2016-595"},{"key":"217_CR54","doi-asserted-by":"crossref","unstructured":"T. Mikolov, M. Karafi\u00e1t, L. Burget, J. \u010cernocky\u0300, S. Khudanpur, in 2010 Conference of the International Speech Communication Association. Recurrent neural network based language model, (2010). https:\/\/academic.microsoft.com\/paper\/179875071\/reference.","DOI":"10.1109\/ICASSP.2011.5947611"},{"key":"217_CR55","unstructured":"M. Sundermeyer, R. Schl\u00fcter, H. Ney, in 2012 Conference of the International Speech Communication Association. Lstm neural networks for language modeling, (2012). https:\/\/academic.microsoft.com\/paper\/2402268235\/reference."},{"key":"217_CR56","unstructured":"Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, in Proceedings of the 34th International Conference on Machine Learning-Volume 70. Language modeling with gated convolutional networks (JMLR. org, 2017), pp. 933\u2013941. https:\/\/academic.microsoft.com\/paper\/2963970792\/reference."},{"key":"217_CR57","doi-asserted-by":"publisher","first-page":"2978","DOI":"10.18653\/v1\/P19-1285","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Z. Dai","year":"2019","unstructured":"Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. Salakhutdinov, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Transformer-XL: Attentive language models beyond a fixed-length context (Association for Computational LinguisticsFlorence, 2019), pp. 2978\u20132988."},{"key":"217_CR58","unstructured":"C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. -C. Lin, F. Bougares, H. Schwenk, Y. Bengio, On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 (2015)."},{"key":"217_CR59","doi-asserted-by":"publisher","unstructured":"A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, R. Prabhavalkar, in ICASSP. An analysis of incorporating an external language model into a sequence-to-sequence model (IEEE, 2018), pp. 1\u20135828. https:\/\/doi.org\/10.1109\/icassp.2018.8462682.","DOI":"10.1109\/icassp.2018.8462682"},{"key":"217_CR60","doi-asserted-by":"publisher","unstructured":"A. Sriram, H. Jun, S. Satheesh, A. Coates, in Interspeech. Cold fusion: Training seq2seq models together with language models, (2018), pp. 387\u2013391. https:\/\/doi.org\/10.21437\/interspeech.2018-1392.","DOI":"10.21437\/interspeech.2018-1392"},{"key":"217_CR61","doi-asserted-by":"publisher","unstructured":"S. Toshniwal, A. Kannan, C. -C. Chiu, Y. Wu, T. N. Sainath, K. Livescu, in 2018 IEEE Spoken Language Technology Workshop (SLT). A comparison of techniques for language model integration in encoder-decoder speech recognition (IEEE, 2018), pp. 369\u2013375. https:\/\/doi.org\/10.1109\/SLT.2018.8639038.","DOI":"10.1109\/SLT.2018.8639038"},{"key":"217_CR62","unstructured":"T. Mikolov, S. Kombrink, A. Deoras, L. Burget, J. Cernocky, in Proc. of the 2011 ASRU Workshop. Rnnlm-recurrent neural network language modeling toolkit, (2011), pp. 196\u2013201. https:\/\/academic.microsoft.com\/paper\/2474824677\/reference."},{"key":"217_CR63","doi-asserted-by":"publisher","unstructured":"H. Xu, et al., in ICASSP. A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition (IEEE, 2018), pp. 5929\u20135933. https:\/\/doi.org\/10.1109\/ICASSP.2018.8461974.","DOI":"10.1109\/ICASSP.2018.8461974"},{"key":"217_CR64","unstructured":"Kaldi TDNN LibriSpeech implementation (2020). https:\/\/github.com\/kaldi-asr\/kaldi\/blob\/master\/egs\/librispeech\/s5\/local\/chain\/tuning\/run_tdnn_1d.sh. Accessed 23 Mar 2020."},{"key":"217_CR65","unstructured":"Kaldi CNN-TDNN LibriSpeech implementation (2020). https:\/\/github.com\/kaldi-asr\/kaldi\/blob\/master\/egs\/librispeech\/s5\/local\/chain\/tuning\/run_cnn_tdnn_1a.sh. Accessed 23 Mar 2020."},{"key":"217_CR66","unstructured":"PaddlePaddle DeepSpeech2 LibriSpeech implementation (2020). https:\/\/github.com\/PaddlePaddle\/DeepSpeech\/blob\/develop\/model_utils\/network.py. Accessed 23 Mar 2020."},{"key":"217_CR67","unstructured":"RWTH Returnn LibriSpeech implementation (2020). https:\/\/github.com\/rwth-i6\/returnn-experiments\/blob\/master\/2018-asr-attention\/librispeech\/full-setup-attention\/returnn.config . Accessed 23 Mar 2020."},{"key":"217_CR68","unstructured":"Wav, 2Letter CNN-GLU fully convolutional LibriSpeech implementation (2020). https:\/\/github.com\/facebookresearch\/wav2letter\/blob\/master\/recipes\/models\/conv_glu\/librispeech\/network.arch. Accessed 23 Mar 2020."},{"key":"217_CR69","unstructured":"Wav, 2Letter time-domain separable LibriSpeech implementation (2020). https:\/\/github.com\/facebookresearch\/wav2letter\/blob\/master\/recipes\/models\/seq2seq_tds\/librispeech\/network.arch. Accessed 23 Mar 2020."},{"key":"217_CR70","unstructured":"Nividia OpenSeq2Seq Jasper LibriSpeech implementation (2020). https:\/\/github.com\/NVIDIA\/OpenSeq2Seq\/blob\/master\/example_configs\/speech2text\/jasper10x5_LibriSpeech_nvgrad.py. Accessed 23 Mar 2020."},{"key":"217_CR71","unstructured":"Nvidia QuartzNet implementation (2020). https:\/\/github.com\/NVIDIA\/NeMo\/blob\/master\/examples\/asr\/configs\/quartznet15x5.yaml. Accessed 23 Mar 2020."},{"key":"217_CR72","doi-asserted-by":"publisher","unstructured":"V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in ICASSP. Librispeech: an asr corpus based on public domain audio books (IEEE, 2015), pp. 5206\u20135210. https:\/\/doi.org\/10.1109\/ICASSP.2015.7178964.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"217_CR73","doi-asserted-by":"publisher","unstructured":"D. B. Paul, J. M. Baker, in Proceedings of the Workshop on Speech and Natural Language. The design for the wall street journal-based csr corpus (Association for Computational Linguistics, 1992), pp. 357\u2013362. https:\/\/doi.org\/10.3115\/1075527.1075614.","DOI":"10.3115\/1075527.1075614"},{"key":"217_CR74","unstructured":"A. Rousseau, P. Del\u00e9glise, Y. Esteve, in 2014 Language Resources and Evaluation. Enhancing the ted-lium corpus with selected data for language modeling and more ted talks, (2014), pp. 3935\u20133939. https:\/\/academic.microsoft.com\/paper\/2251321385\/reference."},{"key":"217_CR75","doi-asserted-by":"publisher","unstructured":"J. J. Godfrey, E. C. Holliman, J. McDaniel, in ICASSP, vol. 1. Switchboard: Telephone speech corpus for research and development (IEEE, 1992), pp. 517\u2013520. https:\/\/doi.org\/10.1109\/ICASSP.1992.225858.","DOI":"10.1109\/ICASSP.1992.225858"},{"key":"217_CR76","unstructured":"C. Cieri, D. Miller, K. Walker, in 2004 Language Resources and Evaluation, vol. 4. The fisher corpus: a resource for the next generations of speech-to-text, (2004), pp. 69\u201371. https:\/\/academic.microsoft.com\/paper\/97072897\/reference."},{"key":"217_CR77","unstructured":"Kaldi Help Google Group: Multiple output heads in chain network (2020). https:\/\/groups.google.com\/d\/msg\/kaldi-help\/WC8hYgL2o3I\/WccCc0ucAgAJ. Accessed 23 Mar 2020."},{"key":"217_CR78","doi-asserted-by":"publisher","unstructured":"A. Zeyer, K. Irie, R. Schl\u00fcter, H. Ney, in Interspeech. Improved training of end-to-end attention models for speech recognition, (2018), pp. 7\u201311. https:\/\/doi.org\/10.21437\/Interspeech.2018-1616.","DOI":"10.21437\/Interspeech.2018-1616"},{"key":"217_CR79","doi-asserted-by":"publisher","unstructured":"R. Sennrich, B. Haddow, A. Birch, in 2016 Meeting of the Association for Computational Linguistics. Neural machine translation of rare words with subword units, (2016), pp. 1715\u20131725. https:\/\/doi.org\/10.18653\/v1\/p16-1162.","DOI":"10.18653\/v1\/p16-1162"},{"key":"217_CR80","doi-asserted-by":"publisher","unstructured":"N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, E. Dupoux, in Interspeech. End-to-end speech recognition from the raw waveform, (2018), pp. 781\u2013785. https:\/\/doi.org\/10.21437\/Interspeech.2018-2414.","DOI":"10.21437\/Interspeech.2018-2414"},{"key":"217_CR81","doi-asserted-by":"publisher","unstructured":"T. Likhomanenko, G. Synnaeve, R. Collobert, in Interspeech. Who needs words? lexicon-free speech recognition, (2019), pp. 3915\u20133919. https:\/\/doi.org\/10.21437\/Interspeech.2019-3107.","DOI":"10.21437\/Interspeech.2019-3107"},{"key":"217_CR82","unstructured":"Wav, 2Letter lexicon-free LibriSpeech implementation (2020). https:\/\/github.com\/facebookresearch\/wav2letter\/blob\/master\/recipes\/models\/lexicon_free\/librispeech\/am.arch. Accessed 23 Mar 2020."},{"key":"217_CR83","unstructured":"T. Salimans, D. P. Kingma, in 2016 Neural Information Processing Systems. Weight normalization: A simple reparameterization to accelerate training of deep neural networks, (2016), pp. 901\u2013909. https:\/\/academic.microsoft.com\/paper\/2963685250\/reference."},{"key":"217_CR84","unstructured":"B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, H. Nguyen, J. M. Cohen, Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286 (2019)."},{"key":"217_CR85","doi-asserted-by":"publisher","unstructured":"S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, Y. Zhang, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions, (2020), pp. 6124\u20136128. https:\/\/doi.org\/10.1109\/icassp40776.2020.9053889.","DOI":"10.1109\/icassp40776.2020.9053889"},{"key":"217_CR86","unstructured":"Open Speech and Language Resources (2020). http:\/\/www.openslr.org\/resources\/11\/4-gram.arpa.gz. Accessed 23 Mar 2020."}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00217-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-021-00217-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00217-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,5]],"date-time":"2023-01-05T02:38:06Z","timestamp":1672886286000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-021-00217-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,21]]},"references-count":86,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["217"],"URL":"https:\/\/doi.org\/10.1186\/s13636-021-00217-4","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,21]]},"assertion":[{"value":"11 February 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 June 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 July 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"28"}}