{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T12:07:22Z","timestamp":1777896442899,"version":"3.51.4"},"reference-count":44,"publisher":"Walter de Gruyter GmbH","issue":"1","license":[{"start":{"date-parts":[[2019,3,5]],"date-time":"2019-03-05T00:00:00Z","timestamp":1551744000000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,12,18]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Deep neural networks (DNNs) have been playing a significant role in acoustic modeling. Convolutional neural networks (CNNs) are the advanced version of DNNs that achieve 4\u201312% relative gain in the word error rate (WER) over DNNs. Existence of spectral variations and local correlations in speech signal makes CNNs more capable of speech recognition. Recently, it has been demonstrated that bidirectional long short-term memory (BLSTM) produces higher recognition rate in acoustic modeling because they are adequate to reinforce higher-level representations of acoustic data. Spatial and temporal properties of the speech signal are essential for high recognition rate, so the concept of combining two different networks came into mind. In this paper, a hybrid architecture of CNN-BLSTM is proposed to appropriately use these properties and to improve the continuous speech recognition task. Further, we explore different methods like weight sharing, the appropriate number of hidden units, and ideal pooling strategy for CNN to achieve a high recognition rate. Specifically, the focus is also on how many BLSTM layers are effective. This paper also attempts to overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Next, various non-linearities with or without dropout are analyzed for speech tasks. Experiments indicate that proposed hybrid architecture with speaker-adapted features and maxout non-linearity with dropout idea shows 5.8% and 10% relative decrease in WER over the CNN and DNN systems, respectively.<\/jats:p>","DOI":"10.1515\/jisys-2018-0372","type":"journal-article","created":{"date-parts":[[2019,3,5]],"date-time":"2019-03-05T04:10:57Z","timestamp":1551759057000},"page":"1261-1274","source":"Crossref","is-referenced-by-count":65,"title":["A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition"],"prefix":"10.1515","volume":"29","author":[{"given":"Vishal","family":"Passricha","sequence":"first","affiliation":[{"name":"Computer Engineering Department , National Institute of Technology Kurukshetra , Haryana , India"}]},{"given":"Rajesh Kumar","family":"Aggarwal","sequence":"additional","affiliation":[{"name":"Computer Engineering Department , National Institute of Technology Kurukshetra , Haryana , India"}]}],"member":"374","published-online":{"date-parts":[[2019,3,5]]},"reference":[{"key":"2025120523341680935_j_jisys-2018-0372_ref_001","doi-asserted-by":"crossref","unstructured":"O. Abdel-Hamid, A. Mohamed, H. Jiang and G. Penn, Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, in: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277\u20134280, 2012.","DOI":"10.1109\/ICASSP.2012.6288864"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_002","doi-asserted-by":"crossref","unstructured":"O. Abdel-Hamid, L. Deng and D. Yu, Exploring convolutional neural network structures and optimization techniques for speech recognition, Interspeech (2013), 3366\u20133370.","DOI":"10.21437\/Interspeech.2013-744"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_003","doi-asserted-by":"crossref","unstructured":"Y. Bengio, P. Simard and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5 (1994), 157\u2013166.","DOI":"10.1109\/72.279181"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_004","unstructured":"J. Bruna, A. Szlam and Y. LeCun, Signal recovery from pooling representations, in: 31st International Conference on Machine Learning, ICML 2014, Beijing, China, pp. 1585\u20131598, 2014."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_005","doi-asserted-by":"crossref","unstructured":"M. Cai, Y. Shi and J. Liu, Deep maxout neural networks for speech recognition, in: Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 291\u2013296, 2013.","DOI":"10.1109\/ASRU.2013.6707745"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_006","doi-asserted-by":"crossref","unstructured":"G. E. Dahl, T. N. Sainath and G. E. Hinton, Improving deep neural networks for LVCSR using rectified linear units and dropout, in: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8609\u20138613, 2013.","DOI":"10.1109\/ICASSP.2013.6639346"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_007","doi-asserted-by":"crossref","unstructured":"S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, in: Readings in Speech Recognition, ed: Elsevier, pp. 65\u201374, 1990.","DOI":"10.1016\/B978-0-08-051584-7.50010-3"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_008","unstructured":"J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, A.Y. Ng, Large scale distributed deep networks, in: Advances in Neural Information Processing Systems, pp. 1223\u20131231, 2012."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_009","doi-asserted-by":"crossref","unstructured":"L. Deng, O. Abdel-Hamid and D. Yu, A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6669\u20136673, 2013.","DOI":"10.1109\/ICASSP.2013.6638952"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_010","doi-asserted-by":"crossref","unstructured":"M. J. Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Comput. Speech Lang. 12 (1998), 75\u201398.","DOI":"10.1006\/csla.1998.0043"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_011","unstructured":"X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249\u2013256, 2010."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_012","unstructured":"I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Bengio, Maxout networks, in: Proceedings of the 30th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 1319\u20131327, 2013."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_013","doi-asserted-by":"crossref","unstructured":"A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18 (2005), 602\u2013610.","DOI":"10.1016\/j.neunet.2005.06.042"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_014","doi-asserted-by":"crossref","unstructured":"A. Graves, N. Jaitly and A.-R. Mohamed, Hybrid speech recognition with deep bidirectional LSTM, in: Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 273\u2013278, 2013.","DOI":"10.1109\/ASRU.2013.6707742"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_015","doi-asserted-by":"crossref","unstructured":"A. Graves, A.-R. Mohamed and G. Hinton, Speech recognition with deep recurrent neural networks, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645\u20136649, 2013.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_016","doi-asserted-by":"crossref","unstructured":"K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE international conference on computer vision, pp. 1026\u20131034, 2015.","DOI":"10.1109\/ICCV.2015.123"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_017","doi-asserted-by":"crossref","unstructured":"G. Heigold, E. McDermott, V. Vanhoucke, A. Senior and M. Bacchiani, Asynchronous stochastic optimization for sequence training of deep neural networks, in: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5587\u20135591, 2014.","DOI":"10.1109\/ICASSP.2014.6854672"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_018","doi-asserted-by":"crossref","unstructured":"G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Proc. Mag. 29 (2012), 82\u201397.","DOI":"10.1109\/MSP.2012.2205597"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_019","doi-asserted-by":"crossref","unstructured":"K. Jarrett, K. Kavukcuoglu, M. A. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition? in: 2009 IEEE 12th International Conference on Computer Vision, pp. 2146\u20132153, 2009.","DOI":"10.1109\/ICCV.2009.5459469"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_020","doi-asserted-by":"crossref","unstructured":"B. Kingsbury, Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling, in: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pp. 3761\u20133764, Taipei, Taiwan, 2009.","DOI":"10.1109\/ICASSP.2009.4960445"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_021","doi-asserted-by":"crossref","unstructured":"B. Kingsbury, T. N. Sainath and H. Soltau, Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization, Interspeech, pp. 10\u201313, Portland, OR, USA, 2012.","DOI":"10.21437\/Interspeech.2012-3"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_022","unstructured":"A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, pp. 1097\u20131105, 2012."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_023","unstructured":"N. Lambooij, Applying image recognition to automatic speech recognition: determining suitability of spectrograms for training a deep neural network for speech recognition, Bachelor thesis, Utrecht University, 2017."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_024","doi-asserted-by":"crossref","unstructured":"Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, in: Proceedings of the IEEE, vol. 86, pp. 2278\u20132324, November 1998.","DOI":"10.1109\/5.726791"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_025","doi-asserted-by":"crossref","unstructured":"L. Lee and R. C. Rose, Speaker normalization using efficient frequency warping procedures, in: Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, pp. 353\u2013356, Atlanta, GA, USA, 1996.","DOI":"10.1109\/ICASSP.1996.541105"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_026","doi-asserted-by":"crossref","unstructured":"A.-R. Mohamed, G. Hinton and G. Penn, Understanding how deep belief networks perform acoustic modelling, in: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp. 4273\u20134276, Olomouc, Czech Republic, 2012.","DOI":"10.1109\/ICASSP.2012.6288863"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_027","doi-asserted-by":"crossref","unstructured":"T. Robinson, M. Hochberg and S. Renals, The use of recurrent neural networks in continuous speech recognition, in: C. H. Lee, F. K. Soong, and K. K. Paliwal, (Eds.), Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science (VLSI, Computer Architecture and Digital Signal Processing), vol. 355, pp. 233\u2013258, Springer, Boston, MA, 1996.","DOI":"10.1007\/978-1-4613-1367-0_10"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_028","doi-asserted-by":"crossref","unstructured":"T. N. Sainath, B. Kingsbury, A.-R. Mohamed, G. E. Dahl, G. Saon, H. Soltau, T. Beran, A. Y. Aravkin and B. Ramabhadran, Improvements to deep convolutional neural networks for LVCSR, in: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 315\u2013320, Olomouc, Czech Republic, 2013.","DOI":"10.1109\/ASRU.2013.6707749"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_029","doi-asserted-by":"crossref","unstructured":"T. N. Sainath, A.-R. Mohamed, B. Kingsbury and B. Ramabhadran, Deep convolutional neural networks for LVCSR, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614\u20138618, Vancouver, Canada, May 26\u201331, 2013.","DOI":"10.1109\/ICASSP.2013.6639347"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_030","doi-asserted-by":"crossref","unstructured":"T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A.-R. Mohamed, G. Dahl, B. Ramabhadran, Deep convolutional neural networks for large-scale speech tasks, Neural Networks 64 (2015), 39\u201348.","DOI":"10.1016\/j.neunet.2014.08.005"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_031","doi-asserted-by":"crossref","unstructured":"H. Sak, A. Senior and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, Interspeech, pp. 338\u2013342, Singapore, September 14\u201318, 2014.","DOI":"10.21437\/Interspeech.2014-80"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_032","doi-asserted-by":"crossref","unstructured":"G. Saon, H. Soltau, A. Emami and M. Picheny, Unfolded recurrent neural networks for speech recognition, Interspeech, pp. 343\u2013347, Singapore, September 14\u201318, 2014.","DOI":"10.21437\/Interspeech.2014-81"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_033","doi-asserted-by":"crossref","unstructured":"M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Proc. 45 (1997), 2673\u20132681.","DOI":"10.1109\/78.650093"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_034","unstructured":"P. Sermanet, S. Chintala and Y. LeCun, Convolutional neural networks applied to house numbers digit classification, in: Pattern Recognition (ICPR), 2012 21st International Conference on, pp. 3288\u20133291, Stockholm, Sweden, 2012."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_035","doi-asserted-by":"crossref","unstructured":"H. Soltau, H.-K. Kuo, L. Mangu, G. Saon and T. Beran, Neural network acoustic models for the DARPA RATS program, Interspeech, pp. 3092\u20133096, Lyon, France, August 25\u201329, 2013.","DOI":"10.21437\/Interspeech.2013-674"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_036","unstructured":"N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014), 1929\u20131958."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_037","doi-asserted-by":"crossref","unstructured":"L. Toth, Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition, in: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 190\u2013194, Florence, Italy, 2014.","DOI":"10.1109\/ICASSP.2014.6853584"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_038","doi-asserted-by":"crossref","unstructured":"L. Toth, Convolutional deep maxout networks for phone recognition, Interspeech, pp. 1078\u20131082, Singapore, September 14\u201318, 2014.","DOI":"10.21437\/Interspeech.2014-278"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_039","doi-asserted-by":"crossref","unstructured":"L. T\u00f3th, Phone recognition with hierarchical convolutional deep maxout networks, J. Audio Speech Music Proc. 2015 (2015), 25. https:\/\/doi.org\/10.1186\/s13636-015-0068-3.","DOI":"10.1186\/s13636-015-0068-3"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_040","doi-asserted-by":"crossref","unstructured":"E. Variani and T. Schaaf, VTLN in the MFCC domain: Band-limited versus local interpolation, in: Twelfth Annual Conference of the International Speech Communication Association, pp. 1273\u20131276, Florence, Italy, August 27\u201331, 2011.","DOI":"10.21437\/Interspeech.2011-104"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_041","doi-asserted-by":"crossref","unstructured":"A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, Phoneme recognition using time-delay neural networks, in: IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 328\u2013339, Elsevier, Amsterdam, The Netherlands, pp. 393\u2013404, 1989.","DOI":"10.1109\/29.21701"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_042","doi-asserted-by":"crossref","unstructured":"R. G. Wijnhoven and P. de With, Fast training of object detection using stochastic gradient descent, in: Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 424\u2013427, Istanbul, Turkey, August 23\u201326, 2010.","DOI":"10.1109\/ICPR.2010.112"},{"key":"2025120523341680935_j_jisys-2018-0372_ref_043","unstructured":"S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, Odell, D. J. Ollason and D. Povey, The HTK book, Cambridge University Engineering Department, vol. 3, p. 175, Cambridge, UK, 2002."},{"key":"2025120523341680935_j_jisys-2018-0372_ref_044","unstructured":"M. D. Zeiler and R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, in: International Conference on Learning Representation, Scottsdale, AZ, USA, 2013."}],"container-title":["Journal of Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.degruyter.com\/view\/journals\/jisys\/29\/1\/article-p1261.xml","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.degruyterbrill.com\/document\/doi\/10.1515\/jisys-2018-0372\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.degruyterbrill.com\/document\/doi\/10.1515\/jisys-2018-0372\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T23:35:30Z","timestamp":1764977730000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.degruyterbrill.com\/document\/doi\/10.1515\/jisys-2018-0372\/html"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,3,5]]},"references-count":44,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2019,9,20]]},"published-print":{"date-parts":[[2019,12,18]]}},"alternative-id":["10.1515\/jisys-2018-0372"],"URL":"https:\/\/doi.org\/10.1515\/jisys-2018-0372","relation":{},"ISSN":["2191-026X","0334-1860"],"issn-type":[{"value":"2191-026X","type":"electronic"},{"value":"0334-1860","type":"print"}],"subject":[],"published":{"date-parts":[[2019,3,5]]}}}