{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T12:44:15Z","timestamp":1770986655741,"version":"3.50.1"},"reference-count":32,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2021,4,12]],"date-time":"2021-04-12T00:00:00Z","timestamp":1618185600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Recent research shows recurrent neural network-Transducer (RNN-T) architecture has become a mainstream approach for streaming speech recognition. In this work, we investigate the VGG2 network as the input layer to the RNN-T in streaming speech recognition. Specifically, before the input feature is passed to the RNN-T, we introduce a gated-VGG2 block, which uses the first two layers of the VGG16 to extract contextual information in the time domain, and then use a SEnet-style gating mechanism to control what information in the channel domain is to be propagated to RNN-T. The results show that the RNN-T model with the proposed gated-VGG2 block brings significant performance improvement when compared to the existing RNN-T model, and it has a lower latency and character error rate than the Transformer-based model.<\/jats:p>","DOI":"10.3390\/info12040165","type":"journal-article","created":{"date-parts":[[2021,4,12]],"date-time":"2021-04-12T11:05:06Z","timestamp":1618225506000},"page":"165","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3702-2380","authenticated-orcid":false,"given":"Xintong","family":"Wang","sequence":"first","affiliation":[{"name":"College of Science, Beijing Forestry University, Beijing 100083, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chuangang","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Information Science &amp; Technology, Beijing Forestry University, Beijing 100083, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,4,12]]},"reference":[{"key":"ref_1","unstructured":"Selfridge, E., Arizmendi, I., Heeman, P.A., and Williams, J.D. (2011, January 17\u201318). Stability and accuracy in incremental speech recognition. Proceedings of the SIGDIAL 2011 Conference, Portland, OR, USA."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Arivazhagan, N., Cherry, C., Te, I., Macherey, W., Baljekar, P., and Foster, G. (2020, January 4\u20138). Re-translation strategies for long form, simultaneous, spoken language translation. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054585"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016, January 20\u201325). Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Kim, S., Hori, T., and Watanabe, S. (2017, January 5\u20139). Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N.E.Y., Yamamoto, R., and Zhang, W. (2019, January 14\u201318). A Comparative Study on Transformer vs RNN in Speech Applications. Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore.","DOI":"10.1109\/ASRU46091.2019.9003750"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Pang, R. (2020). Conformer: Convolution-Augmented Transformer for Speech Recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Emiru, E.D., Xiong, S., Li, Y., Fesseha, A., and Diallo, M. (2021). Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings. Information, 12.","DOI":"10.3390\/info12020062"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Sepp","year":"1997","journal-title":"Neural Comput."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Sak, H., Shannon, M., Rao, K., and Beaufays, F. (2017, January 20\u201324). Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping. Proceedings of the Interspeech 2017: Conference of the International Speech Communication Association, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1705"},{"key":"ref_10","unstructured":"Jaitly, N., Sussillo, D., Le, Q.V., Vinyals, O., Sutskever, I., and Bengio, S. (2016). A Neural Transducer. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Graves, A. (2012). Sequence Transduction with Recurrent Neural Networks. arXiv.","DOI":"10.1007\/978-3-642-24797-2"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A.R., and Hinton, G. (2013, January 26\u201330). Speech Recognition with Deep Recurrent Neural Networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Rao, K., Sak, H., and Prabhavalkar, R. (2017, January 16\u201320). Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer. Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan.","DOI":"10.1109\/ASRU.2017.8268935"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"He, Y., Sainath, T.N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y., and Gruenstein, A. (2019, January 12\u201317). Streaming End-to-End Speech Recognition for Mobile Devices. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682336"},{"key":"ref_15","unstructured":"Yeh, C.F., Mahadeokar, J., Kalgaonkar, K., Wang, Y., Le, D., Jain, M., Schubert, K., Fuegen, C., and Seltzer, M.L. (2019). Transformer transducer: End-to-end speech recognition with self-attention. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., and Kumar, S. (2020, January 4\u20138). Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053896"},{"key":"ref_17","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_18","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_19","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2015, January 7\u20139). Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of the ICLR 2015: International Conference on Learning Representations 2015, San Diego, CA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Wang, Y., Li, X., Yang, Y., Anwar, A., and Dong, R. (2021). Hybrid System Combination Framework for Uyghur\u2013Chinese Machine Translation. Information, 12.","DOI":"10.3390\/info12030098"},{"key":"ref_21","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv."},{"key":"ref_22","unstructured":"Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The Expressive Power of Neural Networks: A View from the Width. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Bu, H., Du, J., Na, X., Wu, B., and Zheng, H. (2017, January 1\u20133). AISHELL-1: An Open-Source Mandarin Speech Corpus and a Speech Recognition Baseline. Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I\/O Systems and Assessment (O-COCOSDA), Seoul, Korea.","DOI":"10.1109\/ICSDA.2017.8384449"},{"key":"ref_24","unstructured":"Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large Scale Image Recognition. arXiv."},{"key":"ref_25","unstructured":"Mohamed, A., Okhonko, D., and Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Huang, W., Hu, W., Yeung, Y., and Chen, X. (2020, January 25\u201329). Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition. Proceedings of the Interspeech 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2361"},{"key":"ref_27","unstructured":"Dauphin, Y.N., Fan, A., Auli, M., and Grangier, D. (2017). Language Modeling with Gated Convolutional Networks. arXiv."},{"key":"ref_28","unstructured":"Lin, M., Chen, Q., and Yan, S. (2014, January 14\u201316). Network In Network. Proceedings of the ICLR 2014: International Conference on Learning Representations (ICLR) 2014, Banff, AB, Canada."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1145\/3065386","article-title":"ImageNet Classification with Deep Convolutional Neural Networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Commun. ACM"},{"key":"ref_31","unstructured":"Matthew, Z. (2012). ADADELTA: An adaptive learning rate method. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., and Chen, N. (2018). ESPnet: End-to-End Speech Processing Toolkit. arXiv.","DOI":"10.21437\/Interspeech.2018-1456"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/4\/165\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T13:35:50Z","timestamp":1760362550000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/4\/165"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,12]]},"references-count":32,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2021,4]]}},"alternative-id":["info12040165"],"URL":"https:\/\/doi.org\/10.3390\/info12040165","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,4,12]]}}}