{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:57:14Z","timestamp":1760234234064,"version":"build-2065373602"},"reference-count":54,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2021,4,28]],"date-time":"2021-04-28T00:00:00Z","timestamp":1619568000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"ITMO University","award":["620173"],"award-info":[{"award-number":["620173"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token\u2019s contexts and to regularize their distribution for the model\u2019s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.<\/jats:p>","DOI":"10.3390\/s21093063","type":"journal-article","created":{"date-parts":[[2021,4,28]],"date-time":"2021-04-28T22:29:07Z","timestamp":1619648947000},"page":"3063","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4690-705X","authenticated-orcid":false,"given":"Aleksandr","family":"Laptev","sequence":"first","affiliation":[{"name":"Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8697-832X","authenticated-orcid":false,"given":"Andrei","family":"Andrusenko","sequence":"additional","affiliation":[{"name":"Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1617-4369","authenticated-orcid":false,"given":"Ivan","family":"Podluzhny","sequence":"additional","affiliation":[{"name":"Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9791-6418","authenticated-orcid":false,"given":"Anton","family":"Mitrofanov","sequence":"additional","affiliation":[{"name":"Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia"},{"name":"STC-Innovations Ltd., 194044 Saint-Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5381-3433","authenticated-orcid":false,"given":"Ivan","family":"Medennikov","sequence":"additional","affiliation":[{"name":"Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia"},{"name":"STC-Innovations Ltd., 194044 Saint-Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7010-1585","authenticated-orcid":false,"given":"Yuri","family":"Matveev","sequence":"additional","affiliation":[{"name":"Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia"},{"name":"STC-Innovations Ltd., 194044 Saint-Petersburg, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2021,4,28]]},"reference":[{"key":"ref_1","unstructured":"Olson, C., and Kemery, K. (2020, December 26). Voice Report: From Answers to Action: Customer Adoption of Voice Technology and Digital Assistants; Microsoft. Available online: https:\/\/about.ads.microsoft.com\/en-us\/insights\/2019-voice-report."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Pal, D., Arpnikanondt, C., Funilkul, S., and Varadarajan, V. (2019, January 6\u20138). User Experience with Smart Voice Assistants: The Accent Perspective. Proceedings of the IEEE 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India.","DOI":"10.1109\/ICCCNT45670.2019.8944754"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Sainath, T., He, Y., Li, B., Narayanan, A., Pang, R., Bruguier, A., Chang, S.Y., Li, W., Alvarez, R., and Chen, Z. (2020, January 4\u20138). A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054188"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Huang, W., Hu, W., Yeung, Y.T., and Chen, X. (2020, January 25\u201329). Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2361"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Sigtia, S., Haynes, R., Richards, H., Marchi, E., and Bridle, J. (2018, January 2\u20136). Efficient Voice Trigger Detection for Low Resource Hardware. Proceedings of the Interspeech 2018, ISCA, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2204"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups","volume":"29","author":"Hinton","year":"2012","journal-title":"Signal Process. Mag. IEEE"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd international conference on Machine learning\u2014ICML, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Chan, W., Jaitly, N., Le, Q.V., and Vinyals, O. (2016, January 20\u201325). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Aleksic, P., Allauzen, C., Elson, D., Kracun, A., Casado, D.M., and Moreno, P.J. (2015, January 19\u201324). Improved recognition of contact names in voice commands. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178957"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Tulsiani, H., Sapru, A., Arsikere, H., Punjabi, S., and Garimella, S. (2020, January 25\u201329). Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2036"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Khokhlov, Y., Tomashenko, N., Medennikov, I., and Romanenko, A. (2017, January 20\u201324). Fast and Accurate OOV Decoder on High-Level Features. Proceedings of the Interspeech 2017, ISCA, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1367"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gandhe, A., Rastrow, A., and Hoffmeister, B. (2018, January 18\u201321). Scalable Language Model Adaptation for Spoken Dialogue Systems. Proceedings of the IEEE 2018 Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639663"},{"key":"ref_13","unstructured":"Malkovsky, N., Bataev, V., Sviridkin, D., Kizhaeva, N., Laptev, A., Valiev, I., and Petrov, O. (2020). Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Smit, P., Virpioja, S., and Kurimo, M. (2017, January 20\u201324). Improved Subword Modeling for WFST-Based Speech Recognition. Proceedings of the Interspeech 2017, ISCA, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-103"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Jain, M., Keren, G., Mahadeokar, J., Zweig, G., Metze, F., and Saraf, Y. (2020, January 25\u201329). Contextual RNN-T for Open Domain ASR. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2986"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics.","DOI":"10.18653\/v1\/P16-1162"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics.","DOI":"10.18653\/v1\/P18-1007"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Creutz, M., and Lagus, K. (2002). Unsupervised Discovery of Morphemes. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, Association for Computational Linguistics.","DOI":"10.3115\/1118647.1118650"},{"key":"ref_19","unstructured":"Gr\u00f6nroos, S.A., Virpioja, S., and Kurimo, M. (2020). Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning. Proceedings of the 12th Language Resources and Evaluation Conference, ELRA."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Provilkov, I., Emelianenko, D., and Voita, E. (2020). BPE-Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics.","DOI":"10.18653\/v1\/2020.acl-main.170"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Drexler, J., and Glass, J. (2019, January 12\u201317). Subword Regularization and Beam Search Decoding for End-to-end Automatic Speech Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683531"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Lakomkin, E., Heymann, J., Sklyar, I., and Wiesler, S. (2020, January 25\u201329). Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1569"},{"key":"ref_23","unstructured":"Tapo, A.A., Coulibaly, B., Diarra, S., Homan, C., Kreutzer, J., Luger, S., Nagashima, A., Zampieri, M., and Leventhal, M. (2020). Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara. Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020), Association for Computational Linguistics."},{"key":"ref_24","unstructured":"Knowles, R., Larkin, S., Stewart, D., and Littell, P. (2020, January 19\u201320). NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications. Proceedings of the Fifth Conference on Machine Translation, Seattle, WA, USA."},{"key":"ref_25","unstructured":"Libovick\u00fd, J., Hangya, V., Schmid, H., and Fraser, A. (2020, January 19\u201320). The LMU Munich System for the WMT20 Very Low Resource Supervised MT Task. Proceedings of the Fifth Conference on Machine Translation, online."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"He, X., Haffari, G., and Norouzi, M. (2020, January 6\u20138). Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA.","DOI":"10.18653\/v1\/2020.acl-main.275"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Gr\u00f6nroos, S.A., Virpioja, S., and Kurimo, M. (2021). Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation. Machine Translation, Springer International Publishing.","DOI":"10.1007\/s10590-020-09253-x"},{"key":"ref_28","unstructured":"Andresen, J., Bills, A., Dubinski, E., Fiscus, J., Gillies, B., Mary Harper, T., Jarrett, A., Roomi, B., Ray, J., and Rytting, A. (2016). IARPA Babel Turkish Language Pack, IARPA-babel105bv0.5 LDC2016S10, Linguistic Data Consortium."},{"key":"ref_29","unstructured":"Bills, A., Conners, T., David, A., Dubinski, E., Fiscus, J., Hammond, S., Gann, K., Harper, M., Hefright, B., and Kazi, M. (2016). IARPA Babel Georgian Language Pack, IARPA-babel404b-v1.0a LDC2016S12, Linguistic Data Consortium."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., and Zhang, Y. (2020, January 4\u20138). Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053889"},{"key":"ref_31","unstructured":"Graves, A. (July, January 26). Sequence Transduction with Recurrent Neural Networks. Proceedings of the 29th International Conference on Machine Learning\u2014ICML, Workshop on Representation Learning, Edinburgh, Scotland."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Li, J., Zhao, R., Meng, Z., Liu, Y., Wei, W., Parthasarathy, S., Mazalov, V., Wang, Z., He, L., and Zhao, S. (2020, January 25\u201329). Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-3016"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Andrusenko, A., Laptev, A., and Medennikov, I. (2020, January 25\u201329). Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1074"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Kim, S., Hori, T., and Watanabe, S. (2017, January 5\u20139). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"ref_35","first-page":"5998","article-title":"Attention is All you Need","volume":"Volume 30","author":"Vaswani","year":"2017","journal-title":"Advances in Neural Information Processing Systems 30 (NIPS 2017)"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Andrusenko, A., Laptev, A., and Medennikov, I. (2020). Exploration of End-to-End ASR for OpenSTT \u2013 Russian Open Speech-to-Text Dataset. Speech and Computer, Springer International Publishing.","DOI":"10.1007\/978-3-030-60276-5_4"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Laptev, A., Korostik, R., Svischev, A., Andrusenko, A., Medennikov, I., and Rybin, S. (2020, January 17\u201319). You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation. Proceedings of the IEEE 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Chengdu, China.","DOI":"10.1109\/CISP-BMEI51763.2020.9263564"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Wu, Y. (2020, January 25\u201329). Conformer: Convolution-augmented Transformer for Speech Recognition. Proceedings of the Interspeech 2020, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Guo, P., Boyer, F., Chang, X., Hayashi, T., Higuchi, Y., Inaguma, H., Kamo, N., Li, C., Garcia-Romero, D., and Shi, J. (2020). Recent Developments on ESPnet Toolkit Boosted by Conformer. arXiv.","DOI":"10.1109\/ICASSP39728.2021.9414858"},{"key":"ref_40","first-page":"337","article-title":"Bayesian Methods for Hidden Markov Models: Recursive Computing in the 21st Century","volume":"97","author":"Scott","year":"2002","journal-title":"Taylor Fr."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Alexeev, A., Kukharev, G., Matveev, Y., and Matveev, A. (2020). A Highly Efficient Neural Network Solution for Automated Detection of Pointer Meters with Different Analog Scales Operating in Different Conditions. Mathematics, 8.","DOI":"10.3390\/math8071104"},{"key":"ref_42","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA."},{"key":"ref_43","first-page":"369","article-title":"Super-convergence: Very fast training of neural networks using large learning rates","volume":"Volume 11006","author":"Smith","year":"2019","journal-title":"Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019, January 15\u201319). SpecAugment: A simple data augmentation method for automatic speech recognition. Proceedings of the Interspeech 2019, ISCA, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Kudo, T., and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.","DOI":"10.18653\/v1\/D18-2012"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Enrique Yalta Soplin, N., Heymann, J., Wiesner, M., and Chen, N. (2018, January 2\u20136). ESPnet: End-to-End Speech Processing Toolkit. Proceedings of the Interspeech 2018, ISCA, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1456"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Watanabe, S., Boyer, F., Chang, X., Guo, P., Hayashi, T., Higuchi, Y., Hori, T., Huang, W.C., Inaguma, H., and Kamo, N. (2020). The 2020 ESPnet update: New features, broadened applications, performance improvements, and future plans. arXiv.","DOI":"10.1109\/DSLW51110.2021.9523402"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). Audio augmentation for speech recognition. Proceedings of the Interspeech 2015, ISCA, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-711"},{"key":"ref_49","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., and Schwarz, P. (2011, January 11\u201315). The kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, Waikoloa, HI, USA."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Fisher, W.M., and Fiscus, J.G. (1993, January 27\u201330). Better alignment procedures for speech recognition evaluation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Minneapolis, MN, USA.","DOI":"10.1109\/ICASSP.1993.319229"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Bataev, V., Korenevsky, M., Medennikov, I., and Zatvornitskiy, A. (2018, January 18\u201322). Exploring end-to-end techniques for low-resource speech recognition. Proceedings of the International Conference on Speech and Computer, Leipzig, Germany.","DOI":"10.1007\/978-3-319-99579-3_4"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Cho, J., Baskar, M.K., Li, R., Wiesner, M., Mallidi, S.H., Yalta, N., Karafi\u00e1t, M., Watanabe, S., and Hori, T. (2018, January 18\u201321). Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639655"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and Auli, M. (2020). Unsupervised cross-lingual representation learning for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2021-329"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Alum\u00e4e, T., Karakos, D., Hartmann, W., Hsiao, R., Zhang, L., Nguyen, L., Tsakalidis, S., and Schwartz, R. (2017, January 5\u20139). The 2016 BBN Georgian telephone speech keyword spotting system. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953259"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/9\/3063\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:53:39Z","timestamp":1760162019000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/9\/3063"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,28]]},"references-count":54,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2021,5]]}},"alternative-id":["s21093063"],"URL":"https:\/\/doi.org\/10.3390\/s21093063","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,4,28]]}}}