{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T15:25:45Z","timestamp":1775229945404,"version":"3.50.1"},"reference-count":34,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2023,4,24]],"date-time":"2023-04-24T00:00:00Z","timestamp":1682294400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Altice Labs"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15\u00d75 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503.<\/jats:p>","DOI":"10.3390\/fi15050159","type":"journal-article","created":{"date-parts":[[2023,4,24]],"date-time":"2023-04-24T03:38:07Z","timestamp":1682307487000},"page":"159","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7224-6830","authenticated-orcid":false,"given":"Eduardo","family":"Medeiros","sequence":"first","affiliation":[{"name":"Escola de Ci\u00eancias e Tecnologia, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3231-5520","authenticated-orcid":false,"given":"Leonel","family":"Corado","sequence":"additional","affiliation":[{"name":"Escola de Ci\u00eancias e Tecnologia, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4492-7548","authenticated-orcid":false,"given":"Lu\u00eds","family":"Rato","sequence":"additional","affiliation":[{"name":"Escola de Ci\u00eancias e Tecnologia, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"},{"name":"Centro ALGORITMI, Vista Lab, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5086-059X","authenticated-orcid":false,"given":"Paulo","family":"Quaresma","sequence":"additional","affiliation":[{"name":"Escola de Ci\u00eancias e Tecnologia, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"},{"name":"Centro ALGORITMI, Vista Lab, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7614-2951","authenticated-orcid":false,"given":"Pedro","family":"Salgueiro","sequence":"additional","affiliation":[{"name":"Escola de Ci\u00eancias e Tecnologia, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"},{"name":"Centro ALGORITMI, Vista Lab, Universidade de \u00c9vora, 7000-671 \u00c9vora, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,24]]},"reference":[{"key":"ref_1","unstructured":"Ruan, S., Wobbrock, J.O., Liou, K., Ng, A., and Landay, J. (2016). Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"45","DOI":"10.30744\/brjac.2179-3425.AR-38-2021","article-title":"Data Mining, Machine Learning, Deep Learning, Chemometrics Definitions, Common Points and Trends (Spoiler Alert: VALIDATE your models!)","volume":"8","author":"Amigo","year":"2021","journal-title":"Braz. J. Anal. Chem."},{"key":"ref_3","unstructured":"Eberhard, D.M., Simons, G.F., and Fennig, C.D. (2023). Ethnologue: Languages of the World, SIL International. [26th ed.]."},{"key":"ref_4","unstructured":"Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, MIT Press."},{"key":"ref_5","unstructured":"Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V., and Cook, J. (2019). NeMo: A toolkit for building AI applications using Neural Modules. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., and Zhang, Y. (2020, January 4\u20138). Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053889"},{"key":"ref_7","unstructured":"Li, J. (2023, April 16). Recent Advances in End-to-End Automatic Speech Recognition, Available online: http:\/\/xxx.lanl.gov\/abs\/2111.01690."},{"key":"ref_8","unstructured":"Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C.L.Y., and Courville, A. (2023, April 16). Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks, Available online: http:\/\/xxx.lanl.gov\/abs\/1701.02720."},{"key":"ref_9","first-page":"1764","article-title":"Towards End-To-End Speech Recognition with Recurrent Neural Networks","volume":"Volume 32","author":"Xing","year":"2014","journal-title":"Proceedings of Machine Learning Research, Proceedings of the 31st International Conference on Machine Learning"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-647"},{"key":"ref_11","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, April 16). Robust Speech Recognition via Large-Scale Weak Supervision, Available online: http:\/\/xxx.lanl.gov\/abs\/2212.04356."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., and Gadde, R.T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv.","DOI":"10.21437\/Interspeech.2019-1819"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"532","DOI":"10.1016\/j.procs.2021.03.067","article-title":"Applying transfer learning and various ANN architectures to predict transportation mode choice in Amsterdam","volume":"184","author":"Buijs","year":"2021","journal-title":"Procedia Comput. Sci."},{"key":"ref_14","unstructured":"Xavier Sampaio, M., Pires Magalh\u00e3es, R., Linhares Coelho da Silva, T., Almada Cruz, L., Romero de Vasconcelos, D., Ant\u00f4nio Fernandes de Mac\u00eado, J., and Gon\u00e7alves Fontenele Ferreira, M. (2021). Anais do XXXVI Simp\u00f3sio Brasileiro de Bancos de Dados, SBC. Technical Report."},{"key":"ref_15","unstructured":"Dalmia, S., Sanabria, R., Metze, F., and Black, A.W. (2023, April 16). Sequence-based Multi-lingual Low Resource Speech Recognition, Available online: http:\/\/xxx.lanl.gov\/abs\/1802.07420."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Cho, J., Baskar, M.K., Li, R., Wiesner, M., Mallidi, S., Yalta, N., Karafi\u00e1t, M., Watanabe, S., and Hori, T. (2018, January 18\u201321). Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639655"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Pellegrini, T., H\u00e4m\u00e4l\u00e4inen, A., de Mare\u00fcil, P.B., Tjalve, M., Trancoso, I., Candeias, S., Dias, M.S., and Braga, D. (2013, January 25\u201329). A corpus-based study of elderly and young speakers of European Portuguese: Acoustic correlates and their impact on speech recognition performance. Proceedings of the Proc. Interspeech 2013, Lyon, France.","DOI":"10.21437\/Interspeech.2013-241"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., and Volpe Nunes, M.d.G. (2014). Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science, Springer.","DOI":"10.1007\/978-3-319-09761-9"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Mamede, N.J., Trancoso, I., Baptista, J., and das Gra\u00e7as Volpe Nunes, M. (2003). Proceedings of the Computational Processing of the Portuguese Language, Springer.","DOI":"10.1007\/3-540-45011-4"},{"key":"ref_20","unstructured":"Meinedo, H., Abad, A., Pellegrini, T., Neto, J., and Trancoso, I. (2010, January 10\u201312). The L2F Broadcast News Speech Recognition System. Proceedings of the Fala 2010 Conference, Vigo, Spain."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"101055","DOI":"10.1016\/j.csl.2019.101055","article-title":"A survey on automatic speech recognition systems for Portuguese language and its variations","volume":"62","year":"2020","journal-title":"Comput. Speech Lang."},{"key":"ref_22","unstructured":"Gris, L.R.S., Casanova, E., Oliveira, F.S.d., Soares, A.d.S., and Candido-Junior, A. (2021). Anais do Brazilian e-Science Workshop (BreSci), SBC."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H. (2022). Proceedings of the Computational Processing of the Portuguese Language, Springer International Publishing.","DOI":"10.1007\/978-3-030-98305-5"},{"key":"ref_24","unstructured":"Macedo Quintanilha, I. (2017). End-to-End Speech Recognition Applied to Brazilian Portuguese Using Deep Learning. [Master\u2019s Thesis, Universidade Federal do Rio de Janeiro]."},{"key":"ref_25","first-page":"230","article-title":"An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora","volume":"35","author":"Quintanilha","year":"2020","journal-title":"J. Commun. Inf. Syst."},{"key":"ref_26","unstructured":"Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., and Diamos, G. (2015). International Conference on Machine Learning, PMLR."},{"key":"ref_27","unstructured":"Tejedor-Garc\u00eda, C., Escudero-Mancebo, D., Gonz\u00e1lez-Ferreras, C., C\u00e1mara-Arenas, E., and Carde\u00f1oso-Payo, V. (2023, March 13). TipTopTalk! Mobile Application for Speech Training Using Minimal Pairs and Gamification. Available online: https:\/\/uvadoc.uva.es\/handle\/10324\/27857."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., and Lim, L.L. (2017). English Conversational Telephone Speech Recognition by Humans and Machines. arXiv.","DOI":"10.21437\/Interspeech.2017-405"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An ASR corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert, R. (2020, January 25\u201329). MLS: A Large-Scale Multilingual Dataset for Speech Research. Proceedings of the Interspeech 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2826"},{"key":"ref_31","unstructured":"(1988). Pulse Code Modulation (PCM) of Voice Frequencies (Standard No. G.711 Standard)."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Mordido, G., Van Keirsbilck, M., and Keller, A. (2021). Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices. arXiv.","DOI":"10.21437\/Interspeech.2021-141"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Guo, Y., Li, Y., Feris, R., Wang, L., and Rosing, T. (2019). Depthwise Convolution is All You Need for Learning Multiple Visual Domains. arXiv.","DOI":"10.1609\/aaai.v33i01.33018368"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Feitelson, D., Rudolph, L., and Schwiegelshohn, U. (2003). Proceedings of the Job Scheduling Strategies for Parallel Processing, Springer.","DOI":"10.1007\/10968987"}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/15\/5\/159\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:22:14Z","timestamp":1760124134000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/15\/5\/159"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,24]]},"references-count":34,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["fi15050159"],"URL":"https:\/\/doi.org\/10.3390\/fi15050159","relation":{},"ISSN":["1999-5903"],"issn-type":[{"value":"1999-5903","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,24]]}}}