{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T02:09:47Z","timestamp":1772762987965,"version":"3.50.1"},"reference-count":41,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2019,8,28]],"date-time":"2019-08-28T00:00:00Z","timestamp":1566950400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>When the National Centre for Human Language Technology (NCHLT) Speech corpus was released, it created various opportunities for speech technology development in the 11 official, but critically under-resourced, languages of South Africa. Since then, the substantial improvements in acoustic modeling that deep architectures achieved for well-resourced languages ushered in a new data requirement: their development requires hundreds of hours of speech. A suitable strategy for the enlargement of speech resources for the South African languages is therefore required. The first possibility was to look for data that has already been collected but has not been included in an existing corpus. Additional data was collected during the NCHLT project that was not included in the official corpus: it only contains a curated, but limited subset of the data. In this paper, we first analyze the additional resources that could be harvested from the auxiliary NCHLT data. We also measure the effect of this data on acoustic modeling. The analysis incorporates recent factorized time-delay neural networks (TDNN-F). These models significantly reduce phone error rates for all languages. In addition, data augmentation and cross-corpus validation experiments for a number of the datasets illustrate the utility of the auxiliary NCHLT data.<\/jats:p>","DOI":"10.3390\/info10090268","type":"journal-article","created":{"date-parts":[[2019,8,28]],"date-time":"2019-08-28T11:23:18Z","timestamp":1566991398000},"page":"268","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages"],"prefix":"10.3390","volume":"10","author":[{"given":"Jaco","family":"Badenhorst","sequence":"first","affiliation":[{"name":"Human Technologies Research Group, CSIR Next Generation Enterprises and Institutions Cluster, P.O. Box 395, Pretoria 0001, South Africa"}]},{"given":"Febe","family":"de Wet","sequence":"additional","affiliation":[{"name":"Human Technologies Research Group, CSIR Next Generation Enterprises and Institutions Cluster, P.O. Box 395, Pretoria 0001, South Africa"},{"name":"Department of Electrical &amp; Electronic Engineering, Stellenbosch University, Private Bag X1, Stellenbosch 7602, South Africa"}]}],"member":"1968","published-online":{"date-parts":[[2019,8,28]]},"reference":[{"key":"ref_1","unstructured":"Roux, J.C., Louw, P.H., and Niesler, T. (2004, January 1). The African Speech Technology Project: An Assessment. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC\u201904), Lisbon, Portugal."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1007\/s10579-011-9152-1","article-title":"Collecting and evaluating speech recognition corpora for 11 South African languages","volume":"3","author":"Badenhorst","year":"2011","journal-title":"Lang. Resour. Eval."},{"key":"ref_3","unstructured":"Calteaux, K., de Wet, F., Moors, C., van Niekerk, D., McAlister, B., Sharma-Grover, A., Reid, T., Davel, M., Barnard, E., and van Heerden, C. (2013). Lwazi II Final Report: Increasing the Impact of Speech Technologies in South Africa, CSIR. Technical Report."},{"key":"ref_4","unstructured":"Barnard, E., Davel, M.H., van Heerden, C., de Wet, F., and Badenhorst, J. (2014, January 14\u201316). The NCHLT speech corpus of the South African languages. Proceedings of the 4th Workshop on Spoken Language Technologies for Under-Resourced Languages, St. Petersburg, Russia."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1016\/j.procs.2016.04.028","article-title":"Developing speech resources from parliamentary data for South African English","volume":"81","author":"Badenhorst","year":"2016","journal-title":"Procedia Comput. Sci."},{"key":"ref_6","unstructured":"Eiselen, R., and Puttkammer, M.J. (2014, January 28). Developing Text Resources for Ten South African Languages. Proceedings of the Language Resource and Evaluation, Reykjavik, Iceland."},{"key":"ref_7","unstructured":"Camelin, N., Damnati, G., Bouchekif, A., Landeau, A., Charlet, D., and Est\u00e8ve, Y. (2018, January 22). FrNewsLink: A corpus linking TV Broadcast News Segments and Press Articles. Proceedings of the Language Resource and Evaluation, Miyazaki, Japan."},{"key":"ref_8","unstructured":"Takamichi, S., and Saruwatari, H. (2018, January 7\u201312). CPJD corpus: Crowdsourced parallel speech corpus of japanese dialects. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan."},{"key":"ref_9","unstructured":"Salimbajevs, A. (2018, January 7\u201312). Creating Lithuanian and Latvian speech corpora from inaccurately annotated web data. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Baumann, T., K\u00f6hn, A., and Hennig, F. (2018). The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening. Lang. Resour. Eval., 1\u201327.","DOI":"10.1007\/s10579-017-9410-y"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1016\/j.specom.2013.07.001","article-title":"A smartphone-based ASR data collection tool for under-resourced languages","volume":"56","author":"Davel","year":"2014","journal-title":"Speech Commun."},{"key":"ref_12","unstructured":"Jones, K.S., Strassel, S., Walker, K., Graff, D., and Wright, J. (2016, January 23\u201328). Multi-language speech collection for NIST LRE. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoro\u017e, Slovenia."},{"key":"ref_13","unstructured":"Ide, N., Reppen, R., and Suderman, K. (2002, January 29\u201331). The American National Corpus: More Than the Web Can Provide. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC\u201902), Las Palmas, Spain."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., and Strope, B. (2010). \u201cYour Word is my Command\u201d: Google search by voice: A case study. Advances in Speech Recognition, Springer.","DOI":"10.1007\/978-1-4419-5951-5_4"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Cieri, C., Miller, D., and Walker, K. (2002, January 24\u201327). Research Methodologies, Observations and Outcomes in (Conversational) Speech Data Collection. Proceedings of the Second International Conference on Human Language Technology Research, San Diego, CA, USA.","DOI":"10.3115\/1289189.1289198"},{"key":"ref_16","unstructured":"De Wet, F., Louw, P., and Niesler, T. (December, January 29). The design, collection and annotation of speech databases in South Africa. Proceedings of the Pattern Recognition Association of South Africa (PRASA 2006), Bloemfontein, South Africa."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Br\u00fcmmer, N., and Garcia-Romero, D. (2014). Generative modeling for unsupervised score calibration. arXiv.","DOI":"10.1109\/ICASSP.2014.6853884"},{"key":"ref_18","unstructured":"Davel, M.H., van Heerden, C., and Barnard, E. (2012, January 7\u20139). Validating Smartphone-Collected Speech Corpora. Proceedings of the Third Workshop on Spoken Language Technologies for Under-resourced Languages, Cape Town, South Africa."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Badenhorst, J., Martinus, L., and De Wet, F. (2019, January 28\u201330). BLSTM harvesting of auxiliary NCHLT speech data. Proceedings of the 2019 Southern African Universities Power Engineering Conference\/Robotics and Mechatronics\/Pattern Recognition Association of South Africa (SAUPEC\/RobMech\/PRASA), Bloemfontein, South Africa.","DOI":"10.1109\/RoboMech.2019.8704835"},{"key":"ref_20","unstructured":"Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., and Povey, D. (2019, June 27). The HTK Book. Revised for HTK Version 3.4. Available online: http:\/\/htk.eng.cam.ac.uk\/\/."},{"key":"ref_21","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., and Schwarz, P. (2011, January 11\u201315). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, HI, USA."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Badenhorst, J., and de Wet, F. (December, January 30). The limitations of data perturbation for ASR of learner data in under-resourced languages. Proceedings of the 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), Bloemfontein, South Africa.","DOI":"10.1109\/RoboMech.2017.8261121"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"van Heerden, C., Kleynhans, N., and Davel, M. (2016, January 8\u201312). Improving the Lwazi ASR baseline. Proceedings of the Interspeech 2016, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-1412"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-647"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1109\/29.21701","article-title":"Phoneme recognition using time-delay neural networks","volume":"37","author":"Waibel","year":"1989","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2014-80"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Yu, Z., Ramanarayanan, V., Suendermann-Oeft, D., Wang, X., Zechner, K., Chen, L., Tao, J., Ivanou, A., and Qian, Y. (2015, January 13\u201317). Using bidirectional LSTM recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech. Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA.","DOI":"10.1109\/ASRU.2015.7404814"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"373","DOI":"10.1109\/LSP.2017.2723507","article-title":"Low latency acoustic modeling using temporal convolution and LSTMs","volume":"25","author":"Peddinti","year":"2018","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Karafiat, M., Baskar, M.K., Vesely, K., Grezl, F., Burget, L., and \u010cernock\u00fd, J.C. (2018, January 15\u201320). Analysis of multilingual BLSTM acoustic model on low and high resource languages. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462083"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"012076","DOI":"10.1088\/1742-6596\/1229\/1\/012076","article-title":"Deeper Time Delay Neural Networks for Effective Acoustic Modeling","volume":"1229","author":"Huang","year":"2019","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"012077","DOI":"10.1088\/1742-6596\/1229\/1\/012077","article-title":"Gated Time Delay Neural Network for Speech Recognition","volume":"1229","author":"Chen","year":"2019","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M., and Khudanpur, S. (2018, January 2\u20136). Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1417"},{"key":"ref_33","unstructured":"van der Westhuizen, E., and Niesler, T.R. (2015). Technical Report SU-EE-1501 An Analysis of the NCHLT Speech Corpora, Stellenbosh University of Zurich, Department of Electrical and Electronic Engineering. Technical Report."},{"key":"ref_34","unstructured":"Loots, L., Davel, M., Barnard, E., and Niesler, T. (December, January 30). Comparing manually-developed and data-driven rules for P2P learning. Proceedings of the 20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"de Wet, F., de Waal, A., and van Huyssteen, G.B. (2011, January 27\u201331). Developing a broadband automatic speech recognition system for Afrikaans. Proceedings of the INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy.","DOI":"10.21437\/Interspeech.2011-797"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Peddinti, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 6\u201310). Reverberation robust acoustic modeling using i-vectors with time delay neural networks. Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-527"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). Audio augmentation for speech recognition. Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-711"},{"key":"ref_38","unstructured":"Povey, D. (2019, June 27). Kaldi Librispeech TDNN-F 1c Chain Model Example Recipe. Available online: https:\/\/github.com\/kaldi-asr\/kaldi\/blob\/master\/egs\/librispeech\/s5\/local\/chain\/tuning\/run_tdnn_1c.sh."},{"key":"ref_39","unstructured":"Povey, D. (2019, June 27). Kaldi Librispeech TDNN-F 1d Chain Model Example Recipe. Available online: https:\/\/github.com\/kaldi-asr\/kaldi\/blob\/master\/egs\/librispeech\/s5\/local\/chain\/tuning\/run_tdnn_1d.sh."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., and Yan, Y. (2017, January 20\u201324). An exploration of dropout with lstms. Proceedings of the Interspeech, 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-129"},{"key":"ref_41","unstructured":"Jurafsky, D., and Martin, J. (2000). Speech Lang. Process., Prentice Hall."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/10\/9\/268\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:14:38Z","timestamp":1760188478000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/10\/9\/268"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,28]]},"references-count":41,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2019,9]]}},"alternative-id":["info10090268"],"URL":"https:\/\/doi.org\/10.3390\/info10090268","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,8,28]]}}}