{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T14:06:51Z","timestamp":1773929211512,"version":"3.50.1"},"reference-count":48,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2019,7,25]],"date-time":"2019-07-25T00:00:00Z","timestamp":1564012800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100015846","name":"Welsh Government","doi-asserted-by":"publisher","award":["Welsh-language Technology and Digital Media Grant"],"award-info":[{"award-number":["Welsh-language Technology and Digital Media Grant"]}],"id":[{"id":"10.13039\/100015846","id-type":"DOI","asserted-by":"publisher"}]},{"name":"S4C","award":["GALLU Project"],"award-info":[{"award-number":["GALLU Project"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.<\/jats:p>","DOI":"10.3390\/info10080247","type":"journal-article","created":{"date-parts":[[2019,7,26]],"date-time":"2019-07-26T08:45:39Z","timestamp":1564130739000},"page":"247","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2105-8338","authenticated-orcid":false,"given":"Sarah","family":"Cooper","sequence":"first","affiliation":[{"name":"School of Languages, Literatures and Linguistics, Bangor University, Bangor, Gwynedd LL57 2DG, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1263-6332","authenticated-orcid":false,"given":"Dewi Bryn","family":"Jones","sequence":"additional","affiliation":[{"name":"Language Technologies Unit, Bangor University, Bangor, Gwynedd LL57 2DG, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Delyth","family":"Prys","sequence":"additional","affiliation":[{"name":"Language Technologies Unit, Bangor University, Bangor, Gwynedd LL57 2DG, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,7,25]]},"reference":[{"key":"ref_1","unstructured":"(2013). Language in England and Wales: 2011."},{"key":"ref_2","unstructured":"Aitchison, J.W., and Carter, H. (1994). A Geography of the Welsh Language, 1961\u20131991, University of Wales Press."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.specom.2013.07.008","article-title":"Automatic speech recognition for under-resourced languages: A survey","volume":"56","author":"Besacier","year":"2014","journal-title":"Speech Commun."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"961","DOI":"10.1007\/s10579-016-9336-9","article-title":"Modeling under-resourced languages for speech recognition","volume":"51","author":"Kurimo","year":"2017","journal-title":"Lang. Resour. Eval."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Crystal, D. (2014). Language Death, Cambridge University Press.","DOI":"10.1017\/CBO9781139923477"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Cormack, M., and Hourigan, N. (2007). The media and language maintenance. Minority Language Media: Concepts, Critiques and Case Studies, Multilingual Matters.","DOI":"10.21832\/9781853599651"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"891","DOI":"10.1007\/s10579-017-9405-8","article-title":"Introduction to the special issue","volume":"51","author":"Pretorius","year":"2017","journal-title":"Lang. Resour. Eval."},{"key":"ref_8","unstructured":"Ceberio Berger, K., Gurrutxaga Hernaiz, A., Baroni, P., Hicks, D., Kruse, E., Quochi, V., Russo, I., Salonen, T., Sarhimaa, A., and Soria, C. (2019, July 24). Available online: http:\/\/wp.dldp.eu\/wp-content\/uploads\/2018\/09\/Digital-Language-Survival-Kit.pdf."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1177\/0165551512437638","article-title":"Towards an integrated crowdsourcing definition","volume":"38","year":"2012","journal-title":"J. Inf. Sci."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Eskenazi, M., Levow, G., Meng, H., Parent, G., and Suendermann, D. (2013). The basics. Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment, Wiley.","DOI":"10.1002\/9781118541241"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Eskenazi, M., Levow, G., Meng, H., Parent, G., and Suendermann, D. (2013). Collecting Speech from Crowds. Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment, Wiley.","DOI":"10.1002\/9781118541241"},{"key":"ref_12","unstructured":"Ball, M.J., and Jones, G. (1984). The distinctive vowels and consonants of Welsh. Welsh Phonology: Selected Readings, University of Wales Press."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1017\/S0025100310000290","article-title":"A cross-dialectal acoustic study of the monophthongs and diphthongs of Welsh","volume":"41","author":"Mayr","year":"2011","journal-title":"J. Int. Phon. Assoc."},{"key":"ref_14","unstructured":"Ball, M.J., and Williams, B.J. (2001). Welsh Phonetics, Edwin Mellen Press."},{"key":"ref_15","unstructured":"Ball, M.J., and Jones, G. (1984). Phonotactic constraints in Welsh. Welsh Phonology: Selected Readings, University of Wales Press."},{"key":"ref_16","first-page":"149","article-title":"Phonological Variation in Mid-Wales","volume":"45","author":"Rees","year":"2015","journal-title":"Stud. Celt."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1177\/1367006915614921","article-title":"Disentangling the effects of long-term language contact and individual bilingualism: The case of monophthongs in Welsh and English","volume":"21","author":"Mayr","year":"2017","journal-title":"Int. J. Biling."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Durham, M., and Morris, J. (2017). Sociolinguistics in Wales, Palgrave Macmillan.","DOI":"10.1057\/978-1-137-52897-1"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1111\/josl.12231","article-title":"Sociophonetic variation in a long-term language contact situation: \/l\/-darkening in Welsh-English bilingual speech","volume":"21","author":"Morris","year":"2017","journal-title":"J. Socioling."},{"key":"ref_20","unstructured":"Prys, M. (2016). Style in the vernacular and on the radio: code-switching and mutations as stylistic and social markers in Welsh. [Ph.D. Thesis, Prifysgol Bangor University]."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1016\/j.lingua.2014.02.007","article-title":"Auxiliary deletion in the informal speech of Welsh\u2013English bilinguals: A change in progress","volume":"143","author":"Davies","year":"2014","journal-title":"Lingua"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Borsley, R.D., Tallerman, M., and Willis, D. (2007). The Syntax of Welsh, Cambridge University Press. Cambridge Syntax Guides.","DOI":"10.1017\/CBO9780511486227"},{"key":"ref_23","unstructured":"Welsh Government (2017). Cymraeg 2050: A Million Welsh Speakers."},{"key":"ref_24","unstructured":"Welsh Government (2018). Welsh Language Technology Action Plan."},{"key":"ref_25","unstructured":"Welsh Government (2013). Welsh-Language Technology and Digital Media Action Plan."},{"key":"ref_26","unstructured":"Prys, D., Williams, B., Hicks, B., Jones, D.B., N\u00ed Chasaide, A., Gobl, C., Carson-Berndsen, J., Cummins, F., N\u00ed Chios\u00e1in, M., and McKenna, J. (2004, January 24). WISPR: Speech Processing Resources for Welsh and Irish. Proceedings of the SALTMIL Workshop at LREC 2004: First Steps for Language Documentation of Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, Lisbon, Portugal."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Williams, B. (1994, January 18\u201322). Diphone synthesis for the Welsh language. Proceedings of the 1994 International Conference on Spoken Language Processing, Yokahama, Japan.","DOI":"10.21437\/ICSLP.1994-187"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Williams, B. (1995, January 18\u201321). Text-to-speech synthesis for Welsh and Welsh English. Proceedings of the Eurospeech 1995, Madrid, Spain.","DOI":"10.21437\/Eurospeech.1995-279"},{"key":"ref_29","unstructured":"Williams, B. (1999, January 5\u20139). A Welsh speech database: Preliminary results. Proceedings of the Eurospeech 1999, Budapest, Hungary."},{"key":"ref_30","unstructured":"Language Technologies Unit (2019). Paldaruo, Bangor University."},{"key":"ref_31","unstructured":"Jones, D.B., and Cooper, S. (2016, January 23\u201328). Building Intelligent Digital Assistants for speakers of a Lesser-Resourced Language. Proceedings of the LREC 2016 Workshop \u201cCCURL 2016\u2014Towards an Alliance for Digital Language Diversity\u201d, Portoro\u017e, Slovenia."},{"key":"ref_32","unstructured":"Prys, D., and Jones, D.B. (2018, January 12). Gathering Data for Speech Technology in the Welsh Language: A Case Study. Proceedings of the LREC 2018 Workshop \u201cCCURL 2018\u2014Sustaining Knowledge Diversity in the Digital Age\u201d, Miyazaki, Japan."},{"key":"ref_33","unstructured":"BBC (2014). Lansio adnodd Adnabod Lleferydd Cymraeg Newydd (Launching a New Welsh Speech Recognition Resource), BBC. BBC Cymru Fyw Website."},{"key":"ref_34","unstructured":"BBC (2014). Speakers for Welsh Voice Recognition App Sought, BBC. BBC News Website."},{"key":"ref_35","unstructured":"S4C (2014). Ap\u00eal am Leisiau i Helpu Adeiladu Adnodd Adnabod Lleferydd Cymraeg (Appeal for Voices to Help Create Welsh Speech Recognition Resource), S4C. S4C News."},{"key":"ref_36","unstructured":"Language Technologies Unit (2019). Welsh National Language Technologies Portal, Bangor University."},{"key":"ref_37","unstructured":"Williams, I. (2017). Challenges for Developing Speech Technology for Welsh, Plas Gregynog."},{"key":"ref_38","unstructured":"Williams, I. (2017). Modelau Cyfrifiadurol ar Gyfer y Gymraeg (Computational Models for Welsh), Bangor University."},{"key":"ref_39","unstructured":"Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (2002). The HTK Book, Cambridge University Engineering Department. Version 3.2."},{"key":"ref_40","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., and Schwarz, P. (2011, January 11\u201315). The Kaldi Speech Recognition Toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA."},{"key":"ref_41","unstructured":"(2019). DeepSpeech: A TensorFlow Implementation of Baidu\u2019s DeepSpeech Architecture, Mozilla."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Cooper, S., Jones, D.B., and Prys, D. (2014, January 23\u201329). Developing further speech recognition resources for Welsh. Proceedings of the First Celtic Language Technology Workshop at the 25th International Conference on Computational Linguistics, Dublin, Ireland.","DOI":"10.3115\/v1\/W14-4608"},{"key":"ref_43","unstructured":"Lee, A., and Kawahara, T. (2019, July 24). Available online: https:\/\/zenodo.org\/record\/2530396#.XTgW4Y8RXIU."},{"key":"ref_44","first-page":"192","article-title":"Prosodylab-Aligner: A Tool for Forced Alignment of Laboratory Speech","volume":"39","author":"Gorman","year":"2011","journal-title":"Can. Acoust."},{"key":"ref_45","unstructured":"Boersma, P., and Weenink, D. (2019, July 24). Available online: http:\/\/www.fon.hum.uva.nl\/praat\/."},{"key":"ref_46","unstructured":"Cooper, S. (2015). A Resource for Exploring Socio-Phonetic Variation in Welsh: The Paldaruo Corpus, University of Glasgow."},{"key":"ref_47","unstructured":"Iosad, P. (2017). Bridging the Gap: Length and Tenseness in Brythonic Vowels, Institi\u00faid Ard-L\u00e9inn Bhaile \u00c1tha Cliath."},{"key":"ref_48","unstructured":"Language Technologies Unit (2019). Paldaruo Source Code, Bangor University. Available online: https:\/\/github.com\/techiaith\/Paldaruo."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/10\/8\/247\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:09:41Z","timestamp":1760188181000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/10\/8\/247"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7,25]]},"references-count":48,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2019,8]]}},"alternative-id":["info10080247"],"URL":"https:\/\/doi.org\/10.3390\/info10080247","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,7,25]]}}}