{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T15:23:27Z","timestamp":1759332207479,"version":"3.37.3"},"reference-count":51,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2020,7,29]],"date-time":"2020-07-29T00:00:00Z","timestamp":1595980800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,7,29]],"date-time":"2020-07-29T00:00:00Z","timestamp":1595980800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000269","name":"Economic and Social Research Council","doi-asserted-by":"publisher","award":["ES\/M011348\/1"],"award-info":[{"award-number":["ES\/M011348\/1"]}],"id":[{"id":"10.13039\/501100000269","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Lang Resources &amp; Evaluation"],"published-print":{"date-parts":[[2021,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes\u2014National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.<\/jats:p>","DOI":"10.1007\/s10579-020-09501-9","type":"journal-article","created":{"date-parts":[[2020,7,29]],"date-time":"2020-07-29T14:02:31Z","timestamp":1596031351000},"page":"789-816","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh"],"prefix":"10.1007","volume":"55","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4745-6502","authenticated-orcid":false,"given":"Dawn","family":"Knight","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fernando","family":"Loizides","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Steven","family":"Neale","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Laurence","family":"Anthony","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Irena","family":"Spasi\u0107","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,7,29]]},"reference":[{"key":"9501_CR1","doi-asserted-by":"crossref","unstructured":"Adolphs, S., Knight, D., Smith, C., & Price, D. (2020). Crowdsourcing formulaic phrases: towards a new type of spoken corpus. Corpora, 15(1), in press.","DOI":"10.3366\/cor.2020.0192"},{"key":"9501_CR2","unstructured":"Anthony, L. (2014). AntConc (Version 3.4.3). Waseda University. https:\/\/www.laurenceanthony.net\/software\/antconc\/. Accessed 27 July 2020."},{"key":"9501_CR3","unstructured":"Areta, N., Gurrutxaga, A., Leturia, I., Alegria, I., Artola, X., Diaz de Ilarraza, A., et al. (2007). ZT corpus: Annotation and tools for Basque corpora. Paper presented at the Corpus Linguistics Conference, Birmingham."},{"key":"9501_CR4","volume-title":"The BNC handbook: exploring the British National Corpus with SARA","author":"G Aston","year":"1998","unstructured":"Aston, G., & Burnard, L. (1998). The BNC handbook: exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press."},{"issue":"1","key":"9501_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/llc\/7.1.1","volume":"7","author":"S Atkins","year":"1992","unstructured":"Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1\u201316.","journal-title":"Literary and Linguistic Computing"},{"issue":"4","key":"9501_CR6","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1093\/llc\/8.4.243","volume":"8","author":"D Biber","year":"1993","unstructured":"Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243\u2013257.","journal-title":"Literary and Linguistic Computing"},{"key":"9501_CR7","doi-asserted-by":"crossref","unstructured":"Boleda, G., Bott, S., Villanueva Meza, R. M., Castillo, C., Badia, T., & L\u00f3pez, V. (2006). CUCWeb: A Catalan corpus built from the web. Paper presented at the the 2nd International Workshop on Web as Corpus, Trento.","DOI":"10.3115\/1628297.1628301"},{"issue":"1","key":"9501_CR8","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1177\/1354856507084420","volume":"14","author":"DC Brabham","year":"2008","unstructured":"Brabham, D. C. (2008). Crowdsourcing as a model for problem solving: an introduction and cases. Convergence, 14(1), 75\u201390. https:\/\/doi.org\/10.1177\/1354856507084420.","journal-title":"Convergence"},{"key":"9501_CR9","doi-asserted-by":"publisher","unstructured":"Braun, V., & Clarke, V. (2012). Thematic analysis. In H. Cooper, P. Camic, D. Long, A. Panter, D. Rindskopf, & K. Sher (Eds.), APA handbook of research methods in psychology, Vol. 2. Research designs: Quantitative, qualitative, neuropsychological, and biological (pp. 57\u201371). American Psychological Association. https:\/\/doi.org\/10.1037\/13620-004.","DOI":"10.1037\/13620-004"},{"key":"9501_CR10","volume-title":"Exploring Spoken English","author":"R Carter","year":"1997","unstructured":"Carter, R., & McCarthy, M. (1997). Exploring Spoken English. Cambridge: Cambridge University Press."},{"key":"9501_CR11","doi-asserted-by":"crossref","unstructured":"Carter, R., & McCarthy, M. (2004). Talking, creating: interactional language, creativity, and context. Applied Linguistics, 25(1), 62\u201388. https:\/\/academic.oup.com\/applij\/article\/25\/1\/62\/149094.","DOI":"10.1093\/applin\/25.1.62"},{"key":"9501_CR13","doi-asserted-by":"crossref","unstructured":"Davies, M. (2010). The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447\u2013464. https:\/\/academic.oup.com\/dsh\/article\/25\/4\/447\/997323.","DOI":"10.1093\/llc\/fqq018"},{"issue":"1","key":"9501_CR14","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1075\/eww.36.1.01dav","volume":"36","author":"M Davies","year":"2015","unstructured":"Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide, 36(1), 1\u201328. https:\/\/doi.org\/10.1075\/eww.36.1.01dav.","journal-title":"English World-Wide"},{"key":"9501_CR15","doi-asserted-by":"publisher","first-page":"93","DOI":"10.21832\/9781783091713-008","volume-title":"Advances in the Study of Bilingualism","author":"M Deuchar","year":"2014","unstructured":"Deuchar, M., Davies, P., Herring, J., Parafita Couto, M., & Carter, D. (2014). Building bilingual corpora. In E. M. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp. 93\u2013111). Bristol: Multilingual Matters."},{"issue":"4","key":"9501_CR16","doi-asserted-by":"publisher","first-page":"1082","DOI":"10.1045\/april2002-weibel","volume":"8","author":"E Duval","year":"2002","unstructured":"Duval, E., Hodgins, W., Sutton, S., & Weibel, S. L. (2002). Metadata principles and practicalities. D-lib Magazine, 8(4), 1082\u20139873.","journal-title":"D-lib Magazine"},{"key":"9501_CR17","unstructured":"Ellis, N. C., O\u2019Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. https:\/\/www.bangor.ac.uk\/canolfanbedwyr\/ceg.php.en. Accessed 27 July 2020."},{"key":"9501_CR18","unstructured":"ESRC Centre for Research on Bilingualism (2020). Bangor Siarad. http:\/\/bangortalk.org.uk\/. Accessed 27 July 2020."},{"issue":"2","key":"9501_CR19","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1177\/0165551512437638","volume":"38","author":"E Estell\u00e9s-Arolas","year":"2012","unstructured":"Estell\u00e9s-Arolas, E., & Gonz\u00e1lez-Ladr\u00f3n-De-Guevara, F. (2012). Towards an integrated crowdsourcing definition. Journal of Information Science, 38(2), 189\u2013200. https:\/\/doi.org\/10.1177\/0165551512437638.","journal-title":"Journal of Information Science"},{"key":"9501_CR20","unstructured":"Expert Advisory Group on Language Engineering Standards (1996). EAGLES guidelines. http:\/\/www.ilc.cnr.it\/EAGLES\/browse.html. Accessed 27 July 2020."},{"key":"9501_CR21","doi-asserted-by":"crossref","unstructured":"Hardie, A. (2012). CQPweb\u2014combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380\u2013409. https:\/\/www.ingentaconnect.com\/content\/jbp\/ijcl\/2012\/00000017\/00000003\/art00004.","DOI":"10.1075\/ijcl.17.3.04har"},{"key":"9501_CR22","doi-asserted-by":"crossref","unstructured":"J\u00e4\u00e4skel\u00e4inen, R. (2010). Think-aloud protocol. In Y. Gambier, & L. van Doorslaer (Eds.), Benjamins Handbook of Translation Studies, Volume 1 (pp. 371-373). Amsterdam\/Philadelphia John Benjamins.","DOI":"10.1075\/hts.1.thi1"},{"key":"9501_CR23","first-page":"125","volume-title":"\u2018The TenTen Corpus Family\u2019 the 7th International Corpus Linguistics Conference","author":"M Jakub\u00ed\u010dek","year":"2013","unstructured":"Jakub\u00ed\u010dek, M., Kilgarriff, A., Kov\u00e1\u0159, V., Rychl\u00fd, P., & Suchomel, V. (2013). \u2018The TenTen Corpus Family\u2019 the 7th International Corpus Linguistics Conference (pp. 125\u2013127). UK: Lancaster."},{"key":"9501_CR24","doi-asserted-by":"crossref","unstructured":"Karlsson, F. (1990). Constraint grammar as a framework for parsing running text. Paper presented at the the 13th International Conference on Computational Linguistics (COLING), Helsinki.","DOI":"10.3115\/991146.991176"},{"key":"9501_CR25","doi-asserted-by":"publisher","DOI":"10.1515\/9783110882629","volume-title":"Constraint grammar: a language-independent framework for parsing unrestricted text","author":"F Karlsson","year":"1995","unstructured":"Karlsson, F., Voutilainen, A., Heikkil\u00e4, J., & Anttila, A. (1995). Constraint grammar: a language-independent framework for parsing unrestricted text. Berlin\/New York: Mouton de Gruyter."},{"key":"9501_CR26","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1007\/s40607-014-0009-9","volume":"1","author":"A Kilgarriff","year":"2014","unstructured":"Kilgarriff, A., Baisa, V., Bu\u0161ta, J., Jakub\u00ed\u010dek, M., Kov\u00e1\u0159, V., Michelfeit, J., et al. (2014). The Sketch Engine: ten years on. Lexicography, 1, 7\u201336. https:\/\/doi.org\/10.1007\/s40607-014-0009-9.","journal-title":"Lexicography"},{"issue":"2","key":"9501_CR27","first-page":"391","volume":"11","author":"D Knight","year":"2011","unstructured":"Knight, D. (2011). The future of corpus linguistics. Brazilian Journal of Applied Linguistics, 11(2), 391\u2013416.","journal-title":"Brazilian Journal of Applied Linguistics"},{"key":"9501_CR28","doi-asserted-by":"crossref","unstructured":"Knight, D., Adolphs, S., & Carter, R. (2013). Formality in digital discourse: a study of hedging in CANELC. In J. Romero-Trillo (Ed.), Yearbook of corpus linguistics and pragmatics (pp. 131\u2013152). Netherlands: Springer. http:\/\/orca.cf.ac.uk\/78844\/.","DOI":"10.1007\/978-94-007-6250-3_7"},{"key":"9501_CR29","unstructured":"Knight, D., Fitzpatrick, T., & Morris, S. (2017). CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes\u2014The National Corpus of Contemporary Welsh): An overview. (Paper presented at the the Annual British Association for Applied Linguistics (BAAL) Conference, Leeds, UK)."},{"key":"9501_CR12","unstructured":"Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasi\u0107, I., Thomas, E.M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M., & Scannell, K. (2020). CorCenCC: (Corpws Cenedlaethol Cymraeg Cyfoes \u2013 The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction. https:\/\/urldefense.proofpoint.com\/v2\/url?u=http-3A__www.corcencc.org_explore&d=DwIGaQ&c=vh6FgFnduejNhPPD0fl_yRaSfZy8CWbWnIf4XJhSqx8&r=r2aSgYn6PHMQXXmeBiKsnvfFG9T9U5fmdQ67xEVmgo0&m=vENIMNwG0whbhOk5BSn93DDjUNbEHZkw7lOkhXD_WU&s=XozS5TJ9oHkKopOV4ytseYQ1EtUmmW8_QjYuqwg1hcg&e=. Accessed 27 July 2020."},{"issue":"2","key":"9501_CR30","doi-asserted-by":"publisher","first-page":"245","DOI":"10.1093\/llc\/17.2.245","volume":"17","author":"K Kucera","year":"2002","unstructured":"Kucera, K. (2002). The Czech National Corpus: principles, design, and results. Literary and Linguistic Computing, 17(2), 245\u2013257.","journal-title":"Literary and Linguistic Computing"},{"key":"9501_CR31","unstructured":"Kupietz, M., L\u00fcngen, H., Kamocki, P., & Witt, A. (2018) \u2018The German reference corpus DeReKo: New developments\u2014new opportunities\u2019 the Eleventh International Conference on Language Resources and Evaluation (pp. 4354\u20134360). Miyazaki, Japan. https:\/\/ids-pub.bsz-bw.de\/frontdoor\/index\/index\/docId\/7491."},{"key":"9501_CR32","unstructured":"Leech, G. (2014). The state of the art in corpus linguistics. In K. Aijmer, & B. Altenberg (Eds.), English Corpus Linguistics (pp. 20\u201341). Routledge."},{"key":"9501_CR33","doi-asserted-by":"publisher","DOI":"10.4324\/9780429429811","volume-title":"Overcoming challenges in corpus construction: The spoken British National Corpus 2014","author":"R Love","year":"2020","unstructured":"Love, R. (2020). Overcoming challenges in corpus construction: The spoken British National Corpus 2014. Abingdon: Routledge."},{"key":"9501_CR34","volume-title":"The CHILDES Project: Tools for analyzing talk","author":"B MacWhinney","year":"2000","unstructured":"MacWhinney, B. (2000). The CHILDES Project: Tools for analyzing talk (3rd ed.). Mahwah: Lawrence Erlbaum Associates.","edition":"3"},{"key":"9501_CR35","doi-asserted-by":"crossref","unstructured":"McEnery, T., Love, R., & Brezina, V. (2017). Compiling and analysing the Spoken British National Corpus 2014. International Journal of Corpus Linguistics, 22(3), 311\u2013318. https:\/\/benjamins.com\/catalog\/ijcl.22.3.01mce.","DOI":"10.1075\/ijcl.22.3.01mce"},{"key":"9501_CR36","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511606311","volume-title":"English corpus linguistics: An introduction","author":"CF Meyer","year":"2002","unstructured":"Meyer, C. F. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press."},{"key":"9501_CR37","unstructured":"Mozilla (2020). Common Voice. https:\/\/voice.mozilla.org\/. Accessed 27 July 2020."},{"key":"9501_CR38","unstructured":"Neale, S., Donnelly, K., Watkins, G., & Knight, D. (2018). Leveraging lexical resources and constraint grammar for rule-based part-of-speech tagging in Welsh. Paper presented at the the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki."},{"key":"9501_CR39","unstructured":"Neale, S., Spasi\u0107, I., Needs, J., Watkins, G., Morris, S., Fitzpatrick, T., et al. (2017). The CorCenCC crowdsourcing app: A bespoke tool for the user-driven creation of the national corpus of contemporary Welsh. Paper presented at the Corpus Linguistics Conference, Birmingham."},{"key":"9501_CR40","unstructured":"Office for National Statistics (2011). UK census. https:\/\/www.ons.gov.uk\/census\/2011census. Accessed."},{"key":"9501_CR41","unstructured":"Piao, S., Rayson, P., Knight, D., & Watkins, G. (2018). Towards a Welsh semantic annotation system. Paper presented at the the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki."},{"key":"9501_CR42","unstructured":"Prys, D., & Jones, D. B. (2018). Gathering data for speech technology in the welsh language: A case study. (aper presented at the the LREC Workshop on Collaboration and Computing for Under-Resourced Languages Sustaining knowledge diversity in the digital age, Miyazaki."},{"key":"9501_CR43","unstructured":"Rayson, P., Archer, D., Piao, S., & McEneryb, T. (2004). The UCREL semantic analysis system. Paper presented at the the Workshop on Beyond Named Entity Recognition Semantic Labelling for NLP tasks at the 4th International Conference on Language Resources and Evaluation (LREC), Lisbon."},{"key":"9501_CR44","unstructured":"Rees, M., Watkins, G., Needs, J., Morris, S., & Knight, D. (2017). Creating a bespoke corpus sampling frame for a minoritised language: CorCenCC, the National Corpus of Contemporary Welsh. Paper presented at the the Corpus Linguistics Conference, Birmingham."},{"key":"9501_CR45","unstructured":"Scannell, K. P. (2007). The Cr\u00fabad\u00e1n Project: Corpus building for under-resourced languages. In C. Fairon, H. Naets, A. Kilgariff, & G.-M. De Schryver (Eds.), Building and Exploring Web Corpora, Proceedings of the 3rd Web as Corpus Workshop (p. 182). Presses universitaires de Louvain. http:\/\/crubadan.org\/languages\/cy."},{"key":"9501_CR46","unstructured":"Schmidt, T. (2014). The database for spoken German\u2014DGD2. Paper presented at the the 9th International Conference Language Resources and Evaluation, Reykjavik, Iceland."},{"key":"9501_CR47","doi-asserted-by":"crossref","unstructured":"Simpson-Vlach, R. C., & Leicher, S. (2006). The MICASE handbook: A resource for users of the Michigan corpus of academic spoken English. University of Michigan Press.","DOI":"10.3998\/mpub.101203"},{"key":"9501_CR48","first-page":"1","volume-title":"Developing linguistic corpora: A guide to good practice","author":"J Sinclair","year":"2005","unstructured":"Sinclair, J. (2005). Corpus and text\u2014basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1\u201316). Oxford: Oxbow Books."},{"key":"9501_CR49","unstructured":"Tadi\u0107, M. (2002). Building the Croatioan National Corpus. Paper presented at the The third International Conference on Language Resources and Evaluation (LREC), Las palmas."},{"key":"9501_CR50","unstructured":"Weinberger, S. H. (2020). The speech accent archive. http:\/\/accent.gmu.edu\/. Accessed 27 July 2020."},{"key":"9501_CR51","unstructured":"Williams, B. (1999). A Welsh speech database: Preliminary results. Paper presented at the the 6th European Conference on Speech Communication and Technology (EUROSPEECH), Budapest."}],"container-title":["Language Resources and Evaluation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-020-09501-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10579-020-09501-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-020-09501-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,31]],"date-time":"2021-07-31T17:12:40Z","timestamp":1627751560000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10579-020-09501-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,29]]},"references-count":51,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9]]}},"alternative-id":["9501"],"URL":"https:\/\/doi.org\/10.1007\/s10579-020-09501-9","relation":{},"ISSN":["1574-020X","1574-0218"],"issn-type":[{"type":"print","value":"1574-020X"},{"type":"electronic","value":"1574-0218"}],"subject":[],"published":{"date-parts":[[2020,7,29]]},"assertion":[{"value":"29 July 2020","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Compliance with ethical standards"}},{"value":"All authors declares that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interest"}},{"value":"Ethical approval was gained from Andrew Edgar, Ethics Officer in the School of English, Communication and Philosophy, Cardiff University.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval"}}]}}