{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T15:57:51Z","timestamp":1768838271299,"version":"3.49.0"},"reference-count":62,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2021,4,30]],"date-time":"2021-04-30T00:00:00Z","timestamp":1619740800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance with the FAIR principles and the standards we established for reproducibility, discussing how far research data that has been collected in the past can be made comparable, reusable and reproducible. Our results show that some basic needs for providing comparable and reusable data are covered by existing general infrastructure solutions and can be exploited for domain-specific infrastructures such as the one presented in this article. Other aspects need genuinely domain-driven approaches. The solutions found for the corpora in the presented infrastructure can only be a preliminary attempt, and further community involvement would be needed to provide templates and models acknowledged and promoted by the community. Furthermore, forward-looking data management would be needed starting from the beginning of new corpus creation projects to ensure that all requirements for FAIR data can be met.<\/jats:p>","DOI":"10.3390\/info12050199","type":"journal-article","created":{"date-parts":[[2021,4,30]],"date-time":"2021-04-30T10:53:29Z","timestamp":1619780009000},"page":"199","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8540-2396","authenticated-orcid":false,"given":"Alexander","family":"K\u00f6nig","sequence":"first","affiliation":[{"name":"CLARIN ERIC, 3512 BS Utrecht, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7008-6394","authenticated-orcid":false,"given":"Jennifer-Carmen","family":"Frey","sequence":"additional","affiliation":[{"name":"Institute for Applied Linguistics, Eurac Research, 39100 Bolzano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7655-5526","authenticated-orcid":false,"given":"Egon W.","family":"Stemle","sequence":"additional","affiliation":[{"name":"Institute for Applied Linguistics, Eurac Research, 39100 Bolzano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,4,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Thorne, S., and May, S. (2017). Learner corpora in foreign language education. Language and Technology. Encyclopedia of Language and Education, Springer.","DOI":"10.1007\/978-3-319-02237-6"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Granger, S., Hung, J., and Petch-Tyson, S. (2002). Computer Learner Corpora, Second Language Acquisition, and Foreign Language Teaching, John Benjamins Publishing.","DOI":"10.1075\/lllt.6"},{"key":"ref_3","unstructured":"Declerck, T., Choukri, K., and Calzolari, N. (2012). EXMARaLDA and the FOLK tools\u2014Two toolsets for transcribing and annotating spoken language. Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC\u201912), European Language Resources Association (ELRA)."},{"key":"ref_4","first-page":"61","article-title":"Corpora and Language Learning with the Sketch Engine and SKELL","volume":"XX","author":"Kilgarriff","year":"2015","journal-title":"Rev. Fr. Linguist. Appl."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1093\/llc\/fqu057","article-title":"ANNIS3: A new architecture for generic corpus query and visualization","volume":"Volume 31","author":"Krause","year":"2016","journal-title":"Digital Scholarship in the Humanities"},{"key":"ref_6","unstructured":"Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., and Odijk, J. (2016). TEITOK: Text-faithful annotated corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA)."},{"key":"ref_7","unstructured":"Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., and Mazo, H. (2018). Transc&Anno: A graphical tool for the transcription and on-the-fly annotation of handwritten documents. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA)."},{"key":"ref_8","unstructured":"Volodina, E. (2021, April 22). Korp Searches in Second Language Data\u2014Spr\u00e5kbanksbloggen. Available online: https:\/\/spraakbanken.gu.se\/blogg\/index.php\/2020\/06\/17\/korp-searches-in-second-language-data\/."},{"key":"ref_9","unstructured":"Centre for English Corpus Linguistics (2021, April 22). Learner Corpora around the World. Available online: https:\/\/uclouvain.be\/en\/research-institutes\/ilc\/cecl\/learner-corpora-around-the-world.html."},{"key":"ref_10","unstructured":"Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., and Mazo, H. (2018). CLARIN\u2019s key resource families. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA)."},{"key":"ref_11","unstructured":"Abel, A., Glaznieks, A., Nicolas, L., and Stemle, E.W. (2014, January 26\u201331). KoKo: An L1 learner corpus for german. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914), Reykjavik, Iceland."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1075\/scl.12.11nes","article-title":"Learner corpora and their potential for language teaching","volume":"Volume 12","author":"Nesselhauf","year":"2004","journal-title":"How to Use Corpora in Language Teaching"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1017\/CBO9781139649414.002","article-title":"From design to collection of learner corpora","volume":"Volume 3","author":"Gilquin","year":"2015","journal-title":"The Cambridge Handbook of Learner Corpus Research"},{"key":"ref_14","first-page":"154","article-title":"Corpus compilation collection strategies and design decisions","volume":"Volume 2","author":"Hunston","year":"2009","journal-title":"Corpus Linguistics: An International Handbook"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1017\/CBO9781139013734.015","article-title":"Creating and using corpora","volume":"2","author":"Gries","year":"2014","journal-title":"Res. Methods Linguist."},{"key":"ref_16","unstructured":"Lenardi\u010d, J., Tiedemann, T.L., and Fi\u0161er, D. (2018). Overview of L2 Corpora and Re-Sources 2.0, CLARIN. Technical Report."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"276","DOI":"10.1075\/jsls.00005.gri","article-title":"On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally","volume":"1","author":"Gries","year":"2018","journal-title":"J. Second Lang. Stud."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1075\/ijlcr.3.1.03paq","article-title":"Quantitative research methods and study quality in learner corpus research","volume":"3","author":"Paquot","year":"2017","journal-title":"Int. J. Learn. Corpus Res."},{"key":"ref_19","unstructured":"Volodina, E., Tenfjord, K., Mikelic Preradovic, N., Janssen, M., Lindstr\u00f6m Tiedemann, T., and Ragnhildstveit, S. (2021, April 22). Workshop on Interoperability of L2 Resources and Tools | Sweclarin.se. Available online: https:\/\/sweclarin.se\/swe\/workshop-interoperability-l2-resources-and-tools,2017."},{"key":"ref_20","unstructured":"Stemle, E.W., Boyd, A., Janssen, M., Lindstr\u00f6m Tiedemann, T., Mikeli\u0107 Preradovi\u0107, N., Rosen, A., Ros\u00e9n, D., and Volodina, E. (2017, January 5\u20137). Working together towards an ideal infrastructure for language learner corpora. Proceedings of the Widening the Scope of Learner Corpus Research Selected Papers from the Fourth Learner Corpus Research Conference 2017, Bolzano\/Bozen, Italy."},{"key":"ref_21","unstructured":"Volodina, E., Megyesi, B., Wir\u00e9n, M., Granstedt, L., Prentice, J., Reichenberg, M., and Sundberg, G. (2016, January 17\u201318). A friend in need?: Research agenda for electronic Second Language infrastructure. Proceedings of the Sixth Swedish Language Technology Conference (SLTC), Ume\u00e5, Sweden."},{"key":"ref_22","first-page":"5","article-title":"Establishing a Standardised Procedure for Building Learner Corpora","volume":"8","author":"Glaznieks","year":"2014","journal-title":"Apples J. Appl. Lang. Stud."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1087\/20140503","article-title":"The Research Data Alliance: Globally co-ordinated action against barriers to data publishing and sharing","volume":"27","author":"Treloar","year":"2014","journal-title":"Learn. Publ."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Moskovko, M. (2020). Intensified role of the European Union? European Research Infrastructure Consortium as a legal framework for contemporary multinational research collaboration. Big Science and Research Infrastructures in Europe, Edward Elgar Publishing.","DOI":"10.4337\/9781839100017.00012"},{"key":"ref_25","unstructured":"Ayris, P., Berthou, J.Y., Bruce, R., Lindstaedt, S., Monreale, A., Mons, B., Murayama, Y., S\u00f6derg\u00e5rd, C., Tochtermann, K., and Wilkinson, R. (2021, April 22). Realising the European Open Science Cloud. First Report and Recommendations of the Commission High Level Expert Group on the European Open Science Cloud. Available online: file:\/\/\/C:\/Users\/MDPI\/AppData\/Local\/Temp\/RealisingtheOpenScienceCloud-2.pdf."},{"key":"ref_26","first-page":"383","article-title":"Social sciences, humanities and their interoperability with the European Open Science Cloud: What is SSHOC?","volume":"72","author":"Ausserhofer","year":"2019","journal-title":"Mitt. Ver. \u00d6sterreichischer Bibl. Bibl."},{"key":"ref_27","unstructured":"European Language Resources Association (ELRA) (2020). Social Sciences and Humanities Pathway Towards the European Open Science Cloud, European Language Resources Association (ELRA)."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci. Data"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"04001","DOI":"10.1051\/itmconf\/20203304001","article-title":"Agile development of the SSH open marketplace: User workshop","volume":"Volume 33","author":"Barbot","year":"2020","journal-title":"ITM Web of Conferences"},{"key":"ref_30","unstructured":"de Jong, F.M.G., Maegaard, B., De Smedt, K., Fi\u0161er, D., and Van Uytvanck, D. (2018, January 7\u201312). CLARIN: Towards FAIR and responsible data science using language resources. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_31","unstructured":"Abel, A., Vettori, C., and Wisniewski, K. (2012). KOLIPSI: Gli Studenti Altoatesini e la Seconda Lingua; Indagine Linguistica e Psicosociale= KOLIPSI: Die S\u00fcdtiroler Sch\u00fclerInnen und die Zweitsprache; Eine Linguistische und Sozialpsychologische Untersuchung, Eurac Research. Available online: https:\/\/www.researchgate.net\/publication\/259453091_Gli_studenti_altoatesini_e_la_seconda_lingua_indagine_linguistica_e_psicosociale_Die_Sudtiroler_SchulerInnen_und_die_Zweitsprache_eine_linguistische_und_sozialpsychologische_Untersuchung_Volume_1_-_Ba."},{"key":"ref_32","unstructured":"Vettori, C., and Abel, A. (2017). KOLIPSI II Gli studenti altoatesini e la seconda lingua: Indagine linguistica e psicosociale. Die S\u00fcdtiroler Sch\u00fclerInnen und die Zweitsprache: Eine Linguistische und Sozialpsychologische Untersuchung, Eurac Research. Available online: https:\/\/bia.unibz.it\/discovery\/delivery?vid=39UBZ_INST:ResearchRepository&repId=12235320180001241#13235268510001241."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Abel, A., Glaznieks, A., Nicolas, L., and Stemle, E. (2016, January 5\u20136). An extended version of the KoKo German L1 Learner corpus. Proceedings of the Third Italian Conference on Computational Linguistics, Napoli, Italy.","DOI":"10.4000\/books.aaccademia.1743"},{"key":"ref_34","unstructured":"Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Sch\u00f6ne, K., \u0160tindlov\u00e1, B., and Vettori, C. (2014, January 26\u201331). The MERLIN corpus: Learner language and the CEFR. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 14), Reykjavik, Iceland."},{"key":"ref_35","unstructured":"Zanasi, L., and Stopfner, M. (2018). Rilevare, osservare, consultare. Metodi e strumenti per l\u2019analisi del plurilinguismo nella scuola secondaria di primo grado. La Didattica Delle Lingue nel Nuovo Millennio, 135\u2013148. Available online: https:\/\/edizionicafoscari.unive.it\/media\/pdf\/books\/978-88-6969-228-4\/978-88-6969-228-4-ch-01_ALK6Jr7.pdf."},{"key":"ref_36","unstructured":"Granger, S., Dagneaux, E., Meunier, F., and Paquot, M. (2009). International Corpus of Learner English, Presses Universitaires de Louvain."},{"key":"ref_37","unstructured":"Tenfjord, K., Meurer, P., and Hofland, K. (2006, January 22\u201328). The ASK corpus-a language learner corpus of norwegian as a second language. Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy."},{"key":"ref_38","unstructured":"Rosen, A., Hana, J., Vidov\u00e1 Hladk\u00e1, B., Jel\u00ednek, T., \u0160kodov\u00e1, S., and \u0160tindlov\u00e1, B. (2020). Compiling and Annotating a Learner Corpus for a Morphologically Rich Language: {CzeSL}, a Corpus of Non-Native {Czech}, Nakladatelstv\u00ed Karolinum. Available online: http:\/\/hdl.handle.net\/20.500.11956\/123103."},{"key":"ref_39","first-page":"15","article-title":"TOEFL11: A corpus of non-native English","volume":"2","author":"Blanchard","year":"2013","journal-title":"ETS Res. Rep. Ser."},{"key":"ref_40","first-page":"49","article-title":"Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud","volume":"37","author":"Mons","year":"2017","journal-title":"Inf. Serv. Use"},{"key":"ref_41","unstructured":"Lindstr\u00f6m, T., Lenardi\u010d, J., and Fi\u0161er, D. (2018, January 8\u201310). L2 learner corpus survey\u2013Towards improved verifiability, reproducibility and inspiration in learner corpus research. Proceedings of the CLARIN Annual Conference 2018, Pisa, Italy."},{"key":"ref_42","unstructured":"Van Uytvanck, D., Stehouwer, H., and Lampen, L. (2012). Semantic metadata mapping in practice: The Virtual Language Observatory. LREC 2012: 8th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA)."},{"key":"ref_43","unstructured":"Megyesi, B., Granstedt, L., Johansson, S., Prentice, J., Ros\u00e9n, D., Schenstr\u00f6m, C.J., Sundberg, G., Wir\u00e9n, M., and Volodina, E. (2018, January 7). Learner corpus anonymization in the age of gdpr: Insights from the creation of a learner corpus of swedish. Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, Sweden."},{"key":"ref_44","unstructured":"Volodina, E., Janssen, M., Tiedemann, T.L., Preradovi\u0107, N.M., Ragnhildstveit, S., Tenfjord, K., and de Smedt, K. (2018, January 8\u201310). Interoperability of Second Language Resources and Tools. Proceedings of the CLARIN Annual Conference 2018, Pisa, Italy."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Chiarcos, C., Nordhoff, S., and Hellmann, S. (2012). Linked Data in Linguistics, Springer.","DOI":"10.1007\/978-3-642-28249-2"},{"key":"ref_46","unstructured":"Granger, S., and Paquot, M. (2017, January 6\u20138). Towards standardization of metadata for L2 corpora. Proceedings of the workshop on Interoperability of Second Language Resources and Tools, Gothenburg, Sweden."},{"key":"ref_47","unstructured":"Wittenburg, P., Van Uytvanck, D., Zastrow, T., Strak, P., Broeder, D., Schiel, F., Boehlke, V., Reichel, U., and Offersgaard, L. (2018). CLARIN B Centre Checklist, Clarin Eric. Technical Report CE-2013-0095."},{"key":"ref_48","unstructured":"Eskevich, M., de Jong, F., K\u00f6nig, A., Fi\u0161er, D., Van Uytvanck, D., Aalto, T., Borin, L., Gerassimenko, O., Hajic, J., and van den Heuvel, H. (2020). CLARIN: Distributed language resources and technology in a European infrastructure. Proceedings of the 1st International Workshop on Language Technology Platforms, European Language Resources Association (ELRA)."},{"key":"ref_49","first-page":"337","article-title":"A Generic Data Workflow for Building Annotated Text Corpora","volume":"190","author":"Nicolas","year":"2015","journal-title":"Stud. Learn. Corpus Linguist. Res. Appl. Foreign Lang. Teach. Assess."},{"key":"ref_50","unstructured":"Abel, A., and Zanin, R. (2011). Korpus s\u00fcdtirol\u2014Variet\u00e4tenlinguistische untersuchungen. Korpora in Lehre und Forschung, Bozen-Bolzano University Press."},{"key":"ref_51","unstructured":"Schmid, H. (1995, January 30). Improvements in part-of-speech tagging with an application to German. Proceedings of the ACL SIGDAT-Workshop, Dublin, Ireland."},{"key":"ref_52","unstructured":"Evert, S., and Hardie, A. (2011, January 20\u201322). Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. Proceedings of the Corpus Linguistics 2011, Birmingham, UK."},{"key":"ref_53","unstructured":"Rychl\u00fd, P. (2007). Manatee\/Bonito\u2014A modular corpus manager. First Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2007), Masaryk University."},{"key":"ref_54","first-page":"66","article-title":"Technical solutions for reproducible research","volume":"Volume 172","author":"Simov","year":"2020","journal-title":"Selected Papers from the CLARIN Annual Conference 2019"},{"key":"ref_55","unstructured":"Branco, A., Calzolari, N., Vossen, P., Van Noord, G., Van Uytvanck, D., Silva, J., Gomes, L., Moreira, A., and Elbers, W. (2020). A Shared Task of a New, Collaborative Type to foster Reproducibility: A first exercise in the area of language science and technology with REPROLANG2020. Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association."},{"key":"ref_56","unstructured":"Krauwer, S., and Hinrichs, E. (2014). The CLARIN research infrastructure: Resources and tools for e-humanities scholars. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), European Language Resources Association."},{"key":"ref_57","unstructured":"Druskat, S., Gast, V., Krause, T., and Zipser, F. (2016). Corpus-tools. org: An interoperable generic software tool set for multi-layer linguistic corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC\u201916), European Language Resources Association."},{"key":"ref_58","unstructured":"Broeder, D., Windhouwer, M., Van Uytvanck, D., Goosen, T., and Trippel, T. (2012, January 22). CMDI: A component metadata infrastructure. Proceedings of the Workshop on Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR, Istanbul, Turkey."},{"key":"ref_59","unstructured":"Granger, S., and Paquot, M. (2017, January 6\u20138). Core metadata for learner corpora: Eraft 1.0. Proceedings of the workshop on Interoperability of Second Language Resources and Tools, Gothenburg, Sweden."},{"key":"ref_60","unstructured":"Piperidis, S. (2012, January 21\u201327). The META-SHARE language resources sharing infrastructure: Principles, challenges, solutions. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey."},{"key":"ref_61","unstructured":"Alfter, D., Borin, L., Pil\u00e1n, I., Tiedemann, T.L., and Volodina, E. (2018, January 8\u201310). L\u00e4rka: From language learning platform to infrastructure for research on language learning. Proceedings of the CLARIN Annual Conference 2018, Pisa, Italy."},{"key":"ref_62","unstructured":"Dar\u0123is, R., Auzi\u0146a, I., Lev\u0101ne-Petrova, K., and Kaija, I. (2020, January 11\u201316). Quality focused approach to a learner corpus development. Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/5\/199\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:56:05Z","timestamp":1760162165000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/5\/199"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,30]]},"references-count":62,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2021,5]]}},"alternative-id":["info12050199"],"URL":"https:\/\/doi.org\/10.3390\/info12050199","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,4,30]]}}}