{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T08:58:24Z","timestamp":1765357104506},"reference-count":58,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2022,4,11]],"date-time":"2022-04-11T00:00:00Z","timestamp":1649635200000},"content-version":"vor","delay-in-days":100,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,4,11]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgments on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at http:\/\/hdl.handle.net\/11234\/1-4639.<\/jats:p>","DOI":"10.1162\/tacl_a_00470","type":"journal-article","created":{"date-parts":[[2022,4,11]],"date-time":"2022-04-11T19:40:39Z","timestamp":1649706039000},"page":"452-467","update-policy":"http:\/\/dx.doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":11,"title":["Czech Grammar Error Correction with a Large and Diverse Corpus"],"prefix":"10.1162","volume":"10","author":[{"given":"Jakub","family":"N\u00e1plava","sequence":"first","affiliation":[{"name":"Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics, Czech Republic. naplava@ufal.mff.cuni.cz"}]},{"given":"Milan","family":"Straka","sequence":"additional","affiliation":[{"name":"Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics, Czech Republic. straka@ufal.mff.cuni.cz"}]},{"given":"Jana","family":"Strakov\u00e1","sequence":"additional","affiliation":[{"name":"Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics, Czech Republic. strakova@ufal.mff.cuni.cz"}]},{"given":"Alexandr","family":"Rosen","sequence":"additional","affiliation":[{"name":"Charles University, Faculty of Arts Institute of Theoretical and Computational Linguistics, Czech Republic. alexandr.rosen@ff.cuni.cz"}]}],"member":"281","published-online":{"date-parts":[[2022,4,11]]},"reference":[{"issue":"6","key":"2022041119403312600_bib1","doi-asserted-by":"publisher","first-page":"1554","DOI":"10.1214\/aoms\/1177699147","article-title":"Statistical inference for probabilistic functions of finite state Markov chains","volume":"37","author":"Baum","year":"1966","journal-title":"The Annals of Mathematical Statistics"},{"key":"2022041119403312600_bib2","doi-asserted-by":"publisher","first-page":"131","DOI":"10.18653\/v1\/W16-2301","article-title":"Findings of the 2016 conference on machine translation","volume-title":"Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers","author":"Bojar","year":"2016"},{"key":"2022041119403312600_bib3","volume-title":"Romsk\u00fd etnolekt \u010de\u0161tiny","author":"Bo\u0159kovcov\u00e1","year":"2007"},{"key":"2022041119403312600_bib4","article-title":"Romsk\u00fd etnolekt \u010de\u0161tiny","volume-title":"Nov\u00fd encyklopedick\u00fd slovn\u00edk \u010de\u0161tiny","author":"Bo\u0159kovcov\u00e1","year":"2017"},{"key":"2022041119403312600_bib5","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-6111","article-title":"Using Wikipedia edits in low resource grammatical error correction","volume-title":"Proceedings of the 4th Workshop on Noisy User-generated Text","author":"Boyd","year":"2018"},{"key":"2022041119403312600_bib6","first-page":"1281","article-title":"The MERLIN corpus: Learner language and the CEFR","volume-title":"Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u2019 14)","author":"Boyd","year":"2014"},{"key":"2022041119403312600_bib7","doi-asserted-by":"publisher","first-page":"52","DOI":"10.18653\/v1\/W19-4406","article-title":"The BEA- 2019 Shared task on grammatical error correction","volume-title":"Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications","author":"Bryant","year":"2019"},{"key":"2022041119403312600_bib8","doi-asserted-by":"publisher","first-page":"793","DOI":"10.18653\/v1\/P17-1074","article-title":"Automatic annotation and evaluation of error types for grammatical error correction","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Bryant","year":"2017"},{"key":"2022041119403312600_bib9","doi-asserted-by":"publisher","first-page":"697","DOI":"10.3115\/v1\/P15-1068","article-title":"How far are we from fully automatic high quality grammatical error correction?","volume-title":"Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Bryant","year":"2015"},{"key":"2022041119403312600_bib10","doi-asserted-by":"publisher","first-page":"213","DOI":"10.18653\/v1\/W19-4423","article-title":"A neural grammatical error correction system built On better pre- training and sequential transfer learning","volume-title":"Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications","author":"Yo","year":"2019"},{"key":"2022041119403312600_bib11","doi-asserted-by":"publisher","first-page":"435","DOI":"10.18653\/v1\/P19-1042","article-title":"Cross-sentence grammatical error correction","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Chollampatt","year":"2019"},{"key":"2022041119403312600_bib12","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1109\/ICTAI50040.2020.00101","article-title":"Neural grammatical error correction for romanian","volume-title":"2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)","author":"Cotet","year":"2020"},{"key":"2022041119403312600_bib13","first-page":"568","article-title":"Better evaluation for grammatical error correction","volume-title":"Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Dahlmeier","year":"2012"},{"key":"2022041119403312600_bib14","first-page":"22","article-title":"Building a large annotated corpus of learner English: The NUS corpus of learner English","volume-title":"Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications","author":"Dahlmeier","year":"2013"},{"key":"2022041119403312600_bib15","doi-asserted-by":"publisher","first-page":"53","DOI":"10.18653\/v1\/W16-0506","article-title":"A report on the automatic evaluation of scientific writing shared task","volume-title":"Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications","author":"Daudaravicius","year":"2016"},{"key":"2022041119403312600_bib16","first-page":"7238","article-title":"Developing NLP tools with a new corpus of learner Spanish","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference","author":"Davidson","year":"2020"},{"issue":"1","key":"2022041119403312600_bib17","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1198\/000313002753631385","article-title":"Designing Monte Carlo implementations of permutation or bootstrap hypothesis tests","volume":"56","author":"Fay","year":"2002","journal-title":"The American Statistician"},{"key":"2022041119403312600_bib18","doi-asserted-by":"publisher","first-page":"578","DOI":"10.3115\/v1\/N15-1060","article-title":"Towards a standard evaluation method for grammatical error detection and correction","volume-title":"Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Felice","year":"2015"},{"key":"2022041119403312600_bib19","doi-asserted-by":"publisher","first-page":"8467","DOI":"10.18653\/v1\/2020.emnlp-main.680","article-title":"Grammatical error correction in low error density domains: A new benchmark and analyses","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Flachs","year":"2020"},{"issue":"3","key":"2022041119403312600_bib20","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1109\/PROC.1973.9030","article-title":"The viterbi algorithm","volume":"61","author":"David Forney","year":"1973","journal-title":"Proceedings of the IEEE"},{"issue":"488","key":"2022041119403312600_bib21","doi-asserted-by":"publisher","first-page":"1504","DOI":"10.1198\/jasa.2009.tm08368","article-title":"Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk","volume":"104","author":"Gandy","year":"2009","journal-title":"Journal of the American Statistical Association"},{"key":"2022041119403312600_bib22","article-title":"The computer learner corpus: A versatile new source of data for SLA research","volume-title":"Learner English on Computer","author":"Granger","year":"1998"},{"key":"2022041119403312600_bib23","doi-asserted-by":"publisher","first-page":"357","DOI":"10.18653\/v1\/D19-5546","article-title":"Minimally-augmented grammatical error correction","volume-title":"Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)","author":"Grundkiewicz","year":"2019"},{"key":"2022041119403312600_bib24","doi-asserted-by":"publisher","first-page":"461","DOI":"10.18653\/v1\/D15-1052","article-title":"Human evaluation of grammatical error correction systems","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Grundkiewicz","year":"2015"},{"key":"2022041119403312600_bib25","article-title":"Facebook data for sentiment analysis","author":"Habernal","year":"2013"},{"key":"2022041119403312600_bib26","first-page":"65","article-title":"Sentiment analysis in Czech social media using supervised machine learning","volume-title":"Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis","author":"Habernal","year":"2013"},{"key":"2022041119403312600_bib27","article-title":"MorfFlex CZ 2.0","author":"Haji\u010d","year":"2020"},{"key":"2022041119403312600_bib28","first-page":"4037","article-title":"TEITOK: Text-faithful annotated corpora","volume-title":"Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)","author":"Janssen","year":"2016"},{"key":"2022041119403312600_bib29","article-title":"Announcing CzEng 2.0 parallel corpus with over 2 gigawords","author":"Kocmi","year":"2020","journal-title":"CoRR"},{"key":"2022041119403312600_bib30","doi-asserted-by":"publisher","first-page":"634","DOI":"10.1162\/tacl_a_00336","article-title":"Data weighted training strategies for grammatical error correction","volume":"8","author":"Lichtarge","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022041119403312600_bib31","doi-asserted-by":"publisher","first-page":"3291","DOI":"10.18653\/v1\/N19-1333","article-title":"Corpora generation for grammatical error correction","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Lichtarge","year":"2019"},{"key":"2022041119403312600_bib32","first-page":"45","article-title":"Results of the WMT13 metrics shared task","volume-title":"Proceedings of the Eighth Workshop on Statistical Machine Translation","author":"Mach\u00e1\u010dek","year":"2013"},{"key":"2022041119403312600_bib33","doi-asserted-by":"publisher","first-page":"340","DOI":"10.18653\/v1\/2021.wnut-1.38","article-title":"Understanding model robustness to user-generated noisy texts","volume-title":"Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)","author":"N\u00e1plava","year":"2021"},{"key":"2022041119403312600_bib34","doi-asserted-by":"publisher","first-page":"346","DOI":"10.18653\/v1\/D19-5545","article-title":"Grammatical error correction in low-resource scenarios","volume-title":"Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)","author":"N\u00e1plava","year":"2019"},{"key":"2022041119403312600_bib35","doi-asserted-by":"publisher","first-page":"551","DOI":"10.1162\/tacl_a_00282","article-title":"Enabling robust grammatical error correction in new domains: Data sets, metrics, and analyses","volume":"7","author":"Napoles","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022041119403312600_bib36","doi-asserted-by":"publisher","first-page":"588","DOI":"10.3115\/v1\/P15-2097","article-title":"Ground truth for grammatical error correction metrics","volume-title":"Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Napoles","year":"2015"},{"key":"2022041119403312600_bib37","doi-asserted-by":"publisher","first-page":"229","DOI":"10.18653\/v1\/E17-2037","article-title":"JFLEG: A fluency corpus and benchmark for grammatical error correction","volume-title":"Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers","author":"Napoles","year":"2017"},{"key":"2022041119403312600_bib38","doi-asserted-by":"crossref","first-page":"1","DOI":"10.3115\/v1\/W14-1701","article-title":"The CoNLL-2014 shared task on grammatical error correction","volume-title":"Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task","author":"Ng","year":"2014"},{"key":"2022041119403312600_bib39","first-page":"1","article-title":"The CoNLL-2013 shared task on grammatical error correction","volume-title":"Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task","author":"Ng","year":"2013"},{"key":"2022041119403312600_bib40","first-page":"4034","article-title":"Universal dependencies v2: An evergrowing multilingual treebank collection","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference","author":"Nivre","year":"2020"},{"key":"2022041119403312600_bib41","first-page":"72","article-title":"RankME: Reliable human ratings for natural language generation","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Novikova","year":"2018"},{"key":"2022041119403312600_bib42","doi-asserted-by":"publisher","first-page":"163","DOI":"10.18653\/v1\/2020.bea-1.16","article-title":"GECToR \u2013 grammatical error correction: Tag, not rewrite","volume-title":"Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications","author":"Omelianchuk","year":"2020"},{"key":"2022041119403312600_bib43","first-page":"1019","article-title":"Korektor \u2013 A system for contextual spell-checking and diacritics completion","volume-title":"Proceedings of COLING 2012: Posters","author":"Richter","year":"2012"},{"key":"2022041119403312600_bib44","volume-title":"Compiling and annotating a learner corpus for a morphologically rich language \u2013 CzeSL, a corpus of non-native Czech","author":"Rosen","year":"2020"},{"key":"2022041119403312600_bib45","first-page":"702","article-title":"A simple recipe for multilingual grammatical error correction","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Rothe","year":"2021"},{"key":"2022041119403312600_bib46","first-page":"28","article-title":"Annotating ESL errors: Challenges and rewards","volume-title":"Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications","author":"Rozovskaya","year":"2010"},{"key":"2022041119403312600_bib47","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1162\/tacl_a_00251","article-title":"Grammar error correction in morphologically rich languages: The case of Russian","volume":"7","author":"Rozovskaya","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022041119403312600_bib48","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1162\/tacl_a_00091","article-title":"Reassessing the goals of grammatical error correction: Fluency instead of grammaticality","volume":"4","author":"Sakaguchi","year":"2016","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022041119403312600_bib49","doi-asserted-by":"publisher","first-page":"208","DOI":"10.18653\/v1\/P18-1020","article-title":"Efficient online scalar annotation with bounded support","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sakaguchi","year":"2018"},{"key":"2022041119403312600_bib50","first-page":"11","article-title":"Korpusy \u010de\u0161tiny a osvojov\u00e1n\u00ed jazyka","volume":"1","author":"\u0160ebesta","year":"2010","journal-title":"Studie z aplikovan\u00e9 lingvistiky\/Studies in Applied Linguistics"},{"key":"2022041119403312600_bib51","first-page":"4290","article-title":"UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing","volume-title":"Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Straka","year":"2016"},{"key":"2022041119403312600_bib52","article-title":"UA-GEC: Grammatical error correction and fluency corpus for the ukrainian language","author":"Syvokon","year":"2021","journal-title":"CoRR"},{"key":"2022041119403312600_bib53","first-page":"198","article-title":"Tense and aspect error correction for ESL learners using global context","volume-title":"Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Tajiri","year":"2012"},{"key":"2022041119403312600_bib54","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2022041119403312600_bib55","first-page":"149","article-title":"Erroneous data generation for grammatical error correction","volume-title":"Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications","author":"Shuyao","year":"2019"},{"key":"2022041119403312600_bib56","doi-asserted-by":"publisher","DOI":"10.1080\/08957347.2018.1464447","article-title":"Developing an automated writing placement system for ESL learners","volume":"31","author":"Yannakoudakis","year":"2018","journal-title":"Applied Measurement in Education"},{"key":"2022041119403312600_bib57","first-page":"180","article-title":"A new dataset and method for automatically grading ESOL texts","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies","author":"Yannakoudakis","year":"2011"},{"key":"2022041119403312600_bib58","first-page":"75","article-title":"Document-level grammatical error correction","volume-title":"Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications","author":"Yuan","year":"2021"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00470\/2008050\/tacl_a_00470.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00470\/2008050\/tacl_a_00470.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,11]],"date-time":"2022-04-11T19:41:17Z","timestamp":1649706077000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00470\/110536\/Czech-Grammar-Error-Correction-with-a-Large-and"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":58,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00470","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}