{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T16:34:55Z","timestamp":1773246895187,"version":"3.50.1"},"reference-count":130,"publisher":"MIT Press","issue":"3","license":[{"start":{"date-parts":[[2024,3,25]],"date-time":"2024-03-25T00:00:00Z","timestamp":1711324800000},"content-version":"vor","delay-in-days":84,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,9,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent work, however, has shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication, or data validation. Using these annotations, we then analyze how quality management is conducted in practice. A majority of the annotated publications apply good or excellent quality management. However, we deem the effort of 30% of the studies as only subpar. Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.<\/jats:p>","DOI":"10.1162\/coli_a_00516","type":"journal-article","created":{"date-parts":[[2024,3,25]],"date-time":"2024-03-25T15:54:12Z","timestamp":1711382052000},"page":"817-866","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":22,"title":["Analyzing Dataset Annotation Quality Management in the Wild"],"prefix":"10.1162","volume":"50","author":[{"given":"Jan-Christoph","family":"Klie","sequence":"first","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab, Department of Computer Science and Hessian Center for AI (hessian.AI). www.ukp.tu-darmstadt.de"}]},{"given":"Richard Eckart de","family":"Castilho","sequence":"additional","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab"}]},{"given":"Iryna","family":"Gurevych","sequence":"additional","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab"}]}],"member":"281","published-online":{"date-parts":[[2024,9,1]]},"reference":[{"key":"2024092014245840300_bib1","first-page":"29","article-title":"Agile corpus annotation in practice: An overview of manual and automatic annotation of CVs","volume-title":"Proceedings of the Fourth Linguistic Annotation Workshop","author":"Alex","year":"2010"},{"issue":"2","key":"2024092014245840300_bib2","doi-asserted-by":"publisher","first-page":"415","DOI":"10.1080\/03610919908813557","article-title":"Sample size requirements for interval estimation of the intraclass kappa statistic","volume":"28","author":"Allan","year":"1999","journal-title":"Communications in Statistics - Simulation and Computation"},{"key":"2024092014245840300_bib3","doi-asserted-by":"publisher","first-page":"1558","DOI":"10.18653\/v1\/2020.acl-main.142","article-title":"TACRED Revisited: A thorough evaluation of the TACRED relation extraction task","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Alt","year":"2020"},{"key":"2024092014245840300_bib4","doi-asserted-by":"publisher","first-page":"344","DOI":"10.18653\/v1\/W19-8642","article-title":"Agreement is overrated: A plea for correlation to assess human evaluation reliability","volume-title":"Proceedings of the 12th International Conference on Natural Language Generation","author":"Amidei","year":"2019"},{"issue":"1","key":"2024092014245840300_bib5","doi-asserted-by":"publisher","first-page":"15","DOI":"10.1609\/aimag.v36i1.2564","article-title":"Truth is a lie: Crowd truth and the seven myths of human annotation","volume":"36","author":"Aroyo","year":"2015","journal-title":"AI Magazine"},{"issue":"4","key":"2024092014245840300_bib6","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1162\/coli.07-034-R2","article-title":"Inter-coder agreement for computational linguistics","volume":"34","author":"Artstein","year":"2008","journal-title":"Computational Linguistics"},{"issue":"4","key":"2024092014245840300_bib7","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1037\/1082-989X.2.4.357","article-title":"Detecting sequential patterns and determining their reliability with fallible observers","volume":"2","author":"Bakeman","year":"1997","journal-title":"Psychological Methods"},{"issue":"1","key":"2024092014245840300_bib8","doi-asserted-by":"publisher","first-page":"3","DOI":"10.2307\/3315487","article-title":"Beyond kappa: A review of interrater agreement measures","volume":"27","author":"Banerjee","year":"1999","journal-title":"Canadian Journal of Statistics"},{"key":"2024092014245840300_bib9","doi-asserted-by":"publisher","first-page":"909","DOI":"10.1162\/tacl_a_00404","article-title":"Neural modeling for named entities and morphology (NEMO2)","volume":"9","author":"Bareket","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024092014245840300_bib10","doi-asserted-by":"publisher","first-page":"604","DOI":"10.18653\/v1\/2020.coling-main.52","article-title":"Author\u2019s sentiment prediction","volume-title":"Proceedings of the 28th International Conference on Computational Linguistics","author":"Bastan","year":"2020"},{"issue":"4","key":"2024092014245840300_bib11","doi-asserted-by":"publisher","first-page":"699","DOI":"10.1162\/COLI_a_00074","article-title":"What determines inter-coder agreement in manual annotations? A meta-analytic investigation","volume":"37","author":"Bayerl","year":"2011","journal-title":"Computational Linguistics"},{"key":"2024092014245840300_bib12","doi-asserted-by":"publisher","DOI":"10.1075\/tilar.6","volume-title":"Corpora in Language Acquisition Research: History, Methods, Perspectives","author":"Behrens","year":"2008"},{"key":"2024092014245840300_bib13","doi-asserted-by":"publisher","first-page":"587","DOI":"10.1162\/tacl_a_00041","article-title":"Data statements for natural language processing: Toward mitigating system bias and enabling better science","volume":"6","author":"Bender","year":"2018","journal-title":"Transactions of the Association for Computational Linguistics"},{"issue":"8476","key":"2024092014245840300_bib14","doi-asserted-by":"publisher","first-page":"307","DOI":"10.1016\/S0140-6736(86)90837-8","article-title":"Statistical methods for assessing agreement between two methods of clinical measurement","volume":"1","author":"Bland","year":"1986","journal-title":"Lancet"},{"key":"2024092014245840300_bib15","doi-asserted-by":"publisher","first-page":"632","DOI":"10.18653\/v1\/D15-1075","article-title":"A large annotated corpus for learning natural language inference","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Bowman","year":"2015"},{"key":"2024092014245840300_bib16","first-page":"4573","article-title":"Creating a dataset for named entity recognition in the archaeology domain","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Brandsen","year":"2020"},{"issue":"5","key":"2024092014245840300_bib17","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1038\/nrn3475","article-title":"Power failure: Why small sample size undermines the reliability of neuroscience","volume":"14","author":"Button","year":"2013","journal-title":"Nature Reviews Neuroscience"},{"key":"2024092014245840300_bib18","first-page":"1","article-title":"Creating speech and language data with Amazon\u2019s Mechanical Turk","volume-title":"Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk","author":"Callison-Burch","year":"2010"},{"issue":"2","key":"2024092014245840300_bib19","first-page":"249","article-title":"Assessing agreement on classification tasks: The kappa statistic","volume":"22","author":"Carletta","year":"1996","journal-title":"Computational Linguistics"},{"key":"2024092014245840300_bib20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/S17-2001","article-title":"SemEval-2017 Task 1: Semantic textual similarity multilingual and crosslingual focused evaluation","volume-title":"Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)","author":"Cer","year":"2017"},{"key":"2024092014245840300_bib21","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1609\/hcomp.v5i1.13306","article-title":"Let\u2019s agree to disagree: Fixing agreement measures for crowdsourcing","volume-title":"Proceedings of the AAAI Conference on Human Computation and Crowdsourcing","author":"Checco","year":"2017"},{"key":"2024092014245840300_bib22","doi-asserted-by":"publisher","first-page":"3697","DOI":"10.18653\/v1\/2021.emnlp-main.300","article-title":"FinQA: A dataset of numerical reasoning over financial data","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Chen","year":"2021"},{"issue":"1","key":"2024092014245840300_bib23","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1177\/001316446002000104","article-title":"A Coefficient of Agreement for Nominal Scales","volume":"20","author":"Cohen","year":"1960","journal-title":"Educational and Psychological Measurement"},{"issue":"1","key":"2024092014245840300_bib24","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3148148","article-title":"Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions","volume":"51","author":"Daniel","year":"2019","journal-title":"ACM Computing Surveys"},{"issue":"1","key":"2024092014245840300_bib25","doi-asserted-by":"publisher","first-page":"20","DOI":"10.2307\/2346806","article-title":"Maximum likelihood estimation of observer error-rates using the EM algorithm","volume":"28","author":"Dawid","year":"1979","journal-title":"Applied Statistics"},{"key":"2024092014245840300_bib26","doi-asserted-by":"publisher","first-page":"4040","DOI":"10.18653\/v1\/2020.acl-main.372","article-title":"GoEmotions: A dataset of fine-grained emotions","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Demszky","year":"2020"},{"key":"2024092014245840300_bib27","first-page":"1","article-title":"Detecting inconsistencies in treebanks","volume-title":"Proceedings of the Second Workshop on Treebanks and Linguistic Theories","author":"Dickinson","year":"2003"},{"issue":"11","key":"2024092014245840300_bib28","doi-asserted-by":"publisher","first-page":"1511","DOI":"10.1002\/sim.4780111109","article-title":"A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation","volume":"11","author":"Donner","year":"1992","journal-title":"Statistics in Medicine"},{"key":"2024092014245840300_bib29","doi-asserted-by":"publisher","first-page":"1383","DOI":"10.18653\/v1\/P18-1128","article-title":"The hitchhiker\u2019s guide to testing statistical significance in natural language processing","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Dror","year":"2018"},{"issue":"4","key":"2024092014245840300_bib30","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1007\/BF02288803","article-title":"Estimation of the reliability of ratings","volume":"16","author":"Ebel","year":"1951","journal-title":"Psychometrika"},{"issue":"4","key":"2024092014245840300_bib31","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1002\/sono.12276","article-title":"Correlation does not imply agreement: A cautionary tale for researchers and reviewers","volume":"8","author":"Edwards","year":"2021","journal-title":"Sonography"},{"issue":"1","key":"2024092014245840300_bib32","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1214\/ss\/1177013815","article-title":"Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy","volume":"1","author":"Efron","year":"1986","journal-title":"Statistical Science"},{"key":"2024092014245840300_bib33","doi-asserted-by":"publisher","first-page":"1626","DOI":"10.18653\/v1\/2021.naacl-main.129","article-title":"Did they answer? Subjective acts and intents in conversational discourse","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Ferracane","year":"2021"},{"key":"2024092014245840300_bib34","volume-title":"Statistical Methods for Research Workers","author":"Fisher","year":"1925"},{"issue":"5","key":"2024092014245840300_bib35","doi-asserted-by":"publisher","first-page":"378","DOI":"10.1037\/h0031619","article-title":"Measuring nominal scale agreement among many raters","volume":"76","author":"Fleiss","year":"1971","journal-title":"Psychological Bulletin"},{"key":"2024092014245840300_bib36","doi-asserted-by":"publisher","DOI":"10.1002\/0471445428","volume-title":"Statistical Methods for Rates and Proportions","author":"Fleiss","year":"2003","edition":"1st"},{"issue":"12","key":"2024092014245840300_bib37","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1145\/3458723","article-title":"Datasheets for datasets","volume":"64","author":"Gebru","year":"2021","journal-title":"Communications of the ACM"},{"key":"2024092014245840300_bib38","doi-asserted-by":"publisher","first-page":"1161","DOI":"10.18653\/v1\/D19-1107","article-title":"Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Geva","year":"2019"},{"key":"2024092014245840300_bib39","doi-asserted-by":"publisher","first-page":"5010","DOI":"10.18653\/v1\/2022.acl-long.344","article-title":"CICERO: A dataset for contextualized commonsense inference in dialogues","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Ghosal","year":"2022"},{"key":"2024092014245840300_bib40","doi-asserted-by":"publisher","first-page":"23","DOI":"10.18653\/v1\/W18-2504","article-title":"The ACL anthology: Current state and future directions","volume-title":"Proceedings of Workshop for NLP Open Source Software (NLP-OSS)","author":"Gildea","year":"2018"},{"key":"2024092014245840300_bib41","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-349-17295-5_4","volume-title":"Problems of Monetary Management: The UK Experience","author":"Goodhart","year":"1984"},{"key":"2024092014245840300_bib42","doi-asserted-by":"publisher","first-page":"5295","DOI":"10.18653\/v1\/2020.emnlp-main.427","article-title":"Help! Need advice on identifying advice","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Govindarajan","year":"2020"},{"key":"2024092014245840300_bib43","doi-asserted-by":"publisher","first-page":"8342","DOI":"10.18653\/v1\/2020.acl-main.740","article-title":"Don\u2019t stop pretraining: Adapt language models to domains and tasks","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Gururangan","year":"2020"},{"key":"2024092014245840300_bib44","volume-title":"Patterns, Predictions, and Actions: Foundations of Machine Learning","author":"Hardt","year":"2022"},{"key":"2024092014245840300_bib45","first-page":"15","article-title":"You\u2019re hired! An examination of crowdsourcing incentive models in human resource tasks","volume-title":"Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM)","author":"Harris","year":"2011"},{"key":"2024092014245840300_bib46","first-page":"1113","article-title":"Approximating theoretical linguistics classification in real data: The case of German \u201cnach\u201d particle verbs","volume-title":"Proceedings of COLING 2012","author":"Haselbach","year":"2012"},{"issue":"1","key":"2024092014245840300_bib47","doi-asserted-by":"publisher","first-page":"77","DOI":"10.1080\/19312450709336664","article-title":"Answering the call for a standard reliability measure for coding data","volume":"1","author":"Hayes","year":"2007","journal-title":"Communication Methods and Measures"},{"key":"2024092014245840300_bib48","doi-asserted-by":"publisher","first-page":"419","DOI":"10.1145\/2736277.2741102","article-title":"Incentivizing high quality crowdwork","volume-title":"Proceedings of the 24th International Conference on World Wide Web","author":"Ho","year":"2015"},{"issue":"03677","key":"2024092014245840300_bib49","first-page":"1","article-title":"The dataset nutrition label: A framework to drive higher data quality standards","volume":"1805","author":"Holland","year":"2018","journal-title":"arXiv"},{"key":"2024092014245840300_bib50","first-page":"45","article-title":"The influence of spelling errors on content scoring performance","volume-title":"Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)","author":"Horbach","year":"2017"},{"key":"2024092014245840300_bib51","doi-asserted-by":"publisher","first-page":"1120","DOI":"10.3115\/v1\/P14-2062","article-title":"Learning whom to trust with MACE","volume-title":"Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Hovy","year":"2013"},{"key":"2024092014245840300_bib52","doi-asserted-by":"crossref","first-page":"377","DOI":"10.3115\/v1\/P14-2062","article-title":"Experiments with crowdsourced re-annotation of a POS tagging data set","volume-title":"Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Hovy","year":"2014"},{"key":"2024092014245840300_bib53","first-page":"13","article-title":"Towards a \u2018science\u2019 of corpus annotation: A new methodological challenge for corpus linguistics","volume":"22","author":"Hovy","year":"2010","journal-title":"International Journal of Translation Studies"},{"key":"2024092014245840300_bib54","doi-asserted-by":"publisher","first-page":"27","DOI":"10.3115\/1564131.1564137","article-title":"Data quality from crowdsourcing: A study of annotation selection criteria","volume-title":"Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing","author":"Hsueh","year":"2009"},{"key":"2024092014245840300_bib55","doi-asserted-by":"publisher","first-page":"560","DOI":"10.1145\/3442188.3445918","article-title":"Towards accountability for machine learning datasets: Practices from software engineering and infrastructure","volume-title":"Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency","author":"Hutchinson","year":"2021"},{"key":"2024092014245840300_bib56","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-024-0881-2","volume-title":"Handbook of Linguistic Annotation","author":"Ide","year":"2017"},{"key":"2024092014245840300_bib57","doi-asserted-by":"publisher","first-page":"291","DOI":"10.18653\/v1\/D15-1035","article-title":"Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Jamison","year":"2015"},{"key":"2024092014245840300_bib58","first-page":"234","article-title":"Context-sensitive spelling correction of clinical text via conditional independence","volume-title":"Proceedings of the Conference on Health, Inference, and Learning","author":"Kim","year":"2022"},{"key":"2024092014245840300_bib59","doi-asserted-by":"publisher","first-page":"1352","DOI":"10.18653\/v1\/2022.naacl-main.97","article-title":"HATEMOJI: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Kirk","year":"2022"},{"key":"2024092014245840300_bib60","first-page":"5","article-title":"The INCEpTION Platform: Machine-assisted and knowledge-oriented interactive annotation","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations","author":"Klie","year":"2018"},{"issue":"1","key":"2024092014245840300_bib61","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1162\/coli_a_00464","article-title":"Annotation error detection: Analyzing the past and present for a more coherent future","volume":"49","author":"Klie","year":"2023","journal-title":"Computational Linguistics"},{"issue":"1","key":"2024092014245840300_bib62","doi-asserted-by":"publisher","first-page":"96","DOI":"10.1016\/j.jclinepi.2010.03.002","article-title":"Guidelines for reporting reliability and agreement studies (GRRAS) were proposed","volume":"64","author":"Kottner","year":"2011","journal-title":"Journal of Clinical Epidemiology"},{"key":"2024092014245840300_bib63","volume-title":"Content Analysis: An Introduction to Its Methodology","author":"Krippendorff","year":"1980"},{"key":"2024092014245840300_bib64","doi-asserted-by":"publisher","first-page":"47","DOI":"10.2307\/271061","article-title":"On the reliability of unitizing continuous data","volume":"25","author":"Krippendorff","year":"1995","journal-title":"Sociological Methodology"},{"issue":"3","key":"2024092014245840300_bib65","doi-asserted-by":"publisher","first-page":"411","DOI":"10.1111\/j.1468-2958.2004.tb00738.x","article-title":"Reliability in content analysis: Some common misconceptions and recommendations","volume":"30","author":"Krippendorff","year":"2004","journal-title":"Human Communication Research"},{"issue":"2","key":"2024092014245840300_bib66","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1080\/19312458.2011.568376","article-title":"Agreement and information in the reliability of coding","volume":"5","author":"Krippendorff","year":"2011","journal-title":"Communication Methods and Measures"},{"issue":"6","key":"2024092014245840300_bib67","doi-asserted-by":"publisher","first-page":"2347","DOI":"10.1007\/s11135-015-0266-1","article-title":"On the reliability of unitizing textual continua: Further developments","volume":"50","author":"Krippendorff","year":"2016","journal-title":"Quality & Quantity"},{"key":"2024092014245840300_bib68","doi-asserted-by":"publisher","first-page":"343","DOI":"10.18653\/v1\/2021.acl-short.44","article-title":"Quantifying and avoiding unfair qualification labour in crowdsourcing","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Kummerfeld","year":"2021"},{"key":"2024092014245840300_bib69","doi-asserted-by":"publisher","first-page":"3846","DOI":"10.18653\/v1\/P19-1374","article-title":"A large-scale corpus for conversation disentanglement","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Kummerfeld","year":"2019"},{"key":"2024092014245840300_bib70","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3115\/1072228.1072249","article-title":"(Semi-)automatic detection of errors in PoS-tagged corpora","volume-title":"COLING 2002: The 19th International Conference on Computational Linguistics","author":"Kve\u0306to\u0148","year":"2002"},{"issue":"1","key":"2024092014245840300_bib71","doi-asserted-by":"publisher","first-page":"159","DOI":"10.2307\/2529310","article-title":"The measurement of observer agreement for categorical data","volume":"33","author":"Landis","year":"1977","journal-title":"Biometrics"},{"key":"2024092014245840300_bib72","first-page":"97","article-title":"On quality control and machine learning in crowdsourcing","volume-title":"Proceedings of the 11th AAAI Conference on Human Computation","author":"Lease","year":"2011"},{"key":"2024092014245840300_bib73","doi-asserted-by":"publisher","first-page":"177","DOI":"10.18653\/v1\/W19-4520","article-title":"Towards assessing argumentation annotation - a first step","volume-title":"Proceedings of the 6th Workshop on Argument Mining","author":"Lindahl","year":"2019"},{"issue":"4","key":"2024092014245840300_bib74","doi-asserted-by":"publisher","first-page":"587","DOI":"10.1111\/j.1468-2958.2002.tb00826.x","article-title":"Content analysis in mass communication: Assessment and reporting of intercoder reliability","volume":"28","author":"Lombard","year":"2002","journal-title":"Human Communication Research"},{"key":"2024092014245840300_bib75","doi-asserted-by":"publisher","first-page":"3428","DOI":"10.18653\/v1\/P19-1334","article-title":"Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"McCoy","year":"2019"},{"key":"2024092014245840300_bib76","first-page":"105","article-title":"DKPro agreement: An open-source Java library for measuring inter-rater agreement","volume-title":"Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations","author":"Meyer","year":"2014"},{"key":"2024092014245840300_bib77","doi-asserted-by":"publisher","first-page":"2381","DOI":"10.18653\/v1\/D18-1260","article-title":"Can a suit of armor conduct electricity? A new dataset for open book question answering","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Mihaylov","year":"2018"},{"key":"2024092014245840300_bib78","doi-asserted-by":"publisher","first-page":"1003","DOI":"10.3115\/1690219.1690287","article-title":"Distant supervision for relation extraction without labeled data","volume-title":"Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP","author":"Mintz","year":"2009"},{"key":"2024092014245840300_bib79","volume-title":"Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-Centered AI","author":"Monarch","year":"2021"},{"key":"2024092014245840300_bib80","doi-asserted-by":"publisher","first-page":"4569","DOI":"10.18653\/v1\/2020.emnlp-main.370","article-title":"GLUCOSE: Generalized and contextualized story explanations","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Mostafazadeh","year":"2020"},{"key":"2024092014245840300_bib81","doi-asserted-by":"publisher","DOI":"10.4135\/9781071802878","volume-title":"The Content Analysis Guidebook","author":"Neuendorf","year":"2016"},{"key":"2024092014245840300_bib82","doi-asserted-by":"publisher","first-page":"1373","DOI":"10.1613\/jair.1.12125","article-title":"Confident learning: Estimating uncertainty in dataset labels","volume":"70","author":"Northcutt","year":"2021","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2024092014245840300_bib83","first-page":"1","article-title":"Pervasive label errors in test sets destabilize machine learning benchmarks","volume-title":"35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track","author":"Northcutt","year":"2021"},{"key":"2024092014245840300_bib84","first-page":"1","article-title":"Training language models to follow instructions with human feedback","volume-title":"Advances in Neural Information Processing Systems","author":"Ouyang","year":"2022"},{"key":"2024092014245840300_bib85","doi-asserted-by":"publisher","first-page":"1779","DOI":"10.18653\/v1\/2023.eacl-main.130","article-title":"Don\u2019t blame the annotator: Bias already starts in the annotation instructions","volume-title":"Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics","author":"Parmar","year":"2023"},{"key":"2024092014245840300_bib86","doi-asserted-by":"publisher","first-page":"4886","DOI":"10.18653\/v1\/2021.findings-emnlp.421","article-title":"Does putting a linguist in the loop improve NLU data collection?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Parrish","year":"2021"},{"key":"2024092014245840300_bib87","doi-asserted-by":"publisher","first-page":"311","DOI":"10.1162\/tacl_a_00185","article-title":"The benefits of a model of annotation","volume":"2","author":"Passonneau","year":"2014","journal-title":"Transactions of the Association for Computational Linguistics"},{"issue":"0","key":"2024092014245840300_bib88","doi-asserted-by":"publisher","first-page":"571","DOI":"10.1162\/tacl_a_00040","article-title":"Comparing Bayesian models of annotation","volume":"6","author":"Paun","year":"2018","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024092014245840300_bib89","doi-asserted-by":"publisher","first-page":"7","DOI":"10.18653\/v1\/W19-4302","article-title":"To tune or not to tune? Adapting pretrained representations to diverse tasks","volume-title":"Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)","author":"Peters","year":"2019"},{"key":"2024092014245840300_bib90","doi-asserted-by":"publisher","first-page":"2343","DOI":"10.18653\/v1\/2023.semeval-1.317","article-title":"SemEval-2023 Task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup","volume-title":"Proceedings of the 17th International Workshop on Semantic Evaluation","author":"Piskorski","year":"2023"},{"key":"2024092014245840300_bib91","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1007\/978-1-349-19051-5_6","article-title":"On agreement indices for nominal data","volume-title":"Sociometric Research","author":"Popping","year":"1988"},{"issue":"1","key":"2024092014245840300_bib92","first-page":"37","article-title":"Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation","volume":"2","author":"Powers","year":"2011","journal-title":"Journal of Machine Learning Technologies"},{"key":"2024092014245840300_bib93","first-page":"2961","article-title":"The Penn Discourse TreeBank 2.0.","volume-title":"Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC\u201908)","author":"Prasad","year":"2008"},{"key":"2024092014245840300_bib94","doi-asserted-by":"publisher","first-page":"1776","DOI":"10.1145\/3531146.3533231","article-title":"Data cards: Purposeful and transparent dataset documentation for responsible AI","volume-title":"2022 ACM Conference on Fairness, Accountability, and Transparency","author":"Pushkarna","year":"2022"},{"key":"2024092014245840300_bib95","volume-title":"Natural Language Annotation for Machine Learning","author":"Pustejovsky","year":"2013"},{"key":"2024092014245840300_bib96","doi-asserted-by":"publisher","first-page":"326","DOI":"10.18653\/v1\/2021.sigdial-1.35","article-title":"Annotation inconsistency and entity bias in MultiWOZ","volume-title":"Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue","author":"Qian","year":"2021"},{"issue":"4","key":"2024092014245840300_bib97","doi-asserted-by":"publisher","first-page":"187","DOI":"10.4103\/picr.PICR_123_17","article-title":"Common pitfalls in statistical analysis: Measures of agreement","volume":"8","author":"Ranganathan","year":"2017","journal-title":"Perspectives in Clinical Research"},{"key":"2024092014245840300_bib98","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1162\/tacl_a_00266","article-title":"CoQA: A conversational question answering challenge","volume":"7","author":"Reddy","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024092014245840300_bib99","doi-asserted-by":"publisher","first-page":"215","DOI":"10.18653\/v1\/2020.conll-1.16","article-title":"Identifying incorrect labels in the CoNLL-2003 corpus","volume-title":"Proceedings of the 24th Conference on Computational Natural Language Learning","author":"Reiss","year":"2020"},{"issue":"4","key":"2024092014245840300_bib100","doi-asserted-by":"publisher","first-page":"1328","DOI":"10.1109\/TKDE.2019.2946162","article-title":"A survey on data collection for machine learning: A big data - AI integration perspective","volume":"33","author":"Roh","year":"2021","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"2024092014245840300_bib101","first-page":"859","article-title":"Corpus annotation through crowdsourcing: Towards best practice guidelines","volume-title":"Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Sabou","year":"2014"},{"key":"2024092014245840300_bib102","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3411764.3445518","article-title":"\u201cEveryone wants to do the model work, not the data work\u201d: Data cascades in high-stakes AI","volume-title":"SIGCHI","author":"Sambasivan","year":"2021"},{"key":"2024092014245840300_bib103","doi-asserted-by":"publisher","DOI":"10.1002\/9780470999875","volume-title":"A Companion to Digital Humanities","author":"Schreibman","year":"2004"},{"issue":"3","key":"2024092014245840300_bib104","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1086\/266577","article-title":"Reliability of content analysis: The case of nominal scale coding","volume":"19","author":"Scott","year":"1955","journal-title":"The Public Opinion Quarterly"},{"key":"2024092014245840300_bib105","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1609\/hcomp.v1i1.13088","article-title":"SQUARE: A benchmark for research on computing crowd consensus","volume":"1","author":"Sheshadri","year":"2013","journal-title":"Proceedings of the AAAI Conference on Human Computation and Crowdsourcing"},{"key":"2024092014245840300_bib106","doi-asserted-by":"publisher","first-page":"3758","DOI":"10.18653\/v1\/2021.naacl-main.295","article-title":"Beyond fair pay: Ethical implications of NLP crowdsourcing","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Shmueli","year":"2021"},{"issue":"4","key":"2024092014245840300_bib107","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1191\/0962280204sm365ra","article-title":"Sample size requirements for the design of reliability study: Review and new results","volume":"13","author":"Shoukri","year":"2004","journal-title":"Statistical Methods in Medical Research"},{"issue":"2","key":"2024092014245840300_bib108","doi-asserted-by":"publisher","first-page":"420","DOI":"10.1037\/0033-2909.86.2.420","article-title":"Intraclass correlations: Uses in assessing rater reliability.","volume":"86","author":"Shrout","year":"1979","journal-title":"Psychological Bulletin"},{"issue":"3","key":"2024092014245840300_bib109","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1093\/ptj\/85.3.257","article-title":"The kappa statistic in reliability studies: Use, interpretation, and sample size requirements","volume":"85","author":"Sim","year":"2005","journal-title":"Physical Therapy"},{"key":"2024092014245840300_bib110","doi-asserted-by":"publisher","first-page":"1093","DOI":"10.18653\/v1\/D19-1101","article-title":"A Bayesian approach for sequence tagging with crowds","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Simpson","year":"2019"},{"key":"2024092014245840300_bib111","doi-asserted-by":"publisher","first-page":"883","DOI":"10.18653\/v1\/2021.findings-acl.78","article-title":"COM2SENSE: A commonsense reasoning benchmark with complementary sentences","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Singh","year":"2021"},{"key":"2024092014245840300_bib112","doi-asserted-by":"publisher","first-page":"254","DOI":"10.3115\/1613715.1613751","article-title":"Cheap and fast \u2013 but is it good? Evaluating non-expert annotations for natural language tasks","volume-title":"Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing","author":"Snow","year":"2008"},{"key":"2024092014245840300_bib113","first-page":"1631","article-title":"Recursive deep models for semantic compositionality over a sentiment treebank","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing","author":"Socher","year":"2013"},{"key":"2024092014245840300_bib114","doi-asserted-by":"publisher","first-page":"46","DOI":"10.3115\/v1\/D14-1006","article-title":"Identifying argumentative discourse structures in persuasive essays","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Stab","year":"2014"},{"key":"2024092014245840300_bib115","doi-asserted-by":"publisher","first-page":"13843","DOI":"10.1609\/aaai.v35i15.17631","article-title":"Re-TACRED: Addressing shortcomings of the TACRED dataset","volume-title":"Proceedings of the 35th AAAI Conference on Artificial Intelligence 2021","author":"Stoica","year":"2021"},{"key":"2024092014245840300_bib116","doi-asserted-by":"publisher","first-page":"843","DOI":"10.1109\/ICCV.2017.97","article-title":"Revisiting unreasonable effectiveness of data in deep learning era","volume-title":"2017 IEEE International Conference on Computer Vision (ICCV)","author":"Sun","year":"2017"},{"key":"2024092014245840300_bib117","doi-asserted-by":"publisher","first-page":"80","DOI":"10.18653\/v1\/W17-1610","article-title":"A short review of ethical challenges in clinical natural language processing","volume-title":"Proceedings of the First ACL Workshop on Ethics in Natural Language Processing","author":"Suster","year":"2017"},{"key":"2024092014245840300_bib118","doi-asserted-by":"publisher","first-page":"142","DOI":"10.3115\/1119176.1119195","article-title":"Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition","volume-title":"Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003","author":"SangErik","year":"2003"},{"key":"2024092014245840300_bib119","doi-asserted-by":"publisher","first-page":"1385","DOI":"10.1613\/jair.1.12752","article-title":"Learning from disagreement: A survey","volume":"72","author":"Uma","year":"2021","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2024092014245840300_bib120","first-page":"1251","article-title":"An analysis of the impact of annotation errors on the accuracy of deep learning for cell segmentation","volume-title":"Proceedings of Machine Learning Research","author":"V\u0103dineanu","year":"2022"},{"issue":"3","key":"2024092014245840300_bib121","doi-asserted-by":"publisher","first-page":"162","DOI":"10.1159\/000337798","article-title":"Measuring agreement, more complicated than it seems","volume":"120","author":"van Stralen","year":"2012","journal-title":"Nephron Clinical Practice"},{"key":"2024092014245840300_bib122","first-page":"1","article-title":"When does dough become a bagel? Analyzing the remaining mistakes on ImageNet","volume-title":"Proceedings of the 36th Conference on Neural Information Processing Systems","author":"Vasudevan","year":"2022"},{"key":"2024092014245840300_bib123","doi-asserted-by":"publisher","first-page":"5153","DOI":"10.18653\/v1\/D19-1519","article-title":"CrossWeigh: Training named entity tagger from imperfect annotations","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Wang","year":"2019"},{"key":"2024092014245840300_bib124","first-page":"1","article-title":"Finetuned language models are zero-shot learners","volume-title":"International Conference on Learning Representations","author":"Wei","year":"2022"},{"key":"2024092014245840300_bib125","volume-title":"Developing Linguistic Corpora: A Guide to Good Practice","author":"Wynne","year":"2005"},{"key":"2024092014245840300_bib126","doi-asserted-by":"publisher","first-page":"764","DOI":"10.18653\/v1\/P19-1074","article-title":"DocRED: A large-scale document-level relation extraction dataset","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Yao","year":"2019"},{"issue":"1","key":"2024092014245840300_bib127","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1186\/s12874-016-0200-9","article-title":"Measuring inter-rater reliability for nominal data \u2013 which coefficients and confidence intervals are appropriate?","volume":"16","author":"Zapf","year":"2016","journal-title":"BMC Medical Research Methodology"},{"key":"2024092014245840300_bib128","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1155\/2015\/674296","article-title":"Survey of natural language processing techniques in bioinformatics","volume":"2015","author":"Zeng","year":"2015","journal-title":"Computational and Mathematical Methods in Medicine"},{"key":"2024092014245840300_bib129","doi-asserted-by":"publisher","first-page":"35","DOI":"10.18653\/v1\/D17-1004","article-title":"Position-aware attention and supervised data improve slot filling","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Zhang","year":"2017"},{"issue":"1","key":"2024092014245840300_bib130","doi-asserted-by":"publisher","first-page":"419","DOI":"10.1080\/23808985.2013.11679142","article-title":"Assumptions behind intercoder reliability indices","volume":"36","author":"Zhao","year":"2013","journal-title":"Annals of the International Communication Association"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/50\/3\/817\/2470929\/coli_a_00516.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/50\/3\/817\/2470929\/coli_a_00516.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,20]],"date-time":"2024-09-20T14:25:20Z","timestamp":1726842320000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/50\/3\/817\/120233\/Analyzing-Dataset-Annotation-Quality-Management-in"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":130,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,9,1]]},"published-print":{"date-parts":[[2024,9,1]]}},"URL":"https:\/\/doi.org\/10.1162\/coli_a_00516","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}