{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T16:52:24Z","timestamp":1776271944562,"version":"3.50.1"},"reference-count":111,"publisher":"Cambridge University Press (CUP)","issue":"5","license":[{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F<jats:sub>1<\/jats:sub>score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.<\/jats:p>","DOI":"10.1017\/s1351324922000535","type":"journal-article","created":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T14:45:50Z","timestamp":1675694750000},"page":"1199-1222","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":18,"title":["How to do human evaluation: A brief introduction to user studies in NLP"],"prefix":"10.1017","volume":"29","author":[{"given":"Hendrik","family":"Schuff","sequence":"first","affiliation":[]},{"given":"Lindsey","family":"Vanderlyn","sequence":"additional","affiliation":[]},{"given":"Heike","family":"Adel","sequence":"additional","affiliation":[]},{"given":"Ngoc Thang","family":"Vu","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2023,2,6]]},"reference":[{"key":"S1351324922000535_ref55","doi-asserted-by":"crossref","unstructured":"Howcroft, D.M. and Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? human evaluations in NLP are even more under-powered than you think. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics, pp. 8932\u20138939. doi: 10.18653\/v1\/2021.emnlp-main.703. Available at https:\/\/aclanthology.org\/2021.emnlp-main.703.","DOI":"10.18653\/v1\/2021.emnlp-main.703"},{"key":"S1351324922000535_ref9","unstructured":"Bojar, O. , Federmann, C. , Haddow, B. , Koehn, P. , Post, M. and Specia, L. (2016). Ten years of wmt evaluation campaigns: Lessons learnt. In Proceedings of the LREC 2016 Workshop \u201cTranslation Evaluation\u2013From Fragmented Tools and Data Sets to an Integrated Ecosystem, pp. 27\u201334."},{"key":"S1351324922000535_ref70","doi-asserted-by":"publisher","DOI":"10.1002\/0470011815.b2a10021"},{"key":"S1351324922000535_ref60","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300621"},{"key":"S1351324922000535_ref80","first-page":"181","article-title":"The nuremberg code","volume":"10","author":"Nuremberg","year":"1949","journal-title":"Trials of War Criminals Before the Nuremberg Military Tribunals Under Control Council Law"},{"key":"S1351324922000535_ref14","doi-asserted-by":"crossref","unstructured":"Callison-Burch, C. , Fordyce, C. , Koehn, P. , Monz, C. and Schroeder, J. (2007). (meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, June 2007. Association for Computational Linguistics, pp. 136\u2013158. Available at https:\/\/www.aclweb.org\/anthology\/W07-0718.","DOI":"10.3115\/1626355.1626373"},{"key":"S1351324922000535_ref6","first-page":"742","article-title":"Respondent fatigue","volume":"2","author":"Ben-Nun","year":"2008","journal-title":"Encyclopedia of Survey Research Methods"},{"key":"S1351324922000535_ref32","unstructured":"European Commission. (2018). 2018 reform of eu data protection rules. Available at https:\/\/ec.europa.eu\/commission\/sites\/beta-political\/files\/data-protection-factsheet-changes_en.pdf."},{"key":"S1351324922000535_ref42","unstructured":"Gaudio, R. , Burchardt, A. and Branco, A. (2016). Evaluating machine translation in a usage scenario. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC\u201916), Portoro\u017e, Slovenia, May 2016. European Language Resources Association (ELRA), pp. 1\u20138. Available at https:\/\/aclanthology.org\/L16-1001."},{"key":"S1351324922000535_ref12","unstructured":"Brooke, J. (1996) Sus: A \u201cquick and dirty\u2019usability\u201d. In Usability Evaluation in Industry, p. 189."},{"key":"S1351324922000535_ref63","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2008.26"},{"key":"S1351324922000535_ref21","doi-asserted-by":"crossref","unstructured":"Clark, R. , Sil\u00e9n, H. , Kenter, T. and Leith, R. Evaluating long-form text-to-speech: Comparing the ratings of sentences and paragraphs. CoRR, abs\/1909.03965, 2019. Available at http:\/\/arxiv.org\/abs\/1909.03965.","DOI":"10.21437\/SSW.2019-18"},{"key":"S1351324922000535_ref57","doi-asserted-by":"publisher","DOI":"10.1145\/160688.160758"},{"key":"S1351324922000535_ref75","unstructured":"Narang, S. , Raffel, C. , Lee, K. , Roberts, A. , Fiedel, N. and Malkan, K. (2020). Wt5?! training text-to-text models to explain their predictions. CoRR, abs\/2004.14546. https:\/\/arxiv.org\/abs\/2004.14546."},{"key":"S1351324922000535_ref85","doi-asserted-by":"publisher","DOI":"10.1017\/S1930297500002205"},{"key":"S1351324922000535_ref31","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1128"},{"key":"S1351324922000535_ref43","unstructured":"Graham, Y. , Baldwin, T. , Moffat, A. and Zobel, J. (2013). Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 33\u201341."},{"key":"S1351324922000535_ref5","unstructured":"Belz, A. and Reiter, E. (2006). Comparing automatic and human evaluation of nlg systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics."},{"key":"S1351324922000535_ref89","doi-asserted-by":"publisher","DOI":"10.1108\/10748120910998425"},{"key":"S1351324922000535_ref7","doi-asserted-by":"publisher","DOI":"10.1016\/S0895-4356(00)00314-0"},{"key":"S1351324922000535_ref104","doi-asserted-by":"publisher","DOI":"10.4300\/JGME-D-12-00156.1"},{"key":"S1351324922000535_ref33","doi-asserted-by":"publisher","DOI":"10.3758\/BRM.41.4.1149"},{"key":"S1351324922000535_ref109","first-page":"77","article-title":"On the ethics of crowdsourced research","volume":"49","author":"Williamson","year":"2016","journal-title":"PS: Political Science and Politics"},{"key":"S1351324922000535_ref50","first-page":"70","article-title":"Doing a pilot study: Why is it essential?","volume":"1","author":"Hassan","year":"2006","journal-title":"Malaysian Family Physician: The Official Journal of the Academy of Family Physicians of Malaysia"},{"key":"S1351324922000535_ref90","doi-asserted-by":"publisher","DOI":"10.1097\/00001648-199001000-00010"},{"key":"S1351324922000535_ref106","doi-asserted-by":"crossref","unstructured":"van der Lee, C. , Gatt, A. , van Miltenburg, E. , Wubben, S. and Krahmer, E. (2019). Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, October 2019. Association for Computational Linguistics, pp. 355\u2013368. doi: 10.18653\/v1\/W19-8643. Available at https:\/\/www.aclweb.org\/anthology\/W19-8643.","DOI":"10.18653\/v1\/W19-8643"},{"key":"S1351324922000535_ref2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijhcs.2015.08.010"},{"key":"S1351324922000535_ref88","doi-asserted-by":"publisher","DOI":"10.1080\/03610918.2014.931971"},{"key":"S1351324922000535_ref26","doi-asserted-by":"publisher","DOI":"10.1007\/b97673"},{"key":"S1351324922000535_ref64","doi-asserted-by":"publisher","DOI":"10.31234\/osf.io\/nfc45"},{"key":"S1351324922000535_ref25","volume-title":"Nonparametric Statistics: A Step-by-Step Approach","author":"Corder","year":"2014"},{"key":"S1351324922000535_ref41","doi-asserted-by":"publisher","DOI":"10.1098\/rsta.2018.0081"},{"key":"S1351324922000535_ref44","doi-asserted-by":"publisher","DOI":"10.1145\/3152832.3152859"},{"key":"S1351324922000535_ref61","doi-asserted-by":"publisher","DOI":"10.1111\/j.1365-2929.2004.02012.x"},{"key":"S1351324922000535_ref22","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.eacl-main.202"},{"key":"S1351324922000535_ref52","volume-title":"Generalized Additive Models","author":"Hastie","year":"1990"},{"key":"S1351324922000535_ref102","volume-title":"Applied Nonparametric Statistical Methods","author":"Sprent","year":"2012"},{"key":"S1351324922000535_ref96","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.blackboxnlp-1.3"},{"key":"S1351324922000535_ref95","doi-asserted-by":"publisher","DOI":"10.1145\/3531146.3533127"},{"key":"S1351324922000535_ref30","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00222"},{"key":"S1351324922000535_ref40","doi-asserted-by":"publisher","DOI":"10.1016\/j.intcom.2010.04.004"},{"key":"S1351324922000535_ref103","doi-asserted-by":"publisher","DOI":"10.1378\/chest.11-0523"},{"key":"S1351324922000535_ref108","doi-asserted-by":"publisher","DOI":"10.3115\/1626355.1626368"},{"key":"S1351324922000535_ref59","unstructured":"Iskender, N. , Polzehl, T. and M\u00f6ller, S. (2021). Reliability of human evaluation for text summarization: Lessons learned and challenges ahead. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), Online, April 2021. Association for Computational Linguistics, pp. 86\u201396. Available at https:\/\/aclanthology.org\/2021.humeval-1.10."},{"key":"S1351324922000535_ref66","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1604"},{"key":"S1351324922000535_ref69","doi-asserted-by":"crossref","unstructured":"Mathur, N. , Baldwin, T. and Cohn, T. (2020). Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics, pp. 4984\u20134997. doi: 10.18653\/v1\/2020.acl-main.448. Available at https:\/\/aclanthology.org\/2020.acl-main.448.","DOI":"10.18653\/v1\/2020.acl-main.448"},{"key":"S1351324922000535_ref82","doi-asserted-by":"publisher","DOI":"10.1093\/oxfordjournals.aje.a009501"},{"key":"S1351324922000535_ref45","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1052"},{"key":"S1351324922000535_ref58","doi-asserted-by":"crossref","unstructured":"Iskender, N. , Polzehl, T. and M\u00f6ller, S. (2020). Best practices for crowd-based evaluation of German summarization: Comparing crowd, expert and automatic evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, November 2020. Association for Computational Linguistics, pp. 164\u2013175. doi: 10.18653\/v1\/2020.eval4nlp-1.16. Available at https:\/\/aclanthology.org\/2020.eval4nlp-1.16.","DOI":"10.18653\/v1\/2020.eval4nlp-1.16"},{"key":"S1351324922000535_ref68","doi-asserted-by":"crossref","unstructured":"Liu, C.-W. , Lowe, R. , Serban, I. , Noseworthy, M. , Charlin, L. and Pineau, J. (2016). How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, November 2016. Association for Computational Linguistics, pp. 2122\u20132132. doi: 10.18653\/v1\/D16-1230. Available at https:\/\/www.aclweb.org\/anthology\/D16-1230.","DOI":"10.18653\/v1\/D16-1230"},{"key":"S1351324922000535_ref77","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1097"},{"key":"S1351324922000535_ref84","unstructured":"Palmer, J.C. and Strickland, J. (2016). A beginners guide to crowdsourcing\u2014strengths, limitations and best practice for psychological research. Psychological Science Agenda. Available at https:\/\/www.apa.org\/science\/about\/psa\/2016\/06\/changing-mind."},{"key":"S1351324922000535_ref115","first-page":"324","article-title":"Rank analysis of incomplete block designs","volume":"39","author":"Bradley","year":"1952","journal-title":"Biometrika"},{"key":"S1351324922000535_ref29","unstructured":"Divjak, D. and Baayen, H. (2017). Ordinal gamms: A new window on human ratings. In Each Venture, a New Beginning: Studies in Honor of Laura A. Janda, pp. 39\u201356."},{"key":"S1351324922000535_ref62","volume-title":"Linear and Generalized Linear Mixed Models and their Applications","author":"Jiang","year":"2007"},{"key":"S1351324922000535_ref78","doi-asserted-by":"crossref","first-page":"134","DOI":"10.4103\/2231-4040.116779","article-title":"Informed consent: Issues and challenges","volume":"4","author":"Nijhawan","year":"2013","journal-title":"Journal of Advanced Pharmaceutical Technology and Research"},{"key":"S1351324922000535_ref99","unstructured":"Sedoc, J. and Ungar, L. (2020). Item response theory for efficient human evaluation of chatbots. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, November 2020. Association for Computational Linguistics, pp. 21\u201333. doi: 10.18653\/v1\/2020.eval4nlp-1.3. Available at https:\/\/aclanthology.org\/2020.eval4nlp-1.3."},{"key":"S1351324922000535_ref100","doi-asserted-by":"publisher","DOI":"10.1177\/1468017303003001002"},{"key":"S1351324922000535_ref110","doi-asserted-by":"publisher","DOI":"10.20982\/tqmp.03.2.p043"},{"key":"S1351324922000535_ref38","volume-title":"How to Design and Report Experiments","author":"Field","year":"2002"},{"key":"S1351324922000535_ref87","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00167"},{"key":"S1351324922000535_ref34","volume-title":"Lectures on Biostatistics: An Introduction to Statistics with Applications in Biology and Medicine","author":"Colquhoun","year":"1971"},{"key":"S1351324922000535_ref47","doi-asserted-by":"publisher","DOI":"10.1016\/j.cptl.2015.08.001"},{"key":"S1351324922000535_ref53","author":"Herbrich","year":"2006"},{"key":"S1351324922000535_ref83","unstructured":"Owczarzak, K. , Conroy, J.M. , Dang, H.T. and Nenkova, A. (2012). An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, Montr\u00e9al, Canada, June 2012. Association for Computational Linguistics, pp. 1\u20139. Available at https:\/\/www.aclweb.org\/anthology\/W12-2601."},{"key":"S1351324922000535_ref105","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2020.101151"},{"key":"S1351324922000535_ref11","doi-asserted-by":"publisher","DOI":"10.1191\/1478088706qp063oa"},{"key":"S1351324922000535_ref79","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2012"},{"key":"S1351324922000535_ref118","doi-asserted-by":"publisher","DOI":"10.3758\/BF03193146"},{"key":"S1351324922000535_ref3","author":"Bausell","year":"2002"},{"key":"S1351324922000535_ref16","doi-asserted-by":"publisher","DOI":"10.14705\/rpnet.2015.000318"},{"key":"S1351324922000535_ref93","doi-asserted-by":"publisher","DOI":"10.2298\/PSI1004441S"},{"key":"S1351324922000535_ref73","volume-title":"Design and Analysis of Experiments","author":"Montgomery","year":"2017"},{"key":"S1351324922000535_ref4","doi-asserted-by":"crossref","unstructured":"Belz, A. , Mille, S. and Howcroft, D.M. Disentangling the properties of human evaluation methods: A classification system to support comparability, meta-evaluation and reproducibility testing. In Proceedings of the 13th International Conference on Natural Language Generation, Dublin, Ireland, December 2020. Association for Computational Linguistics, pp. 183\u2013194. Available at https:\/\/aclanthology.org\/2020.inlg-1.24.","DOI":"10.18653\/v1\/2020.inlg-1.24"},{"key":"S1351324922000535_ref56","doi-asserted-by":"publisher","DOI":"10.1177\/1049732305276687"},{"key":"S1351324922000535_ref17","doi-asserted-by":"publisher","DOI":"10.1111\/j.1365-2923.2008.03172.x"},{"key":"S1351324922000535_ref86","unstructured":"Paulus, R. , Xiong, C. and Socher, R. (2018). A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30\u2013May 3, 2018, Conference Track Proceedings. OpenReview.net. Available at https:\/\/openreview.net\/forum?id=HkAClQgA-."},{"key":"S1351324922000535_ref101","doi-asserted-by":"publisher","DOI":"10.36965\/OJAKM.2020.8(1)16-31"},{"key":"S1351324922000535_ref65","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2018.02220"},{"key":"S1351324922000535_ref23","first-page":"20","article-title":"The effect size index: d","volume":"2","author":"Cohen","year":"1988","journal-title":"Statistical Power Analysis for the Behavioral Sciences"},{"key":"S1351324922000535_ref37","volume-title":"Discovering Statistics Using IBM SPSS Statistics","author":"Field","year":"2013"},{"key":"S1351324922000535_ref74","doi-asserted-by":"publisher","DOI":"10.1016\/S1047-2797(98)00003-9"},{"key":"S1351324922000535_ref67","doi-asserted-by":"publisher","DOI":"10.1037\/0003-066X.48.12.1181"},{"key":"S1351324922000535_ref1","doi-asserted-by":"publisher","DOI":"10.1037\/0003-066X.57.12.1060"},{"key":"S1351324922000535_ref13","doi-asserted-by":"publisher","DOI":"10.1177\/1745691610393980"},{"key":"S1351324922000535_ref49","unstructured":"Hashimoto, T. , Zhang, H. and Liang, P. (2019). Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics, pp. 1689\u20131701. doi: 10.18653\/v1\/N19-1169. Available at https:\/\/www.aclweb.org\/anthology\/N19-1169."},{"key":"S1351324922000535_ref46","unstructured":"Han, L. , Jones, G.J.F. and Smeaton, A.F. (2021). Translation quality assessment: A brief survey on manual and automatic methods. CoRR, abs\/2105.03311. Available at https:\/\/arxiv.org\/abs\/2105.03311."},{"key":"S1351324922000535_ref72","doi-asserted-by":"publisher","DOI":"10.1037\/a0028085"},{"key":"S1351324922000535_ref71","first-page":"4000","volume-title":"COGCAM: Contact-Free Measurement of Cognitive Stress During Computer Tasks with a Digital Camera","author":"McDuff","year":"2016"},{"key":"S1351324922000535_ref92","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-8610"},{"key":"S1351324922000535_ref27","volume-title":"Scale Development: Theory and Applications","author":"DeVellis","year":"2016"},{"key":"S1351324922000535_ref107","doi-asserted-by":"publisher","DOI":"10.7748\/ns2002.06.16.40.33.c3214"},{"key":"S1351324922000535_ref111","unstructured":"Winter, B. Linear models and linear mixed effects models in R with linguistic applications. CoRR, abs\/1308.5499, 2013. Available at http:\/\/arxiv.org\/abs\/1308.5499."},{"key":"S1351324922000535_ref112","doi-asserted-by":"publisher","DOI":"10.1201\/9781315370279"},{"key":"S1351324922000535_ref94","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.575"},{"key":"S1351324922000535_ref15","unstructured":"Callison-Burch, C. , Osborne, M. and Koehn, P. (2006). Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, April 2006. Association for Computational Linguistics. Available at https:\/\/www.aclweb.org\/anthology\/E06-1032."},{"key":"S1351324922000535_ref113","unstructured":"World Medical Association. (2018). Wma declaration of helsinki \u2013 ethical principles for medical research involving human subjects. Available at https:\/\/www.wma.net\/policies-post\/wma-declaration-of-helsinki-ethical-principles-for-medical-rese-arch-involving-human-subjects\/."},{"key":"S1351324922000535_ref81","doi-asserted-by":"publisher","DOI":"10.4028\/www.scientific.net\/AMM.611.115"},{"key":"S1351324922000535_ref91","doi-asserted-by":"crossref","unstructured":"Sakaguchi, K. , Post, M. and Van Durme, B. (2014). Efficient elicitation of annotations for human evaluation of machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics, pp. 1\u201311. doi: 10.3115\/v1\/W14-3301. Available at https:\/\/www.aclweb.org\/anthology\/W14-3301.","DOI":"10.3115\/v1\/W14-3301"},{"key":"S1351324922000535_ref76","doi-asserted-by":"publisher","DOI":"10.2307\/2344614"},{"key":"S1351324922000535_ref97","unstructured":"Secar\u0103, A. (2005). Translation evaluation: A state of the art survey. In Proceedings of the eCoLoRe\/MeLLANGE Workshop, Leeds, vol. 39. Citeseer, p. 44."},{"key":"S1351324922000535_ref98","unstructured":"Sedoc, J. , Ippolito, D. , Kirubarajan, A. , Thirani, J. , Ungar, L. and Callison-Burch, C. (2019). ChatEval: A tool for chatbot evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics, pp. 60\u201365. doi: 10.18653\/v1\/N19-4011. Available at https:\/\/www.aclweb.org\/anthology\/N19-4011."},{"key":"S1351324922000535_ref19","first-page":"55","article-title":"An experiment in evaluating the quality of translations","volume":"9","author":"Carroll","year":"1966","journal-title":"Mechanical Translation and Computational Linguistics"},{"key":"S1351324922000535_ref28","first-page":"11","article-title":"Five-point likert items: t test versus mann-whitney-wilcoxon (addendum added october 2012)","volume":"15","author":"De Winter","year":"2010","journal-title":"Practical Assessment, Research, and Evaluation"},{"key":"S1351324922000535_ref51","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-84858-7"},{"key":"S1351324922000535_ref39","doi-asserted-by":"publisher","DOI":"10.1093\/idpl\/ipz026"},{"key":"S1351324922000535_ref54","unstructured":"Howcroft, D.M. , Belz, A. , Clinciu, M.-A. , Gkatzia, D. , Hasan, S.A. , Mahamood, S. , Mille, S. , van Miltenburg, E. , Santhanam, S. and Rieser, V. (2020). Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Natural Language Generation, Dublin, Ireland, December 2020. Association for Computational Linguistics, pp. 169\u2013182. Available at https:\/\/www.aclweb.org\/anthology\/2020.inlg-1.23."},{"key":"S1351324922000535_ref20","unstructured":"Chen, A. , Stanovsky, G. , Singh, S. and Gardner, M. (2 019). Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 119\u2013124."},{"key":"S1351324922000535_ref8","doi-asserted-by":"publisher","DOI":"10.3389\/fpubh.2018.00149"},{"key":"S1351324922000535_ref18","unstructured":"Carpinella, C.M. , Wyman, A.B. , Perez, M.A. and Stroessner, S.J. (2017). The Robotic Social Attributes Scale (RoSAS): Development and validation. In Proceedings of the 2017 ACM\/IEEE International Conference on Human-Robot Interaction, Vienna Austria, March 2017. ACM, pp. 254\u2013262. ISBN 978-1-4503-4336-7. doi: 10.1145\/2909824.3020208. Available at https:\/\/dl.acm.org\/doi\/10.1145\/2909824.3020208."},{"key":"S1351324922000535_ref48","first-page":"139","article-title":"Development of nasa-tlx (task load index): Results of empirical and theoretical research","author":"Hart","year":"1988","journal-title":"In Advances in Psychology"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000535","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T09:14:15Z","timestamp":1701854055000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000535\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,6]]},"references-count":111,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["S1351324922000535"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000535","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,6]]},"assertion":[{"value":"\u00a9 The Author(s), 2023. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https:\/\/creativecommons.org\/licenses\/by\/4.0\/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}