{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:37:28Z","timestamp":1772725048331,"version":"3.50.1"},"publisher-location":"Cham","reference-count":35,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783031635359","type":"print"},{"value":"9783031635366","type":"electronic"}],"license":[{"start":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T00:00:00Z","timestamp":1704067200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,17]],"date-time":"2024-07-17T00:00:00Z","timestamp":1721174400000},"content-version":"vor","delay-in-days":198,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Evaluating the quality of arguments is a crucial aspect of any system leveraging argument mining. However, it is a challenge to obtain reliable and consistent annotations regarding argument quality, as this usually requires domain-specific expertise of the annotators. Even among experts, the assessment of argument quality is often inconsistent due to the inherent subjectivity of this task. In this paper, we study the potential of using state-of-the-art large language models (LLMs) as proxies for argument quality annotators. To assess the capability of LLMs in this regard, we analyze the agreement between model, human expert, and human novice annotators based on an established taxonomy of argument quality dimensions. Our findings highlight that LLMs can produce consistent annotations, with a moderately high agreement with human experts across most of the quality dimensions. Moreover, we show that using LLMs as additional annotators can significantly improve the agreement between annotators. These results suggest that LLMs can serve as a valuable tool for automated argument quality assessment, thus streamlining and accelerating the evaluation of large argument datasets .<\/jats:p>","DOI":"10.1007\/978-3-031-63536-6_8","type":"book-chapter","created":{"date-parts":[[2024,7,16]],"date-time":"2024-07-16T05:01:50Z","timestamp":1721106110000},"page":"129-146","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Are Large Language Models Reliable Argument Quality Annotators?"],"prefix":"10.1007","author":[{"given":"Nailia","family":"Mirzakhmedova","sequence":"first","affiliation":[]},{"given":"Marcel","family":"Gohsen","sequence":"additional","affiliation":[]},{"given":"Chia Hao","family":"Chang","sequence":"additional","affiliation":[]},{"given":"Benno","family":"Stein","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,17]]},"reference":[{"key":"8_CR1","unstructured":"Anil, R., Dai, A.M., Firat, O., Johnson, M., et\u00a0al.: PaLM 2 Technical Report. CoRR abs\/2305.10403 (2023)"},{"key":"8_CR2","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., et\u00a0al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6\u201312 December 2020, virtual (2020)"},{"key":"8_CR3","doi-asserted-by":"crossref","unstructured":"Carlile, W., Gurrapadi, N., Ke, Z., Ng, V.: Give me more feedback: annotating argument persuasiveness and related attributes in student essays. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 15\u201320 July 2018, Volume 1: Long Papers, pp. 621\u2013631. Association for Computational Linguistics (2018)","DOI":"10.18653\/v1\/P18-1058"},{"key":"8_CR4","unstructured":"Chen, G., Cheng, L., Tuan, L.A., Bing, L.: Exploring the Potential of Large Language Models in Computational Argumentation. CoRR abs\/2311.09022 (2023)"},{"key":"8_CR5","doi-asserted-by":"crossref","unstructured":"Chiang, D.C., Lee, H.: Can large language models be an alternative to human evaluations? In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, 9\u201314 July 2023, pp. 15607\u201315631. Association for Computational Linguistics (2023)","DOI":"10.18653\/v1\/2023.acl-long.870"},{"key":"8_CR6","unstructured":"Chowdhery, A., Narang, S., Devlin, J., Bosma, M., et\u00a0al.: PaLM: Scaling Language Modeling with Pathways. CoRR abs\/2204.02311 (2022)"},{"key":"8_CR7","unstructured":"Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2\u20137 June 2019, Volume 1 (Long and Short Papers), pp. 4171\u20134186. Association for Computational Linguistics (2019)"},{"key":"8_CR8","unstructured":"Ding, B., et al.: Is GPT-3 a good data annotator? In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, 9\u201314 July 2023, pp. 11173\u201311195. Association for Computational Linguistics (2023)"},{"key":"8_CR9","unstructured":"Faggioli, G., et al.: Perspectives on large language models for relevance judgment. In: Yoshioka, M., Kiseleva, J., Aliannejadi, M. (eds.) Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2023, Taipei, Taiwan, 23 July 2023, pp. 39\u201350. ACM (2023)"},{"key":"8_CR10","unstructured":"Gao, M., Ruan, J., Sun, R., Yin, X., Yang, S., Wan, X.: Human-like Summarization Evaluation with ChatGPT. CoRR abs\/2304.02554 (2023)"},{"key":"8_CR11","doi-asserted-by":"crossref","unstructured":"Gilardi, F., Alizadeh, M., Kubli, M.: ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. CoRR abs\/2303.15056 (2023)","DOI":"10.1073\/pnas.2305016120"},{"key":"8_CR12","doi-asserted-by":"crossref","unstructured":"Guo, J., Cheng, L., Zhang, W., Kok, S., Li, X., Bing, L.: AQE: argument quadruplet extraction via a quad-tagging augmented generative approach. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 9\u201314 July 2023, pp. 932\u2013946. Association for Computational Linguistics (2023)","DOI":"10.18653\/v1\/2023.findings-acl.59"},{"key":"8_CR13","doi-asserted-by":"crossref","unstructured":"Habernal, I., Gurevych, I.: Which argument is more convincing? Analyzing and predicting convincingness of web arguments using bidirectional LSTM. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7\u201312 August 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016)","DOI":"10.18653\/v1\/P16-1150"},{"key":"8_CR14","unstructured":"Holtzman, A., Buys, J., Forbes, M., Choi, Y.: The Curious Case of Neural Text Degeneration. CoRR abs\/1904.09751 (2019)"},{"key":"8_CR15","doi-asserted-by":"crossref","unstructured":"Huo, S., Arabzadeh, N., Clarke, C.L.A.: Retrieving supporting evidence for generative question answering. In: Ai, Q., Liu, Y., Moffat, A., Huang, X., Sakai, T., Zobel, J. (eds.) Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2023, Beijing, China, 26\u201328 November 2023, pp. 11\u201320. ACM (2023)","DOI":"10.1145\/3624918.3625336"},{"key":"8_CR16","doi-asserted-by":"crossref","unstructured":"Kamalloo, E., Dziri, N., Clarke, C.L.A., Rafiei, D.: Evaluating open-domain question answering in the era of large language models. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, 9\u201314 July 2023, pp. 5591\u20135606. Association for Computational Linguistics (2023)","DOI":"10.18653\/v1\/2023.acl-long.307"},{"key":"8_CR17","unstructured":"Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November\u20139 December 2022 (2022)"},{"key":"8_CR18","unstructured":"Magooda, A., Helyar, A., Jackson, K., Sullivan, D., et\u00a0al.: A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications. CoRR abs\/2310.17750 (2023)"},{"key":"8_CR19","unstructured":"Marro, S., Cabrio, E., Villata, S.: Argumentation quality assessment: an argument mining approach. In: ECA 2022-European Conference on Argumentation (2022)"},{"key":"8_CR20","unstructured":"OpenAI: ChatGPT (2022)"},{"key":"8_CR21","unstructured":"OpenAI: GPT-4 Technical Report. CoRR abs\/2303.08774 (2023)"},{"key":"8_CR22","doi-asserted-by":"crossref","unstructured":"Park, J., Cardie, C.: Identifying appropriate support for propositions in online user comments. In: Proceedings of the First Workshop on Argument Mining, hosted by the 52nd Annual Meeting of the Association for Computational Linguistics, ArgMining@ACL 2014, 26 June 2014, Baltimore, Maryland, USA, pp. 29\u201338. The Association for Computer Linguistics (2014)","DOI":"10.3115\/v1\/W14-2105"},{"key":"8_CR23","doi-asserted-by":"crossref","unstructured":"Persing, I., Ng, V.: Modeling argument strength in student essays. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, 26\u201331 July 2015, Beijing, China, Volume 1: Long Papers, pp. 543\u2013552. The Association for Computer Linguistics (2015)","DOI":"10.3115\/v1\/P15-1053"},{"key":"8_CR24","unstructured":"R\u00f8nningstad, E., Velldal, E., \u00d8vrelid, L.: A GPT among annotators: LLM-based entity-level sentiment annotation. In: Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII), pp. 133\u2013139 (2024)"},{"issue":"3","key":"8_CR25","doi-asserted-by":"publisher","first-page":"619","DOI":"10.1162\/COLI_a_00295","volume":"43","author":"C Stab","year":"2017","unstructured":"Stab, C., Gurevych, I.: Parsing argumentation structures in persuasive essays. Comput. Linguist. 43(3), 619\u2013659 (2017)","journal-title":"Comput. Linguist."},{"key":"8_CR26","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511806544","volume-title":"Fallacies and Argument Appraisal","author":"CW Tindale","year":"2007","unstructured":"Tindale, C.W.: Fallacies and Argument Appraisal, 1st edn. Cambridge University Press, Cambridge (2007)","edition":"1"},{"key":"8_CR27","doi-asserted-by":"crossref","unstructured":"Toledo, A., Gretz, S., Cohen-Karlik, E., Friedman, R., et\u00a0al.: Automatic argument quality assessment - new datasets and methods. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3\u20137 November 2019, pp. 5624\u20135634. Association for Computational Linguistics (2019)","DOI":"10.18653\/v1\/D19-1564"},{"key":"8_CR28","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., et\u00a0al.: Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs\/2307.09288 (2023)"},{"key":"8_CR29","doi-asserted-by":"crossref","unstructured":"Wachsmuth, H., et al.: Argumentation quality assessment: theory vs. practice. In: Barzilay, R., Kan, M.Y. (eds.) 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 250\u2013255. Association for Computational Linguistics (2017)","DOI":"10.18653\/v1\/P17-2039"},{"key":"8_CR30","doi-asserted-by":"crossref","unstructured":"Wachsmuth, H., et al.: Computational argumentation quality assessment in natural language. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 176\u2013187 (2017)","DOI":"10.18653\/v1\/E17-1017"},{"key":"8_CR31","doi-asserted-by":"crossref","unstructured":"Wadhwa, S., Amir, S., Wallace, B.C.: Revisiting relation extraction in the era of large language models. In: Rogers, A., Boyd-Graber, J.L., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, 9\u201314 July 2023, pp. 15566\u201315589. Association for Computational Linguistics (2023)","DOI":"10.18653\/v1\/2023.acl-long.868"},{"key":"8_CR32","doi-asserted-by":"crossref","unstructured":"Wang, Y., Zhang, Z., Wang, R.: Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method. arXiv preprint arXiv:2305.13412 (2023)","DOI":"10.18653\/v1\/2023.acl-long.482"},{"key":"8_CR33","unstructured":"Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November\u20139 December 2022 (2022)"},{"key":"8_CR34","doi-asserted-by":"crossref","unstructured":"Yin, P., Deng, B., Chen, E., Vasilescu, B., Neubig, G.: Learning to mine aligned code and natural language pairs from stack overflow. In: Zaidman, A., Kamei, Y., Hill, E. (eds.) Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, 28\u201329 May 2018, pp. 476\u2013486. ACM (2018)","DOI":"10.1145\/3196398.3196408"},{"key":"8_CR35","unstructured":"Zhuo, T.Y.: Large Language Models Are State-of-the-Art Evaluators of Code Generation. CoRR abs\/2304.14317 (2023)"}],"container-title":["Lecture Notes in Computer Science","Robust Argumentation Machines"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-63536-6_8","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,24]],"date-time":"2024-11-24T06:00:25Z","timestamp":1732428025000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-63536-6_8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"ISBN":["9783031635359","9783031635366"],"references-count":35,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-63536-6_8","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024]]},"assertion":[{"value":"17 July 2024","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"RATIO","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Conference on Advances in Robust Argumentation Machines","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Bielefeld","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Germany","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2024","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"5 June 2024","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"7 June 2024","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"1","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"ratio2024","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"http:\/\/ratio.sc.cit-ec.uni-bielefeld.de\/de\/home\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}