{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T06:03:15Z","timestamp":1772863395162,"version":"3.50.1"},"reference-count":54,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T00:00:00Z","timestamp":1764633600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"},{"start":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T00:00:00Z","timestamp":1764633600000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/doi.wiley.com\/10.1002\/tdm_license_1.1"}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Computer Assisted Learning"],"published-print":{"date-parts":[[2026,2]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Sustainability education emphasises critical thinking and interdisciplinary understanding, making the assessment of students' learning outcomes complex. While Large Language Models (LLMs) have shown promise in educational assessment, their reliability in domains requiring contextual reasoning\u2014such as sustainability\u2014remains unclear.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Objectives<\/jats:title>\n                    <jats:p>This study aims to evaluate the agreement between human raters and several LLMs (GPT\u20104o, Gemini 2.0 Flash, DeepSeek V3, LLaMA 3.3) in assessing short\u2010answer responses from a university\u2010level Sustainability course. It also investigates how this agreement varies across cognitive skill levels.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Methods<\/jats:title>\n                    <jats:p>A total of 232 short\u2010answer responses were evaluated using a rubric aligned with Bloom's Revised Taxonomy. Consensus scores from human raters were compared to LLM\u2010generated scores using multiple statistical measures, including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficient (ICC), Pearson correlation, and distributional overlap.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>\n                      Moderate agreement was found between LLMs and human raters in total scores (QWK: 0.585\u20130.640;\n                      <jats:italic>r<\/jats:italic>\n                      : 0.660\u20130.668; : 0.681\u20130.803). Inter\u2010rater reliability among humans was good to excellent (ICC: 0.667\u20130.800). Criterion\u2010level agreement declined as cognitive complexity increased, with notably low agreement on evaluating higher\u2010order skills.\n                    <\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusions<\/jats:title>\n                    <jats:p>Overall, LLM\u2013human agreement was moderate on total scores but declined at higher cognitive levels, indicating that LLMs are suitable for basic comprehension checks while human oversight remains necessary for complex reasoning.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1002\/jcal.70160","type":"journal-article","created":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T08:20:48Z","timestamp":1764663648000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Automatic Short\u2010Answer Grading in Sustainability Education:\n                    <scp>AI<\/scp>\n                    \u2013Human Agreement"],"prefix":"10.1002","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3970-4406","authenticated-orcid":false,"given":"Emrah","family":"Emirtekin","sequence":"first","affiliation":[{"name":"Center for Distance Education Application and Research, Ege University  \u0130zmir Turkey"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0831-6985","authenticated-orcid":false,"given":"Yasin","family":"\u00d6zarslan","sequence":"additional","affiliation":[{"name":"Department of Science Culture Yasar University  \u0130zmir Turkey"}]}],"member":"311","published-online":{"date-parts":[[2025,12,2]]},"reference":[{"key":"e_1_2_13_2_1","doi-asserted-by":"publisher","DOI":"10.3390\/SU15108340"},{"key":"e_1_2_13_3_1","doi-asserted-by":"publisher","DOI":"10.3390\/COMPUTERS14030100"},{"key":"e_1_2_13_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/S10639\u2010025\u201013553\u20101"},{"key":"e_1_2_13_5_1","doi-asserted-by":"publisher","DOI":"10.61969\/JAI.1337500"},{"key":"e_1_2_13_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.ASW.2023.100745"},{"key":"e_1_2_13_7_1","doi-asserted-by":"publisher","DOI":"10.4324\/9781315852249"},{"key":"e_1_2_13_8_1","doi-asserted-by":"publisher","DOI":"10.1057\/S41599\u2010023\u201002269\u20107"},{"key":"e_1_2_13_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442188.3445922"},{"key":"e_1_2_13_10_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.463"},{"key":"e_1_2_13_11_1","volume-title":"Taxonomy of Educational Objectives, Handbook I","author":"Bloom B. S.","year":"1956"},{"key":"e_1_2_13_12_1","doi-asserted-by":"publisher","DOI":"10.3389\/FEDUC.2018.00022"},{"key":"e_1_2_13_13_1","first-page":"1877","article-title":"Language Models Are Few\u2010Shot Learners","volume":"33","author":"Brown T. B.","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_13_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/S40593\u2010014\u20100026\u20108"},{"key":"e_1_2_13_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patter.2025.101260"},{"key":"e_1_2_13_16_1","doi-asserted-by":"publisher","DOI":"10.1037\/1040\u20103590.6.4.284"},{"key":"e_1_2_13_17_1","doi-asserted-by":"publisher","DOI":"10.1080\/14703297.2023.2190148"},{"key":"e_1_2_13_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_2_13_19_1","doi-asserted-by":"publisher","DOI":"10.3390\/APP15105683"},{"key":"e_1_2_13_20_1","doi-asserted-by":"publisher","DOI":"10.1080\/14703297.2023.2195846"},{"key":"e_1_2_13_21_1","doi-asserted-by":"publisher","DOI":"10.1111\/emip.12537"},{"key":"e_1_2_13_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/S11023\u2010018\u20109482\u20105"},{"key":"e_1_2_13_23_1","doi-asserted-by":"publisher","DOI":"10.1186\/S12909\u2010024\u201006026\u20105"},{"key":"e_1_2_13_24_1","doi-asserted-by":"publisher","DOI":"10.3390\/APP15020581"},{"key":"e_1_2_13_25_1","doi-asserted-by":"publisher","DOI":"10.2196\/52113"},{"key":"e_1_2_13_26_1","doi-asserted-by":"publisher","DOI":"10.1080\/13504622.2015.1011084"},{"key":"e_1_2_13_27_1","doi-asserted-by":"publisher","DOI":"10.30191\/ETS.202304_26(2).0014"},{"key":"e_1_2_13_28_1","doi-asserted-by":"publisher","DOI":"10.1111\/JCAL.70072"},{"key":"e_1_2_13_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.EDUREV.2007.05.002"},{"key":"e_1_2_13_30_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.LINDIF.2023.102274"},{"key":"e_1_2_13_31_1","article-title":"Large Language Models Are Zero\u2010Shot Reasoners","volume":"35","author":"Kojima T.","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_13_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.JCM.2016.02.012"},{"key":"e_1_2_13_33_1","doi-asserted-by":"publisher","DOI":"10.1207\/S15430421TIP4104_2"},{"key":"e_1_2_13_34_1","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"key":"e_1_2_13_35_1","doi-asserted-by":"publisher","DOI":"10.14742\/AJET.9463"},{"key":"e_1_2_13_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3560815"},{"key":"e_1_2_13_37_1","doi-asserted-by":"publisher","DOI":"10.3389\/FEDUC.2024.1328769"},{"key":"e_1_2_13_38_1","unstructured":"Marcus G.2020.\u201cThe Next Decade in AI: Four Steps Towards Robust Artificial Intelligence.\u201dhttps:\/\/arxiv.org\/pdf\/2002.06177."},{"key":"e_1_2_13_39_1","doi-asserted-by":"publisher","DOI":"10.18574\/nyu\/9781479833641.001.0001"},{"key":"e_1_2_13_40_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.CAEAI.2024.100234"},{"key":"e_1_2_13_41_1","doi-asserted-by":"publisher","DOI":"10.3389\/FPSYG.2019.01089"},{"key":"e_1_2_13_42_1","first-page":"121","volume-title":"International Advances in Writing Research: Cultures, Places, Measures","author":"Perelman L.","year":"2020"},{"key":"e_1_2_13_43_1","unstructured":"Radford A. J.Wu R.Child D.Luan D.Amodei andI.Sutskever.2019.\u201cLanguage Models are Unsupervised Multitask Learners (Technical Report).\u201dhttps:\/\/cdn.openai.com\/better\u2010language\u2010models\/language_models_are_unsupervised_multitask_learners.pdf."},{"key":"e_1_2_13_44_1","doi-asserted-by":"publisher","DOI":"10.1371\/JOURNAL.PONE.0297521"},{"key":"e_1_2_13_45_1","doi-asserted-by":"publisher","DOI":"10.1080\/02602930902862859"},{"key":"e_1_2_13_46_1","doi-asserted-by":"publisher","DOI":"10.1080\/0260293042000264262"},{"key":"e_1_2_13_47_1","unstructured":"Sahoo P. A. K.Singh S.Saha V.Jain S.Mondal andA.Chadha.2024.\u201cA Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications.\u201dhttps:\/\/arxiv.org\/pdf\/2402.07927."},{"key":"e_1_2_13_48_1","doi-asserted-by":"publisher","DOI":"10.30935\/JDET\/14027"},{"key":"e_1_2_13_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/0-306-48515-X_5"},{"key":"e_1_2_13_50_1","doi-asserted-by":"publisher","DOI":"10.3390\/SU16229855"},{"key":"e_1_2_13_51_1","doi-asserted-by":"publisher","DOI":"10.54675\/CGBA9153"},{"key":"e_1_2_13_52_1","first-page":"24824","article-title":"Chain\u2010Of\u2010Thought Prompting Elicits Reasoning in Large Language Models","volume":"35","author":"Wei J.","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_13_53_1","doi-asserted-by":"publisher","DOI":"10.1108\/OTH\u201011\u20102024\u20100079"},{"key":"e_1_2_13_54_1","doi-asserted-by":"crossref","unstructured":"Xie W. J.Niu C. J.Xue andN.Guan.2024.\u201cGrade Like a Human: Rethinking Automated Assessment With Large Language Models.\u201dhttps:\/\/arxiv.org\/pdf\/2405.19694.","DOI":"10.1145\/3769002.3769962"},{"key":"e_1_2_13_55_1","volume-title":"Learning Fair Representations","author":"Zemel R.","year":"2013"}],"container-title":["Journal of Computer Assisted Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/jcal.70160","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/full-xml\/10.1002\/jcal.70160","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/jcal.70160","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T15:25:20Z","timestamp":1771341920000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/jcal.70160"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,2]]},"references-count":54,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,2]]}},"alternative-id":["10.1002\/jcal.70160"],"URL":"https:\/\/doi.org\/10.1002\/jcal.70160","archive":["Portico"],"relation":{},"ISSN":["0266-4909","1365-2729"],"issn-type":[{"value":"0266-4909","type":"print"},{"value":"1365-2729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,2]]},"assertion":[{"value":"2025-05-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-20","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70160"}}