{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T22:11:52Z","timestamp":1773871912115,"version":"3.50.1"},"reference-count":90,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T00:00:00Z","timestamp":1773619200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Large language models (LLMs) show promise for grading open-ended assessments but still exhibit inconsistent accuracy, systematic biases, and limited reliability across assignments. To address these concerns, we introduce SURE (Selective Uncertainty-based Re-Evaluation), a human-in-the-loop pipeline that combines repeated LLM prompting, uncertainty-based flagging, and selective human regrading. Three LLMs\u2014gpt-4.1-nano, gpt-5-nano, and the open-source gpt-oss-20b\u2014graded answers of 46 students to 130 open questions and coding exercises across five assignments. Each student answer was scored 20 times to derive majority-voted predictions and self-consistency-based certainty estimates. We simulated human regrading by flagging low-certainty cases and replacing them with scores from four human graders. We used the first assignment as a training set for tuning certainty thresholds and to explore LLM output diversification via sampling parameters, rubric shuffling, varied personas, multilingual prompts, and post hoc ensembles. We then evaluated the effectiveness and efficiency of SURE on the other four assignments using a fixed certainty threshold. Across assignments, fully automated grading with a single prompt resulted in substantial underscoring, and majority-voting based on 20 prompts improved but did not eliminate this bias. Low certainty (i.e., high output diversity) was diagnostic of incorrect LLM scores, enabling targeted human regrading that improved grading accuracy while reducing manual grading time by 40\u201390%. Aggregating responses from all three LLMs in an ensemble improved certainty-based flagging and most consistently approached human-level accuracy, with 70\u201390% of the grades students would receive falling inside human-grader ranges. A reanalysis based on outputs from a more diversified LLM ensemble comprised of gpt-5, codestral-25.01, and llama-3.3-70b-instruct replicated these findings but also suggested that large reasoning models such as gpt-5 might eliminate the need for human oversight of LLM grading entirely. These findings demonstrate that self-consistency-based uncertainty estimation and selective human oversight can substantially improve the reliability and efficiency of AI-assisted grading.<\/jats:p>","DOI":"10.3390\/make8030074","type":"journal-article","created":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T14:30:32Z","timestamp":1773671432000},"page":"74","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-2098-8679","authenticated-orcid":false,"given":"Luke","family":"Korthals","sequence":"first","affiliation":[{"name":"Faculty of Social and Behavioural Sciences, University of Amsterdam, 1018 WB Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-1726-7465","authenticated-orcid":false,"given":"Emma","family":"Akrong","sequence":"additional","affiliation":[{"name":"Faculty of Social and Behavioural Sciences, University of Amsterdam, 1018 WB Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9705-1180","authenticated-orcid":false,"given":"Gali","family":"Geller","sequence":"additional","affiliation":[{"name":"Faculty of Social and Behavioural Sciences, University of Amsterdam, 1018 WB Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4983-3615","authenticated-orcid":false,"given":"Hannes","family":"Rosenbusch","sequence":"additional","affiliation":[{"name":"Faculty of Social and Behavioural Sciences, University of Amsterdam, 1018 WB Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7458-1272","authenticated-orcid":false,"given":"Raoul","family":"Grasman","sequence":"additional","affiliation":[{"name":"Faculty of Social and Behavioural Sciences, University of Amsterdam, 1018 WB Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3855-2778","authenticated-orcid":false,"given":"Ingmar","family":"Visser","sequence":"additional","affiliation":[{"name":"Faculty of Social and Behavioural Sciences, University of Amsterdam, 1018 WB Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,3,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., and Isotani, S. (2025). Grading University Students with LLMs: Performance and Acceptance of a Canvas-Based Automation. Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED, Springer.","DOI":"10.1007\/978-3-031-99261-2"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Gr\u00e9visse, C. (2024). LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ., 24.","DOI":"10.1186\/s12909-024-06026-5"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1002\/berj.4069","article-title":"Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT","volume":"51","year":"2025","journal-title":"Br. Educ. Res. J."},{"key":"ref_4","unstructured":"Ishida, T., Liu, T., Wang, H., and Cheung, W.K. (2024). Large language models as partners in student essay evaluation. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1111\/bjet.13494","article-title":"Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments","volume":"56","author":"Yavuz","year":"2025","journal-title":"Br. J. Educ. Technol."},{"key":"ref_6","first-page":"76","article-title":"Analysis of Multiple-Choice versus Open-Ended Questions in Language Tests According to Different Cognitive Domain Levels","volume":"14","author":"Polat","year":"2020","journal-title":"Novitas-ROYAL (Res. Youth Lang.)"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Schneider, J., Schenk, B., and Niklaus, C. (2024). Towards LLM-based Autograding for Short Textual Answers. arXiv.","DOI":"10.5220\/0012552200003693"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"010136","DOI":"10.1103\/PhysRevPhysEducRes.21.010136","article-title":"Assessing confidence in AI-assisted grading of physics exams through psychometrics: An exploratory study","volume":"21","author":"Kortemeyer","year":"2025","journal-title":"Phys. Rev. Phys. Educ. Res."},{"key":"ref_9","unstructured":"European Parliament and Council of the European Union (2024). Regulation (EU) 2024\/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300\/2008, (EU) No 167\/2013, (EU) No 168\/2013, (EU) 2018\/858, (EU) 2018\/1139 and (EU) 2019\/2144 and Directives 2014\/90\/EU, (EU) 2016\/797 and (EU) 2020\/1828 (Artificial Intelligence Act) (Text with EEA relevance). Off. J. Eur. Union, 2024\/1689, 1\u2013144."},{"key":"ref_10","unstructured":"Mills, C., Alexandron, G., Taibi, D., Lo Bosco, G., and Paquette, L. (2025). Can Language Models Grade Algebra Worked Solutions? Evaluating LLM-Based Autograders Against Human Grading. Proceedings of the 18th International Conference on Educational Data Mining, Palermo, Italy, 20\u201323 July 2025, International Educational Data Mining Society."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"010126","DOI":"10.1103\/PhysRevPhysEducRes.21.010126","article-title":"Grading Explanations of Problem-Solving Process and Generating Feedback Using Large Language Models at Human-Level Accuracy","volume":"21","author":"Chen","year":"2025","journal-title":"Phys. Rev. Phys. Educ. Res."},{"key":"ref_12","unstructured":"Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv."},{"key":"ref_13","unstructured":"Ovalle, A., Chang, K.-W., Mehrabi, N., Pruksachatkun, Y., Galystan, A., Dhamala, J., Verma, A., Cao, T., Kumar, A., and Gupta, R. (2023). Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement. Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Toronto,\nCanada, 14 July 2023, Association for Computational Linguistics."},{"key":"ref_14","unstructured":"Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (2025). Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4\u20139 November 2025, Association for Computational Linguistics."},{"key":"ref_15","unstructured":"Hamidieh, K., Thost, V., Gerych, W., Yurochkin, M., and Ghassemi, M. (2025, January 6). Uncovering Confident Failures: The Complementary Roles of Aleatoric and Epistemic Uncertainty in LLMs. Proceedings of the NeurIPS 2025 Workshop: Reliable ML from Unreliable Data, San Diego, CA, USA. Available online: https:\/\/openreview.net\/forum?id=9Jq7wNrpUI."},{"key":"ref_16","first-page":"100733","article-title":"Conformal validation: A deferral policy using uncertainty quantification with a human-in-the-loop for model validation","volume":"22","author":"Horton","year":"2025","journal-title":"Mach. Learn. Appl."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Strong, J., Men, Q., and Noble, A. (2025). Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models. arXiv.","DOI":"10.1609\/aaai.v39i27.35063"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"506","DOI":"10.1038\/s41597-025-04664-y","article-title":"A benchmarking framework and dataset for learning to defer in human-AI decision-making","volume":"12","author":"Alves","year":"2025","journal-title":"Sci. Data"},{"key":"ref_19","unstructured":"OpenAI (2025, November 03). How Should I Set the Temperature Parameter?. Available online: https:\/\/platform.openai.com\/docs\/faq\/how-should-i-set-the-temperature-parameter."},{"key":"ref_20","unstructured":"Peeperkorn, M., Kouwenhoven, T., Brown, D., and Jordanous, A. (2024). Is Temperature the Creativity Parameter of Large Language Models?. arXiv."},{"key":"ref_21","unstructured":"Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (May, January 26). The Curious Case of Neural Text Degeneration. Proceedings of the International Conference on Learning Representations, Virtual Conference."},{"key":"ref_22","unstructured":"OpenAI (2025, November 03). Using GPT-5, 2025. Available online: https:\/\/platform.openai.com."},{"key":"#cr-split#-ref_23.1","unstructured":"Muresan, S., Nakov, P., and Villavicencio, A. (2022). Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume"},{"key":"#cr-split#-ref_23.2","unstructured":"1: Long Papers), Dublin, Ireland, 22-27 May 2022, Association for Computational Linguistics."},{"key":"ref_24","unstructured":"Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (2025). Multilingual Prompting for Improving LLM Generation Diversity. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4\u20139 November 2025, Association for Computational Linguistics."},{"key":"ref_25","unstructured":"Calabrese, A., de Kock, C., Nozza, D., Plaza-del Arco, F.M., Talat, Z., and Vargas, F. (2025). Personas with Attitudes: Controlling LLMs for Diverse Data Annotation. Proceedings of the 9th Workshop on Online Abuse and Harms (WOAH), Vienna, Austria, 1 August 2025, Association for Computational Linguistics. Available online: https:\/\/aclanthology.org\/2025.woah-1.43\/."},{"key":"ref_26","unstructured":"Ku, L.W., Martins, A., and Srikumar, V. (2024). Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand,\n11\u201316 August 2024, Association for Computational Linguistics."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Hastie, T., Tibshirani, R., and Friedman, J. (2009). Ensemble Learning. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer.","DOI":"10.1007\/978-0-387-84858-7"},{"key":"ref_28","unstructured":"Al-Onaizan, Y., Bansal, M., and Chen, Y.N. (2024). LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity. Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12\u201316 November 2024, Association for Computational Linguistics."},{"key":"ref_29","unstructured":"Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., and Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"e70080","DOI":"10.2196\/70080","article-title":"Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study","volume":"27","author":"Yang","year":"2025","journal-title":"J. Med. Internet Res."},{"key":"ref_31","first-page":"3843","article-title":"Solving Quantitative Reasoning Problems with Language Models","volume":"Volume 35","author":"Koyejo","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"ref_32","unstructured":"R Core Team (2022). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing."},{"key":"ref_33","unstructured":"Van Rossum, G., and Drake, F.L. (1995). Python Reference Manual, Centrum voor Wiskunde en Informatica."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3636515","article-title":"Automated Grading and Feedback Tools for Programming Education: A Systematic Review","volume":"24","author":"Messer","year":"2024","journal-title":"ACM Trans. Comput. Educ."},{"key":"ref_35","unstructured":"OpenAI, Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A.J., Welihinda, A., and Hayes, A. (2024). GPT-4o System Card. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"101522","DOI":"10.1016\/j.tsc.2024.101522","article-title":"The Future of Grading Programming Assignments in Education: The Role of ChatGPT in Automating the Assessment and Feedback Process","volume":"52","author":"Jukiewicz","year":"2024","journal-title":"Think. Ski. Creat."},{"key":"ref_37","unstructured":"Jukiewicz, M. (2025). A Systematic Comparison of Large Language Models for Automated Assignment Assessment in Programming Education: Exploring the Importance of Architecture and Vendor. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3759256","article-title":"How Consistent Are Humans When Grading Programming Assignments?","volume":"25","author":"Messer","year":"2025","journal-title":"ACM Trans. Comput. Educ."},{"key":"ref_39","first-page":"319","article-title":"An Overview of Current Research on Automated Essay Grading","volume":"2","author":"Valenti","year":"2003","journal-title":"J. Inf. Technol. Educ. Res."},{"key":"ref_40","first-page":"1","article-title":"An Overview of Automated Scoring of Essays","volume":"5","author":"Dikli","year":"2006","journal-title":"J. Technol. Learn. Assess."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Zawacki-Richter, O., and Jung, I. (2023). Automated Essay Scoring Systems. Handbook of Open, Distance and Digital Education, Springer Nature.","DOI":"10.1007\/978-981-19-2080-6"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"210","DOI":"10.1007\/BF01419938","article-title":"The Use of the Computer in Analyzing Student Essays","volume":"14","author":"Page","year":"1968","journal-title":"Int. Rev. Educ."},{"key":"ref_43","unstructured":"Lascarides, A., Gardent, C., and Nivre, J. (2009). Text-to-Text Semantic Similarity for Automatic Short Answer Grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 30 March\u20133 April 2009, Association for Computational Linguistics. Available online: https:\/\/aclanthology.org\/E09-1065\/."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Dong, F., and Zhang, Y. (2016). Automatic Features for Essay Scoring\u2014An Empirical Study. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1\u20135 November 2016, Association for Computational Linguistics.","DOI":"10.18653\/v1\/D16-1115"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Taghipour, K., and Ng, H.T. (2016). A Neural Approach to Automated Essay Scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1\u20135 November 2016, Association for Computational Linguistics.","DOI":"10.18653\/v1\/D16-1193"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1007\/s41237-021-00142-y","article-title":"A Review of Deep-Neural Automated Essay Scoring Models","volume":"48","author":"Uto","year":"2021","journal-title":"Behaviormetrika"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"389","DOI":"10.1023\/A:1025779619903","article-title":"C-Rater: Automated Scoring of Short-Answer Questions","volume":"37","author":"Leacock","year":"2003","journal-title":"Comput. Humanit."},{"key":"ref_48","first-page":"1","article-title":"Automated Essay Scoring with e-rater\u00ae V.2","volume":"4","author":"Attali","year":"2006","journal-title":"J. Technol. Learn. Assess."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Ihantola, P., Ahoniemi, T., Karavirta, V., and Sepp\u00e4l\u00e4, O. (2010). Review of Recent Systems for Automatic Assessment of Programming Assignments. Proceedings of the 10th Koli Calling International Conference on Computing Education Research, Koli, Finland, 28\u201331 October 2010, ACM.","DOI":"10.1145\/1930464.1930480"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Staubitz, T., Klement, H., Renz, J., Teusner, R., and Meinel, C. (2015). Towards Practical Programming Exercises and Automated Assessment in Massive Open Online Courses. Proceedings of the 2015 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Zhuhai, China, 10\u201312 December 2015, IEEE.","DOI":"10.1109\/TALE.2015.7386010"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Novak, M., and Kermek, D. (2024). Assessment Automation of Complex Student Programming Assignments. Educ. Sci., 14.","DOI":"10.3390\/educsci14010054"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"e70174","DOI":"10.1111\/exsy.70174","article-title":"Grading Open-Ended Questions Using LLMs and RAG","volume":"43","year":"2026","journal-title":"Expert Syst."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Yewon, A., and Sang-Ki, L. (2025). The Impact of Prompt Engineering on GPT-4o\u2019s Scoring Reliability in English Writing Assessment. SSRN.","DOI":"10.2139\/ssrn.5929615"},{"key":"ref_54","unstructured":"Golchin, S., Garuda, N., Impey, C., and Wenger, M. (2024). Large Language Models As MOOCs Graders. arXiv."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Ferreira Mello, R., Pereira Junior, C., Rodrigues, L., Pereira, F.D., Cabral, L., Costa, N., Ramalho, G., and Gasevic, D. (2025). Automatic Short Answer Grading in the LLM Era: Does GPT-4 with Prompt Engineering Beat Traditional Models?. Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3\u20137 March 2025, ACM.","DOI":"10.1145\/3706468.3706481"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Qiu, H., White, B., Ding, A., Costa, R., Hachem, A., Ding, W., and Chen, P. (2025). SteLLA: A Structured Grading System Using LLMs with RAG. arXiv.","DOI":"10.1109\/BigData62323.2024.10825385"},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"100210","DOI":"10.1016\/j.caeai.2024.100210","article-title":"Fine-Tuning ChatGPT for Automatic Scoring","volume":"6","author":"Latif","year":"2024","journal-title":"Comput. Educ. Artif. Intell."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Pathak, A., Gandhi, R., Uttam, V., Ramamoorthy, A., Ghosh, P., Jindal, A.R., Verma, S., Mittal, A., Ased, A., and Khatri, C. (2025). Rubric Is All You Need: Improving LLM-Based Code Evaluation with Question-Specific Rubrics. Proceedings of the 2025 ACM Conference on International Computing Education Research, Charlottesville, VA, USA, 3\u20136 August 2025, ACM.","DOI":"10.1145\/3702652.3744220"},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1016\/j.edurev.2007.05.002","article-title":"The Use of Scoring Rubrics: Reliability, Validity and Educational Consequences","volume":"2","author":"Jonsson","year":"2007","journal-title":"Educ. Res. Rev."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Hossain, S. (2019). Visualization of Bioinformatics Data with Dash Bio. SciPy.","DOI":"10.25080\/Majora-7ddc1dd1-01f"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"420","DOI":"10.1037\/0033-2909.86.2.420","article-title":"Intraclass Correlations: Uses in Assessing Rater Reliability","volume":"86","author":"Shrout","year":"1979","journal-title":"Psychol. Bull."},{"key":"ref_62","unstructured":"Revelle, W. (2025). psych: Procedures for Psychological, Psychometric, and Personality Research, Northwestern University. Available online: https:\/\/CRAN.R-project.org\/package=psych."},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v067.i01","article-title":"Fitting Linear Mixed-Effects Models Using lme4","volume":"67","author":"Bates","year":"2015","journal-title":"J. Stat. Softw."},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"688","DOI":"10.1038\/163688a0","article-title":"Measurement of Diversity","volume":"163","author":"Simpson","year":"1949","journal-title":"Nature"},{"key":"ref_65","doi-asserted-by":"crossref","first-page":"716","DOI":"10.1109\/TAC.1974.1100705","article-title":"A New Look at the Statistical Model Identification","volume":"19","author":"Akaike","year":"1974","journal-title":"IEEE Trans. Autom. Control"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1214\/aos\/1176344136","article-title":"Estimating the Dimension of a Model","volume":"6","author":"Schwarz","year":"1978","journal-title":"Ann. Stat."},{"key":"ref_67","unstructured":"OpenAI (2025, November 03). GPT-4 0.1 Nano-OpenAI API. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-4.1-nano."},{"key":"ref_68","unstructured":"OpenAI (2025, November 03). GPT-5 Nano-OpenAI API. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-5-nano."},{"key":"ref_69","unstructured":"OpenAI (2025, November 03). Gpt-Oss-20b-OpenAI API. Available online: https:\/\/platform.openai.com\/docs\/models\/gpt-oss-20b."},{"key":"ref_70","unstructured":"OpenAI (2025, November 03). Batch API - OpenAI API. Available online: https:\/\/platform.openai.com\/docs\/guides\/batch."},{"key":"ref_71","unstructured":"Microsoft (2025, November 03). Azure Machine Learning\u2014ML as a Service|Microsoft Azure. Available online: https:\/\/azure.microsoft.com\/en-us\/products\/machine-learning."},{"key":"ref_72","unstructured":"DeepL (2025, November 03). DeepL Translate: The World\u2019s Most Accurate Translator. Available online: https:\/\/www.deepl.com\/translator."},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Capretto, T., Piho, C., Kumar, R., Westfall, J., Yarkoni, T., and Martin, O.A. (2022). Bambi: A simple interface for fitting Bayesian linear models in Python. arXiv.","DOI":"10.18637\/jss.v103.i15"},{"key":"ref_74","unstructured":"Westfall, J. (2017). Statistical details of the default priors in the Bambi library. arXiv."},{"key":"ref_75","unstructured":"OpenAI (2026, January 31). Introducing GPT-5. Available online: https:\/\/openai.com\/index\/introducing-gpt-5\/."},{"key":"ref_76","unstructured":"Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., and Vaughan, A. (2024). The Llama 3 Herd of Models. arXiv."},{"key":"ref_77","unstructured":"Mistral AI Team (2026, January 31). Codestral 25.01|Mistral AI. Available online: https:\/\/mistral.ai\/news\/codestral-2501?utm_source=chatgpt.com."},{"key":"ref_78","unstructured":"Universiteit van Amsterdam (2026, January 31). UvA AI Chat. Available online: https:\/\/www.uva.nl\/over-de-uva\/over-de-universiteit\/ai\/ai-in-het-onderwijs\/uva-ai-chat\/uva-ai-chat.html."},{"key":"ref_79","unstructured":"Cohn, C., S, A.T., Mohammed, N., and Biswas, G. (2025). CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring. arXiv, arxiv.2504.02323."},{"key":"ref_80","unstructured":"Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., and Isotani, S. (2025). Avalon: A Human-in-the-Loop LLM Grading System with Instructor Calibration and Student Self-assessment. Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED, Springer."},{"key":"ref_81","doi-asserted-by":"crossref","first-page":"100428","DOI":"10.1016\/j.caeai.2025.100428","article-title":"Is GPT-4 Fair? An Empirical Analysis in Automatic Short Answer Grading","volume":"8","author":"Rodrigues","year":"2025","journal-title":"Comput. Educ. Artif. Intell."},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"pgaf089","DOI":"10.1093\/pnasnexus\/pgaf089","article-title":"Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation","volume":"4","author":"An","year":"2025","journal-title":"PNAS Nexus"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"81","DOI":"10.3102\/003465430298487","article-title":"The Power of Feedback","volume":"77","author":"Hattie","year":"2007","journal-title":"Rev. Educ. Res."},{"key":"ref_84","doi-asserted-by":"crossref","first-page":"e3292","DOI":"10.1002\/rev3.3292","article-title":"Formative Assessment and Feedback for Learning in Higher Education: A Systematic Review","volume":"9","author":"Morris","year":"2021","journal-title":"Rev. Educ."},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"Dai, W., Lin, J., Jin, H., Li, T., Tsai, Y.S., Ga\u0161evi\u0107, D., and Chen, G. (2023). Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10\u201313 July 2023, IEEE.","DOI":"10.1109\/ICALT58122.2023.00100"},{"key":"ref_86","unstructured":"Paa\u00dfen, B., and Demmans Epp, C. (2024). LLM-generated Feedback in Real Classes and Beyond: Perspectives from Students and Instructors. Proceedings of the 17th International Conference on Educational Data Mining, Atlanta, GA, USA,\n14\u201317 July 2024, International Educational Data Mining Society."},{"key":"ref_87","doi-asserted-by":"crossref","first-page":"100199","DOI":"10.1016\/j.caeai.2023.100199","article-title":"Using LLMs to Bring Evidence-Based Feedback into the Classroom: AI-generated Feedback Increases Secondary Students\u2019 Text Revision, Motivation, and Positive Emotions","volume":"6","author":"Meyer","year":"2024","journal-title":"Comput. Educ. Artif. Intell."},{"key":"ref_88","first-page":"127","article-title":"Student Self-Assessment: A Meta-Review of Five Decades of Research","volume":"32","author":"Nieminen","year":"2025","journal-title":"Assess. Educ. Princ. Policy Pract."},{"key":"ref_89","unstructured":"(2025, February 19). Canvas. Available online: https:\/\/www.instructure.com\/canvas."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/8\/3\/74\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T14:47:21Z","timestamp":1773845241000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/8\/3\/74"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,16]]},"references-count":90,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["make8030074"],"URL":"https:\/\/doi.org\/10.3390\/make8030074","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,16]]}}}