{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T14:42:20Z","timestamp":1775486540862,"version":"3.50.1"},"reference-count":57,"publisher":"Elsevier BV","issue":"6","license":[{"start":{"date-parts":[[2025,10,19]],"date-time":"2025-10-19T00:00:00Z","timestamp":1760832000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,19]],"date-time":"2025-10-19T00:00:00Z","timestamp":1760832000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100025273","name":"Constantine the Philosopher University in Nitra","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100025273","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Artif Intell Educ"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>This study investigates the potential of Generative AI models and sentence embedding models for the automated assessment of open-ended student responses in a higher education computer science course. Among 110 university students enrolled in a software engineering course, 1,885 responses to 24 open-ended questions assessing knowledge of software engineering concepts were collected. Using precision, recall, F1-score, false positive and false negative rates, and inter-rater agreement metrics such as Fleiss\u2019 Kappa and Krippendorff\u2019s Alpha, we systematically analyzed the performance of eleven state-of-the-art models, including GPTo1, Claude3, PaLM2, and SBERT, against two human expert graders. The findings reveal that GPTo1 achieved the highest agreement with human evaluations, showing almost perfect agreement, low false positive and false negative rates, and strong performance across all grade categories. Models such as Claude3 and PaLM2 demonstrated substantial agreement, excelling in higher-grade assessments but falling short in identifying failing grades. Sentence embedding models, while moderately effective, struggled with capturing the context and semantic nuances of diverse student expressions. The study also highlights the limitations of reference-based grading approaches, as shown by the Natural Language Inference analysis, which found that many student responses contradicted reference answers despite being semantically correct. This underscores the importance of context-sensitive models like GPTo1, which accurately evaluate diverse responses and ensure fairer grading. While GPTo1 stands out as a candidate for independent deployment, the financial cost of such high-performing proprietary models raises concerns about scalability.<\/jats:p>","DOI":"10.1007\/s40593-025-00517-2","type":"journal-article","created":{"date-parts":[[2025,10,19]],"date-time":"2025-10-19T16:31:12Z","timestamp":1760891472000},"page":"3813-3846","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Automated Grading of Open-Ended Questions in Higher Education Using GenAI Models"],"prefix":"10.1016","volume":"35","author":[{"given":"Janka","family":"Pecuchova","sequence":"first","affiliation":[]},{"given":"\u013dubom\u00edr","family":"Benko","sequence":"additional","affiliation":[]},{"given":"Martin","family":"Drlik","sequence":"additional","affiliation":[]}],"member":"78","published-online":{"date-parts":[[2025,10,19]]},"reference":[{"key":"517_CR1","doi-asserted-by":"publisher","unstructured":"Abdou, I., & Eude, T. (2023). Open-ended questions automated evaluation: proposal of a new generation. In Proceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence (pp. 143\u2013147). New York, NY, USA: ACM. https:\/\/doi.org\/10.1145\/3632971.3632980","DOI":"10.1145\/3632971.3632980"},{"key":"517_CR2","doi-asserted-by":"crossref","unstructured":"Aggarwal, D., Sil, P., Raman, B., & Bhattacharyya, P. (2025). I understand why I got this grade: Automatic Short Answer Grading (ASAG) with Feedback. In Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025) (pp. 1\u201315). Palermo, Italy: AIED Society.","DOI":"10.1007\/978-3-031-98420-4_22"},{"issue":"4","key":"517_CR3","doi-asserted-by":"publisher","first-page":"1709","DOI":"10.1007\/s11423-023-10239-8","volume":"71","author":"C \u00c1lvarez-\u00c1lvarez","year":"2023","unstructured":"\u00c1lvarez-\u00c1lvarez, C., & Falcon, S. (2023). Students\u2019 preferences with university teaching practices: Analysis of testimonials with artificial intelligence. Educational Technology Research And Development, 71(4), 1709\u20131724. https:\/\/doi.org\/10.1007\/s11423-023-10239-8","journal-title":"Educational Technology Research And Development"},{"key":"517_CR4","unstructured":"Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A. (2023). PaLM 2 Technical Report."},{"key":"517_CR5","unstructured":"Anthropic (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku, 1\u201342."},{"key":"517_CR6","doi-asserted-by":"publisher","unstructured":"Arici, N., Gerevini, A. E., Putelli, L., Serina, I., & Sigalini, L. (2023a). A BERT-Based scoring system for workplace safety courses in italian (pp. 457\u2013471). https:\/\/doi.org\/10.1007\/978-3-031-27181-6_32","DOI":"10.1007\/978-3-031-27181-6_32"},{"issue":"8","key":"517_CR7","doi-asserted-by":"publisher","first-page":"268","DOI":"10.3390\/fi15080268","volume":"15","author":"N Arici","year":"2023","unstructured":"Arici, N., Gerevini, A., Olivato, M., Putelli, L., Sigalini, L., & Serina, I. (2023b). Real-World implementation and integration of an automatic scoring system for workplace safety courses in Italian. Future Internet, 15(8), 268. https:\/\/doi.org\/10.3390\/fi15080268","journal-title":"Future Internet"},{"key":"517_CR8","doi-asserted-by":"publisher","unstructured":"Balaha, H. M., Saafan, M. M. Automatic Exam Correction Framework (AECF) for the MCQs, Essays, and, & Matching, E. (2021). IEEE Access, 9, 32368\u201332389. https:\/\/doi.org\/10.1109\/ACCESS.2021.3060940","DOI":"10.1109\/ACCESS.2021.3060940"},{"key":"517_CR9","doi-asserted-by":"publisher","unstructured":"Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 632\u2013642). Stroudsburg, PA, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/D15-1075","DOI":"10.18653\/v1\/D15-1075"},{"issue":"1","key":"517_CR10","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1007\/s40593-014-0026-8","volume":"25","author":"S Burrows","year":"2015","unstructured":"Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60\u2013117. https:\/\/doi.org\/10.1007\/s40593-014-0026-8","journal-title":"International Journal of Artificial Intelligence in Education"},{"key":"517_CR11","unstructured":"Cao, H. (2024). Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark, 1\u201321."},{"key":"517_CR12","doi-asserted-by":"publisher","unstructured":"Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N. (2018). St. John, R.,. Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169\u2013174). Stroudsburg, PA, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/D18-2029","DOI":"10.18653\/v1\/D18-2029"},{"key":"517_CR13","unstructured":"Condor, A., Litster, M., & Pardos, Z. (2021). Automatic short answer grading with SBERT on out-of-sample questions. In Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) (pp. 345\u2013352). Paris, France: International Educational Data Mining Society."},{"key":"517_CR14","doi-asserted-by":"publisher","unstructured":"Condoravdi, C., Crouch, D., de Paiva, V., Stolle, R., & Bobrow, D. G. (2003). Entailment, intensionality and text understanding. In Proceedings of the HLT-NAACL 2003 workshop on Text meaning - (pp. 38\u201345). Morristown, NJ, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.3115\/1119239.1119245","DOI":"10.3115\/1119239.1119245"},{"key":"517_CR15","doi-asserted-by":"publisher","unstructured":"Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North (pp. 4171\u20134186). Stroudsburg, PA, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/N19-1423","DOI":"10.18653\/v1\/N19-1423"},{"key":"517_CR16","doi-asserted-by":"publisher","unstructured":"Dinh, T. A., Mullov, C., B\u00e4rmann, L., Li, Z., Liu, D., Rei\u00df, S. (2024). SciEx: benchmarking large language models on scientific exams with human expert grading and automatic grading. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 11592\u201311610). Stroudsburg, PA, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2024.emnlp-main.647","DOI":"10.18653\/v1\/2024.emnlp-main.647"},{"key":"517_CR17","doi-asserted-by":"publisher","first-page":"3106","DOI":"10.1109\/ACCESS.2018.2887057","volume":"7","author":"M Drlik","year":"2019","unstructured":"Drlik, M., & Munk, M. (2019). Understanding time-based trends in stakeholders\u2019 choice of learning activity type using predictive models. IEEE Access, 7, 3106\u20133121. https:\/\/doi.org\/10.1109\/ACCESS.2018.2887057","journal-title":"IEEE Access"},{"key":"517_CR18","doi-asserted-by":"publisher","first-page":"23795","DOI":"10.1109\/ACCESS.2021.3056191","volume":"9","author":"M Drlik","year":"2021","unstructured":"Drlik, M., Munk, M., & Skalka, J. (2021). Identification of changes in VLE stakeholders\u2019 behavior over time using frequent patterns mining. IEEE Access\u202f: Practical Innovations, Open Solutions, 9, 23795\u201323813. https:\/\/doi.org\/10.1109\/ACCESS.2021.3056191","journal-title":"IEEE Access : Practical Innovations, Open Solutions"},{"key":"517_CR19","doi-asserted-by":"publisher","unstructured":"Ethayarajh, K. (2019). How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 55\u201365). Stroudsburg, PA, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/D19-1006","DOI":"10.18653\/v1\/D19-1006"},{"key":"517_CR20","doi-asserted-by":"publisher","unstructured":"Ferreira Mello, R., Pereira Junior, C., Rodrigues, L., Pereira, F. D., Cabral, L., Costa, N. (2025). Automatic Short Answer Grading in the LLM Era: Does GPT-4 with Prompt Engineering beat Traditional Models? In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 93\u2013103). New York, NY, USA: ACM. https:\/\/doi.org\/10.1145\/3706468.3706481","DOI":"10.1145\/3706468.3706481"},{"key":"517_CR21","doi-asserted-by":"publisher","DOI":"10.1007\/s40593-024-00442-w","author":"Y Fu","year":"2024","unstructured":"Fu, Y., Weng, Z., & Wang, J. (2024). Examining AI use in educational contexts: A scoping meta-review and bibliometric analysis. International Journal of Artificial Intelligence in Education. https:\/\/doi.org\/10.1007\/s40593-024-00442-w","journal-title":"International Journal of Artificial Intelligence in Education"},{"key":"517_CR22","doi-asserted-by":"publisher","first-page":"100206","DOI":"10.1016\/j.caeai.2024.100206","volume":"6","author":"R Gao","year":"2024","unstructured":"Gao, R., Merzdorf, H. E., Anwar, S., Hipwell, M. C., & Srinivasa, A. R. (2024). Automatic assessment of text-based responses in post-secondary education: A systematic review. Computers And Education: Artificial Intelligence, 6, 100206. https:\/\/doi.org\/10.1016\/j.caeai.2024.100206","journal-title":"Computers And Education: Artificial Intelligence"},{"issue":"1","key":"517_CR23","doi-asserted-by":"publisher","first-page":"1060","DOI":"10.1186\/s12909-024-06026-5","volume":"24","author":"C Gr\u00e9visse","year":"2024","unstructured":"Gr\u00e9visse, C. (2024). LLM-based automatic short answer grading in undergraduate medical education. BMC Medical Education, 24(1), 1060. https:\/\/doi.org\/10.1186\/s12909-024-06026-5","journal-title":"BMC Medical Education"},{"key":"517_CR24","doi-asserted-by":"publisher","DOI":"10.1007\/s40593-024-00431-z","author":"O Henkel","year":"2024","unstructured":"Henkel, O., Hills, L., Roberts, B., & McGrane, J. (2024). Can LLMs grade open response reading comprehension questions? An empirical study using the roars dataset. International Journal of Artificial Intelligence in Education. https:\/\/doi.org\/10.1007\/s40593-024-00431-z","journal-title":"International Journal of Artificial Intelligence in Education"},{"key":"517_CR25","doi-asserted-by":"publisher","unstructured":"Huang, Y., Yang, X., Zhuang, F., Zhang, L., & Yu, S. (2018). Automatic Chinese reading comprehension grading by LSTM with knowledge adaptation (pp. 118\u2013129). https:\/\/doi.org\/10.1007\/978-3-319-93034-3_10","DOI":"10.1007\/978-3-319-93034-3_10"},{"key":"517_CR26","doi-asserted-by":"publisher","unstructured":"Hutt, S., DePiro, A., Wang, J., Rhodes, S., Baker, R. S., Hieb, G. (2024). Feedback on Feedback: Comparing Classic Natural Language Processing and Generative AI to Evaluate Peer Feedback. In Proceedings of the 14th Learning Analytics and Knowledge Conference (pp. 55\u201365). New York, NY, USA: ACM. https:\/\/doi.org\/10.1145\/3636555.3636850","DOI":"10.1145\/3636555.3636850"},{"issue":"6","key":"517_CR27","doi-asserted-by":"publisher","DOI":"10.3390\/info16060472","volume":"16","author":"G Ilieva","year":"2025","unstructured":"Ilieva, G., Yankova, T., Ruseva, M., & Kabaivanov, S. (2025). A framework for generative AI-driven assessment in higher education. Information, 16(6), Article 472. https:\/\/doi.org\/10.3390\/info16060472","journal-title":"Information"},{"issue":"04","key":"517_CR28","doi-asserted-by":"publisher","first-page":"3097","DOI":"10.54364\/AAIML.2024.44177","volume":"04","author":"S Jauhiainen","year":"2024","unstructured":"Jauhiainen, S., & Guerra, A. G. (2024). Evaluating students\u2019 open-ended written responses with LLMs: Using the RAG framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large. Advances in Artificial Intelligence and Machine Learning, 04(04), 3097\u20133113. https:\/\/doi.org\/10.54364\/AAIML.2024.44177","journal-title":"Advances in Artificial Intelligence and Machine Learning"},{"key":"517_CR29","doi-asserted-by":"publisher","unstructured":"Jiang, L., & Bosch, N. (2024). Short answer scoring with GPT-4. In Proceedings of the Eleventh ACM Conference on Learning @ Scale (pp. 438\u2013442). New York, NY, USA: ACM. https:\/\/doi.org\/10.1145\/3657604.3664685","DOI":"10.1145\/3657604.3664685"},{"key":"517_CR30","unstructured":"Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de Casas, D. (2023). las,. Mistral 7B."},{"issue":"7","key":"517_CR31","doi-asserted-by":"publisher","first-page":"3130","DOI":"10.3390\/app11073130","volume":"11","author":"J Kabathova","year":"2021","unstructured":"Kabathova, J., & Drlik, M. (2021). Towards predicting student\u2019s dropout in university courses using different machine learning techniques. Applied Sciences, 11(7), 3130. https:\/\/doi.org\/10.3390\/app11073130","journal-title":"Applied Sciences"},{"key":"517_CR32","doi-asserted-by":"crossref","unstructured":"Kincaid, P. J., FishburneJr., R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (Automated readability Index, fog count and Flesch reading ease formula) for navy enlisted personnel. Millington.","DOI":"10.21236\/ADA006655"},{"key":"517_CR33","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach."},{"key":"517_CR34","doi-asserted-by":"crossref","unstructured":"MacCartney, B., & Manning, C. D. (2008). Modeling Semantic Containment and Exclusion in Natural Language Inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) (pp. 521\u2013528). Manchester, UK: Coling 2008 Organizing Committee.","DOI":"10.3115\/1599081.1599147"},{"key":"517_CR35","doi-asserted-by":"crossref","unstructured":"MacCartney, B., Galley, M., & Manning, C. D. (2008). A Phrase-Based Alignment Model for Natural Language Inference. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 802\u2013811). Honolulu, Hawaii: Association for Computational Linguistics.","DOI":"10.3115\/1613715.1613817"},{"key":"517_CR36","doi-asserted-by":"publisher","DOI":"10.1080\/14703297.2025.2469089","author":"J Manning","year":"2025","unstructured":"Manning, J., Baldwin, J., & Powell, N. (2025). Human versus machine: The effectiveness of ChatGPT in automated essay scoring. Innovations In Education And Teaching International. https:\/\/doi.org\/10.1080\/14703297.2025.2469089","journal-title":"Innovations In Education And Teaching International"},{"issue":"2","key":"517_CR37","doi-asserted-by":"publisher","DOI":"10.1016\/j.rmal.2023.100050","volume":"2","author":"A Mizumoto","year":"2023","unstructured":"Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), Article 100050. https:\/\/doi.org\/10.1016\/j.rmal.2023.100050","journal-title":"Research Methods in Applied Linguistics"},{"key":"517_CR38","unstructured":"OpenAI, Achiam, J., Adler, S., Agarwal, S., Admad, L., Akkaya, I. (2024). GPT-4 Technical Report, 1\u2013100."},{"key":"517_CR39","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2023.10.316","author":"J Pecuchova","year":"2023","unstructured":"Pecuchova, J., & Drlik, M. (2023). Predicting students at risk of early dropping out from course using ensemble classification methods. Procedia Computer Science. https:\/\/doi.org\/10.1016\/j.procs.2023.10.316","journal-title":"Procedia Computer Science"},{"key":"517_CR40","doi-asserted-by":"publisher","first-page":"159336","DOI":"10.1109\/ACCESS.2024.3486762","volume":"12","author":"J Pecuchova","year":"2024","unstructured":"Pecuchova, J., & Drlik, M. (2024). Enhancing the early student dropout prediction model through clustering analysis of students\u2019 digital traces. IEEE Access\u202f: Practical Innovations, Open Solutions, 12, 159336\u2013159367. https:\/\/doi.org\/10.1109\/ACCESS.2024.3486762","journal-title":"IEEE Access\u202f: Practical Innovations, Open Solutions"},{"key":"517_CR41","doi-asserted-by":"publisher","first-page":"394","DOI":"10.1007\/978-3-031-85652-5_40","volume-title":"International Conference on Interactive Collaborative Learning","author":"J Pecuchova","year":"2025","unstructured":"Pecuchova, J., & Drlik, M. (2025). The Role of Generative Artificial Intelligence in the Assessment of Open-Ended Questions. International Conference on Interactive Collaborative Learning (pp. 394\u2013405). Springer. https:\/\/doi.org\/10.1007\/978-3-031-85652-5_40"},{"key":"517_CR42","doi-asserted-by":"publisher","DOI":"10.3389\/fpubh.2022.1025271","author":"X Qian","year":"2022","unstructured":"Qian, X., Jingying, H., Xian, S., Yuqing, Z., Lili, W., Baorui, C., et al. (2022). The effectiveness of artificial intelligence-based automated grading and training system in education of manual detection of diabetic retinopathy. Frontiers in Public Health. https:\/\/doi.org\/10.3389\/fpubh.2022.1025271","journal-title":"Frontiers in Public Health"},{"issue":"1","key":"517_CR43","first-page":"5485","volume":"21","author":"C Raffel","year":"2020","unstructured":"Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485\u20135551.","journal-title":"The Journal of Machine Learning Research"},{"key":"517_CR44","doi-asserted-by":"publisher","unstructured":"Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3980\u20133990). Stroudsburg, PA, USA: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/D19-1410","DOI":"10.18653\/v1\/D19-1410"},{"key":"517_CR45","unstructured":"Reimers, N., Choi, E., Kayid, A., Nandula, A., Govindassamy, M., & Elkady, A. (2023, November 2). Introducing Embed v3. Cohere. https:\/\/cohere.com\/blog\/introducing-embed-v3. Accessed 27 Jun 2025."},{"issue":"1","key":"517_CR46","doi-asserted-by":"publisher","first-page":"20512","DOI":"10.1038\/s41598-023-46995-z","volume":"13","author":"M Roso\u0142","year":"2023","unstructured":"Roso\u0142, M., G\u0105sior, J. S., \u0141aba, J., Korzeniewski, K., & M\u0142y\u0144czak, M. (2023). Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish medical final examination. Scientific Reports, 13(1), 20512. https:\/\/doi.org\/10.1038\/s41598-023-46995-z","journal-title":"Scientific Reports"},{"key":"517_CR47","doi-asserted-by":"publisher","first-page":"526","DOI":"10.1016\/j.procs.2021.01.199","volume":"181","author":"C Schr\u00f6er","year":"2021","unstructured":"Schr\u00f6er, C., Kruse, F., & G\u00f3mez, J. M. (2021). A systematic literature review on applying CRISP-DM process model. Procedia Computer Science, 181, 526\u2013534. https:\/\/doi.org\/10.1016\/j.procs.2021.01.199","journal-title":"Procedia Computer Science"},{"issue":"3","key":"517_CR48","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1109\/TLT.2010.4","volume":"3","author":"R Siddiqi","year":"2010","unstructured":"Siddiqi, R., Harrison, C. J., & Siddiqi, R. (2010). Improving teaching and learning through automated short-answer marking. IEEE Transactions on Learning Technologies, 3(3), 237\u2013249. https:\/\/doi.org\/10.1109\/TLT.2010.4","journal-title":"IEEE Transactions on Learning Technologies"},{"issue":"1","key":"517_CR49","doi-asserted-by":"publisher","first-page":"5670","DOI":"10.1038\/s41598-024-55568-7","volume":"14","author":"D Stribling","year":"2024","unstructured":"Stribling, D., Xia, Y., Amer, M. K., Graim, K. S., Mulligan, C. J., & Renne, R. (2024). The model student: GPT-4 performance on graduate biomedical science exams. Scientific Reports, 14(1), 5670. https:\/\/doi.org\/10.1038\/s41598-024-55568-7","journal-title":"Scientific Reports"},{"issue":"3","key":"517_CR50","doi-asserted-by":"publisher","first-page":"355","DOI":"10.1177\/20965311231168423","volume":"6","author":"J Su","year":"2023","unstructured":"Su, J., & Yang, W. (2023). Unlocking the power of ChatGPT: A framework for applying generative AI in education. ECNU Review of Education, 6(3), 355\u2013366. https:\/\/doi.org\/10.1177\/20965311231168423","journal-title":"ECNU Review of Education"},{"key":"517_CR51","doi-asserted-by":"publisher","first-page":"264","DOI":"10.1016\/j.cogsys.2019.09.025","volume":"59","author":"O Sychev","year":"2020","unstructured":"Sychev, O., Anikin, A., & Prokudin, A. (2020). Automatic grading and hinting in open-ended text questions. Cognitive Systems Research, 59, 264\u2013272. https:\/\/doi.org\/10.1016\/j.cogsys.2019.09.025","journal-title":"Cognitive Systems Research"},{"key":"517_CR52","doi-asserted-by":"publisher","first-page":"102531","DOI":"10.1016\/j.mex.2023.102531","volume":"12","author":"S Tobler","year":"2024","unstructured":"Tobler, S. (2024). Smart grading: A generative AI-based tool for knowledge-grounded answer evaluation in educational assessments. MethodsX, 12, 102531. https:\/\/doi.org\/10.1016\/j.mex.2023.102531","journal-title":"MethodsX"},{"key":"517_CR53","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., et al. (2023). Llama 2. Open Foundation and Fine-Tuned Chat Models."},{"key":"517_CR54","doi-asserted-by":"publisher","first-page":"19270","DOI":"10.1109\/ACCESS.2021.3054346","volume":"9","author":"CN Tulu","year":"2021","unstructured":"Tulu, C. N., Ozkaya, O., & Orhan, U. (2021). Automatic short answer grading with semspace sense vectors and malstm. IEEE Access\u202f: Practical Innovations, Open Solutions, 9, 19270\u201319280. https:\/\/doi.org\/10.1109\/ACCESS.2021.3054346","journal-title":"IEEE Access : Practical Innovations, Open Solutions"},{"key":"517_CR55","doi-asserted-by":"publisher","unstructured":"Wu, S., Peng, Z., Du, X., Zheng, T., Liu, M., Wu, J. (2024). A comparative study on reasoning patterns of openai\u2019s o1 model. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2410.13639","DOI":"10.48550\/arXiv.2410.13639"},{"key":"517_CR56","doi-asserted-by":"publisher","unstructured":"Xiao, C., Ma, W., Song, Q., Xu, S. X., Zhang, K., Wang, Y., & Fu, Q. (2025). Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 293\u2013305). New York, NY, USA: ACM. https:\/\/doi.org\/10.1145\/3706468.3706507","DOI":"10.1145\/3706468.3706507"},{"issue":"05","key":"517_CR57","doi-asserted-by":"publisher","first-page":"9628","DOI":"10.1609\/aaai.v34i05.6510","volume":"34","author":"Z Zhang","year":"2020","unstructured":"Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X., & Zhou, X. (2020). Semantics-aware BERT for language understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9628\u20139635. https:\/\/doi.org\/10.1609\/aaai.v34i05.6510","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"}],"container-title":["International Journal of Artificial Intelligence in Education"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-025-00517-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40593-025-00517-2","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-025-00517-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T18:12:48Z","timestamp":1772647968000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40593-025-00517-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,19]]},"references-count":57,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["517"],"URL":"https:\/\/doi.org\/10.1007\/s40593-025-00517-2","relation":{},"ISSN":["1560-4292","1560-4306"],"issn-type":[{"value":"1560-4292","type":"print"},{"value":"1560-4306","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,19]]},"assertion":[{"value":"21 January 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 September 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 October 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}