{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T21:33:30Z","timestamp":1777152810232,"version":"3.51.4"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,7,5]],"date-time":"2024-07-05T00:00:00Z","timestamp":1720137600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,5]],"date-time":"2024-07-05T00:00:00Z","timestamp":1720137600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Swiss Federal Institute of Technology Zurich"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Discov Artif Intell"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.<\/jats:p>","DOI":"10.1007\/s44163-024-00147-y","type":"journal-article","created":{"date-parts":[[2024,7,5]],"date-time":"2024-07-05T17:01:41Z","timestamp":1720198901000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":41,"title":["Performance of the pre-trained large language model GPT-4 on automated short answer grading"],"prefix":"10.1007","volume":"4","author":[{"given":"Gerd","family":"Kortemeyer","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,5]]},"reference":[{"key":"147_CR1","volume-title":"How people learn","author":"JD Bransford","year":"2000","unstructured":"Bransford JD, Brown AL, Cocking RR, et al. How people learn. Washington, DC: National academy press; 2000."},{"issue":"1","key":"147_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s41239-021-00292-9","volume":"18","author":"K Seo","year":"2021","unstructured":"Seo K, Tang J, Roll I, Fels S, Yoon D. The impact of artificial intelligence on learner-instructor interaction in online learning. Int J Educ Technol Higher Educ. 2021;18(1):1\u201323.","journal-title":"Int J Educ Technol Higher Educ"},{"issue":"1","key":"147_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s41239-023-00392-8","volume":"20","author":"H Crompton","year":"2023","unstructured":"Crompton H, Burke D. Artificial intelligence in higher education: the state of the field. Int J Educ Technol Higher Educ. 2023;20(1):1\u201322.","journal-title":"Int J Educ Technol Higher Educ"},{"issue":"1","key":"147_CR4","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1186\/s41239-023-00420-7","volume":"20","author":"C Zhang","year":"2023","unstructured":"Zhang C, Schie\u00dfl J, Pl\u00f6\u00dfl L, Hofmann F, Gl\u00e4ser-Zikuda M. Acceptance of artificial intelligence among pre-service teachers: a multigroup analysis. Int J Educ Technol Higher Educ. 2023;20(1):49.","journal-title":"Int J Educ Technol Higher Educ"},{"key":"147_CR5","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1007\/s40593-014-0026-8","volume":"25","author":"S Burrows","year":"2015","unstructured":"Burrows S, Gurevych I, Stein B. The eras and trends of automatic short answer grading. Int J Artif Intell Educ. 2015;25:60\u2013117.","journal-title":"Int J Artif Intell Educ"},{"key":"147_CR6","unstructured":"Haller S, Aldea A, Seifert C, Strisciuglio N. Survey on automated short answer grading with deep learning: from word embeddings to transformers. arXiv preprint arXiv:2204.03503, 2022."},{"key":"147_CR7","unstructured":"Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT. Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 263\u2013274, 2013."},{"key":"147_CR8","unstructured":"OpenAI. GPT-4. https:\/\/openai.com\/gpt-4."},{"key":"147_CR9","unstructured":"Google. Bard. https:\/\/bard.google.com\/."},{"issue":"2","key":"147_CR10","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1353\/tech.2004.0085","volume":"45","author":"S Petrina","year":"2004","unstructured":"Petrina S. Sidney pressey and the automation of education, 1924\u20131934. Technol Cult. 2004;45(2):305\u201330.","journal-title":"Technol Cult"},{"issue":"4","key":"147_CR11","doi-asserted-by":"publisher","first-page":"303","DOI":"10.5951\/AT.13.4.0303","volume":"13","author":"P Suppes","year":"1966","unstructured":"Suppes P, Jerman M, Groen G. Arithmetic drills and review on a computer-based teletype. Arith Teach. 1966;13(4):303\u20139.","journal-title":"Arith Teach"},{"issue":"8","key":"147_CR12","doi-asserted-by":"publisher","first-page":"987","DOI":"10.1080\/00207390601002906","volume":"38","author":"CJ Sangwin","year":"2007","unstructured":"Sangwin CJ. Assessing elementary algebra with stack. Int J Math Educ Sci Technol. 2007;38(8):987\u20131002.","journal-title":"Int J Math Educ Sci Technol"},{"issue":"4","key":"147_CR13","doi-asserted-by":"publisher","first-page":"438","DOI":"10.1119\/1.2835046","volume":"76","author":"G Kortemeyer","year":"2008","unstructured":"Kortemeyer G, Kashy E, Benenson W, Bauer W. Experiences using the open-source learning content management and assessment system lon-capa in introductory physics courses. Am J Phys. 2008;76(4):438\u201344.","journal-title":"Am J Phys"},{"issue":"1","key":"147_CR14","doi-asserted-by":"publisher","first-page":"61","DOI":"10.2307\/3586852","volume":"24","author":"J Jonz","year":"1990","unstructured":"Jonz J. Another turn in the conversation: what does cloze measure? Tesol Quarterly. 1990;24(1):61\u201383.","journal-title":"Tesol Quarterly"},{"issue":"2","key":"147_CR15","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1177\/026553229000700201","volume":"7","author":"CA Chapelle","year":"1990","unstructured":"Chapelle CA, Abraham RG. Cloze method: what difference does it make. Lang Testing. 1990;7(2):121\u201346.","journal-title":"Lang Testing"},{"key":"147_CR16","unstructured":"R\u00a0Pate. Open versus closed questions: what constitutes a good question. Educational research and innovations, pages 29\u201339, 2012."},{"key":"147_CR17","unstructured":"Lord FM, Novick MR. Statistical theories of mental test scores. Information Age Publishing, 2008."},{"issue":"1","key":"147_CR18","first-page":"1","volume":"7","author":"James Dean Brown","year":"2013","unstructured":"James Dean Brown. My twenty-five years of cloze testing research: so what. Int J Lang Stud. 2013;7(1):1\u201332.","journal-title":"Int J Lang Stud"},{"issue":"1","key":"147_CR19","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevSTPER.10.010118","volume":"10","author":"G Kortemeyer","year":"2014","unstructured":"Kortemeyer G. Extending item response theory to online homework. Phys Rev Special Topics-Phys Educ Res. 2014;10(1): 010118.","journal-title":"Phys Rev Special Topics-Phys Educ Res"},{"issue":"2","key":"147_CR20","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevPhysEducRes.19.020163","volume":"19","author":"G Kortemeyer","year":"2023","unstructured":"Kortemeyer G. Toward ai grading of student problem solutions in introductory physics: a feasibility study. Phys Rev Phys Educ Res. 2023;19(2): 020163.","journal-title":"Phys Rev Phys Educ Res"},{"key":"147_CR21","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.120640","volume":"231","author":"F Jamil","year":"2023","unstructured":"Jamil F, Hameed IA. Toward intelligent open-ended questions evaluation based on predictive optimization. Expert Syst Appl. 2023;231: 120640.","journal-title":"Expert Syst Appl"},{"key":"147_CR22","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijinfomgt.2022.102555","volume":"69","author":"Stephen Jackson","year":"2023","unstructured":"Jackson Stephen, Panteli Niki. Trust or mistrust in algorithmic grading? an embedded agency perspective. Int J Inf Manag. 2023;69: 102555.","journal-title":"Int J Inf Manag"},{"issue":"1","key":"147_CR23","doi-asserted-by":"publisher","first-page":"37","DOI":"10.18608\/jla.2023.7801","volume":"10","author":"R Conijn","year":"2023","unstructured":"Conijn R, Kahr P, Snijders CC. The effects of explanations in automated essay scoring systems on student trust and motivation. J Learn Anal. 2023;10(1):37\u201353.","journal-title":"J Learn Anal"},{"issue":"1","key":"147_CR24","doi-asserted-by":"publisher","first-page":"177","DOI":"10.1080\/10494820.2019.1648300","volume":"30","author":"Lishan Zhang","year":"2022","unstructured":"Zhang Lishan, Huang Yuwei, Yang Xi, Shengquan Yu, Zhuang Fuzhen. An automatic short-answer grading model for semi-open-ended questions. Int Learn Environ. 2022;30(1):177\u201390.","journal-title":"Int Learn Environ"},{"key":"147_CR25","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1023\/A:1025779619903","volume":"37","author":"Claudia Leacock","year":"2003","unstructured":"Leacock Claudia, Chodorow Martin. C-rater: automated scoring of short-answer questions. Comput Hum. 2003;37:389\u2013405.","journal-title":"Comput Hum"},{"key":"147_CR26","doi-asserted-by":"crossref","unstructured":"Ahmed A, Joorabchi A, Hayes MJ. On deep learning approaches to automated assessment: strategies for short answer grading. CSEDU (2), pages 85\u201394, 2022.","DOI":"10.5220\/0011082100003182"},{"key":"147_CR27","doi-asserted-by":"crossref","unstructured":"Akila Devi TR, Javubar Sathick K, Abdul Azeez Khan A, Arun Raj L. Novel framework for improving the correctness of reference answers to enhance results of asag systems. SN Computer Science, 2023; 4(4): 415.","DOI":"10.1007\/s42979-023-01682-8"},{"key":"147_CR28","unstructured":"Kerneler, Kaggle: semeval 2013 2 and 3 way. https:\/\/www.kaggle.com\/datasets\/smiles28\/semeval-2013-2-and-3-way."},{"key":"147_CR29","unstructured":"Microsoft. Azure ai services. https:\/\/azure.microsoft.com\/en-us\/products\/ai-services."},{"key":"147_CR30","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"147_CR31","unstructured":"Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et\u00a0al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 2019; 32."},{"key":"147_CR32","doi-asserted-by":"crossref","unstructured":"Andrew Poulton and Sebas Eliens. Explaining transformer-based models for automatic short answer grading. In Proceedings of the 5th International Conference on Digital Technology in Education, pages 110\u2013116, 2021.","DOI":"10.1145\/3488466.3488479"},{"key":"147_CR33","doi-asserted-by":"crossref","unstructured":"Sultan MA, Salazar C, Sumner T. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1070\u20131075, 2016.","DOI":"10.18653\/v1\/N16-1123"},{"key":"147_CR34","doi-asserted-by":"crossref","unstructured":"Saha S, Dhamecha TI, Marvaniya S, Sindhgatta R, Sengupta B. Sentence level or token level features for automatic short answer grading?: Use both. In Artificial Intelligence in Education: 19th International Conference, AIED 2018, London, UK, June 27\u201330, 2018, Proceedings, Part I 19, pages 503\u2013517. Springer, 2018.","DOI":"10.1007\/978-3-319-93843-1_37"},{"issue":"3","key":"147_CR35","doi-asserted-by":"publisher","first-page":"1636","DOI":"10.1080\/10494820.2020.1855207","volume":"31","author":"Hongye Tan","year":"2023","unstructured":"Tan Hongye, Wang Chong, Qinglong Duan YuLu, Zhang Hu, Li Ru. Automatic short answer grading by encoding student responses via a graph convolutional network. Int Learn Environ. 2023;31(3):1636\u201350.","journal-title":"Int Learn Environ"},{"key":"147_CR36","doi-asserted-by":"crossref","unstructured":"Li Z, Tomar Y, Passonneau RJ. A semantic feature-wise transformation relation network for automatic short answer grading. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6030\u20136040, 2021.","DOI":"10.18653\/v1\/2021.emnlp-main.487"},{"key":"147_CR37","doi-asserted-by":"crossref","unstructured":"Filighera A, Tschesche J, Steuer T, Tregel T, Wernet L. Towards generating counterfactual examples as automatic short answer feedback. In International Conference on Artificial Intelligence in Education, pages 206\u2013217. Springer, 2022.","DOI":"10.1007\/978-3-031-11644-5_17"},{"issue":"1","key":"147_CR38","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevPhysEducRes.19.010132","volume":"19","author":"Gerd Kortemeyer","year":"2023","unstructured":"Kortemeyer Gerd. Could an artificial-intelligence agent pass an introductory physics course? Phys Rev Phys Educ Res. 2023;19(1): 010132.","journal-title":"Phys Rev Phys Educ Res"},{"issue":"2","key":"147_CR39","doi-asserted-by":"publisher","first-page":"371","DOI":"10.1111\/j.1467-8535.2008.00928.x","volume":"40","author":"Sally Jordan","year":"2009","unstructured":"Jordan Sally, Mitchell Tom. e-assessment for learning? the potential of short-answer free-text questions with tailored feedback. Br J Educ Technol. 2009;40(2):371\u201385.","journal-title":"Br J Educ Technol"},{"key":"147_CR40","unstructured":"Meta. Llama 2. https:\/\/ai.meta.com\/llama\/."}],"container-title":["Discover Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44163-024-00147-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44163-024-00147-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44163-024-00147-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,5]],"date-time":"2024-07-05T17:35:34Z","timestamp":1720200934000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44163-024-00147-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,5]]},"references-count":40,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["147"],"URL":"https:\/\/doi.org\/10.1007\/s44163-024-00147-y","relation":{},"ISSN":["2731-0809"],"issn-type":[{"value":"2731-0809","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,5]]},"assertion":[{"value":"9 December 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 June 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"There are no conflicting or Competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"47"}}