{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T14:53:12Z","timestamp":1774968792301,"version":"3.50.1"},"reference-count":66,"publisher":"Elsevier BV","issue":"2","license":[{"start":{"date-parts":[[2024,7,18]],"date-time":"2024-07-18T00:00:00Z","timestamp":1721260800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,18]],"date-time":"2024-07-18T00:00:00Z","timestamp":1721260800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Artif Intell Educ"],"published-print":{"date-parts":[[2025,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Recent developments in the field of artificial intelligence allow for improved performance in the automated assessment of extended response items in mathematics, potentially allowing for the scoring of these items cheaply and at scale. This study details the grand prize-winning approach to developing large language models (LLMs) to automatically score the ten items in the National Assessment of Educational Progress (NAEP) Math Scoring Challenge. The approach uses extensive preprocessing that balanced the class labels for each item. This was done by identifying and filtering over-represented classes using a classifier trained on document-term matrices and data augmentation of under-represented classes using a generative pre-trained large language model (Grammarly\u2019s Coedit-XL; Raheja et al., 2023). We also use input modification schemes that were hand-crafted to each item type and included information from parts of the multi-step math problem students had to solve. Finally, we finetune several pre-trained large language models on the modified input for each individual item in the NAEP automated math scoring challenge, with DeBERTa (He et al., 2021a) showing the best performance. This approach achieved human-like agreement (less than QWK 0.05 difference from human\u2013human agreement) on nine out of the ten items in a held-out test set.<\/jats:p>","DOI":"10.1007\/s40593-024-00418-w","type":"journal-article","created":{"date-parts":[[2024,7,18]],"date-time":"2024-07-18T13:01:46Z","timestamp":1721307706000},"page":"559-586","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models"],"prefix":"10.1016","volume":"35","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6316-6479","authenticated-orcid":false,"given":"Wesley","family":"Morris","sequence":"first","affiliation":[]},{"given":"Langdon","family":"Holmes","sequence":"additional","affiliation":[]},{"given":"Joon Suh","family":"Choi","sequence":"additional","affiliation":[]},{"given":"Scott","family":"Crossley","sequence":"additional","affiliation":[]}],"member":"78","published-online":{"date-parts":[[2024,7,18]]},"reference":[{"key":"418_CR1","doi-asserted-by":"publisher","unstructured":"Abdullah, M., Khrais, J., & Swedat, S. (2022). Transformer-based deep learning for sarcasm detection with imbalanced dataset: Resampling techniques with downsampling and augmentation.\u00a0 In\u00a013th International Conference on Information and Communication Systems (ICICS)\u00a0(pp. 294\u2013300). IEEE. https:\/\/doi.org\/10.1109\/ICICS55353.2022.9811196","DOI":"10.1109\/ICICS55353.2022.9811196"},{"key":"418_CR2","doi-asserted-by":"publisher","unstructured":"Abercrombie, G., & Hovy, D. (2016). Putting sarcasm detection into context: The effects of class imbalance and manual labelling on supervised machine classification of twitter conversations. Proceedings of the ACL 2016 Student Research Workshop\u00a0(pp. 107\u2013113). Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/P16-3016","DOI":"10.18653\/v1\/P16-3016"},{"key":"418_CR3","doi-asserted-by":"publisher","unstructured":"Baffour, P., Saxberg, T., & Crossley, S. (2023). Analyzing bias in large language model solutions for assisted writing feedback tools: Lessons from the feedback prize competition series. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)\u00a0(pp. 242\u2013246). Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2023.bea-1.21","DOI":"10.18653\/v1\/2023.bea-1.21"},{"key":"418_CR4","unstructured":"Baral, S., Botelho, A. F., & Erickson, J. A. (2021). Improving Automated Scoring of Student Open Responses in Mathematics. International Educational Data Mining Society."},{"key":"418_CR5","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1162\/tacl_a_00236","volume":"1","author":"S Basu","year":"2013","unstructured":"Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A Clustering Approach to Amplify Human Effort for Short Answer Grading. Transactions of the Association for Computational Linguistics, 1, 391\u2013402. https:\/\/doi.org\/10.1162\/tacl_a_00236","journal-title":"Transactions of the Association for Computational Linguistics"},{"issue":"1","key":"418_CR6","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1007\/s13042-022-01553-3","volume":"14","author":"M Bayer","year":"2023","unstructured":"Bayer, M., Kaufhold, M.-A., Buchhold, B., Keller, M., Dallmeyer, J., & Reuter, C. (2023). Data augmentation in natural language processing: A novel text generation approach for long and short text classifiers. International Journal of Machine Learning and Cybernetics, 14(1), 135\u2013150. https:\/\/doi.org\/10.1007\/s13042-022-01553-3","journal-title":"International Journal of Machine Learning and Cybernetics"},{"issue":"3","key":"418_CR7","doi-asserted-by":"publisher","first-page":"823","DOI":"10.1111\/jcal.12793","volume":"39","author":"A Botelho","year":"2023","unstructured":"Botelho, A., Baral, S., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2023). Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. Journal of Computer Assisted Learning, 39(3), 823\u2013840. https:\/\/doi.org\/10.1111\/jcal.12793","journal-title":"Journal of Computer Assisted Learning"},{"issue":"11","key":"418_CR8","doi-asserted-by":"publisher","first-page":"2309","DOI":"10.1177\/016146811111301101","volume":"113","author":"H Braun","year":"2011","unstructured":"Braun, H., Kirsch, I., & Yamamoto, K. (2011). An Experimental Study of the Effects of Monetary Incentives on Performance on the 12th-Grade NAEP Reading Assessment. Teachers College Record: The Voice of Scholarship in Education, 113(11), 2309\u20132344. https:\/\/doi.org\/10.1177\/016146811111301101","journal-title":"Teachers College Record: The Voice of Scholarship in Education"},{"key":"418_CR9","doi-asserted-by":"publisher","unstructured":"Cochran, K., Cohn, C., Hutchins, N., Biswas, G., & Hastings, P. (2022). Improving Automated Evaluation of Formative Assessments with Text Data Augmentation. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 390\u2013401). Springer International Publishing. https:\/\/doi.org\/10.1007\/978-3-031-11644-5_32","DOI":"10.1007\/978-3-031-11644-5_32"},{"key":"418_CR10","unstructured":"Crossley, S., Kyle, K., Davenport, J., & Danielle S., M. (2016). Automatic assessment of constructed response data in a chemistry tutor. International Educational Data Mining Society. International Conference on Educational Data Mining (EDM), Raleigh, NC. Retrieved July 16, 2024 from https:\/\/eric.ed.gov\/?id=ED592642"},{"issue":"6","key":"418_CR11","doi-asserted-by":"publisher","first-page":"706","DOI":"10.3102\/1076998617705653","volume":"42","author":"SA Culpepper","year":"2017","unstructured":"Culpepper, S. A. (2017). The Prevalence and Implications of Slipping on Low-Stakes, Large-Scale Assessments. Journal of Educational and Behavioral Statistics, 42(6), 706\u2013725. https:\/\/doi.org\/10.3102\/1076998617705653","journal-title":"Journal of Educational and Behavioral Statistics"},{"key":"418_CR12","unstructured":"Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"418_CR13","unstructured":"Dossey, J. A., Mullis, I. V. S., & Jones, C. O. (1993). Can students do mathematical problem solving?: Results from constructed-response questions in NAEP\u2019s 1992 mathematics assessment. U.S. Department of Education, Office of Educational Research and Improvement."},{"key":"418_CR14","doi-asserted-by":"publisher","unstructured":"Erickson, J. A., Botelho, A. F., McAteer, S., Varatharaj, A., & Heffernan, N. T. (2020). The automated grading of student open responses in mathematics. Proceedings of the Tenth International Conference on Learning Analytics & Knowledge\u00a0(pp. 615\u2013624). Association for Computing Machinery. https:\/\/doi.org\/10.1145\/3375462.3375523","DOI":"10.1145\/3375462.3375523"},{"key":"418_CR15","doi-asserted-by":"publisher","unstructured":"Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R., & Lan, A. (2022). Automated Scoring for Reading Comprehension via In-context BERT Tuning. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 691\u2013697). Springer International Publishing. https:\/\/doi.org\/10.1007\/978-3-031-11644-5_69","DOI":"10.1007\/978-3-031-11644-5_69"},{"issue":"2","key":"418_CR16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1002\/ets2.12067","volume":"2015","author":"B Finn","year":"2015","unstructured":"Finn, B. (2015). Measuring Motivation in Low-Stakes Assessments. ETS Research Report Series, 2015(2), 1\u201317. https:\/\/doi.org\/10.1002\/ets2.12067","journal-title":"ETS Research Report Series"},{"key":"418_CR17","doi-asserted-by":"publisher","first-page":"e712","DOI":"10.7717\/peerj-cs.712","volume":"7","author":"B Gaye","year":"2021","unstructured":"Gaye, B., Zhang, D., & Wulamu, A. (2021). Sentiment classification for employees reviews using regression vector- stochastic gradient descent classifier (RV-SGDC). PeerJ Computer Science, 7, e712. https:\/\/doi.org\/10.7717\/peerj-cs.712","journal-title":"PeerJ Computer Science"},{"issue":"3","key":"418_CR18","doi-asserted-by":"publisher","first-page":"1167","DOI":"10.1007\/s11135-016-0323-4","volume":"51","author":"M Gnaldi","year":"2017","unstructured":"Gnaldi, M. (2017). A multidimensional IRT approach for dimensionality assessment of standardised students\u2019 tests in mathematics. Quality & Quantity, 51(3), 1167\u20131182. https:\/\/doi.org\/10.1007\/s11135-016-0323-4","journal-title":"Quality & Quantity"},{"key":"418_CR19","doi-asserted-by":"publisher","unstructured":"Goswami, M., & Sabata, P. (2021). Evaluation of ML-Based Sentiment Analysis Techniques with Stochastic Gradient Descent and Logistic Regression. In M. Chakraborty, R. Kr. Jha, V. E. Balas, S. N. Sur, & D. Kandar (Eds.), Trends in Wireless Communication and Information Security (Vol. 740, pp. 153\u2013163). Springer Singapore. https:\/\/doi.org\/10.1007\/978-981-33-6393-9_17","DOI":"10.1007\/978-981-33-6393-9_17"},{"issue":"6","key":"418_CR20","doi-asserted-by":"publisher","first-page":"496","DOI":"10.5951\/MT.88.6.0496","volume":"88","author":"CL Hancock","year":"1995","unstructured":"Hancock, C. L. (1995). Implementing the Assessment Standards for School Mathematics: Enhancing Mathematics Learning with Open-Ended Questions. The Mathematics Teacher, 88(6), 496\u2013499. https:\/\/doi.org\/10.5951\/MT.88.6.0496","journal-title":"The Mathematics Teacher"},{"key":"418_CR21","unstructured":"He, P., Liu, X., Gao, J., & Chen, W. (2021a). DeBERTa: Decoding-enhanced BERT with disentangled attention. Preprint retrieved from https:\/\/arxiv.org\/abs\/2006.03654"},{"key":"418_CR22","unstructured":"He, P., Liu, X., Gao, J., & Chen, W. (2021b). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. Preprint retrieved from https:\/\/arxiv.org\/abs\/2006.03654"},{"issue":"4","key":"418_CR23","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1080\/08957340701580736","volume":"20","author":"TP Hogan","year":"2007","unstructured":"Hogan, T. P., & Murphy, G. (2007). Recommendations for Preparing and Scoring Constructed-Response Items: What the Experts Say. Applied Measurement in Education, 20(4), 427\u2013441. https:\/\/doi.org\/10.1080\/08957340701580736","journal-title":"Applied Measurement in Education"},{"issue":"6","key":"418_CR24","doi-asserted-by":"publisher","first-page":"584","DOI":"10.3390\/math9060584","volume":"9","author":"G-J Hwang","year":"2021","unstructured":"Hwang, G.-J., & Tu, Y.-F. (2021). Roles and research trends of artificial intelligence in mathematics education: A bibliometric mapping analysis and systematic review. Mathematics, 9(6), 584.","journal-title":"Mathematics"},{"key":"418_CR25","doi-asserted-by":"publisher","unstructured":"\u0130Lhan, M. (2019). An Empirical Study for the Statistical Adjustment of Rater Bias. International Journal of Assessment Tools in Education, 6(2), 193\u2013201. https:\/\/doi.org\/10.21449\/ijate.533517","DOI":"10.21449\/ijate.533517"},{"issue":"2","key":"418_CR26","first-page":"10","volume":"20","author":"N Inoue","year":"2011","unstructured":"Inoue, N., & Buczynski, S. (2011). You Asked Open-Ended Questions, Now What? Understanding the Nature of Stumbling Blocks in Teaching Inquiry Lessons. The Mathematics Educator, 20(2), 10\u201323.","journal-title":"The Mathematics Educator"},{"key":"418_CR27","unstructured":"Ji, C. S., Rahman, T., & Yee, D. S. (2021). Mapping state proficiency standards onto the NAEP scales: Results from the 2019 NAEP reading and mathematics assessments (NCES 2021\u2013036). Institute of Educational Sciences, National Center for Education Statistics. Retrieved July 16, 2024 from https:\/\/files.eric.ed.gov\/fulltext\/ED612877.pdf"},{"issue":"4\u20135","key":"418_CR28","doi-asserted-by":"publisher","first-page":"528","DOI":"10.1080\/09541440601056620","volume":"19","author":"SHK Kang","year":"2007","unstructured":"Kang, S. H. K., McDermott, K. B., & Roediger, H. L. (2007). Test format and corrective feedback modify the effect of testing on long-term retention. European Journal of Cognitive Psychology, 19(4\u20135), 528\u2013558. https:\/\/doi.org\/10.1080\/09541440601056620","journal-title":"European Journal of Cognitive Psychology"},{"issue":"6","key":"418_CR29","doi-asserted-by":"publisher","first-page":"1115","DOI":"10.1080\/01443410.2016.1166176","volume":"36","author":"B-C Kuo","year":"2016","unstructured":"Kuo, B.-C., Chen, C.-H., Yang, C.-W., & Mok, M. M. C. (2016). Cognitive diagnostic models for tests with multiple-choice and constructed-response items. Educational Psychology, 36(6), 1115\u20131133. https:\/\/doi.org\/10.1080\/01443410.2016.1166176","journal-title":"Educational Psychology"},{"key":"418_CR30","doi-asserted-by":"publisher","unstructured":"Lagakis, P., & Demetriadis, S. (2021). Automated essay scoring: A review of the field. 2021 International Conference on Computer, Information and Telecommunication Systems (CITS)\u00a0(pp. 1\u20136). IEEE. https:\/\/doi.org\/10.1109\/CITS52676.2021.9618476","DOI":"10.1109\/CITS52676.2021.9618476"},{"key":"418_CR31","doi-asserted-by":"publisher","unstructured":"Lan, A. S., Vats, D., Waters, A. E., & Baraniuk, R. G. (2015). Mathematical language processing: Automatic grading and feedback for open response mathematical questions. Proceedings of the Second (2015) ACM Conference on Learning @ Scale\u00a0(pp. 167\u2013176). Association for Computing Machinery. https:\/\/doi.org\/10.1145\/2724660.2724664","DOI":"10.1145\/2724660.2724664"},{"key":"418_CR32","doi-asserted-by":"publisher","unstructured":"Landron-Rivera, B. A., Santiago, N. G., Santiago, A., & Vega-Riveros, J. F. (2018). Text classification of student predicate use for automatic misconception categorization. 2018 IEEE Frontiers in Education Conference (FIE)\u00a0(pp. 1\u20138). IEEE. https:\/\/doi.org\/10.1109\/FIE.2018.8658680","DOI":"10.1109\/FIE.2018.8658680"},{"issue":"4","key":"418_CR33","doi-asserted-by":"publisher","first-page":"897","DOI":"10.3390\/psych3040056","volume":"3","author":"S Ludwig","year":"2021","unstructured":"Ludwig, S., Mayer, C., Hansen, C., Eilers, K., & Brandt, S. (2021). Automated Essay Scoring Using Transformer Models. Psych, 3(4), 897\u2013915. https:\/\/doi.org\/10.3390\/psych3040056","journal-title":"Psych"},{"issue":"1","key":"418_CR34","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1177\/026553229501200104","volume":"12","author":"T Lumley","year":"1995","unstructured":"Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54\u201371. https:\/\/doi.org\/10.1177\/026553229501200104","journal-title":"Language Testing"},{"key":"418_CR35","unstructured":"Ma, E. (2019). NLP Augmentation. Retrieved July 16, 2024 from https:\/\/github.com\/makcedward\/nlpaug"},{"issue":"1","key":"418_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1002\/ets2.12358","volume":"2022","author":"DF McCaffrey","year":"2022","unstructured":"McCaffrey, D. F., Casabianca, J. M., Ricker-Pedley, K. L., Lawless, R. R., & Wendler, C. (2022). Best Practices for Constructed-Response Scoring. ETS Research Report Series, 2022(1), 1\u201358. https:\/\/doi.org\/10.1002\/ets2.12358","journal-title":"ETS Research Report Series"},{"key":"418_CR37","unstructured":"Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Preprint retrieved from https:\/\/arxiv.org\/abs\/1301.3781"},{"key":"418_CR38","doi-asserted-by":"publisher","unstructured":"Morris, W., Crossley, S., Holmes, L., Ou, C., McNamara, D., & Dascalu, M. (2023). Using Large Language Models to Provide Formative Feedback in Intelligent Textbooks. In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O. C. Santos (Eds.), Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky (Vol. 1831, pp. 484\u2013489). Springer Nature Switzerland. https:\/\/doi.org\/10.1007\/978-3-031-36336-8_75","DOI":"10.1007\/978-3-031-36336-8_75"},{"key":"418_CR39","unstructured":"NAEP. (2021). ED.gov National Assessment of Educational Progress (NAEP) Automated Scoring Challenge. Github. Retrieved July 16, 2024 from https:\/\/github.com\/NAEP-AS-Challenge\/reading-prediction"},{"key":"418_CR40","unstructured":"NAEP. (2023). NAEP Math Automated Scoring Challenge. Github. Retrieved July 16, 2024 from https:\/\/github.com\/NAEP-AS-Challenge\/math-prediction"},{"issue":"3","key":"418_CR41","first-page":"33","volume":"7","author":"P Nesher","year":"1987","unstructured":"Nesher, P. (1987). Towards an Instructional Theory: The Role of Student\u2019s Misconceptions. For the Learning of Mathematics, 7(3), 33\u201340.","journal-title":"For the Learning of Mathematics"},{"issue":"3","key":"418_CR42","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1207\/s15326977ea1003_3","volume":"10","author":"HF O\u2019Neil","year":"2005","unstructured":"O\u2019Neil, H. F., Abedi, J., Miyoshi, J., & Mastergeorge, A. (2005). Monetary Incentives for Low-Stakes Tests. Educational Assessment, 10(3), 185\u2013208. https:\/\/doi.org\/10.1207\/s15326977ea1003_3","journal-title":"Educational Assessment"},{"key":"418_CR43","unstructured":"Ormerod, C. M., Malhotra, A., & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. Preprint retrieved from https:\/\/arxiv.org\/abs\/2102.13136"},{"key":"418_CR44","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825\u20132830.","journal-title":"Journal of Machine Learning Research"},{"key":"418_CR45","unstructured":"Peng, S., Yuan, K., Gao, L., & Tang, Z. (2021). MathBERT: A Pre-Trained Model for Mathematical Formula Understanding. Preprint retrieved from https:\/\/arxiv.org\/abs\/2105.00377"},{"key":"418_CR46","doi-asserted-by":"publisher","unstructured":"Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\u00a0(pp. 1532\u20131543). Association for Computational Linguistics. https:\/\/doi.org\/10.3115\/v1\/D14-1162","DOI":"10.3115\/v1\/D14-1162"},{"issue":"2","key":"418_CR47","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1080\/0969594X.2010.532769","volume":"19","author":"JC Phelan","year":"2012","unstructured":"Phelan, J. C., Choi, K., Niemi, D., Vendlinski, T. P., Baker, E. L., & Herman, J. (2012). The effects of POWERSOURCE \u00a9 assessments on middle-school students\u2019 math performance. Assessment in Education: Principles, Policy & Practice, 19(2), 211\u2013230. https:\/\/doi.org\/10.1080\/0969594X.2010.532769","journal-title":"Assessment in Education: Principles, Policy & Practice"},{"key":"418_CR48","unstructured":"Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Preprint retrieved from https:\/\/arxiv.org\/abs\/1910.10683"},{"key":"418_CR49","doi-asserted-by":"crossref","unstructured":"Raheja, V., Kumar, D., Koo, R., & Kang, D. (2023). CoEdIT: Text Editing by Task-Specific Instruction Tuning. Preprint retrieved from https:\/\/arxiv.org\/abs\/2305.09857","DOI":"10.18653\/v1\/2023.findings-emnlp.350"},{"key":"418_CR50","doi-asserted-by":"crossref","unstructured":"Raman, M., Maini, P., Kolter, J. Z., Lipton, Z. C., & Pruthi, D. (2023). Model-tuning Via Prompts Makes NLP Models Adversarially Robust. Preprint retrieved from https:\/\/arxiv.org\/abs\/2303.07320","DOI":"10.18653\/v1\/2023.emnlp-main.576"},{"key":"418_CR51","unstructured":"Rampey, B., Dion, G. S., & Donahue, P. L. (2009). NAEP 2008: Trends in Academic Progress. NCES 2009\u2013479. National Center for Educational Statistics."},{"key":"418_CR52","doi-asserted-by":"publisher","unstructured":"Rizos, G., Hemker, K., & Schuller, B. (2019). Augment to prevent: Short-text data augmentation in deep learning for hate-speech classification. Proceedings of the 28th ACM International Conference on Information and Knowledge Management\u00a0(pp. 991\u20131000). Association for Computing Machinery. https:\/\/doi.org\/10.1145\/3357384.3358040","DOI":"10.1145\/3357384.3358040"},{"key":"418_CR53","unstructured":"Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and Automated Essay Scoring. Preprint retrieved from https:\/\/arxiv.org\/abs\/1909.09482"},{"issue":"2","key":"418_CR54","doi-asserted-by":"publisher","first-page":"138","DOI":"10.1080\/08957347.2019.1577249","volume":"32","author":"AD Slepkov","year":"2019","unstructured":"Slepkov, A. D., & Godfrey, A. T. K. (2019). Partial Credit in Answer-Until-Correct Multiple-Choice Tests Deployed in a Classroom Setting. Applied Measurement in Education, 32(2), 138\u2013150. https:\/\/doi.org\/10.1080\/08957347.2019.1577249","journal-title":"Applied Measurement in Education"},{"key":"418_CR55","first-page":"16857","volume":"33","author":"K Song","year":"2020","unstructured":"Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). MPNet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33, 16857\u201316867.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"418_CR56","unstructured":"Stedman, L. C. (2008). The NAEP long-term trend assessment: A review of its transformation, use, and findings. Teaching, Learning, and Educational Leadership Faculty Scholarship, 2. Retrieved July 16, 2024 from https:\/\/orb.binghamton.edu\/education_fac\/2\/"},{"key":"418_CR57","unstructured":"Sukkarieh, J., Pulman, S., & Raikes, N. (2003). Automarking: Using computational linguistics to score short free-text responses. Proceedings of 29th International Association for Educational Assessment (IAEA) Annual Conference. Retrieved July 16, 2024 from https:\/\/www.cs.ox.ac.uk\/files\/234\/sukkarieh-pulman-raikes.pdf"},{"key":"418_CR58","unstructured":"Sukkarieh, J. Z., & Blackmore, J. (2009). C-rater:Automatic Content Scoring for Short Constructed Responses. Flairs Conference."},{"key":"418_CR59","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., \u2026 Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. Preprint retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"418_CR60","doi-asserted-by":"publisher","unstructured":"Tran, N., Pierce, B., Litman, D., Correnti, R., & Matsumura, L. C. (2023). Utilizing Natural Language Processing for Automated Assessment of Classroom Discussion. In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O. C. Santos (Eds.), Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky (Vol. 1831, pp. 490\u2013496). Springer Nature Switzerland. https:\/\/doi.org\/10.1007\/978-3-031-36336-8_76","DOI":"10.1007\/978-3-031-36336-8_76"},{"key":"418_CR61","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Preprint retrieved fromhttps:\/\/arxiv.org\/abs\/1706.03762"},{"key":"418_CR62","doi-asserted-by":"publisher","unstructured":"Wang, Y., Zheng, Y., Zhu, J., & Yu, Y. (2022). LoBERTa: A composition named entity recognition method based on longformer and DeBERTa model.\u00a0International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM)\u00a0(pp. 266\u2013270). IEEE. https:\/\/doi.org\/10.1109\/MLCCIM55934.2022.00052","DOI":"10.1109\/MLCCIM55934.2022.00052"},{"key":"418_CR63","doi-asserted-by":"crossref","unstructured":"Whitmer, J., Deng, E., Blankenship, C., Beiting-Parrish, M., Zhang, T., & Bailey, P. (2023). Results of NAEP Reading Item Automated Scoring Data Challenge (Fall 2021). Preprint retrieved from https:\/\/osf.io\/preprints\/edarxiv\/2hevq","DOI":"10.35542\/osf.io\/2hevq"},{"key":"418_CR64","doi-asserted-by":"crossref","unstructured":"Yoo, K. M., Park, D., Kang, J., Lee, S.-W., & Park, W. (2021). GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. Preprint retrieved from https:\/\/arxiv.org\/abs\/2104.08826","DOI":"10.18653\/v1\/2021.findings-emnlp.192"},{"key":"418_CR65","unstructured":"Zhang, M., Baral, S., Heffernan, N., & Lan, A. (2022). Automatic Short Math Answer Grading via In-context Meta-learning. Preprint retrieved from https:\/\/arxiv.org\/abs\/2205.15219"},{"key":"418_CR66","doi-asserted-by":"publisher","unstructured":"Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. Twenty-First International Conference on Machine Learning - ICML \u201904\u00a0(pp. 116). Association for Computing Machinery. https:\/\/doi.org\/10.1145\/1015330.1015332","DOI":"10.1145\/1015330.1015332"}],"container-title":["International Journal of Artificial Intelligence in Education"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-024-00418-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40593-024-00418-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-024-00418-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T18:12:51Z","timestamp":1772647971000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40593-024-00418-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,18]]},"references-count":66,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6]]}},"alternative-id":["418"],"URL":"https:\/\/doi.org\/10.1007\/s40593-024-00418-w","relation":{},"ISSN":["1560-4292","1560-4306"],"issn-type":[{"value":"1560-4292","type":"print"},{"value":"1560-4306","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,18]]},"assertion":[{"value":"4 July 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 July 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors affirm that there are no conflicts of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of Interest"}}]}}