{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T18:12:56Z","timestamp":1773511976383,"version":"3.50.1"},"reference-count":88,"publisher":"Elsevier BV","issue":"5","license":[{"start":{"date-parts":[[2025,5,30]],"date-time":"2025-05-30T00:00:00Z","timestamp":1748563200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,30]],"date-time":"2025-05-30T00:00:00Z","timestamp":1748563200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"This study was partially supported by the Israeli Science Foundation","award":["851\/22851\/22"],"award-info":[{"award-number":["851\/22851\/22"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Artif Intell Educ"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Large Language Models (LLMs) are becoming increasingly popular in assessment systems for analyzing and providing personalized feedback on student responses to open-ended questions. However, the quality of diagnosis provided by such systems depends heavily on the ability of the LLMs to accurately capture the subtle differences between responses that represent the key types of student reasoning, also referred to as Knowledge Profiles (KPs). In this study, we compared expert-defined KPs with data-driven clusters generated from LLM embeddings of student responses in biology. We aimed to determine whether LLM-based clusters align with the theory-driven KPs that classify responses by their level of conceptual accuracy. Our findings revealed a \u2018discoverability bias\u2019 where LLM-derived clusters captured reasonably well the high-quality responses, but failed to distinguish between the different ways student responses can be incorrect. We then traced this \u2018discoverability bias\u2019 to the representations of the KPs in the pre-trained LLM embedding space and showed that as student responses become more wrong, they become less similar in the embedding space to other responses that reveal the same type of conceptual error. Furthermore, we found a strong relationship between the quality of the KP responses (correct or various degrees of incorrect) and the shape and density of their embeddings-based representation. Specifically, we found that the lower the quality of the KP, the less similar its responses are to each other in the embedding space. This phenomenon, which we call the \u2018Anna Karenina Principle\u2019 and study in the context of automated short answer scoring, suggests that LLM embeddings may not be sufficiently sensitive out-of-the-box to the nuances that distinguish between key profiles of conceptual understanding. This limitation poses challenges for developing fair and effective LLM-based formative assessment systems.<\/jats:p>","DOI":"10.1007\/s40593-025-00485-7","type":"journal-article","created":{"date-parts":[[2025,5,30]],"date-time":"2025-05-30T14:00:58Z","timestamp":1748613658000},"page":"2821-2855","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Uncovering Measurement Biases in LLM Embedding Spaces: The Anna Karenina Principle and Its Implications for Automated Feedback"],"prefix":"10.1016","volume":"35","author":[{"given":"Abigail","family":"Gurin Schleifer","sequence":"first","affiliation":[]},{"given":"Beata","family":"Beigman Klebanov","sequence":"additional","affiliation":[]},{"given":"Giora","family":"Alexandron","sequence":"additional","affiliation":[]}],"member":"78","published-online":{"date-parts":[[2025,5,30]]},"reference":[{"key":"485_CR1","doi-asserted-by":"crossref","unstructured":"Al\u00a0Ghadban, Y., Lu, H. Y., Adavi, U., Sharma, A., Gara, S., Das, N., Kumar, B., John, R., Devarsetty, P., & Hirst, J. E. (2023). Transforming healthcare education: Harnessing large language models for frontline health worker capacity building using retrieval-augmented generation. medRxiv, 2023\u201312.","DOI":"10.21955\/gatesopenres.1117064.1"},{"key":"485_CR2","unstructured":"Ali, M., Panda, S., Shen, Q., Wick, M., & Kobren, A. (2024). Understanding the interplay of scale, data, and bias in language models: A case study with BERT. arXiv preprint arXiv:2407.21058"},{"key":"485_CR3","unstructured":"Allen, L. K., Jacovina, M. E., McNamara, D. S. (2016). Computer-based writing instruction. Grantee Submission."},{"issue":"3","key":"485_CR4","doi-asserted-by":"publisher","first-page":"841","DOI":"10.1111\/jcal.12717","volume":"39","author":"N Andersen","year":"2023","unstructured":"Andersen, N., Zehner, F., & Goldhammer, F. (2023). Semi-automatic coding of open-ended text responses in large-scale assessments. Journal of Computer Assisted Learning, 39(3), 841\u2013854.","journal-title":"Journal of Computer Assisted Learning"},{"key":"485_CR5","doi-asserted-by":"crossref","unstructured":"Ariely, M., Nazaretsky, T., & Alexandron, G. (2024). Causal-mechanical explanations in biology: Applying automated assessment for personalized learning in the science classroom. Journal of Research in Science Teaching.","DOI":"10.1002\/tea.21929"},{"key":"485_CR6","unstructured":"Ariely, M., Nazaretsky, T., & Alexandron, G. (2022) Personalized automated formative feedback can support students in generating causal explanations in biology. In Proceedings of the International Conference of the Learning Sciences (ICLS) (pp. 953\u2013956)."},{"issue":"1","key":"485_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s40593-021-00283-x","volume":"33","author":"M Ariely","year":"2023","unstructured":"Ariely, M., Nazaretsky, T., & Alexandron, G. (2023). Machine learning and Hebrew NLP for automated assessment of open-ended questions in biology. International Journal of Artificial Intelligence in Education, 33(1), 1\u201334.","journal-title":"International Journal of Artificial Intelligence in Education"},{"issue":"4","key":"485_CR8","doi-asserted-by":"publisher","first-page":"992","DOI":"10.1007\/s40593-022-00323-0","volume":"33","author":"X Bai","year":"2023","unstructured":"Bai, X., & Stede, M. (2023). A survey of current machine learning approaches to student free-text evaluation for intelligent tutoring. International Journal of Artificial Intelligence in Education, 33(4), 992\u20131030.","journal-title":"International Journal of Artificial Intelligence in Education"},{"issue":"4","key":"485_CR9","doi-asserted-by":"publisher","first-page":"1052","DOI":"10.1007\/s40593-021-00285-9","volume":"32","author":"RS Baker","year":"2022","unstructured":"Baker, R. S., & Hawn, A. (2022). Algorithmic bias in education. International Journal of Artificial Intelligence in Education, 32(4), 1052\u20131092.","journal-title":"International Journal of Artificial Intelligence in Education"},{"key":"485_CR10","doi-asserted-by":"crossref","unstructured":"Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024). Seven failure points when engineering a retrieval augmented generation system. In Proceedings of the IEEE\/ACM 3rd International Conference on AI Engineering-Software Engineering for AI (pp. 194\u2013199).","DOI":"10.1145\/3644815.3644945"},{"key":"485_CR11","doi-asserted-by":"publisher","first-page":"108632","DOI":"10.1016\/j.knosys.2022.108632","volume":"245","author":"F Bayram","year":"2022","unstructured":"Bayram, F., Ahmed, B. S., & Kassler, A. (2022). From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 245, 108632.","journal-title":"Knowledge-Based Systems"},{"key":"485_CR12","doi-asserted-by":"crossref","unstructured":"Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (pp. 610\u2013623).","DOI":"10.1145\/3442188.3445922"},{"key":"485_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s11704-019-9059-3","volume":"15","author":"P Bhattacharjee","year":"2021","unstructured":"Bhattacharjee, P., & Mitra, P. (2021). A survey of density based clustering algorithms. Frontiers of Computer Science, 15, 1\u201327.","journal-title":"Frontiers of Computer Science"},{"key":"485_CR14","unstructured":"Bird, S., Edgar, R., Horn, B., Lutz, R., Milan, V., Sameki, M., Wallach, H., & Walker, K. (2020). Fairlearn: A toolkit for assessing and improving fairness in AI. Microsoft Technical Report (MSR-TR-2020-32)."},{"key":"485_CR15","doi-asserted-by":"crossref","unstructured":"Bornea, A. -L., Ayed, F., De\u00a0Domenico, A., Piovesan, N., & Maatouk, A. (2024). Telco-rag: Navigating the challenges of retrieval-augmented language models for telecommunications. arXiv preprint arXiv:2404.15939","DOI":"10.1109\/GLOBECOM52923.2024.10901158"},{"key":"485_CR16","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1007\/s40593-014-0026-8","volume":"25","author":"S Burrows","year":"2015","unstructured":"Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60\u2013117.","journal-title":"International Journal of Artificial Intelligence in Education"},{"issue":"6334","key":"485_CR17","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1126\/science.aal4230","volume":"356","author":"A Caliskan","year":"2017","unstructured":"Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183\u2013186.","journal-title":"Science"},{"key":"485_CR18","doi-asserted-by":"crossref","unstructured":"Cao, Y. T., Pruksachatkun, Y., Chang, K. -W., Gupta, R., Kumar, V., Dhamala, J., & Galstyan, A. (2022). On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 561\u2013570).","DOI":"10.18653\/v1\/2022.acl-short.62"},{"key":"485_CR19","unstructured":"Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805"},{"key":"485_CR20","unstructured":"Dong, C. (2023). How to build an AI tutor that can adapt to any course and provide accurate answers using large language model and retrieval-augmented generation. arXiv preprint arXiv:2311.17696"},{"key":"485_CR21","unstructured":"Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783"},{"key":"485_CR22","doi-asserted-by":"publisher","first-page":"15991","DOI":"10.1109\/ACCESS.2017.2654247","volume":"5","author":"A Dutt","year":"2017","unstructured":"Dutt, A., Ismail, M. A., & Herawan, T. (2017). A systematic review on educational data mining. Ieee Access, 5, 15991\u201316005.","journal-title":"Ieee Access"},{"key":"485_CR23","doi-asserted-by":"crossref","unstructured":"Eklund, A., & Forsman, M. (2022). Topic modeling by clustering language model embeddings: Human validation on an industry dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track (pp. 635\u2013643).","DOI":"10.18653\/v1\/2022.emnlp-industry.65"},{"issue":"9","key":"485_CR24","doi-asserted-by":"publisher","first-page":"985","DOI":"10.1002\/tea.3660310911","volume":"31","author":"NJ Fellows","year":"1994","unstructured":"Fellows, N. J. (1994). A window into thinking: Using student writing to understand conceptual change in science learning. Journal of Research in Science Teaching, 31(9), 985\u20131001.","journal-title":"Journal of Research in Science Teaching"},{"issue":"1","key":"485_CR25","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1007\/s10649-008-9134-4","volume":"70","author":"F Furinghetti","year":"2009","unstructured":"Furinghetti, F., & Morselli, F. (2009). Every unsuccessful problem solver is unsuccessful in his or her own way: affective and cognitive factors in proving. Educational Studies in Mathematics, 70(1), 71\u201390.","journal-title":"Educational Studies in Mathematics"},{"issue":"1","key":"485_CR26","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1007\/s10972-016-9455-6","volume":"27","author":"LF Gerard","year":"2016","unstructured":"Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111\u2013129.","journal-title":"Journal of Science Teacher Education"},{"key":"485_CR27","doi-asserted-by":"crossref","unstructured":"Goldfarb-Tarrant, S., Marchant, R., Mu\u00f1oz\u00a0S\u00e1nchez, R., Pandya, M., & Lopez, A. (2021). Intrinsic bias metrics do not correlate with application bias. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 1926\u20131940).","DOI":"10.18653\/v1\/2021.acl-long.150"},{"key":"485_CR28","doi-asserted-by":"crossref","unstructured":"Goldfarb-Tarrant, S., Ungless, E., Balkir, E., & Blodgett, S. L. (2023). This prompt is measuring $$<$$mask$$>$$: evaluating bias evaluation in language models. In Findings of the Association for Computational Linguistics: ACL (pp. 2209\u20132225).","DOI":"10.18653\/v1\/2023.findings-acl.139"},{"key":"485_CR29","doi-asserted-by":"crossref","unstructured":"Gorgun, G., & Yildirim-Erbasli, S. N. (2024). Algorithmic bias in BERT for response accuracy prediction: A case study for investigating population validity. Journal of Educational Measurement","DOI":"10.1111\/jedm.12420"},{"key":"485_CR30","unstructured":"Gurin\u00a0Schleifer, A., Beigman\u00a0Klebanov, B., Ariely, M., & Alexandron, G. (2024). Anna Karenina strikes again: Pre-trained LLM embeddings may favor high-performing learners. In Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA) (pp. 391\u2013402)."},{"key":"485_CR31","doi-asserted-by":"crossref","unstructured":"Gurin\u00a0Schleifer, A., Beigman\u00a0Klebanov, B., Ariely, M., & Alexandron, G. (2023). Transformer-based Hebrew NLP models for short answer scoring in biology. In Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA) (pp. 550\u2013555).","DOI":"10.18653\/v1\/2023.bea-1.46"},{"key":"485_CR32","unstructured":"Haller, S., Aldea, A., Seifert, C., & Strisciuglio, N. (2022). Survey on automated short answer grading with deep learning: from word embeddings to transformers. arXiv preprint arXiv:2204.03503"},{"key":"485_CR33","doi-asserted-by":"crossref","unstructured":"Harris, N., Butani, A., & Hashmy, S. (2024). Enhancing embedding performance through large language model-based text enrichment and rewriting. arXiv preprint arXiv:2404.12283","DOI":"10.54364\/AAIML.2024.42136"},{"issue":"1","key":"485_CR34","doi-asserted-by":"publisher","first-page":"81","DOI":"10.3102\/003465430298487","volume":"77","author":"J Hattie","year":"2007","unstructured":"Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81\u2013112.","journal-title":"Review of Educational Research"},{"key":"485_CR35","doi-asserted-by":"crossref","unstructured":"He, Q., Liao, D., & Jiao, H. (2019). Clustering behavioral patterns using process data in piaac problem-solving items. Theoretical and practical advances in computer-based educational measurement, 189\u2013212","DOI":"10.1007\/978-3-030-18480-3_10"},{"key":"485_CR36","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1016\/j.lindif.2017.11.001","volume":"66","author":"M Hickendorff","year":"2018","unstructured":"Hickendorff, M., Edelsbrunner, P. A., McMullen, J., Schneider, M., & Trezise, K. (2018). Informative tools for characterizing individual differences in learning: Latent class, latent profile, and latent transition analysis. Learning and Individual Differences, 66, 4\u201315.","journal-title":"Learning and Individual Differences"},{"key":"485_CR37","doi-asserted-by":"crossref","unstructured":"Hubert, L., & Arabie, P. (1985) Comparing partitions. Journal of Classification, 193\u2013218","DOI":"10.1007\/BF01908075"},{"key":"485_CR38","doi-asserted-by":"publisher","first-page":"178","DOI":"10.1016\/j.ins.2022.11.139","volume":"622","author":"AM Ikotun","year":"2023","unstructured":"Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178\u2013210.","journal-title":"Information Sciences"},{"issue":"3","key":"485_CR39","doi-asserted-by":"publisher","first-page":"264","DOI":"10.1145\/331499.331504","volume":"31","author":"AK Jain","year":"1999","unstructured":"Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264\u2013323.","journal-title":"ACM computing surveys (CSUR)"},{"key":"485_CR40","unstructured":"Kaneko, M., Bollegala, D., & Okazaki, N. (2022). Debiasing isn\u2019t enough! \u2013 On the effectiveness of debiasing MLMs and their social biases in downstream tasks. In Proceedings of the International Conference on Computational Linguistics (COLING) (pp. 1299\u20131310)."},{"key":"485_CR41","doi-asserted-by":"crossref","unstructured":"Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019). Get IT scored using AutoSAS \u2014 an automated system for scoring short answers. In Proceedings of the Conference of the American Association for Artificial Intelligence (AAAI) (pp. 9662\u20139669).","DOI":"10.1609\/aaai.v33i01.33019662"},{"key":"485_CR42","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1007\/978-981-99-0026-8_2","volume-title":"Educational Data Science: Essentials, Approaches, and Tendencies: Proactive Education Based on Empirical Big Data Evidence","author":"T Le Quy","year":"2023","unstructured":"Le Quy, T., Friege, G., & Ntoutsi, E. (2023). A review of clustering models in educational data science toward fairness-aware learning. In A. Pe\u00f1a-Ayala (Ed.), Educational Data Science: Essentials, Approaches, and Tendencies: Proactive Education Based on Empirical Big Data Evidence (pp. 43\u201394). Singapore: Springer."},{"key":"485_CR43","unstructured":"Le\u00a0Scao, T., Fan, A., Akiki, C., Pavlick, E., Ili\u0107, S., Hesslow, D., Castagn\u00e9, R., Luccioni, A. S., Yvon, F., Gall\u00e9, M., et al. (2023). Bloom: A 176b-parameter open-access multilingual language model. hal-03850124f"},{"issue":"1","key":"485_CR44","first-page":"22","volume":"11","author":"ML Ledbetter","year":"2012","unstructured":"Ledbetter, M. L. (2012). Vision and change in undergraduate biology education: a call to action presentation to Faculty for Undergraduate Neuroscience, July 2011. Journal of Undergraduate Neuroscience Education, 11(1), 22.","journal-title":"Journal of Undergraduate Neuroscience Education"},{"key":"485_CR45","first-page":"9459","volume":"33","author":"P Lewis","year":"2020","unstructured":"Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u00fcttler, H., Lewis, M., Yih, W.-T., Rockt\u00e4schel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459\u20139474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"485_CR46","unstructured":"Li, H., Gobert, J., & Dickler, R. (2017). Automated assessment for scientific explanations in on-line science inquiry. In Proceedings of the International Conference on Educational Data Mining (EDM) (pp. 214\u2013219)."},{"key":"485_CR47","doi-asserted-by":"crossref","unstructured":"Li, Z., Tomar, Y., & Passonneau, R. J. (2021). A semantic feature-wise transformation relation network for automatic short answer grading. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6030\u20136040).","DOI":"10.18653\/v1\/2021.emnlp-main.487"},{"issue":"2","key":"485_CR48","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1111\/emip.12028","volume":"33","author":"OL Liu","year":"2014","unstructured":"Liu, O. L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M. C. (2014). Automated scoring of constructed-response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19\u201328.","journal-title":"Educational Measurement: Issues and Practice"},{"issue":"2","key":"485_CR49","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","volume":"28","author":"S Lloyd","year":"1982","unstructured":"Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129\u2013137.","journal-title":"IEEE Transactions on Information Theory"},{"key":"485_CR50","doi-asserted-by":"crossref","unstructured":"Luukkonen, R., Komulainen, V., Luoma, J., Eskelinen, A., Kanerva, J., Kupari, H. -M., Ginter, F., Laippala, V., Muennighoff, N., Piktus, A., Wang, T., Tazi, N., Scao, T., Wolf, T., Suominen, O., Sairanen, S., Merioksa, M., Heinonen, J., Vahtola, A., Antao, S., & Pyysalo, S. (2023). FinGPT: Large generative models for a small language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2710\u20132726).","DOI":"10.18653\/v1\/2023.emnlp-main.164"},{"key":"485_CR51","doi-asserted-by":"crossref","unstructured":"Madnani, N., Loukina, A., Von\u00a0Davier, A., Burstein, J., & Cahill, A. (2017). Building better open-source tools to support fairness in automated scoring. In Proceedings of the ACL Workshop on Ethics in Natural Language Processing (pp. 41\u201352).","DOI":"10.18653\/v1\/W17-1605"},{"key":"485_CR52","volume-title":"Foundations of Statistical Natural Language Processing","author":"CD Manning","year":"1999","unstructured":"Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press."},{"key":"485_CR53","doi-asserted-by":"crossref","unstructured":"Maoro, F., Vehmeyer, B., & Geierhos, M. (2023). Leveraging semantic search and llms for domain-adaptive information retrieval. In International Conference on Information and Software Technologies (pp. 148\u2013159). Springer","DOI":"10.1007\/978-3-031-48981-5_12"},{"key":"485_CR54","doi-asserted-by":"crossref","unstructured":"Martin, P. P., Kranz, D., Wulff, P., & Graulich, N. (2023). Exploring new depths: Applying machine learning for the analysis of student argumentation in chemistry. Journal of Research in Science Teaching.","DOI":"10.1002\/tea.21903"},{"key":"485_CR55","doi-asserted-by":"crossref","unstructured":"Masala, M., Ruseti, S., Dascalu, M., & Dobre, C. (2021). Extracting and clustering main ideas from student feedback using language models. In Proceedings of the International Conference on AI in Education (AIED) (pp. 282\u2013292).","DOI":"10.1007\/978-3-030-78292-4_23"},{"key":"485_CR56","doi-asserted-by":"crossref","unstructured":"McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11).","DOI":"10.21105\/joss.00205"},{"key":"485_CR57","doi-asserted-by":"crossref","unstructured":"Miladi, F., Psych\u00e9, V., & Lemire, D. (2024). Leveraging gpt-4 for accuracy in education: A comparative study on retrieval-augmented generation in MOOCs. In Proceedings of the International Conference on AI in Education (AIED) (pp. 427\u2013434).","DOI":"10.1007\/978-3-031-64315-6_40"},{"key":"485_CR58","doi-asserted-by":"crossref","unstructured":"Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., & Inui, K. (2019). Analytic score prediction and justification identification in automated short answer scoring. In Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA) (pp. 316\u2013325).","DOI":"10.18653\/v1\/W19-4433"},{"key":"485_CR59","unstructured":"National Research Council (2012). A framework for k-12 science education: Practices, crosscutting concepts, and core ideas. National Academy of Sciences."},{"key":"485_CR60","doi-asserted-by":"crossref","unstructured":"Nazaretsky, T., Bar, C., Walter, M., & Alexandron, G. (2022). Empowering teachers with AI: Co-designing a learning analytics tool for personalized instruction in the science classroom. In Proceedings of the International Learning Analytics and Knowledge Conference (LAK) (pp. 1\u201312).","DOI":"10.1145\/3506860.3506861"},{"key":"485_CR61","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1007\/s10956-011-9282-7","volume":"21","author":"RH Nehm","year":"2012","unstructured":"Nehm, R. H., & Haertig, H. (2012). Human vs. computer diagnosis of students\u2019 natural selection knowledge: Testing the efficacy of text analytic software. Journal of Science Education and Technology, 21, 56\u201373.","journal-title":"Journal of Science Education and Technology"},{"key":"485_CR62","unstructured":"OpenAI (2024). How to count tokens with Tiktoken. Accessed 25 Oct 2024. https:\/\/cookbook.openai.com\/examples\/how_to_count_tokens_with_tiktoken"},{"key":"485_CR63","unstructured":"OpenAI (2024). OpenAI Embeddings. https:\/\/platform.openai.com\/docs\/guides\/embeddings\/embedding-models. Accessed 07 Oct 2024."},{"key":"485_CR64","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32"},{"key":"485_CR65","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825\u20132830.","journal-title":"Journal of Machine Learning Research"},{"key":"485_CR66","unstructured":"Pipitone, N., & Alami, G. H. (2024). Legalbench-rag: A benchmark for retrieval-augmented generation in the legal domain. arXiv preprint arXiv:2408.10343"},{"issue":"3","key":"485_CR67","doi-asserted-by":"publisher","first-page":"1042","DOI":"10.3390\/app10031042","volume":"10","author":"JL Rastrollo-Guerrero","year":"2020","unstructured":"Rastrollo-Guerrero, J. L., G\u00f3mez-Pulido, J. A., & Dur\u00e1n-Dom\u00ednguez, A. (2020). Analyzing and predicting students\u2019 performance by means of machine learning: A review. Applied Sciences, 10(3), 1042.","journal-title":"Applied Sciences"},{"key":"485_CR68","doi-asserted-by":"crossref","unstructured":"Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3982\u20133992).","DOI":"10.18653\/v1\/D19-1410"},{"key":"485_CR69","doi-asserted-by":"crossref","unstructured":"Riordan, B., Bichler, S., Bradford, A., King\u00a0Chen, J., Wiley, K., Gerard, L., &\u00a0Linn, C. M. (2020). An empirical investigation of neural methods for content scoring of science explanations. In Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA) (pp. 135\u2013144).","DOI":"10.18653\/v1\/2020.bea-1.13"},{"key":"485_CR70","doi-asserted-by":"crossref","unstructured":"Roscoe, R. D., Varner, L. K., Crossley, S. A., & McNamara, D. S. (2013). Developing pedagogically-guided algorithms for intelligent writing feedback. International Journal of Learning Technology, 25 8(4), 362\u2013381.","DOI":"10.1504\/IJLT.2013.059131"},{"key":"485_CR71","doi-asserted-by":"crossref","unstructured":"Rust, P., Pfeiffer, J., Vuli\u0107, I., Ruder, S., & Gurevych, I. (2021). How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 3118\u20133135).","DOI":"10.18653\/v1\/2021.acl-long.243"},{"issue":"2","key":"485_CR72","doi-asserted-by":"publisher","first-page":"147","DOI":"10.1002\/tea.21128","volume":"51","author":"K Ryoo","year":"2014","unstructured":"Ryoo, K., & Linn, M. C. (2014). Designing guidance for interpreting dynamic visualizations: Generating versus reading explanations. Journal of Research in Science Teaching, 51(2), 147\u2013174.","journal-title":"Journal of Research in Science Teaching"},{"key":"485_CR73","doi-asserted-by":"crossref","unstructured":"Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) (pp. 1049\u20131054).","DOI":"10.3115\/v1\/N15-1111"},{"key":"485_CR74","doi-asserted-by":"crossref","unstructured":"Salmon, W. C. (2006). Four Decades of Scientific Explanation. University of Pittsburgh Press, Pittsburgh, PA 15260","DOI":"10.2307\/j.ctt5vkdm7"},{"key":"485_CR75","doi-asserted-by":"crossref","unstructured":"Seker, A., Bandel, E., Bareket, D., Brusilovsky, I., Greenfeld, R., & Tsarfaty, R. (2022). AlephBERT: Language model pre-training and evaluation from sub-word to sentence level. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 46\u201356).","DOI":"10.18653\/v1\/2022.acl-long.4"},{"key":"485_CR76","doi-asserted-by":"crossref","unstructured":"Sonnleitner, B., Madou, T., Deceuninck, M., Theodosiou, F., & Sagaert, Y. R. (2025). Evaluation of early student performance prediction given concept drift. Computers and Education: Artificial Intelligence, 100369","DOI":"10.1016\/j.caeai.2025.100369"},{"key":"485_CR77","unstructured":"Srivastava, A., Kleyjo, D., & Wu, Z. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 5."},{"key":"485_CR78","doi-asserted-by":"crossref","unstructured":"Sung, C., Dhamecha, T. I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. In Proceedings of the International Conference on AI in Education (AIED) (pp. 469\u2013481).","DOI":"10.1007\/978-3-030-23204-7_39"},{"key":"485_CR79","unstructured":"Tang, Y., & Yang, Y. (2024). Pooling and attention: What are effective designs for llm-based embedding models? arXiv preprint arXiv:2409.02727"},{"key":"485_CR80","doi-asserted-by":"publisher","first-page":"729","DOI":"10.1007\/s40593-017-0145-0","volume":"27","author":"C Tansomboon","year":"2017","unstructured":"Tansomboon, C., Gerard, L. F., Vitale, J. M., & Linn, M. C. (2017). Designing automated guidance to promote productive revision of science explanations. International Journal of Artificial Intelligence in Education, 27, 729\u2013757.","journal-title":"International Journal of Artificial Intelligence in Education"},{"issue":"1","key":"485_CR81","doi-asserted-by":"publisher","first-page":"3","DOI":"10.3102\/10769986211010467","volume":"47","author":"E Ulitzsch","year":"2022","unstructured":"Ulitzsch, E., He, Q., & Pohl, S. (2022). Using sequence mining techniques for understanding incorrect behavioral patterns on interactive tasks. Journal of Educational and Behavioral Statistics, 47(1), 3\u201335.","journal-title":"Journal of Educational and Behavioral Statistics"},{"key":"485_CR82","doi-asserted-by":"crossref","unstructured":"Vinh, N. X., Epps, J., & Bailey, J. (2009). Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the International Conference on Machine Learning (ICML) (pp. 1073\u20131080).","DOI":"10.1145\/1553374.1553511"},{"key":"485_CR83","first-page":"24824","volume":"35","author":"J Wei","year":"2022","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824\u201324837.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"4","key":"485_CR84","doi-asserted-by":"publisher","first-page":"490","DOI":"10.1007\/s10956-022-09969-w","volume":"31","author":"P Wulff","year":"2022","unstructured":"Wulff, P., Buschh\u00fcter, D., Westphal, A., Mientus, L., Nowak, A., & Borowski, A. (2022). Bridging the gap between qualitative and quantitative assessment in science education research with machine learning\u2014a case for pretrained language models-based clustering. Journal of Science Education and Technology, 31(4), 490\u2013513.","journal-title":"Journal of Science Education and Technology"},{"issue":"14","key":"485_CR85","doi-asserted-by":"publisher","first-page":"3866","DOI":"10.1002\/cpe.3745","volume":"28","author":"C Xiao","year":"2016","unstructured":"Xiao, C., Ye, J., Esteves, R. M., & Rong, C. (2016). Using Spearman\u2019s correlation coefficients for exploratory data analysis on big dataset. Concurrency and Computation: Practice and Experience, 28(14), 3866\u20133878.","journal-title":"Concurrency and Computation: Practice and Experience"},{"issue":"1","key":"485_CR86","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1080\/03057267.2020.1735757","volume":"56","author":"X Zhai","year":"2020","unstructured":"Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 56(1), 111\u2013151.","journal-title":"Studies in Science Education"},{"key":"485_CR87","unstructured":"Zhou, C., Li, Q., Li, C., Yu, J., Liu, Y., Wang, G., Zhang, K., Ji, C., Yan, Q., He, L., et al. (2024). A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, 1\u201365"},{"issue":"1","key":"485_CR88","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1093\/nsr\/nwx106","volume":"5","author":"Z-H Zhou","year":"2018","unstructured":"Zhou, Z.-H. (2018). A brief introduction to weakly supervised learning. National science review, 5(1), 44\u201353.","journal-title":"National science review"}],"container-title":["International Journal of Artificial Intelligence in Education"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-025-00485-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40593-025-00485-7","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-025-00485-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T18:12:43Z","timestamp":1772647963000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40593-025-00485-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,30]]},"references-count":88,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["485"],"URL":"https:\/\/doi.org\/10.1007\/s40593-025-00485-7","relation":{},"ISSN":["1560-4292","1560-4306"],"issn-type":[{"value":"1560-4292","type":"print"},{"value":"1560-4306","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,30]]},"assertion":[{"value":"14 May 2025","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 May 2025","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}