{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T19:30:54Z","timestamp":1772652654700,"version":"3.50.1"},"reference-count":46,"publisher":"Elsevier BV","issue":"4","license":[{"start":{"date-parts":[[2025,7,10]],"date-time":"2025-07-10T00:00:00Z","timestamp":1752105600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,10]],"date-time":"2025-07-10T00:00:00Z","timestamp":1752105600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"JST SPRING","award":["JPMJSP2114"],"award-info":[{"award-number":["JPMJSP2114"]}]},{"name":"JST SPRING","award":["JPMJSP2114"],"award-info":[{"award-number":["JPMJSP2114"]}]},{"name":"JSPS KAKENHI","award":["JP19K12112"],"award-info":[{"award-number":["JP19K12112"]}]},{"name":"JSPS KAKENHI","award":["22H00524"],"award-info":[{"award-number":["22H00524"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Artif Intell Educ"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Automated short answer scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. SAS is promising for real-world applications. However, because rubrics and reference answers differ among prompts, there is a need to acquire new data and train a model for each new prompt. This makes SAS expensive, especially in schools and online courses where resources are limited and only a few prompts are used. In this study, we propose a two-phase approach to address this issue. The proposed approach involves training a model on existing rubrics and answers with gold score signals and then finetuning the model on a new prompt. In particular, given that scoring rubrics and reference answers differ for different prompts, we employed key phrases, which are representative expressions that the answer should contain to gain a score, and trained an SAS model to learn the relationship between the key phrases and answers using already annotated prompts (i.e., cross-prompts). We evaluated the proposed approach using bidirectional encoder representations from transformers (BERT) and open-source large language models (LLMs). In addition, we incorporated the proposed approach with zero-shot conditions and in-context learning of LLMs. The results show that the proposed two-phase approach significantly improves scoring accuracy, especially when the training data is limited. Finally, an extensive analysis revealed that it is crucial to design a model that can learn a task\u2019s general properties.<\/jats:p>","DOI":"10.1007\/s40593-025-00474-w","type":"journal-article","created":{"date-parts":[[2025,7,10]],"date-time":"2025-07-10T20:07:49Z","timestamp":1752178069000},"page":"2399-2420","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Cross-prompt Pre-finetuning of Language Models for Short Answer Scoring"],"prefix":"10.1016","volume":"35","author":[{"given":"Hiroaki","family":"Funayama","sequence":"first","affiliation":[]},{"given":"Yuichiroh","family":"Matsubayashi","sequence":"additional","affiliation":[]},{"given":"Yuya","family":"Asazuma","sequence":"additional","affiliation":[]},{"given":"Tomoya","family":"Mizumoto","sequence":"additional","affiliation":[]},{"given":"Kentaro","family":"Inui","sequence":"additional","affiliation":[]}],"member":"78","published-online":{"date-parts":[[2025,7,10]]},"reference":[{"key":"474_CR1","doi-asserted-by":"publisher","unstructured":"Aghajanyan, A., Gupta, A., Shrivastava, A., Chen, X., Zettlemoyer, L., & Gupta, S. (2021). Muppet: Massive multi-task representations with pre-finetuning. In EMNLP (pp. 5799\u20135811). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.468","DOI":"10.18653\/v1\/2021.emnlp-main.468"},{"key":"474_CR2","unstructured":"AI@Meta (2024). Llama 3 model card. https:\/\/github.com\/meta-llama\/llama3\/blob\/main\/MODEL{_}CARD.md"},{"key":"474_CR3","doi-asserted-by":"crossref","unstructured":"Bailey, S., & Meurers, D. (2008). Diagnosing meaning errors in short answers to reading comprehension questions. In Proceedings of the third workshop on innovative use of NLP for building educational applications (pp. 107\u2013115). Columbus, Ohio: Association for Computational Linguistics. https:\/\/aclanthology.org\/W08-0913","DOI":"10.3115\/1631836.1631849"},{"key":"474_CR4","doi-asserted-by":"publisher","unstructured":"Bexte, M., Horbach, A., & Zesch, T. (2022). Similarity-based content scoring - how to make S-BERT keep up with BERT. In E. Kochmar, J. Burstein, A. Horbach, et\u00a0al. (Eds.), Proceedings of the 17th workshop on innovative use of nlp for building educational applications (BEA 2022) (pp. 118\u2013123). Seattle, Washington: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2022.bea-1.16, https:\/\/aclanthology.org\/2022.bea-1.16\/","DOI":"10.18653\/v1\/2022.bea-1.16"},{"key":"474_CR5","doi-asserted-by":"publisher","unstructured":"Bexte, M., Horbach, A., & Zesch, T. (2023). Similarity-based content scoring - a more classroom-suitable alternative to instance-based scoring? In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the association for computational linguistics: ACL 2023 (pp. 1892\u2013190). Toronto, Canada: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2023.findings-acl.119, https:\/\/aclanthology.org\/2023.findings-acl.119\/","DOI":"10.18653\/v1\/2023.findings-acl.119"},{"key":"474_CR6","unstructured":"Brown, T. B., Mann, B., Ryder, N., et\u00a0al. (2020). Language models are few-shot learners. arXiv [csCL] [cs.CL]"},{"issue":"1","key":"474_CR7","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1007\/s40593-014-0026-8","volume":"25","author":"S Burrows","year":"2015","unstructured":"Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60\u2013117.","journal-title":"International Journal of Artificial Intelligence in Education"},{"key":"474_CR8","unstructured":"Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. Workshop Innov Use NLP Build Educ Appl pp. 309\u2013315."},{"issue":"21","key":"474_CR9","doi-asserted-by":"publisher","first-page":"23173","DOI":"10.1609\/aaai.v38i21.30363","volume":"38","author":"LH Chang","year":"2024","unstructured":"Chang, L. H., & Ginter, F. (2024). Automatic short answer grading for finnish with ChatGPT. Proceedings of the Conference AAAI on Artificial Intelligence, 38(21), 23173\u201323181.","journal-title":"Proceedings of the Conference AAAI on Artificial Intelligence"},{"issue":"4","key":"474_CR10","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1037\/h0026256","volume":"70","author":"J Cohen","year":"1968","unstructured":"Cohen, J. (1968). Weighted Kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213\u2013220.","journal-title":"Psychological Bulletin"},{"key":"474_CR11","unstructured":"Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv [csLG] [cs.LG]"},{"key":"474_CR12","doi-asserted-by":"publisher","unstructured":"Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (pp 4171\u20134186). https:\/\/doi.org\/10.18653\/v1\/N19-1423","DOI":"10.18653\/v1\/N19-1423"},{"key":"474_CR13","unstructured":"Dzikovska, M., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., & Dang, H. T. (2013). SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In S. Manandhar, & D. Yuret (Eds.), Second joint conference on lexical and computational semantics (*SEM), Volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 263\u2013274). Atlanta, Georgia, USA: Association for Computational Linguistics. https:\/\/aclanthology.org\/S13-2045\/"},{"key":"474_CR14","unstructured":"Fujii, K., Nakamura, T., Loem, M., Iida, H., Ohi, M., Hattori, K., Shota, H., Mizuki, S., Yokota, R., & Okazaki, N. (2024). Continual pre-training for cross-lingual llm adaptation: Enhancing japanese language capabilities. In Proceedings of the first conference on language modeling, University of Pennsylvania, USA, COLM (p. (to appear))."},{"key":"474_CR15","doi-asserted-by":"crossref","unstructured":"Funayama, H., Sato, T., Matsubayashi, Y., Mizumoto, T., Suzuki, J., & Inui, K. (2022). Balancing cost and quality: An exploration of human-in-the-loop frameworks for automated short answer scoring. AIED (pp. 465\u2013476). Cham: Springer International Publishing.","DOI":"10.1007\/978-3-031-11644-5_38"},{"key":"474_CR16","doi-asserted-by":"crossref","unstructured":"Funayama, H., Asazuma, Y., Matsubayashi, Y., Mizumoto, T., & Inui, K. (2023). Reducing the cost: Cross-Prompt pre-finetuning for short answer scoring. In Artificial intelligence in education (pp. 78\u201389). Springer Nature Switzerland.","DOI":"10.1007\/978-3-031-36272-9_7"},{"key":"474_CR17","unstructured":"Haller, S., Aldea, A., Seifert, C., & Strisciuglio, N. (2022) Survey on automated short answer grading with deep learning: from word embeddings to transformers."},{"key":"474_CR18","unstructured":"Heilman, M., & Madnani, N. (2013). ETS: Domain adaptation and stacking for short answer scoring. In S. Manandhar, & D. Yuret (Eds.), Second joint conference on lexical and computational semantics (*SEM), Volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 275\u2013279). Atlanta, Georgia, USA: Association for Computational Linguistics. https:\/\/aclanthology.org\/S13-2046\/"},{"key":"474_CR19","unstructured":"Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv [csCL] [cs.CL]"},{"key":"474_CR20","unstructured":"Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Proceedings of the 36th international conference on neural information processing systems (pp. 22199\u201322213). Curran Associates Inc., Red Hook, NY, USA, no. Article 1613 in NIPS \u201922."},{"key":"474_CR21","doi-asserted-by":"publisher","unstructured":"Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019). Get it scored using autosas - an automated system for scoring short answers. In AAAI\/IAAI\/EAAI. AAAI Press. https:\/\/doi.org\/10.1609\/aaai.v33i01.33019662","DOI":"10.1609\/aaai.v33i01.33019662"},{"key":"474_CR22","unstructured":"Meurers, D., Ziai, R., Ott, N., & Kopp, J. (2011). Evaluating answers to reading comprehension questions in context: Results for German and the role of information structure. In S. Pad\u00f3, & S. Thater (Eds.), Proceedings of the TextInfer 2011 workshop on textual entailment (pp. 1\u20139). Edinburgh, Scottland, UK: Association for Computational Linguistics. https:\/\/aclanthology.org\/W11-2401\/"},{"key":"474_CR23","doi-asserted-by":"crossref","unstructured":"Miller, G. A. (1994). WordNet: A lexical database for English. In Human language technology: Proceedings of a workshop held at plainsboro, New Jersey, March 8-11, 1994. https:\/\/aclanthology.org\/H94-1111\/","DOI":"10.3115\/1075812.1075938"},{"key":"474_CR24","doi-asserted-by":"publisher","unstructured":"Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., & Inui, K. (2019). Analytic score prediction and justification identification in automated short answer scoring. In BEA (pp 316\u2013325). https:\/\/doi.org\/10.18653\/v1\/W19-4433","DOI":"10.18653\/v1\/W19-4433"},{"key":"474_CR25","doi-asserted-by":"crossref","unstructured":"Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th conference of the european chapter of the ACL (EACL 2009) (pp. 567\u2013575). Athens, Greece: Association for Computational Linguistics. https:\/\/aclanthology.org\/E09-1065\/","DOI":"10.3115\/1609067.1609130"},{"key":"474_CR26","unstructured":"Mohler, M., Bunescu, R., & Mihalcea, R. (2011a). Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In ACL-HLT (pp. 752\u2013762)."},{"key":"474_CR27","unstructured":"Mohler, M., Bunescu, R., & Mihalcea, R. (2011b). Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 752\u2013762). Portland, Oregon, USA: Association for Computational Linguistics. https:\/\/aclanthology.org\/P11-1076\/"},{"key":"474_CR28","unstructured":"Nielsen, R., Ward, W., & Martin, J. (2008). Learning to assess low-level conceptual understanding (pp. 427\u2013432)."},{"key":"474_CR29","doi-asserted-by":"crossref","unstructured":"Oka, H., Nguyen, H. T., Nguyen, C. T., Nakagawa, M., & Ishioka, T. (2022). Fully automated short answer scoring of the trial tests for common entrance examinations for japanese university. In AIED (pp. 180\u2013192). Cham: Springer International Publishing.","DOI":"10.1007\/978-3-031-11644-5_15"},{"key":"474_CR30","unstructured":"Okazaki, N., Hattori, K., Shota, H., Iida, H., Ohi, M., Fujii, K., Nakamura, T., Loem, M., Yokota, R., & Mizuki, S. (2024). Building a large japanese web corpus for large language models. In Proceedings of the first conference on language modeling, University of Pennsylvania, USA, COLM (p (to appear))."},{"issue":"1","key":"474_CR31","first-page":"140:1","volume":"21","author":"C Raffel","year":"2019","unstructured":"Raffel, C., Shazeer, N. M., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1), 140:1-140:67.","journal-title":"Journal of Machine Learning Research"},{"key":"474_CR32","doi-asserted-by":"publisher","unstructured":"Riordan, B., Horbach, A., Cahill, A., Zesch, T., & Lee, C. (2017). Investigating neural architectures for short answer scoring. In BEA (pp. 159\u2013168). https:\/\/doi.org\/10.18653\/v1\/W17-5017","DOI":"10.18653\/v1\/W17-5017"},{"key":"474_CR33","unstructured":"Saha, S., Dhamecha, T. I., Marvaniya, S., Foltz, P., Sindhgatta, R., & Sengupta, B. (2019). Joint multi-domain learning for automatic short answer grading. CoRR abs\/1902.09183. arXiv:1902.09183"},{"key":"474_CR34","doi-asserted-by":"publisher","unstructured":"Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. In NAACL-HLT (pp. 1049\u20131054). Denver, Colorado: Association for Computational Linguistics. https:\/\/doi.org\/10.3115\/v1\/N15-1111","DOI":"10.3115\/v1\/N15-1111"},{"key":"474_CR35","doi-asserted-by":"crossref","unstructured":"Schneider, J., Schenk, B., & Niklaus, C. (2024). Towards LLM-based autograding for short textual answers. In Proceedings of the 16th international conference on computer supported education. SCITEPRESS - Science and Technology Publications.","DOI":"10.5220\/0012552200003693"},{"key":"474_CR36","unstructured":"Song, X., Li, O., Lee, C., Yang, B., Peng, D., Perel, S., & Chen, Y. (2024). OmniPred: Language models as universal regressors. arXiv [csLG] [cs.LG]"},{"key":"474_CR37","doi-asserted-by":"publisher","unstructured":"Sultan, M. A., Salazar, C., & Sumner, T. (2016) Fast and easy short answer grading with high accuracy. In NAACL-HLT (pp. 1070\u20131075). San Diego, California: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/N16-1123","DOI":"10.18653\/v1\/N16-1123"},{"key":"474_CR38","doi-asserted-by":"publisher","unstructured":"Sung, C., Dhamecha, T., Saha, S., Ma, T., Reddy, V., & Arora, R. (2019). Pre-training BERT on domain resources for short answer grading. In EMNLP-IJCNLP (pp. 6071\u20136075). Hong Kong, China: Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/D19-1628","DOI":"10.18653\/v1\/D19-1628"},{"key":"474_CR39","unstructured":"Swallow, L. L. M. (2024). Llama 3 swallow. https:\/\/swallow-llm.github.io\/llama3-swallow.en.html"},{"key":"474_CR40","unstructured":"Vacareanu, R., Negru, V. A., Suciu, V., & Surdeanu, M. (2024). From words to numbers: Your large language model is secretly a capable regressor when given in-context examples. arXiv [csCL] [cs.CL]"},{"issue":"1","key":"474_CR41","doi-asserted-by":"publisher","first-page":"183","DOI":"10.5715\/jnlp.28.183","volume":"28","author":"T Wang","year":"2021","unstructured":"Wang, T., Funayama, H., Ouchi, H., & Inui, K. (2021). Data augmentation by rubrics for short answer grading. Journal of Natural Language Processing, 28(1), 183\u2013205. https:\/\/doi.org\/10.5715\/jnlp.28.183","journal-title":"Journal of Natural Language Processing"},{"key":"474_CR42","unstructured":"Wang, X., Wei, J., Schuurmans, D., et\u00a0al. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv [csCL] [cs.CL]"},{"key":"474_CR43","unstructured":"Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2021). Finetuned language models are zero-shot learners. In International Conference on Learning Representations."},{"key":"474_CR44","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. Neural Information Processing Systems abs\/2201.11903:24824\u201324837"},{"key":"474_CR45","unstructured":"Zhang, B., Liu, Z., Cherry, C., & Firat, O. (2024). When scaling meets LLM finetuning: The effect of data, model and finetuning method. International Conference on Learning Representations abs\/2402.17193"},{"key":"474_CR46","unstructured":"Zhao, J., Wang, T., Abid, W., Angus, G., Garg, A., Kinnison, J., Sherstinsky, A., Molino, P., Addair, T., & Rishi, D. (2024). LoRA land: 310 fine-tuned LLMs that rival GPT-4, a technical report. arXiv [csCL] [cs.CL]"}],"container-title":["International Journal of Artificial Intelligence in Education"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-025-00474-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40593-025-00474-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40593-025-00474-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T18:12:40Z","timestamp":1772647960000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40593-025-00474-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,10]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["474"],"URL":"https:\/\/doi.org\/10.1007\/s40593-025-00474-w","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-4929687\/v1","asserted-by":"object"}]},"ISSN":["1560-4292","1560-4306"],"issn-type":[{"value":"1560-4292","type":"print"},{"value":"1560-4306","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,10]]},"assertion":[{"value":"31 March 2025","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 July 2025","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}