{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,13]],"date-time":"2026-06-13T13:41:50Z","timestamp":1781358110209,"version":"3.54.1"},"reference-count":45,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,11,9]],"date-time":"2025-11-09T00:00:00Z","timestamp":1762646400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Informatics"],"abstract":"<jats:p>Background and objectives: Automated evaluation of open-ended responses remains a persistent challenge, particularly when consistency, transparency, and reproducibility are required. While large language models (LLMs) have shown promise in rubric-based evaluation, their reliability across multiple evaluators is still uncertain. Variability in scoring, feedback, and rubric adherence raises concerns about interpretability and system robustness. This study introduces GraderAssist, a graph-based, rubric-guided, multi-LLM framework designed to ensure transparent and reproducible automated evaluation. Methods: GraderAssist evaluates a dataset of 220 responses to both technical and argumentative questions, collected from undergraduate computer science courses. Six open-source LLMs and GPT-4 (as expert reference) independently scored each response using two predefined rubrics. All outputs\u2014including scores, feedback, and metadata\u2014were parsed, validated, and stored in a Neo4j graph database, enabling structured querying, traceability, and longitudinal analysis. Results: Cross-model analysis revealed systematic differences in scoring behavior and feedback generation. Some models produced more generous evaluations, while others aligned closely with GPT-4. Semantic analysis using Sentence-BERT embeddings highlighted distinctive feedback styles and variable rubric adherence. Inter-model agreement was stronger for technical criteria but diverged substantially for argumentative tasks. Originality: GraderAssist integrates rubric-guided evaluation, multi-model comparison, and graph-based storage into a unified pipeline. By emphasizing reproducibility, transparency, and fine-grained analysis of evaluator behavior, it advances the design of interpretable automated evaluation systems with applications in education and beyond.<\/jats:p>","DOI":"10.3390\/informatics12040123","type":"journal-article","created":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T08:57:55Z","timestamp":1762765075000},"page":"123","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["GraderAssist: A Graph-Based Multi-LLM Framework for Transparent and Reproducible Automated Evaluation"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1849-3072","authenticated-orcid":false,"given":"Catalin","family":"Anghel","sequence":"first","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2537-6713","authenticated-orcid":false,"given":"Andreea Alexandra","family":"Anghel","sequence":"additional","affiliation":[{"name":"Computer Science and Information Technology Program, Faculty of Automation, Computer Science, Electrical and Electronic Engineering, \u201cDun\u0103rea de Jos\u201d University of Galati, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1155-5274","authenticated-orcid":false,"given":"Emilia","family":"Pecheanu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0935-4713","authenticated-orcid":false,"given":"Adina","family":"Cocu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9970-2556","authenticated-orcid":false,"given":"Marian Viorel","family":"Craciun","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7539-4159","authenticated-orcid":false,"given":"Paul","family":"Iacobescu","sequence":"additional","affiliation":[{"name":"Doctoral School, \u201cDun\u0103rea de Jos\u201d University of Galati, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5261-2060","authenticated-orcid":false,"given":"Antonio Stefan","family":"Balau","sequence":"additional","affiliation":[{"name":"Computer Science and Information Technology Program, Faculty of Automation, Computer Science, Electrical and Electronic Engineering, \u201cDun\u0103rea de Jos\u201d University of Galati, 800201 Galati, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5595-1135","authenticated-orcid":false,"given":"Constantin Adrian","family":"Andrei","sequence":"additional","affiliation":[{"name":"\u201cFoisor\u201d Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,9]]},"reference":[{"key":"ref_1","first-page":"429","article-title":"Formal Assessment in STEM Higher Education","volume":"1","author":"Mistler","year":"2025","journal-title":"J. Tech. Stud."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"893","DOI":"10.1007\/s10734-023-01148-z","article-title":"A review of the benefits and drawbacks of high-stakes final examinations in higher education","volume":"8","author":"French","year":"2024","journal-title":"High. Educ."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1186\/s40594-025-00570-2","article-title":"LUPDA: A comprehensive rubrics-based assessment model","volume":"12","author":"Cheng","year":"2025","journal-title":"Int. J. STEM Educ."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tan, L.Y., Hu, S., Yeo, D.J., and Cheong, K.H. (2025). A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques. Mathematics, 13.","DOI":"10.3390\/math13172828"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Souza, M., Margalho, \u00c9., Lima, R.M., Mesquita, D., and Costa, M.J. (2022). Rubric\u2019s Development Process for Assessment of Project Management Competences. Educ. Sci., 12.","DOI":"10.3390\/educsci12120902"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Emirtekin, E. (2025). Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci., 15.","DOI":"10.3390\/app15105683"},{"key":"ref_7","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., and Ray, A. (December, January 28). Training language models to follow instructions with human feedback. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA. Available online: https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/hash\/b1efde53be364a73914f58805a001731-Abstract-Conference.html."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Mendon\u00e7a, P.C., Quintal, F., and Mendon\u00e7a, F. (2025). Evaluating LLMs for Automated Scoring in Formative Assessments. Appl. Sci., 15.","DOI":"10.3390\/app15052787"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"72","DOI":"10.2307\/1412159","article-title":"The Proof and Measurement of Association between Two Things","volume":"15","author":"Spearman","year":"1904","journal-title":"Am. J. Psychol."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wei, Y., Zhang, R., Zhang, J., Qi, D., and Cui, W. (2025). Research on Intelligent Grading of Physics Problems Based on Large Language Models. Educ. Sci., 15.","DOI":"10.3390\/educsci15020116"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Iacobescu, P., Marina, V., Anghel, C., and Anghele, A.-D. (2024). Evaluating Binary Classifiers for Cardiovascular Disease Prediction: Enhancing Early Diagnostic Capabilities. J. Cardiovasc. Dev. Dis., 11.","DOI":"10.3390\/jcdd11120396"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Anghele, A.-D., Marina, V., Dragomir, L., Moscu, C.A., Anghele, M., and Anghel, C. (2024). Predicting Deep Venous Thrombosis Using Artificial Intelligence: A Clinical Data Approach. Bioengineering, 11.","DOI":"10.3390\/bioengineering11111067"},{"key":"ref_13","first-page":"51","article-title":"Predicting periprosthetic joint Infection: Evaluating supervised machine learning models for clinical application","volume":"54","author":"Dragosloveanu","year":"2025","journal-title":"J. Orthop. Transl."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Cisneros-Gonz\u00e1lez, J., Gordo-Herrera, N., Barcia-Santos, I., and S\u00e1nchez-Soriano, J. (2025). JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs). Future Internet, 17.","DOI":"10.3390\/fi17060265"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Pan, Y., and Nehm, R.H. (2025). Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks. Educ. Sci., 15.","DOI":"10.3390\/educsci15060676"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Seo, H., Hwang, T., Jung, J., Namgoong, H., Lee, J., and Jung, S. (2025). Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci., 15.","DOI":"10.3390\/app15020671"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Monteiro, J., S\u00e1, F., and Bernardino, J. (2023). Experimental Evaluation of Graph Databases: JanusGraph, Nebula Graph, Neo4j, and TigerGraph. Appl. Sci., 13.","DOI":"10.3390\/app13095770"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1016\/j.edurev.2013.01.002","article-title":"The use of scoring rubrics for formative assessment purposes revisited: A review","volume":"9","author":"Panadero","year":"2013","journal-title":"Educ. Res. Rev."},{"key":"ref_19","unstructured":"Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. (2024, January 7\u201311). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. Proceedings of the Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria. Available online: https:\/\/openreview.net\/forum?id=FQepisCUWu."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Zhu, B., Gonzalez, J.E., and Stoica, I. (2025, September 12). From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline. Available online: https:\/\/lmsys.org\/blog\/2024-04-19-arena-hard\/.","DOI":"10.1038\/s41597-025-05435-5"},{"key":"ref_21","unstructured":"Bhat, V. (2025, October 31). RubricEval: A Scalable Human-LLM Evaluation Framework for Open-Ended Tasks. Available online: https:\/\/web.stanford.edu\/class\/cs224n\/final-reports\/256846781.pdf."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Hashemi, H., Eisner, J., Rosset, C., Van Durme, B., and Kedzie, C. (2024, January 11\u201316). LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.acl-long.745"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Feng, Z., Zhang, Y., Li, H., Wu, B., Liao, J., Liu, W., Lang, J., Feng, Y., Wu, J., and Liu, Z. (May, January 29). TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA.","DOI":"10.18653\/v1\/2025.findings-naacl.218"},{"key":"ref_24","unstructured":"Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (December, January 28). Reflexion: Language Agents with Verbal Reinforcement Learning. Proceedings of the Advances in Neural Information Processing Systems 36\u2014NeurIPS, New Orleans, LA, USA."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zhao, R., Zhang, W., Chia, Y.K., Zhao, D., and Bing, L. (2025, September 12). Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-Battlaes and Committee Discussions. Available online: https:\/\/auto-arena.github.io\/blog\/.","DOI":"10.18653\/v1\/2025.acl-long.223"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Cantera, M.A., Arevalo, M.-J., Garc\u00eda-Marina, V., and Alves-Castro, M. (2021). A Rubric to Assess and Improve Technical Writing in Undergraduate Engineering Courses. Educ. Sci., 11.","DOI":"10.3390\/educsci11040146"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Shahriar, S., Lund, B.D., Mannuru, N.R., Arshad, M.A., Hayawi, K., Bevara, R.V.K., Mannuru, A., and Batool, L. (2024). Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Appl. Sci., 14.","DOI":"10.20944\/preprints202406.1635.v1"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1038\/s41539-024-00291-1","article-title":"Evaluating large language models for criterion-based grading from agreement to consistency","volume":"9","author":"Zhang","year":"2024","journal-title":"npj Sci. Learn."},{"key":"ref_29","unstructured":"Meta (2025, September 17). Introducing Meta Llama 3. Available online: https:\/\/ai.meta.com\/blog\/meta-llama-3\/."},{"key":"ref_30","unstructured":"Nous Research (2025, September 17). Nous-Hermes 2 Model Card. Available online: https:\/\/ollama.com\/library\/nous-hermes2:latest."},{"key":"ref_31","unstructured":"Teknium (2025, September 17). OpenHermes Model Card. Available online: https:\/\/ollama.com\/library\/openhermes."},{"key":"ref_32","unstructured":"Hartford, E. (2025, September 17). Dolphin-Mistral Model Card. Available online: https:\/\/ollama.com\/library\/dolphin-mistral."},{"key":"ref_33","unstructured":"Google (2025, September 17). Gemma:7B-Instruct Model Card. Available online: https:\/\/ollama.com\/library\/gemma:7b-instruct."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Renze, M. (2024, January 12\u201316). The Effect of Sampling Temperature on Problem Solving in Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA.","DOI":"10.18653\/v1\/2024.findings-emnlp.432"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Dong, B., Bai, J., Xu, T., and Zhou, Y. (2024, January 19\u201321). Large Language Models in Education: A Systematic Review. Proceedings of the 2024 6th International Conference on Computer Science and Technologies in Education (CSTE), Xi\u2019an, China.","DOI":"10.1109\/CSTE62025.2024.00031"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Mazein, I., Rougny, A., Mazein, A., Henkel, R., G\u00fctebier, L., Michaelis, L., Ostaszewski, M., Schneider, R., Satagopam, V., and Jensen, L.J. (2024). Graph databases in systems biology: A systematic review. Brief. Bioinform., 25.","DOI":"10.1093\/bib\/bbae561"},{"key":"ref_37","unstructured":"Asplund, E., and Sandell, J. (2025, October 31). Comparison of Graph Databases and Relational Databases Performance. Available online: https:\/\/su.diva-portal.org\/smash\/record.jsf?pid=diva2:1784349."},{"key":"ref_38","unstructured":"Chen, M., Poulsen, S., Alkhabaz, R., and Alawini, A. (July, January 26). A Quantitative Analysis of Student Solutions to Graph Database Problems. Proceedings of the the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1, Virtual Event, Germany."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019, January 3\u20137). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_40","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (May, January 26). BERTScore: Evaluating Text Generation with BERT. Proceedings of the International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia (Virtual). Available online: https:\/\/openreview.net\/forum?id=SkeHuCVFDr."},{"key":"ref_41","first-page":"1","article-title":"openTSNE: A Modular Python Library for t-SNE Dimensionality Reduction and Embedding","volume":"109","author":"Zupan","year":"2024","journal-title":"J. Stat. Softw."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"G\u00f3mez, J., and V\u00e1zquez, P.-P. (2022). An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles. Appl. Sci., 12.","DOI":"10.3390\/app12115664"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1006\/enfo.2001.0061","article-title":"Detecting Trends Using Spearman\u2019s Rank Correlation Coefficient","volume":"2","author":"Gauthier","year":"2001","journal-title":"Environ. Forensics"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Vulpe, D.E., Anghel, C., Scheau, C., Dragosloveanu, S., and S\u0103ndulescu, O. (2025). Artificial Intelligence and Its Role in Predicting Periprosthetic Joint Infections. Biomedicines, 13.","DOI":"10.3390\/biomedicines13081855"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Lazuka, M., Anghel, A., and Parnell, T.P. (2024, January 17\u201322). LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services. Proceedings of the SC \u201924: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA.","DOI":"10.1109\/SC41406.2024.00022"}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/12\/4\/123\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T09:45:43Z","timestamp":1762767943000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/12\/4\/123"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,9]]},"references-count":45,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["informatics12040123"],"URL":"https:\/\/doi.org\/10.3390\/informatics12040123","relation":{},"ISSN":["2227-9709"],"issn-type":[{"value":"2227-9709","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,9]]}}}