{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T16:20:24Z","timestamp":1764865224720,"version":"3.46.0"},"reference-count":44,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T00:00:00Z","timestamp":1764720000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Large Language Models (LLMs) are increasingly used for rubric-based assessment, yet reliability is limited by instability, bias, and weak diagnostics. We present EvalCouncil, a committee-and-chief framework for rubric-guided grading with auditable traces and a human adjudication baseline. Our objectives are to (i) characterize domain structure in Human\u2013LLM alignment, (ii) assess robustness to concordance tolerance and panel composition, and (iii) derive a domain-adaptive audit policy grounded in dispersion and chief\u2013panel differences. Authentic student responses from two domains\u2013Computer Networks (CNs) and Machine Learning (ML)\u2013are graded by multiple heterogeneous LLM evaluators using identical rubric prompts. A designated chief arbitrator operates within a tolerance band and issues the final grade. We quantify within-panel dispersion via MPAD (mean pairwise absolute deviation), measure chief\u2013panel concordance (e.g., absolute error and bias), and compute Human\u2013LLM deviation. Robustness is examined by sweeping the tolerance and performing leave-one-out perturbations of panel composition. All outputs and reasoning traces are stored in a graph database for full provenance. Human\u2013LLM alignment exhibits systematic domain dependence: ML shows tighter central tendency and shorter upper tails, whereas CN displays broader dispersion with heavier upper tails and larger extreme spreads. Disagreement increases with item difficulty as captured by MPAD, concentrating misalignment on a relatively small subset of items. These patterns are stable to tolerance variation and single-grader removals. The signals support a practical triage policy: accept low-dispersion, small-gap items; apply a brief check to borderline cases; and adjudicate high-dispersion or large-gap items with targeted rubric clarification. EvalCouncil instantiates a committee-and-chief, rubric-guided grading workflow with committee arbitration, a human adjudication baseline, and graph-based auditability in a real classroom deployment. By linking domain-aware dispersion (MPAD), a policy tolerance dial, and chief\u2013panel discrepancy, the study shows how these elements can be combined into a replicable, auditable, and capacity-aware approach for organizing LLM-assisted grading and identifying instability and systematic misalignment, while maintaining pedagogical interpretability.<\/jats:p>","DOI":"10.3390\/computers14120530","type":"journal-article","created":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T16:07:38Z","timestamp":1764864458000},"page":"530","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1849-3072","authenticated-orcid":false,"given":"Catalin","family":"Anghel","sequence":"first","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800146 Galati, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9970-2556","authenticated-orcid":false,"given":"Marian Viorel","family":"Craciun","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800146 Galati, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2537-6713","authenticated-orcid":false,"given":"Andreea Alexandra","family":"Anghel","sequence":"additional","affiliation":[{"name":"Faculty of Automation, Computer Science, Electrical and Electronic Engineering, \u201cDun\u0103rea de Jos\u201d University of Galati, 800008 Galati, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0935-4713","authenticated-orcid":false,"given":"Adina","family":"Cocu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Technology, \u201cDun\u0103rea de Jos\u201d University of Galati, \u0218tiin\u021bei St. 2, 800146 Galati, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5261-2060","authenticated-orcid":false,"given":"Antonio Stefan","family":"Balau","sequence":"additional","affiliation":[{"name":"Faculty of Automation, Computer Science, Electrical and Electronic Engineering, \u201cDun\u0103rea de Jos\u201d University of Galati, 800008 Galati, Romania"}]},{"given":"Constantin Adrian","family":"Andrei","sequence":"additional","affiliation":[{"name":"Faculty of Medicine, The \u201cCarol Davila\u201d University of Medicine and Pharmacy, 050474 Bucharest, Romania"},{"name":"Department of Orthopaedics, \u201cFoisor\u201d Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania"}]},{"given":"Calina","family":"Maier","sequence":"additional","affiliation":[{"name":"Faculty of Medicine, The \u201cCarol Davila\u201d University of Medicine and Pharmacy, 050474 Bucharest, Romania"},{"name":"Panait Sirbu Obstetrics and Gynaecology Hospital Bucharest, 060251 Bucharest, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7403-1684","authenticated-orcid":false,"given":"Serban","family":"Dragosloveanu","sequence":"additional","affiliation":[{"name":"Faculty of Medicine, The \u201cCarol Davila\u201d University of Medicine and Pharmacy, 050474 Bucharest, Romania"},{"name":"Department of Orthopaedics, \u201cFoisor\u201d Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania"}]},{"given":"Dana-Georgiana","family":"Nedelea","sequence":"additional","affiliation":[{"name":"Faculty of Medicine, The \u201cCarol Davila\u201d University of Medicine and Pharmacy, 050474 Bucharest, Romania"},{"name":"Department of Orthopaedics, \u201cFoisor\u201d Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7676-6393","authenticated-orcid":false,"given":"Cristian","family":"Scheau","sequence":"additional","affiliation":[{"name":"Faculty of Medicine, The \u201cCarol Davila\u201d University of Medicine and Pharmacy, 050474 Bucharest, Romania"},{"name":"Department of Radiology and Medical Imaging, \u201cFoisor\u201d Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,3]]},"reference":[{"key":"ref_1","unstructured":"OpenAI (2025, September 22). Introducing GPT-5. Available online: https:\/\/openai.com\/index\/introducing-gpt-5\/."},{"key":"ref_2","unstructured":"Meta (2025, July 30). The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation. Available online: https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence."},{"key":"ref_3","unstructured":"Mistral AI (2025, July 30). Announcing Mistral 7B: A High-Performance Open-Weight Language Model. Available online: https:\/\/mistral.ai\/news\/announcing-mistral-7b."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 7\u201312). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA. Available online: https:\/\/aclanthology.org\/P02-1040.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_5","unstructured":"Lin, C.Y. (2004, January 21\u201326). ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Text Summarization Branches Out (ACL Workshop), Barcelona, Spain. Available online: https:\/\/aclanthology.org\/W04-1013."},{"key":"ref_6","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and\/or Summarization, Ann Arbor, MI, USA. Available online: https:\/\/aclanthology.org\/W05-0909."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Gao, M., Liu, Y., Hu, X., Wan, X., Bragg, J., and Cohan, A. (May, January 29). Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA. Available online: https:\/\/aclanthology.org\/2025.findings-naacl.260.","DOI":"10.18653\/v1\/2025.findings-naacl.260"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Jeong, H., Park, C., Hong, J., Lee, H., and Choo, J. (2024). The Comparative Trap: Pairwise Comparisons Amplify Biased Preferences of LLM Evaluators. arXiv.","DOI":"10.18653\/v1\/2025.blackboxnlp-1.5"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Hashemi, H., Eisner, J., Rosset, C., Van Durme, B., and Kedzie, C. (2024, January 11\u201316). LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand. Available online: https:\/\/aclanthology.org\/2024.acl-long.745.","DOI":"10.18653\/v1\/2024.acl-long.745"},{"key":"ref_10","unstructured":"Chaudhary, M., Gupta, H., Bhat, S., and Varma, V. (2024). Towards Understanding the Robustness of LLM-based Evaluations under Perturbations. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Anghel, C., Anghel, A.A., Pecheanu, E., Cocu, A., and Istrate, A. (2025). Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information, 16.","DOI":"10.3390\/info16080652"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dong, B., Bai, J., Xu, T., and Zhou, Y. (2024, January 19\u201321). Large Language Models in Education: A Systematic Review. Proceedings of the 2024 6th International Conference on Computer Science and Technologies in Education (CSTE), Xi\u2019an, China. Available online: https:\/\/ieeexplore.ieee.org\/document\/10589960.","DOI":"10.1109\/CSTE62025.2024.00031"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Al-Ahmad, B., Alsobeh, A., Meqdadi, O., and Shaikh, N. (2025). A Student-Centric Evaluation Survey to Explore the Impact of LLMs on UML Modeling. Information, 16.","DOI":"10.20944\/preprints202505.2054.v1"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Alhazeem, E., Alsobeh, A., and Al-Ahmad, B. (2024, January 9\u201311). Enhancing Software Engineering Education through AI: An Empirical Study of Tree-Based Machine Learning for Defect Prediction. Proceedings of the 25th Annual Conference on Information Technology Education, New York, NY, USA.","DOI":"10.1145\/3686852.3686881"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lopes, P., Silva, E., Braga, C., Oliveira, T., and Rosado, L. (2022). XAI Systems Evaluation: A Review of Human and Computer-Centred Methods. Appl. Sci., 12.","DOI":"10.3390\/app12199423"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3583558","article-title":"From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI","volume":"55","author":"Nauta","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Anghel, C., Anghel, A.A., Pecheanu, E., Susnea, I., Cocu, A., and Istrate, A. (2025). Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents. Informatics, 12.","DOI":"10.3390\/informatics12030076"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Anghel, C., Cr\u0103ciun, M.V., Pecheanu, E., Cocu, A., Anghel, A.A., Iacobescu, P., Maier, C., Andrei, C.A., Scheau, C., and Drago\u0219loveanu, \u0218. (2025). CourseEvalAI: Rubric-Guided Framework for Transparent and Consistent Evaluation of Large Language Models. Computers, 14.","DOI":"10.3390\/computers14100431"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Qiu, W., Su, C.L., Jamil, N.B., Thway, M., Ng, S.S.H., Zhang, L., Lim, F.S., and Lai, J.W. (2025). A Systematic Approach to Evaluate the Use of Chatbots in Educational Contexts: Learning Gains, Engagements and Perceptions. Computers, 14.","DOI":"10.35542\/osf.io\/7yga3_v3"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Seo, H., Hwang, T., Jung, J., Namgoong, H., Lee, J., and Jung, S. (2025). Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci., 15.","DOI":"10.3390\/app15020671"},{"key":"ref_21","unstructured":"Zheng, L., Chiang, W.L., Sheng, Y., and Zhang, H. (2025, July 31). Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B. Available online: https:\/\/lmsys.org\/blog\/2023-06-22-leaderboard."},{"key":"ref_22","unstructured":"Liu, Y., Zhou, H., Guo, Z., Shareghi, E., Vuli\u0107, I., Korhonen, A., and Collier, N. (2024, January 7\u20139). Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators. Proceedings of the COLM 2024\u2013Conference on Language Modeling, Philadelphia, PA, USA."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hu, Z., Song, L., Zhang, J., Xiao, Z., Wang, T., Chen, Z., Yuan, N.J., Lian, J., Ding, K., and Xiong, H. (2024). Explaining Length Bias in LLM-based Preference Evaluations. arXiv.","DOI":"10.18653\/v1\/2025.findings-emnlp.358"},{"key":"ref_24","unstructured":"Shi, L., Ma, C., Liang, W., Ma, W., and Vosoughi, S. (2024). Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Fan, Z., Wang, W., W, X., and Zhang, D. (2024, January 20). SedarEval: Automated Evaluation using Self-Adaptive Rubrics. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA. Available online: https:\/\/aclanthology.org\/2024.findings-emnlp.984.","DOI":"10.18653\/v1\/2024.findings-emnlp.984"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1465","DOI":"10.1007\/s40593-024-00440-y","article-title":"Revealing Rubric Relations: Investigating the Interdependence of a Research Informed and a Machine Learning Based Rubric in Assessing Student Reasoning in Chemistry","volume":"35","author":"Martin","year":"2024","journal-title":"Int. J. Artif. Intell. Educ."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1007\/s10648-023-09823-4","article-title":"Effects of Rubrics on Academic Performance, Self-Regulated Learning, and self-Efficacy: A Meta-analytic Review","volume":"35","author":"Panadero","year":"2023","journal-title":"Educ. Psychol. Rev."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Moradi, M., and Samwald, M. (2021, January 7\u201311). Evaluating the Robustness of Neural Language Models to Input Perturbations. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Punta Cana, Dominican Republic. Available online: https:\/\/aclanthology.org\/2021.emnlp-main.117\/.","DOI":"10.18653\/v1\/2021.emnlp-main.117"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3567550","article-title":"Explainability for Large Language Models: A Survey","volume":"56","author":"Zhao","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhou, J., Gandomi, A.H., Chen, F., and Holzinger, A. (2021). Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics, 10.","DOI":"10.3390\/electronics10050593"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.inffus.2019.12.012","article-title":"Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI","volume":"58","author":"Bennetot","year":"2020","journal-title":"Inf. Fusion"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Emirtekin, E. (2025). Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci., 15.","DOI":"10.3390\/app15105683"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Aggarwal, D., Sil, P., Raman, B., and Bhattacharyya, P. (2025, January 22\u201326). \u201cI Understand Why I Got This Grade\u201d: Automatic Short Answer Grading (ASAG) with Feedback. Proceedings of the Artificial Intelligence in Education (AIED 2025), Cham, Switzerland.","DOI":"10.1007\/978-3-031-98420-4_22"},{"key":"ref_34","unstructured":"Neo4j, I. (2025, September 24). Neo4j Graph Database Platform. Available online: https:\/\/neo4j.com\/product\/neo4j-graph-database\/."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Monteiro, J., S\u00e1, F., and Bernardino, J. (2023). Experimental Evaluation of Graph Databases: JanusGraph, Nebula Graph, Neo4j, and TigerGraph. Appl. Sci., 13.","DOI":"10.3390\/app13095770"},{"key":"ref_36","unstructured":"Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., and Saulnier, L. (2023). Mistral 7B. arXiv."},{"key":"ref_37","unstructured":"Google (2025, September 17). Gemma:7B-Instruct Model Card. Available online: https:\/\/ollama.com\/library\/gemma:7b-instruct."},{"key":"ref_38","unstructured":"Hugging-Face (2025, September 24). Zephyr-7B-\u03b2. Available online: https:\/\/huggingface.co\/HuggingFaceH4\/zephyr-7b-beta."},{"key":"ref_39","unstructured":"Teknium (2025, September 17). OpenHermes Model Card. Available online: https:\/\/ollama.com\/library\/openhermes."},{"key":"ref_40","unstructured":"Meta (2025, September 17). Introducing Meta Llama 3. Available online: https:\/\/ai.meta.com\/blog\/meta-llama-3\/."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Renze, M. (2024, January 12\u201316). The Effect of Sampling Temperature on Problem Solving in Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA. Available online: https:\/\/aclanthology.org\/2024.findings-emnlp.432\/.","DOI":"10.18653\/v1\/2024.findings-emnlp.432"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Xu, F., Song, Y., Iyyer, M., and Choi, E. (2023, January 4\u20139). A Critical Evaluation of Evaluations for Long-form Question Answering. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada. Available online: https:\/\/aclanthology.org\/2023.acl-long.181.","DOI":"10.18653\/v1\/2023.acl-long.181"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1038\/s41539-024-00291-1","article-title":"Evaluating large language models for criterion-based grading from agreement to consistency","volume":"9","author":"Zhang","year":"2024","journal-title":"npj Sci. Learn."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"258","DOI":"10.1038\/s41746-024-01258-7","article-title":"A framework for human evaluation of large language models in healthcare derived from literature review","volume":"7","author":"Tam","year":"2024","journal-title":"npj Digit. Med."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/12\/530\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T16:17:22Z","timestamp":1764865042000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/12\/530"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,3]]},"references-count":44,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["computers14120530"],"URL":"https:\/\/doi.org\/10.3390\/computers14120530","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,3]]}}}