{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T14:04:57Z","timestamp":1767621897918,"version":"3.48.0"},"reference-count":21,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,3]],"date-time":"2026-01-03T00:00:00Z","timestamp":1767398400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"German BMBF project SCINEXT","award":["01lS22070"],"award-info":[{"award-number":["01lS22070"]}]},{"name":"European Research Council for ScienceGRAPH","award":["819536"],"award-info":[{"award-number":["819536"]}]},{"name":"German DFG for NFDI4DataScience","award":["460234259"],"award-info":[{"award-number":["460234259"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>This paper explores the novel application of large language models (LLMs) as evaluators for structured scientific summaries\u2014a task where traditional natural language evaluation metrics may not readily apply. Leveraging the Open Research Knowledge Graph (ORKG) as a repository of human-curated properties, we augment a gold-standard dataset by generating corresponding properties using three distinct LLMs\u2014Llama, Mistral, and Qwen\u2014under three contextual settings: context-lean (research problem only), context-rich (research problem with title and abstract), and context-dense (research problem with multiple similar papers). To assess the quality of these properties, we employ LLM evaluators (Deepseek, Mistral, and Qwen) to rate them on criteria, including similarity, relevance, factuality, informativeness, coherence, and specificity. This study addresses key research questions: How do LLM-as-a-judge rubrics transfer to the evaluation of structured summaries? How do LLM-generated properties compare to human-annotated ones? What are the performance differences among various LLMs? How does the amount of contextual input affect the generation quality? The resulting evaluation framework, KGEval, offers a customizable approach that can be extended to diverse knowledge graphs and application domains. Our experimental findings reveal distinct patterns in evaluator biases, contextual sensitivity, and inter-model performance, thereby highlighting both the promise and the challenges of integrating LLMs into structured science evaluation.<\/jats:p>","DOI":"10.3390\/info17010035","type":"journal-article","created":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T10:53:50Z","timestamp":1767610430000},"page":"35","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0146-1207","authenticated-orcid":false,"given":"Vladyslav","family":"Nechakhin","sequence":"first","affiliation":[{"name":"Leibniz Information Centre for Science and Technology, 30167 Hannover, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6616-9509","authenticated-orcid":false,"given":"Jennifer","family":"D\u2019Souza","sequence":"additional","affiliation":[{"name":"Leibniz Information Centre for Science and Technology, 30167 Hannover, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4663-8336","authenticated-orcid":false,"given":"Steffen","family":"Eger","sequence":"additional","affiliation":[{"name":"Natural Language Learning & Generation (NLLG), University of Technology Nuremberg (UTN), 90461 Nuremberg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0698-2864","authenticated-orcid":false,"given":"S\u00f6ren","family":"Auer","sequence":"additional","affiliation":[{"name":"Leibniz Information Centre for Science and Technology, 30167 Hannover, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,3]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"516","DOI":"10.1515\/bfp-2020-2042","article-title":"Improving access to scientific literature with knowledge graphs","volume":"44","author":"Auer","year":"2020","journal-title":"Bibl. Forsch. Und Prax."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., and Bourne, P.E. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data, 3.","DOI":"10.1038\/sdata.2016.18"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Meyer, L.P., Stadler, C., Frey, J., Radtke, N., Junghanns, K., Meissner, R., Dziwis, G., Bulert, K., and Martin, M. (2023). Llm-assisted knowledge graph engineering: Experiments with chatgpt. Proceedings of the Working conference on Artificial Intelligence Development for a Resilient and Sustainable Tomorrow, Springer Fachmedien Wiesbaden.","DOI":"10.1007\/978-3-658-43705-3_8"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PN, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_5","unstructured":"Lin, C.Y. (2004, January 25\u201326). Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_6","unstructured":"Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., and Liu, H. (2025). A Survey on LLM-as-a-Judge. arXiv."},{"key":"ref_7","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., and Eger, S. (2019). MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv.","DOI":"10.18653\/v1\/D19-1053"},{"key":"ref_9","first-page":"27263","article-title":"Bartscore: Evaluating generated text as text generation","volume":"34","author":"Yuan","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Thompson, B., and Post, M. (2020). Automatic machine translation evaluation in many languages via zero-shot paraphrasing. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-main.8"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"804","DOI":"10.1162\/tacl_a_00576","article-title":"Menli: Robust evaluation metrics from natural language inference","volume":"11","author":"Chen","year":"2023","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_12","unstructured":"Kocmi, T., and Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. arXiv."},{"key":"ref_13","first-page":"46595","article-title":"Judging llm-as-a-judge with mt-bench and chatbot arena","volume":"36","author":"Zheng","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wang, J., Liang, Y., Meng, F., Sun, Z., Shi, H., Li, Z., Xu, J., Qu, J., and Zhou, J. (2023). Is chatgpt a good nlg evaluator? A preliminary study. arXiv.","DOI":"10.18653\/v1\/2023.newsum-1.1"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Chiang, C.H., and Lee, H.y. (2023). Can large language models be an alternative to human evaluations?. arXiv.","DOI":"10.18653\/v1\/2023.acl-long.870"},{"key":"ref_16","first-page":"30039","article-title":"Alpacafarm: A simulation framework for methods that learn from human feedback","volume":"36","author":"Dubois","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. (2023). G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv.","DOI":"10.18653\/v1\/2023.emnlp-main.153"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Fu, J., Ng, S.K., Jiang, Z., and Liu, P. (2023). Gptscore: Evaluate as you desire. arXiv.","DOI":"10.18653\/v1\/2024.naacl-long.365"},{"key":"ref_19","unstructured":"Ye, S., Kim, D., Kim, S., Hwang, H., Kim, S., Jo, Y., Thorne, J., Kim, J., and Seo, M. (2023). Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv."},{"key":"ref_20","unstructured":"Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., and Thorne, J. (2023, January 1\u20135). Prometheus: Inducing fine-grained evaluation capability in language models. Proceedings of the The Twelfth International Conference on Learning Representations, Kigali, Rwanda."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Nechakhin, V., D\u2019Souza, J., and Eger, S. (2024). Evaluating large language models for structured science summarization in the open research knowledge graph. Information, 15.","DOI":"10.20944\/preprints202405.0124.v1"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/35\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T11:05:48Z","timestamp":1767611148000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/35"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,3]]},"references-count":21,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["info17010035"],"URL":"https:\/\/doi.org\/10.3390\/info17010035","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,3]]}}}