{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T01:53:35Z","timestamp":1769738015476,"version":"3.49.0"},"publisher-location":"Cham","reference-count":35,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783031908996","type":"print"},{"value":"9783031909009","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,1,1]],"date-time":"2025-01-01T00:00:00Z","timestamp":1735689600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,1]],"date-time":"2025-05-01T00:00:00Z","timestamp":1746057600000},"content-version":"vor","delay-in-days":120,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects.<\/jats:p>\n          <jats:p>This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer, leveraging the few-shot learning capabilities of LLMs for a target-oriented comparison.<\/jats:p>\n          <jats:p>Our research highlights the limitations of text similarity metrics, finding that less than 10% of benchmark comments are high quality for automation. In contrast, DeepCRCEval effectively distinguishes between high and low-quality comments, proving to be a more reliable evaluation mechanism. Incorporating LLM evaluators into DeepCRCEval significantly boosts efficiency, reducing time and cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates significant potential of focusing task real targets in comment generation.<\/jats:p>","DOI":"10.1007\/978-3-031-90900-9_3","type":"book-chapter","created":{"date-parts":[[2025,4,30]],"date-time":"2025-04-30T04:44:09Z","timestamp":1745988249000},"page":"43-64","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation"],"prefix":"10.1007","author":[{"given":"Junyi","family":"Lu","sequence":"first","affiliation":[]},{"given":"Xiaojia","family":"Li","sequence":"additional","affiliation":[]},{"given":"Zihan","family":"Hua","sequence":"additional","affiliation":[]},{"given":"Lei","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Shiqi","family":"Cheng","sequence":"additional","affiliation":[]},{"given":"Li","family":"Yang","sequence":"additional","affiliation":[]},{"given":"Fengjun","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Chun","family":"Zuo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,1]]},"reference":[{"key":"3_CR1","doi-asserted-by":"publisher","unstructured":"Bacchelli, A., Bird, C.: Expectations, outcomes, and challenges of modern code review. In: 2013 35th International Conference on Software Engineering (ICSE). pp. 712\u2013721 (2013). https:\/\/doi.org\/10.1109\/ICSE.2013.6606617","DOI":"10.1109\/ICSE.2013.6606617"},{"key":"3_CR2","doi-asserted-by":"crossref","unstructured":"Bosu, A., Carver, J.C., Bird, C., Orbeck, J., Chockley, C.: Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at microsoft. IEEE Transactions on Software Engineering 43(1), 56\u201375 (2017). https:\/\/doi.org\/10.1109\/TSE.2016.2576451","DOI":"10.1109\/TSE.2016.2576451"},{"key":"3_CR3","doi-asserted-by":"publisher","unstructured":"Bosu, A., Greiler, M., Bird, C.: Characteristics of useful code reviews: An empirical study at microsoft. In: 2015 IEEE\/ACM 12th Working Conference on Mining Software Repositories. pp. 146\u2013156 (2015). https:\/\/doi.org\/10.1109\/MSR.2015.21","DOI":"10.1109\/MSR.2015.21"},{"key":"3_CR4","unstructured":"Creswell, J.W., Creswell, J.D.: Research design: Qualitative, quantitative, and mixed methods approaches. Sage publications (2017)"},{"key":"3_CR5","doi-asserted-by":"crossref","unstructured":"Dalkey, N., Helmer, O.: An experimental application of the delphi method to the use of experts. Management science 9(3), 458\u2013467 (1963)","DOI":"10.1287\/mnsc.9.3.458"},{"key":"3_CR6","unstructured":"Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., Hashimoto, T.B.: Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387 (2023)"},{"key":"3_CR7","doi-asserted-by":"crossref","unstructured":"Fagan, M.: Design and code inspections to reduce errors in program development. In: Software pioneers, pp. 575\u2013607. Springer (2002)","DOI":"10.1007\/978-3-642-59412-0_35"},{"key":"3_CR8","unstructured":"Gupta, A., Sundaresan, N.: Intelligent code reviews using deep learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD\u201918) Deep Learning Day (2018)"},{"key":"3_CR9","doi-asserted-by":"publisher","unstructured":"Hasan, M., Iqbal, A., Islam, M.R.U., Rahman, A.I., Bosu, A.: Using a balanced scorecard to identify opportunities to improve code review effectiveness: An industrial experience report. Empirical Softw. Engg. 26(6) (nov 2021). https:\/\/doi.org\/10.1007\/s10664-021-10038-w, https:\/\/doi.org\/10.1007\/s10664-021-10038-w","DOI":"10.1007\/s10664-021-10038-w"},{"key":"3_CR10","doi-asserted-by":"crossref","unstructured":"Hong, Y., Tantithamthavorn, C., Thongtanunam, P., Aleti, A.: Commentfinder: a simpler, faster, more accurate code review comments recommendation. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 507\u2013519 (2022)","DOI":"10.1145\/3540250.3549119"},{"key":"3_CR11","unstructured":"Hua, Z.H., Yang, L., Lu, J.Y., Zuo, C.: Survey on code review automation research. Ruan Jian Xue Bao\/Journal of Software 35(7), 3265\u20133290 (2024) (in Chinese). http:\/\/www.jos.org.cn\/1000-9825\/7112.htm"},{"key":"3_CR12","doi-asserted-by":"publisher","unstructured":"Kononenko, O., Baysal, O., Godfrey, M.W.: Code review quality: How developers see it. In: 2016 IEEE\/ACM 38th International Conference on Software Engineering (ICSE). pp. 1028\u20131038 (2016). https:\/\/doi.org\/10.1145\/2884781.2884840","DOI":"10.1145\/2884781.2884840"},{"key":"3_CR13","unstructured":"Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et\u00a0al.: From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594 (2024)"},{"key":"3_CR14","doi-asserted-by":"crossref","unstructured":"Li, L., Yang, L., Jiang, H., Yan, J., Luo, T., Hua, Z., Liang, G., Zuo, C.: Auger: automatically generating review comments with pre-training models. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 1009\u20131021 (2022)","DOI":"10.1145\/3540250.3549099"},{"key":"3_CR15","unstructured":"Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., Hashimoto, T.B.: Alpacaeval: An automatic evaluator of instruction-following models. https:\/\/github.com\/tatsu-lab\/alpaca_eval (2023)"},{"key":"3_CR16","doi-asserted-by":"crossref","unstructured":"Li, Z., Lu, S., Guo, D., Duan, N., Jannu, S., Jenks, G., Majumder, D., Green, J., Svyatkovskiy, A., Fu, S., et\u00a0al.: Automating code review activities by large-scale pre-training. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 1035\u20131047 (2022)","DOI":"10.1145\/3540250.3549081"},{"key":"3_CR17","doi-asserted-by":"crossref","unstructured":"Lin, B., Wang, S., Liu, Z., Liu, Y., Xia, X., Mao, X.: Cct5: A code-change-oriented pre-trained model. In: Proceedings of the 38th IEEE\/ACM International Conference on Automated Software Engineering (2023)","DOI":"10.1145\/3611643.3616339"},{"key":"3_CR18","unstructured":"Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74\u201381 (2004)"},{"key":"3_CR19","unstructured":"Lindlof, T.R., Taylor, B.C.: Qualitative communication research methods. Sage publications (2017)"},{"key":"3_CR20","doi-asserted-by":"crossref","unstructured":"Lu, J., Li, Z., Shen, C., Yang, L., Zuo, C.: Exploring the impact of code review factors on the code review comment generation. Automated Software Engineering 31(2), 71 (2024)","DOI":"10.1007\/s10515-024-00469-2"},{"key":"3_CR21","doi-asserted-by":"crossref","unstructured":"Lu, J., Yu, L., Li, X., Yang, L., Zuo, C.: Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning. In: 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). pp. 647\u2013658. IEEE (2023)","DOI":"10.1109\/ISSRE59848.2023.00026"},{"key":"3_CR22","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311\u2013318 (2002)","DOI":"10.3115\/1073083.1073135"},{"key":"3_CR23","doi-asserted-by":"publisher","unstructured":"Rahman, M.M., Roy, C.K., Kula, R.G.: Predicting usefulness of code review comments using textual features and developer experience. In: 2017 IEEE\/ACM 14th International Conference on Mining Software Repositories (MSR). pp. 215\u2013226 (2017). https:\/\/doi.org\/10.1109\/MSR.2017.17","DOI":"10.1109\/MSR.2017.17"},{"key":"3_CR24","unstructured":"Ray: Aviary: Study stochastic parrots in the wild. https:\/\/github.com\/ray-project\/aviary (2023)"},{"key":"3_CR25","doi-asserted-by":"crossref","unstructured":"Rigby, P.C., Bird, C.: Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. pp. 202\u2013212 (2013)","DOI":"10.1145\/2491411.2491444"},{"key":"3_CR26","doi-asserted-by":"crossref","unstructured":"Rigby, P.C., German, D.M., Cowen, L., Storey, M.A.: Peer review on open-source software projects: Parameters, statistical models, and theory. ACM Transactions on Software Engineering and Methodology (TOSEM) 23(4), 1\u201333 (2014)","DOI":"10.1145\/2594458"},{"key":"3_CR27","doi-asserted-by":"crossref","unstructured":"Rigby, P.C., German, D.M., Storey, M.A.: Open source software peer review practices: a case study of the apache server. In: Proceedings of the 30th international conference on Software engineering. pp. 541\u2013550 (2008)","DOI":"10.1145\/1368088.1368162"},{"key":"3_CR28","doi-asserted-by":"crossref","unstructured":"Sadowski, C., S\u00f6derberg, E., Church, L., Sipko, M., Bacchelli, A.: Modern code review: a case study at google. In: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. pp. 181\u2013190 (2018)","DOI":"10.1145\/3183519.3183525"},{"key":"3_CR29","doi-asserted-by":"crossref","unstructured":"Shan, Q., Sukhdeo, D., Huang, Q., Rogers, S., Chen, L., Paradis, E., Rigby, P.C., Nagappan, N.: Using nudges to accelerate code reviews at scale. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 472\u2013482 (2022)","DOI":"10.1145\/3540250.3549104"},{"key":"3_CR30","doi-asserted-by":"crossref","unstructured":"Siow, J.K., Gao, C., Fan, L., Chen, S., Liu, Y.: Core: Automating review recommendation for code changes. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). pp. 284\u2013295. IEEE (2020)","DOI":"10.1109\/SANER48275.2020.9054794"},{"key":"3_CR31","doi-asserted-by":"crossref","unstructured":"Tufano, R., Masiero, S., Mastropaolo, A., Pascarella, L., Poshyvanyk, D., Bavota, G.: Using pre-trained models to boost code review automation. In: Proceedings of the 44th International Conference on Software Engineering. pp. 2291\u20132302 (2022)","DOI":"10.1145\/3510003.3510621"},{"key":"3_CR32","doi-asserted-by":"crossref","unstructured":"Tufano, R., Pascarella, L., Tufano, M., Poshyvanyk, D., Bavota, G.: Towards automating code review activities. In: 2021 IEEE\/ACM 43rd International Conference on Software Engineering (ICSE). pp. 163\u2013174. IEEE (2021)","DOI":"10.1109\/ICSE43902.2021.00027"},{"key":"3_CR33","doi-asserted-by":"publisher","unstructured":"Yang, L., Xu, J., Zhang, Y., Zhang, H., Bacchelli, A.: Evacrc: Evaluating code review comments. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. p. 275-287. ESEC\/FSE 2023, Association for Computing Machinery, New York, NY, USA (2023). https:\/\/doi.org\/10.1145\/3611643.3616245, https:\/\/doi.org\/10.1145\/3611643.3616245","DOI":"10.1145\/3611643.3616245"},{"key":"3_CR34","doi-asserted-by":"crossref","unstructured":"Yang, X., Kula, R.G., Yoshida, N., Iida, H.: Mining the modern code review repositories: A dataset of people, process and product. In: Proceedings of the 13th International Conference on Mining Software Repositories. pp. 460\u2013463 (2016)","DOI":"10.1145\/2901739.2903504"},{"key":"3_CR35","unstructured":"Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et\u00a0al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems (2023)"}],"container-title":["Lecture Notes in Computer Science","Fundamental Approaches to Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-90900-9_3","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,30]],"date-time":"2025-04-30T04:44:44Z","timestamp":1745988284000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-90900-9_3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"ISBN":["9783031908996","9783031909009"],"references-count":35,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-90900-9_3","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025]]},"assertion":[{"value":"1 May 2025","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"FASE","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Conference on Fundamental Approaches to Software Engineering","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Hamilton, ON","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Canada","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2025","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"3 May 2025","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"8 May 2025","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"28","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"fase2025","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/etaps.org\/2025\/conferences\/fase\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}