{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T10:02:05Z","timestamp":1775815325535,"version":"3.50.1"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2025,2,23]],"date-time":"2025-02-23T00:00:00Z","timestamp":1740268800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62192733, 61832009, 62192731, 62192730, 62072007"],"award-info":[{"award-number":["62192733, 61832009, 62192731, 62192730, 62072007"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Key Program of Hubei","award":["JD2023008"],"award-info":[{"award-number":["JD2023008"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref and NL formats, which existing match-based CEMs cannot effectively accommodate. In this article, we propose CodeScore, a large language model (LLM)-based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.<\/jats:p>","DOI":"10.1145\/3695991","type":"journal-article","created":{"date-parts":[[2024,9,13]],"date-time":"2024-09-13T13:56:18Z","timestamp":1726235778000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":44,"title":["CodeScore: Evaluating Code Generation by Learning Code Execution"],"prefix":"10.1145","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6228-4019","authenticated-orcid":false,"given":"Yihong","family":"Dong","sequence":"first","affiliation":[{"name":"Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2884-7434","authenticated-orcid":false,"given":"Jiazheng","family":"Ding","sequence":"additional","affiliation":[{"name":"Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5317-865X","authenticated-orcid":false,"given":"Xue","family":"Jiang","sequence":"additional","affiliation":[{"name":"Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5828-0186","authenticated-orcid":false,"given":"Ge","family":"Li","sequence":"additional","affiliation":[{"name":"Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0198-2304","authenticated-orcid":false,"given":"Zhuo","family":"Li","sequence":"additional","affiliation":[{"name":"Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1087-226X","authenticated-orcid":false,"given":"Zhi","family":"Jin","sequence":"additional","affiliation":[{"name":"Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,2,23]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Shushan Arakelyan Anna Hakhverdyan Miltiadis Allamanis Christophe Hauser Luis Garcia and Xiang Ren. 2022. NS3: Neuro-symbolic semantic code search. arXiv:2205.10674. Retrieved from https:\/\/arxiv.org\/abs\/2205.10674"},{"key":"e_1_3_2_3_2","unstructured":"Jacob Austin Augustus Odena Maxwell I. Nye Maarten Bosma Henryk Michalewski David Dohan Ellen Jiang Carrie J. Cai Michael Terry Quoc V. Le and Charles Sutton. 2021. Program synthesis with large language models. arXiv:2108.07732. Retrieved from https:\/\/arxiv.org\/abs\/2108.07732"},{"key":"e_1_3_2_4_2","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL. Association for Computational Linguistics 65\u201372."},{"key":"e_1_3_2_5_2","volume-title":"Analyse math\u00e9matique sur les probabilit\u00e9s des erreurs de situation d\u2019un point","author":"Bravais Auguste","year":"1844","unstructured":"Auguste Bravais. 1844. Analyse math\u00e9matique sur les probabilit\u00e9s des erreurs de situation d\u2019un point. Impr. Royale."},{"key":"e_1_3_2_6_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Pond\u00e9 de Oliveira Pinto Jared Kaplan Harrison Edwards Yuri Burda Nicholas Joseph Greg Brockman Alex Ray Raul Puri Gretchen Krueger Michael Petrov Heidy Khlaaf Girish Sastry Pamela Mishkin Brooke Chan Scott Gray Nick Ryder Mikhail Pavlov Alethea Power Lukasz Kaiser Mohammad Bavarian Clemens Winter Philippe Tillet Felipe Petroski Such Dave Cummings Matthias Plappert Fotios Chantzis Elizabeth Barnes Ariel Herbert-Voss William Hebgen Guss Alex Nichol Alex Paino Nikolas Tezak Jie Tang Igor Babuschkin Suchir Balaji Shantanu Jain William Saunders Christopher Hesse Andrew N. Carr Jan Leike Joshua Achiam Vedant Misra Evan Morikawa Alec Radford Matthew Knight Miles Brundage Mira Murati Katie Mayer Peter Welinder Bob McGrew Dario Amodei Sam McCandlish Ilya Sutskever and Wojciech Zaremba. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_2_7_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1). Association for Computational Linguistics 4171\u20134186."},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","unstructured":"Yihong Dong Xue Jiang Zhi Jin and Ge Li. 2023. Self-collaboration code generation via ChatGPT. ACM Transactions on Software Engineering and Methodology 33 7 Article 189 (2023) 38 pages. DOI: 10.1145\/3672459","DOI":"10.1145\/3672459"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","unstructured":"Yihong Dong Xue Jiang Huanyu Liu Zhi Jin and Ge Li. 2024. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv:2402.15938. Retrieved from https:\/\/arxiv.org\/abs\/2402.15938","DOI":"10.18653\/v1\/2024.findings-acl.716"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.3233\/FAIA230317"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Yihong Dong Ge Li and Zhi Jin. 2023. CODEP: Grammatical seq2seq model for general-purpose code generation. In ISSTA. ACM 188\u2013198.","DOI":"10.1145\/3597926.3598048"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Yihong Dong Kangcheng Luo Xue Jiang Zhi Jin and Ge Li. 2024. PACE: Improving prompt with actor-critic editing for large language model. In ACL \u201924. 7304\u20137323.","DOI":"10.18653\/v1\/2024.findings-acl.436"},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","unstructured":"Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: Precisely and efficiently measuring the similarity of code. In ASE. ACM 28:1\u201328:12.","DOI":"10.1145\/3551349.3556903"},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Zhangyin Feng Daya Guo Duyu Tang Nan Duan Xiaocheng Feng Ming Gong Linjun Shou Bing Qin Ting Liu Daxin Jiang and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In EMNLP (Findings). Association for Computational Linguistics 1536\u20131547.","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"e_1_3_2_15_2","unstructured":"Daniel Fried Armen Aghajanyan Jessy Lin Sida Wang Eric Wallace Freda Shi Ruiqi Zhong Wen-tau Yih Luke Zettlemoyer and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv:2204.05999. Retrieved from https:\/\/arxiv.org\/abs\/2204.05999"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Daya Guo Shuai Lu Nan Duan Yanlin Wang Ming Zhou and Jian Yin. 2022. UniXcoder: Unified cross-modal pre-training for code representation. In ACL (1). Association for Computational Linguistics 7212\u20137225.","DOI":"10.18653\/v1\/2022.acl-long.499"},{"key":"e_1_3_2_17_2","unstructured":"Daya Guo Alexey Svyatkovskiy Jian Yin Nan Duan Marc Brockschmidt and Miltiadis Allamanis. 2022. Learning to complete code with sketches. In ICLR. OpenReview.net."},{"key":"e_1_3_2_18_2","unstructured":"Yiyang Hao Ge Li Yongqiang Liu Xiaowei Miao He Zong Siyuan Jiang Yang Liu and He Wei. 2022. AixBench: A code generation benchmark dataset. arXiv:2206.13179. Retrieved from https:\/\/arxiv.org\/abs\/2206.13179"},{"key":"e_1_3_2_19_2","unstructured":"Dan Hendrycks Steven Basart Saurav Kadavath Mantas Mazeika Akul Arora Ethan Guo Collin Burns Samir Puranik Horace He Dawn Song and Jacob Steinhardt. 2021. Measuring coding challenge competence with APPS. In NeurIPS Datasets and Benchmarks."},{"key":"e_1_3_2_20_2","unstructured":"Dong Huang Qingwen Bu Jie Zhang Xiaofei Xie Junjie Chen and Heming Cui. 2023. Bias assessment and mitigation in llm-based code generation. arXiv:2309.14345. Retrieved from https:\/\/arxiv.org\/abs\/2309.14345"},{"key":"e_1_3_2_21_2","unstructured":"Jeevana Priya Inala Chenglong Wang Mei Yang Andr\u00e9s Codas Mark Encarnaci\u00f3n Shuvendu K. Lahiri Madanlal Musuvathi and Jianfeng Gao. 2022. Fault-aware neural code rankers. In NeurIPS."},{"key":"e_1_3_2_22_2","unstructured":"Xue Jiang Yihong Dong Zhi Jin and Ge Li. 2024. SEED: Customize large language models with sample-efficient adaptation for code generation. arXiv:2403.00046."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","unstructured":"Xue Jiang Yihong Dong Lecheng Wang Fang Zheng Qiwei Shang Ge Li Zhi Jin and Wenpin Jiao. 2023. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology 33 7 Article 182 (2023) 30 pages. DOI: 10.1145\/3672456","DOI":"10.1145\/3672456"},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Maurice G. Kendall. 1938. A new measure of rank correlation. Biometrika 30 1\/2 (1938) 81\u201393.","DOI":"10.1093\/biomet\/30.1-2.81"},{"key":"e_1_3_2_25_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR."},{"key":"e_1_3_2_26_2","unstructured":"Sumith Kulal Panupong Pasupat Kartik Chandra Mina Lee Oded Padon Alex Aiken and Percy Liang. 2019. SPoC: Search-based pseudocode to code. In NeurIPS 11883\u201311894."},{"key":"e_1_3_2_27_2","unstructured":"Jia Li Ge Li Xuanming Zhang Yihong Dong and Zhi Jin. 2024a. EvoCodeBench: An evolving code generation benchmark aligned with real-world code repositories. arXiv:2404.00599. Retrieved from https:\/\/arxiv.org\/abs\/2404.00599"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","unstructured":"Jia Li Ge Li Yunfei Zhao Yongmin Li Huanyu Liu Hao Zhu Lecheng Wang Kaibo Liu Zheng Fang Lanshen Wang Jiazheng Ding Xuanming Zhang Yuqi Zhu Yihong Dong Zhi Jin Binhua Li Fei Huang Yongbin Li Bin Gu and Mengfei Yang. 2024b. DevEval: A manually-annotated code generation benchmark aligned with real-world code repositories. In ACL \u201924 3603\u20133614.","DOI":"10.18653\/v1\/2024.findings-acl.214"},{"key":"e_1_3_2_29_2","unstructured":"Raymond Li Loubna Ben Allal Yangtian Zi Niklas Muennighoff Denis Kocetkov Chenghao Mou Marc Marone Christopher Akiki Jia Li Jenny Chim Qian Liu Evgenii Zheltonozhskii Terry Yue Zhuo Thomas Wang Olivier Dehaene Mishig Davaadorj Joel Lamy-Poirier Jo\u00e3o Monteiro Oleh Shliazhko Nicolas Gontier Nicholas Meade Armel Zebaze Ming-Ho Yee Logesh Kumar Umapathi Jian Zhu Benjamin Lipkin Muhtasham Oblokulov Zhiruo Wang Rudra Murthy V Jason Stillerman Siva Sankalp Patel Dmitry Abulkhanov Marco Zocca Manan Dey Zhihan Zhang Nour Moustafa-Fahmy Urvashi Bhattacharyya Wenhao Yu Swayam Singh Sasha Luccioni Paulo Villegas Maxim Kunakov Fedor Zhdanov Manuel Romero Tony Lee Nadav Timor Jennifer Ding Claire Schlesinger Hailey Schoelkopf Jan Ebert Tri Dao Mayank Mishra Alex Gu Jennifer Robinson Carolyn Jane Anderson Brendan Dolan-Gavitt Danish Contractor Siva Reddy Daniel Fried Dzmitry Bahdanau Yacine Jernite Carlos Mu\u00f1oz Ferrandis Sean Hughes Thomas Wolf Arjun Guha Leandro von Werra and Harm de Vries. 2023. StarCoder: May the source be with you! arXiv:2305.06161. Retrieved from https:\/\/arxiv.org\/abs\/2305.06161"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.abq1158"},{"key":"e_1_3_2_31_2","unstructured":"Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics 74\u201381."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1057"},{"key":"e_1_3_2_33_2","unstructured":"Shuai Lu Nan Duan Hojae Han Daya Guo Seung-won Hwang and Alexey Svyatkovskiy. 2022. ReACC: A retrieval-augmented code completion framework. In ACL. Association for Computational Linguistics 6227\u20136240."},{"key":"e_1_3_2_34_2","unstructured":"Ziyang Luo Can Xu Pu Zhao Qingfeng Sun Xiubo Geng Wenxiang Hu Chongyang Tao Jing Ma Qingwei Lin and Daxin Jiang. 2023. WizardCoder: Empowering code large language models with evol-instruct. arXiv:2306.08568. Retrieved from https:\/\/arxiv.org\/abs\/2306.08568"},{"key":"e_1_3_2_35_2","unstructured":"Alexander McFarlane Mood. 1950. Introduction to the Theory of Statistics. McGraw-Hill"},{"key":"e_1_3_2_36_2","unstructured":"Rohan Mukherjee Yeming Wen Dipak Chaudhari Thomas W. Reps Swarat Chaudhuri and Christopher M. Jermaine. 2021. Neural program generation modulo static analysis. In NeurIPS 18984\u201318996."},{"key":"e_1_3_2_37_2","unstructured":"Erik Nijkamp Bo Pang Hiroaki Hayashi Lifu Tu Huan Wang Yingbo Zhou Silvio Savarese and Caiming Xiong. 2023. CodeGen: An open large language model for code with multi-turn program synthesis. In ICLR. OpenReview.net."},{"key":"e_1_3_2_38_2","unstructured":"OpenAI. 2023. ChatGPT: Optimizing Language Models for Dialogue. Retrieved from https:\/\/openai.com\/blog\/chatgpt\/"},{"key":"e_1_3_2_39_2","unstructured":"OpenAI. 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_2_40_2","doi-asserted-by":"crossref","unstructured":"Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In ACL. ACL 311\u2013318.","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","unstructured":"Matthew E. Peters Mark Neumann Mohit Iyyer Matt Gardner Christopher Clark Kenton Lee and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT. Association for Computational Linguistics 2227\u20132237.","DOI":"10.18653\/v1\/N18-1202"},{"key":"e_1_3_2_42_2","doi-asserted-by":"crossref","unstructured":"Maja Popovic. 2015. chrF: Character n-gram F-score for automatic MT evaluation. In WMT@EMNLP. The Association for Computer Linguistics 392\u2013395.","DOI":"10.18653\/v1\/W15-3049"},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","unstructured":"Tharindu Ranasinghe Constantin Orasan and Ruslan Mitkov. 2020. TransQuest: Translation quality estimation with cross-lingual transformers. In COLING. International Committee on Computational Linguistics 5070\u20135081.","DOI":"10.18653\/v1\/2020.coling-main.445"},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","unstructured":"Veselin Raychev Martin T. Vechev and Eran Yahav. 2014. Code completion with statistical language models. In PLDI. ACM 419\u2013428.","DOI":"10.1145\/2666356.2594321"},{"key":"e_1_3_2_45_2","unstructured":"Ricardo Rei Jos\u00e9 GC De Souza Duarte Alves Chrysoula Zerva Ana C Farinha Taisiya Glushkova Alon Lavie Luisa Coheur and Andr\u00e9 FT Martins. 2022. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In WMT@EMNLP. Association for Computational Linguistics 578\u2013585."},{"key":"e_1_3_2_46_2","unstructured":"Ricardo Rei Ana C. Farinha Chrysoula Zerva Daan van Stigt Craig Stewart Pedro G. Ramos Taisiya Glushkova Andr\u00e9 F. T. Martins and Alon Lavie. 2021. Are references really needed? Unbabel-IST 2021 submission for the metrics shared task. In WMT@EMNLP. Association for Computational Linguistics 1030\u20131040."},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.213"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1410"},{"key":"e_1_3_2_49_2","unstructured":"Shuo Ren Daya Guo Shuai Lu Long Zhou Shujie Liu Duyu Tang Neel Sundaresan Ming Zhou Ambrosio Blanco and Shuai Ma. 2020. CodeBLEU: A method for automatic evaluation of code synthesis. arXiv:2009.10297. Retrieved from https:\/\/arxiv.org\/abs\/2009.10297"},{"key":"e_1_3_2_50_2","unstructured":"Baptiste Rozi\u00e8re Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat Xiaoqing Ellen Tan Yossi Adi Jingyu Liu Tal Remez J\u00e9r\u00e9my Rapin Artyom Kozhevnikov Ivan Evtimov Joanna Bitton Manish Bhatt Cristian Canton-Ferrer Aaron Grattafiori Wenhan Xiong Alexandre D\u00e9fossez Jade Copet Faisal Azhar Hugo Touvron Louis Martin Nicolas Usunier Thomas Scialom and Gabriel Synnaeve. 2023. Code Llama: Open foundation models for code. arXiv:2308.12950. Retrieved from https:\/\/arxiv.org\/abs\/2308.12950"},{"key":"e_1_3_2_51_2","unstructured":"Baptiste Rozi\u00e8re Marie-Anne Lachaux Lowik Chanussot and Guillaume Lample. 2020. Unsupervised translation of programming languages. In NeurIPS."},{"key":"e_1_3_2_52_2","doi-asserted-by":"crossref","unstructured":"Thibault Sellam Dipanjan Das and Ankur P. Parikh. 2020. BLEURT: Learning robust metrics for text generation. In ACL. Association for Computational Linguistics 7881\u20137892.","DOI":"10.18653\/v1\/2020.acl-main.704"},{"key":"e_1_3_2_53_2","doi-asserted-by":"crossref","unstructured":"Sijie Shen Xiang Zhu Yihong Dong Qizhi Guo Yankun Zhen and Ge Li. 2022. Incorporating domain knowledge through task augmentation for front-end JavaScript code generation. In ESEC\/SIGSOFT FSE. ACM 1533\u20131543.","DOI":"10.1145\/3540250.3558965"},{"key":"e_1_3_2_54_2","doi-asserted-by":"crossref","unstructured":"Weisong Sun Chunrong Fang Yuchen Chen Guanhong Tao Tingxu Han and Quanjun Zhang. 2022. Code search based on context-aware code translation. In ICSE. ACM 388\u2013400.","DOI":"10.1145\/3510003.3510140"},{"key":"e_1_3_2_55_2","doi-asserted-by":"crossref","unstructured":"Zeyu Sun Qihao Zhu Yingfei Xiong Yican Sun Lili Mou and Lu Zhang. 2020. TreeGen: A tree-based transformer architecture for code generation. In AAAI. AAAI Press 8984\u20138991.","DOI":"10.1609\/aaai.v34i05.6430"},{"key":"e_1_3_2_56_2","doi-asserted-by":"crossref","unstructured":"Ian Tenney Dipanjan Das and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In ACL Vol. 1. Association for Computational Linguistics 4593\u20134601.","DOI":"10.18653\/v1\/P19-1452"},{"key":"e_1_3_2_57_2","doi-asserted-by":"crossref","unstructured":"Yu Wan Dayiheng Liu Baosong Yang Haibo Zhang Boxing Chen Derek F. Wong and Lidia S. Chao. 2022. UniTE: Unified translation evaluation. In ACL Vol. 1. Association for Computational Linguistics 8117\u20138127.","DOI":"10.18653\/v1\/2022.acl-long.558"},{"key":"e_1_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Yue Wang Weishi Wang Shafiq R. Joty and teven C. H. Hoi. 2021. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP Vol. 1. 8696\u20138708.","DOI":"10.18653\/v1\/2021.emnlp-main.685"},{"key":"e_1_3_2_59_2","unstructured":"Bolin Wei Ge Li Xin Xia Zhiyi Fu and Zhi Jin. 2019. Code generation as a dual task of code summarization. In NeurIPS 6559\u20136569."},{"key":"e_1_3_2_60_2","doi-asserted-by":"crossref","unstructured":"Xiaokai Wei Sujan Kumar Gonugondla Shiqi Wang Wasi Uddin Ahmad Baishakhi Ray Haifeng Qian Xiaopeng Li Varun Kumar Zijian Wang Yuchen Tian Qing Sun Ben Athiwaratkun Mingyue Shang Murali Krishna Ramanathan Parminder Bhatia and Bing Xiang. 2023. Towards greener yet powerful code generation via quantization: An empirical study. In ESEC\/SIGSOFT FSE. ACM 224\u2013236.","DOI":"10.1145\/3611643.3616302"},{"key":"e_1_3_2_61_2","doi-asserted-by":"crossref","unstructured":"Pengcheng Yin and Graham Neubig. 2018. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In EMNLP 7\u201312.","DOI":"10.18653\/v1\/D18-2002"},{"key":"e_1_3_2_62_2","unstructured":"Weizhe Yuan Graham Neubig and Pengfei Liu. 2021. BARTScore: Evaluating generated text as text generation. In NeurIPS 27263\u201327277."},{"key":"e_1_3_2_63_2","unstructured":"Shun Zhang Zhenfang Chen Yikang Shen Mingyu Ding Joshua B. Tenenbaum and Chuang Gan. 2023. Planning with large language models for code generation. In ICLR. OpenReview.net."},{"key":"e_1_3_2_64_2","unstructured":"Tianyi Zhang Varsha Kishore Felix Wu Kilian Q. Weinberger and Yoav Artzi. 2020. BERTScore: Evaluating text generation with BERT. In ICLR. OpenReview.net."},{"key":"e_1_3_2_65_2","doi-asserted-by":"crossref","unstructured":"Wei Zhao Maxime Peyrard Fei Liu Yang Gao Christian M. Meyer and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In EMNLP\/IJCNLP Vol. 1. Association for Computational Linguistics 563\u2013578.","DOI":"10.18653\/v1\/D19-1053"},{"key":"e_1_3_2_66_2","first-page":"238","volume-title":"Internetware","author":"Zhao Yunfei","year":"2023","unstructured":"Yunfei Zhao, Yihong Dong, and Ge Li. 2023. Seq2Seq or Seq2Tree: Generating code using both paradigms via mutual learning. In Internetware, 238\u2013248."},{"key":"e_1_3_2_67_2","doi-asserted-by":"crossref","unstructured":"Qinkai Zheng Xiao Xia Xu Zou Yuxiao Dong Shan Wang Yufei Xue Zihan Wang Lei Shen Andi Wang Yang Li Teng Su Zhilin Yang and Jie Tang. 2023. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on Humaneval-X. arXiv:2303.17568. Retrieved from https:\/\/arxiv.org\/abs\/2303.17568","DOI":"10.1145\/3580305.3599790"},{"key":"e_1_3_2_68_2","doi-asserted-by":"crossref","unstructured":"Shuyan Zhou Uri Alon Sumit Agarwal and Graham Neubig. 2023. CodeBERTScore: Evaluating code generation with pretrained models of code. arXiv:2302.05527. Retrieved from https:\/\/arxiv.org\/abs\/2302.05527","DOI":"10.18653\/v1\/2023.emnlp-main.859"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21434"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695991","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695991","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:04:30Z","timestamp":1750291470000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695991"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,23]]},"references-count":68,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3695991"],"URL":"https:\/\/doi.org\/10.1145\/3695991","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,23]]},"assertion":[{"value":"2023-11-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-19","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}