{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T23:15:09Z","timestamp":1771024509570,"version":"3.50.1"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2026,3,31]]},"abstract":"<jats:p>Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavor that requires a deep assessment of LLMs\u2019 outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and LLM outputs. To address this gap, we introduce the LLM-as-a-Judge evaluation framework and present CodeUltraFeedback, a comprehensive dataset for assessing and improving LLM alignment with coding preferences. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs. These responses are annotated using GPT-3.5 as a judge, with both ranking-based scores and detailed textual feedback across five distinct coding preferences. Our analysis reveals that responses from GPT-3.5 and GPT-4 are consistently rated higher than those from open-weight models, underscoring substantial alignment gaps between closed- and open-weight LLMs. In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO). The resulting aligned model achieves an average alignment improvement of 22.7% and 29.7% when evaluated with GPT-3.5 and GPT-4 judges, respectively. Notably, our aligned CodeLlama-7B-Instruct surpasses much larger models, such as CodeLlama-13B and 34B, in alignment with coding preferences. Despite not being explicitly trained for functional correctness, it also achieves a 10.5% and 26.6% relative improvement in Pass@1 and Pass@10 on the HumanEval+ benchmark. Our contributions demonstrate the practical value of preference tuning in code generation and set the stage for further progress in model alignment and RLAIF for automated software engineering.<\/jats:p>","DOI":"10.1145\/3736407","type":"journal-article","created":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T13:03:21Z","timestamp":1747746201000},"page":"1-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences"],"prefix":"10.1145","volume":"35","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5987-850X","authenticated-orcid":false,"given":"Martin","family":"Weyssow","sequence":"first","affiliation":[{"name":"DIRO, Universit\u00e9 de Montr\u00e9al, Montreal, Quebec, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4747-4858","authenticated-orcid":false,"given":"Aton","family":"Kamanda","sequence":"additional","affiliation":[{"name":"DIRO, Universit\u00e9 de Montr\u00e9al, Montreal, Quebec, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4558-0622","authenticated-orcid":false,"given":"Xin","family":"Zhou","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6304-9926","authenticated-orcid":false,"given":"Houari","family":"Sahraoui","sequence":"additional","affiliation":[{"name":"DIRO, Universit\u00e9 de Montr\u00e9al, Montreal, Quebec, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,2,13]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2013.6624029"},{"key":"e_1_3_2_3_2","unstructured":"Ben Athiwaratkun Sanjay Krishna Gouda Zijian Wang Xiaopeng Li Yuchen Tian Ming Tan Wasi Uddin Ahmad Shiqi Wang Qing Sun Mingyue Shang et al. 2022. Multi-lingual evaluation of code generation models. arXiv:2210.14868. Retrieved from https:\/\/arxiv.org\/abs\/2210.14868"},{"key":"e_1_3_2_4_2","unstructured":"Jacob Austin Augustus Odena Maxwell Nye Maarten Bosma Henryk Michalewski David Dohan Ellen Jiang Carrie Cai Michael Terry Quoc Le et al. 2021. Program synthesis with large language models. arXiv:2108.07732. Retrieved from https:\/\/arxiv.org\/abs\/2108.07732"},{"key":"e_1_3_2_5_2","unstructured":"Hannah McLean Babe Sydney Nguyen Yangtian Zi Arjun Guha Molly Q. Feldman and Carolyn Jane Anderson. 2023. StudentEval: A benchmark of student-written prompts for large language models of code. arXiv:2306.04556. Retrieved from https:\/\/arxiv.org\/abs\/2306.04556"},{"key":"e_1_3_2_6_2","unstructured":"Yuntao Bai Saurav Kadavath Sandipan Kundu Amanda Askell Jackson Kernion Andy Jones Anna Chen Anna Goldie Azalia Mirhoseini Cameron McKinnon et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073. Retrieved from https:\/\/arxiv.org\/abs\/2212.08073"},{"key":"e_1_3_2_7_2","unstructured":"Manish Bhatt Sahana Chennabasappa Cyrus Nikolaidis Shengye Wan Ivan Evtimov Dominik Gabi Daniel Song Faizan Ahmad Cornelius Aschermann Lorenzo Fontana et al. 2023. Purple Llama CyberSecEval: A secure coding benchmark for language models. arXiv:2312.04724. Retrieved from https:\/\/arxiv.org\/abs\/2312.04724"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2023.3267446"},{"key":"e_1_3_2_9_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Ponde de Oliveira Pinto Jared Kaplan Harri Edwards Yuri Burda Nicholas Joseph Greg Brockman et al. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_2_10_2","unstructured":"Ganqu Cui Lifan Yuan Ning Ding Guanming Yao Wei Zhu Yuan Ni Guotong Xie Zhiyuan Liu and Maosong Sun. 2023. UltraFeedback: Boosting language models with high-quality feedback. arXiv:2310.01377. Retrieved from https:\/\/arxiv.org\/abs\/2310.01377"},{"key":"e_1_3_2_11_2","unstructured":"Tim Dettmers Artidoro Pagnoni Ari Holtzman and Luke Zettlemoyer. 2023. QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314. Retrieved from https:\/\/arxiv.org\/abs\/2305.14314"},{"key":"e_1_3_2_12_2","unstructured":"Yangruibo Ding Zijian Wang Wasi Uddin Ahmad Hantian Ding Ming Tan Nihal Jain Murali Krishna Ramanathan Ramesh Nallapati Parminder Bhatia Dan Roth et al. 2023. CrossCodeEval: A diverse and multilingual benchmark for cross-file code completion. arXiv:2310.11248. Retrieved from https:\/\/arxiv.org\/abs\/2310.11248"},{"key":"e_1_3_2_13_2","unstructured":"Xueying Du Mingwei Liu Kaixin Wang Hanlin Wang Junwei Liu Yixuan Chen Jiayi Feng Chaofeng Sha Xin Peng and Yiling Lou. 2023. ClassEval: A manually-crafted benchmark for evaluating LLMs on class-level code generation. arXiv:2308.01861. Retrieved from https:\/\/arxiv.org\/abs\/2308.01861"},{"key":"e_1_3_2_14_2","unstructured":"Daya Guo Qihao Zhu Dejian Yang Zhenda Xie Kai Dong Wentao Zhang Guanting Chen Xiao Bi Y. Wu Y. K. Li et al. 2024. DeepSeek-Coder: When the large language model meets programming\u2014The rise of code intelligence. arXiv:2401.14196. Retrieved from https:\/\/arxiv.org\/abs\/2401.14196"},{"key":"e_1_3_2_15_2","article-title":"Measuring coding challenge competence with APPS","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with APPS. In Neural Information Processing Systems (NeurIPS).","journal-title":"Neural Information Processing Systems (NeurIPS)"},{"key":"e_1_3_2_16_2","unstructured":"Dong Huang Jie M. Zhang Yuhao Qing and Heming Cui. 2024. EffiBench: Benchmarking the efficiency of automatically generated code. arXiv:2402.02037. Retrieved from https:\/\/arxiv.org\/abs\/2402.02037"},{"key":"e_1_3_2_17_2","unstructured":"Hamel Husain Ho-Hsiang Wu Tiferet Gazit Miltiadis Allamanis and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv:1909.09436. Retrieved from https:\/\/arxiv.org\/abs\/1909.09436"},{"key":"e_1_3_2_18_2","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier et al. 2023. Mistral 7B. arXiv:2310.06825. Retrieved from https:\/\/arxiv.org\/abs\/2310.06825"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"1529","DOI":"10.1109\/ASE56229.2023.00114","volume-title":"2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE)","author":"Jiao Mingsheng","year":"2023","unstructured":"Mingsheng Jiao, Tingrui Yu, Xuan Li, Guanjie Qiu, Xiaodong Gu, and Beijun Shen. 2023. On the evaluation of neural code translation: Taxonomy and benchmark. In 2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1529\u20131541"},{"key":"e_1_3_2_20_2","unstructured":"Mohammad Abdullah Matin Khan M. Saiful Bari Xuan Long Do Weishi Wang Md Rizwan Parvez and Shafiq Joty. 2023. xCodeEval: A large scale multilingual multitask benchmark for code understanding generation translation and retrieval. arXiv:2303.03004. Retrieved from https:\/\/arxiv.org\/abs\/2303.03004"},{"key":"e_1_3_2_21_2","unstructured":"Seungone Kim Jamin Shin Yejin Cho Joel Jang Shayne Longpre Hwaran Lee Sangdoo Yun Seongjin Shin Sungdong Kim James Thorne et al. 2023. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv:2310.08491. Retrieved from https:\/\/arxiv.org\/abs\/2310.08491"},{"key":"e_1_3_2_22_2","first-page":"18319","volume-title":"International Conference on Machine Learning","author":"Lai Yuhang","year":"2023","unstructured":"Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning. PMLR, 18319\u201318345."},{"key":"e_1_3_2_23_2","unstructured":"Von Werra Leandro Belkada Younes Tunstall Lewis Beeching Edward Thrush Tristan Lambert Nathan and Huang Shengyi. 2020. TRL: Transformer Reinforcement Learning. Retrieved from https:\/\/huggingface.co\/docs\/trl\/en\/index"},{"key":"e_1_3_2_24_2","unstructured":"Harrison Lee Samrat Phatale Hassan Mansoor Kellie Lu Thomas Mesnard Colton Bishop Victor Carbune and Abhinav Rastogi. 2023. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. arXiv:2309.00267. Retrieved from https:\/\/arxiv.org\/abs\/2309.00267"},{"key":"e_1_3_2_25_2","unstructured":"Xuechen Li Tianyi Zhang Yann Dubois Rohan Taori Ishaan Gulrajani Carlos Guestrin Percy Liang and Tatsunori B. Hashimoto. 2023. AlpacaEval: An automatic evaluator of instruction-following models. Retrieved from https:\/\/github.com\/tatsu-lab\/alpaca_eval"},{"key":"e_1_3_2_26_2","unstructured":"Jiawei Liu Chunqiu Steven Xia Yuyao Wang and Lingming Zhang. 2023. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. arXiv:2305.01210. Retrieved from https:\/\/arxiv.org\/abs\/2305.01210"},{"issue":"5","key":"e_1_3_2_27_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3643674","article-title":"Refining ChatGPT-generated code: Characterizing and mitigating code quality issues","volume":"33","author":"Liu Yue","year":"2023","unstructured":"Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. 2023. Refining ChatGPT-generated code: Characterizing and mitigating code quality issues. ACM Transactions on Software Engineering and Methodology 33, 5 (2023), 1\u201326.","journal-title":"ACM Transactions on Software Engineering and Methodology"},{"key":"e_1_3_2_28_2","unstructured":"Shuai Lu Daya Guo Shuo Ren Junjie Huang Alexey Svyatkovskiy Ambrosio Blanco Colin Clement Dawn Drain Daxin Jiang Duyu Tang et al. 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv:2102.04664. Retrieved from https:\/\/arxiv.org\/abs\/2102.04664"},{"key":"e_1_3_2_29_2","unstructured":"Ziyang Luo Can Xu Pu Zhao Qingfeng Sun Xiubo Geng Wenxiang Hu Chongyang Tao Jing Ma Qingwei Lin and Daxin Jiang. 2023. WizardCoder: Empowering code large language models with Evol-Instruct. arXiv:2306.08568. Retrieved from https:\/\/arxiv.org\/abs\/2306.08568"},{"key":"e_1_3_2_30_2","unstructured":"Niklas Muennighoff Qian Liu Armel Zebaze Qinkai Zheng Binyuan Hui Terry Yue Zhuo Swayam Singh Xiangru Tang Leandro Von Werra and Shayne Longpre. 2023. OctoPack: Instruction tuning code large language models. arXiv:2308.07124. Retrieved from https:\/\/arxiv.org\/abs\/2308.07124"},{"key":"e_1_3_2_31_2","unstructured":"Subhabrata Mukherjee Arindam Mitra Ganesh Jawahar Sahaj Agarwal Hamid Palangi and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of GPT-4. arXiv:2306.02707. Retrieved from https:\/\/arxiv.org\/abs\/2306.02707"},{"key":"e_1_3_2_32_2","unstructured":"Changan Niu Chuanyi Li Vincent Ng and Bin Luo. 2023. CrossCodeBench: Benchmarking cross-task generalization of source code models. arXiv:2302.04030. Retrieved from https:\/\/arxiv.org\/abs\/2302.04030"},{"key":"e_1_3_2_33_2","unstructured":"Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray et al. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155. Retrieved from https:\/\/arxiv.org\/abs\/2203.02155"},{"key":"e_1_3_2_34_2","unstructured":"Rangeet Pan Ali Reza Ibrahimzada Rahul Krishna Divya Sankar Lambert Pouguem Wassi Michele Merler Boris Sobolev Raju Pavuluri Saurabh Sinha and Reyhaneh Jabbarvand. 2023. Understanding the effectiveness of large language models in code translation. arXiv:2308.03109. Retrieved from https:\/\/arxiv.org\/abs\/2308.03109"},{"key":"e_1_3_2_35_2","unstructured":"Ruchir Puri David S. Kung Geert Janssen Wei Zhang Giacomo Domeniconi Vladimir Zolotov Julian Dolby Jie Chen Mihir Choudhury Lindsey Decker et al. 2021. CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks. arXiv:2105.12655. Retrieved from https:\/\/arxiv.org\/abs\/2105.12655"},{"key":"e_1_3_2_36_2","unstructured":"Rafael Rafailov Archit Sharma Eric Mitchell Stefano Ermon Christopher D. Manning and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv:2305.18290. Retrieved from https:\/\/arxiv.org\/abs\/2305.18290"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3022671.2984041"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/2914770.2837671"},{"key":"e_1_3_2_39_2","unstructured":"Baptiste Roziere Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat Xiaoqing Ellen Tan Yossi Adi Jingyu Liu Tal Remez J\u00e9r\u00e9my Rapin et al. 2023. Code Llama: Open foundation models for code. arXiv:2308.12950. Retrieved from https:\/\/arxiv.org\/abs\/2308.12950"},{"key":"e_1_3_2_40_2","unstructured":"Oussama Ben Sghaier Lucas Maes and Houari Sahraoui. 2023. Unity is strength: Cross-task knowledge distillation to improve code review generation. arXiv:2309.03362. Retrieved from https:\/\/arxiv.org\/abs\/2309.03362"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/SANER56733.2023.00049"},{"key":"e_1_3_2_42_2","doi-asserted-by":"crossref","unstructured":"Oussama Ben Sghaier and Houari Sahraoui. 2024. Improving the learning of code review successive tasks with cross-task knowledge distillation. arXiv:2402.02063. Retrieved from https:\/\/arxiv.org\/abs\/2402.02063","DOI":"10.1145\/3643775"},{"key":"e_1_3_2_43_2","unstructured":"Mohammed Latif Siddiq Beatrice Casey and Joanna Santos. 2023. A lightweight framework for high-quality code generation. arXiv:2307.08220. Retrieved from https:\/\/arxiv.org\/abs\/2307.08220"},{"key":"e_1_3_2_44_2","unstructured":"Andr\u00e9 Silva Sen Fang and Martin Monperrus. 2023. RepairLLaMA: Efficient representations and fine-tuned adapters for program repair. arXiv:2312.15698. Retrieved from https:\/\/arxiv.org\/abs\/2312.15698"},{"key":"e_1_3_2_45_2","unstructured":"Manav Singhal Tushar Aggarwal Abhijeet Awasthi Nagarajan Natarajan and Aditya Kanade. 2024. NoFunEval: Funny how code LMs falter on requirements beyond functional correctness. arXiv:2401.15963. Retrieved from https:\/\/arxiv.org\/abs\/2401.15963"},{"key":"e_1_3_2_46_2","unstructured":"Zhiqing Sun Yikang Shen Qinhong Zhou Hongxin Zhang Zhenfang Chen David Cox Yiming Yang and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv:2305.03047. Retrieved from https:\/\/arxiv.org\/abs\/2305.03047"},{"issue":"6","key":"e_1_3_2_47_2","first-page":"7","article-title":"Alpaca: A strong, replicable instruction-following model","volume":"3","author":"Taori Rohan","year":"2023","unstructured":"Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models 3, 6 (2023), 7. Retrieved from https:\/\/crfm.stanford.edu\/2023\/03\/13\/alpaca.html","journal-title":"Stanford Center for Research on Foundation Models"},{"key":"e_1_3_2_48_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_49_2","unstructured":"Lewis Tunstall Edward Beeching Nathan Lambert Nazneen Rajani Kashif Rasul Younes Belkada Shengyi Huang Leandro von Werra Cl\u00e9mentine Fourrier Nathan Habib et al. 2023. Zephyr: Direct distillation of LM alignment. arXiv:2310.16944. Retrieved from https:\/\/arxiv.org\/abs\/2310.16944"},{"key":"e_1_3_2_50_2","unstructured":"Shiqi Wang Zheng Li Haifeng Qian Chenghao Yang Zijian Wang Mingyue Shang Varun Kumar Samson Tan Baishakhi Ray Parminder Bhatia et al. 2022. ReCode: Robustness evaluation of code generation models. arXiv:2212.10264. Retrieved from https:\/\/arxiv.org\/abs\/2212.10264"},{"key":"e_1_3_2_51_2","unstructured":"Tianlu Wang Ping Yu Xiaoqing Ellen Tan Sean O\u2019Brien Ramakanth Pasunuru Jane Dwivedi-Yu Olga Golovneva Luke Zettlemoyer Maryam Fazel-Zarandi and Asli Celikyilmaz. 2023. Shepherd: A critic for language model generation. arXiv:2308.04592. Retrieved from https:\/\/arxiv.org\/abs\/2308.04592"},{"key":"e_1_3_2_52_2","unstructured":"Yizhong Wang Yeganeh Kordi Swaroop Mishra Alisa Liu Noah A. Smith Daniel Khashabi and Hannaneh Hajishirzi 2022. Self-instruct: Aligning language model with self generated instructions. arXiv:2212.10560. Retrieved from https:\/\/arxiv.org\/abs\/2212.10560"},{"key":"e_1_3_2_53_2","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, Vol. 35, 24824\u201324837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_54_2","unstructured":"Yuxiang Wei Zhe Wang Jiawei Liu Yifeng Ding and Lingming Zhang. 2023. Magicoder: Empowering code generation with OSS-instruct. arXiv:2312.02120. Retrieved from https:\/\/arxiv.org\/abs\/2312.02120"},{"key":"e_1_3_2_55_2","unstructured":"Martin Weyssow Xin Zhou Kisub Kim David Lo and Houari Sahraoui. 2023. Exploring parameter-efficient fine-tuning techniques for code generation with large language models. arXiv:2308.10462. Retrieved from https:\/\/arxiv.org\/abs\/2308.10462"},{"key":"e_1_3_2_56_2","unstructured":"Can Xu Qingfeng Sun Kai Zheng Xiubo Geng Pu Zhao Jiazhan Feng Chongyang Tao and Daxin Jiang. 2023. WizardLM: Empowering large language models to follow complex instructions. arXiv:2304.12244. Retrieved from https:\/\/arxiv.org\/abs\/2304.12244"},{"key":"e_1_3_2_57_2","unstructured":"Xiaohan Xu Ming Li Chongyang Tao Tao Shen Reynold Cheng Jinyang Li Can Xu Dacheng Tao and Tianyi Zhou. 2024. A survey on knowledge distillation of large language models. arXiv:2402.13116. Retrieved from https:\/\/arxiv.org\/abs\/2402.13116"},{"key":"e_1_3_2_58_2","unstructured":"Zhou Yang Zhensu Sun Terry Zhuo Yue Premkumar Devanbu and David Lo. 2024. Robustness security privacy explainability efficiency and usability of large language models for code. arXiv:2403.07506. Retrieved from https:\/\/arxiv.org\/abs\/2403.07506"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2020.110825"},{"key":"e_1_3_2_60_2","unstructured":"Burak Yeti\u015ftiren I\u015f\u0131k \u00d6zsoy Miray Ayerdem and Eray T\u00fcz\u00fcn. 2023. Evaluating the code quality of AI-assisted code generation tools: An empirical study on GitHub Copilot Amazon CodeWhisperer and ChatGPT. arXiv:2304.10778. Retrieved from https:\/\/arxiv.org\/abs\/2304.10778"},{"key":"e_1_3_2_61_2","first-page":"1","volume-title":"46th IEEE\/ACM International Conference on Software Engineering","author":"Yu Hao","year":"2024","unstructured":"Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A benchmark of pragmatic code generation with generative pre-trained models. In 46th IEEE\/ACM International Conference on Software Engineering, 1\u201312."},{"key":"e_1_3_2_62_2","unstructured":"Lianmin Zheng Wei Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu Yonghao Zhuang Zi Lin Zhuohan Li Dacheng Li Eric Xing et al. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot arena. arXiv:2306.05685. Retrieved from https:\/\/arxiv.org\/abs\/2306.05685"},{"key":"e_1_3_2_63_2","doi-asserted-by":"crossref","unstructured":"Qinkai Zheng Xiao Xia Xu Zou Yuxiao Dong Shan Wang Yufei Xue Zihan Wang Lei Shen Andi Wang Yang Li et al. 2023. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on HumanEval-X. arXiv:2303.17568. Retrieved from https:\/\/arxiv.org\/abs\/2303.17568","DOI":"10.1145\/3580305.3599790"},{"key":"e_1_3_2_64_2","volume-title":"the 11th International Conference on Learning Representations","author":"Zhou Shuyan","year":"2022","unstructured":"Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, and Graham Neubig. 2022. DocPrompting: Generating code by retrieving the docs. In the 11th International Conference on Learning Representations."},{"key":"e_1_3_2_65_2","unstructured":"Xin Zhou Kisub Kim Bowen Xu DongGyun Han Junda He and David Lo. 2023. Generation-based code review automation: How far are we? arXiv:2303.07221. Retrieved from https:\/\/arxiv.org\/abs\/2303.07221"},{"key":"e_1_3_2_66_2","unstructured":"Ming Zhu Aneesh Jain Karthik Suresh Roshan Ravindran Sindhu Tipirneni and Chandan K. Reddy. 2022. XLCoST: A benchmark dataset for cross-lingual code intelligence. arXiv:2206.08474. Retrieved from https:\/\/arxiv.org\/abs\/2206.08474"},{"key":"e_1_3_2_67_2","first-page":"2232","volume-title":"Findings of the Association for Computational Linguistics: EACL 2024","author":"Yue Zhuo Terry","year":"2024","unstructured":"Terry Yue Zhuo. 2024. ICE-Score: Instructing large language models to evaluate code. In Findings of the Association for Computational Linguistics: EACL 2024, 2232\u20132242."}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3736407","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T14:36:35Z","timestamp":1770993395000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3736407"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,13]]},"references-count":66,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3,31]]}},"alternative-id":["10.1145\/3736407"],"URL":"https:\/\/doi.org\/10.1145\/3736407","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,13]]},"assertion":[{"value":"2024-08-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-14","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}