{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T02:13:49Z","timestamp":1775873629978,"version":"3.50.1"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"ISSTA","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2025,6,22]]},"abstract":"<jats:p>Recent advancements in large language models (LLMs) have significantly improved code generation, which generates code snippets automatically based on natural language requirements. Despite achieving state-of-the-art performance, LLMs often struggle to generate accurate and reliable code, requiring developers to spend substantial effort debugging and evaluating the generated output. Researchers have proposed leveraging Consistency to select code that passes more tests (inter-consistency) and demonstrates consistent behavior across more counterparts (intra-consistency). However, since the tests themselves are also generated by LLMs, relying on majority voting based on incorrect tests leads to unreliable results. To address this, we propose a lightweight interaction framework that incorporates user feedback to effectively guide consistency. Our results demonstrate that, with minimal human effort, performance can be significantly improved. In each iteration, we introduce a rank-correct-fix co-evolution process between code and tests. This process iteratively  \nenhances the quality of both, making the consistency voting between code and tests more reliable.  \nWe evaluate ConTested through extensive experiments, demonstrating its effectiveness across multiple LLMs, including GPT-3.5 and GPT-4o. Our results show improvements of 32.9% over GPT-3.5 and 16.97% over GPT-4o. Additionally, ConTested achieves an 11.1% improvement over the SOTA post-processing technique, MPSC. This improvement is achieved with only a 4-round interaction with users, requiring minimal user effort. A user study further confirms the feasibility and cost-effectiveness of ConTested, highlighting its ability to enhance code generation without introducing substantial overhead.<\/jats:p>","DOI":"10.1145\/3728902","type":"journal-article","created":{"date-parts":[[2025,6,22]],"date-time":"2025-06-22T10:53:21Z","timestamp":1750589601000},"page":"596-617","source":"Crossref","is-referenced-by-count":5,"title":["ConTested: Consistency-Aided Tested Code Generation with LLM"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0416-6896","authenticated-orcid":false,"given":"Jinhao","family":"Dong","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3545-1392","authenticated-orcid":false,"given":"Jun","family":"Sun","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2669-1837","authenticated-orcid":false,"given":"Wenjie","family":"Zhang","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6512-8326","authenticated-orcid":false,"given":"Jin Song","family":"Dong","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8295-303X","authenticated-orcid":false,"given":"Dan","family":"Hao","sequence":"additional","affiliation":[{"name":"Peking University, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,22]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2024. Replication package. https:\/\/github.com\/DJjjjhao\/replication_package"},{"key":"e_1_2_1_2_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, and Shyamal Anadkat.","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, and Shyamal Anadkat. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774."},{"key":"e_1_2_1_3_1","volume-title":"Meet Claude. https:\/\/www.anthropic.com\/claude","unstructured":"2024. Meet Claude. https:\/\/www.anthropic.com\/claude"},{"key":"e_1_2_1_4_1","unstructured":"Jacob Austin Augustus Odena Maxwell Nye Maarten Bosma Henryk Michalewski David Dohan Ellen Jiang Carrie Cai Michael Terry and Quoc Le. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732."},{"key":"e_1_2_1_5_1","volume-title":"CodeT: Code Generation with Generated Tests. In The Eleventh International Conference on Learning Representations, ICLR 2023","author":"Chen Bei","year":"2023","unstructured":"Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https:\/\/openreview.net\/forum?id=ktrw68Cmu9c"},{"key":"e_1_2_1_6_1","volume-title":"32nd European Conference on Object-Oriented Programming (ECOOP","author":"Chen Junjie","year":"2018","unstructured":"Junjie Chen, Wenxiang Hu, Lingming Zhang, Dan Hao, Sarfraz Khurshid, and Lu Zhang. 2018. Learning to accelerate symbolic execution via code transformation. In 32nd European Conference on Object-Oriented Programming (ECOOP 2018). 6\u20131."},{"key":"e_1_2_1_7_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman.","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374."},{"key":"e_1_2_1_8_1","volume-title":"Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30","author":"Christiano Paul F","year":"2017","unstructured":"Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30 (2017)."},{"key":"e_1_2_1_9_1","unstructured":"Karl Cobbe Vineet Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun Lukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton and Reiichiro Nakano. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3672459"},{"key":"e_1_2_1_11_1","volume-title":"Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360.","author":"Du Zhengxiao","year":"2021","unstructured":"Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360."},{"key":"e_1_2_1_12_1","volume-title":"Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999.","author":"Fried Daniel","year":"2022","unstructured":"Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999."},{"key":"e_1_2_1_13_1","unstructured":"Daya Guo Qihao Zhu Dejian Yang Zhenda Xie Kai Dong Wentao Zhang Guanting Chen Xiao Bi Yu Wu and YK Li. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming\u2013The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2024.ACL-LONG.78"},{"key":"e_1_2_1_15_1","unstructured":"Binyuan Hui Jian Yang Zeyu Cui Jiaxi Yang Dayiheng Liu Lei Zhang Tianyu Liu Jiajun Zhang Bowen Yu and Kai Dang. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00194"},{"key":"e_1_2_1_17_1","unstructured":"Nate Kushman and Regina Barzilay. 2013. Using semantic unification to generate regular expressions from natural language."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3597207"},{"key":"e_1_2_1_19_1","volume-title":"Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, and Jenny Chim.","author":"Li Raymond","year":"2023","unstructured":"Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, and Jenny Chim. 2023. Starcoder: may the source be with you!. arXiv preprint arXiv:2305.06161."},{"key":"e_1_2_1_20_1","volume-title":"Yuyao Wang, and Lingming Zhang.","author":"Liu Jiawei","year":"2024","unstructured":"Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36 (2024)."},{"key":"e_1_2_1_21_1","volume-title":"Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.","author":"Luo Ziyang","year":"2023","unstructured":"Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568."},{"key":"e_1_2_1_22_1","volume-title":"Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36","author":"Madaan Aman","year":"2024","unstructured":"Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, and Yiming Yang. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36 (2024)."},{"key":"e_1_2_1_23_1","volume-title":"International Conference on Machine Learning. 26106\u201326128","author":"Ni Ansong","year":"2023","unstructured":"Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning. 26106\u201326128."},{"key":"e_1_2_1_24_1","volume-title":"Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.","author":"Nijkamp Erik","year":"2022","unstructured":"Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474."},{"key":"e_1_2_1_25_1","unstructured":"Don Norman. 2013. The design of everyday things: Revised and expanded edition. Basic books."},{"key":"e_1_2_1_26_1","unstructured":"2023. GPT-3.5 Turbo. https:\/\/platform.openai.com\/docs\/models\/gpt-3-5-turbo"},{"key":"e_1_2_1_27_1","unstructured":"2023. Introducing GPT-4o and more tools to ChatGPT free users. https:\/\/openai.com\/index\/gpt-4o-and-more-tools-to-chatgpt-free\/"},{"key":"e_1_2_1_28_1","unstructured":"2024. GIntroducing OpenAI o1-preview. https:\/\/openai.com\/index\/introducing-openai-o1-preview\/"},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Maxim Rabinovich Mitchell Stern and Dan Klein. 2017. Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535.","DOI":"10.18653\/v1\/P17-1105"},{"key":"e_1_2_1_30_1","volume-title":"Yossi Adi, Jingyu Liu, Romain Sauvestre, and Tal Remez.","author":"Roziere Baptiste","year":"2023","unstructured":"Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, and Tal Remez. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950."},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Freda Shi Daniel Fried Marjan Ghazvininejad Luke Zettlemoyer and Sida I Wang. 2022. Natural language to code translation with execution. arXiv preprint arXiv:2204.11454.","DOI":"10.18653\/v1\/2022.emnlp-main.231"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems. 8634\u20138652","author":"Shinn Noah","year":"2023","unstructured":"Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems. 8634\u20138652."},{"key":"e_1_2_1_33_1","volume-title":"Advancing the rationality debate. Behavioral and brain sciences, 23, 5","author":"Stanovich Keith E","year":"2000","unstructured":"Keith E Stanovich and Richard F West. 2000. Advancing the rationality debate. Behavioral and brain sciences, 23, 5 (2000), 701\u2013717."},{"key":"e_1_2_1_34_1","unstructured":"Zhiqing Sun Xuezhi Wang Yi Tay Yiming Yang and Denny Zhou. 2022. Recitation-augmented language models. arXiv preprint arXiv:2210.01296."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6430"},{"key":"e_1_2_1_36_1","volume-title":"Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, and Yu Du.","author":"Thoppilan Romal","year":"2022","unstructured":"Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, and Yu Du. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239."},{"key":"e_1_2_1_37_1","unstructured":"A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Han Wang Archiki Prasad Elias Stengel-Eskin and Mohit Bansal. 2024. Soft Self-Consistency Improves Language Model Agents. arXiv preprint arXiv:2402.13212.","DOI":"10.18653\/v1\/2024.acl-short.28"},{"key":"e_1_2_1_39_1","volume-title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023","author":"Wang Xuezhi","year":"2023","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https:\/\/openreview.net\/forum?id=1PL1NIMMrw"},{"key":"e_1_2_1_40_1","first-page":"7572","article-title":"Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate","volume":"2023","author":"Xiong Kai","year":"2023","unstructured":"Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate. In Findings of the Association for Computational Linguistics: EMNLP 2023. 7572\u20137590.","journal-title":"Findings of the Association for Computational Linguistics: EMNLP"},{"key":"e_1_2_1_41_1","unstructured":"Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 678\u2013687","author":"Zettlemoyer Luke","year":"2007","unstructured":"Luke Zettlemoyer and Michael Collins. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 678\u2013687."},{"key":"e_1_2_1_43_1","unstructured":"Luke S Zettlemoyer and Michael Collins. 2012. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420."},{"key":"e_1_2_1_44_1","first-page":"54769","article-title":"Algo: Synthesizing algorithmic programs with generated oracle verifiers","volume":"36","author":"Zhang Kexun","year":"2023","unstructured":"Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, and Lei Li. 2023. Algo: Synthesizing algorithmic programs with generated oracle verifiers. Advances in Neural Information Processing Systems, 36 (2023), 54769\u201354784.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_45_1","volume-title":"International Conference on Machine Learning. 41832\u201341846","author":"Zhang Tianyi","year":"2023","unstructured":"Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang. 2023. Coder reviewer reranking for code generation. In International Conference on Machine Learning. 41832\u201341846."}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3728902","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T16:45:34Z","timestamp":1752684334000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3728902"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,22]]},"references-count":45,"journal-issue":{"issue":"ISSTA","published-print":{"date-parts":[[2025,6,22]]}},"alternative-id":["10.1145\/3728902"],"URL":"https:\/\/doi.org\/10.1145\/3728902","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,22]]}}}