{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T09:16:42Z","timestamp":1773739002827,"version":"3.50.1"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"POPL","license":[{"start":{"date-parts":[[2025,1,7]],"date-time":"2025-01-07T00:00:00Z","timestamp":1736208000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001381","name":"National Research Foundation Singapore","doi-asserted-by":"crossref","award":["AISG3-RP-2022-030"],"award-info":[{"award-number":["AISG3-RP-2022-030"]}],"id":[{"id":"10.13039\/501100001381","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2025,1,7]]},"abstract":"<jats:p>Recently, the rise of code-centric Large Language Models (LLMs) has reshaped the software engineering world with low-barrier tools like Copilot that can easily generate code. However, there is no correctness guarantee for the code generated by LLMs, which suffer from the hallucination problem, and their output is fraught with risks. Besides, the end-to-end process from specification to code through LLMs is a non-transparent and uncontrolled black box. This opacity makes it difficult for users to understand and trust the generated code. Addressing these challenges is both necessary and critical. In contrast, program refinement transforms high-level specification statements into executable code while preserving correctness. Traditional tools for program refinement are primarily designed for formal methods experts and lack automation and extensibility. We apply program refinement to guide LLM and validate the LLM-generated code while transforming refinement into a more accessible and flexible framework. To initiate this vision, we propose Refine4LLM, an approach that aims to:(1) Formally refine the specifications, (2) Automatically prompt and guide the LLM using refinement calculus, (3) Interact with the LLM to generate the code, (4) Verify that the generated code satisfies the constraints, thus guaranteeing its correctness, (5) Learn and build more advanced refinement laws to extend the refinement calculus. We evaluated Refine4LLM against the state-of-the-art baselines on program refinement and LLMs benchmarks. The experiment results show that Refine4LLM can efficiently generate more robust code and reduce the time for refinement and verification.<\/jats:p>","DOI":"10.1145\/3704905","type":"journal-article","created":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T05:48:42Z","timestamp":1736401722000},"page":"2057-2089","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Automated Program Refinement: Guide and Verify Code Large Language Model with Refinement Calculus"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-7579-0824","authenticated-orcid":false,"given":"Yufan","family":"Cai","sequence":"first","affiliation":[{"name":"Ningbo University, Ningbo, China"},{"name":"National University of Singapore, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7164-0580","authenticated-orcid":false,"given":"Zhe","family":"Hou","sequence":"additional","affiliation":[{"name":"Griffith University, Brisbane, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2755-3089","authenticated-orcid":false,"given":"David","family":"Sanan","sequence":"additional","affiliation":[{"name":"Singapore Institute of Technology, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5878-6486","authenticated-orcid":false,"given":"Xiaokun","family":"Luan","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8255-0118","authenticated-orcid":false,"given":"Yun","family":"Lin","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3545-1392","authenticated-orcid":false,"given":"Jun","family":"Sun","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6512-8326","authenticated-orcid":false,"given":"Jin Song","family":"Dong","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2025,1,9]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.761"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.scico.2017.04.003"},{"key":"e_1_3_2_4_2","unstructured":"Anonymous. 2024. LLM aided Program Refinement. https:\/\/sites.google.com\/view\/refine4llm."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4612-1674-2"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01888227"},{"issue":"11","key":"e_1_3_2_7_2","first-page":"17","article-title":"The Coq proof assistant reference manual","volume":"6","author":"Barras Bruno","year":"1999","unstructured":"Bruno Barras, Samuel Boutin, Cristina Cornes, Judica\u00ebl Courant, Yann Coscoy, David Delahaye, Daniel de Rauglaudre, Jean-Christophe Filli\u00e2tre, Eduardo Gim\u00e9nez, Hugo Herbelin, et al. 1999. The Coq proof assistant reference manual. INRIA, version 6, 11 (1999), 17\u201321.","journal-title":"INRIA, version"},{"key":"e_1_3_2_8_2","unstructured":"Loubna Ben Allal Niklas Muennighoff Logesh Kumar Umapathi Ben Lipkin and Leandro von Werra. 2022. A framework for the evaluation of code generation models. https:\/\/github.com\/bigcode-project\/bigcode-evaluation-harness."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29720"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","unstructured":"Luca Beurer-Kellner Marc Fischer and Martin Vechev. 2023. Prompting Is Programming: A Query Language for Large Language Models. Proc. ACM Program. Lang. 7 PLDI Article 186 (jun 2023) 24 pages. https:\/\/doi.org\/10.1145\/3591300 10.1145\/3591300","DOI":"10.1145\/3591300"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3591335.3591343"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3571234"},{"key":"e_1_3_2_13_2","first-page":"427","volume-title":"Verification, Model Checking, and Abstract Interpretation","author":"Bradley Aaron R","year":"2006","unstructured":"Aaron R. Bradley, Zohar Manna, and Henny B. Sipma. 2006. What's Decidable About Arrays?. In Verification, Model Checking, and Abstract Interpretation, E. Allen Emerson and Kedar S. Namjoshi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 427\u2013442."},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Michael Butler and Thomas L\u00e5ngbacka. 1996. Program derivation using the refinement calculator. In Theorem Proving in Higher Order Logics Gerhard Goos Juris Hartmanis Jan van Leeuwen Joakim von Wright Jim Grundy and John Harrison (Eds.). Springer Berlin Heidelberg Berlin Heidelberg 93\u2013108.","DOI":"10.1007\/BFb0105399"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s001650050006"},{"key":"e_1_3_2_16_2","unstructured":"Yiannis Charalambous Norbert Tihanyi Ridhi Jain Youcheng Sun Mohamed Amine Ferrag and Lucas C. Cordeiro. 2023. A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification. arXiv:2305.14752 [cs.SE]"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3360567"},{"key":"e_1_3_2_18_2","unstructured":"Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG]"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-53291-8_30"},{"key":"e_1_3_2_20_2","unstructured":"Koen Claessen Reiner H\u00e4hnle and Johan M\u00e5rtensson. 2002. Verification of hardware systems with first-order logic. In Proceedings of the CADE-18 Workshop-Problem and Problem Sets for ATP. Citeseer."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10817-018-9458-4"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/2676726.2677006"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/1066100.1066102"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/360933.360975"},{"key":"e_1_3_2_25_2","unstructured":"Edsger W Dijkstra Edsger Wybe Dijkstra Edsger W Dijkstra and Edsger Wybe Dijkstra. 1976. A discipline of programming. Vol.613924118. prentice-hall Englewood Cliffs."},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","unstructured":"Hantian Ding Varun Kumar Yuchen Tian Zijian Wang Rob Kwiatkowski Xiaopeng Li Murali Krishna Ramanathan Baishakhi Ray Parminder Bhatia Sudipta Sengupta Dan Roth and Bing Xiang. 2023. A Static Evaluation of Code Completion by Large Language Models. arXiv:2306.03203 [cs.CL]","DOI":"10.18653\/v1\/2023.acl-industry.34"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10009-020-00561-4"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3453483.3454080"},{"key":"e_1_3_2_29_2","unstructured":"Maxim Enis and Mark Hopkins. 2024. From LLM to NMT: Advancing Low-Resource Machine Translation with Claude. arXiv:2404.13813 [cs.CL]"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-011-1793-7_4"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-43520-2_11"},{"key":"e_1_3_2_32_2","unstructured":"GitHub. 2023. GitHub Copilot. https:\/\/github.com\/features\/copilot"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-49812-6_1"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/363235.363259"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i19.30108"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","unstructured":"Susmit Jha Sumit Kumar Jha Patrick Lincoln Nathaniel D. Bastian Alvaro Velasquez and Sandeep Neema. 2023. Dehallucinating Large Language Models Using Formal Methods Guided Iterative Prompting. In 2023 IEEE International Conference on Assured Autonomy (ICAA). 149\u2013152. https:\/\/doi.org\/10.1109\/ICAA58325.2023.00029 10.1109\/ICAA58325.2023.00029","DOI":"10.1109\/ICAA58325.2023.00029"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613904.3642596"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3634713.3634714"},{"key":"e_1_3_2_39_2","unstructured":"Jiawei Liu Chunqiu Steven Xia Yuyao Wang and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems.https:\/\/openreview.net\/forum?id=1qvx610Cu7"},{"key":"e_1_3_2_40_2","unstructured":"Ariana Martino Michael Iannelli and Coleen Truong. 2023. Knowledge Injection to Counter Large Language Model (LLM) Hallucination. In The Semantic Web: ESWC 2023 Satellite Events Catia Pesquita Hala Skaf-Molli Vasilis Efthymiou Sabrina Kirrane Axel Ngonga Diego Collarana Renato Cerqueira Mehwish Alam Cassia Trojahn and Sven Hertling (Eds.). Springer Nature Switzerland Cham 182\u2013185."},{"key":"e_1_3_2_41_2","unstructured":"Marcus J. Min Yangruibo Ding Luca Buratti Saurabh Pujar Gail Kaiser Suman Jana and Baishakhi Ray. 2024. Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain. In The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=caW7LdAALh"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.5555\/95423"},{"key":"e_1_3_2_43_2","unstructured":"C. Morgan K. Robinson and P. Gardiner. 1988. On the Refinement Calculus. Techllical Monograph. https:\/\/web.comlab.ox.ac.uk\/files\/3391\/PRG70.pdf"},{"key":"e_1_3_2_44_2","unstructured":"OpenAI. [n. d.]. Introducing GPTs. https:\/\/openai.com\/blog\/introducing-gpts."},{"key":"e_1_3_2_45_2","unstructured":"OpenAI and co. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]"},{"key":"e_1_3_2_46_2","unstructured":"Ruchika Pandey Prabhat Singh Raymond Wei and Shaila Shankar. 2024. Transforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects. arXiv:2406.17910 https:\/\/arxiv.org\/abs\/2406.17910 [cs.SE]"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","unstructured":"Nadia Polikarpova Ivan Kuraj and Armando Solar-Lezama. 2016. Program Synthesis from Polymorphic Refinement Types. SIGPLAN Not. 51 6 (June 2016) 522\u2013538. https:\/\/doi.org\/10.1145\/2980983.2908093 10.1145\/2980983.2908093","DOI":"10.1145\/2980983.2908093"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840306"},{"key":"e_1_3_2_49_2","unstructured":"Bernhard Reus. 2016. The WHILE-Language. Springer International Publishing Cham 29\u201345. https:\/\/doi.org\/10.1007\/ 978-3-319-27889-6_3 10.1007\/ 978-3-319-27889-6_3"},{"key":"e_1_3_2_50_2","unstructured":"Bernardino Romera-Paredes Mohammadamin Barekatain Alexander Novikov Matej Balog M Pawan Kumar Emilien Dupont Francisco JR Ruiz Jordan S Ellenberg Pengming Wang Omar Fawzi et al. 2023. Mathematical discoveries from program search with large language models. Nature (2023) 1\u20133."},{"key":"e_1_3_2_51_2","unstructured":"Baptiste Rozi\u00e8re Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat Xiaoqing Ellen Tan Yossi Adi Jingyu Liu Tal Remez J\u00e9r\u00e9my Rapin Artyom Kozhevnikov Ivan Evtimov Joanna Bitton Manish Bhatt Cristian Canton Ferrer Aaron Grattafiori Wenhan Xiong Alexandre D\u00e9fossez Jade Copet Faisal Azhar Hugo Touvron Louis Martin Nicolas Usunier Thomas Scialom and Gabriel Synnaeve. 2023. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-16722-6_2"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-32409-4_19"},{"key":"e_1_3_2_54_2","unstructured":"Armando Solar-Lezama. 2018. Lecture 10: Introduction to Functional Synthesis. Available at https:\/\/people.csail.mit.edu\/asolar\/SynthesisCourse\/Lecture10.htm."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","unstructured":"Armando Solar-Lezama Christopher Grant Jones and Rastislav Bod\u00edk.Sketching concurrent data structures.Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation Tucson AZ USA June 7-13 2008. 136\u2013148. https:\/\/doi.org\/10.1145\/1375581.1375599 10.1145\/1375581.1375599","DOI":"10.1145\/1375581.1375599"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","unstructured":"Saurabh Srivastava Sumit Gulwani and Jeffrey S. Foster. 2010. From program verification to program synthesis. In Proceedings of the 37th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages POPL 2010 Madrid Spain January 17-23 2010. 313\u2013326. https:\/\/doi.org\/10.1145\/1706299.1706337 10.1145\/1706299.1706337","DOI":"10.1145\/1706299.1706337"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-29604-3_3"},{"key":"e_1_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Priyan Vaithilingam Tianyi Zhang and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts. 1\u20137.","DOI":"10.1145\/3491101.3519665"},{"key":"e_1_3_2_59_2","doi-asserted-by":"crossref","unstructured":"Yue Wang Hung Le Akhilesh Deepak Gotmare Nghi D. Q. Bui Junnan Li and Steven C. H. Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv:2305.07922 [cs.CL]","DOI":"10.18653\/v1\/2023.emnlp-main.68"},{"key":"e_1_3_2_60_2","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824\u201324837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_61_2","unstructured":"Ziwei Xu Sanjay Jain and Mohan Kankanhalli. 2024. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817 https:\/\/arxiv.org\/abs\/2401.11817 [cs.CL]"},{"key":"e_1_3_2_62_2","first-page":"29669","article-title":"Program synthesis guided reinforcement learning for partially observed environments","volume":"34","author":"Yang Yichen","year":"2021","unstructured":"Yichen Yang, Jeevana Priya Inala, Osbert Bastani, Yewen Pu, Armando Solar-Lezama, and Martin Rinard. 2021. Program synthesis guided reinforcement learning for partially observed environments. Advances in neural information processing systems 34 (2021), 29669\u201329683.","journal-title":"Advances in neural information processing systems"},{"key":"e_1_3_2_63_2","article-title":"Tree of Thoughts: Deliberate Problem Solving with Large Language Models","volume":"36","author":"Yao Shunyu","year":"2023","unstructured":"Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems 36 (2023). Publisher Copyright: \u00a9 2023 Neural information processing systems foundation. All rights reserved.; 37th Conference on Neural Information Processing Systems, NeurIPS 2023 ; Conference date: 10-12-2023 Through 16-12-2023.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_64_2","unstructured":"Shengyu Zhang Linfeng Dong Xiaoya Li Sen Zhang Xiaofei Sun Shuhe Wang Jiwei Li Runyi Hu Tianwei Zhang Fei Wu and Guoyin Wang. 2024. Instruction Tuning for Large Language Models: A Survey. arXiv:2308.10792 https:\/\/arxiv.org\/abs\/2308.10792 [cs.CL]"},{"key":"e_1_3_2_65_2","unstructured":"Yifan Zhang Jingqin Yang Yang Yuan and Andrew Chi-ChihYao. 2023. Cumulative Reasoning With Large Language Models. arXiv preprint arXiv:https:\/\/arXiv\/2308.04371 (2023)."},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.320"},{"key":"e_1_3_2_67_2","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong Yifan Du Chen Yang Yushuo Chen Zhipeng Chen Jinhao Jiang Ruiyang Ren Yifan Li Xinyu Tang Zikang Liu Peiyu Liu Jian-Yun Nie and Ji-Rong Wen. 2023. A Survey of Large Language Models. arXiv:2303.18223 [cs.CL]"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3704905","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3704905","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T10:17:51Z","timestamp":1770200271000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3704905"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,7]]},"references-count":66,"journal-issue":{"issue":"POPL","published-print":{"date-parts":[[2025,1,7]]}},"alternative-id":["10.1145\/3704905"],"URL":"https:\/\/doi.org\/10.1145\/3704905","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,7]]},"assertion":[{"value":"2024-07-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}