{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,20]],"date-time":"2025-06-20T04:08:56Z","timestamp":1750392536075,"version":"3.41.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2025,6,19]]},"abstract":"<jats:p>Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.<\/jats:p>","DOI":"10.1145\/3715761","type":"journal-article","created":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T15:15:34Z","timestamp":1750346134000},"page":"957-977","source":"Crossref","is-referenced-by-count":0,"title":["Integrating Large Language Models and Reinforcement Learning for Non-linear Reasoning"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1432-3057","authenticated-orcid":false,"given":"Yoav","family":"Alon","sequence":"first","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9106-934X","authenticated-orcid":false,"given":"Cristina","family":"David","sequence":"additional","affiliation":[{"name":"University of Bristol, Bristol, United Kingdom"}]}],"member":"320","published-online":{"date-parts":[[2025,6,19]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2024. Anthropic Claude 3. https:\/\/www.anthropic.com\/news\/claude-3-family"},{"key":"e_1_2_1_2_1","unstructured":"2024. Gemini. https:\/\/blog.google\/technology\/ai\/google-gemini-ai\/##sundar-note"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3540250.3549095"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2007.70725"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2303.12712"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","unstructured":"Ethan Caballero . OpenAI and Ilya Sutskever. 2016. Description2Code Dataset. https:\/\/doi.org\/10.5281\/zenodo.5665051 10.5281\/zenodo.5665051","DOI":"10.5281\/zenodo.5665051"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2306.11816"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","unstructured":"Rudrajit Choudhuri Dylan Liu Igor Steinmacher Marco Gerosa and Anita Sarma. 2024. How Far Are We? The Triumphs and Trials of Generative AI in Learning Software Engineering. https:\/\/doi.org\/10.1145\/3597503.3639201 10.1145\/3597503.3639201","DOI":"10.1145\/3597503.3639201"},{"key":"e_1_2_1_9_1","volume-title":"Training Verifiers to Solve Math Word Problems. CoRR, abs\/2110.14168","author":"Cobbe Karl","year":"2021","unstructured":"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. CoRR, abs\/2110.14168 (2021), arXiv:2110.14168. arxiv:2110.14168"},{"key":"e_1_2_1_10_1","unstructured":"Shihan Dou Junjie Shan Haoxiang Jia Wenhao Deng Zhiheng Xi Wei He Yueming Wu Tao Gui Yang Liu and Xuanjing Huang. 2023. Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey. arxiv:2308.01191."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2405.11514"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1002\/STVR.1472"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2209.00840"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2303.18184"},{"key":"e_1_2_1_15_1","unstructured":"Dave Hulbert. 2023. Using Tree-of-Thought Prompting to boost ChatGPT\u2019s reasoning. https:\/\/github.com\/dave1010\/tree-of-thought-prompting"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10936-7_11"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2310.02104"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2205.00445"},{"key":"e_1_2_1_19_1","volume-title":"Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022","author":"Kojima Takeshi","year":"2022","unstructured":"Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html"},{"key":"e_1_2_1_20_1","volume-title":"Equivalence of Dataflow Graphs via Rewrite Rules Using a Graph-to-Sequence Neural Model. ArXiv, abs\/2002.06799","author":"Kommrusch Steve","year":"2020","unstructured":"Steve Kommrusch, Th\u00e9o Barollet, and L Pouchet. 2020. Equivalence of Dataflow Graphs via Rewrite Rules Using a Graph-to-Sequence Neural Model. ArXiv, abs\/2002.06799 (2020), null. https:\/\/www.semanticscholar.org\/paper\/8bd94d48eb7a2706391c6cbf11fddf8440099cf9"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/tse.2023.3271065"},{"key":"e_1_2_1_22_1","volume-title":"Proving Equivalence Between Complex Expressions Using Graph-to-Sequence Neural Models. ArXiv, abs\/2106.02452","author":"Kommrusch Steven J","year":"2021","unstructured":"Steven J Kommrusch, Th\u00e9o Barollet, and L Pouchet. 2021. Proving Equivalence Between Complex Expressions Using Graph-to-Sequence Neural Models. ArXiv, abs\/2106.02452 (2021), null. https:\/\/www.semanticscholar.org\/paper\/6437aab9075d2a06d277146db10b4f4135432212"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/0164-1212(90)90094-3"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.4230\/DAGREP.8.4.1"},{"key":"e_1_2_1_25_1","volume-title":"Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022","author":"Le Hung","year":"2022","unstructured":"Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu-Hong Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/8636419dea1aa9fbd25fc4248e702da4-Abstract-Conference.html"},{"key":"e_1_2_1_26_1","volume-title":"Silvio Savarese, and Steven C. H. Hoi.","author":"Le Hung","year":"2022","unstructured":"Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. arxiv:2207.01780."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.JSS.2021.111141"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2304.11477"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2402.09664"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.08291"},{"key":"e_1_2_1_31_1","unstructured":"OpenAI : Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat Red Avila Igor Babuschkin Suchir Balaji Valerie Balcom Paul Baltescu Haiming Bao Mo Bavarian Jeff Belgum Irwan Bello and Jake Berdine. 2023. GPT-4 Technical Report. arxiv:2303.08774."},{"key":"e_1_2_1_32_1","volume-title":"Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.","author":"Pan Rangeet","year":"2024","unstructured":"Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45699-6_8"},{"key":"e_1_2_1_36_1","volume-title":"Enhancing SQL Query Generation with Neurosymbolic Reasoning. In The 39th Annual AAAI Conference on Artificial Intelligence. https:\/\/openreview.net\/forum?id=sbHQu2EPsm","author":"Princis Henrijs","year":"2025","unstructured":"Henrijs Princis, Cristina David, and Alan Mycroft. 2025. Enhancing SQL Query Generation with Neurosymbolic Reasoning. In The 39th Annual AAAI Conference on Artificial Intelligence. https:\/\/openreview.net\/forum?id=sbHQu2EPsm"},{"key":"e_1_2_1_37_1","volume-title":"Reinforcement Learning","author":"Richard Sutton Andrew Barto","unstructured":"Andrew Barto Richard Sutton. 2018. Reinforcement Learning (second ed.). MIT Press."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.scico.2009.02.007"},{"key":"e_1_2_1_39_1","volume-title":"Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. In The Eleventh International Conference on Learning Representations, ICLR 2023","author":"Saparov Abulhair","year":"2023","unstructured":"Abulhair Saparov and He He. 2023. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https:\/\/openreview.net\/pdf?id=qFVVBzXxR2V"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.05364"},{"key":"e_1_2_1_41_1","volume-title":"Alvine B. Belle, Song Wang, and Timothy C. Lethbridge.","author":"Shahandashti Kimya Khakzad","year":"2024","unstructured":"Kimya Khakzad Shahandashti, Mithila Sivakumar, Mohammad Mahdi Mohajer, Alvine B. Belle, Song Wang, and Timothy C. Lethbridge. 2024. Evaluating the Effectiveness of GPT-4 Turbo in Creating Defeaters for Assurance Cases. arxiv:2401.17991."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the International Conference on Paradigms of Computing, Communication and Data Sciences, Mayank Dave, Ritu Garg, Mohit Dua, and Jemal Hussien (Eds.)","author":"Singh Utkarsh","year":"2021","unstructured":"Utkarsh Singh, Kuldeep Kumar, and DeepakKumar Gupta. 2021. A Study of Code Clone Detection Techniques in Software Systems. In Proceedings of the International Conference on Paradigms of Computing, Communication and Data Sciences, Mayank Dave, Ritu Garg, Mohit Dua, and Jemal Hussien (Eds.). Springer Singapore, Singapore. 347\u2013359. isbn:978-981-15-7533-4"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSM.2015.7332459"},{"key":"e_1_2_1_44_1","article-title":"Analysis of Sudoku Solving Algorithms","volume":"9","author":"Thenmozhi M.","year":"2017","unstructured":"M. Thenmozhi, Palash Jain, Sai Anand R, and Saketh Ram B. 2017. Analysis of Sudoku Solving Algorithms. International Journal of Engineering and Technology, 9, 3 (2017), https:\/\/www.enggjournals.com\/ijet\/docs\/IJET17-09-03-043.pdf","journal-title":"International Journal of Engineering and Technology"},{"key":"e_1_2_1_45_1","unstructured":"Petar Veli\u010dkovi\u0107 Guillem Cucurull Arantxa Casanova Adriana Romero Pietro Li\u00f2 and Yoshua Bengio. 2018. Graph Attention Networks. arxiv:1710.10903."},{"key":"e_1_2_1_46_1","volume-title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html"},{"key":"e_1_2_1_47_1","volume-title":"Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023","author":"Yao Shunyu","year":"2023","unstructured":"Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (Eds.). http:\/\/papers.nips.cc\/paper_files\/paper\/2023\/hash\/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html"},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Morteza Zakeri-Nasrabadi Saeed Parsa Mohammad Ramezani Chanchal Roy and Masoud Ekhtiarzadeh. 2023. A systematic literature review on source code similarity measurement and clone detection: techniques applications and challenges. arxiv:2306.16171.","DOI":"10.1016\/j.jss.2023.111796"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2412.08035"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3715761","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T15:29:23Z","timestamp":1750346963000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3715761"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,19]]},"references-count":49,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2025,6,19]]}},"alternative-id":["10.1145\/3715761"],"URL":"https:\/\/doi.org\/10.1145\/3715761","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,19]]}}}