{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T18:01:21Z","timestamp":1775325681236,"version":"3.50.1"},"reference-count":288,"publisher":"Association for Computing Machinery (ACM)","issue":"6","funder":[{"DOI":"10.13039\/501100001809","name":"National Nature Science Foundation of China","doi-asserted-by":"crossref","award":["62477010, No.62577022, No.62307028 and No.62477012"],"award-info":[{"award-number":["62477010, No.62577022, No.62307028 and No.62477012"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Natural Science Foundation of Shanghai","award":["23ZR1441800 and No.23ZR1418500"],"award-info":[{"award-number":["23ZR1441800 and No.23ZR1418500"]}]},{"name":"Shanghai Science and Technology Innovation Action Plan","award":["24YF2710100 and No.23YF1426100"],"award-info":[{"award-number":["24YF2710100 and No.23YF1426100"]}]},{"name":"Shanghai Qiji Zhifeng Co., Ltd.","award":["2025-GZL-RGZN-01001"],"award-info":[{"award-number":["2025-GZL-RGZN-01001"]}]},{"name":"State Key Laboratory of DisasterReduction in Civil Engineering","award":["SLDRCE24-03"],"award-info":[{"award-number":["SLDRCE24-03"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,4,30]]},"abstract":"<jats:p>In recent years, there has been remarkable progress in leveraging Language Models (LMs), encompassing Pre-trained Language Models (PLMs) and Large-scale Language Models (LLMs), within the domain of mathematics. This article conducts a comprehensive survey of mathematical LMs, systematically categorizing pivotal research endeavors from two distinct perspectives: tasks and methodologies. The landscape reveals a large number of proposed mathematical LLMs, which are further delineated into instruction learning, tool-based methods, fundamental CoT techniques, advanced CoT methodologies, and multi-modal methods. To comprehend the benefits of mathematical LMs more thoroughly, we carry out an in-depth contrast of their characteristics and performance. In addition, our survey entails the compilation of over 60 mathematical datasets, including training datasets, benchmark datasets, and augmented datasets. Addressing the primary challenges and delineating future trajectories within the field of mathematical LMs, this survey is poised to facilitate and inspire future innovation among researchers invested in advancing this domain.<\/jats:p>","DOI":"10.1145\/3773985","type":"journal-article","created":{"date-parts":[[2025,11,1]],"date-time":"2025-11-01T11:34:09Z","timestamp":1761996849000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Mathematical Language Models: A Survey"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-3205-8684","authenticated-orcid":false,"given":"Wentao","family":"Liu","sequence":"first","affiliation":[{"name":"Shanghai Institute of Artificial Intelligence for Education, East China Normal University","place":["Shanghai, China"]},{"name":"Shanghai Innovation Institute","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2220-4009","authenticated-orcid":false,"given":"Hanglei","family":"Hu","sequence":"additional","affiliation":[{"name":"Shanghai Institute of Artificial Intelligence for Education, East China Normal University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2589-0164","authenticated-orcid":false,"given":"Jie","family":"Zhou","sequence":"additional","affiliation":[{"name":"East China Normal University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9986-4606","authenticated-orcid":false,"given":"Yuyang","family":"Ding","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, East China Normal University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-4249-4874","authenticated-orcid":false,"given":"Junsong","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, East China Normal University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1435-9451","authenticated-orcid":false,"given":"Jiayi","family":"Zeng","sequence":"additional","affiliation":[{"name":"East China Normal University School of Computer Science and Technology","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6698-0905","authenticated-orcid":false,"given":"Mengliang","family":"He","sequence":"additional","affiliation":[{"name":"East China Normal University School of Computer Science and Technology","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5602-1877","authenticated-orcid":false,"given":"Qin","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, East China Normal University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7914-1978","authenticated-orcid":false,"given":"Bo","family":"Jiang","sequence":"additional","affiliation":[{"name":"Shanghai Institute of Artificial Intelligence for Education, East China Normal University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4768-5946","authenticated-orcid":false,"given":"Aimin","family":"Zhou","sequence":"additional","affiliation":[{"name":"Shanghai Institute of Artificial Intelligence for Education, East China Normal University","place":["Shanghai, China"]},{"name":"Shanghai Innovation Institute","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4723-5486","authenticated-orcid":false,"given":"Liang","family":"He","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, East China Normal University","place":["Shanghai, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,12,8]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","unstructured":"E. Al-Hossami R. Bunescu J. Smith et\u00a0al. 2024. Can language models employ the socratic method? Experiments with code debugging. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education 1 (2024) 53\u201359.","DOI":"10.1145\/3626252.3630799"},{"key":"e_1_3_1_3_2","first-page":"709","volume-title":"BEA","author":"Al-Hossami Erfan","year":"2023","unstructured":"Erfan Al-Hossami, Razvan Bunescu, Ryan Teehan, Laurel Powell, Khyati Mahajan, and Mohsen Dorodchi. 2023. Socratic questioning of novice debuggers: A benchmark dataset and preliminary evaluations. In BEA. 709\u2013726."},{"key":"e_1_3_1_4_2","first-page":"351","volume-title":"LREC","author":"Alghamdi Reem","year":"2022","unstructured":"Reem Alghamdi, Zhenwen Liang, and Xiangliang Zhang. 2022. ArMATH: A dataset for solving Arabic math word problems. In LREC. 351\u2013362."},{"key":"e_1_3_1_5_2","first-page":"2357","volume-title":"NAACL","author":"Amini Aida","year":"2019","unstructured":"Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In NAACL. 2357\u20132367."},{"key":"e_1_3_1_6_2","first-page":"147","volume-title":"EACL","author":"Ang Beng Heng","year":"2023","unstructured":"Beng Heng Ang, Sujatha Das Gollapalli, and See Kiong Ng. 2023. Socratic question generation: A novel dataset, models, and evaluation. In EACL. 147\u2013165."},{"key":"e_1_3_1_7_2","volume-title":"Submitted to The 13th International Conference on Learning Representations","year":"2024","unstructured":"Anonymous. 2024. Diving into self-evolve training for multimodal reasoning. In Submitted to The 13th International Conference on Learning Representations."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_1_9_2","unstructured":"J. Austin A. Odena M. Nye et\u00a0al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732."},{"key":"e_1_3_1_10_2","unstructured":"Z. Azerbayev B. Piotrowski H. Schoelkopf et\u00a0al. 2023. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433."},{"key":"e_1_3_1_11_2","first-page":"164B","article-title":"LLEMMA: An open language model for mathematics","volume":"8","author":"Azerbayev Zhangir","unstructured":"Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. [n.d.]. LLEMMA: An open language model for mathematics. Minerva 8 ([n.d.]), 164B.","journal-title":"Minerva"},{"key":"e_1_3_1_12_2","first-page":"454","volume-title":"ICML","author":"Bansal Kshitij","year":"2019","unstructured":"Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. 2019. Holist: An environment for machine learning of higher order logic theorem proving. In ICML. 454\u2013463."},{"key":"e_1_3_1_13_2","first-page":"4754","volume-title":"EMNLP","author":"Berg-Kirkpatrick Taylor","year":"2020","unstructured":"Taylor Berg-Kirkpatrick and Daniel Spokoyny. 2020. An empirical investigation of contextualized number prediction. In EMNLP. 4754\u20134764."},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","unstructured":"Bernardino Romera-Paredes Mohammadamin Barekatain Alexander Novikov Matej Balog M. Pawan Kumar Emilien Dupont Francisco J. R. Ruiz Jordan S. Ellenberg Pengming Wang Omar Fawzi et\u00a0al. 2024. Mathematical discoveries from program search with large language models. Nature 2024 625 (2024) 468\u2013475.","DOI":"10.1038\/s41586-023-06924-6"},{"key":"e_1_3_1_15_2","volume-title":"Interactive Theorem Proving and Program Development: Coq\u2019Art: The Calculus of Inductive Constructions","author":"Bertot Yves","year":"2013","unstructured":"Yves Bertot and Pierre Cast\u00e9ran. 2013. Interactive Theorem Proving and Program Development: Coq\u2019Art: The Calculus of Inductive Constructions."},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"M. Besta N. Blach A. Kubicek et\u00a0al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence 38 16 (2024) 17682\u201317690.","DOI":"10.1609\/aaai.v38i16.29720"},{"key":"e_1_3_1_17_2","unstructured":"D. G. Boblow. 1968. Natural language input for a computer problem-solving system. Semantic Information Processing (1968) 146\u2013226."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1207\/s1532690xci0103_1"},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Thomas C. Brickhouse and Nicholas D. Smith. 2009. Socratic teaching and socratic method. (2009).","DOI":"10.1093\/oxfordhb\/9780195312881.003.0011"},{"key":"e_1_3_1_20_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. In NeurIPS 33 (2020), 1877\u20131901.","journal-title":"NeurIPS"},{"key":"e_1_3_1_21_2","first-page":"0351","volume-title":"CCWC","author":"Chang Edward Y.","year":"2023","unstructured":"Edward Y. Chang. 2023. Prompting large language models with the Socratic method. In CCWC. 0351\u20130360."},{"key":"e_1_3_1_22_2","first-page":"216","volume-title":"AAAI","author":"Chaslot Guillaume","year":"2008","unstructured":"Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. 2008. Monte-Carlo tree search: A new framework for game AI. In AAAI 4 (2008), 216\u2013217."},{"key":"e_1_3_1_23_2","doi-asserted-by":"crossref","unstructured":"H. Chen G. He L. Yuan et\u00a0al. 2024. Noise contrastive alignment of language models with explicit rewards. Advances in Neural Information Processing Systems 37 (2024) 117784\u2013117812.","DOI":"10.52202\/079017-3741"},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","unstructured":"Jiaqi Chen Jianheng Tang Jinghui Qin Xiaodan Liang Lingbo Liu Eric P. Xing and Liang Lin. 2021. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 513\u2013523.","DOI":"10.18653\/v1\/2021.findings-acl.46"},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","unstructured":"Nuo Chen Ning Wu Jianhui Chang and Jia Li. 2024. ControlMath: Controllable data generation promotes math generalist models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 12201\u201312217.","DOI":"10.18653\/v1\/2024.emnlp-main.680"},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Qiguang Chen Libo Qin Jin Zhang Zhi Chen Xiao Xu and Wanxiang Che. 2024. M \\(^3\\) CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8199\u20138221.","DOI":"10.18653\/v1\/2024.acl-long.446"},{"key":"e_1_3_1_27_2","unstructured":"Wenhu Chen Xueguang Ma Xinyi Wang and William W. Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research."},{"key":"e_1_3_1_28_2","unstructured":"Xingyu Chen Jiahao Xu Tian Liang Zhiwei He Jianhui Pang Dian Yu Linfeng Song Qiuzhi Liu Mengfei Zhou Zhuosheng Zhang et\u00a0al. 2024. Do not think that much for 2+ 3=? on the overthinking of o1-like LLMs. arXiv:2412.21187. Retrieved from https:\/\/arxiv.org\/abs\/2412.21187"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","unstructured":"Yanda Chen Ruiqi Zhong Sheng Zha George Karypis and He He. 2022. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 719\u2013730.","DOI":"10.18653\/v1\/2022.acl-long.53"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Zhiyu Chen Wenhu Chen Charese Smiley Sameena Shah Iana Borova Dylan Langdon Reema Moussa Matt Beane Ting-Hao Huang Bryan R. Routledge et\u00a0al. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3697\u20133711.","DOI":"10.18653\/v1\/2021.emnlp-main.300"},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"Zhe Chen Jiannan Wu Wenhai Wang Weijie Su Guo Chen Sen Xing Muyan Zhong Qinglong Zhang Xizhou Zhu Lewei Lu et\u00a0al. 2024. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 24185\u201324198.","DOI":"10.1109\/CVPR52733.2024.02283"},{"key":"e_1_3_1_32_2","unstructured":"Konstantin Chernyshev Vitaliy Polshkov Ekaterina Artemova Alex Myasnikov Vlad Stepanov Alexei Miasnikov and Sergei Tilga. 2024. U-MATH: A university-level benchmark for evaluating mathematical skills in LLMs. arxiv:2412.03205. Retrieved from https:\/\/arxiv.org\/abs\/2412.03205"},{"key":"e_1_3_1_33_2","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann et\u00a0al. 2022. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 2023 24 (240) 1\u2013113."},{"key":"e_1_3_1_34_2","doi-asserted-by":"crossref","unstructured":"Zheng Chu Jingchang Chen Qianglong Chen Weijiang Yu Tao He Haotian Wang Weihua Peng Ming Liu Bing Qin and Ting Liu. 2023. A survey of chain of thought reasoning: Advances frontiers and future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1173\u20131203.","DOI":"10.18653\/v1\/2024.acl-long.65"},{"key":"e_1_3_1_35_2","unstructured":"Hyung Won Chung Le Hou Shayne Longpre Barret Zoph Yi Tay William Fedus Yunxuan Li Xuezhi Wang Mostafa Dehghani Siddhartha Brahma et\u00a0al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25 70 (2024) 1\u201353."},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","unstructured":"Peter Clark Oren Etzioni Daniel Khashabi Tushar Khot Bhavana Dalvi Mishra Kyle Richardson Ashish Sabharwal Carissa Schoenick Oyvind Tafjord Niket Tandon et\u00a0al. 2021. From \u2019F\u2019 to \u2019A\u2019 on the N.Y. Regents Science Exams: An overview of the Aristo Project. AI Magazine 41 4 (2021) 39\u201353.","DOI":"10.1609\/aimag.v41i4.5304"},{"key":"e_1_3_1_37_2","unstructured":"Karl Cobbe Vineet Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun Lukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton Reiichiro Nakano et\u00a0al. 2021. Training verifiers to solve math word problems. arxiv:2110.14168. Retrieved from https:\/\/arxiv.org\/abs\/2110.14168"},{"key":"e_1_3_1_38_2","volume-title":"Large Language Models and Mathematical Understanding","author":"Couperus Jelle","year":"2023","unstructured":"Jelle Couperus. 2023. Large Language Models and Mathematical Understanding. Master\u2019s thesis."},{"key":"e_1_3_1_39_2","unstructured":"Gautier Dagan Frank Keller and Alex Lascarides. 2023. Dynamic planning with a LLM. Language Gamification Workshop 2024 at NeurIPS. Neural Information Processing Systems Foundation (NeurIPS). 1\u201314."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-021-04086-x"},{"key":"e_1_3_1_41_2","unstructured":"P. C. Davis and E. E. Steinglass. 1997. A dialogue about Socratic teaching. NYU Rev. L. & Soc. Change 23 (1997) 249."},{"key":"e_1_3_1_42_2","unstructured":"Yihe Deng and Paul Mineiro. 2024. Flow-DPO: Improving LLM mathematical reasoning through online multi-agent learning. The 4th Workshop on Mathematical Reasoning and AI at NeurIPS\u201924."},{"key":"e_1_3_1_43_2","doi-asserted-by":"crossref","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 1 (Long and Short Papers). 4171\u20134186.","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_1_44_2","first-page":"3029","volume-title":"EMNLP","author":"Ding Ning","year":"2023","unstructured":"Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In EMNLP. 3029\u20133051."},{"key":"e_1_3_1_45_2","first-page":"3730","volume-title":"CIKM","author":"Ding Yuyang","year":"2024","unstructured":"Yuyang Ding, Hanglei Hu, Jie Zhou, Qin Chen, Bo Jiang, and Liang He. 2024. Boosting large language models with Socratic method for conversational mathematics teaching. In CIKM. 3730\u20133735."},{"issue":"32","key":"e_1_3_1_46_2","doi-asserted-by":"crossref","first-page":"e2123433119","DOI":"10.1073\/pnas.2123433119","article-title":"A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level","volume":"119","author":"Drori Iddo","year":"2022","unstructured":"Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et\u00a0al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. PNAS 119, 32 (2022), e2123433119.","journal-title":"PNAS"},{"key":"e_1_3_1_47_2","first-page":"2368","volume-title":"NAACL","author":"Dua Dheeru","year":"2019","unstructured":"Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL. 2368\u20132378."},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","unstructured":"Jinhao Duan Hao Cheng Shiqi Wang Chenan Wang Alex Zavalny Renjing Xu Bhavya Kailkhura and Kaidi Xu. 2023. Shifting attention to relevance: Towards the uncertainty estimation of large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5050\u20135063.","DOI":"10.18653\/v1\/2024.acl-long.276"},{"key":"e_1_3_1_49_2","first-page":"3973","volume-title":"ACL","author":"Elazar Yanai","year":"2019","unstructured":"Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran, Tania Bedrax-Weiss, and Dan Roth. 2019. How large are lions? Inducing distributions over quantitative attributes. In ACL. 3973\u20133983."},{"key":"e_1_3_1_50_2","unstructured":"Kawin Ethayarajh Winnie Xu Niklas Muennighoff Dan Jurafsky and Douwe Kiela. 2024. KTO: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306."},{"key":"e_1_3_1_51_2","unstructured":"E. A. Feigenbaum and J. Feldman. 1963. Computers and Thought 7 (1963)."},{"key":"e_1_3_1_52_2","unstructured":"Yu Feng Jing Zhang Xiaokang Zhang Lemao Liu Cuiping Li and Hong Chen. 2022. Injecting numerical reasoning skills into knowledge base question answering models. arXiv:2112.06109. Retrieved from https:\/\/arxiv.org\/abs\/2112.06109"},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","unstructured":"Emily First Markus N. Rabe Talia Ringer and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229\u20131241.","DOI":"10.1145\/3611643.3616243"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.3758\/BF03207654"},{"key":"e_1_3_1_55_2","first-page":"266","volume-title":"ACL","author":"Forbes Maxwell","year":"2017","unstructured":"Maxwell Forbes and Yejin Choi. 2017. Verb physics: Relative physical knowledge of actions and objects. In ACL. 266\u2013276."},{"key":"e_1_3_1_56_2","volume-title":"ICLR","author":"Fu Yao","year":"2023","unstructured":"Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023. Complexity-based prompting for multi-step reasoning. In ICLR."},{"key":"e_1_3_1_57_2","first-page":"10764","volume-title":"ICML","author":"Gao Luyu","year":"2023","unstructured":"Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. In ICML. 10764\u201310799."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-023-10562-9"},{"key":"e_1_3_1_59_2","first-page":"143","volume-title":"Western Joint IRE-AIEE-ACM Computer Conference","author":"Gelernter Herbert","year":"1960","unstructured":"Herbert Gelernter, James R. Hansen, and Donald W. Loveland. 1960. Empirical explorations of the geometry theorem machine. In Western Joint IRE-AIEE-ACM Computer Conference. 143\u2013149."},{"key":"e_1_3_1_60_2","first-page":"946","volume-title":"ACL","author":"Geva Mor","year":"2020","unstructured":"Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Injecting numerical reasoning skills into language models. In ACL, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). 946\u2013958."},{"key":"e_1_3_1_61_2","unstructured":"Team GLM Aohan Zeng Bin Xu Bowen Wang Chenhui Zhang Da Yin Dan Zhang Diego Rojas Guanyu Feng Hanlin Zhao et\u00a0al. 2024. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. arXiv preprint arXiv:2406.12793."},{"key":"e_1_3_1_62_2","first-page":"163","volume-title":"ITP","author":"Gonthier Georges","year":"2013","unstructured":"Georges Gonthier, Andrea Asperti, Jeremy Avigad, Yves Bertot, Cyril Cohen, Fran\u00e7ois Garillot, St\u00e9phane Le Roux, Assia Mahboubi, Russell O\u2019Connor, Sidi Ould Biha, et\u00a0al. 2013. A machine-checked proof of the odd order theorem. In ITP. 163\u2013179."},{"key":"e_1_3_1_63_2","unstructured":"Zhibin Gou Zhihong Shao Yeyun Gong Yelong Shen Yujiu Yang Nan Duan and Weizhu Chen. 2023. CRITIC: Large language models can self-correct with tool-interactive critiquing. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_64_2","unstructured":"Z. Gou Z. Shao Y. Gong Y. Shen Y. Yang M. Huang N. Duan and W. Chen. 2023. ToRA: A tool-integrated reasoning agent for mathematical problem solving. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_65_2","doi-asserted-by":"crossref","unstructured":"A. Grabowski A. Korni\u0142owicz and A. Naumowicz. 2015. Four decades of mizar: Foreword. Journal of Automated Reasoning 55 3 (2015) 191\u2013198.","DOI":"10.1007\/s10817-015-9345-1"},{"key":"e_1_3_1_66_2","doi-asserted-by":"crossref","unstructured":"T. Hales M. Adams G. Bauer et\u00a0al. 2017. A formal proof of the Kepler conjecture. Forum of Mathematics Pi 5 (2017) e2.","DOI":"10.1017\/fmp.2017.1"},{"key":"e_1_3_1_67_2","volume-title":"ICLR","author":"Han Jesse Michael","year":"2022","unstructured":"Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. 2022. Proof artifact co-training for theorem proving with language models. In ICLR."},{"key":"e_1_3_1_68_2","first-page":"22017","volume-title":"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024","author":"Han Simeng","year":"2024","unstructured":"Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, et\u00a0al. 2024. FOLIO: Natural language reasoning with first-order logic. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, 22017\u201322031. Retrieved from https:\/\/aclanthology.org\/2024.emnlp-main.1229"},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"Shibo Hao Yi Gu Haodi Ma Joshua Jiahua Hong Zhen Wang Daisy Zhe Wang and Zhiting Hu. 2023. Reasoning with language model is planning with world model. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 8154\u20138173.","DOI":"10.18653\/v1\/2023.emnlp-main.507"},{"key":"e_1_3_1_70_2","unstructured":"Shibo Hao Tianyang Liu Zhen Wang and Zhiting Hu. 2023. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. Advances in Neural Information Processing Systems 36 (2023) 45870\u201345894."},{"key":"e_1_3_1_71_2","first-page":"1763","article-title":"PGDP5K: A diagram parsing dataset for plane geometry problems","author":"Hao Yihan","year":"2022","unstructured":"Yihan Hao, Mingliang Zhang, Fei Yin, and Linlin Huang. 2022. PGDP5K: A diagram parsing dataset for plane geometry problems. In ICPR. 1763\u20131769.","journal-title":"ICPR"},{"key":"e_1_3_1_72_2","unstructured":"Hangfeng He Hongming Zhang and Dan Roth. 2022. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303."},{"key":"e_1_3_1_73_2","unstructured":"Joy He-Yueya Gabriel Poesia Rose E. Wang and Noah D. Goodman. 2023. Solving math word problems by combining language models with symbolic solvers. The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS\u201923."},{"key":"e_1_3_1_74_2","article-title":"Measuring massive multitask language understanding","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In ICLR.","journal-title":"ICLR"},{"key":"e_1_3_1_75_2","volume-title":"NeurIPS Datasets and Benchmarks Track (Round 2)","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In NeurIPS Datasets and Benchmarks Track (Round 2)."},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295748"},{"key":"e_1_3_1_77_2","unstructured":"Arian Hosseini Xingdi Yuan Nikolay Malkin Aaron Courville Alessandro Sordoni and Rishabh Agarwal. 2024. V-star: Training verifiers for self-taught reasoners. First Conference on Language Modeling."},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1058"},{"key":"e_1_3_1_79_2","unstructured":"Cheng-Yu Hsieh Si-An Chen Chun-Liang Li Yasuhisa Fujii Alexander Ratner Chen-Yu Lee Ranjay Krishna and Tomas Pfister. 2023. Tool documentation enables zero-shot tool-usage with large language models. arxiv:2308.00675. Retrieved from https:\/\/arxiv.org\/abs\/2308.00675"},{"key":"e_1_3_1_80_2","volume-title":"ICLR","author":"Huang Daniel","year":"2019","unstructured":"Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. 2019. GamePad: A learning environment for theorem proving. In ICLR."},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1084"},{"key":"e_1_3_1_82_2","first-page":"13674","volume-title":"EMNLP","author":"Huang Xijie","year":"2024","unstructured":"Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, Fan Yang, and Mao Yang. 2024. Fewer is more: Boosting math reasoning with reinforced context pruning. In EMNLP. 13674\u201313695."},{"key":"e_1_3_1_83_2","unstructured":"Aaron Hurst Adam Lerer Adam P. Goucher Adam Perelman Aditya Ramesh Aidan Clark AJ Ostrow Akila Welihinda Alan Hayes Alec Radford et\u00a0al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276."},{"key":"e_1_3_1_84_2","doi-asserted-by":"crossref","unstructured":"Shima Imani Liang Du and Harsh Shrivastava. 2023. Mathprompter: Mathematical reasoning using large language models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 37\u201342.","DOI":"10.18653\/v1\/2023.acl-industry.4"},{"key":"e_1_3_1_85_2","article-title":"Deepmath-deep sequence models for premise selection","volume":"29","author":"Irving Geoffrey","year":"2016","unstructured":"Geoffrey Irving, Christian Szegedy, Alexander A. Alemi, Niklas E\u00e9n, Fran\u00e7ois Chollet, and Josef Urban. 2016. Deepmath-deep sequence models for premise selection. In NeurIPS 29 (2016).","journal-title":"NeurIPS"},{"key":"e_1_3_1_86_2","unstructured":"Samy Jelassi St\u00e9phane d\u2019Ascoli Carles Domingo-Enrich Yuhuai Wu Yuanzhi Li and Fran\u00e7ois Charton. 2023. Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400."},{"key":"e_1_3_1_87_2","doi-asserted-by":"publisher","DOI":"10.1145\/3571730"},{"key":"e_1_3_1_88_2","first-page":"378","volume-title":"AITP","author":"Jiang Albert Qiaochu","year":"2021","unstructured":"Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. 2021. LISA: Language models of ISAbelle proofs. In AITP. 378\u2013392."},{"key":"e_1_3_1_89_2","unstructured":"Albert Q. Jiang Wenda Li Szymon Tworkowski Konrad Czechowski Tomasz Odrzyg\u00f3\u017ad\u017a Piotr Mi\u0142o\u015b Yuhuai Wu and Mateja Jamnik. 2022. Thor: Wielding hammers to integrate language models and automated theorem provers. Advances in Neural Information Processing Systems 35 (2022) 8360\u20138373."},{"key":"e_1_3_1_90_2","unstructured":"Albert Qiaochu Jiang Sean Welleck Jin Peng Zhou Timothee Lacroix Jiacheng Liu Wenda Li Mateja Jamnik Guillaume Lample and Yuhuai Wu. 2023. Draft sketch and prove: Guiding formal theorem provers with informal proofs. The Eleventh International Conference on Learning Representations."},{"key":"e_1_3_1_91_2","doi-asserted-by":"crossref","unstructured":"Dongwei Jiang Marcio Fonseca and Shay B. Cohen. 2024. LeanReasoner: Boosting complex logical reasoning with lean. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7490\u20137503.","DOI":"10.18653\/v1\/2024.naacl-long.416"},{"key":"e_1_3_1_92_2","doi-asserted-by":"crossref","unstructured":"Song Jiang Zahra Shakeri Aaron Chan Maziar Sanjabi Hamed Firooz Yinglong Xia Bugra Akyildiz Yizhou Sun Jinchao Li Qifan Wang et\u00a0al. 2024. Resprompt: Residual connection prompting advances multi-step reasoning in large language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 5784\u20135809.","DOI":"10.18653\/v1\/2024.naacl-long.323"},{"key":"e_1_3_1_93_2","unstructured":"Zhanming Jie Jierui Li and Wei Lu. 2022. Learning to reason deductively: Math word problem solving as complex relation extraction. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5944\u20135955."},{"key":"e_1_3_1_94_2","volume-title":"ICLR","author":"Kaliszyk Cezary","year":"2017","unstructured":"Cezary Kaliszyk, Fran\u00e7ois Chollet, and Christian Szegedy. 2017. HolStep: A machine learning dataset for higher-order logic theorem proving. In ICLR."},{"key":"e_1_3_1_95_2","first-page":"7318","volume-title":"EMNLP","author":"Kalyan Ashwin","year":"2021","unstructured":"Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, and Peter Clark. 2021. How much coffee was consumed during EMNLP 2019? Fermi problems: A new reasoning challenge for AI. In EMNLP. 7318\u20137328."},{"key":"e_1_3_1_96_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361."},{"key":"e_1_3_1_97_2","doi-asserted-by":"crossref","unstructured":"Mehran Kazemi Najoung Kim Deepti Bhatia Xin Xu and Deepak Ramachandran. 2022. Lambada: Backward chaining for automated reasoning in natural language. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6547\u20136568.","DOI":"10.18653\/v1\/2023.acl-long.361"},{"key":"e_1_3_1_98_2","first-page":"4171","volume-title":"NAACL","author":"Kenton Jacob Devlin Ming-Wei Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL. 4171\u20134186."},{"key":"e_1_3_1_99_2","unstructured":"Muhammad Khalifa Lajanugen Logeswaran Moontae Lee Honglak Lee and Lu Wang. 2023. Discriminator-guided multi-step reasoning with language models. arXiv preprint arXiv:2305.14934."},{"key":"e_1_3_1_100_2","first-page":"3768","volume-title":"EMNLP","author":"Kim Bugeun","year":"2020","unstructured":"Bugeun Kim, Kyung Seo Ki, Donggeon Lee, and Gahgene Gweon. 2020. Point to the expression: Solving algebraic word problems using the expression-pointer transformer model. In EMNLP. Online, 3768\u20133779."},{"key":"e_1_3_1_101_2","first-page":"4442","volume-title":"ACL","author":"Kim Bugeun","year":"2022","unstructured":"Bugeun Kim, Kyung Seo Ki, Sangkyu Rhim, and Gahgene Gweon. 2022. EPT-X: An expression-pointer transformer model that generates explanations for numbers. In ACL. 4442\u20134458."},{"key":"e_1_3_1_102_2","unstructured":"T. Kojima S. S. Gu M. Reid et\u00a0al. 2022. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems 35 (2022) 22199\u201322213."},{"key":"e_1_3_1_103_2","doi-asserted-by":"crossref","unstructured":"R. Koncel-Kedziorski H. Hajishirzi A. Sabharwal et\u00a0al. 2015. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics 3 (2015) 585\u2013597.","DOI":"10.1162\/tacl_a_00160"},{"key":"e_1_3_1_104_2","first-page":"1152","volume-title":"NAACL","author":"Koncel-Kedziorski Rik","year":"2016","unstructured":"Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math word problem repository. In NAACL. 1152\u20131157."},{"key":"e_1_3_1_105_2","unstructured":"Aviral Kumar Vincent Zhuang Rishabh Agarwal Yi Su John D. Co-Reyes Avi Singh Kate Baumli Shariq Iqbal Colton Bishop Rebecca Roelofs et\u00a0al. 2024. Training language models to self-correct via reinforcement learning. The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_1_106_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-1026"},{"key":"e_1_3_1_107_2","unstructured":"Xin Lai Zhuotao Tian Yukang Chen Senqiao Yang Xiangru Peng and Jiaya Jia. 2024. Step-DPO: Step-wise preference optimization for long-chain reasoning of LLMs. arXiv preprint arXiv:2406.18629."},{"key":"e_1_3_1_108_2","unstructured":"G. Lample T. Lacroix M. A. Lachaux et\u00a0al. 2022. Hypertree proof search for neural theorem proving. Advances in Neural Information Processing Systems 35 (2022) 26337\u201326349."},{"key":"e_1_3_1_109_2","first-page":"26337","article-title":"Hypertree proof search for neural theorem proving","volume":"35","author":"Lample Guillaume","year":"2022","unstructured":"Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. 2022. Hypertree proof search for neural theorem proving. In NeurIPS 35 (2022), 26337\u201326349.","journal-title":"NeurIPS"},{"key":"e_1_3_1_110_2","unstructured":"B. Lei P. Ling C. Liao and C. Ding. 2023. Boosting logical reasoning in large language models through a new framework: The graph of thought. arXiv preprint arXiv:2308.08614."},{"key":"e_1_3_1_111_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"e_1_3_1_112_2","first-page":"3843","article-title":"Solving quantitative reasoning problems with language models","volume":"35","author":"Lewkowycz Aitor","year":"2022","unstructured":"Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et\u00a0al. 2022. Solving quantitative reasoning problems with language models. In NeurIPS 35 (2022), 3843\u20133857.","journal-title":"NeurIPS"},{"key":"e_1_3_1_113_2","volume-title":"NeurIPS","author":"Li Guohao","year":"2023","unstructured":"Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative agents for\u201d mind\u201d exploration of large language model society. In NeurIPS."},{"key":"e_1_3_1_114_2","volume-title":"ICLR","author":"Li Wenda","year":"2021","unstructured":"Wenda Li, Lei Yu, Yuhuai Wu, and Lawrence C. Paulson. 2021. IsarStep: A benchmark for high-level mathematical reasoning. In ICLR."},{"key":"e_1_3_1_115_2","first-page":"5315","volume-title":"ACL","author":"Li Yifei","year":"2023","unstructured":"Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023. Making language models better reasoners with step-aware verifier. In ACL. 5315\u20135333."},{"key":"e_1_3_1_116_2","doi-asserted-by":"crossref","unstructured":"Z. Li W. Zhang C. Yan et\u00a0al. 2022. Seeking patterns not just memorizing procedures: Contrastive learning for solving math word problems. Findings of the Association for Computational Linguistics: ACL 2022. 2486\u20132496.","DOI":"10.18653\/v1\/2022.findings-acl.195"},{"key":"e_1_3_1_117_2","doi-asserted-by":"crossref","unstructured":"Z. Liang J. Zhang L. Wang et\u00a0al. 2022. MWP-BERT: Numeracy-augmented pre-training for math word problem solving. Findings of the Association for Computational Linguistics: NAACL 2022. 997\u20131009.","DOI":"10.18653\/v1\/2022.findings-naacl.74"},{"key":"e_1_3_1_118_2","volume-title":"ICLR","author":"Lightman Hunter","year":"2024","unstructured":"Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let\u2019s verify step by step. In ICLR."},{"key":"e_1_3_1_119_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1015"},{"key":"e_1_3_1_120_2","unstructured":"Zhan Ling Yunhao Fang Xuanlin Li Zhiao Huang Mingu Lee Roland Memisevic and Hao Su. 2023. Deductive verification of chain-of-thought reasoning. Advances in Neural Information Processing Systems 36 (2023) 36407\u201336433."},{"key":"e_1_3_1_121_2","unstructured":"Bo Liu Yuqian Jiang Xiaohan Zhang Qiang Liu Shiqi Zhang Joydeep Biswas and Peter Stone. 2023. LLM+ P: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477."},{"key":"e_1_3_1_122_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.deelio-1.10"},{"key":"e_1_3_1_123_2","unstructured":"Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: Fine-tuned LLaMA outperforms GPT-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201."},{"key":"e_1_3_1_124_2","doi-asserted-by":"crossref","unstructured":"Wentao Liu Qianjun Pan Yi Zhang Zhuo Liu Ji Wu Jie Zhou Aimin Zhou Qin Chen Bo Jiang and Liang He. 2024. CMM-Math: A Chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. Proceedings of the 33rd ACM International Conference on Multimedia. 12585\u201312591.","DOI":"10.1145\/3746027.3758193"},{"key":"e_1_3_1_125_2","unstructured":"Y. Liu M. Ott N. Goyal et\u00a0al. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019."},{"key":"e_1_3_1_126_2","unstructured":"Yixin Liu Avi Singh C. Daniel Freeman John D. Co-Reyes and Peter J. Liu. 2023. Improving large language model fine-tuning for solving math problems. arXiv preprint arXiv:2305.08291."},{"key":"e_1_3_1_127_2","article-title":"Large language model guided tree-of-thought","author":"Long Jieyi","year":"2023","unstructured":"Jieyi Long. 2023. Large language model guided tree-of-thought. arXiv (2023).","journal-title":"arXiv"},{"key":"e_1_3_1_128_2","unstructured":"Pan Lu Hritik Bansal Tony Xia Jiacheng Liu Chunyuan Li Hannaneh Hajishirzi Hao Cheng Kai-Wei Chang Michel Galley and Jianfeng Gao. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_129_2","first-page":"6774","volume-title":"ACL","author":"Lu Pan","year":"2021","unstructured":"Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-chun Zhu. 2021. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In ACL. 6774\u20136786."},{"key":"e_1_3_1_130_2","unstructured":"Pan Lu Baolin Peng Hao Cheng Michel Galley Kai-Wei Chang Ying Nian Wu Song-Chun Zhu and Jianfeng Gao. 2023. Chameleon: Plug-and-Play compositional reasoning with large language models. Advances in Neural Information Processing Systems 36 (2023) 43447\u201343478."},{"key":"e_1_3_1_131_2","volume-title":"ICLR","author":"Lu Pan","year":"2023","unstructured":"Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. 2023. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In ICLR."},{"key":"e_1_3_1_132_2","volume-title":"NeurIPS Datasets and Benchmarks Track","author":"Lu Pan","year":"2021","unstructured":"Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. 2021. IconQA: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Datasets and Benchmarks Track."},{"key":"e_1_3_1_133_2","first-page":"14605","volume-title":"ACL","author":"Lu Pan","year":"2023","unstructured":"Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2023. A survey of deep learning for mathematical reasoning. In ACL, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). 14605\u201314631."},{"key":"e_1_3_1_134_2","unstructured":"Haipeng Luo Qingfeng Sun Can Xu Pu Zhao Jianguang Lou Chongyang Tao Xiubo Geng Qingwei Lin Shifeng Chen and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_1_135_2","unstructured":"Liangchen Luo Yinxiao Liu Rosanne Liu Samrat Phatale Harsh Lara Yunxuan Li Lei Shu Yun Zhu Lei Meng Jiao Sun et\u00a0al. 2024. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592."},{"key":"e_1_3_1_136_2","doi-asserted-by":"crossref","unstructured":"Trung Quoc Luong Xinbo Zhang Zhanming Jie Peng Sun Xiaoran Jin and Hang Li. 2024. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967.","DOI":"10.18653\/v1\/2024.acl-long.410"},{"key":"e_1_3_1_137_2","unstructured":"Qianli Ma Haotian Zhou Tingkai Liu Jianbo Yuan Pengfei Liu Yang You and Hongxia Yang. 2023. Let\u2019s reward step by step: Step-Level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080."},{"key":"e_1_3_1_138_2","doi-asserted-by":"crossref","unstructured":"Jakub Macina Nico Daheim Sankalan Pal Chowdhury Tanmay Sinha Manu Kapur Iryna Gurevych and Mrinmaya Sachan. 2023. Mathdial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. Findings of the Association for Computational Linguistics: EMNLP 2023. 5602\u20135621.","DOI":"10.18653\/v1\/2023.findings-emnlp.372"},{"key":"e_1_3_1_139_2","unstructured":"Aman Madaan Niket Tandon Prakhar Gupta Skyler Hallinan Luyu Gao Sarah Wiegreffe Uri Alon Nouha Dziri Shrimai Prabhumoye Yiming Yang et\u00a0al. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2023) 46534\u201346594."},{"issue":"20","key":"e_1_3_1_140_2","doi-asserted-by":"crossref","first-page":"51","DOI":"10.3991\/ijet.v18i20.42979","article-title":"Learning mathematics with large language models: A comparative study with computer algebra systems and other tools","volume":"18","author":"Matzakos Nikolaos","year":"2023","unstructured":"Nikolaos Matzakos, Spyridon Doukakis, and Maria Moundridou. 2023. Learning mathematics with large language models: A comparative study with computer algebra systems and other tools. International Journal of Emerging Technologies in Learning (Online) 18, 20 (2023), 51.","journal-title":"International Journal of Emerging Technologies in Learning (Online)"},{"key":"e_1_3_1_141_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.173"},{"key":"e_1_3_1_142_2","volume-title":"Metamath: A Computer Language for Mathematical Proofs","author":"Megill Norman","year":"2019","unstructured":"Norman Megill and David A. Wheeler. 2019. Metamath: A Computer Language for Mathematical Proofs. Lulu. com."},{"key":"e_1_3_1_143_2","unstructured":"Ning Miao Yee Whye Teh and Tom Rainforth. 2024. Selfcheck: Using LLMs to zero-shot check their own step-by-step reasoning. 12th International Conference on Learning Representations (ICLR 2024)."},{"key":"e_1_3_1_144_2","first-page":"975","volume-title":"ACL","author":"Miao Shen-Yun","year":"2020","unstructured":"Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In ACL. 975\u2013984."},{"issue":"2","key":"e_1_3_1_145_2","article-title":"Recent advances in natural language processing via large pre-trained language models: A survey","volume":"56","author":"Min Bonan","year":"2023","unstructured":"Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys 56, 2 (2023), 1\u201340.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_1_146_2","first-page":"2791","volume-title":"NAACL","author":"Min Sewon","year":"2022","unstructured":"Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to learn in context. In NAACL. 2791\u20132809."},{"key":"e_1_3_1_147_2","first-page":"5807","volume-title":"EMNLP","author":"Mishra Swaroop","year":"2022","unstructured":"Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, et\u00a0al. 2022. LILA: A unified benchmark for mathematical reasoning. In EMNLP. 5807\u20135832."},{"key":"e_1_3_1_148_2","first-page":"3505","volume-title":"ACL","author":"Mishra Swaroop","year":"2022","unstructured":"Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. 2022. NUMGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In ACL. 3505\u20133523."},{"key":"e_1_3_1_149_2","first-page":"2144","volume-title":"ACL","author":"Mitra Arindam","year":"2016","unstructured":"Arindam Mitra and Chitta Baral. 2016. Learning to use formulas to solve simple arithmetic problems. In ACL. 2144\u20132153."},{"key":"e_1_3_1_150_2","doi-asserted-by":"crossref","unstructured":"Shentong Mo and Miao Xin. 2024. Tree of uncertain thoughts reasoning for large language models. ICASSP 2024-2024 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). IEEE 12742\u201312746.","DOI":"10.1109\/ICASSP48485.2024.10448355"},{"key":"e_1_3_1_151_2","unstructured":"Matteo Muffo Aldo Cocco and Enrico Bertino. [n.d.]. Evaluating transformer language models on arithmetic operations using number decomposition. ([n.d.])."},{"key":"e_1_3_1_152_2","article-title":"Diversity of thought improves reasoning abilities of large language models","author":"Naik Ranjita","year":"2023","unstructured":"Ranjita Naik, Varun Chandrasekaran, Mert Yuksekgonul, Hamid Palangi, and Besmira Nushi. 2023. Diversity of thought improves reasoning abilities of large language models. arXiv (2023).","journal-title":"arXiv"},{"key":"e_1_3_1_153_2","doi-asserted-by":"crossref","unstructured":"Ryosuke Nakamoto Brendan Flanagan Taisei Yamauchi Dai Yilling Kyosuke Takami and Horoaki Ogata. 2023. Enhancing automated scoring of math self-explanation quality using LLM-generated datasets: A semi-supervised approach. (2023).","DOI":"10.20944\/preprints202308.2098.v1"},{"issue":"2","key":"e_1_3_1_154_2","first-page":"34","article-title":"The socratic method","volume":"2","author":"Nelson Leonard","year":"1980","unstructured":"Leonard Nelson. 1980. The socratic method. Thinking: The Journal of Philosophy for Children 2, 2 (1980), 34\u201338.","journal-title":"Thinking: The Journal of Philosophy for Children"},{"key":"e_1_3_1_155_2","doi-asserted-by":"crossref","unstructured":"A. Newell J. C. Shaw and H. A. Simon. 1957. Empirical explorations of the logic theory machine: A case study in heuristic. 218\u2013230.","DOI":"10.1145\/1455567.1455605"},{"key":"e_1_3_1_156_2","unstructured":"Rodrigo Nogueira Zhiying Jiang and Jimmy Lin. 2021. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019."},{"key":"e_1_3_1_157_2","article-title":"Pretrained language models are symbolic mathematics solvers too!","author":"Noorbakhsh Kimia","year":"2021","unstructured":"Kimia Noorbakhsh, Modar Sulaiman, Mahdi Sharifi, Kallol Roy, and Pooyan Jamshidi. 2021. Pretrained language models are symbolic mathematics solvers too! arXiv (2021).","journal-title":"arXiv"},{"key":"e_1_3_1_158_2","unstructured":"Maxwell Nye Anders Johan Andreassen Guy Gur-Ari Henryk Michalewski Jacob Austin David Bieber David Dohan Aitor Lewkowycz Maarten Bosma David Luan et\u00a0al. 2021. Show Your Work: Scratchpads for Intermediate Computation with Language Models."},{"key":"e_1_3_1_159_2","doi-asserted-by":"crossref","unstructured":"Theo X. Olausson Alex Gu Benjamin Lipkin Cedegao E. Zhang Armando Solar-Lezama Joshua B. Tenenbaum and Roger Levy. 2023. LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 5153\u20135176.","DOI":"10.18653\/v1\/2023.emnlp-main.313"},{"key":"e_1_3_1_160_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_1_161_2","unstructured":"A. Jaech A. Kalai A. Lerer et\u00a0al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720."},{"key":"e_1_3_1_162_2","unstructured":"Bhargavi Paranjape Scott Lundberg Sameer Singh Hannaneh Hajishirzi Luke Zettlemoyer and Marco Tulio Ribeiro. 2023. ART: Automatic multi-step reasoning and tool-use for large language models. arxiv:2303.09014. Retrieved from https:\/\/arxiv.org\/abs\/2303.09014"},{"key":"e_1_3_1_163_2","unstructured":"Aaron Parisi Yao Zhao and Noah Fiedel. 2022. TALM: Tool augmented language models. arxiv:2205.12255. Retrieved from https:\/\/arxiv.org\/abs\/2205.12255"},{"key":"e_1_3_1_164_2","unstructured":"Keiran Paster Marco Dos Santos Zhangir Azerbayev and Jimmy Ba. 2023. OpenWebMath: An open dataset of high-quality mathematical web text. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_165_2","first-page":"2080","volume-title":"NAACL","author":"Patel Arkil","year":"2021","unstructured":"Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In NAACL. 2080\u20132094."},{"key":"e_1_3_1_166_2","unstructured":"Shuai Peng Ke Yuan Liangcai Gao and Zhi Tang. 2021. MathBERT: A Pre-Trained Model for Mathematical Formula Understanding."},{"key":"e_1_3_1_167_2","unstructured":"Silviu Pitis Michael R. Zhang Andrew Wang and Jimmy Ba. 2023. Boosted prompt ensembles for large language models. arXiv preprint arXiv:2304.05970."},{"key":"e_1_3_1_168_2","unstructured":"Stanislas Polu and Ilya Sutskever. 2020. Generative language modeling for automated theorem proving. arxiv:2009.03393. Retrieved from https:\/\/arxiv.org\/abs\/2009.03393"},{"key":"e_1_3_1_169_2","volume-title":"EMNLP","author":"Qi Jingyuan","year":"2023","unstructured":"Jingyuan Qi, Zhiyang Xu, Ying Shen, Minqian Liu, Dingnan Jin, Qifan Wang, and Lifu Huang. 2023. The Art of SOCRATIC QUESTIONING: Recursive thinking with large language models. In EMNLP."},{"key":"e_1_3_1_170_2","doi-asserted-by":"crossref","unstructured":"Cheng Qian Chi Han Yi R. Fung Yujia Qin Zhiyuan Liu and Heng Ji. 2023. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. Findings of the Association for Computational Linguistics: EMNLP 2023. 6922\u20136939.","DOI":"10.18653\/v1\/2023.findings-emnlp.462"},{"key":"e_1_3_1_171_2","doi-asserted-by":"crossref","unstructured":"Runqi Qiao Qiuna Tan Guanting Dong Minhui Wu Chong Sun Xiaoshuai Song Zhuoma GongQue Shanglin Lei Zhe Wei Miaoxuan Zhang et\u00a0al. 2024. We-Math: Does your large multimodal model achieve human-like mathematical reasoning?Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 20023\u201320070.","DOI":"10.18653\/v1\/2025.acl-long.983"},{"key":"e_1_3_1_172_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.294"},{"key":"e_1_3_1_173_2","first-page":"3780","volume-title":"EMNLP","author":"Qin Jinghui","year":"2020","unstructured":"Jinghui Qin, Lihui Lin, Xiaodan Liang, Rumin Zhang, and Liang Lin. 2020. Semantically-aligned universal tree-structured solver for math word problems. In EMNLP. 3780\u20133789."},{"key":"e_1_3_1_174_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11431-020-1647-3"},{"key":"e_1_3_1_175_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving Language understanding by generative pre-training. (June2018) 12."},{"key":"e_1_3_1_176_2","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (Feb.2019) 24."},{"key":"e_1_3_1_177_2","article-title":"Direct preference optimization: Your language model is secretly a reward model","volume":"36","author":"Rafailov Rafael","year":"2024","unstructured":"Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS 36 (2024).","journal-title":"NeurIPS"},{"key":"e_1_3_1_178_2","unstructured":"Vipula Rawte Amit Sheth and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922."},{"key":"e_1_3_1_179_2","unstructured":"Colin Raffel Noam Shazeer Adam Roberts Katherine Lee Sharan Narang Michael Matena Yanqi Zhou Wei Li and Peter J. Liu. [n.d.]. Exploring the limits of transfer learning with a Unified text-to-text transformer. ([n.d.])."},{"key":"e_1_3_1_180_2","first-page":"1743","volume-title":"EMNLP","author":"Roy Subhro","year":"2015","unstructured":"Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In EMNLP. 1743\u20131752."},{"key":"e_1_3_1_181_2","volume-title":"AAAI","author":"Roy Subhro","year":"2017","unstructured":"Subhro Roy and Dan Roth. 2017. Unit dependency graph and its application to arithmetic word problem solving. In AAAI, Vol. 31."},{"key":"e_1_3_1_182_2","doi-asserted-by":"crossref","unstructured":"S. Roy and D. Roth. 2018. Mapping to declarative knowledge for word problem solving. Transactions of the Association for Computational Linguistics 6 (2018) 159\u2013172.","DOI":"10.1162\/tacl_a_00012"},{"key":"e_1_3_1_183_2","doi-asserted-by":"crossref","unstructured":"S. Roy T. Vieira and D. Roth. 2015. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics 3 (2015) 1\u201313.","DOI":"10.1162\/tacl_a_00118"},{"key":"e_1_3_1_184_2","first-page":"311","volume-title":"Proceedings of the 1992 Workshop on Types for Proofs and Programs","author":"Rudnicki Piotr","year":"1992","unstructured":"Piotr Rudnicki. 1992. An overview of the mizar project. In Proceedings of the 1992 Workshop on Types for Proofs and Programs. 311\u2013330."},{"key":"e_1_3_1_185_2","volume-title":"ICLR","author":"Saxton David","year":"2019","unstructured":"David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing mathematical reasoning abilities of neural models. In ICLR."},{"key":"e_1_3_1_186_2","first-page":"1","volume-title":"ICDL","author":"Scharpf Philipp","year":"2022","unstructured":"Philipp Scharpf, Moritz Schubotz, and Bela Gipp. 2022. Mining mathematical documents for question answering via unsupervised formula labeling. In ICDL. 1\u201311."},{"key":"e_1_3_1_187_2","unstructured":"Timo Schick Jane Dwivedi-Yu Roberto Dess\u00ec Roberta Raileanu Maria Lomeli Luke Zettlemoyer Nicola Cancedda and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36 (2023) 68539\u201368551."},{"key":"e_1_3_1_188_2","unstructured":"John Schulman Filip Wolski Prafulla Dhariwal Alec Radford and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347."},{"key":"e_1_3_1_189_2","unstructured":"Bilgehan Sel Ahmad Al-Tawaha Vanshaj Khattar Lu Wang Ruoxi Jia and Ming Jin. 2023. Algorithm of thoughts: Enhancing exploration of ideas in large language models. Proceedings of the 41st International Conference on Machine Learning. 44136\u201344189."},{"key":"e_1_3_1_190_2","first-page":"1466","volume-title":"EMNLP","author":"Seo Minjoon","year":"2015","unstructured":"Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. 2015. Solving geometry problems: Combining text and diagram interpretation. In EMNLP. 1466\u20131476."},{"key":"e_1_3_1_191_2","doi-asserted-by":"crossref","unstructured":"Jianhao Shen Yichun Yin Lin Li Lifeng Shang Xin Jiang Ming Zhang and Qun Liu. 2021. Generate & Rank: A Multi-task Framework for Math Word Problems.","DOI":"10.18653\/v1\/2021.findings-emnlp.195"},{"key":"e_1_3_1_192_2","volume-title":"ICLR","author":"Shi Freda","year":"2022","unstructured":"Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et\u00a0al. 2022. Language models are multilingual chain-of-thought reasoners. In ICLR."},{"key":"e_1_3_1_193_2","first-page":"1132","volume-title":"EMNLP","author":"Shi Shuming","year":"2015","unstructured":"Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving number word problems by semantic parsing and reasoning. In EMNLP. 1132\u20131142."},{"key":"e_1_3_1_194_2","unstructured":"Wenhao Shi Zhiqiang Hu Yi Bin Junhua Liu Yang Yang See-Kiong Ng Lidong Bing and Roy Ka-Wei Lee. 2024. Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models."},{"key":"e_1_3_1_195_2","volume-title":"NeurIPS","author":"Shinn Noah","year":"2023","unstructured":"Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS."},{"key":"e_1_3_1_196_2","unstructured":"Kumar Shridhar Harsh Jhamtani Hao Fang Benjamin Van Durme Jason Eisner and Patrick Xia. 2023. SCREWS: A modular framework for reasoning with revisions. arXiv preprint arXiv:2309.13075."},{"key":"e_1_3_1_197_2","doi-asserted-by":"crossref","unstructured":"Kumar Shridhar Jakub Macina Mennatallah El-Assady Tanmay Sinha Manu Kapur and Mrinmaya Sachan. 2022. Automatic generation of socratic subquestions for teaching math word problems. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4136\u20134149.","DOI":"10.18653\/v1\/2022.emnlp-main.277"},{"key":"e_1_3_1_198_2","first-page":"12113","volume-title":"EMNLP","author":"Shum Kashun","year":"2023","unstructured":"Kashun Shum, Shizhe Diao, and Tong Zhang. 2023. Automatic prompt augmentation and selection with chain-of-thought from labeled data. In EMNLP. 12113\u201312139."},{"key":"e_1_3_1_199_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-71067-7_6"},{"key":"e_1_3_1_200_2","unstructured":"Peiyang Song Kaiyu Yang and Anima Anandkumar. 2024. Towards large language models as copilots for theorem proving in lean. The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS\u201923."},{"key":"e_1_3_1_201_2","first-page":"987","volume-title":"EMNLP","author":"Spithourakis Georgios","year":"2016","unstructured":"Georgios Spithourakis, Isabelle Augenstein, and Sebastian Riedel. 2016. Numerically grounded language models for semantic error correction. In EMNLP. 987\u2013992."},{"key":"e_1_3_1_202_2","first-page":"2104","volume-title":"ACL","author":"Spithourakis G. P.","year":"2018","unstructured":"G. P. Spithourakis and S. Riedel. 2018. Numeracy for language models: Evaluating and improving their ability to predict numbers. In ACL, Vol. 56. 2104\u20132115."},{"key":"e_1_3_1_203_2","unstructured":"Yang Sui Yu-Neng Chuang Guanchu Wang Jiamu Zhang Tianyi Zhang Jiayi Yuan Hongyi Liu Andrew Wen Shaochen Zhong Na Zou et\u00a0al. 2025. Stop overthinking: A survey on efficient reasoning for large language models. arXiv:2503.16419. Retrieved from https:\/\/arxiv.org\/abs\/2503.16419"},{"key":"e_1_3_1_204_2","unstructured":"Haotian Sun Yuchen Zhuang Lingkai Kong Bo Dai and Chao Zhang. 2023. AdaPlanner: Adaptive planning from feedback with language models. Advances in Neural Information Processing Systems 36 (2023) 58202\u201358245."},{"key":"e_1_3_1_205_2","unstructured":"Ross Taylor Marcin Kardas Guillem Cucurull Thomas Scialom Anthony Hartshorn Elvis Saravia Andrew Poulton Viktor Kerkez and Robert Stojnic. 2022. Galactica: A Large Language Model for Science."},{"key":"e_1_3_1_206_2","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew M. Dai Anja Hauth Katie Millican et\u00a0al. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805."},{"key":"e_1_3_1_207_2","unstructured":"Q. Team. 2024. Qwq: Reflect deeply on the boundaries of the unknown. Hugging Face."},{"key":"e_1_3_1_208_2","unstructured":"Romal Thoppilan Daniel De Freitas Jamie Hall Noam Shazeer Apoorv Kulshreshtha Heng-Tze Cheng Alicia Jin Taylor Bos Leslie Baker Yu Du et\u00a0al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239."},{"key":"e_1_3_1_209_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971."},{"key":"e_1_3_1_210_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288."},{"key":"e_1_3_1_211_2","first-page":"7601","volume-title":"ACL","author":"Trung Luong","year":"2024","unstructured":"Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. 2024. Reft: Reasoning with reinforced fine-tuning. In ACL. 7601\u20137614."},{"key":"e_1_3_1_212_2","volume-title":"Draw: A Challenging and Diverse Algebra Word Problem Set","author":"Upadhyay Shyam","year":"2015","unstructured":"Shyam Upadhyay and Ming-Wei Chang. 2015. Draw: A Challenging and Diverse Algebra Word Problem Set. Technical Report. Citeseer."},{"key":"e_1_3_1_213_2","first-page":"494","volume-title":"EACL","author":"Upadhyay Shyam","year":"2017","unstructured":"Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In EACL. 494\u2013504."},{"key":"e_1_3_1_214_2","first-page":"20","volume-title":"BEA","author":"Upadhyay Shriyash","year":"2023","unstructured":"Shriyash Upadhyay, Etan Ginsberg, and Chris Callison-Burch. 2023. Improving mathematics tutoring with a code scratchpad. In BEA. 20\u201328."},{"key":"e_1_3_1_215_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS 30.","journal-title":"NeurIPS"},{"key":"e_1_3_1_216_2","first-page":"5307","volume-title":"EMNLP","author":"Wallace Eric","year":"2019","unstructured":"Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do NLP models know numbers? Probing numeracy in embeddings. In EMNLP. 5307\u20135315."},{"key":"e_1_3_1_217_2","unstructured":"Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 billion parameter autoregressive language model."},{"key":"e_1_3_1_218_2","first-page":"758","volume-title":"NLPCC","author":"Wang Cunxiang","year":"2021","unstructured":"Cunxiang Wang, Boyuan Zheng, Yuchen Niu, and Yue Zhang. 2021. Exploring generalization ability of pretrained language models on arithmetic and logical reasoning. In NLPCC. 758\u2013769."},{"key":"e_1_3_1_219_2","doi-asserted-by":"crossref","unstructured":"Ke Wang Junting Pan Weikang Shi Zimu Lu Mingjie Zhan and Hongsheng Li. 2024. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems 37 (2024) 95095\u201395169.","DOI":"10.52202\/079017-3014"},{"key":"e_1_3_1_220_2","unstructured":"Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge et\u00a0al. 2024. Qwen2-VL: Enhancing vision-language model\u2019s perception of the world at any resolution. arXiv preprint arXiv:2409.12191."},{"key":"e_1_3_1_221_2","unstructured":"Peiyi Wang Lei Li Liang Chen Feifan Song Binghuai Lin Yunbo Cao Tianyu Liu and Zhifang Sui. 2023. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144."},{"key":"e_1_3_1_222_2","doi-asserted-by":"crossref","unstructured":"Weihan Wang Qingsong Lv Wenmeng Yu Wenyi Hong Ji Qi Yan Wang Junhui Ji Zhuoyi Yang Lei Zhao Xixuan Song et\u00a0al. 2023. CogVLM: Visual expert for pretrained language models. Advances in Neural Information Processing Systems 37 (2024) 121475\u2013121499.","DOI":"10.52202\/079017-3860"},{"key":"e_1_3_1_223_2","unstructured":"Xingyao Wang Zihan Wang Jiateng Liu Yangyi Chen Lifan Yuan Hao Peng and Heng Ji. 2023. Mint: Evaluating LLMs in multi-turn interaction with tools and language feedback. 12th International Conference on Learning Representations ICLR 2024."},{"key":"e_1_3_1_224_2","volume-title":"ICLR","author":"Wang Xuezhi","year":"2023","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In ICLR."},{"key":"e_1_3_1_225_2","unstructured":"Yifan Wang and Yun Fu. 2024. Understanding Abstracting and Checking: Evoking Complicated Multimodal Reasoning in LMMs."},{"key":"e_1_3_1_226_2","unstructured":"Yue Wang Qiuzhi Liu Jiahao Xu Tian Liang Xingyu Chen Zhiwei He Linfeng Song Dian Yu Juntao Li Zhuosheng Zhang et\u00a0al. 2025. Thoughts are all over the place: On the underthinking of o1-like LLMs. arXiv:2501.18585. Retrieved from https:\/\/arxiv.org\/abs\/2501.18585"},{"key":"e_1_3_1_227_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1088"},{"key":"e_1_3_1_228_2","unstructured":"Zengzhi Wang Rui Xia and Pengfei Liu. 2023. Generative AI for math: Part I\u2013MathPile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120."},{"key":"e_1_3_1_229_2","unstructured":"J. Wei X. Wang D. Schuurmans M. Bosma F. Xia E. Chi Q. Le and D. Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022) 24824\u201324837."},{"key":"e_1_3_1_230_2","volume-title":"NeurIPS Datasets and Benchmarks Track (Round 1)","author":"Welleck Sean","year":"2021","unstructured":"Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. 2021. NaturalProofs: Mathematical theorem proving in natural language. In NeurIPS Datasets and Benchmarks Track (Round 1)."},{"key":"e_1_3_1_231_2","first-page":"4913","article-title":"Naturalprover: Grounded mathematical proof generation with language models","volume":"35","author":"Welleck Sean","year":"2022","unstructured":"Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Yejin Choi. 2022. Naturalprover: Grounded mathematical proof generation with language models. In NeurIPS 35 (2022), 4913\u20134927.","journal-title":"NeurIPS"},{"key":"e_1_3_1_232_2","volume-title":"NeurIPS","author":"Welleck Sean","year":"2023","unstructured":"Sean Welleck and Rahul Saha. 2023. LLMSTEP: LLM proofstep suggestions in lean. In NeurIPS."},{"key":"e_1_3_1_233_2","first-page":"33","volume-title":"TPHOLs","author":"Wenzel Makarius","year":"2008","unstructured":"Makarius Wenzel, Lawrence C. Paulson, and Tobias Nipkow. 2008. The Isabelle framework. In TPHOLs. 33\u201338."},{"key":"e_1_3_1_234_2","volume-title":"ICLR","author":"Wu Yuhuai","year":"2021","unstructured":"Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Baker Grosse. 2021. INT: An inequality benchmark for evaluating generalization in theorem proving. In ICLR."},{"key":"e_1_3_1_235_2","volume-title":"NeurIPS","author":"Wu Yuhuai","year":"2022","unstructured":"Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. 2022. Autoformalization with large language models. In NeurIPS."},{"key":"e_1_3_1_236_2","unstructured":"Yuhuai Wu Markus Rabe and Wenda Li. [n.d.]. LIME: Learning inductive bias for primitives of mathematical reasoning. ([n.d.])."},{"key":"e_1_3_1_237_2","doi-asserted-by":"crossref","unstructured":"Shijie Xia Xuefeng Li Yixin Liu Tongshuang Wu and Pengfei Liu. 2025. Evaluating mathematical reasoning beyond accuracy. Proceedings of the AAAI Conference on Artificial Intelligence 39 26 (2025) 27723\u201327730.","DOI":"10.1609\/aaai.v39i26.34987"},{"key":"e_1_3_1_238_2","unstructured":"Kun Xiang Zhili Liu Zihao Jiang Yunshuang Nie Runhui Huang Haoxiang Fan Hanhui Li Weiran Huang Yihan Zeng Jianhua Han et\u00a0al. 2024. AtomThink: A slow thinking framework for multimodal mathematical reasoning. arXiv preprint arXiv:2411.11930."},{"key":"e_1_3_1_239_2","unstructured":"Can Xu Qingfeng Sun Kai Zheng Xiubo Geng Pu Zhao Jiazhan Feng Chongyang Tao and Daxin Jiang. 2023. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244."},{"key":"e_1_3_1_240_2","unstructured":"Guowei Xu Peng Jin Li Hao Yibing Song Lichao Sun and Li Yuan. 2024. LLaVA-o1: Let vision language models reason step-by-step. Findings of the Association for Computational Linguistics: ACL 2025. 24290\u201324315."},{"key":"e_1_3_1_241_2","doi-asserted-by":"crossref","unstructured":"Jundong Xu Hao Fei Liangming Pan Qian Liu Mong-Li Lee and Wynne Hsu. 2024. Faithful logical reasoning via symbolic chain-of-thought. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13326\u201313365.","DOI":"10.18653\/v1\/2024.acl-long.720"},{"key":"e_1_3_1_242_2","doi-asserted-by":"crossref","unstructured":"Yibo Yan Jiamin Su Jianxiang He Fangteng Fu Xu Zheng Yuanhuiyi Lyu Kun Wang Shen Wang Qingsong Wen and Xuming Hu. 2025. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark method & challenges. Findings of the Association for Computational Linguistics: ACL 2025. 11798\u201311827.","DOI":"10.18653\/v1\/2025.findings-acl.614"},{"key":"e_1_3_1_243_2","unstructured":"An Yang Beichen Zhang Binyuan Hui Bofei Gao Bowen Yu Chengpeng Li Dayiheng Liu Jianhong Tu Jingren Zhou Junyang Lin et\u00a0al. 2024. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122."},{"key":"e_1_3_1_244_2","first-page":"6984","volume-title":"ICML","author":"Yang Kaiyu","year":"2019","unstructured":"Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants. In ICML. 6984\u20136994."},{"key":"e_1_3_1_245_2","unstructured":"Zhen Yang Ming Ding Qingsong Lv Zhihuan Jiang Zehai He Yuyi Guo Jinfeng Bai and Jie Tang. 2023. GPT can solve mathematical problems without a calculator. arXiv preprint arXiv:2309.03241."},{"key":"e_1_3_1_246_2","doi-asserted-by":"crossref","unstructured":"Zhicheng Yang Jinghui Qin Jiaqi Chen Liang Lin and Xiaodan Liang. 2022. Logicsolver: Towards interpretable math word problem solving with logical prompt-enhanced learning. Findings of the Association for Computational Linguistics: EMNLP 2022. 1\u201313.","DOI":"10.18653\/v1\/2022.findings-emnlp.1"},{"key":"e_1_3_1_247_2","unstructured":"Shunyu Yao Dian Yu Jeffrey Zhao Izhak Shafran Thomas L. Griffiths Yuan Cao and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2023) 11809\u201311822."},{"key":"e_1_3_1_248_2","doi-asserted-by":"crossref","unstructured":"Yao Yao Zuchao Li and Hai Zhao. 2023. Beyond chain-of-thought effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582.","DOI":"10.18653\/v1\/2024.findings-naacl.183"},{"key":"e_1_3_1_249_2","doi-asserted-by":"crossref","unstructured":"Shuo Yin Weihao You Zhilong Ji Guoqiang Zhong and Jinfeng Bai. 2024. MuMath-code: Combining tool-use large language models with multi-perspective data augmentation for mathematical reasoning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 4770\u20134785.","DOI":"10.18653\/v1\/2024.emnlp-main.274"},{"key":"e_1_3_1_250_2","unstructured":"Huaiyuan Ying Shuo Zhang Linyang Li Zhejian Zhou Yunfan Shao Zhaoye Fei Yichuan Ma Jiawei Hong Kuikun Liu Ziyi Wang et\u00a0al. 2024. InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning."},{"key":"e_1_3_1_251_2","doi-asserted-by":"crossref","unstructured":"Ori Yoran Tomer Wolfson Ben Bogin Uri Katz Daniel Deutch and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 5942\u20135966.","DOI":"10.18653\/v1\/2023.emnlp-main.364"},{"key":"e_1_3_1_252_2","unstructured":"Junchi Yu Ran He and Rex Ying. 2023. Thought propagation: An analogical approach to complex reasoning with large language models. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_253_2","unstructured":"Longhui Yu Weisen Jiang Han Shi Jincheng Yu Zhengying Liu Yu Zhang James T. Kwok Zhenguo Li Adrian Weller and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_254_2","unstructured":"Lifan Yuan Ganqu Cui Hanbin Wang Ning Ding Xingyao Wang Jia Deng Boji Shan Huimin Chen Ruobing Xie Yankai Lin et\u00a0al. 2024. Advancing LLM reasoning generalists with preference trees. The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_1_255_2","unstructured":"Zheng Yuan Hongyi Yuan Chengpeng Li Guanting Dong Chuanqi Tan and Chang Zhou. 2023. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825."},{"key":"e_1_3_1_256_2","unstructured":"Zheng Yuan Hongyi Yuan Chuanqi Tan Wei Wang and Songfang Huang. 2023. How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015."},{"key":"e_1_3_1_257_2","doi-asserted-by":"crossref","unstructured":"Xiang Yue Yuansheng Ni Kai Zhang Tianyu Zheng Ruoqi Liu Ge Zhang Samuel Stevens Dongfu Jiang Weiming Ren Yuxuan Sun et\u00a0al. 2024. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9556\u20139567.","DOI":"10.1109\/CVPR52733.2024.00913"},{"key":"e_1_3_1_258_2","unstructured":"Xiang Yue Xingwei Qu Ge Zhang Yao Fu Wenhao Huang Huan Sun Yu Su and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653."},{"key":"e_1_3_1_259_2","unstructured":"Eric Zelikman Georges Harik Yijia Shao Varuna Jayasiri Nick Haber and Noah D. Goodman. 2024. Quiet-star: Language models can teach themselves to think before speaking. First Conference on Language Modeling."},{"key":"e_1_3_1_260_2","first-page":"15476","article-title":"Star: Bootstrapping reasoning with reasoning","volume":"35","author":"Zelikman Eric","year":"2022","unstructured":"Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. Star: Bootstrapping reasoning with reasoning. In NeurIPS 35 (2022), 15476\u201315488.","journal-title":"NeurIPS"},{"key":"e_1_3_1_261_2","unstructured":"Andy Zeng Maria Attarian Brian Ichter Krzysztof Choromanski Adrian Wong Stefan Welker Federico Tombari Aveek Purohit Michael Ryoo Vikas Sindhwani et\u00a0al. 2022. Socratic models: Composing zero-shot multimodal reasoning with language. The Eleventh International Conference on Learning Representations."},{"key":"e_1_3_1_262_2","unstructured":"Di Zhang Xiaoshui Huang Dongzhan Zhou Yuqiang Li and Wanli Ouyang. 2024. Accessing GPT-4 level mathematical olympiad solutions via Monte Carlo tree self-refine with LLaMa-3 8B. arXiv preprint arXiv:2406.07394."},{"key":"e_1_3_1_263_2","doi-asserted-by":"crossref","unstructured":"Mengxue Zhang Zichao Wang Zhichao Yang Weiqi Feng and Andrew Lan. 2023. Interpretable math word problem solution generation via step-by-step planning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6858\u20136877.","DOI":"10.18653\/v1\/2023.acl-long.379"},{"key":"e_1_3_1_264_2","first-page":"1636","volume-title":"IJCAI","author":"Zhang Ming-Liang","year":"2022","unstructured":"Ming-Liang Zhang, Fei Yin, Yi-Han Hao, and Cheng-Lin Liu. 2022. Plane geometry diagram parsing. In IJCAI. 1636\u20131643."},{"key":"e_1_3_1_265_2","first-page":"3374","volume-title":"IJCAI","author":"Zhang Ming-Liang","year":"2023","unstructured":"Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. 2023. A multi-modal neural geometric solver with textual clauses parsed from diagram. In IJCAI. 3374\u20133382."},{"key":"e_1_3_1_266_2","first-page":"169","volume-title":"ECCV","author":"Zhang Renrui","year":"2025","unstructured":"Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et\u00a0al. 2025. Mathverse: Does your multi-modal LLM truly see the diagrams in visual math problems? In ECCV. 169\u2013186."},{"key":"e_1_3_1_267_2","unstructured":"Shengyu Zhang Linfeng Dong Xiaoya Li Sen Zhang Xiaofei Sun Shuhe Wang Jiwei Li Runyi Hu Tianwei Zhang Fei Wu et\u00a0al. 2023. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792."},{"key":"e_1_3_1_268_2","first-page":"4889","volume-title":"EMNLP","author":"Zhang Xikun","year":"2020","unstructured":"Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do language embeddings capture scales?. In EMNLP. 4889\u20134896."},{"key":"e_1_3_1_269_2","doi-asserted-by":"crossref","first-page":"292","DOI":"10.18653\/v1\/2020.blackboxnlp-1.27","volume-title":"BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP","author":"Zhang Xikun","year":"2020","unstructured":"Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do language embeddings capture scales?. In BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 292\u2013299."},{"key":"e_1_3_1_270_2","unstructured":"Yifan Zhang Jingqin Yang Yang Yuan and Andrew Chi-Chih Yao. 2023. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371."},{"key":"e_1_3_1_271_2","doi-asserted-by":"crossref","unstructured":"Zhihan Zhang Tao Ge Zhenwen Liang Wenhao Yu Dian Yu Mengzhao Jia Dong Yu and Meng Jiang. 2024. Learn beyond the answer: Training language models with reflection for mathematical reasoning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 14720\u201314738.","DOI":"10.18653\/v1\/2024.emnlp-main.817"},{"key":"e_1_3_1_272_2","volume-title":"ICLR","author":"Zhang Zhuosheng","year":"2023","unstructured":"Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic chain of thought prompting in large language models. In ICLR."},{"key":"e_1_3_1_273_2","first-page":"16361","volume-title":"EMNLP","author":"Zhao Jun","year":"2024","unstructured":"Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuan-Jing Huang. 2024. Exploring the compositional deficiency of large language models in mathematical reasoning through trap problems. In EMNLP. 16361\u201316376."},{"key":"e_1_3_1_274_2","doi-asserted-by":"crossref","unstructured":"Ruochen Zhao Xingxuan Li Shafiq Joty Chengwei Qin and Lidong Bing. 2023. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5823\u20135840.","DOI":"10.18653\/v1\/2023.acl-long.320"},{"key":"e_1_3_1_275_2","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et\u00a0al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223."},{"key":"e_1_3_1_276_2","first-page":"6588","volume-title":"ACL","author":"Zhao Yilun","year":"2022","unstructured":"Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. 2022. MultiHiertt: Numerical reasoning over multi hierarchical tabular and textual data. In ACL. 6588\u20136600."},{"key":"e_1_3_1_277_2","unstructured":"Yu Zhao Huifeng Yin Bo Zeng Hao Wang Tianqi Shi Chenyang Lyu Longyue Wang Weihua Luo and Kaifu Zhang. 2024. Marco-o1: Towards open reasoning models for open-ended solutions. CoRR\u201924."},{"key":"e_1_3_1_278_2","unstructured":"Zilong Zhao Yao Rong Dongyang Guo Emek G\u00f6zl\u00fckl\u00fc Emir G\u00fclboy and Enkelejda Kasneci. 2024. Stepwise self-consistent mathematical reasoning with large language models. arXiv preprint arXiv:2402.17786."},{"key":"e_1_3_1_279_2","volume-title":"ICLR","author":"Zheng Kunhao","year":"2023","unstructured":"Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. 2023. miniF2F: A cross-system benchmark for formal Olympiad-level mathematics. In ICLR."},{"key":"e_1_3_1_280_2","doi-asserted-by":"crossref","unstructured":"Wanjun Zhong Ruixiang Cui Yiduo Guo Yaobo Liang Shuai Lu Yanlin Wang Amin Saied Weizhu Chen and Nan Duan. 2023. AGIEval: A human-centric benchmark for evaluating foundation models. Findings of the Association for Computational Linguistics: NAACL 2024. 2299\u20132314.","DOI":"10.18653\/v1\/2024.findings-naacl.149"},{"key":"e_1_3_1_281_2","unstructured":"Aojun Zhou Ke Wang Zimu Lu Weikang Shi Sichun Luo Zipeng Qin Shaoqing Lu Anya Jia Linqi Song Mingjie Zhan et\u00a0al. 2023. Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_1_282_2","unstructured":"Andy Zhou Kai Yan Michal Shlapentokh-Rothman Haohan Wang and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models. Proceedings of the 41st International Conference on Machine Learning. 62138\u201362160."},{"key":"e_1_3_1_283_2","unstructured":"Hattie Zhou Azade Nova Hugo Larochelle Aaron Courville Behnam Neyshabur and Hanie Sedghi. 2022. Teaching Algorithmic Reasoning via In-Context Learning."},{"key":"e_1_3_1_284_2","first-page":"817","volume-title":"EMNLP","author":"Zhou Lipu","year":"2015","unstructured":"Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming. In EMNLP. 817\u2013822."},{"key":"e_1_3_1_285_2","first-page":"42602","volume-title":"ICML","author":"Zhou Wangchunshu","year":"2023","unstructured":"Wangchunshu Zhou, Yuchen Eleanor Jiang, Ethan Wilcox, Ryan Cotterell, and Mrinmaya Sachan. 2023. Controlled text generation with natural language instructions. In ICML, Vol. 202. 42602\u201342613."},{"key":"e_1_3_1_286_2","doi-asserted-by":"crossref","unstructured":"Zhehua Zhou Jiayang Song Kunpeng Yao Zhan Shu and Lei Ma. 2023. ISR-LLM: Iterative self-refined large language model for long-horizon sequential task planning. 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE 2081\u20132088.","DOI":"10.1109\/ICRA57147.2024.10610065"},{"key":"e_1_3_1_287_2","doi-asserted-by":"crossref","unstructured":"Zihao Zhou Qiufeng Wang Mingyu Jin Jie Yao Jianan Ye Wei Liu Wei Wang Xiaowei Huang and Kaizhu Huang. 2023. MathAttack: Attacking large language models towards math solving ability. Proceedings of the AAAI Conference on Artificial Intelligence 38 17 (2023) 19750\u201319758.","DOI":"10.1609\/aaai.v38i17.29949"},{"key":"e_1_3_1_288_2","first-page":"3277","volume-title":"ACL","author":"Zhu Fengbin","year":"2021","unstructured":"Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL. 3277\u20133287."},{"key":"e_1_3_1_289_2","article-title":"Solving math word problems via cooperative reasoning induced language models","author":"Zhu Xinyu","year":"2022","unstructured":"Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problems via cooperative reasoning induced language models. arXiv (2022).","journal-title":"arXiv"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3773985","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T14:09:13Z","timestamp":1765202953000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773985"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,8]]},"references-count":288,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2026,4,30]]}},"alternative-id":["10.1145\/3773985"],"URL":"https:\/\/doi.org\/10.1145\/3773985","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,8]]},"assertion":[{"value":"2023-12-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-17","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}