{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T14:28:11Z","timestamp":1765376891246,"version":"3.46.0"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"12","funder":[{"name":"National Key Research and Development Program of China","award":["2023YFE0116400"],"award-info":[{"award-number":["2023YFE0116400"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>\n                    To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this article, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. Our data is available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/tjunlp-lab\/FineMATH\">https:\/\/github.com\/tjunlp-lab\/FineMATH<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3773083","type":"journal-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T10:11:24Z","timestamp":1761387084000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-2283-1725","authenticated-orcid":false,"given":"Yan","family":"Liu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology\/ TJUNLP Lab, Tianjin University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7452-9883","authenticated-orcid":false,"given":"Renren","family":"Jin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology\/ TJUNLP Lab, Tianjin University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5406-8967","authenticated-orcid":false,"given":"Ling","family":"Shi","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology\/ TJUNLP Lab, Tianjin University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9007-3976","authenticated-orcid":false,"given":"Zheng","family":"Yao","sequence":"additional","affiliation":[{"name":"The University of Queensland","place":["Brisbane, Australia"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2353-5038","authenticated-orcid":false,"given":"Deyi","family":"Xiong","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology\/ TJUNLP Lab, Tianjin University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,12,10]]},"reference":[{"key":"e_1_3_3_2_2","article-title":"Qwen technical report","author":"Bai Jinze","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et\u00a0al. 2023. Qwen technical report. arXiv:2309.16609. Retrieved from https:\/\/arxiv.org\/abs\/\/2309.16609","journal-title":"arXiv:2309.16609"},{"key":"e_1_3_3_3_2","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et\u00a0al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. Retrieved from https:\/\/arxiv.org\/abs\/\/2005.14165"},{"key":"e_1_3_3_4_2","unstructured":"Karl Cobbe Vineet Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun Lukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton Reiichiro Nakano et\u00a0al. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168. Retrieved from https:\/\/arxiv.org\/abs\/\/2110.14168"},{"key":"e_1_3_3_5_2","unstructured":"Team GLM Aohan Zeng Bin Xu Bowen Wang Chenhui Zhang Da Yin Diego Rojas Guanyu Feng Hanlin Zhao Hanyu Lai et\u00a0al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793. Retrieved from https:\/\/arxiv.org\/abs\/\/2406.12793"},{"key":"e_1_3_3_6_2","unstructured":"Zishan Guo Renren Jin Chuang Liu Yufei Huang Dan Shi Linhao Yu Yan Liu Jiaxuan Li Bojian Xiong and Deyi Xiong. 2023. Evaluating large language models: A comprehensive survey. arXiv:2310.19736. Retrieved from https:\/\/arxiv.org\/abs\/2310.19736"},{"volume-title":"Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)","author":"Hendrycks Dan","key":"e_1_3_3_7_2","unstructured":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. [n.d.]. Measuring mathematical problem solving with the MATH dataset. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)."},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1058"},{"key":"e_1_3_3_9_2","unstructured":"Zijian Hu and Meng Jiang. 2022. Heterogeneous line graph transformer for math word problems. arXiv:2208.05645. Retrieved from https:\/\/arxiv.org\/abs\/2208.05645"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-emnlp.68"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN54540.2023.10191776"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00160"},{"key":"e_1_3_3_13_2","first-page":"11307","volume-title":"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)","author":"Liang Zhenwen","year":"2024","unstructured":"Zhenwen Liang, Dian Yu, Xiaoman Pan, Wenlin Yao, Qingkai Zeng, Xiangliang Zhang, and Dong Yu. 2024. MinT: Boosting generalization in mathematical reasoning via multi-view fine-tuning. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 11307\u201311318."},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-naacl.74"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1015"},{"key":"e_1_3_3_16_2","unstructured":"Chuang Liu Renren Jin Yuqi Ren Linhao Yu Tianyu Dong Xiaohan Peng Shuting Zhang Jianxiang Peng Peiyi Zhang Qingqing Lyu et\u00a0al. 2023. M3KE: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models. arXiv:2305.10263. Retrieved from https:\/\/arxiv.org\/abs\/2305.10263"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.findings-emnlp.542"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.92"},{"key":"e_1_3_3_19_2","unstructured":"OpenAI Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman et\u00a0al. 2024. GPT-4 Technical Report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.168"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1202"},{"key":"e_1_3_3_22_2","unstructured":"Tianhao Shen Renren Jin Yufei Huang Chuang Liu Weilong Dong Zishan Guo Xinwei Wu Yan Liu and Deyi Xiong. 2023. Large Language Model Alignment: A Survey. arXiv:2309.15025. Retrieved from https:\/\/arxiv.org\/abs\/2309.15025"},{"key":"e_1_3_3_23_2","article-title":"MOSS: An Open Conversational Large Language Model","author":"Sun Tianxiang","year":"2024","unstructured":"Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Xiangyang Liu, Hang Yan, Yunfan Shao, Qiong Tang, Shiduo Zhang, et\u00a0al. 2024. MOSS: An Open Conversational Large Language Model. Machine Intelligence Research. Retrieved from https:\/\/github.com\/OpenMOSS\/MOSS","journal-title":"Machine Intelligence Research"},{"key":"e_1_3_3_24_2","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824\u201324837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_25_2","unstructured":"Tianwen Wei Jian Luan Wei Liu Shuang Dong and Bin Wang. 2023. CMATH: Can your language model pass chinese elementary school math test? arXiv:2306.16636. Retrieved from https:\/\/arxiv.org\/abs\/2306.16636"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.findings-acl.1083"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.455"},{"key":"e_1_3_3_28_2","article-title":"Self-consistent reasoning for solving math word problems","author":"Xiong Jing","year":"2022","unstructured":"Jing Xiong, Zhongwei Wan, Xiping Hu, Min Yang, and Chengming Li. 2022. Self-consistent reasoning for solving math word problems. arXiv:2210.15373. Retrieved from https:\/\/arxiv.org\/abs\/2210.15373","journal-title":"arXiv:2210.15373"},{"key":"e_1_3_3_29_2","unstructured":"Zhen Yang Ming Ding Qingsong Lv Zhihuan Jiang Zehai He Yuyi Guo Jinfeng Bai and Jie Tang. 2023. Gpt can solve mathematical problems without a calculator. arXiv:2309.03241. Retrieved from https:\/\/arxiv.org\/abs\/2309.03241"},{"key":"e_1_3_3_30_2","first-page":"466","volume-title":"Proceedings of the 31st International Conference on Computational Linguistics: Industry Track","author":"Zhang Shaowei","year":"2025","unstructured":"Shaowei Zhang and Deyi Xiong. 2025. BackMATH: Towards backward reasoning for solving math problems step by step. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 466\u2013482."},{"key":"e_1_3_3_31_2","unstructured":"Wei Zhao Mingyue Shang Yang Liu Liang Wang and Jingming Liu. 2020. Ape210K: A Large-Scale and Template-Rich Dataset of Math Word Problems. arXiv:2009.11506. Retrieved from https:\/\/arxiv.org\/abs\/2009.11506"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3773083","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T14:24:59Z","timestamp":1765376699000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773083"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,10]]},"references-count":30,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3773083"],"URL":"https:\/\/doi.org\/10.1145\/3773083","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,12,10]]},"assertion":[{"value":"2024-12-29","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-04","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}