{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,20]],"date-time":"2026-06-20T16:52:31Z","timestamp":1781974351954,"version":"3.54.5"},"reference-count":154,"publisher":"Association for Natural Language Processing","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Journal of Natural Language Processing"],"published-print":{"date-parts":[[2025]]},"DOI":"10.5715\/jnlp.32.520","type":"journal-article","created":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T22:09:37Z","timestamp":1749938977000},"page":"520-571","source":"Crossref","is-referenced-by-count":1,"title":["Toward Enhancing Reasoning Capabilities of LLMs: An Approach via Synthetic Logic Corpus","\u5927\u898f\u6a21\u8a00\u8a9e\u30e2\u30c7\u30eb\u306b\u63a8\u8ad6\u3092\u6559\u3048\u308b\u305f\u3081\u306e\u4eba\u5de5\u8ad6\u7406\u63a8\u8ad6\u30b3\u30fc\u30d1\u30b9\u3092\u7528\u3044\u305f\u30a2\u30d7\u30ed\u30fc\u30c1"],"prefix":"10.5715","volume":"32","author":[{"given":"Terufumi","family":"Morishita","sequence":"first","affiliation":[{"name":"Advanced AI Innovatoin Center, Hitachi"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Gaku","family":"Morio","sequence":"additional","affiliation":[{"name":"Advanced AI Innovatoin Center, Hitachi"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Atsuki","family":"Yamaguchi","sequence":"additional","affiliation":[{"name":"The University of Sheffield"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yasuhiro","family":"Sogawa","sequence":"additional","affiliation":[{"name":"Advanced AI Innovatoin Center, Hitachi"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"3685","reference":[{"key":"1","unstructured":"AI@Meta (2024). \u201cLlama 3 Model Card.\u201d https:\/\/github.com\/meta-llama\/llama3\/blob\/main\/MODEL_CARD.md."},{"key":"2","unstructured":"Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., and Hajishirzi, H. (2019). \u201cMathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms.\u201d In Burstein, J., Doran, C., and Solorio, T. (Eds.), <i>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)<\/i>, pp. 2357\u20132367, Minneapolis, Minnesota. Association for Computational Linguistics."},{"key":"3","unstructured":"Ando, R., Morishita, T., Abe, H., Mineshima, K., and Okada, M. (2023). \u201cEvaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases.\u201d In Chatzikyriakidis, S. and de Paiva, V. (Eds.), <i>Proceedings of the 4th Natural Logic Meets Machine Learning Workshop<\/i>, pp. 1\u201311, Nancy, France. Association for Computational Linguistics."},{"key":"4","unstructured":"Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. (2022). \u201cExploring Length Generalization in Large Language Models.\u201d In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (Eds.), <i>Advances in Neural Information Processing Systems<\/i>, Vol. 35, pp. 38546\u201338556. Curran Associates, Inc."},{"key":"5","doi-asserted-by":"crossref","unstructured":"Aoki, Y., Kudo, K., Kuribayashi, T., Sone, S., Taniguchi, M., Sakaguchi, K., and Inui, K. (2024). \u201cFirst Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning.\u201d In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (Eds.), <i>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing<\/i>, pp. 14255\u201314271, Miami, Florida, USA. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.emnlp-main.789"},{"key":"6","unstructured":"Arpit, D., Jastrz\u0229bski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., and Lacoste-Julien, S. (2017). \u201cA Closer Look at Memorization in Deep Networks.\u201d."},{"key":"7","unstructured":"Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. (2021). \u201cProgram Synthesis with Large Language Models.\u201d <i>arXiv preprint arXiv:2108.07732<\/i>."},{"key":"8","unstructured":"Bao, Q., Peng, A. Y., Hartill, T., Tan, N., Deng, Z., Witbrock, M., and Liu, J. (2022). \u201cMulti-Step Deductive Reasoning Over Natural Language: An Empirical Study on Out-of-Distribution Generalisation.\u201d <i>arXiv preprint arXiv:2207.14000<\/i>."},{"key":"9","unstructured":"Bean, A. M., Hellsten, S., Mayne, H., Magomere, J., Chi, E. A., Chi, R., Hale, S. A., and Kirk, H. R. (2024). \u201cLINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages.\u201d <i>arXiv preprint arXiv:2406.06196<\/i>."},{"key":"10","unstructured":"Ben Allal, L., Lozhkov, A., Penedo, G., Wolf, T., and von Werra, L. (2024). \u201cCosmopedia.\u201d https:\/\/huggingface.co\/datasets\/HuggingFaceTB\/cosmopedia."},{"key":"11","unstructured":"Ben Allal, L., Muennighoff, N., Kumar Umapathi, L., Lipkin, B., and von Werra, L. (2022). \u201cA framework for the evaluation of code generation models.\u201d https:\/\/github.com\/bigcode-project\/bigcode-evaluation-harness."},{"key":"12","doi-asserted-by":"crossref","unstructured":"Bertolazzi, L., Gatt, A., and Bernardi, R. (2024). \u201cA Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences.\u201d <i>arXiv preprint arXiv:2406.11341<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.769"},{"key":"13","unstructured":"Bertrand, R. (1946). <i>A History of Western Philosophy<\/i>."},{"key":"14","unstructured":"Betz, G., Voigt, C., and Richardson, K. (2021). \u201cCritical Thinking for Language Models.\u201d In <i>Proceedings of the 14th International Conference on Computational Semantics (IWCS)<\/i>, pp. 63\u201375, Groningen, The Netherlands (online). Association for Computational Linguistics."},{"key":"15","unstructured":"Bhagavatula, C., Bras, R. L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, S. W.-t., and Choi, Y. (2019). \u201cAbductive Commonsense Reasoning.\u201d <i>arXiv preprint arXiv:1908.05739<\/i>."},{"key":"16","doi-asserted-by":"crossref","unstructured":"Bhuiya, N., Schlegel, V., and Winkler, S. (2024). \u201cSeemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?\u201d <i>arXiv preprint arXiv:2409.05197<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.147"},{"key":"17","unstructured":"Bond, F. and Foster, R. (2013). \u201cLinking and Extending an Open Multilingual Wordnet.\u201d In <i>Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)<\/i>, pp. 1352\u20131362, Sofia, Bulgaria. Association for Computational Linguistics."},{"key":"18","doi-asserted-by":"crossref","unstructured":"Bostrom, K., Zhao, X., Chaudhuri, S., and Durrett, G. (2021). \u201cFlexible Generation of Natural Language Deductions.\u201d In <i>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing<\/i>, pp. 6266\u20136278, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2021.emnlp-main.506"},{"key":"19","unstructured":"Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. \u201cA Large Annotated Corpus for Learning Natural Language Inference.\u201d. <i>arXiv preprint arXiv:1508.05326<\/i>."},{"key":"20","doi-asserted-by":"crossref","unstructured":"Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., and Jangda, A. (2023). \u201cMultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation.\u201d <i>IEEE Transactions on Software Engineering<\/i>, 49 (7), pp. 3675\u20133691.","DOI":"10.1109\/TSE.2023.3267446"},{"key":"21","unstructured":"Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., and Krueger, G. (2021). \u201cEvaluating Large Language Models Trained on Code.\u201d <i>arXiv preprint arXiv:2107.03374<\/i>."},{"key":"22","doi-asserted-by":"crossref","unstructured":"Chen, S., Hou, Y., Cui, Y., Che, W., Liu, T., and Yu, X. (2020). \u201cRecall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting.\u201d In <i>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)<\/i>, pp. 7870\u20137881, Online. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2020.emnlp-main.634"},{"key":"23","unstructured":"Chen, X., Chi, R. A., Wang, X., and Zhou, D. (2024). \u201cPremise Order Matters in Reasoning with Large Language Models.\u201d In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (Eds.), <i>Proceedings of the 41st International Conference on Machine Learning<\/i>, Vol. 235 of <i>Proceedings of Machine Learning Research<\/i>, pp. 6596\u20136620. PMLR."},{"key":"24","doi-asserted-by":"crossref","unstructured":"Cheng, J., Bernstein, M., Danescu-Niculescu-Mizil, C., and Leskovec, J. (2017). \u201cAnyone Can Become a Troll: Causes of Trolling Behavior in Online Discussions.\u201d <i>CSCW: Proceedings of the Conference on Computer-Supported Cooperative Work. Conference on Computer-Supported Cooperative Work, 2017<\/i>, pp. 1217\u20131230.","DOI":"10.1145\/2998181.2998213"},{"key":"25","unstructured":"Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). \u201cThink you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.\u201d <i>arXiv preprint arXiv:1803.05457<\/i>."},{"key":"26","doi-asserted-by":"crossref","unstructured":"Clark, P., Tafjord, O., and Richardson, K. (2021). \u201cTransformers as Soft Reasoners over Language.\u201d In <i>Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence<\/i>, pp. 3882\u20133890.","DOI":"10.24963\/ijcai.2020\/537"},{"key":"27","unstructured":"Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). \u201cTraining Verifiers to Solve Math Word Problems.\u201d <i>arXiv preprint arXiv:2110.14168<\/i>."},{"key":"28","unstructured":"Colmerauer, A. and Roussel, P. (1973). \u201cThe Birth of Prolog.\u201d <i>The ALP Newsletter<\/i>."},{"key":"29","doi-asserted-by":"crossref","unstructured":"Dagan, I., Glickman, O., and Magnini, B. (2005). \u201cThe PASCAL Recognising Textual Entailment Challenge.\u201d In <i>Machine Learning Challenges Workshop<\/i>, pp. 177\u2013190, Berlin, Heidelberg. Springer Berlin Heidelberg.","DOI":"10.1007\/11736790_9"},{"key":"30","doi-asserted-by":"crossref","unstructured":"Dalvi, B., Jansen, P., Tafjord, O., Xie, Z., Smith, H., Pipatanangkura, L., and Clark, P. (2021). \u201cExplaining Answers with Entailment Trees.\u201d In <i>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing<\/i>, pp. 7358\u20137370, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2021.emnlp-main.585"},{"key":"31","doi-asserted-by":"crossref","unstructured":"Dasgupta, I., Lampinen, A. K., Chan, S. C. Y., Sheahan, H. R., Creswell, A., Kumaran, D., McClelland, J. L., and Hill, F. (2023). \u201cLanguage Models Show Human-like Content Effects on Reasoning Tasks.\u201d <i>arXiv preprint arXiv:2207.07051<\/i>.","DOI":"10.1093\/pnasnexus\/pgae233"},{"key":"32","unstructured":"Dougrez-Lewis, J., Akhter, M. E., He, Y., and Liakata, M. (2024). \u201cAssessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification.\u201d <i>arXiv preprint arXiv:2402.10735<\/i>."},{"key":"33","unstructured":"Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Lin, B. Y., West, P., Bhagavatula, C., Bras, R. L., Hwang, J. D., Sanyal, S., Welleck, S., Ren, X., Ettinger, A., Harchaoui, Z., and Choi, Y. (2023). \u201cFaith and Fate: Limits of Transformers on Compositionality.\u201d <i>arXiv preprint arXiv:2305.18654<\/i>."},{"key":"34","doi-asserted-by":"crossref","unstructured":"Eisape, T., Tessler, M., Dasgupta, I., Sha, F., Steenkiste, S., and Linzen, T. (2024). \u201cA Systematic Comparison of Syllogistic Reasoning in Humans and Language Models.\u201d In Duh, K., Gomez, H., and Bethard, S. (Eds.), <i>Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)<\/i>, pp. 8425\u20138444, Mexico City, Mexico. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.naacl-long.466"},{"key":"35","doi-asserted-by":"crossref","unstructured":"Elkan, C. and Greiner, R. (1993). \u201cBuilding Large Knowledge-based Systems: Representation and Inference in the Cyc Project: DB Lenat and RV Guha.\u201d <i>Artifical Intelligence<\/i>, 61 (1), pp. 41\u201352.","DOI":"10.1016\/0004-3702(93)90092-P"},{"key":"36","unstructured":"Frohberg, J. and Binder, F. (2022). \u201cCRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models.\u201d In Calzolari, N., B\u00e9chet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S. (Eds.), <i>Proceedings of the 13th Language Resources and Evaluation Conference<\/i>, pp. 2126\u20132140, Marseille, France. European Language Resources Association."},{"key":"37","doi-asserted-by":"crossref","unstructured":"Gambardella, A., Iwasawa, Y., and Matsuo, Y. (2024). \u201cLanguage Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks.\u201d In <i>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)<\/i>, pp. 85\u201391.","DOI":"10.18653\/v1\/2024.acl-short.8"},{"key":"38","unstructured":"Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac\u2019h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2023). <i>A Framework for Few-shot Language Model Evaluation<\/i>. Zenodo."},{"key":"39","unstructured":"Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., J\u00e4rviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V., and Wildon, M. (2024). \u201cFrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.\u201d <i>arXiv preprint arXiv:2411.04872<\/i>."},{"key":"40","unstructured":"Gontier, N., Sinha, K., Reddy, S., and Pal, C. (2020). \u201cMeasuring Systematic Generalization in Neural Proof Generation with Transformers.\u201d <i>Advances in Neural Information Processing Systems<\/i>, 33, pp. 22231\u201322242."},{"key":"41","unstructured":"Goodman, N. (1954). <i>Fact, Fiction, and Forecast. London: University of London<\/i>. Athlone Press."},{"key":"42","doi-asserted-by":"crossref","unstructured":"Guia\u015fu, R. C. and Tindale, C. W. (2018). \u201cLogical Fallacies and Invasion Biology.\u201d <i>Biology &amp; Philosophy<\/i>, 33 (5\u20136), p. 34.","DOI":"10.1007\/s10539-018-9644-0"},{"key":"43","unstructured":"Gulati, A., Miranda, B., Chen, E., Xia, E., Fronsdal, K., de Moraes Dumont, B., and Koyejo, S. (2024). \u201cPutnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning.\u201d In <i>The 4th Workshop on Mathematical Reasoning and AI at NeurIPS\u201924<\/i>."},{"key":"44","doi-asserted-by":"crossref","unstructured":"Habernal, I., Wachsmuth, H., Gurevych, I., and Stein, B. (2018). \u201cThe Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants.\u201d In <i>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)<\/i>, pp. 1930\u20131940, New Orleans, Louisiana. Association for Computational Linguistics.","DOI":"10.18653\/v1\/N18-1175"},{"key":"45","unstructured":"Han, S., Schoelkopf, H., Zhao, Y., Qi, Z., Riddell, M., Benson, L., Sun, L., Zubova, E., Qiao, Y., Burtell, M., et al. (2022). \u201cFOLIO: Natural Language Reasoning with First-Order Logic.\u201d <i>arXiv preprint arXiv:2209.00840<\/i>."},{"key":"46","doi-asserted-by":"crossref","unstructured":"Hansson, S. O. (2004). \u201cFallacies of risk.\u201d <i>Journal of Risk Research<\/i>, 7 (3), pp. 353\u2013360.","DOI":"10.1080\/1366987042000176262"},{"key":"47","unstructured":"Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021a). \u201cMeasuring Massive Multitask Language Understanding.\u201d In <i>Proceedings of the International Conference on Learning Representations (ICLR)<\/i>."},{"key":"48","unstructured":"Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021b). \u201cMeasuring Mathematical Problem Solving With the MATH Dataset.\u201d In <i>NeurIPS<\/i>."},{"key":"49","doi-asserted-by":"crossref","unstructured":"Ho, N., Schmid, L., and Yun, S.-Y. (2023). \u201cLarge Language Models Are Reasoning Teachers.\u201d <i>arXiv preprint arXiv:2212.10071<\/i>.","DOI":"10.18653\/v1\/2023.acl-long.830"},{"key":"50","unstructured":"Hodel, D. and West, J. (2023). \u201cResponse: Emergent Analogical Reasoning in Large Language Models.\u201d <i>arXiv preprint arXiv:2308.16118<\/i>."},{"key":"51","doi-asserted-by":"crossref","unstructured":"Hong, R., Zhang, H., Pang, X., Yu, D., and Zhang, C. (2024). \u201cA Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning.\u201d In Duh, K., Gomez, H., and Bethard, S. (Eds.), <i>Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)<\/i>, pp. 900\u2013925, Mexico City, Mexico. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.naacl-long.52"},{"key":"52","unstructured":"Hosseini, A., Sordoni, A., Toyama, D., Courville, A., and Agarwal, R. (2024). \u201cNot All LLM Reasoners Are Created Equal.\u201d <i>arXiv preprint arXiv:2410.01748<\/i>."},{"key":"53","unstructured":"Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. (2024). \u201cLarge Language Models Cannot Self-Correct Reasoning Yet.\u201d In <i>The 12th International Conference on Learning Representations<\/i>."},{"key":"54","doi-asserted-by":"crossref","unstructured":"Hume, D. (1748). <i>An Enquiry Concerning Human Understanding (section IV)<\/i>.","DOI":"10.1093\/oseo\/instance.00032980"},{"key":"55","doi-asserted-by":"crossref","unstructured":"Jiang, B., Xie, Y., Hao, Z., Wang, X., Mallick, T., Su, W. J., Taylor, C. J., and Roth, D. (2024a). \u201cA Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners.\u201d <i>arXiv preprint arXiv:2406.11050<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.272"},{"key":"56","unstructured":"Jiang, J., Yan, Y., Liu, Y., Jin, Y., Peng, S., Zhang, M., Cai, X., Cao, Y., Gao, L., and Tang, Z. (2024b). \u201cLogicPro: Improving Complex Logical Reasoning via Program-Guided Learning.\u201d <i>arXiv preprint arXiv:2409.12929<\/i>."},{"key":"57","unstructured":"Jiang, M., Liu, K. Z., Zhong, M., Schaeffer, R., Ouyang, S., Han, J., and Koyejo, S. (2024c). \u201cInvestigating Data Contamination for Pre-training Language Models.\u201d <i>arXiv preprint arXiv:2401.06059<\/i>."},{"key":"58","unstructured":"Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Adauto, F. G., Kleiman-Weiner, M., Sachan, M., and Sch\u00f6lkopf, B. (2024). \u201cCLadder: Assessing Causal Reasoning in Language Models.\u201d <i>arXiv preprint arXiv:2312.04350<\/i>."},{"key":"59","unstructured":"Jitsev, J. (2024). \u201cA Post on X by a Researcher.\u201d <i>post on X<\/i>. https:\/\/x.com\/JJitsev\/status\/1842727657345036788."},{"key":"60","unstructured":"Kahneman, D. (2011). <i>Thinking, fast and slow<\/i>. Macmillan."},{"key":"61","unstructured":"Kudo, K., Aoki, Y., Kuribayashi, T., Sone, S., Taniguchi, M., Brassard, A., Sakaguchi, K., and Inui, K. (2024). \u201cThink-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning.\u201d <i>arXiv preprint arXiv:2412.01113<\/i>."},{"key":"62","unstructured":"Kudo, T. (2005). \u201cMecab: Yet Another Part-of-speech and Morphological Analyzer.\u201d https:\/\/taku910.github.io\/mecab\/."},{"key":"63","unstructured":"Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Luko\u0161i\u016bt\u0117, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCandlish, S., Kundu, S., Kadavath, S., Yang, S., Henighan, T., Maxwell, T., Telleen-Lawton, T., Hume, T., Hatfield-Dodds, Z., Kaplan, J., Brauner, J., Bowman, S. R., and Perez, E. (2023). \u201cMeasuring Faithfulness in Chain-of-Thought Reasoning.\u201d <i>arXiv preprint arXiv:2307.13702<\/i>."},{"key":"64","doi-asserted-by":"crossref","unstructured":"Li, J., Yu, L., and Ettinger, A. (2023). \u201cCounterfactual Reasoning: Testing Language Models\u2019 Understanding of Hypothetical Scenarios.\u201d In Rogers, A., Boyd-Graber, J., and Okazaki, N. (Eds.), <i>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)<\/i>, pp. 804\u2013815, Toronto, Canada. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.acl-short.70"},{"key":"65","unstructured":"Li, L., Luo, Y., and Pan, T. (2024). \u201cOpenAI-o1 AB Testing: Does the o1 Model Really Do Good Reasoning in Math Problem Solving?\u201d <i>arXiv preprint arXiv:2411.06198<\/i>."},{"key":"66","doi-asserted-by":"crossref","unstructured":"Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. (2023). \u201cSymbolic Chain-of-Thought Distillation: Small Models Can Also \u201cThink\u201d Step-by-Step.\u201d In Rogers, A., Boyd-Graber, J., and Okazaki, N. (Eds.), <i>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)<\/i>, pp. 2665\u20132679, Toronto, Canada. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.acl-long.150"},{"key":"67","unstructured":"Li, S., Chen, J., Shen, Y., Chen, Z., Zhang, X., Li, Z., Wang, H., Qian, J., Peng, B., Mao, Y., Chen, W., and Yan, X. (2022). \u201cExplanations from Large Language Models Make Small Reasoners Better.\u201d <i>arXiv preprint arXiv:2210.06726<\/i>."},{"key":"68","doi-asserted-by":"crossref","unstructured":"Liu, H., Liu, J., Cui, L., Teng, Z., Duan, N., Zhou, M., and Zhang, Y. (2023a). \u201cLogiQA 2.0\u2014An Improved Dataset for Logical Reasoning in Natural Language Understanding.\u201d <i>IEEE\/ACM Transactions on Audio, Speech, and Language Processing<\/i>, 31, pp. 2947\u20132962.","DOI":"10.1109\/TASLP.2023.3293046"},{"key":"69","unstructured":"Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., and Zhang, Y. (2023b). \u201cEvaluating the Logical Reasoning Ability of ChatGPT and GPT-4.\u201d <i>arXiv preprint arXiv:2304.03439<\/i>."},{"key":"70","doi-asserted-by":"crossref","unstructured":"Liu, H., Teng, Z., Cui, L., Zhang, C., Zhou, Q., and Zhang, Y. (2023c). \u201cLogiCoT: Logical Chain-of-Thought Instruction Tuning.\u201d In Bouamor, H., Pino, J., and Bali, K. (Eds.), <i>Findings of the Association for Computational Linguistics: EMNLP 2023<\/i>, pp. 2908\u20132921, Singapore. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.findings-emnlp.191"},{"key":"71","doi-asserted-by":"crossref","unstructured":"Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang, Y. (2020). \u201cLogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning.\u201d In Bessiere, C. (Ed.), <i>Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI-20<\/i>, pp. 3622\u20133628. International Joint Conferences on Artificial Intelligence Organization. Main track.","DOI":"10.24963\/ijcai.2020\/501"},{"key":"72","unstructured":"Liu, J., Xia, C. S., Wang, Y., and Zhang, L. (2023). \u201cIs Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.\u201d In <i>37th Conference on Neural Information Processing Systems<\/i>."},{"key":"73","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lee, I., Du, Y., Sanyal, S., and Zhao, J. (2024). \u201cSelf-Contradictory Reasoning Evaluation and Detection.\u201d <i>arXiv preprint arXiv:2311.09603<\/i>.","DOI":"10.18653\/v1\/2024.findings-emnlp.213"},{"key":"74","doi-asserted-by":"crossref","unstructured":"Lu, Z., Zhou, A., Ren, H., Wang, K., Shi, W., Pan, J., Zhan, M., and Li, H. (2024). \u201cMathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs.\u201d <i>arXiv preprint arXiv:2402.16352<\/i>.","DOI":"10.18653\/v1\/2024.acl-long.151"},{"key":"75","unstructured":"MA, Y., Liu, Y., Yu, Y., Zhang, Y., Jiang, Y., Wang, C., and Li, S. (2024). \u201cAt Which Training Stage Does Code Data Help LLMs Reasoning?\u201d In <i>The 12th International Conference on Learning Representations<\/i>."},{"key":"76","doi-asserted-by":"crossref","unstructured":"Magister, L. C., Mallinson, J., Adamek, J., Malmi, E., and Severyn, A. (2023). \u201cTeaching Small Language Models to Reason.\u201d In Rogers, A., Boyd-Graber, J., and Okazaki, N. (Eds.), <i>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)<\/i>, pp. 1773\u20131781, Toronto, Canada. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.acl-short.151"},{"key":"77","unstructured":"McCarthy, J. W. (1959). \u201cPrograms with Common Sense.\u201d In <i>Proceedings Tedding Conference on the Mechanization of Thought Processes<\/i>, pp. 75\u201391."},{"key":"78","doi-asserted-by":"crossref","unstructured":"Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). \u201cCan a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.\u201d In <i>EMNLP<\/i>.","DOI":"10.18653\/v1\/D18-1260"},{"key":"79","doi-asserted-by":"crossref","unstructured":"Miller, G. A. (1995). \u201cWordNet: A Lexical Database for English.\u201d <i>Communications of the ACM<\/i>, 38 (11), pp. 39\u201341.","DOI":"10.1145\/219717.219748"},{"key":"80","unstructured":"Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., and Farajtabar, M. (2024). \u201cGSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.\u201d <i>arXiv preprint arXiv:2410.05229<\/i>."},{"key":"81","unstructured":"Mitchell, M. (2023). \u201cCan Large Language Models Reason?\u201d https:\/\/aiguide.substack.com\/p\/can-large-language-models-reason."},{"key":"82","unstructured":"Mitra, A., Corro, L. D., Mahajan, S., Codas, A., Simoes, C., Agarwal, S., Chen, X., Razdaibiedina, A., Jones, E., Aggarwal, K., Palangi, H., Zheng, G., Rosset, C., Khanpour, H., and Awadallah, A. (2023). \u201cOrca 2: Teaching Small Language Models How to Reason.\u201d <i>arXiv preprint arXiv:2311.11045<\/i>."},{"key":"83","doi-asserted-by":"crossref","unstructured":"Mondorf, P. and Plank, B. (2024). \u201cLiar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.\u201d <i>arXiv preprint arXiv:2406.12546<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.404"},{"key":"84","unstructured":"Morishita, T., Morio, G., Yamaguchi, A., and Sogawa, Y. (2023). \u201cLearning Deductive Reasoning from Synthetic Corpus based on Formal Logic.\u201d In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (Eds.), <i>Proceedings of the 40th International Conference on Machine Learning<\/i>, Vol. 202 of <i>Proceedings of Machine Learning Research<\/i>, pp. 25254\u201325274. PMLR."},{"key":"85","unstructured":"Morishita, T., Morio, G., Yamaguchi, A., and Sogawa, Y. (2024a). \u201cEnhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus.\u201d In <i>Annual Conference on Neural Information Processing Systems<\/i>."},{"key":"86","unstructured":"Morishita, T., Yamaguchi, A., Morio, G., Tomonari, H., Imaichi, O., and Sogawa, Y. (2024b). \u201cJFLD: A Japanese Benchmark for Deductive Reasoning Based on Formal Logic.\u201d In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N. (Eds.), <i>Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)<\/i>, pp. 9526\u20139535, Torino, Italia. ELRA and ICCL."},{"key":"87","unstructured":"\u68ee\u4e0b\u7693\u6587\uff0c\u68ee\u5c3e\u5b66\uff0c\u5c71\u53e3\u7be4\u5b63\uff0c\u5341\u6cb3\u6cf0\u5f18 (2023a). \u5f62\u5f0f\u8ad6\u7406\u5b66\u306b\u57fa\u3065\u304f\u6f14\u7e79\u30b3\u30fc\u30d1\u30b9\u306b\u3088\u308b\u8a00\u8a9e\u30e2\u30c7\u30eb\u306b\u5bfe\u3059\u308b\u6f14\u7e79\u63a8\u8ad6\u80fd\u529b\u306e\u4ed8\u4e0e. \u8a00\u8a9e\u51e6\u7406\u5b66\u4f1a\u4e88\u7a3f\u96c6. [T. Morishita, et al. (2023a). Keishikironrigaku ni Motozuku Enekikopasu niyoru Gengomoderu ni Taisuru Enekisuironnoryoku no Fuyo. Annual Meeting of the Association for Natural Language Processing.]."},{"key":"88","unstructured":"\u68ee\u4e0b\u7693\u6587\uff0c\u68ee\u5c3e\u5b66\uff0c\u5c71\u53e3\u7be4\u5b63\uff0c\u5341\u6cb3\u6cf0\u5f18 (2023b). \u4eba\u5de5\u6f14\u7e79\u63a8\u8ad6\u30b3\u30fc\u30d1\u30b9\u306b\u3088\u308b\u5b66\u7fd2\u306f\u8a00\u8a9e\u30e2\u30c7\u30eb\u3092\u3069\u306e\u3088\u3046\u306b\u5f37\u5316\u3059\u308b\u304b? \u4eba\u5de5\u77e5\u80fd\u5b66\u4f1a\u5168\u56fd\u5927\u4f1a\u8ad6\u6587\u96c6\u7b2c 37 \u56de. [T. Morishita, et al. (2023b). How Do Synthetic Deduction Corpora Enhance Language Models? The 37th Annual Conference of the Japanese Society for Artificial Intelligence.]."},{"key":"89","unstructured":"\u68ee\u4e0b\u7693\u6587\uff0c\u5c71\u53e3\u7be4\u5b63\uff0c\u68ee\u5c3e\u5b66\uff0c\u89d2\u639b\u6b63\u5f25\uff0c\u53cb\u6210\u5149\uff0c\u4eca\u4e00\u4fee\uff0c\u5341\u6cb3\u6cf0\u5f18 (2024a). \u65e5\u672c\u8a9e\u8ad6\u7406\u63a8\u8ad6\u30d9\u30f3\u30c1\u30de\u30fc\u30af JFLD \u306e\u63d0\u6848. \u8a00\u8a9e\u51e6\u7406\u5b66\u4f1a\u7b2c 30 \u5e74\u6b21\u5927\u4f1a\u767a\u8868\u8ad6\u6587\u96c6, pp. 925\u2013930. [T. Morishita, et al. (2024a). Nihongo Ronrisuiron Benchimaku JFLD no Teian. Proceedings of the 30th Annual Meeting of the Association for Natural Language Processing, pp. 925\u2013930.]."},{"key":"90","unstructured":"\u68ee\u4e0b\u7693\u6587\uff0c\u5c71\u53e3\u7be4\u5b63\uff0c\u68ee\u5c3e\u5b66\uff0c\u4eca\u4e00\u4fee\uff0c\u5341\u6cb3\u6cf0\u5f18 (2024b). \u5e30\u7d0d\u7684\u306b\u591a\u69d8\u306a\u5de8\u5927\u8ad6\u7406\u63a8\u8ad6\u30b3\u30fc\u30d1\u30b9\u306b\u3088\u308a LLM \u306e\u6c4e\u7528\u8ad6\u7406\u63a8\u8ad6\u80fd\u529b\u3092\u5411\u4e0a\u3055\u305b\u308b. \u4eba\u5de5\u77e5\u80fd\u5b66\u4f1a\u5168\u56fd\u5927\u4f1a\u8ad6\u6587\u96c6\u7b2c 38 \u56de. [T. Morishita, et al. (2024b). Acquiring Generalizable Reasoning Ability through a Large-Scale Logical Corpus with Inductively Diverse Examples. The 38th Annual Conference of the Japanese Society for Artificial Intelligence.]."},{"key":"91","unstructured":"Nikankin, Y., Reusch, A., Mueller, A., and Belinkov, Y. (2024). \u201cArithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics.\u201d <i>arXiv preprint arXiv:2410.21272<\/i>."},{"key":"92","unstructured":"\u5ca1\u8c37\u57fa\u5f18 (2023). GPT-4 \u306b\u3088\u308b\u8db3\u3057\u7b97\u5b9f\u9a13\u304b\u3089\u793a\u5506\u3055\u308c\u308b Large Language Models \u306e\u9650\u754c. \u4eba\u5de5\u77e5\u80fd\u5b66\u4f1a\u7b2c\u4e8c\u7a2e\u7814\u7a76\u4f1a\u8cc7\u6599, 2023 (AGI-024), p. 2. [M. Okaya. (2023). The Issues of Large Language Models indicated by Addition Experiments on GPT4. JSAI Technical Report, Type 2 SIG, 2023 (AGI-024), p. 2.]."},{"key":"93","unstructured":"OpenAI (2024). \u201cDays of OpenAI: Day 12.\u201d https:\/\/www.youtube.com\/watch?v=SKBG1sqdyIU. Accessed: 2024\/12\/21."},{"key":"94","doi-asserted-by":"crossref","unstructured":"Ozeki, K., Ando, R., Morishita, T., Abe, H., Mineshima, K., and Okada, M. (2024). \u201cExploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset.\u201d In Ku, L.-W., Martins, A., and Srikumar, V. (Eds.), <i>Findings of the Association for Computational Linguistics ACL 2024<\/i>, pp. 16063\u201316077, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.findings-acl.950"},{"key":"95","doi-asserted-by":"crossref","unstructured":"Paglieri, F. (2017). \u201cA Plea for Ecological Argument Technologies.\u201d <i>Philosophy &amp; Technology<\/i>, 30 (2), pp. 209\u2013238.","DOI":"10.1007\/s13347-016-0222-6"},{"key":"96","doi-asserted-by":"crossref","unstructured":"Parmar, M., Patel, N., Varshney, N., Nakamura, M., Luo, M., Mashetty, S., Mitra, A., and Baral, C. (2024). \u201cLogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models.\u201d In Ku, L.-W., Martins, A., and Srikumar, V. (Eds.), <i>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)<\/i>, pp. 13679\u201313707, Bangkok, Thailand. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.acl-long.739"},{"key":"97","doi-asserted-by":"crossref","unstructured":"Patel, N., Kulkarni, M., Parmar, M., Budhiraja, A., Nakamura, M., Varshney, N., and Baral, C. (2024). \u201cMulti-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models.\u201d <i>arXiv preprint arXiv:2406.17169<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.1160"},{"key":"98","doi-asserted-by":"crossref","unstructured":"Quine, W. V. O. (1969). \u201cEpistemology Naturalized.\u201d In <i>Ontological Relativity and Other Essays<\/i>. New York: Columbia University Press.","DOI":"10.7312\/quin92204"},{"key":"99","unstructured":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). \u201cLanguage Models are Unsupervised Multitask Learners.\u201d OpenAI."},{"key":"100","unstructured":"Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2019). \u201cExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.\u201d <i>arXiv preprint arXiv:1910.10683<\/i>."},{"key":"101","doi-asserted-by":"crossref","unstructured":"Rajpurkar, P., Jia, R., and Liang, P. (2018). \u201cKnow What You Don\u2019t Know: Unanswerable Questions for SQuAD.\u201d In Gurevych, I. and Miyao, Y. (Eds.), <i>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)<\/i>, pp. 784\u2013789, Melbourne, Australia. Association for Computational Linguistics.","DOI":"10.18653\/v1\/P18-2124"},{"key":"102","doi-asserted-by":"crossref","unstructured":"Razeghi, Y., Logan IV, R. L., Gardner, M., and Singh, S. (2022). \u201cImpact of Pretraining Term Frequencies on Few-shot Numerical Reasoning.\u201d In <i>Findings of the Association for Computational Linguistics: EMNLP 2022<\/i>, pp. 840\u2013854.","DOI":"10.18653\/v1\/2022.findings-emnlp.59"},{"key":"103","unstructured":"Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. (2023). \u201cGPQA: A Graduate-Level Google-Proof Q&amp;A Benchmark.\u201d <i>arXiv preprint arXiv:2311.12022<\/i>."},{"key":"104","unstructured":"Ruis, L., Mozes, M., Bae, J., Kamalakara, S. R., Talupuru, D., Locatelli, A., Kirk, R., Rockt\u00e4schel, T., Grefenstette, E., and Bartolo, M. (2024). \u201cProcedural Knowledge in Pretraining Drives Reasoning in Large Language Models.\u201d <i>arXiv preprint arXiv:2411.12580<\/i>."},{"key":"105","doi-asserted-by":"crossref","unstructured":"Saha, S., Ghosh, S., Srivastava, S., and Bansal, M. (2020). \u201cPRover: Proof Generation for Interpretable Reasoning over Rules.\u201d In <i>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)<\/i>, pp. 122\u2013136, Online. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2020.emnlp-main.9"},{"key":"106","doi-asserted-by":"crossref","unstructured":"Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. (2021). \u201cWinogrande: An Adversarial Winograd Schema Challenge at Scale.\u201d <i>Communications of the ACM<\/i>, 64 (9), pp. 99\u2013106.","DOI":"10.1145\/3474381"},{"key":"107","unstructured":"Sanyal, S., Liao, Z., and Ren, X. (2022a). \u201cRobustlr: Evaluating Robustness to Logical Perturbation in Deductive Reasoning.\u201d <i>arXiv preprint arXiv:2205.12598<\/i>."},{"key":"108","doi-asserted-by":"crossref","unstructured":"Sanyal, S., Singh, H., and Ren, X. (2022b). \u201cFaiRR: Faithful and Robust Deductive Reasoning over Natural Language.\u201d In <i>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)<\/i>, pp. 1075\u20131093.","DOI":"10.18653\/v1\/2022.acl-long.77"},{"key":"109","unstructured":"Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., Sch\u00e4rli, N., and Zhou, D. (2023). \u201cLarge Language Models Can Be Easily Distracted by Irrelevant Context.\u201d <i>arXiv preprint arXiv:2302.00093<\/i>."},{"key":"110","doi-asserted-by":"crossref","unstructured":"Shortliffe, E. H. (1976). <i>Computer Based Medical Consultations: MYCIN<\/i>. Elsevier.","DOI":"10.1016\/B978-0-444-00179-5.50009-3"},{"key":"111","doi-asserted-by":"crossref","unstructured":"Shridhar, K., Stolfo, A., and Sachan, M. (2023). \u201cDistilling Reasoning Capabilities into Smaller Language Models.\u201d In Rogers, A., Boyd-Graber, J., and Okazaki, N. (Eds.), <i>Findings of the Association for Computational Linguistics: ACL 2023<\/i>, pp. 7059\u20137073, Toronto, Canada. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.findings-acl.441"},{"key":"112","unstructured":"Sprague, Z. R., Ye, X., Bostrom, K., Chaudhuri, S., and Durrett, G. (2024). \u201cMuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning.\u201d In <i>The 12th International Conference on Learning Representations<\/i>."},{"key":"113","unstructured":"Srivastava, S., Annarose, M. B., Anto, P. V., Menon, S., Sukumar, A., Samod, T. A., Philipose, A., Prince, S., and Thomas, S. (2024). \u201cFunctional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap.\u201d <i>arXiv preprint arXiv:2402.19450<\/i>."},{"key":"114","unstructured":"Sunstein, C. R. and Hastie, R. (2015). <i>Wiser: Getting beyond Groupthink to Make Groups Smarter<\/i>. Harvard Business Review Press."},{"key":"115","doi-asserted-by":"crossref","unstructured":"Suzgun, M., Scales, N., Sch\u00e4rli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., and Wei, J. (2022). \u201cChallenging BIG-Bench Tasks and Whether Chain-of-thought Can Solve Them.\u201d <i>arXiv preprint arXiv:2210.09261<\/i>.","DOI":"10.18653\/v1\/2023.findings-acl.824"},{"key":"116","doi-asserted-by":"crossref","unstructured":"Tafjord, O., Dalvi, B., and Clark, P. (2021). \u201cProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language.\u201d In <i>Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021<\/i>, pp. 3621\u20133634, Online. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2021.findings-acl.317"},{"key":"117","unstructured":"Talmor, A., Herzig, J., Lourie, N., and Berant, J. (2018). \u201cCommonsenseqa: A Question Answering Challenge Targeting Commonsense Knowledge.\u201d <i>arXiv preprint arXiv:1811.00937<\/i>."},{"key":"118","doi-asserted-by":"crossref","unstructured":"Tian, J., Li, Y., Chen, W., Xiao, L., He, H., and Jin, Y. (2021). \u201cDiagnosing the First-Order Logical Reasoning Ability Through LogicNLI.\u201d In <i>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing<\/i>, pp. 3738\u20133747.","DOI":"10.18653\/v1\/2021.emnlp-main.303"},{"key":"119","unstructured":"Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). \u201cLanguage Models Don\u2019t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.\u201d <i>arXiv preprint arXiv:2305.04388<\/i>."},{"key":"120","doi-asserted-by":"crossref","unstructured":"Uchiyama, F., Kojima, T., Gambardella, A., Cao, Q., Iwasawa, Y., and Matsuo, Y. (2024). \u201cWhich Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?\u201d <i>arXiv preprint arXiv:2410.06735<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.1008"},{"key":"121","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u023d., and Polosukhin, I. (2017). \u201cAttention is All you Need.\u201d In <i>Advances in Neural Information Processing Systems<\/i>, Vol. 30, pp. 6000\u20136010."},{"key":"122","unstructured":"Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. (2024). \u201cWill We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data.\u201d <i>arXiv preprint arXiv:2211.04325<\/i>."},{"key":"123","doi-asserted-by":"crossref","unstructured":"Wan, Y., Wang, W., Yang, Y., Yuan, Y., tse Huang, J., He, P., Jiao, W., and Lyu, M. R. (2024). \u201cLogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models.\u201d <i>arXiv preprint arXiv:2401.00757<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.128"},{"key":"124","unstructured":"Wang, B., Yue, X., Su, Y., and Sun, H. (2024). \u201cGrokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization.\u201d <i>arXiv preprint arXiv:2405.15071<\/i>."},{"key":"125","doi-asserted-by":"crossref","unstructured":"Wang, P., Wang, Z., Li, Z., Gao, Y., Yin, B., and Ren, X. (2023). \u201cSCOTT: Self-Consistent Chain-of-Thought Distillation.\u201d In Rogers, A., Boyd-Graber, J., and Okazaki, N. (Eds.), <i>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)<\/i>, pp. 5546\u20135558, Toronto, Canada. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2023.acl-long.304"},{"key":"126","doi-asserted-by":"crossref","unstructured":"Wang, S., Wei, Z., Choi, Y., and Ren, X. (2024a). \u201cCan LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs.\u201d In Ku, L.-W., Martins, A., and Srikumar, V. (Eds.), <i>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)<\/i>, pp. 7523\u20137543, Bangkok, Thailand. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.acl-long.406"},{"key":"127","unstructured":"Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. (2024b). \u201cMMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (Published at NeurIPS 2024 Track Datasets and Benchmarks).\u201d <i>arXiv preprint arXiv:2406.01574<\/i>."},{"key":"128","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. (2022). \u201cChain of Thought Prompting Elicits Reasoning in Large Language Models.\u201d In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (Eds.), <i>Advances in Neural Information Processing Systems<\/i>, pp. 24824\u201324837."},{"key":"129","doi-asserted-by":"crossref","unstructured":"Weizenbaum, J. (1966). \u201cELIZA\u2014A Computer Program for the Study of Natural Language Communication between Man and Machine.\u201d <i>Communications of the ACM<\/i>, 9 (1), pp. 36\u201345.","DOI":"10.1145\/365153.365168"},{"key":"130","doi-asserted-by":"crossref","unstructured":"Welbl, J., Liu, N. F., and Gardner, M. (2017). \u201cCrowdsourcing Multiple Choice Science Questions.\u201d In Derczynski, L., Xu, W., Ritter, A., and Baldwin, T. (Eds.), <i>Proceedings of the 3rd Workshop on Noisy User-generated Text<\/i>, pp. 94\u2013106, Copenhagen, Denmark. Association for Computational Linguistics.","DOI":"10.18653\/v1\/W17-4413"},{"key":"131","unstructured":"Weston, J., Bordes, A., Chopra, S., Rush, A. M., Van Merri\u00ebnboer, B., Joulin, A., and Mikolov, T. (2015). \u201cTowards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks.\u201d <i>arXiv preprint arXiv:1502.05698<\/i>."},{"key":"132","doi-asserted-by":"crossref","unstructured":"Williams, A., Nangia, N., and Bowman, S. R. (2018). \u201cA Broad-Coverage Challenge Corpus for Sentence Understanding through Inference.\u201d In <i>Proceedings of NAACL-HLT<\/i>, pp. 1112\u20131122.","DOI":"10.18653\/v1\/N18-1101"},{"key":"133","unstructured":"Winograd, T. (1971). <i>Procedures as a Representation for Data in a Computer Program for Understanding Natural Language, MIT AI Technical Report 235<\/i>."},{"key":"134","unstructured":"Wittgenstein, L. (1922). <i>Tractatus Logico Philosophicus: Logical-Philosophical Treatise<\/i>. Really Simple Media."},{"key":"135","doi-asserted-by":"crossref","unstructured":"Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. (2020). \u201cTransformers: State-of-the-Art Natural Language Processing.\u201d In <i>Empirical Methods in Natural Language Processing: System Demonstrations<\/i>, pp. 38\u201345.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"136","doi-asserted-by":"crossref","unstructured":"Wu, Z., Qiu, L., Ross, A., Aky\u00fcrek, E., Chen, B., Wang, B., Kim, N., Andreas, J., and Kim, Y. (2023). \u201cReasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks.\u201d <i>arXiv preprint arXiv:2307.02477<\/i>.","DOI":"10.18653\/v1\/2024.naacl-long.102"},{"key":"137","unstructured":"Xie, C., Huang, Y., Zhang, C., Yu, D., Chen, X., Lin, B. Y., Li, B., Ghazi, B., and Kumar, R. (2024). \u201cOn Memorization of Large Language Models in Logical Reasoning.\u201d <i>arXiv preprint arXiv:2410.23123<\/i>."},{"key":"138","doi-asserted-by":"crossref","unstructured":"Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L., and Bos, J. (2019). \u201cHELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning.\u201d <i>arXiv preprint arXiv:1904.12166<\/i>.","DOI":"10.18653\/v1\/S19-1027"},{"key":"139","doi-asserted-by":"crossref","unstructured":"Young, N., Bao, Q., Bensemann, J., and Witbrock, M. J. (2022). \u201cAbductionRules: Training Transformers to Explain Unexpected Inputs.\u201d In <i>Findings of the Association for Computational Linguistics: ACL 2022<\/i>, pp. 218\u2013227.","DOI":"10.18653\/v1\/2022.findings-acl.19"},{"key":"140","unstructured":"Yu, L., Gao, X.-S., Zhang, L., and Miao, Y. (2024). \u201cGeneralizability of Memorization Neural Networks.\u201d <i>arXiv preprint arXiv:2411.00372<\/i>."},{"key":"141","unstructured":"Yu, W., Jiang, Z., Dong, Y., and Feng, J. (2020). \u201cReClor: A Reading Comprehension Dataset Requiring Logical Reasoning.\u201d In <i>International Conference on Learning Representations (ICLR)<\/i>."},{"key":"142","doi-asserted-by":"crossref","unstructured":"Yu, W., Jiang, M., Clark, P., and Sabharwal, A. (2023). \u201cIfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions.\u201d <i>arXiv preprint arXiv:2305.14010<\/i>.","DOI":"10.18653\/v1\/2023.emnlp-main.515"},{"key":"143","doi-asserted-by":"crossref","unstructured":"Yuan, Z., Hu, S., Vuli\u0107, I., Korhonen, A., and Meng, Z. (2023). \u201cCan Pretrained Language Models (Yet) Reason Deductively?\u201d In <i>Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics<\/i>, pp. 1439\u20131454.","DOI":"10.18653\/v1\/2023.eacl-main.106"},{"key":"144","unstructured":"Ze\u010devi\u0107, M., Willig, M., Dhami, D. S., and Kersting, K. (2023). \u201cCausal Parrots: Large Language Models May Talk Causality But Are Not Causal.\u201d <i>arXiv preprint arXiv:2308.13067<\/i>."},{"key":"145","doi-asserted-by":"crossref","unstructured":"Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). \u201cHellaSwag: Can a Machine Really Finish Your Sentence?\u201d In <i>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics<\/i>, pp. 4791\u20134800.","DOI":"10.18653\/v1\/P19-1472"},{"key":"146","doi-asserted-by":"crossref","unstructured":"Zhang, H., Li, L. H., Meng, T., Chang, K.-W., and den Broeck, G. V. (2022). \u201cOn the Paradox of Learning to Reason from Data.\u201d <i>arXiv preprint arXiv:2205.11502<\/i>.","DOI":"10.24963\/ijcai.2023\/375"},{"key":"147","unstructured":"Zhang, H., Da, J., Lee, D., Robinson, V., Wu, C., Song, W., Zhao, T., Raja, P., Slack, D., Lyu, Q., Hendryx, S., Kaplan, R., Lunati, M., and Yue, S. (2024a). \u201cA Careful Examination of Large Language Model Performance on Grade School Arithmetic.\u201d <i>arXiv preprint arXiv:2405.00332<\/i>."},{"key":"148","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wang, H., Feng, S., Tan, Z., Han, X., He, T., and Tsvetkov, Y. (2024b). \u201cCan LLM Graph Reasoning Generalize beyond Pattern Memorization?\u201d In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (Eds.), <i>Findings of the Association for Computational Linguistics: EMNLP 2024<\/i>, pp. 2289\u20132305, Miami, Florida, USA. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.findings-emnlp.127"},{"key":"149","doi-asserted-by":"crossref","unstructured":"Zhao, J., Tong, J., Mou, Y., Zhang, M., Zhang, Q., and Huang, X. (2024a). \u201cExploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning.\u201d <i>arXiv preprint arXiv:2405.06680<\/i>.","DOI":"10.18653\/v1\/2024.emnlp-main.915"},{"key":"150","doi-asserted-by":"crossref","unstructured":"Zhao, W., Chiu, J., Hwang, J., Brahman, F., Hessel, J., Choudhury, S., Choi, Y., Li, X., and Suhr, A. (2024b). \u201cUNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations.\u201d In Duh, K., Gomez, H., and Bethard, S. (Eds.), <i>Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)<\/i>, pp. 8487\u20138505, Mexico City, Mexico. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.naacl-long.469"},{"key":"151","doi-asserted-by":"crossref","unstructured":"Zhong, W., Wang, S., Tang, D., Xu, Z., Guo, D., Wang, J., Yin, J., Zhou, M., and Duan, N. (2021). \u201cAr-lsat: Investigating Analytical Reasoning of Text.\u201d <i>arXiv preprint arXiv:2104.06598<\/i>.","DOI":"10.18653\/v1\/2022.findings-naacl.177"},{"key":"152","unstructured":"Zhou, Y., Alon, U., Chen, X., Wang, X., Agarwal, R., and Zhou, D. (2024a). \u201cTransformers Can Achieve Length Generalization But Not Robustly.\u201d <i>arXiv preprint arXiv:2402.09371<\/i>."},{"key":"153","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Zhu, Y., Antognini, D., Kim, Y., and Zhang, Y. (2024b). \u201cParaphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models.\u201d In Duh, K., Gomez, H., and Bethard, S. (Eds.), <i>Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)<\/i>, pp. 2793\u20132804, Mexico City, Mexico. Association for Computational Linguistics.","DOI":"10.18653\/v1\/2024.naacl-long.153"},{"key":"154","unstructured":"Zhu, K., Chen, J., Wang, J., Gong, N. Z., Yang, D., and Xie, X. (2024). \u201cDyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks.\u201d In <i>The Twelfth International Conference on Learning Representations<\/i>."}],"container-title":["Journal of Natural Language Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/jnlp\/32\/2\/32_520\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T04:14:16Z","timestamp":1750479256000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/jnlp\/32\/2\/32_520\/_article\/-char\/ja\/"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":154,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025]]}},"URL":"https:\/\/doi.org\/10.5715\/jnlp.32.520","relation":{},"ISSN":["1340-7619","2185-8314"],"issn-type":[{"value":"1340-7619","type":"print"},{"value":"2185-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025]]}}}