{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,7]],"date-time":"2026-07-07T06:24:33Z","timestamp":1783405473590,"version":"3.54.6"},"reference-count":48,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,9,19]],"date-time":"2024-09-19T00:00:00Z","timestamp":1726704000000},"content-version":"vor","delay-in-days":262,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,9,18]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric, PC, for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model\u2019s variance in correctness attributable to paraphrasing. To estimate PC, we collect ParaNlu, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference.1 Using ParaNlu, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not fine-tuning. All models tested exhibited room for improvement in paraphrastic consistency.<\/jats:p>","DOI":"10.1162\/tacl_a_00692","type":"journal-article","created":{"date-parts":[[2024,9,19]],"date-time":"2024-09-19T19:44:56Z","timestamp":1726775096000},"page":"1143-1162","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":3,"title":["How Often Are Errors in Natural Language Reasoning Due to Paraphrastic Variability?"],"prefix":"10.1162","volume":"12","author":[{"given":"Neha","family":"Srikanth","sequence":"first","affiliation":[{"name":"Computer Science, University of Maryland, USA. nehasrik@umd.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Marine","family":"Carpuat","sequence":"additional","affiliation":[{"name":"Computer Science, University of Maryland, USA. marine@cs.umd.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rachel","family":"Rudinger","sequence":"additional","affiliation":[{"name":"Computer Science, University of Maryland, USA. rudinger@umd.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2024,9,18]]},"reference":[{"key":"2024091919443818400_bib1","first-page":"432","article-title":"Semantic sensitivities and inconsistent predictions: Measuring the fragility of NLI models","volume-title":"Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Arakelyan","year":"2024"},{"key":"2024091919443818400_bib2","doi-asserted-by":"publisher","DOI":"10.3115\/980451.980860","article-title":"The Berkeley Framenet Project","volume-title":"COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics","author":"Baker","year":"1998"},{"key":"2024091919443818400_bib3","doi-asserted-by":"publisher","first-page":"596","DOI":"10.18653\/v1\/2022.acl-long.45","article-title":"Quality controlled paraphrase generation","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Bandel","year":"2022"},{"issue":"3","key":"2024091919443818400_bib4","doi-asserted-by":"publisher","first-page":"463","DOI":"10.1162\/COLI_a_00166","article-title":"What is a paraphrase?","volume":"39","author":"Bhagat","year":"2013","journal-title":"Computational Linguistics"},{"key":"2024091919443818400_bib5","article-title":"Abductive commonsense reasoning","author":"Bhagavatula","year":"2019","journal-title":"arXiv preprint arXiv:1908.05739"},{"key":"2024091919443818400_bib6","doi-asserted-by":"publisher","first-page":"632","DOI":"10.18653\/v1\/D15-1075","article-title":"A large annotated corpus for learning natural language inference","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Bowman","year":"2015"},{"key":"2024091919443818400_bib7","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2024091919443818400_bib8","article-title":"Scaling instruction-finetuned language models","author":"Chung","year":"2022","journal-title":"arXiv preprint arXiv:2210.11416"},{"key":"2024091919443818400_bib9","doi-asserted-by":"publisher","first-page":"670","DOI":"10.18653\/v1\/D17-1070","article-title":"Supervised learning of universal sentence representations from natural language inference data","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Conneau","year":"2017"},{"key":"2024091919443818400_bib10","doi-asserted-by":"publisher","first-page":"1012","DOI":"10.1162\/tacl_a_00410","article-title":"Measuring and improving consistency in pretrained language models","volume":"9","author":"Elazar","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024091919443818400_bib11","doi-asserted-by":"publisher","first-page":"5533","DOI":"10.18653\/v1\/P19-1554","article-title":"Misleading failures of partial-input baselines","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Feng","year":"2019"},{"issue":"5","key":"2024091919443818400_bib12","doi-asserted-by":"publisher","first-page":"378","DOI":"10.1037\/h0031619","article-title":"Measuring nominal scale agreement among many raters","volume":"76","author":"Fleiss","year":"1971","journal-title":"Psychological Bulletin"},{"key":"2024091919443818400_bib13","doi-asserted-by":"publisher","first-page":"653","DOI":"10.18653\/v1\/2020.emnlp-main.48","article-title":"Social chemistry 101: Learning to reason about social and moral norms","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Forbes","year":"2020"},{"key":"2024091919443818400_bib14","doi-asserted-by":"publisher","first-page":"1307","DOI":"10.18653\/v1\/2020.findings-emnlp.117","article-title":"Evaluating models\u2019 local decision boundaries via contrast sets","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Gardner","year":"2020"},{"key":"2024091919443818400_bib15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3115\/1654536.1654538","article-title":"The third PASCAL recognizing textual entailment challenge","volume-title":"Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing","author":"Giampiccolo","year":"2007"},{"key":"2024091919443818400_bib16","doi-asserted-by":"publisher","first-page":"650","DOI":"10.18653\/v1\/P18-2103","article-title":"Breaking nli systems with sentences that require simple lexical inferences","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Glockner","year":"2018"},{"key":"2024091919443818400_bib17","doi-asserted-by":"publisher","first-page":"107","DOI":"10.18653\/v1\/N18-2017","article-title":"Annotation artifacts in natural language inference data","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Gururangan","year":"2018"},{"key":"2024091919443818400_bib18","article-title":"Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing","volume-title":"The Eleventh International Conference on Learning Representations","author":"He","year":"2022"},{"key":"2024091919443818400_bib19","doi-asserted-by":"publisher","first-page":"1020","DOI":"10.18653\/v1\/2021.acl-short.129","article-title":"MedNLI is not immune: Natural language inference artifacts in the clinical domain","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Herlihy","year":"2021"},{"key":"2024091919443818400_bib20","doi-asserted-by":"publisher","first-page":"839","DOI":"10.18653\/v1\/N19-1090","article-title":"Improved lexically constrained decoding for translation and monolingual rewriting","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Edward Hu","year":"2019"},{"key":"2024091919443818400_bib21","doi-asserted-by":"publisher","first-page":"1875","DOI":"10.18653\/v1\/N18-1170","article-title":"Adversarial example generation with syntactically controlled paraphrase networks","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Iyyer","year":"2018"},{"key":"2024091919443818400_bib22","doi-asserted-by":"publisher","first-page":"962","DOI":"10.1162\/tacl_a_00407","article-title":"How can we know when language models know? On the calibration of language models for question answering","volume":"9","author":"Jiang","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024091919443818400_bib23","first-page":"427","article-title":"Bag of tricks for efficient text classification","volume-title":"Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers","author":"Joulin","year":"2017"},{"key":"2024091919443818400_bib24","article-title":"Learning the difference that makes a difference with counterfactually-augmented data","author":"Kaushik","year":"2020"},{"key":"2024091919443818400_bib25","doi-asserted-by":"publisher","first-page":"3428","DOI":"10.18653\/v1\/P19-1334","article-title":"Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"McCoy","year":"2019"},{"key":"2024091919443818400_bib26","first-page":"2340","article-title":"Stress test evaluation for natural language inference","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics","author":"Naik","year":"2018"},{"key":"2024091919443818400_bib27","doi-asserted-by":"publisher","first-page":"4885","DOI":"10.18653\/v1\/2020.acl-main.441","article-title":"Adversarial NLI: A new benchmark for natural language understanding","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Nie","year":"2020"},{"key":"2024091919443818400_bib28","first-page":"1659","article-title":"Universal dependencies v1: A multilingual treebank collection","volume-title":"Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Nivre","year":"2016"},{"key":"2024091919443818400_bib29","volume-title":"Collected Papers of Charles Sanders Peirce","author":"Peirce","year":"1974"},{"key":"2024091919443818400_bib30","doi-asserted-by":"publisher","first-page":"1532","DOI":"10.3115\/v1\/D14-1162","article-title":"GloVe: Global vectors for word representation","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Pennington","year":"2014"},{"key":"2024091919443818400_bib31","doi-asserted-by":"publisher","first-page":"180","DOI":"10.18653\/v1\/S18-2023","article-title":"Hypothesis only baselines in natural language inference","volume-title":"Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics","author":"Poliak","year":"2018"},{"issue":"1","key":"2024091919443818400_bib32","first-page":"5485","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2024091919443818400_bib33","doi-asserted-by":"publisher","first-page":"2383","DOI":"10.18653\/v1\/D16-1264","article-title":"SQuAD: 100,000+ questions for machine comprehension of text","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Rajpurkar","year":"2016"},{"issue":"1-2","key":"2024091919443818400_bib34","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1016\/0004-3702(80)90014-4","article-title":"A logic for default reasoning","volume":"13","author":"Reiter","year":"1980","journal-title":"Artificial Intelligence"},{"key":"2024091919443818400_bib35","doi-asserted-by":"publisher","first-page":"4661","DOI":"10.18653\/v1\/2020.findings-emnlp.418","article-title":"Thinking like a skeptic: Defeasible inference in natural language","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Rudinger","year":"2020"},{"issue":"9","key":"2024091919443818400_bib36","doi-asserted-by":"publisher","first-page":"99","DOI":"10.1145\/3474381","article-title":"Winogrande: An adversarial winograd schema challenge at scale","volume":"64","author":"Sakaguchi","year":"2021","journal-title":"Communications of the ACM"},{"key":"2024091919443818400_bib37","doi-asserted-by":"publisher","first-page":"3027","DOI":"10.1609\/aaai.v33i01.33013027","article-title":"Atomic: An atlas of machine commonsense for if-then reasoning","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Sap","year":"2019"},{"key":"2024091919443818400_bib38","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v29i1.9759","article-title":"Semantic representation","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Schubert","year":"2015"},{"key":"2024091919443818400_bib39","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"BLEURT: Learning robust metrics for text generation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2024091919443818400_bib40","doi-asserted-by":"publisher","first-page":"4753","DOI":"10.18653\/v1\/2022.naacl-main.350","article-title":"Partial-input baselines show that NLI models can ignore context, but they don\u2019t.","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Srikanth","year":"2022"},{"key":"2024091919443818400_bib41","doi-asserted-by":"publisher","first-page":"545","DOI":"10.18653\/v1\/2023.acl-long.32","article-title":"A causal framework to quantify the robustness of mathematical reasoning with language models","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Stolfo","year":"2023"},{"key":"2024091919443818400_bib42","doi-asserted-by":"publisher","first-page":"809","DOI":"10.18653\/v1\/N18-1074","article-title":"FEVER: A large-scale dataset for fact extraction and VERification","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Thorne","year":"2018"},{"key":"2024091919443818400_bib43","first-page":"16196","article-title":"Counterfactual invariance to spurious correlations in text classification","volume":"34","author":"Veitch","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024091919443818400_bib44","doi-asserted-by":"publisher","first-page":"880","DOI":"10.18653\/v1\/2023.acl-short.76","article-title":"Evaluating paraphrastic robustness in textual entailment models","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Verma","year":"2023"},{"key":"2024091919443818400_bib45","doi-asserted-by":"publisher","first-page":"217","DOI":"10.18653\/v1\/2020.emnlp-main.16","article-title":"Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually)","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Warstadt","year":"2020"},{"key":"2024091919443818400_bib46","volume-title":"A Course in Probability","author":"Weiss","year":"2006"},{"key":"2024091919443818400_bib47","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.18653\/v1\/N18-1101","article-title":"A broad-coverage challenge corpus for sentence understanding through inference","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Williams","year":"2018"},{"key":"2024091919443818400_bib48","doi-asserted-by":"publisher","first-page":"93","DOI":"10.18653\/v1\/D18-1009","article-title":"SWAG: A large-scale adversarial dataset for grounded commonsense inference","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Zellers","year":"2018"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00692\/2470818\/tacl_a_00692.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00692\/2470818\/tacl_a_00692.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,19]],"date-time":"2024-09-19T19:45:02Z","timestamp":1726775102000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00692\/124462\/How-Often-Are-Errors-in-Natural-Language-Reasoning"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":48,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00692","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}