{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,5,19]],"date-time":"2025-05-19T18:40:03Z","timestamp":1747680003273,"version":"3.40.5"},"reference-count":74,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,4,16]],"date-time":"2025-04-16T00:00:00Z","timestamp":1744761600000},"content-version":"vor","delay-in-days":105,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,4,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models\u2019 out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.1<\/jats:p>","DOI":"10.1162\/tacl_a_00741","type":"journal-article","created":{"date-parts":[[2025,4,16]],"date-time":"2025-04-16T14:22:47Z","timestamp":1744813367000},"page":"314-342","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":0,"title":["Self-Rationalization in the Wild: A Large-scale Out-of-Distribution Evaluation on NLI-related tasks"],"prefix":"10.1162","volume":"13","author":[{"given":"Jing","family":"Yang","sequence":"first","affiliation":[{"name":"Artificial Intelligence Lab., Recod.ai, Institute of Computing, University of Campinas, Brazil. jing.yang@ic.unicamp.br"}]},{"given":"Max","family":"Glockner","sequence":"additional","affiliation":[{"name":"UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany. max.glockner@tu-darmstadt.de"}]},{"given":"Anderson","family":"Rocha","sequence":"additional","affiliation":[{"name":"Artificial Intelligence Lab., Recod.ai, Institute of Computing, University of Campinas, Brazil. anderson.rocha@unicamp.br"}]},{"given":"Iryna","family":"Gurevych","sequence":"additional","affiliation":[{"name":"UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany. iryna.gurevych@tu-darmstadt.de"}]}],"member":"281","published-online":{"date-parts":[[2025,4,3]]},"reference":[{"key":"2025051914252781000_bib1","unstructured":"Josh\n              Achiam\n            , StevenAdler, SandhiniAgarwal, LamaAhmad, IlgeAkkaya, Florencia LeoniAleman, DiogoAlmeida, JankoAltenschmidt, SamAltman, ShyamalAnadkat, RedAvila, IgorBabuschkin, SuchirBalaji, ValerieBalcom, PaulBaltescu, HaimingBao, MohammadBavarian, JeffBelgum, IrwanBello, JakeBerdine, GabrielBernadett-Shapiro, ChristopherBerner, LennyBogdonoff, OlegBoiko, MadelaineBoyd, Anna-LuisaBrakman, GregBrockman, TimBrooks, MilesBrundage, KevinButton, TrevorCai, RosieCampbell, AndrewCann, BrittanyCarey, ChelseaCarlson, RoryCarmichael, BrookeChan, CheChang, FotisChantzis, DerekChen, SullyChen, RubyChen, JasonChen, MarkChen, BenChess, ChesterCho, CaseyChu, Hyung WonChung, DaveCummings, JeremiahCurrier, YunxingDai, CoryDecareaux, ThomasDegry, NoahDeutsch, DamienDeville, ArkaDhar, DavidDohan, SteveDowling, SheilaDunning, AdrienEcoffet, AttyEleti, TynaEloundou, DavidFarhi, LiamFedus, NikoFelix, Sim\u00f3n PosadaFishman, JustonForte, IsabellaFulford, LeoGao, ElieGeorges, ChristianGibson, VikGoel, TarunGogineni, GabrielGoh, RaphaGontijo-Lopes, JonathanGordon, MorganGrafstein, ScottGray, RyanGreene, JoshuaGross, Shixiang ShaneGu, YufeiGuo, ChrisHallacy, JesseHan, JeffHarris, YuchenHe, MikeHeaton, JohannesHeidecke, ChrisHesse, AlanHickey, WadeHickey, PeterHoeschele, BrandonHoughton, KennyHsu, ShengliHu, XinHu, JoostHuizinga, ShantanuJain, ShawnJain, \n          2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774."},{"key":"2025051914252781000_bib2","doi-asserted-by":"publisher","first-page":"3050","DOI":"10.18653\/v1\/2021.acl-long.238","article-title":"Explanations for CommonsenseQA: New dataset and models","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Aggarwal","year":"2021"},{"key":"2025051914252781000_bib3","article-title":"METRO: Efficient denoising pretraining of large scale autoencoding language models with model generated signals","author":"Bajaj","year":"2022","journal-title":"arXiv preprint arXiv:2204.06644"},{"key":"2025051914252781000_bib4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-642-48782-8_1","article-title":"Characterization of pareto and lexicographic optimal solutions","volume-title":"Multiple Criteria Decision Making Theory and Application: Proceedings of the Third Conference Hagen\/K\u00f6nigswinter, West Germany, August 20\u201324, 1979","author":"Ben-Tal","year":"1980"},{"key":"2025051914252781000_bib5","doi-asserted-by":"publisher","first-page":"632","DOI":"10.18653\/v1\/D15-1075","article-title":"A large annotated corpus for learning natural language inference","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Bowman","year":"2015"},{"key":"2025051914252781000_bib6","article-title":"On behalf of the stakeholders: Trends in NLP model interpretability in the era of LLMs","author":"Calderon","year":"2024","journal-title":"arXiv preprint arXiv:2407.19200"},{"key":"2025051914252781000_bib7","article-title":"e-SNLI: Natural language inference with natural language explanations","volume":"31","author":"Camburu","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914252781000_bib8","article-title":"Automated data curation for robust language model Fine-Tuning","author":"Chen","year":"2024","journal-title":"arXiv preprint arXiv:2403.12776"},{"key":"2025051914252781000_bib9","first-page":"2946","article-title":"Learning to generate explanation from e-hospital services for medical suggestion","volume-title":"Proceedings of the 29th International Conference on Computational Linguistics","author":"Chen","year":"2022"},{"key":"2025051914252781000_bib10","doi-asserted-by":"publisher","first-page":"78","DOI":"10.18653\/v1\/2021.starsem-1.7","article-title":"NeuralLog: Natural language inference with joint neural and logical reasoning","volume-title":"Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics","author":"Chen","year":"2021"},{"key":"2025051914252781000_bib11","doi-asserted-by":"publisher","first-page":"4443","DOI":"10.18653\/v1\/2020.acl-main.408","article-title":"ERASER: A benchmark to evaluate rationalized NLP models","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"DeYoung","year":"2020"},{"key":"2025051914252781000_bib12","article-title":"Climate-FEVER: A dataset for verification of real-world climate claims","volume-title":"Tackling Climate Change with Machine Learning workshop at NeurIPS 2020","author":"Diggelmann","year":"2020"},{"key":"2025051914252781000_bib13","doi-asserted-by":"publisher","first-page":"352","DOI":"10.18653\/v1\/2021.naacl-main.32","article-title":"Fool me twice: Entailment from Wikipedia gamification","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Eisenschlos","year":"2021"},{"key":"2025051914252781000_bib14","doi-asserted-by":"publisher","first-page":"15789","DOI":"10.18653\/v1\/2024.acl-long.841","article-title":"OLMo: Accelerating the science of language models","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Groeneveld","year":"2024"},{"key":"2025051914252781000_bib15","doi-asserted-by":"publisher","first-page":"1090","DOI":"10.18653\/v1\/2024.naacl-long.62","article-title":"Language models hallucinate, but may excel at fact verification","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Guan","year":"2024"},{"key":"2025051914252781000_bib16","doi-asserted-by":"publisher","first-page":"493","DOI":"10.18653\/v1\/K19-1046","article-title":"A richly annotated corpus for different tasks in automated fact-checking","volume-title":"Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)","author":"Hanselowski","year":"2019"},{"key":"2025051914252781000_bib17","doi-asserted-by":"publisher","first-page":"161","DOI":"10.18653\/v1\/2022.dialdoc-1.19","article-title":"TRUE: Re-evaluating factual consistency evaluation","volume-title":"Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering","author":"Or","year":"2022"},{"key":"2025051914252781000_bib18","article-title":"LoRA: Low-rank adaptation of large language models","volume-title":"International Conference on Learning Representations","author":"Edward","year":"2022"},{"key":"2025051914252781000_bib19","article-title":"Themis: Towards flexible and interpretable NLG evaluation","author":"Xinyu","year":"2024","journal-title":"arXiv preprint arXiv:2406.18365"},{"key":"2025051914252781000_bib20","article-title":"TIGERScore: Towards building explainable metric for all text generation tasks","author":"Jiang","year":"2023","journal-title":"arXiv preprint arXiv:2310.00752"},{"key":"2025051914252781000_bib21","doi-asserted-by":"publisher","first-page":"7265","DOI":"10.18653\/v1\/2021.acl-long.564","article-title":"Mind your outliers! Investigating the negative impact of outliers on active learning for visual question answering","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Karamcheti","year":"2021"},{"key":"2025051914252781000_bib22","doi-asserted-by":"publisher","first-page":"8706","DOI":"10.18653\/v1\/2020.acl-main.769","article-title":"End-to-end bias mitigation by modelling biases in corpora","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Mahabadi","year":"2020"},{"key":"2025051914252781000_bib23","doi-asserted-by":"publisher","first-page":"235","DOI":"10.18653\/v1\/S19-1026","article-title":"Probing what different NLP tasks teach machines about function word comprehension","volume-title":"Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)","author":"Kim","year":"2019"},{"key":"2025051914252781000_bib24","doi-asserted-by":"publisher","first-page":"9332","DOI":"10.18653\/v1\/2020.emnlp-main.750","article-title":"Evaluating the factual consistency of abstractive text summarization","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Kryscinski","year":"2020"},{"key":"2025051914252781000_bib25","doi-asserted-by":"publisher","first-page":"13","DOI":"10.18653\/v1\/2024.hcinlp-1.2","article-title":"Properties and challenges of LLM-generated explanations","volume-title":"Proceedings of the Third Workshop on Bridging Human\u2013Computer Interaction and Natural Language Processing","author":"Kunz","year":"2024"},{"key":"2025051914252781000_bib26","first-page":"100","article-title":"Natural language inference from multiple premises","volume-title":"Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Lai","year":"2017"},{"key":"2025051914252781000_bib27","doi-asserted-by":"publisher","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2020"},{"key":"2025051914252781000_bib28","article-title":"Generative judge for evaluating alignment","volume-title":"The Twelfth International Conference on Learning Representations","author":"Li","year":"2024"},{"key":"2025051914252781000_bib29","doi-asserted-by":"publisher","first-page":"14255","DOI":"10.18653\/v1\/2024.acl-long.769","article-title":"Superfiltering: Weak-to-strong data filtering for fast instruction-tuning","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Li","year":"2024"},{"key":"2025051914252781000_bib30","article-title":"Evaluating the logical reasoning ability of ChatGPT and GPT-4","author":"Liu","year":"2023","journal-title":"arXiv preprint arXiv:2304.03439"},{"key":"2025051914252781000_bib31","article-title":"An empirical study of catastrophic forgetting in large language models during continual fine-tuning","author":"Luo","year":"2023","journal-title":"arXiv preprint arXiv:2308.08747"},{"key":"2025051914252781000_bib32","doi-asserted-by":"publisher","first-page":"410","DOI":"10.18653\/v1\/2022.findings-naacl.31","article-title":"Few-shot self-rationalization with natural language prompts","volume-title":"Findings of the Association for Computational Linguistics: NAACL 2022","author":"Marasovic","year":"2022"},{"key":"2025051914252781000_bib33","first-page":"216","article-title":"A SICK cure for the evaluation of compositional distributional semantic models","volume-title":"Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Marelli","year":"2014"},{"key":"2025051914252781000_bib34","doi-asserted-by":"publisher","first-page":"1906","DOI":"10.18653\/v1\/2020.acl-main.173","article-title":"On faithfulness and factuality in abstractive summarization","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Maynez","year":"2020"},{"key":"2025051914252781000_bib35","doi-asserted-by":"publisher","first-page":"3428","DOI":"10.18653\/v1\/P19-1334","article-title":"Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"McCoy","year":"2019"},{"key":"2025051914252781000_bib36","doi-asserted-by":"publisher","first-page":"12284","DOI":"10.18653\/v1\/2023.findings-acl.779","article-title":"Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Mosbach","year":"2023"},{"key":"2025051914252781000_bib37","doi-asserted-by":"publisher","first-page":"2164","DOI":"10.18653\/v1\/P16-1204","article-title":"Most \u201cbabies\u201d are \u201clittle\u201d and most \u201cproblems\u201d are \u201chuge\u201d: Compositional Entailment in Adjective-Nouns","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Pavlick","year":"2016"},{"key":"2025051914252781000_bib38","doi-asserted-by":"crossref","first-page":"327","DOI":"10.18653\/v1\/2022.emnlp-demos.33","article-title":"POTATO: The portable text annotation tool","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Pei","year":"2022"},{"key":"2025051914252781000_bib39","doi-asserted-by":"publisher","first-page":"67","DOI":"10.18653\/v1\/D18-1007","article-title":"Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Poliak","year":"2018"},{"key":"2025051914252781000_bib40","doi-asserted-by":"publisher","first-page":"180","DOI":"10.18653\/v1\/S18-2023","article-title":"Hypothesis only baselines in natural language inference","volume-title":"Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics","author":"Poliak","year":"2018"},{"issue":"1","key":"2025051914252781000_bib41","first-page":"5485","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2025051914252781000_bib42","article-title":"Tailoring self-rationalizers with multi-reward distillation","volume-title":"International Conference on Learning Representations (ICLR","author":"Ramnath","year":"2024"},{"key":"2025051914252781000_bib43","doi-asserted-by":"publisher","first-page":"7403","DOI":"10.18653\/v1\/2022.emnlp-main.501","article-title":"Does self-rationalization improve robustness to spurious correlations?","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Ross","year":"2022"},{"key":"2025051914252781000_bib44","doi-asserted-by":"publisher","first-page":"2116","DOI":"10.18653\/v1\/2021.acl-long.165","article-title":"COVID-Fact: Fact extraction and verification of real-world claims on COVID-19 pandemic","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Saakyan","year":"2021"},{"key":"2025051914252781000_bib45","doi-asserted-by":"publisher","first-page":"8240","DOI":"10.18653\/v1\/2020.emnlp-main.661","article-title":"ConjNLI: Natural language inference over conjunctive sentences","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Saha","year":"2020"},{"key":"2025051914252781000_bib46","doi-asserted-by":"publisher","first-page":"10776","DOI":"10.18653\/v1\/2023.findings-emnlp.722","article-title":"NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Sainz","year":"2023"},{"key":"2025051914252781000_bib47","doi-asserted-by":"publisher","first-page":"5477","DOI":"10.18653\/v1\/2020.acl-main.486","article-title":"Social bias frames: Reasoning about social and power implications of language","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sap","year":"2020"},{"key":"2025051914252781000_bib48","article-title":"Diversity over size: On the effect of sample and topic sizes for argument mining datasets","author":"Schiller","year":"2022","journal-title":"arXiv preprint arXiv: 2205.11472"},{"key":"2025051914252781000_bib49","doi-asserted-by":"publisher","first-page":"624","DOI":"10.18653\/v1\/2021.naacl-main.52","article-title":"Get your vitamin C! Robust fact verification with contrastive evidence","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Schuster","year":"2021"},{"key":"2025051914252781000_bib50","doi-asserted-by":"publisher","first-page":"15725","DOI":"10.18653\/v1\/2024.acl-long.840","article-title":"Dolma: An open corpus of three trillion tokens for language model pretraining research","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Soldaini","year":"2024"},{"key":"2025051914252781000_bib51","article-title":"e-FEVER: Explanations and summaries for automated fact checking","volume-title":"Truth and Trust Online (TTO)","author":"Stammbach","year":"2020"},{"key":"2025051914252781000_bib52","article-title":"Selective annotation makes language models better few-shot learners","volume-title":"The Eleventh International Conference on Learning Representations","author":"Hongjin","year":"2022"},{"key":"2025051914252781000_bib53","doi-asserted-by":"publisher","first-page":"738","DOI":"10.18653\/v1\/D18-1081","article-title":"BLEU is not suitable for the evaluation of text simplification","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Sulem","year":"2018"},{"key":"2025051914252781000_bib54","doi-asserted-by":"publisher","first-page":"9275","DOI":"10.18653\/v1\/2020.emnlp-main.746","article-title":"Dataset cartography: Mapping and diagnosing datasets with training dynamics","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Swayamdipta","year":"2020"},{"key":"2025051914252781000_bib55","article-title":"UL2: Unifying language learning paradigms","volume-title":"International Conference on Learning Representations","author":"Yi","year":"2022"},{"key":"2025051914252781000_bib56","doi-asserted-by":"publisher","first-page":"809","DOI":"10.18653\/v1\/N18-1074","article-title":"FEVER: A large-scale dataset for fact extraction and VERification","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Thorne","year":"2018"},{"key":"2025051914252781000_bib57","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv preprint arXiv:2307.09288"},{"key":"2025051914252781000_bib58","doi-asserted-by":"publisher","first-page":"4825","DOI":"10.18653\/v1\/2023.findings-acl.297","article-title":"Few shot rationale generation using self-training with dual teachers","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Veerubhotla","year":"2023"},{"key":"2025051914252781000_bib59","doi-asserted-by":"publisher","first-page":"7534","DOI":"10.18653\/v1\/2020.emnlp-main.609","article-title":"Fact or fiction: Verifying scientific claims","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Wadden","year":"2020"},{"key":"2025051914252781000_bib60","doi-asserted-by":"publisher","first-page":"5008","DOI":"10.18653\/v1\/2020.acl-main.450","article-title":"Asking and answering questions to evaluate the factual consistency of summaries","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Wang","year":"2020"},{"key":"2025051914252781000_bib61","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5446","article-title":"GLUE: A multi-task benchmark and analysis platform for natural language understanding","volume-title":"7th International Conference on Learning Representations, ICLR 2019","author":"Wang","year":"2019"},{"key":"2025051914252781000_bib62","doi-asserted-by":"publisher","first-page":"4020","DOI":"10.18653\/v1\/P19-1393","article-title":"Does it make sense? And why? A pilot study for sense making and explanation","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Wang","year":"2019"},{"key":"2025051914252781000_bib63","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914252781000_bib64","doi-asserted-by":"publisher","first-page":"632","DOI":"10.18653\/v1\/2022.naacl-main.47","article-title":"Reframing human-AI collaboration for generating free-text explanations","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Wiegreffe","year":"2022"},{"key":"2025051914252781000_bib65","doi-asserted-by":"publisher","first-page":"10266","DOI":"10.18653\/v1\/2021.emnlp-main.804","article-title":"Measuring association between labels and free-text rationales","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Wiegreffe","year":"2021"},{"key":"2025051914252781000_bib66","doi-asserted-by":"publisher","first-page":"1199","DOI":"10.1145\/3630106.3658966","article-title":"Laboratory-scale ai: Open-weight models are competitive with chatgpt even in low-resource settings","volume-title":"ACM Conference on Fairness, Accountability, and Transparency","author":"Wolfe","year":"2024"},{"key":"2025051914252781000_bib67","doi-asserted-by":"publisher","first-page":"2660","DOI":"10.18653\/v1\/2022.acl-long.190","article-title":"Generating data to mitigate spurious correlations in natural language inference datasets","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Yuxiang","year":"2022"},{"key":"2025051914252781000_bib68","article-title":"LESS: Selecting influential data for targeted instruction tuning","volume-title":"International Conference on Machine Learning (ICML)","author":"Xia","year":"2024"},{"key":"2025051914252781000_bib69","doi-asserted-by":"publisher","first-page":"27","DOI":"10.18653\/v1\/2024.knowllm-1.3","article-title":"Reassess summary factual inconsistency detection with large language model","volume-title":"Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)","author":"Yang","year":"2024"},{"key":"2025051914252781000_bib70","doi-asserted-by":"publisher","first-page":"3486","DOI":"10.18653\/v1\/2022.findings-emnlp.255","article-title":"Few-shot out-of-domain transfer learning of natural language explanations in a label-abundant setup","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Yordanov","year":"2022"},{"key":"2025051914252781000_bib71","doi-asserted-by":"publisher","first-page":"252","DOI":"10.18653\/v1\/2024.trustnlp-1.21","article-title":"Tell me why: Explainable public health fact-checking with large language models","volume-title":"Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)","author":"Zarharan","year":"2024"},{"key":"2025051914252781000_bib72","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1162\/tacl_a_00068","article-title":"Ordinal Common-sense Inference","volume":"5","author":"Zhang","year":"2017","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2025051914252781000_bib73","article-title":"Bertscore: Evaluating text generation with bert","volume-title":"International Conference on Learning Representations","author":"Zhang","year":"2020"},{"key":"2025051914252781000_bib74","doi-asserted-by":"publisher","first-page":"117","DOI":"10.18653\/v1\/2021.insights-1.17","article-title":"Investigating the effect of natural language explanations on out-of-distribution generalization in few-shot NLI","volume-title":"Proceedings of the Second Workshop on Insights from Negative Results in NLP","author":"Zhou","year":"2021"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00741\/2513631\/tacl_a_00741.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00741\/2513631\/tacl_a_00741.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,19]],"date-time":"2025-05-19T18:25:40Z","timestamp":1747679140000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00741\/128862\/Self-Rationalization-in-the-Wild-A-Large-scale-Out"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":74,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00741","relation":{},"ISSN":["2307-387X"],"issn-type":[{"type":"electronic","value":"2307-387X"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}