{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,17]],"date-time":"2026-01-17T19:36:15Z","timestamp":1768678575466,"version":"3.49.0"},"reference-count":49,"publisher":"Cambridge University Press (CUP)","issue":"4","license":[{"start":{"date-parts":[[2023,6,9]],"date-time":"2023-06-09T00:00:00Z","timestamp":1686268800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model\u2019s performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.<\/jats:p>","DOI":"10.1017\/s1351324923000268","type":"journal-article","created":{"date-parts":[[2023,6,9]],"date-time":"2023-06-09T10:57:42Z","timestamp":1686308262000},"page":"793-820","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":2,"title":["A resampling-based method to evaluate NLI models"],"prefix":"10.1017","volume":"30","author":[{"given":"Felipe de Souza","family":"Salvatore","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marcelo","family":"Finger","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"suffix":"Jr.","given":"Roberto","family":"Hirata","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alexandre G.","family":"Patriota","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2023,6,9]]},"reference":[{"key":"S1351324923000268_ref6","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171\u20134186."},{"key":"S1351324923000268_ref42","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5446"},{"key":"S1351324923000268_ref7","article-title":"Fine-tuning pretrained language models: weight initializations, data orders, and early stopping","author":"Dodge","year":"2020","journal-title":"CoRR"},{"key":"S1351324923000268_ref19","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1031"},{"key":"S1351324923000268_ref17","article-title":"DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing","author":"He","year":"2021","journal-title":"CoRR"},{"key":"S1351324923000268_ref29","unstructured":"Naik, A. , Ravichander, A. , Sadeh, N. , Rose, C. and Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, NM: Association for Computational Linguistics, pp. 2340\u20132353."},{"key":"S1351324923000268_ref48","first-page":"5753","volume-title":"Advances in Neural Information Processing Systems 32","author":"Yang","year":"2019"},{"key":"S1351324923000268_ref20","first-page":"5065","volume-title":"Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20","author":"Hupkes","year":"2020"},{"key":"S1351324923000268_ref24","unstructured":"Liu, N. F. , Schwartz, R. and Smith, N. A. (2019a). Inoculation by fine-tuning: A method for analyzing challenge datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics."},{"key":"S1351324923000268_ref27","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1334"},{"key":"S1351324923000268_ref26","article-title":"Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking","author":"Ma","year":"2021","journal-title":"CoRR"},{"key":"S1351324923000268_ref25","unstructured":"Liu, Y. , Ott, M. , Goyal, N. , Du, J. , Joshi, M. , Chen, D. , Levy, O. , Lewis, M. , Zettlemoyer, L. , Stoyanov, V. (2019b). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs\/1907.11692."},{"key":"S1351324923000268_ref36","unstructured":"Salvatore, F. (2020). Looking-for-Equivalences. Available at https:\/\/github.com\/felipessalvatore\/looking-for-equivalences"},{"key":"S1351324923000268_ref9","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/7287.001.0001"},{"key":"S1351324923000268_ref12","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1456"},{"key":"S1351324923000268_ref35","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6397"},{"key":"S1351324923000268_ref5","volume-title":"Proceedings of the 8th Global WordNet Conference (GWC\u201916)","author":"De Paiva","year":"2016"},{"key":"S1351324923000268_ref34","doi-asserted-by":"crossref","unstructured":"Real, L. , Rodrigues, A. , Vieira, A. , Albiero, B. , Thalenberg, B. , Guide, B. , Silva, C. , Lima, G. , C\u00e2mara, I. , Stanojevi\u0107, M. , Souza, R. , De Paiva, V. (2018). SICK-BR: A Portuguese Corpus for Inference: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24-26, 2018, Proceedings, pp. 303\u2013312.","DOI":"10.1007\/978-3-319-99722-3_31"},{"key":"S1351324923000268_ref28","doi-asserted-by":"publisher","DOI":"10.1007\/BF02295996"},{"key":"S1351324923000268_ref18","unstructured":"He, P. , Liu, X. , Gao, J. and Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with disentangled attention. CoRR, abs\/2006.03654."},{"key":"S1351324923000268_ref38","first-page":"179","article-title":"The problem of logical-form equivalence","volume":"19","author":"Shieber","year":"1993","journal-title":"Computational Linguistics"},{"key":"S1351324923000268_ref10","doi-asserted-by":"publisher","DOI":"10.1111\/j.1467-842X.1990.tb01011.x"},{"key":"S1351324923000268_ref39","first-page":"7329","volume-title":"ACL\/IJCNLP (1)","author":"Sinha","year":"2021"},{"key":"S1351324923000268_ref41","article-title":"SuperGLUE: A stickier benchmark for general-purpose language understanding systems","author":"Wang","year":"2019","journal-title":"CoRR"},{"key":"S1351324923000268_ref45","doi-asserted-by":"crossref","unstructured":"Williams, A. , Nangia, N. and Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.","DOI":"10.18653\/v1\/N18-1101"},{"key":"S1351324923000268_ref14","doi-asserted-by":"crossref","unstructured":"Glockner, M. , Shwartz, V. and Goldberg, Y. (2018). Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).","DOI":"10.18653\/v1\/P18-2103"},{"key":"S1351324923000268_ref1","first-page":"281","article-title":"Random search for hyper-parameter optimization","volume":"13","author":"Bergstra","year":"2012","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324923000268_ref46","article-title":"Huggingface\u2019s transformers: State-of-the-art natural language processing","author":"Wolf","year":"2019","journal-title":"CoRR"},{"key":"S1351324923000268_ref49","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-2100"},{"key":"S1351324923000268_ref8","unstructured":"Explosion (2020). spaCy: Industrial-strength NLP. Available at https:\/\/github.com\/explosion\/spaCy"},{"key":"S1351324923000268_ref40","first-page":"276","volume-title":"NoDaLiDa","author":"Talman","year":"2021"},{"key":"S1351324923000268_ref23","unstructured":"Lan, Z. , Chen, M. , Goodman, S. , Gimpel, K. , Sharma, P. and Soricut, R. (2020). Albert: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations."},{"key":"S1351324923000268_ref43","unstructured":"Wang, A. , Singh, A. , Michael, J. , Hill, F. , Levy, O. and Bowman, S. R. (2022). GLUE benchmark. Available at https:\/\/gluebenchmark.com\/leaderboard"},{"key":"S1351324923000268_ref16","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-21606-5"},{"key":"S1351324923000268_ref13","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1107"},{"key":"S1351324923000268_ref4","article-title":"Evaluating compositionality in sentence embeddings","author":"Dasgupta","year":"2018","journal-title":"CoRR"},{"key":"S1351324923000268_ref30","article-title":"Analyzing compositionality-sensitivity of NLI models","author":"Nie","year":"2018","journal-title":"CoRR"},{"key":"S1351324923000268_ref3","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/37.3-4.256"},{"key":"S1351324923000268_ref44","volume-title":"All of Statistics: A Concise Course in Statistical Inference","author":"Wasserman","year":"2010"},{"key":"S1351324923000268_ref32","volume-title":"OpenAI Blog","author":"Radford","year":"2019"},{"key":"S1351324923000268_ref47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4804"},{"key":"S1351324923000268_ref31","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.441"},{"key":"S1351324923000268_ref33","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324923000268_ref11","first-page":"3","article-title":"Vis\u00e3o geral da avalia\u00e7\u00e3o de similaridade sem\u00e2ntica e infer\u00eancia textual","volume":"8","author":"Fonseca","year":"2016","journal-title":"Linguam\u00e1tica"},{"key":"S1351324923000268_ref37","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-6103"},{"key":"S1351324923000268_ref21","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.324"},{"key":"S1351324923000268_ref15","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2017"},{"key":"S1351324923000268_ref22","doi-asserted-by":"publisher","DOI":"10.1007\/s11222-012-9370-4"},{"key":"S1351324923000268_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1075"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324923000268","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,20]],"date-time":"2024-09-20T08:32:38Z","timestamp":1726821158000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324923000268\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,9]]},"references-count":49,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["S1351324923000268"],"URL":"https:\/\/doi.org\/10.1017\/s1351324923000268","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,9]]},"assertion":[{"value":"\u00a9 The Author(s), 2023. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https:\/\/creativecommons.org\/licenses\/by\/4.0\/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}