{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T22:00:20Z","timestamp":1747173620018,"version":"3.40.5"},"reference-count":194,"publisher":"Cambridge University Press (CUP)","issue":"1","license":[{"start":{"date-parts":[[2022,4,22]],"date-time":"2022-04-22T00:00:00Z","timestamp":1650585600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recent years have seen a growing number of publications that analyse Natural Language Understanding (NLU) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope that it will be a useful resource for researchers who propose new datasets to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who propose novel NLU approaches, to further understand the implications of their improvements with respect to their model\u2019s acquired capabilities.<\/jats:p>","DOI":"10.1017\/s1351324922000171","type":"journal-article","created":{"date-parts":[[2022,4,22]],"date-time":"2022-04-22T09:11:40Z","timestamp":1650618700000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":1,"title":["A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding"],"prefix":"10.1017","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6391-2950","authenticated-orcid":false,"given":"Viktor","family":"Schlegel","sequence":"first","affiliation":[]},{"given":"Goran","family":"Nenadic","sequence":"additional","affiliation":[]},{"given":"Riza","family":"Batista-Navarro","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2022,4,22]]},"reference":[{"key":"S1351324922000171_ref185","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1009"},{"key":"S1351324922000171_ref16","unstructured":"Chen, M. , D\u2019Arcy, M. , Liu, A. , Fernandez, J. and Downey, D. (2019). CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense."},{"key":"S1351324922000171_ref92","unstructured":"M\u00f6ller, T. , Reina, A. , Jayakumar, R. and Pietsch, M. (2020). COVID-QA: A question answering dataset for COVID-19 | OpenReview. In ACL 2020 Workshop on Natural Language Processing for COVID-19 (NLP-COVID)."},{"key":"S1351324922000171_ref103","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1258"},{"key":"S1351324922000171_ref64","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.84"},{"key":"S1351324922000171_ref109","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5818"},{"key":"S1351324922000171_ref120","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6398"},{"key":"S1351324922000171_ref156","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1292"},{"key":"S1351324922000171_ref20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1418"},{"key":"S1351324922000171_ref61","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-6004"},{"key":"S1351324922000171_ref43","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-6115"},{"key":"S1351324922000171_ref73","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6357"},{"key":"S1351324922000171_ref194","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.773"},{"key":"S1351324922000171_ref79","unstructured":"Magliacane, S. , van Ommen, T. , Claassen, T. , Bongers, S. , Versteeg, P. and Mooij, J.M. (2017). Domain daptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems, pp. 10846\u201310856."},{"key":"S1351324922000171_ref119","unstructured":"Roemmele, M. , Bejan, C.A. and Gordon, A.S. (2011). Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series."},{"key":"S1351324922000171_ref35","unstructured":"Geiger, A. , Richardson, K. and Potts, C. (2020). Modular Representation Underlies Systematic Generalization in Neural Natural Language Inference Models. arXiv preprint arXiv 2004.14623."},{"key":"S1351324922000171_ref47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.768"},{"key":"S1351324922000171_ref31","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1554"},{"key":"S1351324922000171_ref48","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1215"},{"key":"S1351324922000171_ref94","unstructured":"Mu, J. and Andreas, J. (2020). Compositional explanations of neurons. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems."},{"key":"S1351324922000171_ref136","unstructured":"Shi, Z. , Zhang, H. , Chang, K.-W. , Huang, M. and Hsieh, C.-J. (2020). Robustness verification for transformers. In 8th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings."},{"key":"S1351324922000171_ref21","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00317"},{"key":"S1351324922000171_ref44","unstructured":"Hermann, K.M. , Ko\u00e7isk\u00fd, T. , Grefenstette, E. , Espeholt, L. , Kay, W. , Suleyman, M. and Blunsom, P. (2015). Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693\u20131701."},{"key":"S1351324922000171_ref106","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.bionlp-1.15"},{"key":"S1351324922000171_ref131","unstructured":"Schlegel, V. , Valentino, M. , Freitas, A.A. , Nenadic, G. and Batista-Navarro, R. (2020). A framework for evaluation of machine reading comprehension gold standards. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France. European Language Resources Association, pp. 5359\u20135369."},{"key":"S1351324922000171_ref8","unstructured":"Bras, R.L. , Swayamdipta, S. , Bhagavatula, C. , Zellers, R. , Peters, M.E. , Sabharwal, A. and Choi, Y. (2020). Adversarial filters of dataset biases. In Hal Daum\u00e9 III and Singh A. (eds), Proceedings of the 37th International Conference on Machine Learning. PMLR, pp. 1078\u20131088."},{"key":"S1351324922000171_ref166","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2091"},{"key":"S1351324922000171_ref104","unstructured":"Panenghat, M.P. , Suntwal, S. , Rafique, F. , Sharp, R. and Surdeanu, M. (2020). Towards the necessity for debiasing natural language inference datasets. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, pp. 6883\u20136888."},{"key":"S1351324922000171_ref133","doi-asserted-by":"crossref","unstructured":"Schuster, T. , Shah, D. , Yeo, Y.J.S. , Roberto Filizzola Ortiz, D. , Santus, E. and Barzilay, R. 2019. Towards debiasing fact verification models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 3417\u20133423.","DOI":"10.18653\/v1\/D19-1341"},{"key":"S1351324922000171_ref147","doi-asserted-by":"crossref","unstructured":"Tafjord, O. , Clark, P. , Gardner, M. , Yih, W.-t. and Sabharwal, A. (2018). QuaRel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7063\u20137071.","DOI":"10.1609\/aaai.v33i01.33017063"},{"key":"S1351324922000171_ref24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1144"},{"key":"S1351324922000171_ref84","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1260"},{"key":"S1351324922000171_ref144","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6422"},{"key":"S1351324922000171_ref115","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1079"},{"key":"S1351324922000171_ref150","unstructured":"Tan, S. , Shen, Y. , Huang, C.-w. and Courville, A. (2019). Investigating Biases in Textual Entailment Datasets. arXiv preprint arXiv 1906.09635."},{"key":"S1351324922000171_ref172","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2020.106075"},{"key":"S1351324922000171_ref177","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S19-1027"},{"key":"S1351324922000171_ref108","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1007"},{"key":"S1351324922000171_ref145","doi-asserted-by":"crossref","unstructured":"Sugawara, S. , Yokono, H. and Aizawa, A. (2017b). Prerequisite skills for reading comprehension: Multi-perspective analysis of MCTest datasets and systems. In Thirty-First AAAI Conference on Artificial Intelligence, pp. 3089\u20133096.","DOI":"10.1609\/aaai.v31i1.10957"},{"key":"S1351324922000171_ref66","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/S14-2055"},{"key":"S1351324922000171_ref192","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1332"},{"key":"S1351324922000171_ref181","unstructured":"Yatskar, M. (2019). A qualitative comparison of CoQA, SQuAD 2.0 and QuAC. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 2318\u20132323."},{"key":"S1351324922000171_ref17","doi-asserted-by":"publisher","DOI":"10.1109\/ICSC.2020.00008"},{"key":"S1351324922000171_ref46","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1243"},{"key":"S1351324922000171_ref158","unstructured":"Trivedi, H. , Balasubramanian, N. , Khot, T. and Sabharwal, A. (2020). Measuring and Reducing Non-Multifact Reasoning in Multi-hop Question Answering. arXiv preprint arXiv 2005.00789."},{"key":"S1351324922000171_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.66"},{"key":"S1351324922000171_ref113","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"S1351324922000171_ref18","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1241"},{"key":"S1351324922000171_ref101","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.88"},{"key":"S1351324922000171_ref68","unstructured":"Lan, Z. , Chen, M. , Goodman, S. , Gimpel, K. , Sharma, P. and Soricut, R. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations."},{"key":"S1351324922000171_ref96","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1625"},{"key":"S1351324922000171_ref152","doi-asserted-by":"crossref","unstructured":"Tang, Y. , Ng, H.T. and Tung, A. (2021). Do multi-hop question answering systems know how to answer the single-hop sub-questions? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 3244\u20133249.","DOI":"10.18653\/v1\/2021.eacl-main.283"},{"key":"S1351324922000171_ref121","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K19-1019"},{"key":"S1351324922000171_ref99","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016867"},{"key":"S1351324922000171_ref173","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1496"},{"key":"S1351324922000171_ref190","unstructured":"Zhang, S. , Liu, X. , Liu, J. , Gao, J. , Duh, K. and Van Durme, B. (2018). ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension. arXiv preprint arXiv:1810.12885."},{"key":"S1351324922000171_ref14","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1405"},{"key":"S1351324922000171_ref74","unstructured":"Liu, N.F. , Schwartz, R. and Smith, N.A. (2019a). Inoculation by fine-tuning: A method for analyzing challenge datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 2171\u20132179."},{"key":"S1351324922000171_ref87","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1416"},{"key":"S1351324922000171_ref70","unstructured":"Liang, Y. , Li, J. and Yin, J. (2019). A new multi-choice reading comprehension dataset for curriculum learning. In Proceedings of Machine Learning Research, vol. 101. International Machine Learning Society (IMLS), pp. 742\u2013757."},{"key":"S1351324922000171_ref82","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1334"},{"key":"S1351324922000171_ref53","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1249"},{"key":"S1351324922000171_ref128","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-6103"},{"key":"S1351324922000171_ref139","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1458"},{"key":"S1351324922000171_ref54","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.conll-1.4"},{"key":"S1351324922000171_ref23","doi-asserted-by":"publisher","DOI":"10.2200\/S00509ED1V01Y201305HLT023"},{"key":"S1351324922000171_ref179","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1237"},{"key":"S1351324922000171_ref80","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.333"},{"key":"S1351324922000171_ref4","unstructured":"Bahdanau, D. , Cho, K.H. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. International Conference on Learning Representations, ICLR."},{"key":"S1351324922000171_ref10","doi-asserted-by":"publisher","DOI":"10.1016\/j.metip.2020.100022"},{"key":"S1351324922000171_ref107","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00293"},{"key":"S1351324922000171_ref32","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5801"},{"key":"S1351324922000171_ref140","doi-asserted-by":"crossref","unstructured":"Stacey, J. , Minervini, P. , Dubossarsky, H. , Riedel, S. and Rockt\u00e4schel, T. (2020). There is Strength in Numbers: Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training. arXiv preprint arXiv 2004.07790.","DOI":"10.18653\/v1\/2020.emnlp-main.665"},{"key":"S1351324922000171_ref97","unstructured":"Naik, A. , Ravichander, A. , Sadeh, N. , Rose, C. and Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 2340\u20132353."},{"key":"S1351324922000171_ref62","doi-asserted-by":"crossref","unstructured":"Khashabi, D. , Khot, T. and Sabharwal, A. (2020). Natural Perturbation for Robust Question Answering. arXiv preprint arXiv 2004.04849.","DOI":"10.18653\/v1\/2020.emnlp-main.12"},{"key":"S1351324922000171_ref143","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1075"},{"key":"S1351324922000171_ref56","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5309"},{"key":"S1351324922000171_ref188","unstructured":"Zhang, G. , Bai, B. , Liang, J. , Bai, K. , Zhu, C. and Zhao, T. (2020a). Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark. arXiv preprint arXiv 2010.07676."},{"key":"S1351324922000171_ref154","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00145"},{"key":"S1351324922000171_ref60","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1546"},{"key":"S1351324922000171_ref19","doi-asserted-by":"crossref","unstructured":"Clark, C. , Lee, K. , Chang, M.-W. , Kwiatkowski, T. , Collins, M. and Toutanova, K. (2019a). BoolQ: Exploring the surprising difficulty of natural yes\/no questions. In Proceedings of the 2019 Conference of the North, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 2924\u20132936.","DOI":"10.18653\/v1\/N19-1300"},{"key":"S1351324922000171_ref142","doi-asserted-by":"crossref","unstructured":"Sugawara, S. , Inui, K. , Sekine, S. and Aizawa, A. (2018). What makes reading comprehension questions easier? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 4208\u20134219.","DOI":"10.18653\/v1\/D18-1453"},{"key":"S1351324922000171_ref168","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00021"},{"key":"S1351324922000171_ref175","unstructured":"Yaghoobzadeh, Y. , Tachet, R. , Hazen, T.J. and Sordoni, A. (2019). Robust Natural Language Inference Models with Example Forgetting. arXiv preprint arXiv 1911.03861."},{"key":"S1351324922000171_ref182","unstructured":"Yogatama, D. , D\u2019Autume, C.d.M. , Connor, J. , Kocisky, T. , Chrzanowski, M. , Kong, L. , Lazaridou, A. , Ling, W. , Yu, L. , Dyer, C. and Blunsom, P. (2019). Learning and Evaluating General Linguistic Intelligence. arXiv preprint arXiv:1901.11373."},{"key":"S1351324922000171_ref189","unstructured":"Zhang, G. , Bai, B. , Zhang, J. , Bai, K. , Zhu, C. and Zhao, T. (2019b). Mitigating Annotation Artifacts in Natural Language Inference Datasets to Improve Cross-dataset Generalization Ability. arXiv preprint arXiv 1909.04242."},{"key":"S1351324922000171_ref63","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.84"},{"key":"S1351324922000171_ref176","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4804"},{"key":"S1351324922000171_ref159","unstructured":"Tsuchiya, M. (2018). Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation."},{"key":"S1351324922000171_ref126","unstructured":"Saikh, T. , Ekbal, A. and Bhattacharyya, P. (2020). ScholarlyRead: A new dataset for scientific article reading comprehension. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, pp. 5498\u20135504."},{"key":"S1351324922000171_ref40","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2017"},{"key":"S1351324922000171_ref25","unstructured":"Demszky, D. , Guu, K. and Liang, P. (2018). Transforming Question Answering Datasets Into Natural Language Inference Datasets. arXiv preprint arXiv:1809.02922."},{"key":"S1351324922000171_ref122","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1609"},{"key":"S1351324922000171_ref22","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00018"},{"key":"S1351324922000171_ref3","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.499"},{"key":"S1351324922000171_ref49","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1423"},{"key":"S1351324922000171_ref71","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5808"},{"key":"S1351324922000171_ref91","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6371"},{"key":"S1351324922000171_ref5","unstructured":"Bajgar, O. , Kadlec, R. and Kleindienst, J. (2016). Embracing Data Abundance: BookTest Dataset for Reading Comprehension. arXiv preprint arXiv 1610.00956."},{"key":"S1351324922000171_ref28","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5820"},{"key":"S1351324922000171_ref72","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.465"},{"key":"S1351324922000171_ref146","unstructured":"Szegedy, C. , Zaremba, W. , Sutskever, I. , Bruna, J. , Erhan, D. , Goodfellow, I. and Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations."},{"key":"S1351324922000171_ref161","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.770"},{"key":"S1351324922000171_ref169","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-1042"},{"key":"S1351324922000171_ref183","unstructured":"Yu, W. , Jiang, Z. , Dong, Y. and Feng, J. (2020). ReClor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations."},{"key":"S1351324922000171_ref193","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.3016132"},{"key":"S1351324922000171_ref155","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1074"},{"key":"S1351324922000171_ref110","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000171_ref157","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1335"},{"key":"S1351324922000171_ref42","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.492"},{"key":"S1351324922000171_ref6","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1084"},{"key":"S1351324922000171_ref162","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N. , Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998\u20136008."},{"key":"S1351324922000171_ref75","unstructured":"Liu, P. , Du, C. , Zhao, S. and Zhu, C. (2019b). Emotion Action Detection and Emotion Inference: The Task and Dataset. arXiv preprint arXiv 1903.06901."},{"key":"S1351324922000171_ref36","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-2103"},{"key":"S1351324922000171_ref89","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K18-1007"},{"key":"S1351324922000171_ref164","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1221"},{"key":"S1351324922000171_ref180","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1259"},{"key":"S1351324922000171_ref149","unstructured":"Talmor, A. , Tafjord, O. , Clark, P. , Goldberg, Y. and Berant, J. (2020). Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems."},{"key":"S1351324922000171_ref27","unstructured":"Dodge, J. , Ilharco, G. , Schwartz, R. , Farhadi, A. , Hajishirzi, H. and Smith, N. (2020). Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping. arXiv preprint arXiv 2002.0630."},{"key":"S1351324922000171_ref171","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5807"},{"key":"S1351324922000171_ref187","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1435"},{"key":"S1351324922000171_ref33","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.117"},{"key":"S1351324922000171_ref112","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.300"},{"key":"S1351324922000171_ref153","doi-asserted-by":"crossref","unstructured":"Teney, D. , Abbasnedjad, E. and van den Hengel, A. (2020a). Learning what makes a difference from counterfactual examples and gradient supervision. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), LNCS, vol. 12355, pp. 580\u2013599.","DOI":"10.1007\/978-3-030-58607-2_34"},{"key":"S1351324922000171_ref95","doi-asserted-by":"crossref","unstructured":"Mudrakarta, P.K. , Taly, A. , Sundararajan, M. and Dhamdhere, K. (2018). Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 1896\u20131906.","DOI":"10.18653\/v1\/P18-1176"},{"key":"S1351324922000171_ref59","unstructured":"Kaushik, D. , Hovy, E. and Lipton, Z.C. (2020). Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations."},{"key":"S1351324922000171_ref151","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1629"},{"key":"S1351324922000171_ref77","unstructured":"Liu, X. , Cheng, H. , He, P. , Chen, W. , Wang, Y. , Poon, H. and Gao, J. (2020c). Adversarial Training for Large Neural Language Models. arXiv preprint arXiv 2004.08994."},{"key":"S1351324922000171_ref38","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.177"},{"key":"S1351324922000171_ref52","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1259"},{"key":"S1351324922000171_ref78","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.101"},{"key":"S1351324922000171_ref105","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1144"},{"key":"S1351324922000171_ref85","unstructured":"Miller, J. , Krauth, K. , Recht, B. and Schmidt, L. (2020). The effect of natural distribution shift on question answering models. In Proceedings of the 37th International Conference on Machine Learning, pp. 6905\u20136916."},{"key":"S1351324922000171_ref123","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5436"},{"key":"S1351324922000171_ref184","doi-asserted-by":"publisher","DOI":"10.1109\/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00113"},{"key":"S1351324922000171_ref7","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1075"},{"key":"S1351324922000171_ref83","doi-asserted-by":"publisher","DOI":"10.1145\/3457607"},{"key":"S1351324922000171_ref90","unstructured":"Mishra, S. , Arunkumar, A. , Sachdeva, B. , Bryan, C. and Baral, C. (2020). DQI: Measuring Data Quality in NLP. arXiv preprint arXiv 2005.00816."},{"key":"S1351324922000171_ref57","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1225"},{"key":"S1351324922000171_ref137","unstructured":"Si, C. , Wang, S. , Kan, M.-Y. and Jiang, J. (2019). What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? arXiv preprint arXiv:1910.12391."},{"key":"S1351324922000171_ref50","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1262"},{"key":"S1351324922000171_ref186","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1472"},{"key":"S1351324922000171_ref165","doi-asserted-by":"crossref","unstructured":"Wang, A. , Singh, A.A. , Michael, J. , Hill, F. , Levy, O. and Bowman, S.R. (2018). Glue: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461.","DOI":"10.18653\/v1\/W18-5446"},{"key":"S1351324922000171_ref88","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1160"},{"key":"S1351324922000171_ref134","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K17-1004"},{"key":"S1351324922000171_ref178","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.90"},{"key":"S1351324922000171_ref124","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5435"},{"key":"S1351324922000171_ref9","unstructured":"Brown, T.B. , Mann, B. , Ryder, N. , Subbiah, M. , Kaplan, J. , Dhariwal, P. , Neelakantan, A. , Shyam, P. , Sastry, G. , Askell, A. , Agarwal, S. , Herbert-Voss, A. , Krueger, G. , Henighan, T. , Child, R. , Ramesh, A. , Ziegler, D.M. , Wu, J. , Winter, C. , Hesse, C. , Chen, M. , Sigler, E. , Litwin, M. , Gray, S. , Chess, B. , Clark, J. , Berner, C. , McCandlish, S. , Radford, A. , Sutskever, I. and Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv 2005.14165."},{"key":"S1351324922000171_ref132","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1086"},{"key":"S1351324922000171_ref100","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.441"},{"key":"S1351324922000171_ref65","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00276"},{"key":"S1351324922000171_ref102","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1459"},{"key":"S1351324922000171_ref51","doi-asserted-by":"publisher","DOI":"10.3390\/app11146421"},{"key":"S1351324922000171_ref163","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1092"},{"key":"S1351324922000171_ref55","doi-asserted-by":"publisher","DOI":"10.1109\/ICTAI.2016.0128"},{"key":"S1351324922000171_ref12","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.repl4nlp-1.11"},{"key":"S1351324922000171_ref39","first-page":"77","article-title":"Adversarial networks for machine reading","volume":"59","author":"Grail","year":"2018","journal-title":"TAL Traitement Automatique des Langues"},{"key":"S1351324922000171_ref45","unstructured":"Holzenberger, N. , Blair-Stanek, A. and Van Durme, B. (2020). A dataset for statutory reasoning in tax law entailment and question answering. In NLLP KDD."},{"key":"S1351324922000171_ref13","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1223"},{"key":"S1351324922000171_ref167","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.103"},{"key":"S1351324922000171_ref67","unstructured":"Lake, B.M. and Baroni, M. (2017). Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In 35th International Conference on Machine Learning, ICML 2018, vol. 7, pp. 4487\u20134499."},{"key":"S1351324922000171_ref114","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1621"},{"key":"S1351324922000171_ref174","doi-asserted-by":"publisher","DOI":"10.1007\/s11633-019-1211-x"},{"key":"S1351324922000171_ref81","doi-asserted-by":"publisher","DOI":"10.1145\/3281354.3281359"},{"key":"S1351324922000171_ref41","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1175"},{"key":"S1351324922000171_ref58","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.769"},{"key":"S1351324922000171_ref1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-19551-3_31"},{"key":"S1351324922000171_ref29","unstructured":"Dua, D. , Wang, Y. , Dasigi, P. , Stanovsky, G. , Singh, S. and Gardner, M. (2019b). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2368\u20132378."},{"key":"S1351324922000171_ref191","first-page":"1","article-title":"Adversarial attacks on deep-learning models in natural language processing","volume":"11","author":"Zhang","year":"2020","journal-title":"ACM Transactions on Intelligent Systems and Technology"},{"key":"S1351324922000171_ref93","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1098"},{"key":"S1351324922000171_ref111","unstructured":"Raghunathan, A. , Steinhardt, J. and Liang, P. (2018). Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, pp. 10900\u201310910."},{"key":"S1351324922000171_ref117","doi-asserted-by":"crossref","unstructured":"Richardson, K. , Hu, H. , Moss, L.S. and Sabharwal, A. (2019). Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence.","DOI":"10.1609\/aaai.v34i05.6397"},{"key":"S1351324922000171_ref160","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00335"},{"key":"S1351324922000171_ref34","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1456"},{"key":"S1351324922000171_ref11","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-2097"},{"key":"S1351324922000171_ref98","unstructured":"Nakanishi, M. , Kobayashi, T. and Hayashi, Y. (2018). Answerable or not: Devising a dataset for extending machine reading comprehension. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 973\u2013983."},{"key":"S1351324922000171_ref116","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.442"},{"key":"S1351324922000171_ref37","unstructured":"Goodfellow, I.J. , Pouget-Abadie, J. , Mirza, M. , Xu, B. , Warde-Farley, D. , Ozair, S. , Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672\u20132680."},{"key":"S1351324922000171_ref125","unstructured":"Sagawa, S. , Koh, P.W. , Hashimoto, T.B. and Liang, P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations."},{"key":"S1351324922000171_ref30","unstructured":"Dunn, M. , Sagun, L. , Higgins, M. , Guney, V.U. , Cirik, V. and Cho, K. (2017). SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv preprint arXiv 1704.05179."},{"key":"S1351324922000171_ref69","unstructured":"Li, P. , Li, W. , He, Z. , Wang, X. , Cao, Y. , Zhou, J. and Xu, W. (2016). Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. arXiv preprint arXiv 1607.06275."},{"key":"S1351324922000171_ref138","doi-asserted-by":"crossref","unstructured":"Si, C. , Yang, Z. , Cui, Y. , Ma, W. , Liu, T. and Wang, S. (2020). Benchmarking Robustness of Machine Reading Comprehension Models. arXiv preprint arXiv 2004.14004.","DOI":"10.18653\/v1\/2021.findings-acl.56"},{"key":"S1351324922000171_ref141","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2017.04.009"},{"key":"S1351324922000171_ref135","unstructured":"Seo, M.J. , Kembhavi, A. , Farhadi, A. and Hajishirzi, H. (2017). Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations."},{"key":"S1351324922000171_ref148","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1485"},{"key":"S1351324922000171_ref118","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00331"},{"key":"S1351324922000171_ref26","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 4171\u20134186."},{"key":"S1351324922000171_ref86","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.212"},{"key":"S1351324922000171_ref129","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1179"},{"key":"S1351324922000171_ref130","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.185"},{"key":"S1351324922000171_ref15","doi-asserted-by":"crossref","unstructured":"Chen, J. and Durrett, G. (2020). Robust Question Answering Through Sub-part Alignment. arXiv preprint arXiv 2004.14648.","DOI":"10.18653\/v1\/2021.naacl-main.98"},{"key":"S1351324922000171_ref76","unstructured":"Liu, T. , Zheng, X. , Chang, B. and Sui, Z. (2020b). HypoNLI: Exploring the artificial patterns of hypothesis-only bias in natural language inference. In Proceedings of The 12th Language Resources and Evaluation Conference."},{"key":"S1351324922000171_ref170","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1101"},{"key":"S1351324922000171_ref127","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6399"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000171","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,22]],"date-time":"2024-09-22T19:04:12Z","timestamp":1727031852000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000171\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,22]]},"references-count":194,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["S1351324922000171"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000171","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"type":"print","value":"1351-3249"},{"type":"electronic","value":"1469-8110"}],"subject":[],"published":{"date-parts":[[2022,4,22]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https:\/\/creativecommons.org\/licenses\/by\/4.0\/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}