{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:53:41Z","timestamp":1767855221850,"version":"3.49.0"},"reference-count":90,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2022,6,9]],"date-time":"2022-06-09T00:00:00Z","timestamp":1654732800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recent research has reported that standard fine-tuning approaches can be <jats:italic>unstable<\/jats:italic> due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.<\/jats:p>","DOI":"10.1017\/s1351324922000225","type":"journal-article","created":{"date-parts":[[2022,6,9]],"date-time":"2022-06-09T11:42:58Z","timestamp":1654774978000},"page":"554-583","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":3,"title":["Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task"],"prefix":"10.1017","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4561-5415","authenticated-orcid":false,"given":"Maria","family":"Tikhonova","sequence":"first","affiliation":[]},{"given":"Vladislav","family":"Mikhailov","sequence":"additional","affiliation":[]},{"given":"Dina","family":"Pisarevskaya","sequence":"additional","affiliation":[]},{"given":"Valentin","family":"Malykh","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6976-0185","authenticated-orcid":false,"given":"Tatiana","family":"Shavrina","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2022,6,9]]},"reference":[{"key":"S1351324922000225_ref50","unstructured":"Mosbach, M. , Andriushchenko, M. and Klakow, D. (2020a). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations."},{"key":"S1351324922000225_ref23","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.732"},{"key":"S1351324922000225_ref78","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-1004"},{"key":"S1351324922000225_ref16","doi-asserted-by":"publisher","DOI":"10.3115\/1654536.1654538"},{"key":"S1351324922000225_ref77","unstructured":"Warstadt, A. and Bowman, S.R. (2019). Linguistic analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint arXiv:1901.03438."},{"key":"S1351324922000225_ref8","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1198"},{"key":"S1351324922000225_ref30","doi-asserted-by":"crossref","unstructured":"Jawahar, G. , Sagot, B. and Seddah, D. (2019). What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3651\u20133657.","DOI":"10.18653\/v1\/P19-1356"},{"key":"S1351324922000225_ref84","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4804"},{"key":"S1351324922000225_ref33","unstructured":"Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980."},{"key":"S1351324922000225_ref5","unstructured":"Bhojanapalli, S. , Wilber, K. , Veit, A. , Rawat, A.S. , Kim, S. , Menon, A. and Kumar, S. (2021). On the reproducibility of neural network predictions. arXiv preprint arXiv:2102.03349."},{"key":"S1351324922000225_ref89","unstructured":"Zhu, C. , Cheng, Y. , Gan, Z. , Sun, S. , Goldstein, T. and Liu, J. (2019). Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations."},{"key":"S1351324922000225_ref62","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00349"},{"key":"S1351324922000225_ref70","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.53"},{"key":"S1351324922000225_ref29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.insights-1.13"},{"key":"S1351324922000225_ref37","unstructured":"Lee, C. , Cho, K. and Kang, W. (2019). Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299."},{"key":"S1351324922000225_ref36","unstructured":"Le, H. , Vial, L. , Frej, J. , Segonne, V. , Coavoux, M. , Lecouteux, B. , Allauzen, A. , Crabb\u00e9, B. , Besacier, L. and Schwab, D. (2020). FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 2479\u20132490."},{"key":"S1351324922000225_ref79","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1101"},{"key":"S1351324922000225_ref27","unstructured":"Hu, J. , Ruder, S. , Siddhant, A. , Neubig, G. , Firat, O. and Johnson, M. (2020b). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization."},{"key":"S1351324922000225_ref4","unstructured":"Bentivogli, L. , Clark, P. , Dagan, I. and Giampiccolo, D. (2009). The fifth pascal recognizing textual entailment challenge. In TAC."},{"key":"S1351324922000225_ref51","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.blackboxnlp-1.7"},{"key":"S1351324922000225_ref67","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.blackboxnlp-1.17"},{"key":"S1351324922000225_ref40","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00343"},{"key":"S1351324922000225_ref72","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N. , Kaiser, \u0141. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998\u20136008)."},{"key":"S1351324922000225_ref20","unstructured":"Haim, R.B. , Dagan, I. , Dolan, B. , Ferro, L. , Giampiccolo, D. , Magnini, B. and Szpektor, I. (2006). The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment."},{"key":"S1351324922000225_ref53","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.441"},{"key":"S1351324922000225_ref66","unstructured":"Shavrina, T. and Shapovalova, O. (2017). To the methodology of corpus construction for machine learning:\u201ctaiga\u201d. syntax tree corpus and parser. Corpus Linguistics 2017, p. 78."},{"key":"S1351324922000225_ref45","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.blackboxnlp-1.21"},{"key":"S1351324922000225_ref86","unstructured":"Zhang, S. , Liu, X. , Liu, J. , Gao, J. , Duh, K. and Durme, B.V. (2018). Record: Bridging the gap between human and machine commonsense reading comprehension."},{"key":"S1351324922000225_ref1","unstructured":"Al-Shabab, O. (1996). Interpretation and the language of translation: creativity and conventions in translation."},{"key":"S1351324922000225_ref11","doi-asserted-by":"crossref","unstructured":"Dagan, I. , Glickman, O. and Magnini, B. (2005). The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop. Springer, pp. 177\u2013190.","DOI":"10.1007\/11736790_9"},{"key":"S1351324922000225_ref49","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.212"},{"key":"S1351324922000225_ref18","unstructured":"Goldberg, Y. (2019). Assessing BERT\u2019s syntactic abilities."},{"key":"S1351324922000225_ref87","doi-asserted-by":"crossref","unstructured":"Zhang, Y. , Warstadt, A. , Li, X. , and Bowman, S.R. (2021). When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 1112\u20131125.","DOI":"10.18653\/v1\/2021.acl-long.90"},{"key":"S1351324922000225_ref55","unstructured":"Phang, J. , F\u00e9vry, T. and Bowman, S.R. (2018). Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088."},{"key":"S1351324922000225_ref75","unstructured":"Wang, A. , Pruksachatkun, Y. , Nangia, N. , Singh, A. , Michael, J. , Hill, F. , Levy, O. and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3266\u20133280."},{"key":"S1351324922000225_ref42","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K19-1087"},{"key":"S1351324922000225_ref41","unstructured":"Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101."},{"key":"S1351324922000225_ref80","doi-asserted-by":"crossref","unstructured":"Wu, J.M. , Belinkov, Y. , Sajjad, H. , Durrani, N. , Dalvi, F. and Glass, J. (2020). Similarity analysis of contextual word representation models. arXiv preprint arXiv:2005.01172.","DOI":"10.18653\/v1\/2020.acl-main.422"},{"key":"S1351324922000225_ref25","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.314"},{"key":"S1351324922000225_ref32","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1023"},{"key":"S1351324922000225_ref52","unstructured":"Naik, A. , Ravichander, A. , Sadeh, N. , Rose, C. and Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 2340\u20132353."},{"key":"S1351324922000225_ref65","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.381"},{"key":"S1351324922000225_ref15","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00298"},{"key":"S1351324922000225_ref13","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171\u20134186."},{"key":"S1351324922000225_ref10","unstructured":"Cui, L. , Cheng, S. , Wu, Y. and Zhang, Y. (2020). Does bert solve commonsense task via commonsense knowledge?"},{"key":"S1351324922000225_ref71","unstructured":"Tsuchiya, M. (2018). Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA)."},{"key":"S1351324922000225_ref57","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-demos.15"},{"key":"S1351324922000225_ref58","unstructured":"Raffel, C. , Shazeer, N. , Roberts, A. , Lee, K. , Narang, S. , Matena, M. , Zhou, Y. , Li, W. and Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer."},{"key":"S1351324922000225_ref34","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-26123-2_31"},{"key":"S1351324922000225_ref12","unstructured":"Dehghani, M. , Tay, Y. , Gritsenko, A.A. , Zhao, Z. , Houlsby, N. , Diaz, F. , Metzler, D. and Vinyals, O. (2021). The benchmark lottery."},{"key":"S1351324922000225_ref90","unstructured":"Zhuang, D. , Zhang, X. , Song, S.L. and Hooker, S. (2021). Randomness in neural network training: Characterizing the impact of tooling. arXiv preprint arXiv:2106.11872."},{"key":"S1351324922000225_ref74","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1534"},{"key":"S1351324922000225_ref21","unstructured":"He, P. , Liu, X. , Gao, J. and Chen, W. (2021). Deberta: Decoding-enhanced bert with disentangled attention."},{"key":"S1351324922000225_ref28","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.258"},{"key":"S1351324922000225_ref68","unstructured":"Storks, S. , Gao, Q. and Chai, J.Y. (2019). Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172."},{"key":"S1351324922000225_ref56","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.467"},{"key":"S1351324922000225_ref3","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-35289-8_26"},{"key":"S1351324922000225_ref85","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S19-1027"},{"key":"S1351324922000225_ref54","doi-asserted-by":"crossref","unstructured":"Nisioi, S. , Rabinovich, E. , Dinu, L.P. and Wintner, S. (2016). A corpus of native, non-native and translated texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC\u201916). Portoro\u017e, Slovenia: European Language Resources Association (ELRA), pp. 4197\u20134201.","DOI":"10.18653\/v1\/P16-1176"},{"key":"S1351324922000225_ref38","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.484"},{"key":"S1351324922000225_ref76","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5446"},{"key":"S1351324922000225_ref82","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.41"},{"key":"S1351324922000225_ref48","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.65"},{"key":"S1351324922000225_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1084"},{"key":"S1351324922000225_ref64","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1179"},{"key":"S1351324922000225_ref6","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1075"},{"key":"S1351324922000225_ref9","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1269"},{"key":"S1351324922000225_ref7","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"S1351324922000225_ref81","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.419"},{"key":"S1351324922000225_ref61","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.170"},{"key":"S1351324922000225_ref43","unstructured":"Marelli, M. , Menini, S. , Baroni, M. , Bentivogli, L. , Bernardi, R. and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 216\u2013223."},{"key":"S1351324922000225_ref46","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1334"},{"key":"S1351324922000225_ref17","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-2103"},{"key":"S1351324922000225_ref24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.102"},{"key":"S1351324922000225_ref35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1445"},{"key":"S1351324922000225_ref59","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6397"},{"key":"S1351324922000225_ref26","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.331"},{"key":"S1351324922000225_ref63","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.111"},{"key":"S1351324922000225_ref44","unstructured":"McCoy, R.T. , Frank, R , and Linzen, T. (2018). Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks."},{"key":"S1351324922000225_ref73","first-page":"104763","article-title":"Distributional formal semantics","author":"Venhuizen","year":"2021","journal-title":"Information and Computation"},{"key":"S1351324922000225_ref31","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.698"},{"key":"S1351324922000225_ref83","doi-asserted-by":"crossref","unstructured":"Yanaka, H. , Mineshima, K. , Bekki, D. and Inui, K. (2020). Do neural models learn systematicity of monotonicity inference in natural language? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 6105\u20136117.","DOI":"10.18653\/v1\/2020.acl-main.543"},{"key":"S1351324922000225_ref88","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.429"},{"key":"S1351324922000225_ref22","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11694"},{"key":"S1351324922000225_ref47","doi-asserted-by":"crossref","unstructured":"Merchant, A. , Rahimtoroghi, E. , Pavlick, E. and Tenney, I. (2020). What happens to BERT embeddings during fine-tuning? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 33\u201344.","DOI":"10.18653\/v1\/2020.blackboxnlp-1.4"},{"key":"S1351324922000225_ref19","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiolchem.2004.09.006"},{"key":"S1351324922000225_ref39","unstructured":"Li\u0161ka, A. , Kruszewski, G. and Baroni, M. (2018). Memorize or generalize? searching for a compositional rnn in a haystack. arXiv preprint arXiv:1802.06467."},{"key":"S1351324922000225_ref14","unstructured":"Dodge, J. , Ilharco, G. , Schwartz, R. , Farhadi, A. , Hajishirzi, H. and Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305."},{"key":"S1351324922000225_ref60","unstructured":"Rogers, A. (2019). How the transformers broke nlp leaderboards."},{"key":"S1351324922000225_ref69","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.259"}],"updated-by":[{"DOI":"10.1017\/s1351324923000116","type":"correction","label":"Correction","source":"publisher","updated":{"date-parts":[[2022,6,9]],"date-time":"2022-06-09T00:00:00Z","timestamp":1654732800000}}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000225","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,19]],"date-time":"2023-05-19T07:31:32Z","timestamp":1684481492000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000225\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,9]]},"references-count":90,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,5]]}},"alternative-id":["S1351324922000225"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000225","relation":{"correction":[{"id-type":"doi","id":"10.1017\/S1351324923000116","asserted-by":"object"}]},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,9]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}