{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,8]],"date-time":"2025-11-08T23:05:19Z","timestamp":1762643119070,"version":"3.32.0"},"reference-count":74,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,12,23]],"date-time":"2024-12-23T00:00:00Z","timestamp":1734912000000},"content-version":"vor","delay-in-days":357,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,12,23]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Robust, faithful, and harm-free pronoun use for individuals is an important goal for language model development as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: Given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later. We present RUFF, a carefully designed dataset of over 5 million instances to measure robust pronoun fidelity in English, and we evaluate 37 model variants from nine popular families, across architectures (encoder-only, decoder-only, and encoder-decoder) and scales (11M-70B parameters). When an individual is introduced with a pronoun, models can mostly faithfully reuse this pronoun in the next sentence, but they are significantly worse with she\/her\/her, singular they, and neopronouns. Moreover, models are easily distracted by non-adversarial sentences discussing other people; even one sentence with a distractor pronoun causes accuracy to drop on average by 34 percentage points. Our results show that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy. We encourage researchers to bridge the gaps we find and to carefully evaluate reasoning in settings where superficial repetition might inflate perceptions of model performance.<\/jats:p>","DOI":"10.1162\/tacl_a_00719","type":"journal-article","created":{"date-parts":[[2024,12,23]],"date-time":"2024-12-23T20:12:25Z","timestamp":1734984745000},"page":"1755-1779","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?"],"prefix":"10.1162","volume":"12","author":[{"given":"Vagrant","family":"Gautam","sequence":"first","affiliation":[{"name":"Saarland University, Germany. vgautam@lsv.uni-saarland.de"}]},{"given":"Eileen","family":"Bingert","sequence":"additional","affiliation":[{"name":"Saarland University, Germany"}]},{"given":"Dawei","family":"Zhu","sequence":"additional","affiliation":[{"name":"Saarland University, Germany"}]},{"given":"Anne","family":"Lauscher","sequence":"additional","affiliation":[{"name":"Data Science Group, University of Hamburg, Germany"}]},{"given":"Dietrich","family":"Klakow","sequence":"additional","affiliation":[{"name":"Saarland University, Germany"}]}],"member":"281","published-online":{"date-parts":[[2024,12,23]]},"reference":[{"key":"2024122320121650100_bib1","doi-asserted-by":"publisher","first-page":"7590","DOI":"10.18653\/v1\/2020.acl-main.679","article-title":"The sensitivity of language models and humans to Winograd schema perturbations","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Abdou","year":"2020"},{"key":"2024122320121650100_bib2","doi-asserted-by":"publisher","first-page":"2824","DOI":"10.18653\/v1\/2022.naacl-main.203","article-title":"Using natural sentence prompts for understanding biases in language models","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Alnegheimish","year":"2022"},{"key":"2024122320121650100_bib3","doi-asserted-by":"publisher","first-page":"3426","DOI":"10.18653\/v1\/2022.naacl-main.250","article-title":"Recognition of they\/them as singular personal pronouns in coreference resolution","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Baumler","year":"2022"},{"key":"2024122320121650100_bib4","doi-asserted-by":"publisher","first-page":"610","DOI":"10.1145\/3442188.3445922","article-title":"On the dangers of stochastic parrots: Can language models be too big?","volume-title":"Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency","author":"Bender","year":"2021"},{"key":"2024122320121650100_bib5","doi-asserted-by":"publisher","first-page":"5185","DOI":"10.18653\/v1\/2020.acl-main.463","article-title":"Climbing towards NLU: On meaning, form, and understanding in the age of data","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Bender","year":"2020"},{"key":"2024122320121650100_bib6","first-page":"2397","article-title":"Pythia: A suite for analyzing large language models across training and scaling","volume-title":"International Conference on Machine Learning, ICML 2023, 23\u201329 July 2023, Honolulu, Hawaii, USA","author":"Biderman","year":"2023"},{"key":"2024122320121650100_bib7","doi-asserted-by":"publisher","first-page":"5454","DOI":"10.18653\/v1\/2020.acl-main.485","article-title":"Language (technology) is power: A critical survey of \u201cbias\u201d in NLP","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lin Blodgett","year":"2020"},{"key":"2024122320121650100_bib8","article-title":"Man is to computer programmer as woman is to homemaker? Debiasing word embeddings","volume-title":"Advances in Neural Information Processing Systems","author":"Bolukbasi","year":"2016"},{"issue":"6334","key":"2024122320121650100_bib9","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1126\/science.aal4230","article-title":"Semantics derived automatically from language corpora contain human- biases","volume":"356","author":"Caliskan","year":"2017","journal-title":"Science"},{"key":"2024122320121650100_bib10","doi-asserted-by":"publisher","first-page":"4568","DOI":"10.18653\/v1\/2020.acl-main.418","article-title":"Toward gender-inclusive coreference resolution","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Cao","year":"2020"},{"issue":"3","key":"2024122320121650100_bib11","doi-asserted-by":"publisher","first-page":"615","DOI":"10.1162\/coli_a_00413","article-title":"Toward gender-inclusive coreference resolution: An analysis of gender and bias throughout the machine learning lifecycle","volume":"47","author":"Cao","year":"2021","journal-title":"Computational Linguistics"},{"issue":"70","key":"2024122320121650100_bib12","first-page":"1","article-title":"Scaling instruction-finetuned language models","volume":"25","author":"Chung","year":"2024","journal-title":"Journal of Machine Learning Research"},{"unstructured":"Kirby\n              Conrod\n            \n          . 2019. Pronouns Raising and Emerging. PhD thesis, University of Washington.","key":"2024122320121650100_bib13"},{"key":"2024122320121650100_bib14","doi-asserted-by":"publisher","first-page":"1968","DOI":"10.18653\/v1\/2021.emnlp-main.150","article-title":"Harms of gender exclusivity and challenges in non-binary representation in language technologies","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Dev","year":"2021"},{"key":"2024122320121650100_bib15","doi-asserted-by":"publisher","first-page":"4171","DOI":"10.18653\/v1\/N19-1423","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2024122320121650100_bib16","doi-asserted-by":"publisher","first-page":"10486","DOI":"10.18653\/v1\/2021.emnlp-main.819","article-title":"Back to square one: Artifact detection, training and commonsense disentanglement in the Winograd schema","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Elazar","year":"2021"},{"key":"2024122320121650100_bib17","doi-asserted-by":"publisher","first-page":"8517","DOI":"10.18653\/v1\/2021.emnlp-main.670","article-title":"Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Emelin","year":"2021"},{"key":"2024122320121650100_bib18","doi-asserted-by":"publisher","first-page":"2650","DOI":"10.18653\/v1\/P19-1254","article-title":"Strategies for structuring story generation","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Fan","year":"2019"},{"key":"2024122320121650100_bib19","doi-asserted-by":"publisher","first-page":"9126","DOI":"10.18653\/v1\/2023.acl-long.507","article-title":"WinoQueer: A community-in-the-loop benchmark for anti-LGBTQ+ bias in large language models","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Felkner","year":"2023"},{"key":"2024122320121650100_bib20","doi-asserted-by":"publisher","first-page":"606","DOI":"10.18653\/v1\/2023.acl-long.36","article-title":"When does translation require context? A data-driven, multilingual exploration","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Fernandes","year":"2023"},{"key":"2024122320121650100_bib21","doi-asserted-by":"publisher","first-page":"1926","DOI":"10.18653\/v1\/2021.acl-long.150","article-title":"Intrinsic bias metrics do not correlate with application bias","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Goldfarb-Tarrant","year":"2021"},{"key":"2024122320121650100_bib22","first-page":"609","article-title":"Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Gonen","year":"2019"},{"issue":"2","key":"2024122320121650100_bib23","doi-asserted-by":"publisher","first-page":"203","DOI":"10.21236\/ADA324949","article-title":"Centering: A framework for modeling the local coherence of discourse","volume":"21","author":"Grosz","year":"1995","journal-title":"Computational Linguistics"},{"key":"2024122320121650100_bib24","doi-asserted-by":"publisher","first-page":"4602","DOI":"10.18653\/v1\/2022.acl-long.315","article-title":"Context matters: A pragmatic study of PLMs\u2019 negation understanding","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Gubelmann","year":"2022"},{"key":"2024122320121650100_bib25","doi-asserted-by":"publisher","first-page":"5352","DOI":"10.18653\/v1\/2023.acl-long.293","article-title":"MISGENDERED: Limits of large language models in understanding pronouns","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Hossain","year":"2023"},{"key":"2024122320121650100_bib26","article-title":"Auxiliary task demands mask the capabilities of smaller language models","volume-title":"First Conference on Language Modeling","author":"Jennifer","year":"2024"},{"key":"2024122320121650100_bib27","doi-asserted-by":"publisher","first-page":"5040","DOI":"10.18653\/v1\/2023.emnlp-main.306","article-title":"Prompting is not a substitute for probability measurements in large language models","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Jennifer","year":"2023"},{"key":"2024122320121650100_bib28","doi-asserted-by":"publisher","first-page":"5075","DOI":"10.18653\/v1\/2023.emnlp-main.308","article-title":"Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Jacovi","year":"2023"},{"key":"2024122320121650100_bib29","doi-asserted-by":"publisher","first-page":"1830","DOI":"10.18653\/v1\/D17-1195","article-title":"Dynamic entity representations in neural language models","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Ji","year":"2017"},{"key":"2024122320121650100_bib30","article-title":"Comparing plausibility estimates in base and instruction-tuned large language models","volume":"abs\/2403.14859v1","author":"Kauf","year":"2024","journal-title":"CoRR"},{"key":"2024122320121650100_bib31","doi-asserted-by":"publisher","first-page":"925","DOI":"10.18653\/v1\/2023.acl-short.80","article-title":"A better way to do masked language model scoring","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Kauf","year":"2023"},{"key":"2024122320121650100_bib32","first-page":"22199","article-title":"Large language models are zero-shot reasoners","volume-title":"Advances in Neural Information Processing Systems","author":"Kojima","year":"2022"},{"key":"2024122320121650100_bib33","doi-asserted-by":"publisher","first-page":"166","DOI":"10.18653\/v1\/W19-3823","article-title":"Measuring bias in contextualized word representations","volume-title":"Proceedings of the First Workshop on Gender Bias in Natural Language Processing","author":"Kurita","year":"2019"},{"key":"2024122320121650100_bib34","article-title":"ALBERT: A lite BERT for self-supervised learning of language representations","volume-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\u201330, 2020","author":"Lan","year":"2020"},{"key":"2024122320121650100_bib35","first-page":"1221","article-title":"Welcome to the modern world of pronouns: Identity-inclusive natural language processing beyond gender","volume-title":"Proceedings of the 29th International Conference on Computational Linguistics","author":"Lauscher","year":"2022"},{"key":"2024122320121650100_bib36","doi-asserted-by":"publisher","first-page":"377","DOI":"10.18653\/v1\/2023.acl-long.23","article-title":"What about \u201cem\u201d? How commercial machine translation fails to handle (neo-)pronouns","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Lauscher","year":"2023"},{"key":"2024122320121650100_bib37","article-title":"The winograd schema challenge","volume-title":"Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference, KR 2012, Rome, Italy, June 10\u201314, 2012","author":"Levesque","year":"2012"},{"key":"2024122320121650100_bib38","doi-asserted-by":"publisher","first-page":"15339","DOI":"10.18653\/v1\/2024.acl-long.818","article-title":"Same task, more tokens: The impact of input length on the reasoning performance of large language models","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Levy","year":"2024"},{"key":"2024122320121650100_bib39","doi-asserted-by":"publisher","first-page":"2470","DOI":"10.18653\/v1\/2021.findings-emnlp.211","article-title":"Collecting a large-scale gender bias dataset for coreference resolution and machine translation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Levy","year":"2021"},{"key":"2024122320121650100_bib40","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1162\/tacl_a_00638","article-title":"Lost in the middle: How language models use long contexts","volume":"12","author":"Liu","year":"2024","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024122320121650100_bib41","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu","year":"2019","journal-title":"CoRR"},{"volume-title":"Gender Census 2023: Worldwide Report","year":"2023","author":"Lodge","key":"2024122320121650100_bib42"},{"key":"2024122320121650100_bib43","first-page":"22631","article-title":"The flan collection: Designing data and methods for effective instruction tuning","volume-title":"International Conference on Machine Learning, ICML 2023, 23\u201329 July 2023, Honolulu, Hawaii, USA","author":"Longpre","year":"2023"},{"key":"2024122320121650100_bib44","doi-asserted-by":"publisher","first-page":"622","DOI":"10.18653\/v1\/N19-1063","article-title":"On measuring social biases in sentence encoders","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"May","year":"2019"},{"issue":"1","key":"2024122320121650100_bib45","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1037\/sah0000070","article-title":"A minority stress perspective on transgender individuals\u2019 experiences with misgendering","volume":"3","author":"McLemore","year":"2018","journal-title":"Stigma and Health"},{"key":"2024122320121650100_bib46","article-title":"minicons: Enabling flexible behavioral and representational analyses of transformer language models","author":"Misra","year":"2022","journal-title":"CoRR"},{"key":"2024122320121650100_bib47","doi-asserted-by":"publisher","first-page":"839","DOI":"10.18653\/v1\/N16-1098","article-title":"A corpus and cloze evaluation for deeper understanding of commonsense stories","volume-title":"Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Mostafazadeh","year":"2016"},{"key":"2024122320121650100_bib48","doi-asserted-by":"publisher","first-page":"61","DOI":"10.18653\/v1\/W18-6307","article-title":"A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers","author":"M\u00fcller","year":"2018"},{"key":"2024122320121650100_bib49","doi-asserted-by":"publisher","first-page":"1246","DOI":"10.1145\/3593013.3594078","article-title":"\u201cI\u2019m fully who I am\u201d: Towards centering transgender and non-binary voices to measure biases in open language generation","volume-title":"Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency","author":"Ovalle","year":"2023"},{"key":"2024122320121650100_bib50","doi-asserted-by":"publisher","first-page":"1739","DOI":"10.18653\/v1\/2024.findings-naacl.113","article-title":"Tokenization matters: Navigating data-scarce tokenization for gender inclusive language technologies","volume-title":"Findings of the Association for Computational Linguistics: NAACL 2024","author":"Ovalle","year":"2024"},{"key":"2024122320121650100_bib51","article-title":"MosaicBERT: A bidirectional encoder optimized for fast pretraining","volume-title":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10\u201316, 2023","author":"Portes","year":"2023"},{"key":"2024122320121650100_bib52","first-page":"143","article-title":"Towards robust linguistic analysis using OntoNotes","volume-title":"Proceedings of the Seventeenth Conference on Computational Natural Language Learning","author":"Pradhan","year":"2013"},{"key":"2024122320121650100_bib53","doi-asserted-by":"publisher","first-page":"8","DOI":"10.18653\/v1\/N18-2002","article-title":"Gender bias in coreference resolution","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Rudinger","year":"2018"},{"key":"2024122320121650100_bib54","doi-asserted-by":"publisher","first-page":"2699","DOI":"10.18653\/v1\/2020.acl-main.240","article-title":"Masked language model scoring","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Salazar","year":"2020"},{"key":"2024122320121650100_bib55","article-title":"Are emergent abilities of large language models a mirage?","volume-title":"Thirty-seventh Conference on Neural Information Processing Systems","author":"Schaeffer","year":"2023"},{"key":"2024122320121650100_bib56","article-title":"Quantifying language models\u2019 sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting","volume-title":"The Twelfth International Conference on Learning Representations","author":"Sclar","year":"2024"},{"key":"2024122320121650100_bib57","article-title":"Quantifying social biases using templates is unreliable","author":"Seshadri","year":"2022","journal-title":"CoRR"},{"key":"2024122320121650100_bib58","doi-asserted-by":"publisher","first-page":"1968","DOI":"10.18653\/v1\/2022.findings-emnlp.143","article-title":"How sensitive are translation systems to extra contexts? Mitigating gender bias in neural machine translation models through relevant contexts","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Sharma","year":"2022"},{"key":"2024122320121650100_bib59","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1016\/B978-0-12-491280-9.50016-9","article-title":"10 - Language and the culture of gender: At the intersection of structure, usage, and ideology","volume-title":"Semiotic Mediation","author":"Silverstein","year":"1985"},{"key":"2024122320121650100_bib60","doi-asserted-by":"publisher","first-page":"6043","DOI":"10.18653\/v1\/2023.acl-long.333","article-title":"Language model acceptability judgements are not always robust to context","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sinha","year":"2023"},{"key":"2024122320121650100_bib61","doi-asserted-by":"publisher","first-page":"4753","DOI":"10.18653\/v1\/2022.naacl-main.350","article-title":"Partial-input baselines show that NLI models can ignore context, but they don\u2019t","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Srikanth","year":"2022"},{"key":"2024122320121650100_bib62","volume-title":"Transgender History: The Roots of Today\u2019s Revolution","author":"Stryker","year":"2017","edition":"2nd"},{"key":"2024122320121650100_bib63","doi-asserted-by":"publisher","first-page":"112","DOI":"10.18653\/v1\/2022.gebnlp-1.13","article-title":"Fewer errors, but more stereotypes? The effect of model size on gender bias","volume-title":"Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)","author":"Tal","year":"2022"},{"key":"2024122320121650100_bib64","doi-asserted-by":"publisher","first-page":"12342","DOI":"10.18653\/v1\/2023.findings-emnlp.825","article-title":"Scaling laws vs model architectures: How does inductive bias influence scaling?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Yi","year":"2023"},{"key":"2024122320121650100_bib65","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"CoRR"},{"key":"2024122320121650100_bib66","doi-asserted-by":"publisher","first-page":"3382","DOI":"10.18653\/v1\/D19-1335","article-title":"How reasonable are common-sense reasoning tasks: A case-study on the Winograd schema challenge and SWAG","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Trichelair","year":"2019"},{"key":"2024122320121650100_bib67","doi-asserted-by":"publisher","first-page":"2232","DOI":"10.18653\/v1\/2021.eacl-main.190","article-title":"Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"de Vassimon Manela","year":"2021"},{"key":"2024122320121650100_bib68","doi-asserted-by":"publisher","first-page":"1264","DOI":"10.18653\/v1\/P18-1117","article-title":"Context-aware neural machine translation learns anaphora resolution","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Voita","year":"2018"},{"key":"2024122320121650100_bib69","doi-asserted-by":"publisher","first-page":"605","DOI":"10.1162\/tacl_a_00240","article-title":"Mind the GAP: A balanced corpus of gendered ambiguous pronouns","volume":"6","author":"Webster","year":"2018","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024122320121650100_bib70","doi-asserted-by":"publisher","first-page":"38","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"Transformers: State-of-the-art natural language processing","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Wolf","year":"2020"},{"key":"2024122320121650100_bib71","article-title":"OPT: Open pre-trained transformer language models","author":"Zhang","year":"2022","journal-title":"CoRR"},{"key":"2024122320121650100_bib72","doi-asserted-by":"publisher","first-page":"629","DOI":"10.18653\/v1\/N19-1064","article-title":"Gender bias in contextualized word embeddings","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Zhao","year":"2019"},{"key":"2024122320121650100_bib73","doi-asserted-by":"publisher","first-page":"15","DOI":"10.18653\/v1\/N18-2003","article-title":"Gender bias in coreference resolution: Evaluation and debiasing methods","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Zhao","year":"2018"},{"key":"2024122320121650100_bib74","article-title":"Large language models are human-level prompt engineers","volume-title":"The Eleventh International Conference on Learning Representations","author":"Zhou","year":"2023"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00719\/2487051\/tacl_a_00719.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00719\/2487051\/tacl_a_00719.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,23]],"date-time":"2024-12-23T20:12:33Z","timestamp":1734984753000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00719\/125951\/Robust-Pronoun-Fidelity-with-English-LLMs-Are-they"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":74,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00719","relation":{},"ISSN":["2307-387X"],"issn-type":[{"type":"electronic","value":"2307-387X"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}