{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T09:36:00Z","timestamp":1774604160383,"version":"3.50.1"},"reference-count":61,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2022,3,21]],"date-time":"2022-03-21T00:00:00Z","timestamp":1647820800000},"content-version":"vor","delay-in-days":79,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,3,18]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic probes that allow us to test several characteristics\u2014such as writing styles, factuality, sensitivity to paraphrasing and word order\u2014that are not addressed by previous techniques. To demonstrate the value of the framework, we conduct an extensive empirical study that yields insights into the factors that contribute to the neural model\u2019s gains, and identify potential unintended biases the models exhibit. Some of our results confirm conventional wisdom, for example, that recent neural ranking models rely less on exact term overlap with the query, and instead leverage richer linguistic information, evidenced by their higher sensitivity to word and sentence order. Other results are more surprising, such as that some models (e.g., T5 and ColBERT) are biased towards factually correct (rather than simply relevant) texts. Further, some characteristics vary even for the same base language model, and other characteristics can appear due to random variations during model training.1<\/jats:p>","DOI":"10.1162\/tacl_a_00457","type":"journal-article","created":{"date-parts":[[2022,3,21]],"date-time":"2022-03-21T19:09:55Z","timestamp":1647889795000},"page":"224-239","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":23,"title":["ABNIRML: Analyzing the Behavior of Neural IR Models"],"prefix":"10.1162","volume":"10","author":[{"given":"Sean","family":"MacAvaney","sequence":"first","affiliation":[{"name":"IR Lab, Georgetown University, Washington, DC, USA. sean@ir.cs.georgetown.edu"}]},{"given":"Sergey","family":"Feldman","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, Seattle, WA, USA. sergey@allenai.org"}]},{"given":"Nazli","family":"Goharian","sequence":"additional","affiliation":[{"name":"IR Lab, Georgetown University, Washington, DC, USA. nazli@ir.cs.georgetown.edu"}]},{"given":"Doug","family":"Downey","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, Seattle, WA, USA. dougd@allenai.org"}]},{"given":"Arman","family":"Cohan","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, Seattle, WA, USA"},{"name":"Paul G. Allen School of Computer Science, University of Washington, WA, USA. armanc@allenai.org"}]}],"member":"281","published-online":{"date-parts":[[2022,3,18]]},"reference":[{"key":"2022033118512296400_bib1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.repl4nlp-1.27","article-title":"Syntactic perturbations reveal representational correlates of hierarchical phrase structure in pretrained language models","author":"Alleman","year":"2021","journal-title":"arXiv"},{"key":"2022033118512296400_bib2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-45439-5_40","article-title":"Diagnosing BERT with retrieval heuristics","volume-title":"ECIR","author":"C\u00e2mara","year":"2020"},{"key":"2022033118512296400_bib3","article-title":"MS MARCO: A human generated machine reading comprehension dataset","author":"Campos","year":"2016","journal-title":"arXiv"},{"key":"2022033118512296400_bib4","article-title":"Overview of the TREC 2019 deep learning track","volume-title":"TREC","author":"Craswell","year":"2019"},{"key":"2022033118512296400_bib5","article-title":"Context-aware sentence\/passage term importance estimation for first stage retrieval","author":"Dai","year":"2019","journal-title":"arXiv"},{"key":"2022033118512296400_bib6","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331303","article-title":"Deeper text understanding for ir with contextual neural language modeling","author":"Dai","year":"2019","journal-title":"SIGIR"},{"key":"2022033118512296400_bib7","article-title":"CAsT 2019: The conversational assistance track overview","volume-title":"TREC","author":"Dalton","year":"2019"},{"key":"2022033118512296400_bib8","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"NAACL-HLT","author":"Devlin","year":"2019"},{"key":"2022033118512296400_bib9","doi-asserted-by":"publisher","DOI":"10.1145\/1008992.1009004","article-title":"A formal study of information retrieval heuristics","volume-title":"SIGIR","author":"Fang","year":"2004"},{"key":"2022033118512296400_bib10","doi-asserted-by":"publisher","first-page":"7:1","DOI":"10.1145\/1961209.1961210","article-title":"Diagnostic evaluation of information retrieval models","volume":"29","author":"Fang","year":"2011","journal-title":"ACM Transactions on Management Information Systems"},{"key":"2022033118512296400_bib11","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148193","article-title":"Semantic term matching in axiomatic approaches to information retrieval","volume-title":"SIGIR \u201906","author":"Fang","year":"2006"},{"key":"2022033118512296400_bib12","article-title":"Building a better search engine for semantic scholar","author":"Feldman","year":"2020"},{"key":"2022033118512296400_bib13","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-45442-5_21","article-title":"ANTIQUE: A non-factoid question answering benchmark","volume-title":"ECIR","author":"Hashemi","year":"2020"},{"key":"2022033118512296400_bib14","article-title":"Scalable modified Kneser-Ney language model estimation","volume-title":"ACL","author":"Heafield","year":"2013"},{"key":"2022033118512296400_bib15","article-title":"Interpretable and time-budget-constrained contextualization for re-ranking","volume-title":"ECAI","author":"Hofst\u00e4tter","year":"2020"},{"key":"2022033118512296400_bib16","article-title":"spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing","author":"Honnibal","year":"2017"},{"key":"2022033118512296400_bib17","article-title":"LightGBM: A highly efficient gradient boosting decision tree","volume-title":"NIPS","author":"Ke","year":"2017"},{"key":"2022033118512296400_bib18","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401075","article-title":"ColBERT: Efficient and effective passage search via contextualized late interaction over BERT","volume-title":"SIGIR","author":"Khattab","year":"2020"},{"key":"2022033118512296400_bib19","article-title":"From word embeddings to document distances","volume-title":"ICML","author":"Kusner","year":"2015"},{"key":"2022033118512296400_bib20","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00276","article-title":"Natural Questions: a benchmark for question answering research","author":"Kwiatkowski","year":"2019","journal-title":"TACL"},{"key":"2022033118512296400_bib21","article-title":"Parade: Passage representation aggregation for document reranking","author":"Li","year":"2020","journal-title":"arXiv"},{"key":"2022033118512296400_bib22","article-title":"Pretrained transformers for text ranking: BERT and beyond","author":"Lin","year":"2020","journal-title":"arXiv"},{"key":"2022033118512296400_bib23","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-72113-8_23","article-title":"Evaluating multilingual text encoders for unsupervised cross-lingual retrieval","volume-title":"ECIR","author":"Litschko","year":"2021"},{"key":"2022033118512296400_bib24","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/N19-1112","article-title":"Linguistic knowledge and transferability of contextual representations","volume-title":"NAACL-HLT","author":"Liu","year":"2019"},{"key":"2022033118512296400_bib25","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu","year":"2019","journal-title":"arXiv"},{"key":"2022033118512296400_bib26","article-title":"Language models and word sense disambiguation: An overview and analysis","author":"Loureiro","year":"2020","journal-title":"arXiv"},{"key":"2022033118512296400_bib27","doi-asserted-by":"crossref","DOI":"10.1145\/3336191.3371864","article-title":"OpenNIR: A complete neural ad-hoc ranking pipeline","volume-title":"WSDM","author":"MacAvaney","year":"2020"},{"key":"2022033118512296400_bib28","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401262","article-title":"Expansion via prediction of importance with contextualization","volume-title":"SIGIR","author":"MacAvaney","year":"2020"},{"key":"2022033118512296400_bib29","doi-asserted-by":"crossref","DOI":"10.1145\/3331184.3331317","article-title":"CEDR: Contextualized embeddings for document ranking","volume-title":"SIGIR","author":"MacAvaney","year":"2019"},{"key":"2022033118512296400_bib30","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3463254","article-title":"Simplified data wrangling with ir_datasets","volume-title":"SIGIR","author":"MacAvaney","year":"2021"},{"key":"2022033118512296400_bib31","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331316","article-title":"Content-based weak supervision for ad-hoc re-ranking","volume-title":"SIGIR","author":"MacAvaney","year":"2019"},{"key":"2022033118512296400_bib32","doi-asserted-by":"crossref","DOI":"10.1145\/3459637.3482013","article-title":"PyTerrier: Declarative experimentation in python from BM25 to dense retrieval","volume-title":"CIKM","author":"Macdonald","year":"2021"},{"key":"2022033118512296400_bib33","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482013","article-title":"Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference","volume-title":"ACL","author":"Thomas McCoy","year":"2019"},{"key":"2022033118512296400_bib34","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-2037","article-title":"JFLEG: A fluency corpus and benchmark for grammatical error correction","volume-title":"EACL","author":"Napoles","year":"2017"},{"key":"2022033118512296400_bib35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1206","article-title":"Don\u2019t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization","volume-title":"EMNLP","author":"Narayan","year":"2018"},{"key":"2022033118512296400_bib36","article-title":"Passage re-ranking with BERT","author":"Nogueira","year":"2019","journal-title":"arXiv"},{"key":"2022033118512296400_bib37","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.63","article-title":"Document ranking with a pretrained sequence-to-sequence model","author":"Nogueira","year":"2020","journal-title":"arXiv"},{"key":"2022033118512296400_bib38","unstructured":"Rodrigo\n              Nogueira\n             and JimmyLin. 2019. From doc2query to docTTTTTquery. Self-published."},{"key":"2022033118512296400_bib39","article-title":"Document expansion by query prediction","author":"Nogueira","year":"2019","journal-title":"arXiv"},{"key":"2022033118512296400_bib40","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-31865-1_37","article-title":"Terrier: A high performance and scalable information retrieval platform","volume-title":"Proceedings of ACM SIGIR\u201906 Workshop on Open Source Information Retrieval (OSIR 2006)","author":"Ounis","year":"2006"},{"key":"2022033118512296400_bib41","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162","article-title":"GloVe: Global vectors for word representation","volume-title":"EMNLP","author":"Pennington","year":"2014"},{"key":"2022033118512296400_bib42","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1202","article-title":"Deep contextualized word representations","volume-title":"NAACL-HLT","author":"Peters","year":"2018"},{"key":"2022033118512296400_bib43","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i01.5385","article-title":"Automatically neutralizing subjective bias in text","volume-title":"AAAI","author":"Pryzant","year":"2020"},{"issue":"140","key":"2022033118512296400_bib44","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2022033118512296400_bib45","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1012","article-title":"Dear sir or madam, may I introduce the YAFC corpus: Corpus, benchmarks and metrics for formality style transfer","volume-title":"NAACL-HLT","author":"Rao","year":"2018"},{"issue":"2","key":"2022033118512296400_bib46","article-title":"Gensim\u2013Python framework for vector space modelling","volume":"3","author":"Rehurek","year":"2011","journal-title":"NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic"},{"key":"2022033118512296400_bib47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1410","article-title":"Sentence-BERT: Sentence embeddings using Siamese BERT-networks","volume-title":"EMNLP","author":"Reimers","year":"2019"},{"key":"2022033118512296400_bib48","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-15712-8_32","article-title":"An axiomatic approach to diagnosing neural IR models","volume-title":"ECIR","author":"Rennings","year":"2019"},{"key":"2022033118512296400_bib49","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2021\/659","article-title":"Beyond accuracy: Behavioral testing of NLP models with checklist","volume-title":"ACL","author":"Ribeiro","year":"2020"},{"key":"2022033118512296400_bib50","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00349","article-title":"A primer in BERTology: What we know about how BERT works","author":"Rogers","year":"2020","journal-title":"TACL"},{"key":"2022033118512296400_bib51","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1099","article-title":"Get to the point: Summarization with pointer-generator networks","volume-title":"ACL","author":"See","year":"2017"},{"key":"2022033118512296400_bib52","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1282","article-title":"Is attention interpretable?","volume-title":"ACL","author":"Serrano","year":"2019"},{"key":"2022033118512296400_bib53","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.230","article-title":"Masked language modeling and the distributional hypothesis: Order word matters pre-training for little","author":"Sinha","year":"2021","journal-title":"arXiv"},{"key":"2022033118512296400_bib54","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277794","article-title":"An exploration of proximity measures in information retrieval","volume-title":"SIGIR","author":"Tao","year":"2007"},{"key":"2022033118512296400_bib55","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1452","article-title":"BERT rediscovers the classical NLP pipeline","volume-title":"ACL","author":"Tenney","year":"2019"},{"key":"2022033118512296400_bib56","doi-asserted-by":"publisher","DOI":"10.1145\/3471158.3472256","article-title":"Towards axiomatic explanations for neural ranking models","author":"V\u00f6lske","year":"2021","journal-title":"arXiv"},{"key":"2022033118512296400_bib57","article-title":"HuggingFace\u2019s Transformers: State-of-the-art natural language processing","author":"Wolf","year":"2019","journal-title":"arXiv"},{"key":"2022033118512296400_bib58","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2020.emnlp-main.608","article-title":"Which *BERT? a survey organizing contextualized encoders","volume-title":"EMNLP","author":"Xia","year":"2020"},{"key":"2022033118512296400_bib59","article-title":"Approximate nearest neighbor negative contrastive learning for dense text retrieval","author":"Xiong","year":"2021","journal-title":"arXiv"},{"key":"2022033118512296400_bib60","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00107","article-title":"Optimizing statistical machine translation for text simplification","volume":"4","author":"Wei","year":"2016","journal-title":"TACL"},{"key":"2022033118512296400_bib61","doi-asserted-by":"publisher","DOI":"10.1145\/3239571","article-title":"Anserini: Reproducible ranking baselines using Lucene","volume":"10","author":"Yang","year":"2018","journal-title":"J. Data and Information Quality"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00457\/2002698\/tacl_a_00457.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00457\/2002698\/tacl_a_00457.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,3,31]],"date-time":"2022-03-31T23:52:43Z","timestamp":1648770763000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00457\/110013\/ABNIRML-Analyzing-the-Behavior-of-Neural-IR-Models"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":61,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00457","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}