{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T00:23:34Z","timestamp":1777508614020,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":64,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,7,13]]},"DOI":"10.1145\/3726302.3730348","type":"proceedings-article","created":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T01:38:52Z","timestamp":1752457132000},"page":"3865-3875","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2762-721X","authenticated-orcid":false,"given":"Krisztian","family":"Balog","sequence":"first","affiliation":[{"name":"Google DeepMind, Stavanger, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4276-6269","authenticated-orcid":false,"given":"Don","family":"Metzler","sequence":"additional","affiliation":[{"name":"Google DeepMind, Mountain View, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6739-134X","authenticated-orcid":false,"given":"Zhen","family":"Qin","sequence":"additional","affiliation":[{"name":"Google DeepMind, Mountain View, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,7,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Zahra Abbasiantaeb Chuan Meng Leif Azzopardi and Mohammad Aliannejadi. 2024. Can We Use Large Language Models to Fill Relevance Judgment Holes?. In Joint Proceedings of the 1st Workshop on Evaluation Methodologies Testbeds and Community for Information Access Research (EMTCIR 2024) and the 1st Workshop on User Modelling in Conversational Information Retrieval (UM-CIR 2024) co-located with the 2nd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific (SIGIR-AP 2024) Tokyo Japan December 12 2024 (CEUR Workshop Proceedings Vol. 3854)."},{"key":"e_1_3_2_1_2_1","first-page":"32","volume-title":"Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP '24)","author":"Alaofi Marwah","year":"2024","unstructured":"Marwah Alaofi, Paul Thomas, Falk Scholer, and Mark Sanderson. 2024. LLMs can be Fooled into Labelling a Document as Relevant. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP '24). 32-41."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3531863"},{"key":"e_1_3_2_1_4_1","volume-title":"ODIN: Disentangled Reward Mitigates Hacking in RLHF. In Forty-first International Conference on Machine Learning (ICML '24)","author":"Chen Lichang","year":"2024","unstructured":"Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. 2024b. ODIN: Disentangled Reward Mitigates Hacking in RLHF. In Forty-first International Conference on Machine Learning (ICML '24)."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3673791.3698420"},{"key":"e_1_3_2_1_6_1","volume-title":"Clarke and Laura Dietz","author":"Charles L.","year":"2024","unstructured":"Charles L. A. Clarke and Laura Dietz. 2024. LLM-based relevance assessment still can't replace human relevance assessment. arxiv:2412.17156 [cs.IR]"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.6028\/NIST.SP.1266.deep-overview"},{"key":"e_1_3_2_1_8_1","volume-title":"Overview of the TREC 2019 Deep Learning track. In Proceedings of the Twenty-Eighth Text REtrieval Conference (TREC '19)","author":"Craswell Nick","unstructured":"Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 Deep Learning track. In Proceedings of the Twenty-Eighth Text REtrieval Conference (TREC '19)."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671458"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671882"},{"key":"e_1_3_2_1_11_1","volume-title":"The Eleventh International Conference on Learning Representations (ICLR '23)","author":"Dai Zhuyun","year":"2023","unstructured":"Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang. 2023. Promptagator: Few-shot Dense Retrieval From 8 Examples. In The Eleventh International Conference on Learning Representations (ICLR '23)."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.950"},{"key":"e_1_3_2_1_13_1","volume-title":"Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators. In First Conference on Language Modeling (COLM '24)","author":"Dubois Yann","year":"2024","unstructured":"Yann Dubois, Percy Liang, and Tatsunori Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators. In First Conference on Language Modeling (COLM '24)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3578337.3605136"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1162\/coli_a_00524"},{"key":"e_1_3_2_1_16_1","volume-title":"Gemini: A family of highly capable multimodal models. arxiv:2312.11805 [cs.CL]","author":"Google Gemini Team","year":"2023","unstructured":"Gemini Team Google. 2023. Gemini: A family of highly capable multimodal models. arxiv:2312.11805 [cs.CL]"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2305016120"},{"key":"e_1_3_2_1_18_1","unstructured":"Jiawei Gu Xuhui Jiang Zhichao Shi Hexiang Tan Xuehao Zhai Chengjin Xu Wei Li Yinghan Shen Shengjie Ma Honghao Liu Saizhuo Wang Kun Zhang Yuanzhuo Wang Wen Gao Lionel Ni and Jian Guo. 2024. A Survey on LLM-as-a-Judge. arxiv:2411.15594 [cs.CL]"},{"key":"e_1_3_2_1_19_1","volume-title":"James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, and Dipanjan Das.","author":"Jacovi Alon","year":"2025","unstructured":"Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, and Dipanjan Das. 2025. The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input. arxiv:2501.03200 [cs.CL]"},{"key":"e_1_3_2_1_20_1","unstructured":"Percy Liang Rishi Bommasani Tony Lee Dimitris Tsipras Dilara Soylu Michihiro Yasunaga Yian Zhang Deepak Narayanan Yuhuai Wu Ananya Kumar Benjamin Newman Binhang Yuan Bobby Yan Ce Zhang Christian Alexander Cosgrove Christopher D Manning Christopher Re Diana Acosta-Navas Drew Arad Hudson Eric Zelikman Esin Durmus Faisal Ladhak Frieda Rong Hongyu Ren Huaxiu Yao Jue WANG Keshav Santhanam Laurel Orr Lucia Zheng Mert Yuksekgonul Mirac Suzgun Nathan Kim Neel Guha Niladri S. Chatterji Omar Khattab Peter Henderson Qian Huang Ryan Andrew Chi Sang Michael Xie Shibani Santurkar Surya Ganguli Tatsunori Hashimoto Thomas Icard Tianyi Zhang Vishrav Chaudhary William Wang Xuechen Li Yifan Mai Yuhui Zhang and Yuta Koreeda. 2023. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (2023)."},{"key":"e_1_3_2_1_21_1","first-page":"6565","volume-title":"International Conference on Machine Learning (ICML '21)","author":"Liang Paul Pu","year":"2021","unstructured":"Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning (ICML '21). 6565-6576."},{"key":"e_1_3_2_1_22_1","unstructured":"Weixin Liang Yaohui Zhang Mihai Codreanu Jiayu Wang Hancheng Cao and James Zou. 2025. The Widespread Adoption of Large Language Model-Assisted Writing Across Society. arxiv:2502.09747 [cs.CL]"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3463238"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00638"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.153"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1387"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.753"},{"key":"e_1_3_2_1_28_1","volume-title":"Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators. In First Conference on Language Modeling (COLM '24)","author":"Liu Yinhong","year":"2024","unstructured":"Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli\u0107, Anna Korhonen, and Nigel Collier. 2024c. Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators. In First Conference on Language Modeling (COLM '24)."},{"key":"e_1_3_2_1_29_1","unstructured":"Xueguang Ma Xinyu Zhang Ronak Pradeep and Jimmy Lin. 2023. Zero-Shot Listwise Document Reranking with a Large Language Model. arxiv:2305.02156 [cs.IR]"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3592032"},{"key":"e_1_3_2_1_31_1","unstructured":"J. S. McCarley Rishav Chakravarti and Avirup Sil. 2019. Structured Pruning of a BERT-based Question Answering Model. arxiv:1910.06360 [cs.CL]"},{"key":"e_1_3_2_1_32_1","unstructured":"Navid Mehrdad Hrushikesh Mohapatra Mossaab Bagdouri Prijith Chandran Alessandro Magnani Xunfan Cai Ajit Puthenputhussery Sachin Yadav Tony Lee ChengXiang Zhai and Ciya Liao. 2024. Large Language Models for Relevance Judgment in Product Search. arxiv:2406.00247 [cs.IR]"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3597307"},{"key":"e_1_3_2_1_34_1","unstructured":"Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arxiv:1901.04085 [cs.IR]"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.63"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671883"},{"key":"e_1_3_2_1_37_1","volume-title":"LLM Evaluators Recognize and Favor Their Own Generations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS '24)","author":"Panickssery Arjun","year":"2024","unstructured":"Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. LLM Evaluators Recognize and Favor Their Own Generations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS '24)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-56060-6_19"},{"key":"e_1_3_2_1_39_1","unstructured":"Ronak Pradeep Rodrigo Nogueira and Jimmy Lin. 2021. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arxiv:2101.05667 [cs.IR]"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-naacl.97"},{"key":"e_1_3_2_1_41_1","first-page":"1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1-67.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_42_1","volume-title":"Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024. arxiv:2408","author":"Rahmani Hossein A.","year":"2024","unstructured":"Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz. 2024. Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024. arxiv:2408.05388 [cs.IR]"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Hossein A. Rahmani Xi Wang Emine Yilmaz Nick Craswell Bhaskar Mitra and Paul Thomas. 2025. SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval. arxiv:2408.16312 [cs.IR]","DOI":"10.1145\/3701716.3715311"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.249"},{"key":"e_1_3_2_1_45_1","unstructured":"Jayant Sachdev Sean D. Rosario Abhijeet Phatak He Wen Swati Kirti and Chittaranjan Tripathy. 2025. Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search. arxiv:2502.15990 [cs.IR]"},{"key":"e_1_3_2_1_46_1","volume-title":"Proceedings of the second conference on conceptions of library and information science (CoLIS 2). 201-218","author":"Saracevic Tefko","year":"1996","unstructured":"Tefko Saracevic. 1996. Relevance reconsidered. In Proceedings of the second conference on conceptions of library and information science (CoLIS 2). 201-218."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-024-07566-y"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.54195\/irrj.19625"},{"key":"e_1_3_2_1_49_1","unstructured":"Heydar Soudani Roxana Petcu Evangelos Kanoulas and Faegheh Hasibi. 2024. A Survey on Recent Advances in Conversational Data Generation. arxiv:2405.13003 [cs.CL]"},{"key":"e_1_3_2_1_50_1","unstructured":"Rickard Stureborg Dimitris Alikaniotis and Yoshi Suhara. 2024. Large Language Models are Inconsistent and Biased Evaluators. arxiv:2405.01724 [cs.CL]"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.923"},{"key":"e_1_3_2_1_52_1","unstructured":"Manveer Singh Tamber and Jimmy Lin. 2025. Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers Rerankers and LLM Judges. arxiv:2501.18536 [cs.IR]"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3600270.3601857"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657707"},{"key":"e_1_3_2_1_55_1","volume-title":"Hoa Trang Dang, and Jimmy Lin","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024a. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arxiv:2411.08275 [cs.IR]"},{"key":"e_1_3_2_1_56_1","volume-title":"UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arxiv:2406.06519 [cs.IR]","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024b. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arxiv:2406.06519 [cs.IR]"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.949"},{"key":"e_1_3_2_1_58_1","first-page":"74764","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS '23)","author":"Wang Yizhong","year":"2023","unstructured":"Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS '23). 74764-74786."},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.826"},{"key":"e_1_3_2_1_60_1","volume-title":"BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations (ICLR '20)","author":"Zhang Tianyi","year":"2020","unstructured":"Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations (ICLR '20)."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1053"},{"key":"e_1_3_2_1_62_1","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS '23)","author":"Zheng Lianmin","year":"2023","unstructured":"Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS '23)."},{"key":"e_1_3_2_1_63_1","volume-title":"International Conference on Learning Representations (ICLR '20)","author":"Zhu Jinhua","year":"2020","unstructured":"Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020. Incorporating BERT into Neural Machine Translation. In International Conference on Learning Representations (ICLR '20)."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3592047"}],"event":{"name":"SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval","location":"Padua Italy","acronym":"SIGIR '25","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval"]},"container-title":["Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3726302.3730348","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:07:50Z","timestamp":1755864470000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3726302.3730348"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,13]]},"references-count":64,"alternative-id":["10.1145\/3726302.3730348","10.1145\/3726302"],"URL":"https:\/\/doi.org\/10.1145\/3726302.3730348","relation":{},"subject":[],"published":{"date-parts":[[2025,7,13]]},"assertion":[{"value":"2025-07-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}