{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T09:30:53Z","timestamp":1775899853367,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":99,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100006374","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1846017"],"award-info":[{"award-number":["1846017"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"name":"ARC Centre of Excellence for Automated Decision-Making and Society","award":["CE200100005"],"award-info":[{"award-number":["CE200100005"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,7,18]]},"DOI":"10.1145\/3731120.3744588","type":"proceedings-article","created":{"date-parts":[[2025,7,18]],"date-time":"2025-07-18T13:34:06Z","timestamp":1752845646000},"page":"218-229","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["Principles and Guidelines for the Use of LLM Judges"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1624-3907","authenticated-orcid":false,"given":"Laura","family":"Dietz","sequence":"first","affiliation":[{"name":"University of New Hampshire, Portsmouth, NH, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1535-0989","authenticated-orcid":false,"given":"Oleg","family":"Zendel","sequence":"additional","affiliation":[{"name":"RMIT University, Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-9456-1865","authenticated-orcid":false,"given":"Peter","family":"Bailey","sequence":"additional","affiliation":[{"name":"Canva, Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8178-9194","authenticated-orcid":false,"given":"Charles L. A.","family":"Clarke","sequence":"additional","affiliation":[{"name":"University of Waterloo, Waterloo, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4760-9748","authenticated-orcid":false,"given":"Ellese","family":"Cotterill","sequence":"additional","affiliation":[{"name":"Canva, Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2422-8651","authenticated-orcid":false,"given":"Jeff","family":"Dalton","sequence":"additional","affiliation":[{"name":"University of Edinburgh, Edinburgh, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9986-482X","authenticated-orcid":false,"given":"Faegheh","family":"Hasibi","sequence":"additional","affiliation":[{"name":"Radboud University, Nijmegen, Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0487-9609","authenticated-orcid":false,"given":"Mark","family":"Sanderson","sequence":"additional","affiliation":[{"name":"RMIT University, Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9351-8137","authenticated-orcid":false,"given":"Nick","family":"Craswell","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, WA, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,7,18]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"EMTCIR '24: The First Workshop on Evaluation Methodologies, Testbeds and Community for Information Access Research.","author":"Abbasiantaeb Zahra","year":"2024","unstructured":"Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi. 2024. Can We Use Large Language Models to Fill Relevance Judgment Holes?. In EMTCIR '24: The First Workshop on Evaluation Methodologies, Testbeds and Community for Information Access Research."},{"key":"e_1_3_2_1_2_1","volume-title":"Charles LA Clarke, and Mark Sanderson","author":"Alaofi Marwah","year":"2024","unstructured":"Marwah Alaofi, Negar Arabzadeh, Charles LA Clarke, and Mark Sanderson. 2024a. Generative information retrieval evaluation. In Information Access in the Era of Generative AI. Springer, 135-159."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3673791.3698431"},{"key":"e_1_3_2_1_4_1","volume-title":"Jo ao Guilherme Alves Santos, Hugo Abonizio, and Rodrigo Nogueira.","author":"Almeida Thales Sales","year":"2025","unstructured":"Thales Sales Almeida, Giovana Kerche Bon\u00e1s, Jo ao Guilherme Alves Santos, Hugo Abonizio, and Rodrigo Nogueira. 2025. TiEBe: A Benchmark for Assessing the Current Knowledge of Large Language Models. arXiv preprint arXiv:2501.07482 (2025)."},{"key":"e_1_3_2_1_5_1","first-page":"307","article-title":"Measurement in Medicine: The Analysis of Method Comparison Studies","volume":"32","author":"Altman D. G.","year":"1983","unstructured":"D. G. Altman and J. M. Bland. 1983. Measurement in Medicine: The Analysis of Method Comparison Studies. Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 32, 3 (1983), 307-317. http:\/\/www.jstor.org\/stable\/2987937","journal-title":"Journal of the Royal Statistical Society. Series D (The Statistician)"},{"key":"e_1_3_2_1_6_1","volume-title":"A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment. arXiv preprint arXiv:2504.12408","author":"Arabzadeh Negar","year":"2025","unstructured":"Negar Arabzadeh and Charles LA Clarke. 2025. A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment. arXiv preprint arXiv:2504.12408 (2025)."},{"key":"e_1_3_2_1_7_1","volume-title":"Effects of group pressure on the modification and distortion. Readings in social psychology","author":"Asch Solomon","year":"1958","unstructured":"Solomon Asch. 1958. Effects of group pressure on the modification and distortion. Readings in social psychology. New York: Holt, Rinehart and Winston (1958)."},{"key":"e_1_3_2_1_8_1","volume-title":"The Importance of Distrust in AI. arXiv preprint arXiv:2307.13601","year":"2023","unstructured":"Authors. 2023. The Importance of Distrust in AI. arXiv preprint arXiv:2307.13601 (2023). https:\/\/arxiv.org\/abs\/2307.13601"},{"key":"e_1_3_2_1_9_1","volume-title":"Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation. arXiv preprint arXiv:2503.19092","author":"Balog Krisztian","year":"2025","unstructured":"Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation. arXiv preprint arXiv:2503.19092 (2025)."},{"key":"e_1_3_2_1_10_1","volume-title":"Prompt-Based Document Modifications in Ranking Competitions. arXiv preprint arXiv:2502.07315","author":"Bardas Niv","year":"2025","unstructured":"Niv Bardas, Tommy Mordo, Oren Kurland, Moshe Tennenholtz, and Gal Zur. 2025. Prompt-Based Document Modifications in Ranking Competitions. arXiv preprint arXiv:2502.07315 (2025)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2808194.2809456"},{"key":"e_1_3_2_1_12_1","first-page":"12345","article-title":"Extrinsic Evaluation of Cultural Competence in Large Language Models","volume":"2024","author":"Bhatt Shaily","year":"2024","unstructured":"Shaily Bhatt and Fernando Diaz. 2024. Extrinsic Evaluation of Cultural Competence in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024. 12345-12356.","journal-title":"Findings of the Association for Computational Linguistics: EMNLP"},{"key":"e_1_3_2_1_13_1","volume-title":"Proceedings of the First Conference on Language Modeling (COLM","author":"Bordt Sebastian","year":"2024","unstructured":"Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. 2024. Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models. In Proceedings of the First Conference on Language Modeling (COLM 2024). Philadelphia, USA. https:\/\/arxiv.org\/abs\/2404.06209 Camera-ready version; supersedes arXiv:2404.06209."},{"key":"e_1_3_2_1_14_1","volume-title":"How is ChatGPT's behavior changing over time? arXiv preprint arXiv:2307.09009","author":"Chen Lingjiao","year":"2023","unstructured":"Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is ChatGPT's behavior changing over time? arXiv preprint arXiv:2307.09009 (2023)."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.669"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.20736\/0002002105"},{"key":"e_1_3_2_1_17_1","volume-title":"The effects of event rate on a cognitive vigilance task. Human factors","author":"Claypoole Victoria L","year":"2019","unstructured":"Victoria L Claypoole, Daryn A Dever, Kody L Denues, and James L Szalma. 2019. The effects of event rate on a cognitive vigilance task. Human factors, Vol. 61, 3 (2019), 440-450."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1108\/eb050097"},{"key":"e_1_3_2_1_19_1","volume-title":"Autonomy and reliability of continuous active learning for technology-assisted review. arXiv preprint arXiv:1504.06868","author":"Cormack Gordon V","year":"2015","unstructured":"Gordon V Cormack and Maura R Grossman. 2015. Autonomy and reliability of continuous active learning for technology-assisted review. arXiv preprint arXiv:1504.06868 (2015)."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983776"},{"key":"e_1_3_2_1_21_1","unstructured":"Ellese Cotterill. 2024. How to improve search without looking at queries or results. https:\/\/www.canva.dev\/blog\/engineering\/how-to-improve-search-without-looking-at-queries-or-results\/"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671882"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.482"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3726302.3730178"},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings of the 11th ACM SIGIR \/ The 15th International Conference on Innovative Concepts and Theories in Information Retrieval.","author":"Dietz Laura","year":"2025","unstructured":"Laura Dietz and Naghmeh Farzi. 2025. Criteria-Based LLM Relevance Judgments. In Proceedings of the 11th ACM SIGIR \/ The 15th International Conference on Innovative Concepts and Theories in Information Retrieval."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.911"},{"key":"e_1_3_2_1_27_1","volume-title":"Advances in Neural Information Processing Systems 37 (NeurIPS","author":"Dohmatob Elvis","year":"2024","unstructured":"Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. 2024. Model Collapse Demystified: The Case of Regression. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024). Vancouver, Canada. Main-conference track; replaces arXiv:2402.07712."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.592"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3578337.3605136"},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings of LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval.","author":"Farzi Naghmeh","year":"2024","unstructured":"Naghmeh Farzi and Laura Dietz. 2024. Exam: Llm-based answerability metrics for ir evaluation. In Proceedings of LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval."},{"key":"e_1_3_2_1_31_1","volume-title":"Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations. arXiv preprint arXiv:2504.05294","author":"Ferreira Pedro","year":"2025","unstructured":"Pedro Ferreira, Wilker Aziz, and Ivan Titov. 2025. Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations. arXiv preprint arXiv:2504.05294 (2025)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1002\/aaai.12182"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-28244-7_20"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591888"},{"key":"e_1_3_2_1_35_1","volume-title":"LLM-enhanced Reranking in Recommender Systems. arXiv preprint arXiv:2406.12433","author":"Gao Jingtong","year":"2024","unstructured":"Jingtong Gao, Bo Chen, Xiangyu Zhao, Weiwen Liu, Xiangyang Li, Yichao Wang, Zijian Zhang, Wanyu Wang, Yuyang Ye, Shanru Lin, Huifeng Guo, and Ruiming Tang. 2024. LLM-enhanced Reranking in Recommender Systems. arXiv preprint arXiv:2406.12433 (2024)."},{"key":"e_1_3_2_1_36_1","volume-title":"Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv preprint arXiv:2404.01413","author":"Gerstgrasser Matthias","year":"2024","unstructured":"Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. 2024. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv preprint arXiv:2404.01413 (2024)."},{"key":"e_1_3_2_1_37_1","volume-title":"Monetary Economics","volume":"1","author":"Goodhart Charles","year":"1975","unstructured":"Charles Goodhart. 1975. Problems of monetary management: the UK experience in papers in monetary economics. Monetary Economics, Vol. 1 (1975)."},{"key":"e_1_3_2_1_38_1","volume-title":"Susceptibility to influence of large language models. arXiv preprint arXiv:2303.06074","author":"Griffin Lewis D","year":"2023","unstructured":"Lewis D Griffin, Bennett Kleinberg, Maximilian Mozes, Kimberly T Mai, Maria Vau, Matthew Caldwell, and Augustine Marvor-Parker. 2023. Susceptibility to influence of large language models. arXiv preprint arXiv:2303.06074 (2023)."},{"key":"e_1_3_2_1_39_1","volume-title":"Detecting Machine-Generated Texts: Not Just ''AI vs Humans'' and Explainability is Complicated. arXiv preprint arXiv:2406.18259","author":"Ji Jiazhou","year":"2024","unstructured":"Jiazhou Ji, Ruizhe Li, Shujun Li, Jie Guo, Weidong Qiu, Zheng Huang, Chiyu Chen, Xiaoyu Jiang, and Xinru Lu. 2024. Detecting Machine-Generated Texts: Not Just ''AI vs Humans'' and Explainability is Complicated. arXiv preprint arXiv:2406.18259 (2024)."},{"key":"e_1_3_2_1_40_1","volume-title":"Zombies in the Loop? Humans Trust Untrustworthy AI-Advisors for Ethical Decisions. arXiv preprint arXiv:2106.16122","author":"Kr\u00fcgel Sebastian","year":"2021","unstructured":"Sebastian Kr\u00fcgel, Andreas Ostermaier, and Matthias Uhl. 2021. Zombies in the Loop? Humans Trust Untrustworthy AI-Advisors for Ethical Decisions. arXiv preprint arXiv:2106.16122 (2021)."},{"key":"e_1_3_2_1_41_1","first-page":"2838","volume-title":"Competitive Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)","author":"Kurland Oren","year":"2022","unstructured":"Oren Kurland and Moshe Tennenholtz. 2022. Competitive Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). 2838-2849."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3731120.3744581"},{"key":"e_1_3_2_1_43_1","volume-title":"2025 a. LLM Generated Persona is a Promise with a Catch. arXiv preprint arXiv:2503.16527","author":"Li Ang","year":"2025","unstructured":"Ang Li, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng. 2025 a. LLM Generated Persona is a Promise with a Catch. arXiv preprint arXiv:2503.16527 (2025)."},{"key":"e_1_3_2_1_44_1","volume-title":"2025 b. Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv preprint arXiv:2502.01534","author":"Li Dawei","year":"2025","unstructured":"Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. 2025 b. Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv preprint arXiv:2502.01534 (2025)."},{"key":"e_1_3_2_1_45_1","volume-title":"AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv preprint arXiv:2306.01941","author":"Vera Liao Q.","year":"2023","unstructured":"Q. Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv preprint arXiv:2306.01941 (2023)."},{"key":"e_1_3_2_1_46_1","volume-title":"Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 561-568","author":"Lin Jimmy","year":"2007","unstructured":"Jimmy Lin and Dina Demner-Fushman. 2007. Different structures for evaluating answers to complex questions: Pyramids won't topple, and neither will human assessors. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 561-568."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.598"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.153"},{"key":"e_1_3_2_1_49_1","volume-title":"Nafise Sadat Moosavi, and Chenghua Lin","author":"Liu Yiqi","year":"2024","unstructured":"Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. 2024. LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores. In Findings of the Association for Computational Linguistics ACL 2024. 12688-12701."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080793"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657846"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.64"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2806416.2806606"},{"key":"e_1_3_2_1_54_1","article-title":"Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM","volume":"27","author":"Moffat Alistair","year":"2008","unstructured":"Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM Trans. Inf. Syst., Vol. 27 (2008).","journal-title":"Trans. Inf. Syst."},{"key":"e_1_3_2_1_55_1","volume-title":"Adversarial Search Engine Optimization for Large Language Models. In The Thirteenth International Conference on Learning Representations, ICLR 2025","author":"Nestaas Fredrik","year":"2025","unstructured":"Fredrik Nestaas, Edoardo Debenedetti, and Florian Tram\u00e8r. 2025. Adversarial Search Engine Optimization for Large Language Models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https:\/\/openreview.net\/forum?id=hkdqxN3c7t"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671883"},{"key":"e_1_3_2_1_57_1","volume-title":"Does Writing with Language Models Reduce Content Diversity? arXiv preprint arXiv:2309.05196","author":"Padmakumar Vishakh","year":"2023","unstructured":"Vishakh Padmakumar and He He. 2023. Does Writing with Language Models Reduce Content Diversity? arXiv preprint arXiv:2309.05196 (2023)."},{"key":"e_1_3_2_1_58_1","first-page":"68772","article-title":"Llm evaluators recognize and favor their own generations","volume":"37","author":"Panickssery Arjun","year":"2024","unstructured":"Arjun Panickssery, Samuel Bowman, and Shi Feng. 2024. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, Vol. 37 (2024), 68772-68802.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_59_1","volume-title":"Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models. In Advances in Information Retrieval. 46th European Conference on IR Research (ECIR 2024)","author":"Parry Andrew","year":"2024","unstructured":"Andrew Parry, Maik Fr\u00f6be, Sean MacAvaney, Martin Potthast, and Matthias Hagen. 2024. Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models. In Advances in Information Retrieval. 46th European Conference on IR Research (ECIR 2024) (Lecture Notes in Computer Science). Springer, Berlin Heidelberg New York."},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657942"},{"key":"e_1_3_2_1_61_1","unstructured":"Hossein A. Rahmani Varsha Ramineni Nick Craswell Bhaskar Mitra and Emine Yilmaz. 2025. Towards Understanding Bias in Synthetic Data for Evaluation. arxiv:2506.10301 [cs.IR] https:\/\/arxiv.org\/abs\/2506.10301"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.427"},{"key":"e_1_3_2_1_63_1","volume-title":"Damiano Spina, and Oleg Zendel.","author":"Ran Kun","year":"2025","unstructured":"Kun Ran, Shuoqi Sun, Khoi Nguyen Dinh Anh, Damiano Spina, and Oleg Zendel. 2025. RMIT-ADMS at the SIGIR 2025 LiveRAG Challenge - GRAG: Generation-Retrieval-Augmented Generation. In LiveRAG Challenge at SIGIR 2025. 9 pages."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1093\/jamia\/ocaa091"},{"key":"e_1_3_2_1_65_1","volume-title":"Data Contamination Through the Lens of Time. arXiv preprint arXiv:2310.10628","author":"Roberts Manley","year":"2023","unstructured":"Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. 2023. Data Contamination Through the Lens of Time. arXiv preprint arXiv:2310.10628 (2023)."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3511960"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.722"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3463236"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3658644.3690291"},{"key":"e_1_3_2_1_70_1","volume-title":"The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv preprint arXiv:2305.17493","author":"Shumailov Ilia","year":"2023","unstructured":"Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv preprint arXiv:2305.17493 (2023). https:\/\/arxiv.org\/abs\/2305.17493"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-024-07566-y"},{"key":"e_1_3_2_1_72_1","first-page":"6789","article-title":". Evaluating Large Language Model Biases in Persona-Steered Generation","volume":"2024","author":"Chenglei Si","year":"2024","unstructured":"Chenglei Si et al., 2024. Evaluating Large Language Model Biases in Persona-Steered Generation. In Findings of the Association for Computational Linguistics: ACL 2024. 6789-6800.","journal-title":"Findings of the Association for Computational Linguistics: ACL"},{"key":"e_1_3_2_1_73_1","volume-title":"Andrew Poulton, David Esiobu, Maria Lomeli, and Gergely Szilvasy.","author":"Singh Aaditya K","year":"2024","unstructured":"Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, and Gergely Szilvasy. 2024. Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter? arXiv preprint arXiv:2411.03923 (2024)."},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.560"},{"key":"e_1_3_2_1_75_1","volume-title":"Don't use LLMs to make relevance judgments. Information retrieval research journal","author":"Soboroff Ian","year":"2025","unstructured":"Ian Soboroff. 2025. Don't use LLMs to make relevance judgments. Information retrieval research journal, Vol. 1, 1 (2025), 10-54195."},{"key":"e_1_3_2_1_76_1","volume-title":"Report on the need for and provision of an `ideal' information retrieval test collection. Computer Laboratory","author":"Jones Karen Sp\u00e4rck","year":"1975","unstructured":"Karen Sp\u00e4rck Jones and C. J. van Rijsbergen. 1975. Report on the need for and provision of an `ideal' information retrieval test collection. Computer Laboratory (1975)."},{"key":"e_1_3_2_1_77_1","volume-title":"What large language models know and what people think they know. Nature Machine Intelligence","author":"Steyvers Mark","year":"2025","unstructured":"Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. 2025. What large language models know and what people think they know. Nature Machine Intelligence (2025), 1-11."},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657845"},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657707"},{"key":"e_1_3_2_1_80_1","first-page":"1","article-title":"Report from the Fourth Strategic Workshop on Information Retrieval in Lorne (SWIRL 2025)","volume":"59","author":"Trippas Johanne R.","year":"2025","unstructured":"Johanne R. Trippas and J. Shane Culpepper. 2025. Report from the Fourth Strategic Workshop on Information Retrieval in Lorne (SWIRL 2025). ACM SIGIR Forum, Vol. 59, 1 (June 2025), 68 pages.","journal-title":"ACM SIGIR Forum"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.969"},{"key":"e_1_3_2_1_82_1","volume-title":"Hoa Trang Dang, and Jimmy Lin","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024a. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv preprint arXiv:2411.08275 (2024)."},{"key":"e_1_3_2_1_83_1","volume-title":"Hoa Trang Dang, and Jimmy Lin","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2024b. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv preprint arXiv:2411.08275 (2024)."},{"key":"e_1_3_2_1_84_1","volume-title":"UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv preprint arXiv:2406.06519","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024c. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv preprint arXiv:2406.06519 (2024)."},{"key":"e_1_3_2_1_85_1","volume-title":"Information Retrieval Evaluation in a Changing World, Nicola Ferro and Carol Peters (Eds.).","author":"Voorhees Ellen M.","unstructured":"Ellen M. Voorhees. 2019. The Evolution of Cranfield. In Information Retrieval Evaluation in a Changing World, Nicola Ferro and Carol Peters (Eds.). Vol. 41. Springer International Publishing, 45-69."},{"key":"e_1_3_2_1_86_1","volume-title":"Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? arXiv preprint arXiv:2201.11086","author":"Voorhees Ellen M","year":"2022","unstructured":"Ellen M Voorhees, Ian Soboroff, and Jimmy Lin. 2022. Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? arXiv preprint arXiv:2201.11086 (2022)."},{"key":"e_1_3_2_1_87_1","volume-title":"Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533","author":"Wang Liang","year":"2024","unstructured":"Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533 (2024)."},{"key":"e_1_3_2_1_88_1","volume-title":"Rethinking generative large language model evaluation for semantic comprehension. arXiv preprint arXiv:2403.07872","author":"Wei Fangyun","year":"2024","unstructured":"Fangyun Wei, Xi Chen, and Lin Luo. 2024. Rethinking generative large language model evaluation for semantic comprehension. arXiv preprint arXiv:2403.07872 (2024)."},{"key":"e_1_3_2_1_89_1","volume-title":"Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective. CoRR","author":"Wen Yuchen","year":"2024","unstructured":"Yuchen Wen, Keping Bi, Wei Chen, Jiafeng Guo, and Xueqi Cheng. 2024. Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective. CoRR (2024)."},{"key":"e_1_3_2_1_90_1","unstructured":"Cheng Xu Shuhao Guan Derek Greene M Kechadi et al. 2024a. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244 (2024)."},{"key":"e_1_3_2_1_91_1","unstructured":"Ruijie Xu Zengzhi Wang Run-Ze Fan and Pengfei Liu. 2024b. Benchmarking Benchmark Leakage in Large Language Models. arxiv:2404.18824 [cs.CL] https:\/\/arxiv.org\/abs\/2404.18824"},{"key":"e_1_3_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.826"},{"key":"e_1_3_2_1_93_1","first-page":"12345","article-title":"Self-training Large Language Models through Knowledge Detection","volume":"2024","author":"Yeo Wei Jie","year":"2024","unstructured":"Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria. 2024. Self-training Large Language Models through Knowledge Detection. In Findings of the Association for Computational Linguistics: EMNLP 2024. 12345-12356.","journal-title":"Findings of the Association for Computational Linguistics: EMNLP"},{"key":"e_1_3_2_1_94_1","volume-title":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1439-1442","author":"Fan","unstructured":"Fan Zhang et al., 2020a. Towards a Better Understanding of Evaluation Metrics. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1439-1442."},{"key":"e_1_3_2_1_95_1","unstructured":"Fan Zhang et al. 2023. Constructing and Meta-Evaluating State-Aware Evaluation Metrics for Information Retrieval. Information Retrieval Journal (2023)."},{"key":"e_1_3_2_1_96_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401162"},{"key":"e_1_3_2_1_97_1","volume-title":"Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han.","author":"Zhou Kun","year":"2023","unstructured":"Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023. Don't Make Your LLM an Evaluation Benchmark Cheater. arXiv preprint arXiv:2311.01964 (2023)."},{"key":"e_1_3_2_1_98_1","volume-title":"Peifeng Wang, Caiming Xiong, and Shafiq Joty.","author":"Zhou Yilun","year":"2025","unstructured":"Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, and Shafiq Joty. 2025. Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators. arXiv preprint arXiv:2504.15253 (2025)."},{"key":"e_1_3_2_1_99_1","volume-title":"SIGIR Forum","volume":"56","author":"Zobel Justin","year":"2023","unstructured":"Justin Zobel. 2023. When Measurement Misleads: The Limits of Batch Assessment of Retrieval Systems. SIGIR Forum, Vol. 56 (Jan. 2023), 20 pages."}],"event":{"name":"ICTIR '25: International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval","location":"Padua Italy","acronym":"ICTIR '25","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval"]},"container-title":["Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731120.3744588","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:17:14Z","timestamp":1755868634000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731120.3744588"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,18]]},"references-count":99,"alternative-id":["10.1145\/3731120.3744588","10.1145\/3731120"],"URL":"https:\/\/doi.org\/10.1145\/3731120.3744588","relation":{},"subject":[],"published":{"date-parts":[[2025,7,18]]},"assertion":[{"value":"2025-07-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}