{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:40:01Z","timestamp":1755866401307,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":56,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,7,13]]},"DOI":"10.1145\/3726302.3730229","type":"proceedings-article","created":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T01:38:52Z","timestamp":1752457132000},"page":"2875-2879","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Measuring Hypothesis Testing Errors in the Evaluation of Retrieval Systems"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1509-9248","authenticated-orcid":false,"given":"Jack","family":"McKechnie","sequence":"first","affiliation":[{"name":"University of Glasgow, Glasgow, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1266-5996","authenticated-orcid":false,"given":"Graham","family":"McDonald","sequence":"additional","affiliation":[{"name":"University of Glasgow, Glasgow, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3143-279X","authenticated-orcid":false,"given":"Craig","family":"Macdonald","sequence":"additional","affiliation":[{"name":"University of Glasgow, Glasgow, United Kingdom"}]}],"member":"320","published-online":{"date-parts":[[2025,7,13]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Can We Use Large Language Models to Fill Relevance Judgment Holes? arXiv preprint arXiv:2405.05600","author":"Abbasiantaeb Zahra","year":"2024","unstructured":"Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi. 2024. Can We Use Large Language Models to Fill Relevance Judgment Holes? arXiv preprint arXiv:2405.05600 (2024)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/860435.860501"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2484028.2484034"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2010.764"},{"volume-title":"Proc. of RecSys.","author":"Roc\u00edo","key":"e_1_3_2_1_5_1","unstructured":"Roc\u00edo Ca namares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proc. of RecSys."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1108\/eb050097"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1177\/001316446002000104"},{"key":"e_1_3_2_1_8_1","volume-title":"A definition of relevance for information retrieval. Information storage and retrieval","author":"Cooper William S","year":"1971","unstructured":"William S Cooper. 1971. A definition of relevance for information retrieval. Information storage and retrieval, Vol. 7, 1 (1971)."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/290941.291009"},{"key":"e_1_3_2_1_10_1","volume-title":"Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102","author":"Craswell Nick","year":"2020","unstructured":"Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020a. Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102.07662 (2020)."},{"key":"e_1_3_2_1_11_1","volume-title":"Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003","author":"Craswell Nick","year":"2020","unstructured":"Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020b. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020)."},{"key":"e_1_3_2_1_12_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3578337.3605136"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-72240-1_3"},{"key":"e_1_3_2_1_15_1","volume-title":"TOIS","volume":"42","author":"Ji Yitong","year":"2024","unstructured":"Yu-chen Fan, Yitong Ji, Jie Zhang, and Aixin Sun. 2024. Our model achieves excellent performance on movielens: what does it mean? TOIS, Vol. 42, 6 (2024)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3310364"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080674"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3338062"},{"key":"e_1_3_2_1_19_1","volume-title":"Proc. of WSDM.","author":"Ferro Nicola","year":"2022","unstructured":"Nicola Ferro and Mark Sanderson. 2022. How do you test a test? A multifaceted examination of significance tests. In Proc. of WSDM."},{"key":"e_1_3_2_1_20_1","volume-title":"JAALAS","volume":"50","author":"Fitts Douglas A","year":"2011","unstructured":"Douglas A Fitts. 2011. Ethics and animal numbers: informal analyses, uncertain sample sizes, inefficient replications, and type I errors. JAALAS, Vol. 50, 4 (2011)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190580.3190586"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371857"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-12275-0_16"},{"volume-title":"Multiple comparisons: theory and methods","author":"Hsu Jason","key":"e_1_3_2_1_24_1","unstructured":"Jason Hsu. 1996. Multiple comparisons: theory and methods. CRC Press."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11403-019-00266-1"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/30.1-2.81"},{"key":"e_1_3_2_1_27_1","volume-title":"Proc. of SAC.","author":"Losada David E","year":"2016","unstructured":"David E Losada, Javier Parapar, and \u00c1lvaro Barreiro. 2016. Feeling lucky? Multi-armed bandits for ordering judgements in pooling-based evaluation. In Proc. of SAC."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3592032"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/0005-2795(75)90109-9"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-88708-6_19"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383313.3418479"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3583780.3614916"},{"key":"e_1_3_2_1_33_1","volume-title":"LQ","volume":"63","author":"Park Taemin Kim","year":"1993","unstructured":"Taemin Kim Park. 1993. The nature of relevance in information retrieval: An empirical study. LQ, Vol. 63, 3 (1993)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1835449.1835560"},{"key":"e_1_3_2_1_35_1","volume-title":"Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024. arXiv preprint arXiv:2408","author":"Rahmani Hossein A","year":"2024","unstructured":"Hossein A Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles LA Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz. 2024. Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024. arXiv preprint arXiv:2408.05388 (2024)."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2645710.2645746"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911492"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2914684"},{"key":"e_1_3_2_1_39_1","volume-title":"Effect Sizes, and Statistical Power","author":"Sakai Tetsuya","year":"2018","unstructured":"Tetsuya Sakai. 2018. Multiple comparison procedures. Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power (2018)."},{"key":"e_1_3_2_1_40_1","volume-title":"ACM SIGIR Forum","volume":"54","author":"Sakai Tetsuya","year":"2021","unstructured":"Tetsuya Sakai. 2021. On Fuhr's guideline for IR evaluation. In ACM SIGIR Forum, Vol. 54."},{"key":"e_1_3_2_1_41_1","unstructured":"Tetsuya Sakai Noriko Kando Chuan-Jie Lin Teruko Mitamura Hideki Shima Donghong Ji Kuang-Hua Chen and Eric Nyberg. 2008. Overview of the NTCIR-7 ACLIA IR4QA Task. In NTCIR."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/383952.383961"},{"key":"e_1_3_2_1_43_1","article-title":"Information retrieval test collections","volume":"32","author":"Jones Karen Sparck","year":"1976","unstructured":"Karen Sparck Jones and Cornelis Joost Van Rijsbergen. 1976. Information retrieval test collections. J. Doc, Vol. 32, 1 (1976).","journal-title":"J. Doc"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.2307\/1422689"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591931"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657707"},{"key":"e_1_3_2_1_47_1","volume-title":"LLMs Can Patch Up Missing Relevance Judgments in Evaluation. arXiv preprint arXiv:2405.04727","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ehsan Kamalloo, and Jimmy Lin. 2024a. LLMs Can Patch Up Missing Relevance Judgments in Evaluation. arXiv preprint arXiv:2405.04727 (2024)."},{"key":"e_1_3_2_1_48_1","volume-title":"UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv preprint arXiv:2406.06519","author":"Upadhyay Shivani","year":"2024","unstructured":"Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024b. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv preprint arXiv:2406.06519 (2024)."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2484028.2484163"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2484028.2484038"},{"key":"e_1_3_2_1_51_1","volume-title":"Proc. of the joint IBM\/University of Newcastle upon Tyne seminar on data base systems","volume":"79","author":"Rijsbergen C Van","year":"1979","unstructured":"C Van Rijsbergen. 1979. Information retrieval: Theory and Practice. In Proc. of the joint IBM\/University of Newcastle upon Tyne seminar on data base systems, Vol. 79."},{"volume-title":"Overview of the Fifth Text REtrieval Conference (TREC-5). In Proc. of TREC-5, Ellen M. Voorhees and Donna K. Harman (Eds.).","author":"Ellen","key":"e_1_3_2_1_52_1","unstructured":"Ellen M. Voorhees and Donna Harman. 1997. Overview of the Fifth Text REtrieval Conference (TREC-5). In Proc. of TREC-5, Ellen M. Voorhees and Donna K. Harman (Eds.)."},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/1458082.1458158"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1152\/ajpheart.1997.273.1.H487"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/952532.952693"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/290941.291014"}],"event":{"name":"SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval"],"location":"Padua Italy","acronym":"SIGIR '25"},"container-title":["Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3726302.3730229","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:06:47Z","timestamp":1755864407000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3726302.3730229"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,13]]},"references-count":56,"alternative-id":["10.1145\/3726302.3730229","10.1145\/3726302"],"URL":"https:\/\/doi.org\/10.1145\/3726302.3730229","relation":{},"subject":[],"published":{"date-parts":[[2025,7,13]]},"assertion":[{"value":"2025-07-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}