{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T02:17:10Z","timestamp":1768875430288,"version":"3.49.0"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,8,18]],"date-time":"2023-08-18T00:00:00Z","timestamp":1692316800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000923","name":"Australian Research Council\u2019s","doi-asserted-by":"crossref","award":["DP190101113"],"award-info":[{"award-number":["DP190101113"]}],"id":[{"id":"10.13039\/501100000923","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2024,1,31]]},"abstract":"<jats:p>Measurement of the effectiveness of search engines is often based on use of relevance judgments. It is well known that judgments can be inconsistent between judges, leading to discrepancies that potentially affect not only scores but also system relativities and confidence in the experimental outcomes. We take the perspective that the relevance judgments are an amalgam of perfect relevance assessments plus errors; making use of a model of systematic errors in binary relevance judgments that can be tuned to reflect the kind of judge that is being used, we explore the behavior of measures of effectiveness as error is introduced. Using a novel methodology in which we examine the distribution of \u201ctrue\u201d effectiveness measurements that could be underlying measurements based on sets of judgments that include error, we find that even moderate amounts of error can lead to conclusions such as orderings of systems that statistical tests report as significant but are nonetheless incorrect. Further, in these results the widely used recall-based measures AP and NDCG are notably more fragile in the presence of judgment error than is the utility-based measure\u00a0RBP, but all the measures failed under even moderate error rates. We conclude that knowledge of likely error rates in judgments is critical to interpretation of experimental outcomes.<\/jats:p>","DOI":"10.1145\/3596511","type":"journal-article","created":{"date-parts":[[2023,5,20]],"date-time":"2023-05-20T08:59:21Z","timestamp":1684573161000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["The Impact of Judgment Variability on the Consistency of Offline Effectiveness Measures"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6189-3274","authenticated-orcid":false,"given":"Lida","family":"Rashidi","sequence":"first","affiliation":[{"name":"The University of Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6622-032X","authenticated-orcid":false,"given":"Justin","family":"Zobel","sequence":"additional","affiliation":[{"name":"The University of Melbourne, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6638-0232","authenticated-orcid":false,"given":"Alistair","family":"Moffat","sequence":"additional","affiliation":[{"name":"The University of Melbourne, Australia"}]}],"member":"320","published-online":{"date-parts":[[2023,8,18]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-022-09411-0"},{"key":"e_1_3_2_3_2","first-page":"571","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Aslam J. A.","year":"2005","unstructured":"J. A. Aslam, V. Pavlu, and E. Yilmaz. 2005. Measure-based metasearch. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 571\u2013572."},{"key":"e_1_3_2_4_2","first-page":"541","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Aslam J. A.","year":"2006","unstructured":"J. A. Aslam, V. Pavlu, and E. Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 541\u2013548."},{"key":"e_1_3_2_5_2","first-page":"667","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Bailey P.","year":"2008","unstructured":"P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 667\u2013674."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.4225\/49\/5726E597B8376"},{"key":"e_1_3_2_7_2","first-page":"395","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Bailey P.","year":"2017","unstructured":"P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2017. Retrieval consistency in the presence of query variations. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 395\u2013404."},{"key":"e_1_3_2_8_2","first-page":"25","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Buckley C.","year":"2004","unstructured":"C. Buckley and E. M. Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 25\u201332."},{"key":"e_1_3_2_9_2","volume-title":"TREC: Experiment and Evaluation in Information Retrieval","author":"Buckley C.","year":"2005","unstructured":"C. Buckley and E. M. Voorhees. 2005. Retrieval system evaluation. In TREC: Experiment and Evaluation in Information Retrieval, E. M. Voorhees and D. K. Harman (Eds.). The MIT Press."},{"key":"e_1_3_2_10_2","volume-title":"Proceedings of the Text Retrieval Conference (TREC)","author":"Buckley C.","year":"1999","unstructured":"C. Buckley and J. Walz. 1999. The TREC-8 query track. In Proceedings of the Text Retrieval Conference (TREC)."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(92)90031-T"},{"key":"e_1_3_2_12_2","first-page":"63","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"B\u00fcttcher S.","year":"2007","unstructured":"S. B\u00fcttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 63\u201370."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/1835449.1835540"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2094072.2094076"},{"issue":"3","key":"e_1_3_2_15_2","first-page":"33:1\u201333:21","article-title":"Assessing top-k preferences","volume":"39","author":"Clarke C. L. A.","year":"2021","unstructured":"C. L. A. Clarke, A. Vtyurina, and M. D. Smucker. 2021. Assessing top-k preferences. ACM Trans. Inf. Syst. 39, 3 (2021), 33:1\u201333:21.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_2_16_2","volume-title":"The Effect of Variations in Relevance Assessments in Comparative Experimental Tests of Index Languages","author":"Cleverdon C. W.","year":"1970","unstructured":"C. W. Cleverdon. 1970. The Effect of Variations in Relevance Assessments in Comparative Experimental Tests of Index Languages. Technical Report. Cranfield University."},{"key":"e_1_3_2_17_2","first-page":"282","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Cormack G. V.","year":"1998","unstructured":"G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. 1998. Efficient construction of large test collections. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 282\u2013289."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3110217"},{"key":"e_1_3_2_19_2","first-page":"197","volume-title":"Proceedings of the European Conference on Information Retrieval (ECIR)","author":"Ferrante M.","year":"2018","unstructured":"M. Ferrante, N. Ferro, and S. Pontarollo. 2018. Modelling randomness in relevance judgments and evaluation measures. In Proceedings of the European Conference on Information Retrieval (ECIR). 197\u2013209."},{"key":"e_1_3_2_20_2","first-page":"901","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Ferro N.","year":"2017","unstructured":"N. Ferro and M. Sanderson. 2017. Sub-corpora impact on system effectiveness. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 901\u2013904."},{"key":"e_1_3_2_21_2","first-page":"280","volume-title":"Proceedings of the Conference on Web Search and Data Mining (WSDM)","author":"Ferro N.","year":"2022","unstructured":"N. Ferro and M. Sanderson. 2022. How do you test a test? A multifaceted examination of significance tests. In Proceedings of the Conference on Web Search and Data Mining (WSDM). 280\u2013288."},{"issue":"3","key":"e_1_3_2_22_2","doi-asserted-by":"crossref","first-page":"30:1\u201330:40","DOI":"10.1145\/3310364","article-title":"Using collection shards to study retrieval performance effect sizes","volume":"37","author":"Ferro N.","year":"2019","unstructured":"N. Ferro, Y. Kim, and M. Sanderson. 2019. Using collection shards to study retrieval performance effect sizes. ACM Trans. Inf. Syst. 37, 3 (2019), 30:1\u201330:40.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371857"},{"issue":"4","key":"e_1_3_2_24_2","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/582415.582418","article-title":"Cumulated gain-based evaluation of IR techniques","volume":"20","author":"J\u00e4rvelin K.","year":"2002","unstructured":"K. J\u00e4rvelin and J. Kek\u00e4l\u00e4inen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422\u2013446.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_2_25_2","first-page":"105","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Kazai G.","year":"2012","unstructured":"G. Kazai, N. Craswell, E. Yilmaz, and S. M. M. Tahaghoghi. 2012. An analysis of systematic judging errors in information retrieval. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 105\u2013114."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-012-9205-0"},{"key":"e_1_3_2_27_2","first-page":"591","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Kinney K. A.","year":"2008","unstructured":"K. A. Kinney, S. B. Huffman, and J. Zhai. 2008. How evaluator domain expertise affects search result relevance judgments. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 591\u2013598."},{"key":"e_1_3_2_28_2","first-page":"805","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Kutlu M.","year":"2018","unstructured":"M. Kutlu, T. McDonnell, Y. Barkallah, T. Elsayed, and M. Lease. 2018. Crowd vs expert: What can relevance judgment rationales teach us about assessor disagreement? In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 805\u2013814."},{"issue":"3","key":"e_1_3_2_29_2","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1002\/(SICI)1097-4571(199104)42:3<166::AID-ASI2>3.0.CO;2-A","article-title":"A study of probabilistic information retrieval systems in the case of inconsistent expert judgments","volume":"42","author":"Lee J. J.","year":"1991","unstructured":"J. J. Lee and P. B. Kantor. 1991. A study of probabilistic information retrieval systems in the case of inconsistent expert judgments. J. Amer. Societ. Inf. Sci. 42, 3 (1991), 166\u2013172.","journal-title":"J. Amer. Societ. Inf. Sci."},{"issue":"4","key":"e_1_3_2_30_2","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1016\/0020-0271(68)90029-6","article-title":"Relevance assessments and retrieval system evaluation","volume":"4","author":"Lesk M. E.","year":"1968","unstructured":"M. E. Lesk and G. Salton. 1968. Relevance assessments and retrieval system evaluation. Inf. Stor. Retr. 4, 4 (1968), 343\u2013359.","journal-title":"Inf. Stor. Retr."},{"key":"e_1_3_2_31_2","first-page":"148","volume-title":"Proceedings of the European Conference on Information Retrieval (ECIR)","author":"Li L.","year":"2014","unstructured":"L. Li and M. D. Smucker. 2014. Tolerance of effectiveness measures to relevance judging errors. In Proceedings of the European Conference on Information Retrieval (ECIR). Springer, 148\u2013159."},{"issue":"4","key":"e_1_3_2_32_2","doi-asserted-by":"crossref","first-page":"1503","DOI":"10.1109\/TKDE.2019.2947049","article-title":"Fixed-cost pooling strategies","volume":"33","author":"Lipani A.","year":"2021","unstructured":"A. Lipani, D. E. Losada, G. Zuccon, and M. Lupu. 2021. Fixed-cost pooling strategies. IEEE Trans. Knowl. Data Eng. 33, 4 (2021), 1503\u20131522.","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","first-page":"1027","DOI":"10.1145\/2851613.2851692","volume-title":"Proceedings of the ACM Symposium on Applied Computing","author":"Losada D. E.","year":"2016","unstructured":"D. E. Losada, J. Parapar, and A. Barreiro. 2016. Feeling lucky? Multi-armed bandits for ordering judgements in pooling-based evaluation. In Proceedings of the ACM Symposium on Applied Computing. 1027\u20131034."},{"key":"e_1_3_2_34_2","first-page":"3077","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Mackenzie J.","year":"2020","unstructured":"J. Mackenzie, R. Benham, M. Petri, J. R. Trippas, J. S. Culpepper, and A. Moffat. 2020. CC-News-En: A large English news corpus. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 3077\u20133084."},{"key":"e_1_3_2_35_2","unstructured":"J. Mackenzie M. Petri and A. Moffat. 2021. A Sensitivity Analysis of the MSMARCO Passage Collection. (Dec.2021). arXiv:2112.03396."},{"key":"e_1_3_2_36_2","first-page":"129","volume-title":"Proceedings of the AAAI Conference on Human Computation and Crowdsourcing","author":"Maddalena E.","year":"2016","unstructured":"E. Maddalena, M. Basaldella, D. De Nart, D. Degl\u2019Innocenti, S. Mizzaro, and G. Demartini. 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 129\u2013138."},{"key":"e_1_3_2_37_2","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1145\/3015022.3015025","volume-title":"Proceedings of the Australasian Document Computing Symposium (ADCS)","author":"Moffat A.","year":"2016","unstructured":"A. Moffat. 2016. Judgment pool effects caused by query variations. In Proceedings of the Australasian Document Computing Symposium (ADCS). 65\u201368."},{"issue":"1","key":"e_1_3_2_38_2","doi-asserted-by":"crossref","first-page":"2.1\u20132.27","DOI":"10.1145\/1416950.1416952","article-title":"Rank-biased precision for measurement of retrieval effectiveness","volume":"27","author":"Moffat A.","year":"2008","unstructured":"A. Moffat and J. Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2.1\u20132.27.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_2_39_2","first-page":"375","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Moffat A.","year":"2007","unstructured":"A. Moffat, W. Webber, and J. Zobel. 2007. Strategic system comparisons via targeted relevance judgments. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 375\u2013382."},{"key":"e_1_3_2_40_2","first-page":"1759","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Moffat A.","year":"2015","unstructured":"A. Moffat, F. Scholer, P. Thomas, and P. Bailey. 2015. Pooled evaluation over query variations: Users are as diverse as systems. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 1759\u20131762."},{"issue":"3","key":"e_1_3_2_41_2","doi-asserted-by":"crossref","first-page":"24:1\u201324:38","DOI":"10.1145\/3052768","article-title":"Incorporating user expectations and behavior into the measurement of search effectiveness","volume":"35","author":"Moffat A.","year":"2017","unstructured":"A. Moffat, P. Bailey, F. Scholer, and P. Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1\u201324:38.","journal-title":"ACM Trans. Inf. Syst."},{"issue":"4","key":"e_1_3_2_42_2","first-page":"10.1\u201310.22","article-title":"Estimating measurement uncertainty for information retrieval effectiveness metrics","volume":"10","author":"Moffat A.","year":"2018","unstructured":"A. Moffat, F. Scholer, and Z. Yang. 2018. Estimating measurement uncertainty for information retrieval effectiveness metrics. ACM J. Data Inf. Qual. 10, 4 (Oct.2018), 10.1\u201310.22.","journal-title":"ACM J. Data Inf. Qual."},{"key":"e_1_3_2_43_2","first-page":"1667","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Rashidi L.","year":"2021","unstructured":"L. Rashidi, J. Zobel, and A. Moffat. 2021. Evaluating the predictivity of IR experiments. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 1667\u20131671."},{"key":"e_1_3_2_44_2","first-page":"525","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Sakai T.","year":"2006","unstructured":"T. Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 525\u2013532."},{"key":"e_1_3_2_45_2","first-page":"71","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Sakai T.","year":"2007","unstructured":"T. Sakai. 2007. Alternatives to bpref. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 71\u201378."},{"key":"e_1_3_2_46_2","first-page":"5","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Sakai T.","year":"2016","unstructured":"T. Sakai. 2016. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006\u20132015. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 5\u201314."},{"issue":"4","key":"e_1_3_2_47_2","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1561\/1500000009","article-title":"Test collection based evaluation of information retrieval systems","volume":"4","author":"Sanderson M.","year":"2010","unstructured":"M. Sanderson. 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4, 4 (2010), 247\u2013375.","journal-title":"Found. Trends Inf. Retr."},{"key":"e_1_3_2_48_2","first-page":"162","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Sanderson M.","year":"2005","unstructured":"M. Sanderson and J. Zobel. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 162\u2013169."},{"key":"e_1_3_2_49_2","first-page":"1965","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Sanderson M.","year":"2012","unstructured":"M. Sanderson, A. Turpin, Y. Zhang, and F. Scholer. 2012. Differences in effectiveness across sub-collections. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). ACM, 1965\u20131969."},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1353\/lib.0.0000"},{"key":"e_1_3_2_51_2","first-page":"1063","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Scholer F.","year":"2011","unstructured":"F. Scholer, A. Turpin, and M. Sanderson. 2011. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 1063\u20131072."},{"key":"e_1_3_2_52_2","volume-title":"Proceedings of the Workshop on Evaluating Information Access (EVIA)","author":"Scholer F.","year":"2014","unstructured":"F. Scholer, E. Maddalena, S. Mizzaro, and A. Turpin. 2014. Magnitudes of relevance: Relevance judgements, magnitude estimation, and crowdsourcing. In Proceedings of the Workshop on Evaluating Information Access (EVIA)."},{"key":"e_1_3_2_53_2","first-page":"1231","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Smucker M. D.","year":"2011","unstructured":"M. D. Smucker and C. P. Jethani. 2011. Measuring assessor accuracy: A comparison of NIST assessors and user study participants. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 1231\u20131232."},{"key":"e_1_3_2_54_2","first-page":"9","volume-title":"Proceedings of the SIGIR Workshp. Crowdsourcing for Information Retrieval","author":"Smucker M. D.","year":"2011","unstructured":"M. D. Smucker and C. P. Jethani. 2011. The crowd vs the lab: A comparison of crowd-sourced and university laboratory participant behavior. In Proceedings of the SIGIR Workshp. Crowdsourcing for Information Retrieval. 9\u201314."},{"key":"e_1_3_2_55_2","first-page":"623","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Smucker M. D.","year":"2007","unstructured":"M. D. Smucker, J. Allan, and B. Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 623\u2013632."},{"key":"e_1_3_2_56_2","first-page":"66","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Soboroff I.","year":"2001","unstructured":"I. Soboroff, C. Nicholas, and P. Cahan. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 66\u201373."},{"key":"e_1_3_2_57_2","first-page":"565","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Turpin A.","year":"2015","unstructured":"A. Turpin, F. Scholer, S. Mizzaro, and E. Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 565\u2013574."},{"issue":"3","key":"e_1_3_2_58_2","doi-asserted-by":"crossref","first-page":"313","DOI":"10.1007\/s10791-015-9274-y","article-title":"Test collection reliability: A study of bias and robustness to statistical assumptions via stochastic simulation","volume":"19","author":"Urbano J.","year":"2016","unstructured":"J. Urbano. 2016. Test collection reliability: A study of bias and robustness to statistical assumptions via stochastic simulation. Inf. Retr. 19, 3 (2016), 313\u2013350.","journal-title":"Inf. Retr."},{"key":"e_1_3_2_59_2","first-page":"505","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Urbano J.","year":"2019","unstructured":"J. Urbano, H. Lima, and A. Hanjalic. 2019. Statistical significance testing in information retrieval: An empirical analysis of type I, type II and type III errors. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 505\u2013514."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(00)00010-8"},{"issue":"2","key":"e_1_3_2_61_2","first-page":"12:1\u201312:21","article-title":"Using replicates in information retrieval evaluation","volume":"36","author":"Voorhees E. M.","year":"2017","unstructured":"E. M. Voorhees, D. Samarov, and I. Soboroff. 2017. Using replicates in information retrieval evaluation. ACM Trans. Inf. Syst. 36, 2 (2017), 12:1\u201312:21.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIC.2012.71"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/1852102.1852106"},{"key":"e_1_3_2_64_2","first-page":"125","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Webber W.","year":"2012","unstructured":"W. Webber, P. Chandar, and B. Carterette. 2012. Alternative assessor disagreement and retrieval depth. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). ACM, 125\u2013134."},{"key":"e_1_3_2_65_2","first-page":"307","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)","author":"Zobel J.","year":"1998","unstructured":"J. Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). 307\u2013314."},{"key":"e_1_3_2_66_2","first-page":"1933","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Zobel J.","year":"2020","unstructured":"J. Zobel and L. Rashidi. 2020. Corpus bootstrapping for assessment of the properties of effectiveness measures. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). ACM, 1933\u20131952."},{"key":"e_1_3_2_67_2","first-page":"691","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Zuccon G.","year":"2016","unstructured":"G. Zuccon, J. Palotti, and A. Hanbury. 2016. Query variations and their effect on comparing information retrieval systems. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 691\u2013700."}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3596511","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3596511","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:48:00Z","timestamp":1750178880000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3596511"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,18]]},"references-count":66,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,1,31]]}},"alternative-id":["10.1145\/3596511"],"URL":"https:\/\/doi.org\/10.1145\/3596511","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,18]]},"assertion":[{"value":"2022-07-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-04-22","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}