{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,5]],"date-time":"2025-10-05T19:56:22Z","timestamp":1759694182015,"version":"3.41.0"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,4,30]],"date-time":"2018-04-30T00:00:00Z","timestamp":1525046400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2018,10,31]]},"abstract":"<jats:p>The effectiveness of a search engine is typically evaluated using hand-labeled datasets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less expensive labels (e.g., from crowdsourcing) are used. This introduces errors in the labels, and thus errors in standard effectiveness metrics (such as P@k and DCG). These errors must be taken into consideration when using the metrics. Previous work has approached assessor error by taking aggregates over multiple inexpensive assessors. We take a different approach and introduce equations and algorithms that can adjust the metrics to the values they would have had if there were no annotation errors.<\/jats:p>\n          <jats:p>This is especially important when two search engines are compared on their metrics. We give examples where one engine appeared to be statistically significantly better than the other, but the effect disappeared after the metrics were corrected for annotation error. In other words, the evidence supporting a statistical difference was illusory and caused by a failure to account for annotation error.<\/jats:p>","DOI":"10.1145\/3186195","type":"journal-article","created":{"date-parts":[[2018,4,30]],"date-time":"2018-04-30T11:58:18Z","timestamp":1525089498000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Further Insights on Drawing Sound Conclusions from Noisy Judgments"],"prefix":"10.1145","volume":"36","author":[{"given":"David","family":"Goldberg","sequence":"first","affiliation":[{"name":"eBay, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1253-7123","authenticated-orcid":false,"given":"Andrew","family":"Trotman","sequence":"additional","affiliation":[{"name":"University of Otago, Dunedin, New Zealand"}]},{"given":"Xiao","family":"Wang","sequence":"additional","affiliation":[{"name":"eBay, California, USA"}]},{"given":"Wei","family":"Min","sequence":"additional","affiliation":[{"name":"CreditX, Shanghai, China"}]},{"given":"Zongru","family":"Wan","sequence":"additional","affiliation":[{"name":"Evolution Labs, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2018,4,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911514"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1480506.1480508"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390447"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1835449.1835540"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148262"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.2307\/2346806"},{"key":"e_1_2_1_7_1","volume-title":"COLT 2009 Proceedings of the 22nd Annual Conference on Learning Theory","author":"Dekel Ofer","year":"2009","unstructured":"Ofer Dekel and Ohad Shamir . 2009 . Vox populi: Collecting high-quality labels from a crowd . COLT 2009 Proceedings of the 22nd Annual Conference on Learning Theory (2009). http:\/\/eprints.pascal-network.org\/archive\/00005406\/ Ofer Dekel and Ohad Shamir. 2009. Vox populi: Collecting high-quality labels from a crowd. COLT 2009 Proceedings of the 22nd Annual Conference on Learning Theory (2009). http:\/\/eprints.pascal-network.org\/archive\/00005406\/"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052570"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1008992.1009079"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582418"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2487595"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2396761.2396779"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1185877.1185883"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2835776.2835835"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"key":"e_1_2_1_16_1","volume-title":"ECIR","author":"Li Le","year":"2014","unstructured":"Le Li and Mark D. Smucker . 2014. Tolerance of effectiveness measures to relevance judging errors . In ECIR 2014 . 148--159. Le Li and Mark D. Smucker. 2014. Tolerance of effectiveness measures to relevance judging errors. In ECIR 2014. 148--159."},{"key":"e_1_2_1_17_1","volume-title":"Detection Theory: A User\u2019s Guide","author":"Macmillan N. A.","year":"2005","unstructured":"N. A. Macmillan and C. D. Creelman . 2005 . Detection Theory: A User\u2019s Guide . Lawrence Erlbaum Associates . https:\/\/books.google.com\/books?id&equals;EQLUGpgN0q8C N. A. Macmillan and C. D. Creelman. 2005. Detection Theory: A User\u2019s Guide. Lawrence Erlbaum Associates. https:\/\/books.google.com\/books?id&equals;EQLUGpgN0q8C"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1416950.1416952"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00185"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1416950.1416951"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911492"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-015-9273-z"},{"key":"e_1_2_1_23_1","first-page":"0","volume-title":"ADCS","author":"Sanderson M.","year":"2010","unstructured":"M. Sanderson , F. Scholer , and A. Turpin . 2010. Relatively relevant: Assessor shift in document judgements . In ADCS 2010 . 60--67. http:\/\/www.scopus.com\/inward\/record.url?eid&equals;2-s2. 0 - 84872873938 M. Sanderson, F. Scholer, and A. Turpin. 2010. Relatively relevant: Assessor shift in document judgements. In ADCS 2010. 60--67. http:\/\/www.scopus.com\/inward\/record.url?eid&equals;2-s2.0-84872873938"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2010057"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1321440.1321528"},{"volume-title":"Conference on Empirical Methods in Natural Language Processing. 254--263","author":"Snow R.","key":"e_1_2_1_26_1","unstructured":"R. Snow , B. O\u2019Connor , D. Jurafsky , and A. Y. Ng . 2008. Cheap and fast\u2014but is it good?: Evaluating non-expert annotations for natural language tasks . In Conference on Empirical Methods in Natural Language Processing. 254--263 . R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. 2008. Cheap and fast\u2014but is it good?: Evaluating non-expert annotations for natural language tasks. In Conference on Empirical Methods in Natural Language Processing. 254--263."},{"key":"e_1_2_1_27_1","volume-title":"SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval. 36--41","author":"Tang W.","year":"2011","unstructured":"W. Tang and Matthew Lease . 2011 . Semi-supervised consensus labeling for crowdsourcing . In SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval. 36--41 . https:\/\/www.ischool.utexas.edu\/. W. Tang and Matthew Lease. 2011. Semi-supervised consensus labeling for crowdsourcing. In SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval. 36--41. https:\/\/www.ischool.utexas.edu\/."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(00)00010-8"},{"key":"e_1_2_1_29_1","volume-title":"ACM SIGIR Workshop on Crowdsourcing for Information Retrieval. 21--26","author":"Vuurens Jeroen","year":"2011","unstructured":"Jeroen Vuurens , Arjen de Vries , and Carsten Eickhoff . 2011 . How much spam can you take? An analysis of crowdsourcing results to increase accuracy . In ACM SIGIR Workshop on Crowdsourcing for Information Retrieval. 21--26 . Jeroen Vuurens, Arjen de Vries, and Carsten Eickhoff. 2011. How much spam can you take? An analysis of crowdsourcing results to increase accuracy. In ACM SIGIR Workshop on Crowdsourcing for Information Retrieval. 21--26."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 26th Annual Conference on Learning Theory. 1--30","author":"Wang Yining","year":"2013","unstructured":"Yining Wang , Liwei Wang , Yuanzhi Li , Di He , Wei Chen , and Tie-Yan Liu . 2013 . A theoretical analysis of NDCG ranking measures . In Proceedings of the 26th Annual Conference on Learning Theory. 1--30 . arxiv:1304.6480 Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th Annual Conference on Learning Theory. 1--30. arxiv:1304.6480"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1852102.1852106"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/2168651.2168656"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/290941.291014"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3186195","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3186195","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:11:27Z","timestamp":1750212687000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3186195"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,4,30]]},"references-count":33,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,10,31]]}},"alternative-id":["10.1145\/3186195"],"URL":"https:\/\/doi.org\/10.1145\/3186195","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"type":"print","value":"1046-8188"},{"type":"electronic","value":"1558-2868"}],"subject":[],"published":{"date-parts":[[2018,4,30]]},"assertion":[{"value":"2017-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-04-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}