{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T22:57:46Z","timestamp":1767913066350,"version":"3.49.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,9,8]],"date-time":"2021-09-08T00:00:00Z","timestamp":1631059200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000266","name":"EPSRC","doi-asserted-by":"crossref","award":["EP\/R018634\/1"],"award-info":[{"award-number":["EP\/R018634\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2022,1,31]]},"abstract":"<jats:p>Recommendation systems are often evaluated based on user\u2019s interactions that were collected from an existing, already deployed recommendation system. In this situation, users only provide feedback on the exposed items and they may not leave feedback on other items since they have not been exposed to them by the deployed system. As a result, the collected feedback dataset that is used to evaluate a new model is influenced by the deployed system, as a form of closed loop feedback. In this article, we show that the typical offline evaluation of recommender systems suffers from the so-called Simpson\u2019s paradox. Simpson\u2019s paradox is the name given to a phenomenon observed when a significant trend appears in several different sub-populations of observational data but disappears or is even reversed when these sub-populations are combined together. Our in-depth experiments based on stratified sampling reveal that a very small minority of items that are frequently exposed by the deployed system plays a confounding factor in the offline evaluation of recommendation systems. In addition, we propose a novel evaluation methodology that takes into account the confounder, i.e., the deployed system\u2019s characteristics. Using the relative comparison of many recommendation models as in the typical offline evaluation of recommender systems, and based on the Kendall rank correlation coefficient, we show that our proposed evaluation methodology exhibits statistically significant improvements of 14% and 40% on the examined open loop datasets (Yahoo! and Coat), respectively, in reflecting the true ranking of systems with an open loop (randomised) evaluation in comparison to the standard evaluation.<\/jats:p>","DOI":"10.1145\/3458509","type":"journal-article","created":{"date-parts":[[2021,9,8]],"date-time":"2021-09-08T15:31:23Z","timestamp":1631115083000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":26,"title":["The Simpson\u2019s Paradox in the Offline Evaluation of Recommendation Systems"],"prefix":"10.1145","volume":"40","author":[{"given":"Amir H.","family":"Jadidinejad","sequence":"first","affiliation":[{"name":"University of Glasgow, Glasgow, UK"}]},{"given":"Craig","family":"Macdonald","sequence":"additional","affiliation":[{"name":"University of Glasgow, Glasgow, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4701-3223","authenticated-orcid":false,"given":"Iadh","family":"Ounis","sequence":"additional","affiliation":[{"name":"University of Glasgow, Glasgow, UK"}]}],"member":"320","published-online":{"date-parts":[[2021,9,8]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 41th International ACM SIGIR Conference on Research and Development in Information Retrieval.","author":"Ai Qingyao","unstructured":"Qingyao Ai , Keping Bi , Cheng Luo , Jiafeng Guo , and W. Bruce Croft . 2018. Unbiased learning to rank with unbiased propensity estimation . In Proceedings of the 41th International ACM SIGIR Conference on Research and Development in Information Retrieval. Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. In Proceedings of the 41th International ACM SIGIR Conference on Research and Development in Information Retrieval."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 27th ACM International Conference on Information and Knowledge Management.","author":"Ai Qingyao","unstructured":"Qingyao Ai , Jiaxin Mao , Yiqun Liu , and W. Bruce Croft . 2018. Unbiased learning to rank: Theory and practice . In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Qingyao Ai, Jiaxin Mao, Yiqun Liu, and W. Bruce Croft. 2018. Unbiased learning to rank: Theory and practice. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0085777"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380281"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1080\/00273171.2011.568786"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-020-09371-3"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the 12th ACM Conference on Recommender Systems.","author":"Chaney Allison J. B.","unstructured":"Allison J. B. Chaney , Brandon M. Stewart , and Barbara E. Engelhardt . 2018. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility . In Proceedings of the 12th ACM Conference on Recommender Systems. Allison J. B. Chaney, Brandon M. Stewart, and Barbara E. Engelhardt. 2018. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems."},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"C. R. Charig D. R. Webb S. R. Payne and J. E. Wickham. 1986. Comparison of treatment of renal calculi by open surgery percutaneous nephrolithotomy and extracorporeal shockwave lithotripsy.BMJ 292 6524 (1986) 879\u2013882.  C. R. Charig D. R. Webb S. R. Payne and J. E. Wickham. 1986. Comparison of treatment of renal calculi by open surgery percutaneous nephrolithotomy and extracorporeal shockwave lithotripsy.BMJ 292 6524 (1986) 879\u2013882.","DOI":"10.1136\/bmj.292.6524.879"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1864708.1864721"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3411962"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3289600.3291027"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2827872"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052569"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/963770.963772"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 8th IEEE International Conference on Data Mining.","author":"Hu Y.","unstructured":"Y. Hu , Y. Koren , and C. Volinsky . 2008. Collaborative filtering for implicit feedback datasets . In Proceedings of the 8th IEEE International Conference on Data Mining. Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the Workshop on Offline Evaluation for Recommender Systems.","author":"Jadidinejad Amir H.","year":"2019","unstructured":"Amir H. Jadidinejad , Craig Macdonald , and Iadh Ounis . 2019 . How sensitive is recommendation systems\u2019 offline evaluation to popularity? . In Proceedings of the Workshop on Offline Evaluation for Recommender Systems. Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2019. How sensitive is recommendation systems\u2019 offline evaluation to popularity?. In Proceedings of the Workshop on Offline Evaluation for Recommender Systems."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401230"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306618.3314288"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2914803"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018661.3018699"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2013.00513"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1401890.1401944"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.263"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 13th International Conference on Neural Information Processing Systems.","author":"Lee Daniel D.","unstructured":"Daniel D. Lee and H. Sebastian Seung . 2000. Algorithms for non-negative matrix factorization . In Proceedings of the 13th International Conference on Neural Information Processing Systems. Daniel D. Lee and H. Sebastian Seung. 2000. Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-012-9209-9"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 3rd ACM Conference on Recommender Systems.","author":"Benjamin","unstructured":"Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data . In Proceedings of the 3rd ACM Conference on Recommender Systems. Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM Conference on Recommender Systems."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401102"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380255"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1080\/00031305.2014.876829"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1458082.1458092"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Filip Radlinski Madhu Kurup and Thorsten Joachims. 2011. Evaluating Search Engine Relevance with Click-Based Metrics.  Filip Radlinski Madhu Kurup and Thorsten Joachims. 2011. Evaluating Search Engine Relevance with Click-Based Metrics.","DOI":"10.1007\/978-3-642-14125-6_16"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.","author":"Rendle Steffen","year":"2009","unstructured":"Steffen Rendle , Christoph Freudenthaler , Zeno Gantner , and Lars Schmidt-Thieme . 2009 . BPR: Bayesian personalized ranking from implicit feedback . In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959176"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/2981562.2981720"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Claude Sammut and Geoffrey I. Webb (Eds.). 2010. Holdout Evaluation Encyclopedia of Machine Learning. 506\u2013507.   Claude Sammut and Geoffrey I. Webb (Eds.). 2010. Holdout Evaluation Encyclopedia of Machine Learning. 506\u2013507.","DOI":"10.1007\/978-0-387-30164-8_369"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 33rd International Conference on International Conference on Machine Learning.","author":"Schnabel Tobias","year":"2016","unstructured":"Tobias Schnabel , Adith Swaminathan , Ashudeep Singh , Navin Chandak , and Thorsten Joachims . 2016 . Recommendations as treatments: Debiasing learning and evaluation . In Proceedings of the 33rd International Conference on International Conference on Machine Learning. Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 33rd International Conference on International Conference on Machine Learning."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1835804.1835895"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043932.2043957"},{"key":"e_1_2_1_39_1","volume-title":"Tests for comparing elements of a correlation matrix.Psychological Bulletin 87, 2","author":"Steiger James H","year":"1980","unstructured":"James H Steiger . 1980. Tests for comparing elements of a correlation matrix.Psychological Bulletin 87, 2 ( 1980 ), 245\u2013251. James H Steiger. 1980. Tests for comparing elements of a correlation matrix.Psychological Bulletin 87, 2 (1980), 245\u2013251."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308560.3317303"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/2789272.2886805"},{"key":"e_1_2_1_42_1","first-page":"139","article-title":"Stratified Sampling. John Wiley and Sons, Ltd","volume":"11","author":"Thompson Steven K.","year":"2012","unstructured":"Steven K. Thompson . 2012 . Stratified Sampling. John Wiley and Sons, Ltd , Chapter 11 , 139 \u2013 156 . Steven K. Thompson. 2012. Stratified Sampling. John Wiley and Sons, Ltd, Chapter 11, 139\u2013156.","journal-title":"Chapter"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240323.3240347"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911537"},{"key":"e_1_2_1_45_1","volume-title":"Blei","author":"Wang Yixin","year":"2018","unstructured":"Yixin Wang , Dawen Liang , Laurent Charlin , and David M . Blei . 2018 . The deconfounded recommender: A causal inference approach to recommendation. arXiv:1808.06581. Retrieved from https:\/\/arxiv.org\/abs\/1808.06581. Yixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. 2018. The deconfounded recommender: A causal inference approach to recommendation. arXiv:1808.06581. Retrieved from https:\/\/arxiv.org\/abs\/1808.06581."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-008-5073-7"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240323.3240355"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458509","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3458509","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:16Z","timestamp":1750191436000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458509"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,8]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,1,31]]}},"alternative-id":["10.1145\/3458509"],"URL":"https:\/\/doi.org\/10.1145\/3458509","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,8]]},"assertion":[{"value":"2020-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}