{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T17:16:30Z","timestamp":1775582190295,"version":"3.50.1"},"reference-count":73,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput. Healthcare"],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited, partly due to biases that can compromise the reliability of predictions. In this article, we focus on sample selection bias (SSB), a specific type of bias where the study population is less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing machine learning techniques try to correct the bias mostly by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB\u2019s impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population\n identification rather than the bias correction. Specifically, we propose two independent networks (T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network\/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates and selection rates, outperforming the existing bias correction techniques.<\/jats:p>","DOI":"10.1145\/3761822","type":"journal-article","created":{"date-parts":[[2025,8,18]],"date-time":"2025-08-18T15:58:36Z","timestamp":1755532716000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Sample Selection Bias in Machine Learning for Healthcare"],"prefix":"10.1145","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8195-548X","authenticated-orcid":false,"given":"Vinod Kumar","family":"Chauhan","sequence":"first","affiliation":[{"name":"Department of Engineering Science, University of Oxford, Oxford, United Kingdom of Great Britain and Northern Ireland and Department of Computer and Information Sciences, University of Strathclyde, Glasgow, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5595-8468","authenticated-orcid":false,"given":"Lei","family":"Clifton","sequence":"additional","affiliation":[{"name":"Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5455-9115","authenticated-orcid":false,"given":"Achille","family":"Sala\u00fcn","sequence":"additional","affiliation":[{"name":"Department of Engineering Science, University of Oxford, Oxford, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6140-3394","authenticated-orcid":false,"given":"Huiqi Yvonne","family":"Lu","sequence":"additional","affiliation":[{"name":"Department of Engineering Science, University of Oxford, Oxford, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5699-6369","authenticated-orcid":false,"given":"Kim","family":"Branson","sequence":"additional","affiliation":[{"name":"GSK PLC, London, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2868-7794","authenticated-orcid":false,"given":"Patrick","family":"Schwab","sequence":"additional","affiliation":[{"name":"GSK PLC, London, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4699-2263","authenticated-orcid":false,"given":"Gaurav","family":"Nigam","sequence":"additional","affiliation":[{"name":"Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9848-8555","authenticated-orcid":false,"given":"David A.","family":"Clifton","sequence":"additional","affiliation":[{"name":"Department of Engineering Science, University of Oxford, Oxford, United Kingdom of Great Britain and Northern Ireland and Oxford-Suzhou Institute of Advanced Research (OSCAR), Suzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,13]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10260-022-00643-4"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1111\/jgs.16022"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.2307\/2095230"},{"key":"e_1_3_3_5_2","first-page":"1","article-title":"Double machine learning for sample selection models","author":"Bia Michela","year":"2023","unstructured":"Michela Bia, Martin Huber, and Luk\u00e1\u0161 Laff\u00e9rs. 2023. Double machine learning for sample selection models. Journal of Business & Economic Statistics 42 (2023), 1\u201312.","journal-title":"Journal of Business & Economic Statistics"},{"key":"e_1_3_3_6_2","first-page":"223","article-title":"Neural networks for pattern recognition","volume":"2","author":"Bishop C. M.","year":"1995","unstructured":"C. M. Bishop. 1995. Neural networks for pattern recognition. Clarendon Press Google Schola 2 (1995), 223\u2013228.","journal-title":"Clarendon Press Google Schola"},{"key":"e_1_3_3_7_2","volume-title":"Basic Epidemiology","author":"Bonita Ruth","year":"2006","unstructured":"Ruth Bonita, Robert Beaglehole, and Tord Kjellstr\u00f6m. 2006. Basic Epidemiology. World Health Organization."},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.2478\/jos-2021-0033"},{"key":"e_1_3_3_9_2","first-page":"2022","article-title":"Addressing selection bias in the UK biobank neurological imaging cohort","volume":"2022","author":"Bradley Valerie","year":"2022","unstructured":"Valerie Bradley and Thomas E. Nichols. 2022. Addressing selection bias in the UK biobank neurological imaging cohort. MedRxiv 2022 (2022), 2022\u20132001.","journal-title":"MedRxiv"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.15195\/v2.a17"},{"key":"e_1_3_3_11_2","volume-title":"Addressing Sample Selection Bias for Machine Learning Methods","author":"Brewer Dylan","year":"2021","unstructured":"Dylan Brewer and Alyssa Carlson. 2021. Addressing Sample Selection Bias for Machine Learning Methods. Technical Report. Department of Economics, University of Missouri."},{"key":"e_1_3_3_12_2","first-page":"837","volume-title":"Proceedings of the 26th International Conference on Artificial Intelligence and Statistics","volume":"206","author":"Chauhan Vinod Kumar","year":"2023","unstructured":"Vinod Kumar Chauhan, Soheila Molaei, Marzia Hoque Tania, Anshul Thakur, Tingting Zhu, and David A. Clifton. 2023. Adversarial de-confounding in individualised treatment effects estimation. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, Vol. 206. PMLR, 837\u2013849."},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1186\/s12911-024-02514-2"},{"key":"e_1_3_3_14_2","first-page":"3529","volume-title":"Proceedings of the 27th International Conference on Artificial Intelligence and Statistics","volume":"238","author":"Chauhan Vinod Kumar","year":"2024","unstructured":"Vinod Kumar Chauhan, Jiandong Zhou, Ghadeer Ghosheh, Soheila Molaei, and David A. Clifton. 2024. Dynamic inter-treatment information sharing for individualized treatment effects estimation. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, Vol. 238. Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li (Eds.). PMLR, 3529\u20133537."},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-024-10862-8"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.3389\/fspor.2023.1236870"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-45468-4"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-87987-9_8"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1111\/nep.13913"},{"key":"e_1_3_3_20_2","unstructured":"Antoine de Mathelin Fran\u00e7ois Deheeger Guillaume Richard Mathilde Mougeot and Nicolas Vayatis. 2021. ADAPT: Awesome domain adaptation python toolbox. arXiv:2107.03049. Retrieved from https:\/\/arxiv.org\/abs\/2107.03049"},{"key":"e_1_3_3_21_2","volume-title":"Selection Bias Identification and Mitigation with No Ground Truth Information","author":"Dost Katharina","year":"2022","unstructured":"Katharina Dost. 2022. Selection Bias Identification and Mitigation with No Ground Truth Information. Ph.D. Dissertation. ResearchSpace@Auckland."},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData55660.2022.10021107"},{"key":"e_1_3_3_23_2","first-page":"973","volume-title":"International Joint Conference on Artificial Intelligence","author":"Elkan Charles","year":"2001","unstructured":"Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, Vol. 1. Lawrence Erlbaum Associates Ltd, 973\u2013978."},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2024.104631"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.5555\/2946645.2946704"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41746-024-01127-3"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1097\/00001648-199901000-00008"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1086\/260267"},{"key":"e_1_3_3_29_2","first-page":"1321","volume-title":"International Conference on Machine Learning","author":"Guo Chuan","year":"2017","unstructured":"Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 1321\u20131330."},{"issue":"2","key":"e_1_3_3_30_2","first-page":"313","article-title":"Varieties of selection bias","volume":"80","author":"Heckman James","year":"1990","unstructured":"James Heckman. 1990. Varieties of selection bias. The American Economic Review 80, 2 (1990), 313\u2013318.","journal-title":"The American Economic Review"},{"key":"e_1_3_3_31_2","first-page":"475","article-title":"The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models","volume":"5","author":"Heckman James J.","year":"1976","unstructured":"James J. Heckman. 1976. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In Annals of Economic and Social Measurement, Vol. 5(4). Sanford V. Berg (Ed.). NBER, 475\u2013492.","journal-title":"Annals of Economic and Social Measurement"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.2307\/1912352"},{"key":"e_1_3_3_33_2","first-page":"1","article-title":"Machine learning with a reject option: A survey","author":"Hendrickx Kilian","year":"2024","unstructured":"Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. 2024. Machine learning with a reject option: A survey. Machine Learning 113, 5 (2024), 1\u201338.","journal-title":"Machine Learning"},{"key":"e_1_3_3_34_2","volume-title":"Causal Inference: What If","author":"Hernan M. A.","year":"2023","unstructured":"M. A. Hernan and J. M. Robins. 2023. Causal Inference: What If. Taylor & Francis. Retrieved from https:\/\/books.google.co.uk\/books?id=FPkN0AEACAAJ"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1097\/01.ede.0000135174.63482.43"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2023.104532"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41568-018-0016-5"},{"key":"e_1_3_3_38_2","article-title":"Correcting sample selection bias by unlabeled data","volume":"19","author":"Huang Jiayuan","year":"2006","unstructured":"Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch\u00f6lkopf, and Alex Smola. 2006. Correcting sample selection bias by unlabeled data. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 19.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1057\/palgrave.jors.2601578"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-021-03819-2"},{"key":"e_1_3_3_41_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41746-020-00367-3"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2945942"},{"key":"e_1_3_3_44_2","unstructured":"Ritoban Kundu Xu Shi Jean Morrison and Bhramar Mukherjee. 2023. A framework for understanding selection bias in real-world healthcare data. arXiv:2304.04652. Retrieved from https:\/\/arxiv.org\/abs\/2304.04652"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-023-36214-0"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1002\/sam.11559"},{"key":"e_1_3_3_47_2","first-page":"258","volume-title":"Proceedings of the Health and Wellbeing e-Networks for All (MEDINFO \u201919)","author":"Mei Jing","year":"2019","unstructured":"Jing Mei and Eryu Xia. 2019. Knowledge learning symbiosis for developing risk prediction models from regional EHR repositories. In Proceedings of the Health and Wellbeing e-Networks for All (MEDINFO \u201919). IOS Press, 258\u2013262."},{"key":"e_1_3_3_48_2","first-page":"15682","article-title":"Revisiting the calibration of modern neural networks","volume":"34","author":"Minderer Matthias","year":"2021","unstructured":"Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. 2021. Revisiting the calibration of modern neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34, 15682\u201315694.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1136\/bmj.n2281"},{"issue":"22","key":"e_1_3_3_50_2","first-page":"1690","article-title":"Independent external validation of the QRISK3 cardiovascular disease risk prediction model using UK biobank","volume":"109","author":"Parsons Ruth E.","year":"2023","unstructured":"Ruth E. Parsons, Xiaonan Liu, Jennifer A. Collister, David A. Clifton, Benjamin J. Cairns, and Lei Clifton. 2023. Independent external validation of the QRISK3 cardiovascular disease risk prediction model using UK biobank. Heart (British Cardiac Society) 109, 22 (2023), 1690\u20131697.","journal-title":"Heart (British Cardiac Society)"},{"key":"e_1_3_3_51_2","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_52_2","volume-title":"The Book of Why: The New Science of Cause and Effect","author":"Pearl Judea","year":"2018","unstructured":"Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic books."},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1093\/acref\/9780199976720.001.0001"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1056\/NEJMra1814259"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/s40471-023-00325-z"},{"key":"e_1_3_3_56_2","first-page":"2023","article-title":"Interpretable machine learning in kidney offering: Multiple outcome prediction for accepted offers","author":"Salaun Achille","year":"2023","unstructured":"Achille Salaun, Simon Knight, Laura Ruth Wingfield, and Tingting Zhu. 2023. Interpretable machine learning in kidney offering: Multiple outcome prediction for accepted offers. MedRxiv (2023), 2023\u20132008. Retrieved from https:\/\/www.medrxiv.org\/content\/10.1101\/2023.08.24.23294535v2","journal-title":"MedRxiv"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0378-3758(00)00115-4"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/s40471-020-00241-6"},{"issue":"1","key":"e_1_3_3_59_2","first-page":"112","article-title":"MissForest \u2013 Non-parametric missing value imputation for mixed-type data","volume":"28","author":"Stekhoven Daniel J.","year":"2012","unstructured":"Daniel J. Stekhoven and Peter B\u00fchlmann. 2012. MissForest \u2013 Non-parametric missing value imputation for mixed-type data. Bioinformatics (Oxford, England) 28, 1 (2012), 112\u2013118.","journal-title":"Bioinformatics (Oxford, England)"},{"key":"e_1_3_3_60_2","article-title":"Direct importance estimation with model selection and its application to covariate shift adaptation","volume":"20","author":"Sugiyama Masashi","year":"2007","unstructured":"Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Buenau, and Motoaki Kawanabe. 2007. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 20.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0140-6736(12)61179-9"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0188983"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41591-018-0300-7"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1111\/bioe.13281"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.2307\/146317"},{"key":"e_1_3_3_66_2","unstructured":"Robin Vogel Mastane Achab St\u00e9phan Cl\u00e9men\u00e7on and Charles Tillier. 2020. Weighted empirical risk minimization: Sample selection bias correction based on importance sampling. arXiv:2002.05145. Retrieved from https:\/\/arxiv.org\/abs\/2002.05145"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1038\/s43856-021-00028-w"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1111\/risa.13575"},{"key":"e_1_3_3_69_2","doi-asserted-by":"publisher","DOI":"10.1097\/EDE.0b013e318230e861"},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.7326\/M18-1376"},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","DOI":"10.1148\/rg.2020200040"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/1015330.1015425"},{"key":"e_1_3_3_73_2","doi-asserted-by":"publisher","DOI":"10.1111\/rssb.12136"},{"key":"e_1_3_3_74_2","doi-asserted-by":"publisher","DOI":"10.36227\/techrxiv.175554759.96327720\/v1"}],"container-title":["ACM Transactions on Computing for Healthcare"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3761822","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,18]],"date-time":"2025-10-18T02:25:09Z","timestamp":1760754309000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3761822"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,13]]},"references-count":73,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3761822"],"URL":"https:\/\/doi.org\/10.1145\/3761822","relation":{},"ISSN":["2691-1957","2637-8051"],"issn-type":[{"value":"2691-1957","type":"print"},{"value":"2637-8051","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,13]]},"assertion":[{"value":"2024-05-13","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}