{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T22:45:35Z","timestamp":1776293135395,"version":"3.50.1"},"reference-count":85,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T00:00:00Z","timestamp":1748476800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T00:00:00Z","timestamp":1748476800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Forschungszentrum J\u00fclich GmbH"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Machine learning (ML) provides powerful tools for predictive modeling. ML\u2019s popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.<\/jats:p>","DOI":"10.1186\/s40537-025-01193-8","type":"journal-article","created":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T13:23:34Z","timestamp":1748525014000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":30,"title":["Overview of leakage scenarios in supervised machine learning"],"prefix":"10.1186","volume":"12","author":[{"given":"L.","family":"Sasse","sequence":"first","affiliation":[]},{"given":"E.","family":"Nicolaisen-Sobesky","sequence":"additional","affiliation":[]},{"given":"J.","family":"Dukart","sequence":"additional","affiliation":[]},{"given":"S. B.","family":"Eickhoff","sequence":"additional","affiliation":[]},{"given":"M.","family":"G\u00f6tz","sequence":"additional","affiliation":[]},{"given":"S.","family":"Hamdan","sequence":"additional","affiliation":[]},{"given":"V.","family":"Komeyer","sequence":"additional","affiliation":[]},{"given":"A.","family":"Kulkarni","sequence":"additional","affiliation":[]},{"given":"J. M.","family":"Lahnakoski","sequence":"additional","affiliation":[]},{"given":"B. C.","family":"Love","sequence":"additional","affiliation":[]},{"given":"F.","family":"Raimondo","sequence":"additional","affiliation":[]},{"given":"Kaustubh R.","family":"Patil","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,29]]},"reference":[{"issue":"5","key":"1193_CR1","doi-asserted-by":"publisher","first-page":"675","DOI":"10.1016\/j.beth.2020.05.002","volume":"51","author":"T Jiang","year":"2020","unstructured":"Jiang T, Gradus JL, Rosellini AJ. Supervised machine learning: A brief primer. Behav Ther. 2020;51(5):675\u201387.","journal-title":"Behav Ther"},{"key":"1193_CR2","unstructured":"Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S et al. Generative Adversarial Nets Adv Neural Inform Process Syst. 2014."},{"key":"1193_CR3","unstructured":"Sutton RS, Barto AG. Reinforcement Learning, second edition: An Introduction (Adaptive Computation and Machine Learning series). second edition. Cambridge, Massachusetts: Bradford Books; 2018."},{"issue":"10","key":"1193_CR4","doi-asserted-by":"publisher","first-page":"1104","DOI":"10.1016\/j.compbiomed.2005.09.002","volume":"36","author":"H Bhaskar","year":"2006","unstructured":"Bhaskar H, Hoyle DC, Singh S. Machine learning in bioinformatics: a brief survey and recommendations for practitioners. Comput Biol Med. 2006;36(10):1104\u201325.","journal-title":"Comput Biol Med"},{"issue":"7","key":"1193_CR5","doi-asserted-by":"publisher","first-page":"073001","DOI":"10.1088\/1748-9326\/ab1b7d","volume":"14","author":"AY Sun","year":"2019","unstructured":"Sun AY, Scanlon BR. How can big data and machine learning benefit environment and water management: a survey of methods, applications, and future directions. Environ Res Lett. 2019;14(7):073001.","journal-title":"Environ Res Lett"},{"issue":"6","key":"1193_CR6","doi-asserted-by":"publisher","first-page":"3981","DOI":"10.1007\/s11831-022-09733-8","volume":"29","author":"S Swain","year":"2022","unstructured":"Swain S, Bhushan B, Dhiman G, Viriyasitavat W. Appositeness of optimized and reliable machine learning for healthcare: A survey. Arch Comput Methods Eng. 2022;29(6):3981\u20134003.","journal-title":"Arch Comput Methods Eng"},{"issue":"1","key":"1193_CR7","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1038\/s41746-022-00592-y","volume":"5","author":"G Varoquaux","year":"2022","unstructured":"Varoquaux G, Cheplygina V. Machine learning for medical imaging: methodological failures and recommendations for the future. Npj Digit Med. 2022;5(1):48.","journal-title":"Npj Digit Med"},{"issue":"3","key":"1193_CR8","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1038\/s42254-022-00431-9","volume":"4","author":"MR Douglas","year":"2022","unstructured":"Douglas MR. Machine learning as a tool in theoretical science. Nat Rev Phys. 2022;4(3):145\u20136.","journal-title":"Nat Rev Phys"},{"issue":"1","key":"1193_CR9","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1093\/bib\/bbk007","volume":"7","author":"P Larra\u00f1aga","year":"2006","unstructured":"Larra\u00f1aga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, et al. Machine learning in bioinformatics. Brief Bioinf. 2006;7(1):86\u2013112.","journal-title":"Brief Bioinf"},{"issue":"12","key":"1193_CR10","doi-asserted-by":"publisher","first-page":"e677","DOI":"10.1016\/S2589-7500(20)30200-4","volume":"2","author":"J Wilkinson","year":"2020","unstructured":"Wilkinson J, Arnold KF, Murray EJ, van Smeden M, Carr K, Sippy R, et al. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health. 2020;2(12):e677\u201380.","journal-title":"Lancet Digit Health"},{"issue":"1","key":"1193_CR11","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1016\/j.biopsych.2022.07.025","volume":"93","author":"J Chen","year":"2023","unstructured":"Chen J, Patil KR, Yeo BTT, Eickhoff SB. Leveraging machine learning for gaining Neurobiological and nosological insights in psychiatric research. Biol Psychiatry. 2023;93(1):18\u201328.","journal-title":"Biol Psychiatry"},{"key":"1193_CR12","doi-asserted-by":"crossref","unstructured":"Qiu J, Wu Q, Ding G, Xu Y, Feng S. A survey of machine learning for big data processing. EURASIP J Adv Signal Process. 2016;2016(1).","DOI":"10.1186\/s13634-016-0355-x"},{"key":"1193_CR13","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"1193_CR14","unstructured":"Kuhn M, Wickham H. Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles [Internet]. 2020 [cited 2023 Aug 25]. Available from: https:\/\/www.tidymodels.org"},{"key":"1193_CR15","doi-asserted-by":"publisher","first-page":"102311","DOI":"10.1016\/j.media.2021.102311","volume":"76","author":"S Chen","year":"2022","unstructured":"Chen S, Sedghi Gamechi Z, Dubost F, van Tulder G, de Bruijne M. An end-to-end approach to segmentation in medical images with CNN and posterior-CRF. Med Image Anal. 2022;76:102311.","journal-title":"Med Image Anal"},{"issue":"9","key":"1193_CR16","doi-asserted-by":"publisher","first-page":"100804","DOI":"10.1016\/j.patter.2023.100804","volume":"4","author":"S Kapoor","year":"2023","unstructured":"Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns (N Y). 2023;4(9):100804.","journal-title":"Patterns (N Y)"},{"issue":"2","key":"1193_CR17","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1080\/00031305.2016.1154108","volume":"70","author":"RL Wasserstein","year":"2016","unstructured":"Wasserstein RL, Lazar NA. The ASA statement on p -Values: context, process, and purpose. Am Stat. 2016;70(2):129\u201333.","journal-title":"Am Stat"},{"issue":"8","key":"1193_CR18","doi-asserted-by":"publisher","first-page":"e124","DOI":"10.1371\/journal.pmed.0020124","volume":"2","author":"JPA Ioannidis","year":"2005","unstructured":"Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124.","journal-title":"PLoS Med"},{"key":"1193_CR19","doi-asserted-by":"crossref","unstructured":"Gundersen OE, Kjensmo S. State of the art: reproducibility in artificial intelligence. AAAI. 2018;32(1).","DOI":"10.1609\/aaai.v32i1.11503"},{"key":"1193_CR20","doi-asserted-by":"crossref","unstructured":"Verstynen T, Kording KP. Overfitting to \u2018predict\u2019 suicidal ideation. Nat Hum Behav. 2023.","DOI":"10.1038\/s41562-023-01560-6"},{"issue":"7767","key":"1193_CR21","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1038\/d41586-019-02307-y","volume":"572","author":"P Riley","year":"2019","unstructured":"Riley P. Three pitfalls to avoid in machine learning. Nature. 2019;572(7767):27\u20139.","journal-title":"Nature"},{"issue":"3","key":"1193_CR22","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1038\/s41576-021-00434-9","volume":"23","author":"S Whalen","year":"2022","unstructured":"Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet. 2022;23(3):169\u201381.","journal-title":"Nat Rev Genet"},{"issue":"11","key":"1193_CR23","doi-asserted-by":"publisher","first-page":"e0224365","DOI":"10.1371\/journal.pone.0224365","volume":"14","author":"A Vabalas","year":"2019","unstructured":"Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS ONE. 2019;14(11):e0224365.","journal-title":"PLoS ONE"},{"key":"1193_CR24","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1186\/1471-2105-7-91","volume":"7","author":"S Varma","year":"2006","unstructured":"Varma S, Simon R. Bias in error Estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7:91.","journal-title":"BMC Bioinformatics"},{"issue":"4","key":"1193_CR25","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1109\/MCI.2018.2866730","volume":"13","author":"MS Santos","year":"2018","unstructured":"Santos MS, Soares JP, Abreu PH, Araujo H, Santos J. Cross-Validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [Research Frontier]. IEEE Comput Intell Mag. 2018;13(4):59\u201376.","journal-title":"IEEE Comput Intell Mag"},{"issue":"1","key":"1193_CR26","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1038\/s41746-021-00521-5","volume":"4","author":"V Berisha","year":"2021","unstructured":"Berisha V, Krantsevich C, Hahn PR, Hahn S, Dasarathy G, Turaga P, et al. Digital medicine and the curse of dimensionality. Npj Digit Med. 2021;4(1):153.","journal-title":"Npj Digit Med"},{"key":"1193_CR27","unstructured":"Lones MA. How to avoid machine learning pitfalls: a guide for academic researchers. ArXiv. 2021."},{"issue":"4","key":"1193_CR28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2382577.2382579","volume":"6","author":"S Kaufman","year":"2012","unstructured":"Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data. 2012;6(4):1\u201321.","journal-title":"ACM Trans Knowl Discov Data"},{"issue":"1","key":"1193_CR29","doi-asserted-by":"publisher","first-page":"1829","DOI":"10.1038\/s41467-024-46150-w","volume":"15","author":"M Rosenblatt","year":"2024","unstructured":"Rosenblatt M, Tejavibulya L, Jiang R, Noble S, Scheinost D. Data leakage inflates prediction performance in connectome-based machine learning models. Nat Commun. 2024;15(1):1829.","journal-title":"Nat Commun"},{"issue":"8","key":"1193_CR30","doi-asserted-by":"publisher","first-page":"1444","DOI":"10.1038\/s41592-024-02362-y","volume":"21","author":"J Bernett","year":"2024","unstructured":"Bernett J, Blumenthal DB, Grimm DG, Haselbeck F, Joeres R, Kalinina OV, et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods. 2024;21(8):1444\u201353.","journal-title":"Nat Methods"},{"key":"1193_CR31","doi-asserted-by":"publisher","first-page":"911","DOI":"10.1038\/s41562-017-0234-y","volume":"1","author":"MA Just","year":"2017","unstructured":"Just MA, Pan L, Cherkassky VL, McMakin DL, Cha C, Nock MK, et al. Machine learning of neural representations of suicide and emotion concepts identifies suicidal youth. Nat Hum Behav. 2017;1:911\u20139.","journal-title":"Nat Hum Behav"},{"issue":"4","key":"1193_CR32","doi-asserted-by":"publisher","first-page":"431","DOI":"10.1038\/s41562-021-01085-w","volume":"5","author":"J Dukart","year":"2021","unstructured":"Dukart J, Weis S, Genon S, Eickhoff SB. Towards increasing the clinical applicability of machine learning biomarkers in psychiatry. Nat Hum Behav. 2021;5(4):431\u20132.","journal-title":"Nat Hum Behav"},{"key":"1193_CR33","unstructured":"Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Arti cial Intelligence. 1995."},{"issue":"0","key":"1193_CR34","first-page":"40","volume":"4","author":"S Arlot","year":"2010","unstructured":"Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4(0):40\u201379.","journal-title":"Stat Surv"},{"issue":"350","key":"1193_CR35","doi-asserted-by":"publisher","first-page":"320","DOI":"10.1080\/01621459.1975.10479865","volume":"70","author":"S Geisser","year":"1975","unstructured":"Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc. 1975;70(350):320\u20138.","journal-title":"J Am Stat Assoc"},{"issue":"546","key":"1193_CR36","doi-asserted-by":"publisher","first-page":"1434","DOI":"10.1080\/01621459.2023.2197686","volume":"119","author":"S Bates","year":"2023","unstructured":"Bates S, Trevor H, Tibshirani R. Cross-Validation: what does it estimate and how well does it do it? J Am Stat Assoc. 2023;119(546):1434\u201345.","journal-title":"J Am Stat Assoc"},{"issue":"1","key":"1193_CR37","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1186\/1758-2946-6-10","volume":"6","author":"D Krstajic","year":"2014","unstructured":"Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminform. 2014;6(1):10.","journal-title":"J Cheminform"},{"key":"1193_CR38","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-84858-7","volume-title":"The elements of statistical learning","author":"T Hastie","year":"2009","unstructured":"Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York, NY: Springer New York; 2009.","edition":"2"},{"key":"1193_CR39","doi-asserted-by":"crossref","unstructured":"Yang L, Shami A. On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing. 2020.","DOI":"10.1016\/j.neucom.2020.07.061"},{"key":"1193_CR40","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-7138-7","volume-title":"An introduction to statistical learning","author":"G James","year":"2013","unstructured":"James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York, NY: Springer New York; 2013."},{"key":"1193_CR41","volume-title":"Pattern recognition and machine learning","author":"CM Bishop","year":"2006","unstructured":"Bishop CM. Pattern recognition and machine learning. New York: Springer; 2006."},{"issue":"Pt B","key":"1193_CR42","doi-asserted-by":"publisher","first-page":"166","DOI":"10.1016\/j.neuroimage.2016.10.038","volume":"145","author":"G Varoquaux","year":"2017","unstructured":"Varoquaux G, Raamana PR, Engemann DA, Hoyos-Idrobo A, Schwartz Y, Thirion B. Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines. NeuroImage. 2017;145(Pt B):166\u201379.","journal-title":"NeuroImage"},{"key":"1193_CR43","doi-asserted-by":"crossref","unstructured":"Martinez-Plumed F, Contreras-Ochando L, Ferri C, Hernandez-Orallo J, Kull M, Lachiche N, et al. IEEE Trans Knowl Data Eng. 2021;33(8):3048\u201361. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories.","DOI":"10.1109\/TKDE.2019.2962680"},{"key":"1193_CR44","unstructured":"Wirth R. CRISP-DM: Towards a standard process model for data mining. Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining; 2000."},{"key":"1193_CR45","doi-asserted-by":"crossref","unstructured":"Chakraborty J, Majumder S, Menzies T. Bias in machine learning software: why? how? what to do? Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: ACM; 2021. pp. 429\u201340.","DOI":"10.1145\/3468264.3468537"},{"key":"1193_CR46","doi-asserted-by":"crossref","unstructured":"Liang W, Tadesse GA, Ho D, Li F-F, Zaharia M, Zhang C et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat Mach Intell. 2022.","DOI":"10.1038\/s42256-022-00516-1"},{"issue":"3","key":"1193_CR47","doi-asserted-by":"publisher","first-page":"e1008671","DOI":"10.1371\/journal.pcbi.1008671","volume":"17","author":"J Dem\u0161ar","year":"2021","unstructured":"Dem\u0161ar J, Zupan B. Hands-on training about overfitting. PLoS Comput Biol. 2021;17(3):e1008671.","journal-title":"PLoS Comput Biol"},{"key":"1193_CR48","doi-asserted-by":"publisher","first-page":"62","DOI":"10.1016\/j.neuroimage.2013.05.041","volume":"80","author":"DC Van Essen","year":"2013","unstructured":"Van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil K, et al. The WU-Minn human connectome project: an overview. NeuroImage. 2013;80:62\u201379.","journal-title":"NeuroImage"},{"issue":"11","key":"1193_CR49","doi-asserted-by":"publisher","first-page":"1664","DOI":"10.1038\/nn.4135","volume":"18","author":"ES Finn","year":"2015","unstructured":"Finn ES, Shen X, Scheinost D, Rosenberg MD, Huang J, Chun MM, et al. Functional connectome fingerprinting: identifying individuals using patterns of brain connectivity. Nat Neurosci. 2015;18(11):1664\u201371.","journal-title":"Nat Neurosci"},{"key":"1193_CR50","doi-asserted-by":"publisher","first-page":"116276","DOI":"10.1016\/j.neuroimage.2019.116276","volume":"206","author":"T He","year":"2020","unstructured":"He T, Kong R, Holmes AJ, Nguyen M, Sabuncu MR, Eickhoff SB, et al. Deep neural networks and kernel regression achieve comparable accuracies for functional connectivity prediction of behavior and demographics. NeuroImage. 2020;206:116276.","journal-title":"NeuroImage"},{"issue":"1","key":"1193_CR51","doi-asserted-by":"publisher","first-page":"100801","DOI":"10.1016\/j.isci.2019.100801","volume":"23","author":"DV Demeter","year":"2020","unstructured":"Demeter DV, Engelhardt LE, Mallett R, Gordon EM, Nugiel T, Harden KP, et al. Functional connectivity fingerprints at rest are similar across youths and adults and vary with genetic similarity. iScience. 2020;23(1):100801.","journal-title":"iScience"},{"issue":"1","key":"1193_CR52","doi-asserted-by":"publisher","first-page":"282","DOI":"10.1186\/s13059-020-02177-y","volume":"21","author":"J Schreiber","year":"2020","unstructured":"Schreiber J, Singh R, Bilmes J, Noble WS. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol. 2020;21(1):282.","journal-title":"Genome Biol"},{"issue":"1","key":"1193_CR53","doi-asserted-by":"publisher","first-page":"22544","DOI":"10.1038\/s41598-021-01681-w","volume":"11","author":"E Yagis","year":"2021","unstructured":"Yagis E, Atnafu SW, Garc\u00eda Seco de Herrera A, Marzi C, Scheda R, Giannelli M, et al. Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Sci Rep. 2021;11(1):22544.","journal-title":"Sci Rep"},{"issue":"11","key":"1193_CR54","doi-asserted-by":"publisher","first-page":"1997","DOI":"10.1007\/s10994-020-05910-7","volume":"109","author":"V Cerqueira","year":"2020","unstructured":"Cerqueira V, Torgo L, Mozeti\u010d I. Evaluating time series forecasting models: an empirical study on performance Estimation methods. Mach Learn. 2020;109(11):1997\u20132028.","journal-title":"Mach Learn"},{"issue":"3","key":"1193_CR55","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/j.ijforecast.2006.01.001","volume":"22","author":"JG De Gooijer","year":"2006","unstructured":"De Gooijer JG, Hyndman RJ. 25 Years of time series forecasting. Int J Forecast. 2006;22(3):443\u201373.","journal-title":"Int J Forecast"},{"issue":"6","key":"1193_CR56","doi-asserted-by":"publisher","first-page":"2827","DOI":"10.1002\/mp.14678","volume":"48","author":"RK Samala","year":"2021","unstructured":"Samala RK, Chan H-P, Hadjiiski L, Helvie MA. Risks of feature leakage and sample size dependencies in deep feature extraction for breast mass classification. Med Phys. 2021;48(6):2827\u201337.","journal-title":"Med Phys"},{"issue":"1","key":"1193_CR57","doi-asserted-by":"publisher","first-page":"140","DOI":"10.1186\/s40537-021-00516-9","volume":"8","author":"T Emmanuel","year":"2021","unstructured":"Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140.","journal-title":"J Big Data"},{"issue":"5","key":"1193_CR58","doi-asserted-by":"publisher","first-page":"534","DOI":"10.1001\/jamapsychiatry.2019.3671","volume":"77","author":"RA Poldrack","year":"2020","unstructured":"Poldrack RA, Huckins G, Varoquaux G. Establishment of best practices for evidence for prediction: A review. JAMA Psychiatry. 2020;77(5):534\u201340.","journal-title":"JAMA Psychiatry"},{"key":"1193_CR59","unstructured":"Vanwinckelen G, Blockeel H. Look before you leap: some insights into learner evaluation with cross-validation. Statistically Sound Data Min. 2015;3\u201320."},{"key":"1193_CR60","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1038\/s41746-019-0105-1","volume":"2","author":"MA Badgeley","year":"2019","unstructured":"Badgeley MA, Zech JR, Oakden-Rayner L, Glicksberg BS, Liu M, Gale W, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. Npj Digit Med. 2019;2:31.","journal-title":"Npj Digit Med"},{"key":"1193_CR61","doi-asserted-by":"crossref","unstructured":"Dagaev N, Roads BD, Luo X, Barry DN, Patil KR, Love BC. A Too-Good-to-be-True prior to reduce shortcut reliance. Pattern Recognit Lett. 2022.","DOI":"10.1016\/j.patrec.2022.12.010"},{"key":"1193_CR62","doi-asserted-by":"publisher","first-page":"120125","DOI":"10.1016\/j.neuroimage.2023.120125","volume":"274","author":"F Hu","year":"2023","unstructured":"Hu F, Chen AA, Horng H, Bashyam V, Davatzikos C, Alexander-Bloch A, et al. Image harmonization: A review of statistical and deep learning methods for removing batch effects and evaluation metrics for effective harmonization. NeuroImage. 2023;274:120125.","journal-title":"NeuroImage"},{"issue":"4","key":"1193_CR63","doi-asserted-by":"publisher","first-page":"343","DOI":"10.1016\/S0895-4356(00)00314-0","volume":"54","author":"R Bender","year":"2001","unstructured":"Bender R, Lange S. Adjusting for multiple testing\u2013when and how? J Clin Epidemiol. 2001;54(4):343\u20139.","journal-title":"J Clin Epidemiol"},{"key":"1193_CR64","doi-asserted-by":"publisher","first-page":"100120","DOI":"10.1016\/j.metip.2023.100120","volume":"8","author":"MA Garc\u00eda-P\u00e9rez","year":"2023","unstructured":"Garc\u00eda-P\u00e9rez MA. Use and misuse of corrections for multiple testing. Methods Psychol. 2023;8:100120.","journal-title":"Methods Psychol"},{"key":"1193_CR65","doi-asserted-by":"crossref","unstructured":"Thompson WH, Wright J, Bissett PG, Poldrack RA. Dataset decay and the problem of sequential analyses on open datasets. eLife. 2020;9.","DOI":"10.7554\/eLife.53498"},{"issue":"6248","key":"1193_CR66","doi-asserted-by":"publisher","first-page":"636","DOI":"10.1126\/science.aaa9375","volume":"349","author":"C Dwork","year":"2015","unstructured":"Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. The reusable holdout: preserving validity in adaptive data analysis. Science. 2015;349(6248):636\u20138.","journal-title":"Science"},{"key":"1193_CR67","doi-asserted-by":"crossref","unstructured":"Hardt M, Ullman J. Preventing false discovery in interactive data analysis is hard. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE; 2014. pp. 454\u201363.","DOI":"10.1109\/FOCS.2014.55"},{"key":"1193_CR68","doi-asserted-by":"publisher","first-page":"1282","DOI":"10.3389\/fnins.2019.01282","volume":"13","author":"C Chen","year":"2019","unstructured":"Chen C, Cao X, Tian L. Partial least squares regression performs well in MRI-Based individualized estimations. Front Neurosci. 2019;13:1282.","journal-title":"Front Neurosci"},{"key":"1193_CR69","unstructured":"Komeyer V, Eickhoff SB, Grefkes C, Patil KR, Raimondo F. A framework for confounder considerations in AI-driven precision medicine. medRxiv. 2024."},{"key":"1193_CR70","doi-asserted-by":"publisher","first-page":"119947","DOI":"10.1016\/j.neuroimage.2023.119947","volume":"270","author":"S More","year":"2023","unstructured":"More S, Antonopoulos G, Hoffstaedter F, Caspers J, Eickhoff SB, Patil KR, et al. Brain-age prediction: A systematic comparison of machine learning workflows. NeuroImage. 2023;270:119947.","journal-title":"NeuroImage"},{"issue":"2","key":"1193_CR71","first-page":"79","volume":"5","author":"MA Pourhoseingholi","year":"2012","unstructured":"Pourhoseingholi MA, Baghestani AR, Vahedi M. How to control confounding effects by statistical analysis. Gastroenterol Hepatol Bed Bench. 2012;5(2):79\u201383.","journal-title":"Gastroenterol Hepatol Bed Bench"},{"key":"1193_CR72","doi-asserted-by":"publisher","first-page":"741","DOI":"10.1016\/j.neuroimage.2018.09.074","volume":"184","author":"L Snoek","year":"2019","unstructured":"Snoek L, Mileti\u0107 S, Scholte HS. How to control for confounds in decoding analyses of neuroimaging data. NeuroImage. 2019;184:741\u201360.","journal-title":"NeuroImage"},{"key":"1193_CR73","doi-asserted-by":"crossref","unstructured":"More S, Eickhoff SB, Caspers J, Patil KR. Confound removal and normalization in practice: A neuroimaging based sex prediction case study. In: Dong Y, Ifrim G, Mladeni\u0107 D, Saunders C, Van Hoecke S, editors. Machine learning and knowledge discovery in databases applied data science and demo track: european conference, ECML PKDD 2020, ghent, belgium, september 14\u201318, 2020, proceedings, part V. Cham: Springer International Publishing; 2021. pp. 3\u201318.","DOI":"10.1007\/978-3-030-67670-4_1"},{"key":"1193_CR74","doi-asserted-by":"crossref","unstructured":"Hamdan S, Love BC, von Polier GG, Weis S, Schwender H, Eickhoff SB et al. Confound-leakage: Confound Removal in Machine Learning Leads to Leakage. arXiv. 2022.","DOI":"10.1093\/gigascience\/giad071"},{"key":"1193_CR75","doi-asserted-by":"crossref","unstructured":"Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B et al. Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* \u201919. New York, New York, USA: ACM Press; 2019. pp. 220\u20139.","DOI":"10.1145\/3287560.3287596"},{"issue":"10","key":"1193_CR76","doi-asserted-by":"publisher","first-page":"1122","DOI":"10.1038\/s41592-021-01205-4","volume":"18","author":"I Walsh","year":"2021","unstructured":"Walsh I, Fishman D, Garcia-Gasulla D, Titma T, Pollastri G, ELIXIR Machine Learning Focus Group. DOME: recommendations for supervised machine learning validation in biology. Nat Methods. 2021;18(10):1122\u20137.","journal-title":"Nat Methods"},{"issue":"12","key":"1193_CR77","doi-asserted-by":"publisher","first-page":"e323","DOI":"10.2196\/jmir.5870","volume":"18","author":"W Luo","year":"2016","unstructured":"Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view. J Med Internet Res. 2016;18(12):e323.","journal-title":"J Med Internet Res"},{"issue":"9","key":"1193_CR78","doi-asserted-by":"publisher","first-page":"1320","DOI":"10.1038\/s41591-020-1041-y","volume":"26","author":"B Norgeot","year":"2020","unstructured":"Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R, Gianfrancesco M, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26(9):1320\u20134.","journal-title":"Nat Med"},{"issue":"12","key":"1193_CR79","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1145\/3458723","volume":"64","author":"T Gebru","year":"2021","unstructured":"Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, et al. Datasheets for datasets. Commun ACM. 2021;64(12):86\u201392.","journal-title":"Commun ACM"},{"key":"1193_CR80","unstructured":"Holland S, Hosny A, Newman S, Joseph J, Chmielinski K. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. arXiv. 2018."},{"issue":"3","key":"1193_CR81","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1111\/1740-9713.01522","volume":"18","author":"S Schwab","year":"2021","unstructured":"Schwab S, Held L. Statistical programming: small mistakes, big impacts. Significance. 2021;18(3):6\u20137.","journal-title":"Significance"},{"issue":"7317","key":"1193_CR82","doi-asserted-by":"publisher","first-page":"753","DOI":"10.1038\/467753a","volume":"467","author":"N Barnes","year":"2010","unstructured":"Barnes N. Publish your computer code: it is good enough. Nature. 2010;467(7317):753.","journal-title":"Nature"},{"key":"1193_CR83","doi-asserted-by":"crossref","unstructured":"Soares C. Is the UCI repository useful for data mining? Portuguese Conference on Artificial Intelligence. 2003;209\u201323.","DOI":"10.1007\/978-3-540-24580-3_28"},{"key":"1193_CR84","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1016\/j.jbusres.2022.01.076","volume":"144","author":"B van Giffen","year":"2022","unstructured":"van Giffen B, Herhausen D, Fahse T. Overcoming the pitfalls and perils of algorithms: A classification of machine learning biases and mitigation methods. J Bus Res. 2022;144:93\u2013106.","journal-title":"J Bus Res"},{"key":"1193_CR85","doi-asserted-by":"crossref","unstructured":"Paleyes A, Urma R-G, Lawrence ND. Challenges in deploying machine learning: a survey of case studies. ACM Comput Surv. 2022.","DOI":"10.1145\/3533378"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01193-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01193-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01193-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T14:02:52Z","timestamp":1748527372000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01193-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,29]]},"references-count":85,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1193"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01193-8","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,29]]},"assertion":[{"value":"16 August 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The ethics protocols for analyses of these data were approved by the Heinrich Heine University D\u00fcsseldorf ethics committee (No. 4039).","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"The authors declare no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"135"}}