{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T23:26:40Z","timestamp":1776900400713,"version":"3.51.2"},"reference-count":225,"publisher":"Public Library of Science (PLoS)","issue":"12","license":[{"start":{"date-parts":[[2022,12,15]],"date-time":"2022-12-15T00:00:00Z","timestamp":1671062400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call \u201cfeature\u201d a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1010718","type":"journal-article","created":{"date-parts":[[2022,12,15]],"date-time":"2022-12-15T18:30:48Z","timestamp":1671129048000},"page":"e1010718","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":58,"title":["Eleven quick tips for data cleaning and feature engineering"],"prefix":"10.1371","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9655-7142","authenticated-orcid":true,"given":"Davide","family":"Chicco","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8445-395X","authenticated-orcid":true,"given":"Luca","family":"Oneto","sequence":"additional","affiliation":[]},{"given":"Erica","family":"Tavazzi","sequence":"additional","affiliation":[]}],"member":"340","published-online":{"date-parts":[[2022,12,15]]},"reference":[{"issue":"10","key":"pcbi.1010718.ref001","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1145\/2347736.2347755","article-title":"A few useful things to know about machine learning","volume":"55","author":"P. Domingos","year":"2012","journal-title":"Commun ACM."},{"key":"pcbi.1010718.ref002","volume-title":"An introduction to data cleaning with R","author":"E De Jonge","year":"2013"},{"issue":"10","key":"pcbi.1010718.ref003","doi-asserted-by":"crossref","first-page":"e267","DOI":"10.1371\/journal.pmed.0020267","article-title":"Data cleaning: detecting, diagnosing, and editing data abnormalities.","volume":"2","author":"J Van den Broeck","year":"2005","journal-title":"PLoS Med"},{"key":"pcbi.1010718.ref004","volume-title":"Some essentials of data cleaning: hints and tips","author":"F. Clemens","year":"2005"},{"key":"pcbi.1010718.ref005","article-title":"Best practices in data cleaning: a complete guide to everything you need to do before and after collecting your data.","author":"JW Osborne","year":"2013","journal-title":"Sage"},{"key":"pcbi.1010718.ref006","volume-title":"Feature engineering for machine learning: principles and techniques for data scientists","author":"A Zheng","year":"2018"},{"key":"pcbi.1010718.ref007","doi-asserted-by":"crossref","DOI":"10.1017\/9781108671682","volume-title":"The art of feature engineering: essentials for machine learning","author":"P. Duboue","year":"2020"},{"issue":"2","key":"pcbi.1010718.ref008","doi-asserted-by":"crossref","first-page":"e1009819","DOI":"10.1371\/journal.pcbi.1009819","article-title":"Ten simple rules for initial data analysis.","volume":"18","author":"M Baillie","year":"2022","journal-title":"PLoS Comput Biol"},{"issue":"12","key":"pcbi.1010718.ref009","doi-asserted-by":"crossref","first-page":"e1007434","DOI":"10.1371\/journal.pcbi.1007434","article-title":"Nine quick tips for analyzing network data.","volume":"15","author":"V Miele","year":"2019","journal-title":"PLoS Comput Biol"},{"issue":"5","key":"pcbi.1010718.ref010","doi-asserted-by":"crossref","first-page":"e1006906","DOI":"10.1371\/journal.pcbi.1006906","article-title":"Ten quick tips for biocuration.","volume":"15","author":"YA Tang","year":"2019","journal-title":"PLoS Comput Biol"},{"issue":"3","key":"pcbi.1010718.ref011","first-page":"241","article-title":"Occam\u2019s razor: A principle of intellectual elegance.","volume":"16","author":"D. Walsh","year":"1979","journal-title":"Am Philos Q"},{"issue":"4","key":"pcbi.1010718.ref012","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1023\/A:1009868929893","article-title":"The role of Occam\u2019s razor in knowledge discovery.","volume":"3","author":"P. Domingos","year":"1999","journal-title":"Data Min Knowl Discov."},{"key":"pcbi.1010718.ref013","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4471-0123-9_3","article-title":"The supervised learning no-free-lunch theorems.","author":"DH Wolpert","year":"2002","journal-title":"Soft Computing and Industry."},{"key":"pcbi.1010718.ref014","volume-title":"The master algorithm: How the quest for the ultimate learning machine will remake our world","author":"P. Domingos","year":"2015"},{"key":"pcbi.1010718.ref015","doi-asserted-by":"crossref","unstructured":"D\u2019Amato V, Oneto L, Camurri A, Anguita D. Keep it simple: handcrafting feature and tuning Random Forests and XGBoost to face the Affective Movement Recognition Challenge 2021. In: International Conference on Affective Computing and Intelligent Interaction Workshops and Demos; 2021.","DOI":"10.1109\/ACIIW52867.2021.9666428"},{"key":"pcbi.1010718.ref016","article-title":"Do we really need deep learning models for time series forecasting?","author":"S Elsayed","year":"2021","journal-title":"arXiv"},{"issue":"1","key":"pcbi.1010718.ref017","first-page":"3133","article-title":"Do we need hundreds of classifiers to solve real world classification problems?","volume":"15","author":"M Fern\u00e1ndez-Delgado","year":"2014","journal-title":"J Mach Learn Res"},{"key":"pcbi.1010718.ref018","unstructured":"Molnar C. Interpretable Machine Learning. Available from: leanpub.com; 2020."},{"key":"pcbi.1010718.ref019","volume-title":"Deep learning.","author":"I Goodfellow","year":"2016"},{"issue":"10","key":"pcbi.1010718.ref020","doi-asserted-by":"crossref","first-page":"2585","DOI":"10.1007\/s10115-021-01605-0","article-title":"Model complexity of deep learning: A survey.","volume":"63","author":"X Hu","year":"2021","journal-title":"Knowl Inf Syst"},{"key":"pcbi.1010718.ref021","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1016\/j.ejca.2019.06.012","article-title":"Deep learning outperformed 11 pathologists in the classification of histopathological melanoma images","volume":"118","author":"A Hekler","year":"2019","journal-title":"Eur J Cancer"},{"issue":"7676","key":"pcbi.1010718.ref022","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1038\/nature24270","article-title":"Mastering the game of Go without human knowledge","volume":"550","author":"D Silver","year":"2017","journal-title":"Nature"},{"issue":"7873","key":"pcbi.1010718.ref023","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with AlphaFold","volume":"596","author":"J Jumper","year":"2021","journal-title":"Nature"},{"key":"pcbi.1010718.ref024","first-page":"26831","article-title":"Are transformers more robust than CNNs?","volume":"34","author":"Y Bai","year":"2021","journal-title":"Adv Neural Inf Process Syst"},{"key":"pcbi.1010718.ref025","doi-asserted-by":"crossref","unstructured":"Tay Y, Dehghani M, Gupta JP, Aribandi V, Bahri D, Qin Z, et al. Are pretrained convolutions better than pretrained transformers? In: Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing; 2021.","DOI":"10.18653\/v1\/2021.acl-long.335"},{"key":"pcbi.1010718.ref026","article-title":"Occam\u2019s razor.","volume":"13","author":"C Rasmussen","year":"2000","journal-title":"Adv Neural Inf Process Syst"},{"issue":"32","key":"pcbi.1010718.ref027","doi-asserted-by":"crossref","first-page":"15849","DOI":"10.1073\/pnas.1903070116","article-title":"Reconciling modern machine-learning practice and the classical bias-variance trade-off","volume":"116","author":"M Belkin","year":"2019","journal-title":"Proc Natl Acad Sci U S A"},{"key":"pcbi.1010718.ref028","article-title":"Geometric Regularization from overparameterization explains double descent and other findings.","author":"NJ Teague","year":"2022","journal-title":"arXiv preprint"},{"key":"pcbi.1010718.ref029","volume-title":"Data quality","author":"RY Wang","year":"2006"},{"issue":"3","key":"pcbi.1010718.ref030","first-page":"103","article-title":"Data quality: \u201cGarbage in-garbage out\u201d.","volume":"47","author":"MF Kilkenny","year":"2018","journal-title":"Health Inf Manag J"},{"key":"pcbi.1010718.ref031","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1016\/j.spl.2018.02.031","article-title":"When small data beats big data","volume":"136","author":"JJ Faraway","year":"2018","journal-title":"Stat Probab Lett"},{"issue":"2","key":"pcbi.1010718.ref032","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1109\/MIS.2009.36","article-title":"The unreasonable effectiveness of data","volume":"24","author":"A Halevy","year":"2009","journal-title":"IEEE Intell Syst"},{"key":"pcbi.1010718.ref033","doi-asserted-by":"crossref","unstructured":"Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: IEEE International Conference on Computer Vision; 2017.","DOI":"10.1109\/ICCV.2017.97"},{"key":"pcbi.1010718.ref034","doi-asserted-by":"crossref","DOI":"10.1145\/3310205","volume-title":"Data cleaning.","author":"IF Ilyas","year":"2019"},{"key":"pcbi.1010718.ref035","doi-asserted-by":"crossref","DOI":"10.1201\/9781315108230","volume-title":"Feature engineering and selection: a practical approach for predictive models","author":"M Kuhn","year":"2019"},{"key":"pcbi.1010718.ref036","volume-title":"Feature engineering for machine learning and data analytics","author":"G Dong","year":"2018"},{"issue":"5","key":"pcbi.1010718.ref037","doi-asserted-by":"crossref","first-page":"1097","DOI":"10.1111\/1468-0262.00152","article-title":"A reality check for data snooping.","volume":"68","author":"H. White","year":"2000","journal-title":"Econometrica"},{"issue":"8","key":"pcbi.1010718.ref038","doi-asserted-by":"crossref","first-page":"e124","DOI":"10.1371\/journal.pmed.0020124","article-title":"Why most published research findings are false.","volume":"2","author":"JPA Ioannidis","year":"2005","journal-title":"PLoS Med."},{"key":"pcbi.1010718.ref039","article-title":"How (not) to generate a highly predictive biomarker panel using machine learning.","author":"H Desaire","year":"2022","journal-title":"J Proteome Res"},{"key":"pcbi.1010718.ref040","doi-asserted-by":"crossref","DOI":"10.1109\/FOCS.2014.55","volume-title":"Preventing false discovery in interactive data analysis is hard.","author":"M Hardt","year":"2014"},{"issue":"3","key":"pcbi.1010718.ref041","doi-asserted-by":"crossref","first-page":"140216","DOI":"10.1098\/rsos.140216","article-title":"An investigation of the false discovery rate and the misinterpretation of p-values.","volume":"1","author":"D. Colquhoun","year":"2014","journal-title":"R Soc Open Sci"},{"issue":"1","key":"pcbi.1010718.ref042","first-page":"3837","article-title":"Are random forests truly the best classifiers?","volume":"17","author":"M Wainberg","year":"2016","journal-title":"J Mach Learn Res"},{"key":"pcbi.1010718.ref043","unstructured":"Errica F, Podda M, Bacciu D, Micheli A. A fair comparison of graph neural networks for graph classification. In: International Conference on Learning Representations; 2019."},{"issue":"8","key":"pcbi.1010718.ref044","doi-asserted-by":"crossref","first-page":"1207","DOI":"10.1016\/j.cjca.2021.02.020","article-title":"Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality: a systematic review","volume":"37","author":"SM Cho","year":"2021","journal-title":"Can J Cardiol"},{"issue":"10","key":"pcbi.1010718.ref045","doi-asserted-by":"crossref","first-page":"733","DOI":"10.1038\/nrg2825","article-title":"Tackling the widespread and critical impact of batch effects in high-throughput data","volume":"11","author":"JT Leek","year":"2010","journal-title":"Nat Rev Genet"},{"issue":"6","key":"pcbi.1010718.ref046","first-page":"1","article-title":"Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality","volume":"23","author":"M Sprang","year":"2022","journal-title":"BMC Bioinformatics"},{"issue":"4","key":"pcbi.1010718.ref047","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1093\/bib\/bbs037","article-title":"Batch effect removal methods for microarray gene expression data integration: a survey","volume":"14","author":"C Lazar","year":"2013","journal-title":"Brief Bioinform"},{"issue":"4","key":"pcbi.1010718.ref048","doi-asserted-by":"crossref","first-page":"278","DOI":"10.1038\/tpj.2010.57","article-title":"A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data","volume":"10","author":"J Luo","year":"2010","journal-title":"Pharmacogenomics J"},{"issue":"2","key":"pcbi.1010718.ref049","doi-asserted-by":"crossref","first-page":"e17238","DOI":"10.1371\/journal.pone.0017238","article-title":"Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods.","volume":"6","author":"C Chen","year":"2011","journal-title":"PLoS ONE"},{"issue":"4","key":"pcbi.1010718.ref050","doi-asserted-by":"crossref","first-page":"e0231446","DOI":"10.1371\/journal.pone.0231446","article-title":"Blind estimation and correction of microarray batch effect.","volume":"15","author":"S. Varma","year":"2020","journal-title":"PLoS ONE"},{"key":"pcbi.1010718.ref051","doi-asserted-by":"crossref","first-page":"83","DOI":"10.3389\/fgene.2018.00083","article-title":"Adjusting for batch effects in DNA methylation microarray data, a lesson learned.","volume":"9","author":"EM Price","year":"2018","journal-title":"Front Genet."},{"issue":"2","key":"pcbi.1010718.ref052","first-page":"86","article-title":"ECG noise sources and various noise removal techniques: a survey.","volume":"5","author":"H Limaye","year":"2016","journal-title":"Int J Appl Innov Eng Manag"},{"key":"pcbi.1010718.ref053","doi-asserted-by":"crossref","first-page":"102036","DOI":"10.1016\/j.bspc.2020.102036","article-title":"A review on medical image denoising algorithms.","volume":"61","author":"SVM Sagheer","year":"2020","journal-title":"Biomed Signal Process and Control"},{"issue":"5","key":"pcbi.1010718.ref054","doi-asserted-by":"crossref","first-page":"675","DOI":"10.2174\/1573405613666170428154156","article-title":"A review of denoising medical images using machine learning approaches.","volume":"14","author":"P Kaur","year":"2018","journal-title":"Curr Med Imaging Rev"},{"key":"pcbi.1010718.ref055","doi-asserted-by":"crossref","DOI":"10.1109\/ICIPTM52218.2021.9388367","volume-title":"Review on Medical Image Denoising Techniques","author":"S Kaur","year":"2021"},{"key":"pcbi.1010718.ref056","volume-title":"Exploratory data analysis","author":"V. Cox","year":"2017"},{"key":"pcbi.1010718.ref057","doi-asserted-by":"crossref","DOI":"10.1145\/3318464.3383126","volume-title":"Automating exploratory data analysis via machine learning: An overview.","author":"T Milo","year":"2020"},{"key":"pcbi.1010718.ref058","first-page":"3","author":"MB Brewer","year":"2000","journal-title":"Research design and issues of validity"},{"issue":"1","key":"pcbi.1010718.ref059","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41598-020-73558-3","article-title":"Survival prediction of patients with sepsis from age, sex, and septic episode number alone.","volume":"10","author":"D Chicco","year":"2020","journal-title":"Sci Rep"},{"key":"pcbi.1010718.ref060","doi-asserted-by":"crossref","DOI":"10.1201\/9781315382111","volume-title":"Exploratory data analysis using R","author":"RK Pearson","year":"2018"},{"key":"pcbi.1010718.ref061","volume-title":"Hands-On Exploratory Data Analysis with Python: Perform EDA techniques to understand, summarize, and investigate your data.","author":"SK Mukhiya","year":"2020"},{"key":"pcbi.1010718.ref062","first-page":"ggplot2","author":"H. Wickham","year":"2016","journal-title":"Programming with ggplot2"},{"issue":"03","key":"pcbi.1010718.ref063","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1109\/MCSE.2007.55","article-title":"Matplotlib: a 2D graphics environment.","volume":"9","author":"JD Hunter","year":"2007","journal-title":"Comput Sci Eng"},{"key":"pcbi.1010718.ref064","doi-asserted-by":"crossref","DOI":"10.1201\/9780429447273","volume-title":"Interactive web-based data visualization with R, plotly, and shiny.","author":"C. Sievert","year":"2020"},{"key":"pcbi.1010718.ref065","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1016\/j.jbiotec.2017.07.028","article-title":"KNIME for reproducible cross-domain analysis of life science data","volume":"261","author":"A Fillbrunn","year":"2017","journal-title":"J Biotechnol"},{"key":"pcbi.1010718.ref066","volume-title":"Tableau your data!: fast and easy visual analysis with Tableau software.","author":"DG Murray","year":"2013"},{"key":"pcbi.1010718.ref067","article-title":"Data clustering: theory, algorithms, and applications.","author":"G Gan","year":"2020","journal-title":"SIAM"},{"key":"pcbi.1010718.ref068","doi-asserted-by":"crossref","first-page":"664","DOI":"10.1016\/j.neucom.2017.06.053","article-title":"A review of clustering techniques and developments","volume":"267","author":"A Saxena","year":"2017","journal-title":"Neurocomputing"},{"key":"pcbi.1010718.ref069","unstructured":"scikit learn. Clustering. 2022 [cited 2022 Aug 18]. Available from: https:\/\/scikit-learn.org\/stable\/modules\/clustering.html."},{"key":"pcbi.1010718.ref070","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-39351-3","volume-title":"Nonlinear dimensionality reduction.","author":"JA Lee","year":"2007"},{"issue":"66\u201371","key":"pcbi.1010718.ref071","first-page":"13","article-title":"Dimensionality reduction: a comparative.","volume":"10","author":"L Van der Maaten","year":"2009","journal-title":"J Mach Learn Res"},{"key":"pcbi.1010718.ref072","doi-asserted-by":"crossref","DOI":"10.1007\/11494669_93","volume-title":"The curse of dimensionality in data mining and time series prediction.","author":"M Verleysen","year":"2005"},{"key":"pcbi.1010718.ref073","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1016\/j.inffus.2020.01.005","article-title":"Overview and comparative study of dimensionality reduction techniques for high dimensional data.","volume":"59","author":"S Ayesha","year":"2020","journal-title":"Inf Fusion."},{"key":"pcbi.1010718.ref074","unstructured":"scikit learn. Decomposing signals in components (matrix factorization problems). 2022 [cited 2022 Aug 18]. Available from: https:\/\/scikit-learn.org\/stable\/modules\/decomposition.html."},{"key":"pcbi.1010718.ref075","unstructured":"scikit learn. Manifold learning. 2022 [cited 2022 Aug 18]. Available from: https:\/\/scikit-learn.org\/stable\/modules\/manifold.html."},{"key":"pcbi.1010718.ref076","volume-title":"Graph databases: new opportunities for connected data","author":"I Robinson","year":"2015"},{"key":"pcbi.1010718.ref077","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4842-4354-1","volume-title":"Text analytics with Python: a practitioner\u2019s guide to natural language processing","author":"D. Sarkar","year":"2019"},{"key":"pcbi.1010718.ref078","doi-asserted-by":"crossref","DOI":"10.2307\/j.ctv14jx6sm","volume-title":"Time series analysis","author":"JD Hamilton","year":"2020"},{"issue":"6","key":"pcbi.1010718.ref079","doi-asserted-by":"crossref","first-page":"632","DOI":"10.1002\/jmv.25743","article-title":"Analyzing the epidemiological outbreak of COVID-19: a visual exploratory data analysis approach","volume":"92","author":"SK Dey","year":"2020","journal-title":"J Med Virol"},{"issue":"1","key":"pcbi.1010718.ref080","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1146\/annurev.psych.58.110405.085530","article-title":"Missing data analysis: making it work in the real world.","volume":"60","author":"JW Graham","year":"2009","journal-title":"Annu Rev Psychol"},{"issue":"10","key":"pcbi.1010718.ref081","doi-asserted-by":"crossref","first-page":"1087","DOI":"10.1016\/j.jclinepi.2006.01.014","article-title":"A gentle introduction to imputation of missing values.","volume":"59","author":"ART Donders","year":"2006","journal-title":"J Clin Epidemiol."},{"key":"pcbi.1010718.ref082","volume-title":"Statistical analysis with missing data","author":"RJA Little","year":"2019"},{"issue":"22","key":"pcbi.1010718.ref083","doi-asserted-by":"crossref","first-page":"547","DOI":"10.21105\/joss.00547","article-title":"Missingno: a missing data visualization suite.","volume":"3","author":"A. Bilogur","year":"2018","journal-title":"J Open Source Softw"},{"key":"pcbi.1010718.ref084","article-title":"Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations.","author":"NJ Tierney","year":"2018","journal-title":"arXiv preprint arXiv:180902264"},{"key":"pcbi.1010718.ref085","article-title":"Advances in missing data methods and implications for educational research.","volume":"3178","author":"CYJ Peng","year":"2006","journal-title":"Real Data. Analysis"},{"key":"pcbi.1010718.ref086","first-page":"42","volume-title":"Predicting ICU Mortality Risk by Grouping Temporal Trends from a Multivariate Panel of Physiologic Measurements.","author":"Y Luo","year":"2016"},{"issue":"3","key":"pcbi.1010718.ref087","first-page":"1","article-title":"mice: Multivariate Imputation by Chained Equations in R.","volume":"45","author":"S Van Buuren","year":"2011","journal-title":"J Stat Softw."},{"issue":"1","key":"pcbi.1010718.ref088","first-page":"85","article-title":"A multivariate technique for multiply imputing missing values using a sequence ofregression models.","volume":"27","author":"TE Raghunathan","year":"2001","journal-title":"Survey. Methodology"},{"issue":"5","key":"pcbi.1010718.ref089","doi-asserted-by":"crossref","first-page":"1477","DOI":"10.1109\/TBME.2018.2874712","article-title":"Estimating missing data in temporal data streams using multi-directional recurrent neural networks","volume":"66","author":"J Yoon","year":"2019","journal-title":"IEEE Trans Biomed Eng"},{"key":"pcbi.1010718.ref090","doi-asserted-by":"crossref","first-page":"104933","DOI":"10.1109\/ACCESS.2020.2997255","article-title":"Multi-modal stacked denoising autoencoder for handling missing data in healthcare big data.","volume":"8","author":"JC Kim","year":"2020","journal-title":"IEEE Access"},{"issue":"3","key":"pcbi.1010718.ref091","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1186\/s12911-016-0318-z","article-title":"Nearest neighbor imputation algorithms: a critical evaluation.","volume":"16","author":"L Beretta","year":"2016","journal-title":"BMC Med Inform Decis Mak"},{"issue":"5","key":"pcbi.1010718.ref092","first-page":"1","article-title":"Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach.","volume":"20","author":"E Tavazzi","year":"2020","journal-title":"BMC Med Inform Decis Mak"},{"issue":"2","key":"pcbi.1010718.ref093","doi-asserted-by":"crossref","first-page":"143","DOI":"10.18196\/jrc.v3i2.13133","article-title":"Systematic review on missing data imputation techniques with machine learning algorithms for healthcare.","volume":"3","author":"AR Ismail","year":"2022","journal-title":"J Robot Control"},{"key":"pcbi.1010718.ref094","doi-asserted-by":"crossref","DOI":"10.1201\/b17622","volume-title":"Handbook of missing data methodology","author":"G Molenberghs","year":"2014"},{"issue":"6","key":"pcbi.1010718.ref095","doi-asserted-by":"crossref","first-page":"645","DOI":"10.1093\/jamia\/ocx133","article-title":"3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data","volume":"25","author":"Y Luo","year":"2017","journal-title":"J Am Med Inform Assoc"},{"key":"pcbi.1010718.ref096","first-page":"550","article-title":"Interpolation and k-nearest neighbours combined imputation for longitudinal ICU laboratory data","author":"S Daberdaku","year":"2019","journal-title":"In: The Seventh IEEE International Conference on Healthcare Informatics"},{"key":"pcbi.1010718.ref097","first-page":"1","article-title":"A combined interpolation and weighted k-nearest neighbours approach for the imputation of longitudinal ICU laboratory data.","author":"S Daberdaku","year":"2020","journal-title":"J Healthc Inform Res."},{"key":"pcbi.1010718.ref098","volume-title":"BRITS: bidirectional recurrent imputation for time series.","author":"W Cao","year":"2018"},{"key":"pcbi.1010718.ref099","volume-title":"Multiple imputation for nonresponse in surveys","author":"DB Rubin","year":"2004"},{"issue":"1","key":"pcbi.1010718.ref100","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12874-017-0442-1","article-title":"When and how should multiple imputation be used for handling missing data in randomised clinical trials\u2013a practical guide with flowcharts.","volume":"17","author":"JC Jakobsen","year":"2017","journal-title":"BMC Med Res Methodol"},{"key":"pcbi.1010718.ref101","article-title":"Detecting outliers in the monthly retail trade survey using the Hidiroglou-Berthelot method.","author":"JW Hunt","year":"1999","journal-title":"In: Proceedings of the Section on Survey Research Methods."},{"issue":"2","key":"pcbi.1010718.ref102","first-page":"221","article-title":"On the detection of many outliers.","volume":"17","author":"B. Rosner","year":"1975","journal-title":"Dent Tech."},{"issue":"3","key":"pcbi.1010718.ref103","doi-asserted-by":"crossref","first-page":"2005","DOI":"10.1016\/j.jksus.2020.02.003","article-title":"On detecting outliers in complex data using Dixon\u2019s test under neutrosophic statistics.","volume":"32","author":"M. Aslam","year":"2020","journal-title":"J King Saud Univ Sci"},{"key":"pcbi.1010718.ref104","unstructured":"scikit learn. Novelty and outlier detection. 2007 [cited 2022 Aug 18]. Available from: https:\/\/scikit-learn.org\/stable\/modules\/outlier_detection.html."},{"issue":"1","key":"pcbi.1010718.ref105","first-page":"1","article-title":"Unsupervised outlier detection in multidimensional data.","volume":"8","author":"SB Belhaouari","year":"2021","journal-title":"J Big Data"},{"key":"pcbi.1010718.ref106","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-14142-8","volume-title":"Data mining: the textbook","author":"CC Aggarwal","year":"2015"},{"key":"pcbi.1010718.ref107","doi-asserted-by":"crossref","DOI":"10.1017\/9781108564175","volume-title":"Data mining and machine learning: Fundamental concepts and algorithms","author":"MJ Zaki","year":"2020"},{"key":"pcbi.1010718.ref108","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1017\/S0962492921000039","article-title":"Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation","volume":"30","author":"M. Belkin","year":"2021","journal-title":"Acta Numerica"},{"key":"pcbi.1010718.ref109","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR.2019.00446","article-title":"A general and adaptive robust loss function","author":"JT Barron","year":"2019"},{"key":"pcbi.1010718.ref110","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9781107298019","volume-title":"Understanding machine learning: From theory to algorithms","author":"S Shalev-Shwartz","year":"2014"},{"key":"pcbi.1010718.ref111","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/j.imavis.2018.04.004","article-title":"Beyond one-hot encoding: Lower dimensional target embedding.","volume":"75","author":"P Rodr\u00edguez","year":"2018","journal-title":"Image Vis Comput"},{"key":"pcbi.1010718.ref112","doi-asserted-by":"crossref","first-page":"114381","DOI":"10.1109\/ACCESS.2021.3104357","article-title":"A deep-learned embedding technique for categorical features encoding.","volume":"9","author":"MK Dahouda","year":"2021","journal-title":"IEEE Access."},{"key":"pcbi.1010718.ref113","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1007\/978-1-0716-0826-5_3","article-title":"Siamese neural networks: an overview.","author":"D. Chicco","year":"2021","journal-title":"Artificial. Neural Netw"},{"issue":"4","key":"pcbi.1010718.ref114","doi-asserted-by":"crossref","first-page":"496","DOI":"10.1038\/ng1032","article-title":"Microarray data normalization and transformation","volume":"32","author":"J. Quackenbush","year":"2002","journal-title":"Nat Genet"},{"key":"pcbi.1010718.ref115","doi-asserted-by":"crossref","first-page":"105524","DOI":"10.1016\/j.asoc.2019.105524","article-title":"Investigating the impact of data normalization on classification performance.","volume":"97","author":"D Singh","year":"2020","journal-title":"Appl Soft Comput"},{"issue":"Mar","key":"pcbi.1010718.ref116","first-page":"1157","article-title":"An introduction to variable and feature selection.","volume":"3","author":"I Guyon","year":"2003","journal-title":"J Mach Learn Res"},{"issue":"6","key":"pcbi.1010718.ref117","first-page":"94","article-title":"Feature selection: A data perspective.","volume":"50","author":"J Li","year":"2018","journal-title":"ACM Comp Surv"},{"key":"pcbi.1010718.ref118","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.neucom.2017.11.077","article-title":"Feature selection in machine learning: A new perspective.","volume":"300","author":"J Cai","year":"2018","journal-title":"Neurocomputing"},{"issue":"4","key":"pcbi.1010718.ref119","doi-asserted-by":"crossref","first-page":"837","DOI":"10.1109\/TCBB.2014.2382127","article-title":"Software suite for gene and protein annotation prediction and similarity search","volume":"12","author":"D Chicco","year":"2014","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"pcbi.1010718.ref120","article-title":"Comparing methods addressing multi-collinearity when developing prediction models.","author":"AM Leeuwenberg","year":"2021","journal-title":"arXiv preprint arXiv:210101603"},{"issue":"1","key":"pcbi.1010718.ref121","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13040-017-0142-8","article-title":"EFS: an ensemble feature selection tool implemented as R-package and web-application.","volume":"10","author":"U Neumann","year":"2017","journal-title":"BioData Mining"},{"issue":"7\u20139","key":"pcbi.1010718.ref122","doi-asserted-by":"crossref","first-page":"1379","DOI":"10.1016\/j.neucom.2008.12.024","article-title":"Nearly homogeneous multi-partitioning with a deterministic generator.","volume":"72","author":"M. Aupetit","year":"2009","journal-title":"Neurocomputing"},{"issue":"2","key":"pcbi.1010718.ref123","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1023\/A:1022497517599","article-title":"Sampling and subsampling for cluster analysis in data mining: With applications to sky survey data.","volume":"7","author":"DM Rocke","year":"2003","journal-title":"Data Min Knowl Discov"},{"key":"pcbi.1010718.ref124","article-title":"Su-Sampling Based Active Learning For Large-Scale Histopathology Image","author":"Y Shen","year":"2021","journal-title":"In: IEEE International Conference on Image Processing"},{"issue":"3","key":"pcbi.1010718.ref125","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1177\/0962280216643116","article-title":"Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions.","volume":"27","author":"I Ahmed","year":"2018","journal-title":"Stat Methods Med Res"},{"issue":"21","key":"pcbi.1010718.ref126","doi-asserted-by":"crossref","first-page":"3429","DOI":"10.1093\/bioinformatics\/btv345","article-title":"ProFET: Feature engineering captures high-level protein functions","volume":"31","author":"D Ofer","year":"2015","journal-title":"Bioinformatics"},{"key":"pcbi.1010718.ref127","article-title":"Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering","volume":"2019","author":"PY Lung","year":"2019","journal-title":"Database"},{"issue":"1","key":"pcbi.1010718.ref128","doi-asserted-by":"crossref","first-page":"lqaa109","DOI":"10.1093\/nargab\/lqaa109","article-title":"Rapid discovery of novel prophages using biological feature engineering and machine learning.","volume":"3","author":"K Sir\u00e9n","year":"2021","journal-title":"NAR Genom Bioinform."},{"key":"pcbi.1010718.ref129","doi-asserted-by":"crossref","first-page":"109386","DOI":"10.1016\/j.mehy.2019.109386","article-title":"Medical knowledge integration and \u201csystems medicine\u201d: needs, ambitions, limitations and options.","volume":"133","author":"F Tretter","year":"2019","journal-title":"Med Hypotheses"},{"issue":"7","key":"pcbi.1010718.ref130","doi-asserted-by":"crossref","first-page":"574","DOI":"10.2174\/138920212803251445","article-title":"On the limitations of biological knowledge","volume":"13","author":"ER Dougherty","year":"2012","journal-title":"Curr Genomics"},{"key":"pcbi.1010718.ref131","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511809682","volume-title":"Kernel methods for pattern analysis","author":"J Shawe-Taylor","year":"2004"},{"issue":"3","key":"pcbi.1010718.ref132","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1561\/2200000036","article-title":"Kernels for vector-valued functions: A review.","volume":"4","author":"MA Alvarez","year":"2012","journal-title":"Found Trends Mach Learn"},{"issue":"6","key":"pcbi.1010718.ref133","doi-asserted-by":"crossref","first-page":"399","DOI":"10.1038\/s41592-018-0019-x","article-title":"The curse (s) of dimensionality.","volume":"15","author":"N Altman","year":"2018","journal-title":"Nat Methods"},{"issue":"8","key":"pcbi.1010718.ref134","doi-asserted-by":"crossref","first-page":"1798","DOI":"10.1109\/TPAMI.2013.50","article-title":"Representation learning: A review and new perspectives","volume":"35","author":"Y Bengio","year":"2013","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"pcbi.1010718.ref135","article-title":"The curse of highly variable functions for local kernel machines.","author":"Y Bengio","year":"2005","journal-title":"Neural Inform Process Syst."},{"issue":"3","key":"pcbi.1010718.ref136","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1109\/34.75512","article-title":"Small sample size effects in statistical pattern recognition: Recommendations for practitioners","volume":"13","author":"SJ Raudys","year":"1991","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"pcbi.1010718.ref137","article-title":"Deep learning: A critical appraisal.","author":"G. Marcus","year":"2018","journal-title":"arXiv preprint arXiv:180100631"},{"key":"pcbi.1010718.ref138","unstructured":"Schmidhuber J. Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 2021\u20132022; 2022."},{"issue":"2","key":"pcbi.1010718.ref139","doi-asserted-by":"crossref","first-page":"lqab039","DOI":"10.1093\/nargab\/lqab039","article-title":"A large-scale comparative study on peptide encodings for biomedical classification.","volume":"3","author":"S Sp\u00e4nig","year":"2021","journal-title":"NAR Genom Bioinform."},{"issue":"1","key":"pcbi.1010718.ref140","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1109\/MCI.2009.935308","article-title":"Mining structured data","volume":"5","author":"MG Da San","year":"2010","journal-title":"IEEE Comput Intell Mag"},{"issue":"4","key":"pcbi.1010718.ref141","doi-asserted-by":"crossref","first-page":"339","DOI":"10.1007\/s13218-015-0372-1","article-title":"Autonomous learning of representations.","volume":"29","author":"O Walter","year":"2015","journal-title":"KI-K\u00fcnstliche Intelligenz"},{"key":"pcbi.1010718.ref142","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1016\/j.neucom.2022.04.072","article-title":"Towards learning trustworthily, automatically, and with guarantees on graphs: an overview.","volume":"493","author":"L Oneto","year":"2022","journal-title":"Neurocomputing."},{"key":"pcbi.1010718.ref143","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1146\/annurev-bioeng-071516-044442","article-title":"Deep learning in medical image analysis.","volume":"19","author":"D Shen","year":"2017","journal-title":"Annu Rev Biomed Eng."},{"key":"pcbi.1010718.ref144","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1016\/j.media.2017.07.005","article-title":"A survey on deep learning in medical image analysis","volume":"42","author":"G Litjens","year":"2017","journal-title":"Med Image Anal"},{"key":"pcbi.1010718.ref145","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1016\/j.neunet.2020.06.006","article-title":"A gentle introduction to deep learning for graphs.","volume":"129","author":"D Bacciu","year":"2020","journal-title":"Neural Netw"},{"issue":"1","key":"pcbi.1010718.ref146","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1145\/959242.959248","article-title":"A survey of kernels for structured data.","volume":"5","author":"T. G\u00e4rtner","year":"2003","journal-title":"ACM SIGKDD Explor Newsletter"},{"key":"pcbi.1010718.ref147","volume-title":"Kernels for semi-structured data.","author":"H Kashima","year":"2002"},{"issue":"10","key":"pcbi.1010718.ref148","doi-asserted-by":"crossref","first-page":"4932","DOI":"10.1109\/TNNLS.2017.2785292","article-title":"Generative kernels for tree-structured data","volume":"29","author":"D Bacciu","year":"2018","journal-title":"IEEE transactions on neural networks and learning systems"},{"key":"pcbi.1010718.ref149","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1016\/j.ins.2018.12.052","article-title":"Deep reservoir neural networks for trees","volume":"480","author":"C Gallicchio","year":"2019","journal-title":"Inform Sci"},{"issue":"4","key":"pcbi.1010718.ref150","doi-asserted-by":"crossref","first-page":"296","DOI":"10.1002\/widm.36","article-title":"Similarity measures for sequential data.","volume":"1","author":"K. Rieck","year":"2011","journal-title":"Wiley Interdiscip Rev Data Min Knowl Discov"},{"key":"pcbi.1010718.ref151","article-title":"A critical review of recurrent neural networks for sequence learning.","author":"ZC Lipton","year":"2015","journal-title":"arXiv preprint arXiv:150600019"},{"key":"pcbi.1010718.ref152","article-title":"Kernels for sequentially ordered data.","volume":"20","author":"FJ Kir\u00e1ly","year":"2019","journal-title":"J Mach Learn Res."},{"issue":"1","key":"pcbi.1010718.ref153","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2379776.2379788","article-title":"Time-series data mining","volume":"45","author":"P Esling","year":"2012","journal-title":"ACM Comput Surv"},{"key":"pcbi.1010718.ref154","volume-title":"Foundations of statistical natural language processing.","author":"C Manning","year":"1999"},{"issue":"1","key":"pcbi.1010718.ref155","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/978-3-031-02165-7","article-title":"Neural network methods for natural language processing.","volume":"10","author":"Y. Goldberg","year":"2017","journal-title":"Synth Lect Hum Lang Technol"},{"issue":"3","key":"pcbi.1010718.ref156","doi-asserted-by":"crossref","first-page":"288","DOI":"10.18178\/ijmlc.2019.9.3.800","article-title":"A review of image denoising and segmentation methods based on medical images.","volume":"9","author":"S Kollem","year":"2019","journal-title":"Int J Mach Learn Comput"},{"issue":"6","key":"pcbi.1010718.ref157","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1007\/s10278-013-9622-7","article-title":"The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository.","volume":"26","author":"K Clark","year":"2013","journal-title":"J Digit Imaging."},{"issue":"Supplement 1","key":"pcbi.1010718.ref158","doi-asserted-by":"crossref","first-page":"i418","DOI":"10.1093\/bioinformatics\/btab271","article-title":"KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers","volume":"37","author":"S Wang","year":"2021","journal-title":"Bioinformatics"},{"key":"pcbi.1010718.ref159","article-title":"An automated feature engineering for digital rectal examination documentation using natural language processing","author":"S Bozkurt","year":"2018","journal-title":"In: AMIA Annual Symposium Proceedings"},{"key":"pcbi.1010718.ref160","unstructured":"Koh JY. Model Zoo. 2022 [cited 2022 Aug 18]. Available from: https:\/\/modelzoo.co\/."},{"key":"pcbi.1010718.ref161","article-title":"Huggingface\u2019s transformers: State-of-the-art natural language processing.","author":"T Wolf","year":"2019","journal-title":"arXiv preprint arXiv:191003771"},{"key":"pcbi.1010718.ref162","volume-title":"Strategies For Pre-training Graph Neural Networks","author":"W Hu","year":"2020"},{"issue":"2","key":"pcbi.1010718.ref163","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1109\/TCBB.2015.2459694","article-title":"Ontology-based prediction and prioritization of gene functional annotations","volume":"13","author":"D Chicco","year":"2015","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"pcbi.1010718.ref164","first-page":"1","volume-title":"Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations.","author":"P Pinoli","year":"2013"},{"issue":"1","key":"pcbi.1010718.ref165","first-page":"1997","article-title":"Neural architecture search: A survey.","volume":"20","author":"T Elsken","year":"2019","journal-title":"J Mach Learn Res"},{"key":"pcbi.1010718.ref166","doi-asserted-by":"crossref","first-page":"106622","DOI":"10.1016\/j.knosys.2020.106622","article-title":"AutoML: A survey of the state-of-the-art.","volume":"212","author":"X He","year":"2021","journal-title":"Knowl Based Syst"},{"key":"pcbi.1010718.ref167","article-title":"Meta-learning in neural networks: A survey","author":"TM Hospedales","year":"2021","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"pcbi.1010718.ref168","article-title":"On the Commoditization of Artificial Intelligence.","volume":"3934","author":"AA Abonamah","year":"2021","journal-title":"Front Psychol"},{"key":"pcbi.1010718.ref169","first-page":"100031","article-title":"The commoditization of AI for molecule design","volume":"2","author":"F Urbina","year":"2022","journal-title":"Artif Intell Life Sci"},{"key":"pcbi.1010718.ref170","volume-title":"Commoditization of Data is the Problem, Not the Solution-Why Placing a Price Tag on Personal Information May Harm Rather Than Protect Consumer Privacy.","author":"L Moerel","year":"2020"},{"key":"pcbi.1010718.ref171","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1016\/j.jnca.2017.06.003","article-title":"Trustworthy data: A survey, taxonomy and future trends of secure provenance schemes.","volume":"94","author":"F Zafar","year":"2017","journal-title":"J Netw Comput Appl"},{"issue":"2","key":"pcbi.1010718.ref172","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1038\/s41567-019-0780-5","article-title":"Trustworthy data underpin reproducible research.","volume":"16","author":"MJT Milton","year":"2020","journal-title":"Nat Phys."},{"issue":"10","key":"pcbi.1010718.ref173","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1145\/3448248","article-title":"Trustworthy ai.","volume":"64","author":"JM Wing","year":"2021","journal-title":"Commun ACM"},{"key":"pcbi.1010718.ref174","unstructured":"European Commission. Data Act. 2022 [cited 2022 Aug 18]. Available from: https:\/\/eur-lex.europa.eu\/legal-content\/EN\/TXT\/?uri=CELEX%3A52020PC0767."},{"key":"pcbi.1010718.ref175","unstructured":"European Commission. Artificial Intelligence Act. 2022 [cited 2022 Aug 18]. Available from: https:\/\/eur-lex.europa.eu\/legal-content\/EN\/TXT\/?uri=CELEX%3A52021PC0206."},{"issue":"5","key":"pcbi.1010718.ref176","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1145\/3376898","article-title":"A snapshot of the frontiers of fairness in machine learning.","volume":"63","author":"A Chouldechova","year":"2020","journal-title":"Commun ACM"},{"key":"pcbi.1010718.ref177","article-title":"Fairness in machine learning","author":"L Oneto","year":"2020","journal-title":"In: Recent Trends in Learning From Data"},{"issue":"6","key":"pcbi.1010718.ref178","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3457607","article-title":"A survey on bias and fairness in machine learning","volume":"54","author":"N Mehrabi","year":"2021","journal-title":"ACM Comput Surv"},{"key":"pcbi.1010718.ref179","volume-title":"Exploiting mmd and sinkhorn divergences for fair and transferable representation learning.","author":"L Oneto","year":"2020"},{"key":"pcbi.1010718.ref180","article-title":"Sinkhorn distances: Lightspeed computation of optimal transport.","author":"M. Cuturi","year":"2013","journal-title":"Neural Inf Process Syst."},{"key":"pcbi.1010718.ref181","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1016\/j.patcog.2018.07.023","article-title":"Wild patterns: Ten years after the rise of adversarial machine learning.","volume":"84","author":"B Biggio","year":"2018","journal-title":"Pattern Recognit"},{"issue":"1","key":"pcbi.1010718.ref182","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1109\/MSP.2017.2765202","article-title":"Generative adversarial networks: An overview","volume":"35","author":"A Creswell","year":"2018","journal-title":"IEEE Signal Process Mag"},{"key":"pcbi.1010718.ref183","article-title":"A review on generative adversarial networks: Algorithms, theory, and applications","author":"J Gui","year":"2021","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"pcbi.1010718.ref184","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1146\/annurev-statistics-040720-031848","article-title":"Synthetic data.","volume":"8","author":"TE Raghunathan","year":"2021","journal-title":"Annu Rev Stat Appl"},{"key":"pcbi.1010718.ref185","volume-title":"ML confidential: Machine learning on encrypted data","author":"T Graepel","year":"2012"},{"issue":"3","key":"pcbi.1010718.ref186","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1109\/MSP.2020.2975749","article-title":"Federated learning: Challenges, methods, and future directions","volume":"37","author":"T Li","year":"2020","journal-title":"IEEE Signal Process Mag"},{"issue":"3\u20134","key":"pcbi.1010718.ref187","first-page":"211","article-title":"The algorithmic foundations of differential privacy","volume":"9","author":"C Dwork","year":"2014","journal-title":"Found Trends Theor Comput Sci"},{"issue":"2","key":"pcbi.1010718.ref188","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1109\/MSEC.2018.2888775","article-title":"Privacy-preserving machine learning: Threats and solutions.","volume":"17","author":"M Al-Rubaie","year":"2019","journal-title":"IEEE Secur Priv"},{"issue":"2","key":"pcbi.1010718.ref189","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3436755","article-title":"When machine learning meets privacy: A survey and outlook","volume":"54","author":"B Liu","year":"2021","journal-title":"ACM Comput Surv"},{"issue":"4","key":"pcbi.1010718.ref190","first-page":"139","article-title":"SoK: privacy-preserving computation techniques for deep learning","volume":"2021","author":"J Cabrero-Holgueras","year":"2021","journal-title":"Proc Priv Enh Technol"},{"key":"pcbi.1010718.ref191","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-70992-5","volume-title":"A general survey of privacy-preserving data mining models and algorithms","author":"CC Aggarwal","year":"2008"},{"issue":"5","key":"pcbi.1010718.ref192","doi-asserted-by":"crossref","first-page":"e1424","DOI":"10.1002\/widm.1424","article-title":"Explainable artificial intelligence: an analytical review","volume":"11","author":"PP Angelov","year":"2021","journal-title":"Wiley Interdiscip Rev Data Min Knowl Discov"},{"issue":"5","key":"pcbi.1010718.ref193","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3236009","article-title":"A survey of methods for explaining black box models","volume":"51","author":"R Guidotti","year":"2018","journal-title":"ACM Comput Surv"},{"key":"pcbi.1010718.ref194","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.inffus.2019.12.012","article-title":"Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.","volume":"58","author":"AB Arrieta","year":"2020","journal-title":"Inf Fusion."},{"key":"pcbi.1010718.ref195","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.artint.2018.07.007","article-title":"Explanation in artificial intelligence: insights from the social sciences","volume":"267","author":"T. Miller","year":"2019","journal-title":"Artif Intell"},{"key":"pcbi.1010718.ref196","article-title":"Examples are not enough, learn to criticize! criticism for interpretability.","author":"B Kim","year":"2016","journal-title":"Neural Inf Process Syst."},{"key":"pcbi.1010718.ref197","article-title":"Towards Automatic Concept-based Explanations.","author":"A Ghorbani","year":"2019","journal-title":"Neural Inf Process Syst."},{"key":"pcbi.1010718.ref198","article-title":"On completeness-aware concept-based explanations in deep neural networks.","author":"CK Yeh","year":"2020","journal-title":"Neural Inf Process Syst"},{"key":"pcbi.1010718.ref199","article-title":"Towards a rigorous science of interpretable machine learning.","author":"F Doshi-Velez","year":"2017","journal-title":"arXiv preprint arXiv:170208608"},{"key":"pcbi.1010718.ref200","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-90403-0_9","volume-title":"Perturbation-based explanations of prediction models","author":"M Robnik-\u0160ikonja","year":"2018"},{"key":"pcbi.1010718.ref201","doi-asserted-by":"crossref","first-page":"106043","DOI":"10.1016\/j.compbiomed.2022.106043","article-title":"Explainable, trustworthy, and ethical machine learning for healthcare: a survey","author":"K Rasheed","year":"2022","journal-title":"Comput Biol Med"},{"key":"pcbi.1010718.ref202","first-page":"47","volume-title":"Trustworthy machine learning for health care: scalable data valuation with the Shapley value","author":"KD Pandl","year":"2021"},{"issue":"1","key":"pcbi.1010718.ref203","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41598-020-66481-0","article-title":"Using human in vitro transcriptome analysis to build trustworthy machine learning models for prediction of animal drug toxicity.","volume":"10","author":"LJ Gardiner","year":"2020","journal-title":"Sci Rep"},{"key":"pcbi.1010718.ref204","doi-asserted-by":"crossref","first-page":"263","DOI":"10.1016\/j.inffus.2021.10.007","article-title":"Information fusion as an integrative cross-cutting enabler to achieve robust, explainable, and trustworthy medical artificial intelligence","volume":"79","author":"A Holzinger","year":"2022","journal-title":"Inf Fusion"},{"key":"pcbi.1010718.ref205","unstructured":"Kaggle. Kaggle datasets\u2014Explore, analyze, and share quality data. 2022 [cited 2022 Jun 24]. Available from: https:\/\/www.kaggle.com\/datasets."},{"key":"pcbi.1010718.ref206","unstructured":"University of California Irvine. Machine Learning Repository. 1987 [cited 2022 Jun 24]. Available from: https:\/\/archive.ics.uci.edu\/ml."},{"key":"pcbi.1010718.ref207","unstructured":"Zenodo. Zenodo. 2013 [cited 2022 Jul 25]. Available from: https:\/\/www.zenodo.org."},{"key":"pcbi.1010718.ref208","unstructured":"FigShare. Store, share, discover research. 2011 [cited 2022 Jul 25]. Available from: https:\/\/www.figshare.com."},{"issue":"1","key":"pcbi.1010718.ref209","doi-asserted-by":"crossref","first-page":"1460458220984205","DOI":"10.1177\/1460458220984205","article-title":"Computational intelligence identifies alkaline phosphatase (ALP), alpha-fetoprotein (AFP), and hemoglobin levels as most predictive survival factors for hepatocellular carcinoma.","volume":"27","author":"D Chicco","year":"2021","journal-title":"Health Informatics J"},{"issue":"1","key":"pcbi.1010718.ref210","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship.","volume":"3","author":"MD Wilkinson","year":"2016","journal-title":"Sci Data."},{"issue":"1","key":"pcbi.1010718.ref211","doi-asserted-by":"crossref","DOI":"10.5334\/dsj-2022-017","article-title":"A survey on publicly available open datasets derived from electronic health records (EHRs) of patients with neuroblastoma.","volume":"21","author":"D Chicco","year":"2022","journal-title":"Data Sci J."},{"issue":"1","key":"pcbi.1010718.ref212","doi-asserted-by":"crossref","first-page":"37","DOI":"10.3233\/DS-190026","article-title":"Towards FAIR principles for research software.","volume":"3","author":"AL Lamprecht","year":"2020","journal-title":"Data Sci."},{"issue":"6","key":"pcbi.1010718.ref213","doi-asserted-by":"crossref","first-page":"e1010193","DOI":"10.1371\/journal.pcbi.1010193","article-title":"Advancing code sharing in the computational biology community","volume":"18","author":"L Cadwallader","year":"2022","journal-title":"PLoS Comput Biol"},{"issue":"7","key":"pcbi.1010718.ref214","doi-asserted-by":"crossref","first-page":"e01887","DOI":"10.1002\/ecs2.1887","article-title":"Open access increases citations of papers in ecology","volume":"8","author":"M Tang","year":"2017","journal-title":"Ecosphere"},{"key":"pcbi.1010718.ref215","unstructured":"Scimago Journal Ranking. Molecular biology open access journals. 2022 [cited 2022 Jun 26]. Available from: https:\/\/www.scimagojr.com\/journalrank.php?category=1312&openaccess=true&type=j."},{"key":"pcbi.1010718.ref216","unstructured":"Scimago Journal Ranking. Health informatics open access journals. 2022 [cited 2022 Jun 26]. Available from: https:\/\/www.scimagojr.com\/journalrank.php?openaccess=true&type=j&category=2718."},{"issue":"1","key":"pcbi.1010718.ref217","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13040-017-0155-3","article-title":"Ten quick tips for machine learning in computational biology","volume":"10","author":"D Chicco","year":"2017","journal-title":"BioData Mining"},{"issue":"8","key":"pcbi.1010718.ref218","doi-asserted-by":"crossref","first-page":"e1010348","DOI":"10.1371\/journal.pcbi.1010348","article-title":"Nine quick tips for pathway enrichment analysis.","volume":"18","author":"D Chicco","year":"2022","journal-title":"PLoS Comput Biol"},{"issue":"4","key":"pcbi.1010718.ref219","doi-asserted-by":"crossref","first-page":"693","DOI":"10.1093\/bib\/bbw134","article-title":"Top considerations for creating bioinformatics software documentation","volume":"19","author":"M Karimzadeh","year":"2018","journal-title":"Brief Bioinform"},{"issue":"7","key":"pcbi.1010718.ref220","doi-asserted-by":"crossref","first-page":"e1000424","DOI":"10.1371\/journal.pcbi.1000424","article-title":"A quick guide to organizing computational biology projects","volume":"5","author":"WS Noble","year":"2009","journal-title":"PLoS Comput Biol"},{"issue":"9","key":"pcbi.1010718.ref221","doi-asserted-by":"crossref","first-page":"e1004385","DOI":"10.1371\/journal.pcbi.1004385","article-title":"Ten simple rules for a computational biologist\u2019s laboratory notebook","volume":"11","author":"S Schnell","year":"2015","journal-title":"PLoS Comput Biol"},{"issue":"10","key":"pcbi.1010718.ref222","doi-asserted-by":"crossref","first-page":"e1003285","DOI":"10.1371\/journal.pcbi.1003285","article-title":"Ten simple rules for reproducible computational research.","volume":"9","author":"GK Sandve","year":"2013","journal-title":"PLoS Comput Biol"},{"issue":"4","key":"pcbi.1010718.ref223","doi-asserted-by":"crossref","first-page":"e1005412","DOI":"10.1371\/journal.pcbi.1005412","article-title":"Ten simple rules for making research software more robust.","volume":"13","author":"M Taschuk","year":"2017","journal-title":"PLoS Comput Biol"},{"issue":"1070","key":"pcbi.1010718.ref224","article-title":"RNA-seq workflow: gene-level exploratory analysis and differential expression.","volume":"4","author":"MI Love","year":"2015","journal-title":"F1000Res."},{"key":"pcbi.1010718.ref225","author":"MI Love","year":"2019","journal-title":"RNA-seq workflow: gene-level exploratory analysis and differential expression"}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1010718","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,18]],"date-time":"2023-03-18T03:50:17Z","timestamp":1679111417000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1010718"}},"subtitle":[],"editor":[{"given":"Francis","family":"Ouellette","sequence":"first","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,12,15]]},"references-count":225,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2022,12,15]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1010718","relation":{},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,15]]}}}