{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,14]],"date-time":"2025-12-14T15:57:56Z","timestamp":1765727876985,"version":"3.37.3"},"reference-count":47,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2019,11,21]],"date-time":"2019-11-21T00:00:00Z","timestamp":1574294400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100001659","name":"German Research Foundation","doi-asserted-by":"publisher","award":["BO3139\/4-2","SA580\/8-2"],"award-info":[{"award-number":["BO3139\/4-2","SA580\/8-2"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Data integration, i.e. the use of different sources of information for data analysis, is becoming one of the most important topics in modern statistics. Especially in, but not limited to, biomedical applications, a relevant issue is the combination of low-dimensional (e.g. clinical data) and high-dimensional (e.g. molecular data such as gene expressions) data sources in a prediction model. Not only the different characteristics of the data, but also the complex correlation structure within and between the two data sources, pose challenging issues. In this paper, we investigate these issues via simulations, providing some useful insight into strategies to combine low- and high-dimensional data in a regression prediction model. In particular, we focus on the effect of the correlation structure on the results, while accounting for the influence of our specific choices in the design of the simulation study.<\/jats:p>","DOI":"10.1093\/bib\/bbz136","type":"journal-article","created":{"date-parts":[[2019,10,8]],"date-time":"2019-10-08T19:26:32Z","timestamp":1570562792000},"page":"1904-1919","source":"Crossref","is-referenced-by-count":11,"title":["Combining clinical and molecular data in regression prediction models: insights from a simulation study"],"prefix":"10.1093","volume":"21","author":[{"given":"Riccardo","family":"De Bin","sequence":"first","affiliation":[{"name":"Department of Mathematics, University of Oslo, Norway"}]},{"given":"Anne-Laure","family":"Boulesteix","sequence":"additional","affiliation":[{"name":"Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Germany"}]},{"given":"Axel","family":"Benner","sequence":"additional","affiliation":[{"name":"Division of Biostatistics, German Cancer Research Centre of Heidelberg, Germany"}]},{"given":"Natalia","family":"Becker","sequence":"additional","affiliation":[{"name":"Division of Biostatistics, German Cancer Research Centre of Heidelberg, Germany"}]},{"given":"Willi","family":"Sauerbrei","sequence":"additional","affiliation":[{"name":"Institute of Medical Biometry and Statistics, University of Freiburg, Germany"}]}],"member":"286","published-online":{"date-parts":[[2019,11,21]]},"reference":[{"key":"2020120619044503800_ref1","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1186\/1741-7015-10-51","article-title":"Reporting recommendations for tumor marker prognostic studies (REMARK): explanation and elaboration","volume":"10","author":"Altman","year":"2012","journal-title":"BMC Med"},{"volume-title":"GAMBoost: Generalized Linear And Additive Models by Likelihood Based Boosting","year":"2013","author":"Binder","key":"2020120619044503800_ref2"},{"key":"2020120619044503800_ref3","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1186\/1471-2105-9-14","article-title":"Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models","volume":"9","author":"Binder","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2020120619044503800_ref4","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1093\/bib\/bbq085","article-title":"Added predictive value of high-throughput molecular data to clinical data and its validation","volume":"12","author":"Boulesteix","year":"2011","journal-title":"Brief Bioinform"},{"issue":"138","key":"2020120619044503800_ref5","article-title":"Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies","volume":"17","author":"Boulesteix","year":"2017","journal-title":"BMC Med Res Methodol"},{"key":"2020120619044503800_ref6","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1002\/bimj.201700129","article-title":"On the necessity and design of studies comparing statistical methods","volume":"60","author":"Boulesteix","year":"2018","journal-title":"Biom J"},{"key":"2020120619044503800_ref7","doi-asserted-by":"crossref","first-page":"413","DOI":"10.1186\/1471-2105-10-413","article-title":"Survival prediction from clinico-genomic models\u2014a comparative study","volume":"10","author":"B\u00f8velstad","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2020120619044503800_ref8","doi-asserted-by":"crossref","first-page":"232","DOI":"10.1214\/10-AOAS388","article-title":"Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection","volume":"5","author":"Breheny","year":"2011","journal-title":"Ann Appl Stat"},{"key":"2020120619044503800_ref9","doi-asserted-by":"crossref","first-page":"477","DOI":"10.1214\/07-STS242","article-title":"Boosting algorithms: regularization, prediction and model fitting","volume":"22","author":"B\u00fchlmann","year":"2007","journal-title":"Stat Sci"},{"key":"2020120619044503800_ref10","doi-asserted-by":"crossref","first-page":"324","DOI":"10.1198\/016214503000125","article-title":"Boosting with the L$_2$ loss: regression and classification","volume":"98","author":"B\u00fchlmann","year":"2003","journal-title":"J Am Stat Assoc"},{"key":"2020120619044503800_ref11","doi-asserted-by":"crossref","first-page":"4279","DOI":"10.1002\/sim.2673","article-title":"The design of simulation studies in medical statistics","volume":"25","author":"Burton","year":"2006","journal-title":"Stat Med"},{"key":"2020120619044503800_ref12","doi-asserted-by":"crossref","first-page":"280","DOI":"10.1093\/bib\/bbu006","article-title":"Translational research platforms integrating clinical and omics data: a review of publicly available solutions","volume":"16","author":"Canuel","year":"2014","journal-title":"Brief Bioinform"},{"key":"2020120619044503800_ref13","doi-asserted-by":"crossref","first-page":"e59962","DOI":"10.1371\/journal.pone.0059962","article-title":"Expression levels of obesity-related genes are associated with weight change in kidney transplant recipients","volume":"8","author":"Cashion","year":"2013","journal-title":"PloS ONE"},{"key":"2020120619044503800_ref14","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1007\/s00180-015-0642-2","article-title":"Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost","volume":"31","author":"De Bin","year":"2016","journal-title":"Comput Stat"},{"key":"2020120619044503800_ref15","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1186\/1471-2105-12-49","article-title":"A novel approach to the clustering of microarray data via nonparametric density estimation","volume":"12","author":"De Bin","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2020120619044503800_ref16","doi-asserted-by":"crossref","first-page":"5310","DOI":"10.1002\/sim.6246","article-title":"Investigating the prediction ability of survival models based on both clinical and omics data: two case studies","volume":"33","author":"De Bin","year":"2014","journal-title":"Stat Med"},{"key":"2020120619044503800_ref17","first-page":"68","article-title":"Polychoric and polyserial correlations","volume-title":"The Encyclopedia of Statistical Science","author":"Drasgow","year":"1986"},{"key":"2020120619044503800_ref18","doi-asserted-by":"crossref","first-page":"1348","DOI":"10.1198\/016214501753382273","article-title":"Variable selection via nonconcave penalized likelihood and its oracle properties","volume":"96","author":"Fan","year":"2001","journal-title":"J Am Stat Assoc"},{"key":"2020120619044503800_ref19","doi-asserted-by":"crossref","first-page":"849","DOI":"10.1111\/j.1467-9868.2008.00674.x","article-title":"Sure independence screening for ultrahigh dimensional feature space","volume":"70","author":"Fan","year":"2008","journal-title":"J Royal Stat Soc B"},{"key":"2020120619044503800_ref20","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1111\/rssb.12001","article-title":"Tuning parameter selection in high dimensional penalized likelihood","volume":"75","author":"Fan","year":"2013","journal-title":"J Royal Stat Soc B"},{"key":"2020120619044503800_ref21","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v033.i01","article-title":"Regularization paths for generalized linear models via coordinate descent","volume":"33","author":"Friedman","year":"2010","journal-title":"J Stat Softw"},{"year":"2014","author":"Goeman","journal-title":"Penalized: L1 (Lasso and Fused Lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model","key":"2020120619044503800_ref22"},{"year":"2013","author":"G\u2019Sell","journal-title":"False variable selection rates in regression","key":"2020120619044503800_ref23"},{"key":"2020120619044503800_ref24","doi-asserted-by":"crossref","first-page":"1290","DOI":"10.1002\/sim.7576","article-title":"Fridge: focused fine-tuning of ridge regression for personalized predictions","volume":"37","author":"Hellton","year":"2018","journal-title":"Stat Med"},{"key":"2020120619044503800_ref25","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1016\/0024-3795(88)90223-6","article-title":"Computing a nearest symmetric positive semidefinite matrix","volume":"103","author":"Higham","year":"1988","journal-title":"Linear Algebra Appl"},{"key":"2020120619044503800_ref26","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1080\/00401706.1970.10488634","article-title":"Ridge regression: biased estimation for nonorthogonal problems","volume":"12","author":"Hoerl","year":"1970","journal-title":"Technometrics"},{"key":"2020120619044503800_ref27","doi-asserted-by":"crossref","first-page":"2828","DOI":"10.1093\/bioinformatics\/btl462","article-title":"Model-based boosting in high dimensions","volume":"22","author":"Hothorn","year":"2006","journal-title":"Bioinformatics"},{"volume-title":"mboost: Model-Based Boosting","year":"2014","author":"Hothorn, Buehlmann","key":"2020120619044503800_ref28"},{"key":"2020120619044503800_ref29","first-page":"362","article-title":"Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics","author":"Hu","year":"2019","journal-title":"Pac Symp Bicomput"},{"key":"2020120619044503800_ref30","first-page":"178","article-title":"The importance of knowing when to stop","volume":"51","author":"Mayr","year":"2012","journal-title":"A sequential stopping rule for component-wise gradient boosting. Methods Inf Med"},{"key":"2020120619044503800_ref31","doi-asserted-by":"crossref","first-page":"488","DOI":"10.1016\/S0140-6736(05)17866-0","article-title":"Prediction of cancer outcome with microarrays: a multiple random validation strategy","volume":"365","author":"Michiels","year":"2005","journal-title":"Lancet"},{"key":"2020120619044503800_ref32","article-title":"R: A Language and Environment for Statistical Computing","volume-title":"R Foundation for Statistical Computing, Vienna, Austria","author":"R Core Team","year":"2018"},{"key":"2020120619044503800_ref33","doi-asserted-by":"crossref","first-page":"49","DOI":"10.2307\/1268382","article-title":"Inflation of r$^2$ in best subset regression","volume":"22","author":"Rencher","year":"1980","journal-title":"Technometrics"},{"key":"2020120619044503800_ref34","doi-asserted-by":"crossref","first-page":"1090","DOI":"10.1038\/s41467-018-03424-4","article-title":"A comprehensive evaluation of module detection methods for gene expression data","volume":"9","author":"Saelens","year":"2018","journal-title":"Nat Commun"},{"key":"2020120619044503800_ref35","doi-asserted-by":"crossref","first-page":"1195","DOI":"10.1007\/s00180-017-0773-8","article-title":"On the choice and influence of the number of boosting steps for high-dimensional linear cox-models","volume":"33","author":"Seibold","year":"2018","journal-title":"Comput Stat"},{"key":"2020120619044503800_ref36","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1093\/bib\/bbr001","article-title":"Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data","volume":"12","author":"Simon","year":"2011","journal-title":"Brief Bioinform"},{"key":"2020120619044503800_ref37","doi-asserted-by":"crossref","first-page":"1896","DOI":"10.1177\/0962280215592269","article-title":"Performance of methods for meta-analysis of diagnostic test accuracy with few studies or sparse data","volume":"26","author":"Takwoingi","year":"2017","journal-title":"Stat Methods Med Res"},{"key":"2020120619044503800_ref38","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1007\/s11222-017-9754-6","article-title":"Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates","volume":"28","author":"Thomas","year":"2018","journal-title":"Stat Comput"},{"key":"2020120619044503800_ref39","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression shrinkage and selection via the lasso","volume":"58","author":"Tibshirani","year":"1996","journal-title":"J Royal Stat Soc B"},{"key":"2020120619044503800_ref40","doi-asserted-by":"crossref","first-page":"385","DOI":"10.1186\/s12859-014-0385-z","article-title":"Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data","volume":"15","author":"Truntzer","year":"2014","journal-title":"BMC Bioinformatics"},{"key":"2020120619044503800_ref41","doi-asserted-by":"crossref","first-page":"961","DOI":"10.1111\/j.1541-0420.2006.00578.x","article-title":"Generalized additive modeling with implicit variable selection by likelihood-based boosting","volume":"62","author":"Tutz","year":"2006","journal-title":"Biometrics"},{"key":"2020120619044503800_ref42","doi-asserted-by":"crossref","first-page":"571","DOI":"10.1007\/s10545-017-0128-1","article-title":"The role of the clinician in the multi-omics era: are you ready","volume":"41","author":"van Karnebeek","year":"2018","journal-title":"J Inherit Metab Dis"},{"key":"2020120619044503800_ref43","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1186\/s13059-019-1738-8","article-title":"Essential guidelines for computational method benchmarking","volume":"20","author":"Weber","year":"2019","journal-title":"Genome Biol"},{"key":"2020120619044503800_ref44","first-page":"121","article-title":"UMPIRE: Ultimate microarray prediction, inference, and reality engine","volume-title":"BIOTECHNO 2011, The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies","author":"Zhang","year":"2011"},{"key":"2020120619044503800_ref45","first-page":"44","article-title":"Simulating gene expression data to estimate sample size for class and biomarker discovery","volume":"4","author":"Zhang","year":"2012","journal-title":"Int J Adv Life Sci"},{"key":"2020120619044503800_ref46","doi-asserted-by":"crossref","first-page":"16954","DOI":"10.1038\/s41598-017-17031-8","article-title":"Integrating clinical and multiple omics data for prognostic assessment across human cancers","volume":"7","author":"Zhu","year":"2017","journal-title":"Sci Rep"},{"key":"2020120619044503800_ref47","doi-asserted-by":"crossref","first-page":"301","DOI":"10.1111\/j.1467-9868.2005.00503.x","article-title":"Regularization and variable selection via the elastic net","volume":"67","author":"Zou","year":"2005","journal-title":"J Royal Stat Soc B"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/6\/1904\/34672071\/bbz136.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/6\/1904\/34672071\/bbz136.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,24]],"date-time":"2024-07-24T19:03:07Z","timestamp":1721847787000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/21\/6\/1904\/5627730"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,21]]},"references-count":47,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2019,11,21]]},"published-print":{"date-parts":[[2020,12,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbz136","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"type":"print","value":"1467-5463"},{"type":"electronic","value":"1477-4054"}],"subject":[],"published-other":{"date-parts":[[2020,11]]},"published":{"date-parts":[[2019,11,21]]}}}