{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T21:08:26Z","timestamp":1761340106298,"version":"3.37.3"},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2020,7,6]],"date-time":"2020-07-06T00:00:00Z","timestamp":1593993600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,7,6]],"date-time":"2020-07-06T00:00:00Z","timestamp":1593993600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["BO3139\/4-3"],"award-info":[{"award-number":["BO3139\/4-3"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002347","name":"Bundesministerium f\u00fcr Bildung und Forschung","doi-asserted-by":"publisher","award":["01IS18036A"],"award-info":[{"award-number":["01IS18036A"]}],"id":[{"id":"10.13039\/501100002347","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Classif"],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In many application areas, prediction rules trained based on high-dimensional data are subsequently applied to make predictions for observations from other sources, but they do not always perform well in this setting. This is because data sets from different sources can feature (slightly) differing distributions, even if they come from similar populations. In the context of high-dimensional data and beyond, most prediction methods involve one or several tuning parameters. Their values are commonly chosen by maximizing the cross-validated prediction performance on the training data. This procedure, however, implicitly presumes that the data to which the prediction rule will be ultimately applied, follow the same distribution as the training data. If this is not the case, less complex prediction rules that slightly underfit the training data may be preferable. Indeed, a tuning parameter does not only control the degree of adjustment of a prediction rule to the training data, but also, more generally, the degree of adjustment to the<jats:italic>distribution of<\/jats:italic>the training data. On the basis of this idea, in this paper we compare various approaches including new procedures for choosing tuning parameter values that lead to better generalizing prediction rules than those obtained based on cross-validation. Most of these approaches use an external validation data set. In our extensive comparison study based on a large collection of 15 transcriptomic data sets, tuning on external data and robust tuning with a tuned robustness parameter are the two approaches leading to better generalizing prediction rules.<\/jats:p>","DOI":"10.1007\/s00357-020-09368-z","type":"journal-article","created":{"date-parts":[[2020,7,6]],"date-time":"2020-07-06T10:02:52Z","timestamp":1594029772000},"page":"212-231","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Improved Outcome Prediction Across Data Sources Through Robust Parameter Tuning"],"prefix":"10.1007","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8298-0409","authenticated-orcid":false,"given":"Nicole","family":"Ellenbach","sequence":"first","affiliation":[]},{"given":"Anne-Laure","family":"Boulesteix","sequence":"additional","affiliation":[]},{"given":"Bernd","family":"Bischl","sequence":"additional","affiliation":[]},{"given":"Kristian","family":"Unger","sequence":"additional","affiliation":[]},{"given":"Roman","family":"Hornung","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,7,6]]},"reference":[{"issue":"12","key":"9368_CR1","doi-asserted-by":"publisher","first-page":"i105","DOI":"10.1093\/bioinformatics\/btu279","volume":"30","author":"C Bernau","year":"2014","unstructured":"Bernau, C., Riester, M., Boulesteix, A. L., Parmigiani, G., Huttenhower, C., Waldron, L., & Trippa, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30(12), i105\u2013i112.","journal-title":"Bioinformatics"},{"issue":"170","key":"9368_CR2","first-page":"1","volume":"17","author":"B Bischl","year":"2016","unstructured":"Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., & Jones, Z.M. (2016). mlr: machine learning in R. Journal of Machine Learning Research, 17(170), 1\u20135.","journal-title":"Journal of Machine Learning Research"},{"key":"9368_CR3","unstructured":"Bischl, B., Richter, J., Bossek, J., Horn, D., Thomas, J., & Lang, M. (2017). mlrMBO: a modular framework for model-based optimization of expensive black-box functions, arXiv:1703.03373."},{"key":"9368_CR4","doi-asserted-by":"publisher","first-page":"826","DOI":"10.1016\/S0895-4356(03)00207-5","volume":"56","author":"SE Bleeker","year":"2003","unstructured":"Bleeker, S. E., Moll, H. A., Steyerberg, E. W., Donders, A. R. T., Derksen-Lubsen, G., Grobbee, D. E., & Moons, K. G. M. (2003). External validation is necessary in prediction research: a clinical example. Journal of Clinical Epidemiology, 56, 826\u2013832.","journal-title":"Journal of Clinical Epidemiology"},{"key":"9368_CR5","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman, L. (2001). Random forests. Machine Learning, 45, 5\u201332.","journal-title":"Machine Learning"},{"key":"9368_CR6","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1198\/016214503000125","volume":"98","author":"P Buehlmann","year":"2003","unstructured":"Buehlmann, P., & Yu, B. (2003). Boosting with the l2 loss: regression and classification. Journal of the American Statistical Association, 98, 324\u2013339.","journal-title":"Journal of the American Statistical Association"},{"key":"9368_CR7","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1023\/A:1012450327387","volume":"46","author":"O Chapelle","year":"2002","unstructured":"Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131\u2013159.","journal-title":"Machine Learning"},{"key":"9368_CR8","unstructured":"Claesen, M., & De Moor, B. (2015). Hyperparameter search in machine learning, arXiv:1502.02127."},{"key":"9368_CR9","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1186\/1471-2288-14-40","volume":"14","author":"GS Collins","year":"2014","unstructured":"Collins, G. S., de Groot, J. A., Dutton, S., Omar, O., Shanyinde, M., Tajar, A., Voysey, M., Wharton, R., Yu, L. M., Moons, K. G., & Altman, D. G. (2014). External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Medical Research Methodology, 14, 40.","journal-title":"BMC Medical Research Methodology"},{"key":"9368_CR10","first-page":"273","volume":"20","author":"C Cortes","year":"1995","unstructured":"Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273\u2013297.","journal-title":"Machine Learning"},{"key":"9368_CR11","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1093\/biostatistics\/kxy035","volume":"21","author":"F Dondelinger","year":"2020","unstructured":"Dondelinger, F., Mukherjee, S., & The Alzheimer\u2019s Disease Neuroimaging Initiative. (2020). The joint lasso: high-dimensional regression for group structured data. Biostatistics, 21, 219\u2013235.","journal-title":"Biostatistics"},{"issue":"1","key":"9368_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v033.i01","volume":"33","author":"J Friedman","year":"2010","unstructured":"Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1\u201322.","journal-title":"Journal of Statistical Software"},{"key":"9368_CR13","doi-asserted-by":"publisher","first-page":"498","DOI":"10.1016\/j.tibtech.2017.02.012","volume":"35","author":"WWB Goh","year":"2017","unstructured":"Goh, W. W. B., Wang, W., & Wong, L. (2017). Why batch effects matter in omics data, and how to avoid them. Trends in Biotechnology, 35, 498\u2013507.","journal-title":"Trends in Biotechnology"},{"key":"9368_CR14","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1080\/00401706.1970.10488634","volume":"12","author":"AE Hoerl","year":"1970","unstructured":"Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 55\u201367.","journal-title":"Technometrics"},{"key":"9368_CR15","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1186\/s12874-015-0088-9","volume":"15","author":"R Hornung","year":"2015","unstructured":"Hornung, R., Bernau, C., Truntzer, C., Wilson, R., Stadler, T., & Boulesteix, A. L. (2015). A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization. BMC Medical Research Methodology, 15, 95.","journal-title":"BMC Medical Research Methodology"},{"key":"9368_CR16","unstructured":"Hornung, R. (2016). Preparation of high-dimensional biomedical data with a focus on prediction and error estimation. Dissertation: University of Munich."},{"key":"9368_CR17","doi-asserted-by":"crossref","first-page":"397","DOI":"10.1093\/bioinformatics\/btw650","volume":"33","author":"R Hornung","year":"2017","unstructured":"Hornung, R., Causeur, D., Bernau, C., & Boulesteix, A. L. (2017). Improving cross-study prediction through addon batch effect adjustment or addon normalization. Bioinformatics, 33, 397\u2013404.","journal-title":"Bioinformatics"},{"key":"9368_CR18","unstructured":"Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., & Hofner, B. (2018). mboost: model-based boosting, R package version 2.9-1."},{"key":"9368_CR19","doi-asserted-by":"publisher","first-page":"345","DOI":"10.1038\/nmeth756","volume":"2","author":"RA Irizarry","year":"2005","unstructured":"Irizarry, R. A., Warren, D., Spencer, F., Kim, I. F., Biswal, S., Frank, B. C., Gabrielson, E., Garcia, J. G., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S. C., Hoffman, E., Jedlicka, A. E., Kawasaki, E., Martinez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Ye, S. Q., & Yu, W. (2005). Multiple-laboratory comparison of microarray platforms. Nature Methods, 2, 345\u2013350.","journal-title":"Nature Methods"},{"key":"9368_CR20","doi-asserted-by":"publisher","first-page":"733","DOI":"10.1038\/nrg2825","volume":"11","author":"JT Leek","year":"2010","unstructured":"Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K., & Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11, 733\u2013739.","journal-title":"Nature Reviews Genetics"},{"key":"9368_CR21","doi-asserted-by":"publisher","first-page":"1817","DOI":"10.1016\/j.eswa.2007.08.088","volume":"35","author":"SW Lin","year":"2008","unstructured":"Lin, S. W., Ying, K. C., Chen, S. C., & Lee, Z. J. (2008). Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Systems with Applications, 35, 1817\u20131824.","journal-title":"Expert Systems with Applications"},{"key":"9368_CR22","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1186\/s13059-014-0550-8","volume":"15","author":"MI Love","year":"2014","unstructured":"Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.","journal-title":"Genome Biology"},{"key":"9368_CR23","doi-asserted-by":"publisher","first-page":"1415","DOI":"10.1016\/j.protcy.2016.05.165","volume":"24","author":"A Mathews","year":"2016","unstructured":"Mathews, A., Simi, I., & Kizhakkethottam, J. J. (2016). Efficient diagnosis of cancer from histopathological images by eliminating batch effects. Procedia Technology, 24, 1415\u20131422.","journal-title":"Procedia Technology"},{"key":"9368_CR24","unstructured":"Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2019). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, r package version 1.7-0.1."},{"key":"9368_CR25","doi-asserted-by":"publisher","first-page":"128","DOI":"10.1186\/s12859-017-1553-8","volume":"18","author":"F Rohart","year":"2017","unstructured":"Rohart, F., Eslami, A., Matigian, N., Bougeard, S., & L\u00ea Cao, K. A. (2017). MINT: A multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. BMC Bioinformatics, 18, 128.","journal-title":"BMC Bioinformatics"},{"key":"9368_CR26","doi-asserted-by":"crossref","unstructured":"Scherer, A. (Ed.). (2009). Batch effects and noise in microarray experiments: sources and solutions wiley series in probability and statistics. Wiley: Hoboken.","DOI":"10.1002\/9780470685983"},{"key":"9368_CR27","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1016\/j.jclinepi.2014.09.007","volume":"68","author":"GCM Siontis","year":"2015","unstructured":"Siontis, G. C. M., Tzoulaki, I., Castaldi, P. J., & Ioannidis, J. P. A. (2015). External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. Journal of Clinical Epidemiology, 68, 25\u201334.","journal-title":"Journal of Clinical Epidemiology"},{"key":"9368_CR28","unstructured":"Snoek, J., Larochelle, H., & Adams, R.P. (2012). Practical Bayesian optimization of machine learning algorithms. In Pereira, F., Burges, C.J.C., Bottou, L., & Weinberger, K.Q. (Eds.) Advances in Neural Information Processing Systems, (Vol. 25 pp. 2951\u20132959): Curran Associates, Inc."},{"key":"9368_CR29","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","volume":"58","author":"R Tibshirani","year":"1996","unstructured":"Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267\u2013288.","journal-title":"Journal of the Royal Statistical Society, Series B"},{"key":"9368_CR30","doi-asserted-by":"publisher","first-page":"351","DOI":"10.1186\/s12859-017-1756-z","volume":"18","author":"JA Tom","year":"2017","unstructured":"Tom, J. A., Reeder, J., Forrest, W. F., Graham, R. R., Hunkapiller, J., Behrens, T. W., & Bhangale, T. R. (2017). Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinformatics, 18, 351.","journal-title":"BMC Bioinformatics"},{"key":"9368_CR31","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1186\/1471-2105-7-91","volume":"7","author":"S Varma","year":"2006","unstructured":"Varma, S., & Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91.","journal-title":"BMC Bioinformatics"},{"issue":"1","key":"9368_CR32","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v077.i01","volume":"77","author":"MN Wright","year":"2017","unstructured":"Wright, M.N., & Ziegler, A. (2017). ranger: a fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1\u201317.","journal-title":"Journal of Statistical Software"},{"key":"9368_CR33","doi-asserted-by":"publisher","first-page":"253","DOI":"10.1093\/biostatistics\/kxy044","volume":"21","author":"Y Zhang","year":"2020","unstructured":"Zhang, Y., Bernau, C., Parmigiani, G., & Waldron, L. (2020). The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics, 21, 253\u2013268.","journal-title":"Biostatistics"}],"container-title":["Journal of Classification"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00357-020-09368-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00357-020-09368-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00357-020-09368-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,9]],"date-time":"2024-08-09T09:22:35Z","timestamp":1723195355000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00357-020-09368-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,6]]},"references-count":33,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["9368"],"URL":"https:\/\/doi.org\/10.1007\/s00357-020-09368-z","relation":{},"ISSN":["0176-4268","1432-1343"],"issn-type":[{"type":"print","value":"0176-4268"},{"type":"electronic","value":"1432-1343"}],"subject":[],"published":{"date-parts":[[2020,7,6]]},"assertion":[{"value":"6 July 2020","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Compliance with Ethical Standards"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"<!--Emphasis Type='Bold' removed-->Conflict of Interest"}}]}}