{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T04:03:58Z","timestamp":1778645038017,"version":"3.51.4"},"reference-count":42,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2020,11,27]],"date-time":"2020-11-27T00:00:00Z","timestamp":1606435200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Division of Mathematical Sciences, National Science Foundation","award":["1810829"],"award-info":[{"award-number":["1810829"]}]},{"DOI":"10.13039\/100000054","name":"The National Cancer Institute","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"crossref","award":["4P30CA006516-51"],"award-info":[{"award-number":["4P30CA006516-51"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"crossref","award":["5R01GM127430-02"],"award-info":[{"award-number":["5R01GM127430-02"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,7,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such \u2018batch effects\u2019 often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https:\/\/github.com\/zhangyuqing\/bea_ensemble.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa986","type":"journal-article","created":{"date-parts":[[2020,11,13]],"date-time":"2020-11-13T15:13:22Z","timestamp":1605280402000},"page":"1521-1527","source":"Crossref","is-referenced-by-count":20,"title":["Robustifying genomic classifiers to batch effects via ensemble learning"],"prefix":"10.1093","volume":"37","author":[{"given":"Yuqing","family":"Zhang","sequence":"first","affiliation":[{"name":"Clinical Bioinformatics, Gilead Sciences, Inc. , Foster City, CA 94404, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Prasad","family":"Patil","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, Boston University School of Public Health , Boston, MA 02118, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6247-6595","authenticated-orcid":false,"given":"W. Evan","family":"Johnson","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, Boston University School of Public Health , Boston, MA 02118, USA"},{"name":"Division of Computational Biomedicine, Boston University School of Medicine , Boston, MA 02118, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Giovanni","family":"Parmigiani","sequence":"additional","affiliation":[{"name":"Department of Data Sciences, Dana-Farber Cancer Institute , Boston, MA 02215, USA"},{"name":"Department of Biostatistics, Harvard T.H. Chan School of Public Health , Boston, MA 02115, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2020,11,27]]},"reference":[{"key":"2023051709443350200_btaa986-B1","doi-asserted-by":"crossref","first-page":"1617","DOI":"10.1084\/jem.20052302","article-title":"Tuberculosis in children and adults: two distinct genetic diseases","volume":"202","author":"Alca\u00efs","year":"2005","journal-title":"J. Exp. Med"},{"key":"2023051709443350200_btaa986-B2","doi-asserted-by":"crossref","first-page":"1712","DOI":"10.1056\/NEJMoa1303657","article-title":"Diagnosis of childhood tuberculosis and host RNA expression in Africa","volume":"370","author":"Anderson","year":"2014","journal-title":"N. Engl. J. Med"},{"key":"2023051709443350200_btaa986-B3","doi-asserted-by":"crossref","first-page":"419","DOI":"10.1111\/bju.12789","article-title":"Effect of a genomic classifier test on clinical practice decisions for patients with high-risk prostate cancer after surgery","volume":"115","author":"Badani","year":"2015","journal-title":"BJU Int"},{"key":"2023051709443350200_btaa986-B4","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1093\/bioinformatics\/btg385","article-title":"Adjustment of systematic microarray data biases","volume":"20","author":"Benito","year":"2004","journal-title":"Bioinformatics"},{"key":"2023051709443350200_btaa986-B5","doi-asserted-by":"crossref","first-page":"i105","DOI":"10.1093\/bioinformatics\/btu279","article-title":"Cross-study validation for the assessment of prediction algorithms","volume":"30","author":"Bernau","year":"2014","journal-title":"Bioinformatics"},{"key":"2023051709443350200_btaa986-B6","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1016\/j.asoc.2018.10.005","article-title":"Comparison of common machine learning models for classification of tuberculosis using transcriptional biomarkers from integrated datasets","volume":"74","author":"Bobak","year":"2019","journal-title":"Appl. Soft Comput"},{"key":"2023051709443350200_btaa986-B7","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1007\/BF00117832","article-title":"Stacked regressions","volume":"24","author":"Breiman","year":"1996","journal-title":"Mach. Learn"},{"key":"2023051709443350200_btaa986-B8","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn"},{"key":"2023051709443350200_btaa986-B9","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1038\/nbt.4096","article-title":"Integrating single-cell transcriptomic data across different conditions, technologies, and species","volume":"36","author":"Butler","year":"2018","journal-title":"Nat. Biotechnol"},{"key":"2023051709443350200_btaa986-B10","doi-asserted-by":"crossref","first-page":"1239","DOI":"10.1080\/01621459.2014.1002926","article-title":"Tracking cross-validated estimates of prediction error as studies accumulate","volume":"110","author":"Chang","year":"2015","journal-title":"J. Am. Stat. Assoc"},{"key":"2023051709443350200_btaa986-B11","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1007\/BF00994018","article-title":"Support-vector networks","volume":"20","author":"Cortes","year":"1995","journal-title":"Mach. Learn"},{"key":"2023051709443350200_btaa986-B12","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1504\/IJAIP.2016.074775","article-title":"Handling batch effects on cross-platform classification of microarray data","volume":"8","author":"Engchuan","year":"2016","journal-title":"Int. J. Adv. Intell. Paradigms"},{"key":"2023051709443350200_btaa986-B13","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1093\/biostatistics\/kxr034","article-title":"Using control genes to correct for unwanted variation in microarray data","volume":"13","author":"Gagnon-Bartsch","year":"2012","journal-title":"Biostatistics"},{"key":"2023051709443350200_btaa986-B14","first-page":"1","author":"Gagnon-Bartsch","year":"2013"},{"key":"2023051709443350200_btaa986-B15","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1198\/016214506000001437","article-title":"Strictly proper scoring rules, prediction, and estimation","volume":"102","author":"Gneiting","year":"2007","journal-title":"J. Am. Stat. Assoc"},{"key":"2023051709443350200_btaa986-B16","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1126\/science.286.5439.531","article-title":"Molecular classification of cancer: class discovery and class prediction by gene expression monitoring","volume":"286","author":"Golub","year":"1999","journal-title":"Science"},{"key":"2023051709443350200_btaa986-B17","article-title":"Merging versus ensembling in multi-study machine learning: theoretical insight from random effects","author":"Guan","year":"2019","journal-title":"arXiv preprint arXiv : 1905.07382"},{"key":"2023051709443350200_btaa986-B18","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1093\/biostatistics\/kxj037","article-title":"Adjusting batch effects in microarray expression data using empirical bayes methods","volume":"8","author":"Johnson","year":"2007","journal-title":"Biostatistics"},{"key":"2023051709443350200_btaa986-B19","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1186\/1755-8794-5-23","article-title":"Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis","volume":"5","author":"Kupfer","year":"2012","journal-title":"BMC Med. Genomics"},{"key":"2023051709443350200_btaa986-B20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2014\/651751","article-title":"Microarray-based rna profiling of breast cancer: batch effect removal improves cross-platform consistency","volume":"2014","author":"Larsen","year":"2014","journal-title":"BioMed. Res. Int"},{"key":"2023051709443350200_btaa986-B21","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1093\/bib\/bbs037","article-title":"Batch effect removal methods for microarray gene expression data integration: a survey","volume":"14","author":"Lazar","year":"2013","journal-title":"Brief. Bioinf"},{"key":"2023051709443350200_btaa986-B22","doi-asserted-by":"crossref","first-page":"e161","DOI":"10.1093\/nar\/gku864","article-title":"Svaseq: removing batch effects and other unwanted noise from sequencing data","volume":"42","author":"Leek","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023051709443350200_btaa986-B23","doi-asserted-by":"crossref","first-page":"e161","DOI":"10.1371\/journal.pgen.0030161","article-title":"Capturing heterogeneity in gene expression studies by surrogate variable analysis","volume":"3","author":"Leek","year":"2007","journal-title":"PLoS Genet"},{"key":"2023051709443350200_btaa986-B24","doi-asserted-by":"crossref","first-page":"733","DOI":"10.1038\/nrg2825","article-title":"Tackling the widespread and critical impact of batch effects in high-throughput data","volume":"11","author":"Leek","year":"2010","journal-title":"Nat. Rev. Genet"},{"key":"2023051709443350200_btaa986-B25","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1016\/j.tube.2018.01.002","article-title":"Existing blood transcriptional classifiers accurately discriminate active tuberculosis from latent infection in individuals from south india","volume":"109","author":"Leong","year":"2018","journal-title":"Tuberculosis"},{"key":"2023051709443350200_btaa986-B26","doi-asserted-by":"crossref","first-page":"278","DOI":"10.1038\/tpj.2010.57","article-title":"A comparison of batch effect removal methods for enhancement of prediction performance using maqc-ii microarray gene expression data","volume":"10","author":"Luo","year":"2010","journal-title":"The Pharmacogenomics Journal"},{"key":"2023051709443350200_btaa986-B27","doi-asserted-by":"crossref","first-page":"e110840","DOI":"10.1371\/journal.pone.0110840","article-title":"Measuring the effect of inter-study variability on estimating prediction error","volume":"9","author":"Ma","year":"2014","journal-title":"PLoS One"},{"key":"2023051709443350200_btaa986-B28","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2013\/828939","article-title":"Risk factors for tuberculosis","volume":"2013","author":"Narasimhan","year":"2013","journal-title":"Pulmonary Med"},{"key":"2023051709443350200_btaa986-B29","doi-asserted-by":"crossref","first-page":"2209","DOI":"10.1056\/NEJMoa1516192","article-title":"Genomic classification and prognosis in acute myeloid leukemia","volume":"374","author":"Papaemmanuil","year":"2016","journal-title":"N. Engl. J. Med"},{"key":"2023051709443350200_btaa986-B30","doi-asserted-by":"crossref","first-page":"2578","DOI":"10.1073\/pnas.1708283115","article-title":"Training replicable predictors in multiple studies","volume":"115","author":"Patil","year":"2018","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023051709443350200_btaa986-B31","first-page":"451","article-title":"Tree-weighting for multi-study ensemble learners","volume":"33","author":"Ramchandran","year":"2019","journal-title":"bioRxiv"},{"key":"2023051709443350200_btaa986-B32","doi-asserted-by":"crossref","first-page":"dju048","DOI":"10.1093\/jnci\/dju048","article-title":"Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples","volume":"106","author":"Riester","year":"2014","journal-title":"JNCI J. Natl. Cancer Inst"},{"key":"2023051709443350200_btaa986-B33","doi-asserted-by":"publisher","first-page":"896","DOI":"10.1038\/nbt.2931","article-title":"Normalization of RNA-seq data using factor analysis of control genes or samples","volume":"32","author":"Risso","year":"2014","journal-title":"Nature Biotechnology"},{"key":"2023051709443350200_btaa986-B34","doi-asserted-by":"crossref","first-page":"e1000612","DOI":"10.1371\/journal.pgen.1000612","article-title":"The key role of genomics in modern vaccine and drug design for emerging infectious diseases","volume":"5","author":"Seib","year":"2009","journal-title":"PLoS Genet"},{"key":"2023051709443350200_btaa986-B35","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1056\/NEJMoa1504601","article-title":"A bronchial genomic classifier for the diagnostic evaluation of lung cancer","volume":"373","author":"Silvestri","year":"2015","journal-title":"N. Engl. J. Med"},{"key":"2023051709443350200_btaa986-B36","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1093\/jnci\/95.1.14","article-title":"Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification","volume":"95","author":"Simon","year":"2003","journal-title":"J. Natl. Cancer Inst"},{"key":"2023051709443350200_btaa986-B37","doi-asserted-by":"crossref","first-page":"397","DOI":"10.1007\/0-387-29362-0_23","volume-title":"Bioinformatics and Computational Biology Solutions Using R and Bioconductor","author":"Smyth","year":"2005"},{"key":"2023051709443350200_btaa986-B38","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression shrinkage and selection via the lasso","volume":"58","author":"Tibshirani","year":"1996","journal-title":"J. R. Stat. Soc. Ser. B (Methodological)"},{"key":"2023051709443350200_btaa986-B39","doi-asserted-by":"crossref","first-page":"2312","DOI":"10.1016\/S0140-6736(15)01316-1","article-title":"A blood RNA signature for tuberculosis disease risk: a prospective cohort study","volume":"387","author":"Zak","year":"2016","journal-title":"Lancet"},{"key":"2023051709443350200_btaa986-B40","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1186\/s12859-018-2263-6","article-title":"Alternative empirical bayes models for adjusting for batch effects in genomic studies","volume":"19","author":"Zhang","year":"2018","journal-title":"BMC Bioinformatics"},{"key":"2023051709443350200_btaa986-B41","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1093\/biostatistics\/kxy044","article-title":"The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models","volume":"21","author":"Zhang","year":"2018","journal-title":"Biostatistics (Oxford, England)"},{"key":"2023051709443350200_btaa986-B42","doi-asserted-by":"publisher","DOI":"10.1093\/nargab\/lqaa078","article-title":"Combat-seq: batch effect adjustment for rna-seq count data","author":"Zhang","year":"2020"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa986\/34875671\/btaa986.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/11\/1521\/50360836\/btaa986.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/11\/1521\/50360836\/btaa986.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,17]],"date-time":"2024-08-17T07:26:45Z","timestamp":1723879605000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/11\/1521\/6007261"}},"subtitle":[],"editor":[{"given":"Kelso","family":"Janet","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2020,11,27]]},"references-count":42,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2021,7,12]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa986","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/703587","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,6,1]]},"published":{"date-parts":[[2020,11,27]]}}}