{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T23:47:02Z","timestamp":1780444022044,"version":"3.54.1"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"S6","license":[{"start":{"date-parts":[[2022,7,1]],"date-time":"2022-07-01T00:00:00Z","timestamp":1656633600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,7,14]],"date-time":"2022-07-14T00:00:00Z","timestamp":1657756800000},"content-version":"vor","delay-in-days":13,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100004033","name":"Johannes Gutenberg-Universit\u00e4t Mainz","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100004033","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2022,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>The constant evolving and development of next-generation sequencing techniques lead to high throughput data composed of datasets that include a large number of biological samples. Although a large number of samples are usually experimentally processed by batches, scientific publications are often elusive about this information, which can greatly impact the quality of the samples and confound further statistical analyzes. Because dedicated bioinformatics methods developed to detect unwanted sources of variance in the data can wrongly detect real biological signals, such methods could benefit from using a quality-aware approach.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We recently developed statistical guidelines and a machine learning tool to automatically evaluate the quality of a next-generation-sequencing sample. We leveraged this quality assessment to detect and correct batch effects in 12 publicly available RNA-seq datasets with available batch information. We were able to distinguish batches by our quality score and used it to correct for some batch effects in sample clustering. Overall, the correction was evaluated as comparable to or better than the reference method that uses a priori knowledge of the batches (in 10 and 1 datasets of 12, respectively; total\u2009=\u200992%). When coupled to outlier removal, the correction was more often evaluated as better than the reference (comparable or better in 5 and 6 datasets of 12, respectively; total\u2009=\u200992%).<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>In this work, we show the capabilities of our software to detect batches in public RNA-seq datasets from differences in the predicted quality of their samples. We also use these insights to correct the batch effect and observe the relation of sample quality and batch effect. These observations reinforce our expectation that while batch effects do correlate with differences in quality, batch effects also arise from other artifacts and are more suitably\u00a0 corrected\u00a0statistically in well-designed experiments.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-022-04775-y","type":"journal-article","created":{"date-parts":[[2022,7,14]],"date-time":"2022-07-14T14:06:15Z","timestamp":1657807575000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":44,"title":["Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality"],"prefix":"10.1186","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8438-4747","authenticated-orcid":false,"given":"Maximilian","family":"Sprang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6650-1711","authenticated-orcid":false,"given":"Miguel A.","family":"Andrade-Navarro","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1101-4091","authenticated-orcid":false,"given":"Jean-Fred","family":"Fontaine","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2022,7,14]]},"reference":[{"key":"4775_CR1","volume-title":"Batch effects and noise in microarray experiments: sources and solutions","author":"N Altman","year":"2009","unstructured":"Altman N. Batches and blocks, sample pools and subsamples in the design and analysis of gene expression studies. In: Scherer A, editor. Batch effects and noise in microarray experiments: sources and solutions. Chichester: Wiley; 2009."},{"key":"4775_CR2","volume-title":"Batch effects and noise in microarray experiments: sources and solutions","author":"P Grass","year":"2009","unstructured":"Grass P. Experimental design. In: Scherer A, editor. Batch effects and noise in microarray experiments: sources and solutions. Chichester: Wiley; 2009."},{"issue":"2","key":"4775_CR3","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0017238","volume":"6","author":"C Chen","year":"2011","unstructured":"Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS ONE. 2011;6(2): e17238.","journal-title":"PLoS ONE"},{"issue":"10","key":"4775_CR4","doi-asserted-by":"publisher","first-page":"733","DOI":"10.1038\/nrg2825","volume":"11","author":"JT Leek","year":"2010","unstructured":"Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10):733\u20139.","journal-title":"Nat Rev Genet"},{"key":"4775_CR5","doi-asserted-by":"publisher","unstructured":"Li T, Zhang Y, Patil P, Johnson WE. Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference. Biostatistics. 2021.\nhttps:\/\/doi.org\/10.1093\/biostatistics\/kxab039.","DOI":"10.1093\/biostatistics\/kxab039"},{"issue":"1","key":"4775_CR6","doi-asserted-by":"publisher","first-page":"10849","DOI":"10.1038\/s41598-017-11110-6","volume":"7","author":"G Nyamundanda","year":"2017","unstructured":"Nyamundanda G, Poudel P, Patil Y, Sadanandam A. A novel statistical method to diagnose, quantify and correct batch effects in genomic studies. Sci Rep. 2017;7(1):10849.","journal-title":"Sci Rep"},{"issue":"8","key":"4775_CR7","doi-asserted-by":"publisher","first-page":"892","DOI":"10.7150\/ijbs.24548","volume":"14","author":"H Cai","year":"2018","unstructured":"Cai H, Li X, Li J, Liang Q, Zheng W, Guan Q, et al. Identifying differentially expressed genes from cross-site integrated data based on relative expression orderings. Int J Biol Sci. 2018;14(8):892\u2013900.","journal-title":"Int J Biol Sci"},{"issue":"6","key":"4775_CR8","doi-asserted-by":"publisher","first-page":"498","DOI":"10.1016\/j.tibtech.2017.02.012","volume":"35","author":"WWB Goh","year":"2017","unstructured":"Goh WWB, Wang W, Wong L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 2017;35(6):498\u2013507.","journal-title":"Trends Biotechnol"},{"key":"4775_CR9","unstructured":"JT L, WE J, HS P, EJ F, AE J, Y Z, et al. sva: Surrogate Variable Analysis. 3.42.0 ed2021. p. R package."},{"issue":"3","key":"4775_CR10","doi-asserted-by":"publisher","first-page":"lqaa078","DOI":"10.1093\/nargab\/lqaa078","volume":"2","author":"Y Zhang","year":"2020","unstructured":"Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform. 2020;2(3):lqaa078.","journal-title":"NAR Genom Bioinform"},{"issue":"1","key":"4775_CR11","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1186\/s13059-021-02294-2","volume":"22","author":"S Albrecht","year":"2021","unstructured":"Albrecht S, Sprang M, Andrade-Navarro MA, Fontaine JF. seqQscorer: automated quality control of next-generation sequencing data using machine learning. Genome Biol. 2021;22(1):75.","journal-title":"Genome Biol"},{"key":"4775_CR12","doi-asserted-by":"publisher","DOI":"10.26508\/lsa.202101113","author":"M Sprang","year":"2021","unstructured":"Sprang M, Kruger M, Andrade-Navarro MA, Fontaine JF. Statistical guidelines for quality control of next-generation sequencing techniques. Life Sci Alliance. 2021. https:\/\/doi.org\/10.26508\/lsa.202101113.","journal-title":"Life Sci Alliance"},{"issue":"1","key":"4775_CR13","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1186\/s12864-020-6673-2","volume":"21","author":"AN Scholes","year":"2020","unstructured":"Scholes AN, Lewis JA. Comparison of RNA isolation methods on RNA-Seq: implications for differential expression and meta-analyses. BMC Genomics. 2020;21(1):249.","journal-title":"BMC Genomics"},{"issue":"9","key":"4775_CR14","doi-asserted-by":"publisher","first-page":"896","DOI":"10.1038\/nbt.2931","volume":"32","author":"D Risso","year":"2014","unstructured":"Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896\u2013902.","journal-title":"Nat Biotechnol"},{"key":"4775_CR15","unstructured":"Andrews S, et al. FastQC: a quality control tool for high throughput sequence data. 2010. https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/. Accessed 20 Nov 2020."},{"issue":"4","key":"4775_CR16","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1038\/nmeth.1923","volume":"9","author":"B Langmead","year":"2012","unstructured":"Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357.","journal-title":"Nat Methods"},{"issue":"14","key":"4775_CR17","doi-asserted-by":"publisher","first-page":"2382","DOI":"10.1093\/bioinformatics\/btv145","volume":"31","author":"G Yu","year":"2015","unstructured":"Yu G, Wang L-G, He Q-Y. ChIPseeker: an R\/Bioconductor package for ChIP peak annotation, comparison and visualization. BMC Bioinform. 2015;31(14):2382\u20133.","journal-title":"BMC Bioinform"},{"issue":"1","key":"4775_CR18","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1186\/1471-2105-11-237","volume":"11","author":"LJ Zhu","year":"2010","unstructured":"Zhu LJ, Gazin C, Lawson ND, Pag\u00e8s H, Lin SM, Lapointe DS, et al. ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinform. 2010;11(1):237.","journal-title":"BMC Bioinform"},{"issue":"4","key":"4775_CR19","doi-asserted-by":"publisher","first-page":"417","DOI":"10.1038\/nmeth.4197","volume":"14","author":"R Patro","year":"2017","unstructured":"Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14(4):417\u20139.","journal-title":"Nat Methods"},{"issue":"12","key":"4775_CR20","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1186\/s13059-014-0550-8","volume":"15","author":"MI Love","year":"2014","unstructured":"Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.","journal-title":"Genome Biol"},{"issue":"51","key":"4775_CR21","doi-asserted-by":"publisher","first-page":"14662","DOI":"10.1073\/pnas.1617317113","volume":"113","author":"Z Lin","year":"2016","unstructured":"Lin Z, Yang C, Zhu Y, Duchi J, Fu Y, Wang Y, et al. Simultaneous dimension reduction and adjustment for confounding variation. Proc Natl Acad Sci USA. 2016;113(51):14662\u20137.","journal-title":"Proc Natl Acad Sci USA"},{"key":"4775_CR22","unstructured":"Henning C. fpc: flexible procedures for clustering. 2.2-9 ed2020. p. R package."},{"issue":"2","key":"4775_CR23","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1023\/A:1012801612483","volume":"17","author":"M Halkidi","year":"2001","unstructured":"Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. J Intell Inf Syst. 2001;17(2):107\u201345.","journal-title":"J Intell Inf Syst"},{"issue":"7","key":"4775_CR24","doi-asserted-by":"publisher","first-page":"1796","DOI":"10.1016\/j.cell.2018.11.014","volume":"175","author":"V Lo Sardo","year":"2018","unstructured":"Lo Sardo V, Chubukov P, Ferguson W, Kumar A, Teng EL, Duran M, et al. Unveiling the role of the most impactful cardiovascular risk locus through haplotype editing. Cell. 2018;175(7):1796-810.e20.","journal-title":"Cell"},{"issue":"42","key":"4775_CR25","doi-asserted-by":"publisher","first-page":"E4468","DOI":"10.1073\/pnas.1405266111","volume":"111","author":"A Sugathan","year":"2014","unstructured":"Sugathan A, Biagioli M, Golzio C, Erdin S, Blumenthal I, Manavalan P, et al. CHD8 regulates neurodevelopmental pathways associated with autism spectrum disorder in neural progenitors. Proc Natl Acad Sci USA. 2014;111(42):E4468\u201377.","journal-title":"Proc Natl Acad Sci USA"},{"issue":"14","key":"4775_CR26","doi-asserted-by":"publisher","first-page":"2030","DOI":"10.1038\/onc.2016.340","volume":"36","author":"NA Wijetunga","year":"2017","unstructured":"Wijetunga NA, Pascual M, Tozour J, Delahaye F, Alani M, Adeyeye M, et al. A pre-neoplastic epigenetic field defect in HCV-infected liver at transcription factor binding sites and polycomb targets. Oncogene. 2017;36(14):2030\u201344.","journal-title":"Oncogene"},{"issue":"4","key":"4775_CR27","doi-asserted-by":"publisher","first-page":"588","DOI":"10.1016\/j.ccell.2019.02.009","volume":"35","author":"L Cassetta","year":"2019","unstructured":"Cassetta L, Fragkogianni S, Sims AH, Swierczak A, Forrester LM, Zhang H, et al. Human tumor-associated macrophage and monocyte transcriptional landscapes reveal cancer-specific reprogramming, biomarkers, and therapeutic targets. Cancer Cell. 2019;35(4):588-602.e10.","journal-title":"Cancer Cell"},{"issue":"9","key":"4775_CR28","doi-asserted-by":"publisher","first-page":"1152","DOI":"10.1111\/jcpe.13504","volume":"48","author":"H Kim","year":"2021","unstructured":"Kim H, Momen-Heravi F, Chen S, Hoffmann P, Kebschull M, Papapanou PN. Differential DNA methylation and mRNA transcription in gingival tissues in periodontal health and disease. J Clin Periodontol. 2021;48(9):1152\u201364.","journal-title":"J Clin Periodontol"},{"issue":"4","key":"4775_CR29","doi-asserted-by":"publisher","first-page":"e0009321","DOI":"10.1371\/journal.pntd.0009321","volume":"15","author":"C Farias-Amorim","year":"2021","unstructured":"Farias-Amorim C, Novais FO, Nguyen BT, Nascimento MT, Lago J, Lago AS, et al. Localized skin inflammation during cutaneous leishmaniasis drives a chronic, systemic IFN-gamma signature. PLoS Negl Trop Dis. 2021;15(4):e0009321.","journal-title":"PLoS Negl Trop Dis"},{"issue":"17","key":"4775_CR30","doi-asserted-by":"publisher","first-page":"4547","DOI":"10.1016\/j.cell.2021.07.003","volume":"184","author":"KR Bowles","year":"2021","unstructured":"Bowles KR, Silva MC, Whitney K, Bertucci T, Berlind JE, Lai JD, et al. ELAVL4, splicing, and glutamatergic dysfunction precede neuron loss in MAPT mutation cerebral organoids. Cell. 2021;184(17):4547-63.e17.","journal-title":"Cell"},{"issue":"1","key":"4775_CR31","doi-asserted-by":"publisher","first-page":"5450","DOI":"10.1038\/s41467-021-25704-2","volume":"12","author":"J Alvarez-Benayas","year":"2021","unstructured":"Alvarez-Benayas J, Trasanidis N, Katsarou A, Ponnusamy K, Chaidos A, May PC, et al. Chromatin-based, in cis and in trans regulatory rewiring underpins distinct oncogenic transcriptomes in multiple myeloma. Nat Commun. 2021;12(1):5450.","journal-title":"Nat Commun"},{"issue":"2","key":"4775_CR32","doi-asserted-by":"publisher","first-page":"678","DOI":"10.3390\/ijms22020678","volume":"22","author":"T Procida","year":"2021","unstructured":"Procida T, Friedrich T, Jack APM, Peritore M, Bonisch C, Eberl HC, et al. JAZF1, a novel p400\/TIP60\/NuA4 complex member, regulates H2A.Z acetylation at regulatory regions. Int J Mol Sci. 2021;22(2):678.","journal-title":"Int J Mol Sci"},{"issue":"1","key":"4775_CR33","doi-asserted-by":"publisher","first-page":"504","DOI":"10.1038\/s41398-021-01635-w","volume":"11","author":"Y Lim","year":"2021","unstructured":"Lim Y, Beane-Ebel JE, Tanaka Y, Ning B, Husted CR, Henderson DC, et al. Exploration of alcohol use disorder-associated brain miRNA-mRNA regulatory networks. Transl Psychiatry. 2021;11(1):504.","journal-title":"Transl Psychiatry"},{"issue":"11","key":"4775_CR34","doi-asserted-by":"publisher","first-page":"103238","DOI":"10.1016\/j.isci.2021.103238","volume":"24","author":"VA Moser","year":"2021","unstructured":"Moser VA, Workman MJ, Hurwitz SJ, Lipman RM, Pike CJ, Svendsen CN. Microglial transcription profiles in mouse and human are driven by APOE4 and sex. iScience. 2021;24(11):103238.","journal-title":"iScience"},{"key":"4775_CR35","doi-asserted-by":"publisher","DOI":"10.7554\/eLife.58178","author":"JG Roth","year":"2020","unstructured":"Roth JG, Muench KL, Asokan A, Mallett VM, Gai H, Verma Y, et al. 16p11.2 microdeletion imparts transcriptional alterations in human iPSC-derived models of early neural development. Elife. 2020. https:\/\/doi.org\/10.7554\/eLife.58178.","journal-title":"Elife"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-04775-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-022-04775-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-04775-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,7,14]],"date-time":"2022-07-14T14:07:48Z","timestamp":1657807668000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-022-04775-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7]]},"references-count":35,"journal-issue":{"issue":"S6","published-print":{"date-parts":[[2022,7]]}},"alternative-id":["4775"],"URL":"https:\/\/doi.org\/10.1186\/s12859-022-04775-y","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,7]]},"assertion":[{"value":"1 June 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 June 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 July 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"279"}}