{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T19:41:42Z","timestamp":1775245302873,"version":"3.50.1"},"reference-count":43,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2022,2,25]],"date-time":"2022-02-25T00:00:00Z","timestamp":1645747200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"National Science Foundation through Research Training Groups","award":["1745640"],"award-info":[{"award-number":["1745640"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,4,28]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Microbiome datasets provide rich information about microbial communities. However, vast library size variations across samples present great challenges for proper statistical comparisons. To deal with these challenges, rarefaction is often used in practice as a normalization technique, although there has been debate whether rarefaction should ever be used. Conventional wisdom and previous work suggested that rarefaction should never be used in practice, arguing that rarefying microbiome data is statistically inadmissible. These discussions, however, have been confined to particular parametric models and simulation studies.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We develop a semiparametric graphical model framework for grouped microbiome data and analyze in the context of differential abundance testing the statistical trade-offs of the rarefaction procedure, accounting for latent variations and measurement errors. Under the framework, it can be shown rarefaction guarantees that subsequent permutation tests properly control the Type I error. In addition, the loss in sensitivity from rarefaction is solely due to increased measurement error; if the underlying variation in microbial composition is large among samples, rarefaction might not hurt subsequent statistical inference much. We develop the rarefaction efficiency index (REI) as an indicator for efficiency loss and illustrate it with a dataset on the effect of storage conditions for microbiome data. Simulation studies based on real data demonstrate that the impact of rarefaction on sensitivity is negligible when overdispersion is prominent, while low REI corresponds to scenarios in which rarefying might substantially lower the statistical power. Whether to rarefy or not ultimately depends on assumptions of the data generating process and characteristics of the data.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>Source codes are publicly available at https:\/\/github.com\/jcyhong\/rarefaction.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac127","type":"journal-article","created":{"date-parts":[[2022,2,24]],"date-time":"2022-02-24T12:11:05Z","timestamp":1645704665000},"page":"2389-2396","source":"Crossref","is-referenced-by-count":41,"title":["To rarefy or not to rarefy: robustness and efficiency trade-offs of rarefying microbiome data"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6476-2982","authenticated-orcid":false,"given":"Johnny","family":"Hong","sequence":"first","affiliation":[{"name":"Department of Statistics, UC Berkeley , Berkeley, CA 94720, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8238-6757","authenticated-orcid":false,"given":"Ulas","family":"Karaoz","sequence":"additional","affiliation":[{"name":"Climate and Ecosystem Sciences Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA"}]},{"given":"Perry","family":"de Valpine","sequence":"additional","affiliation":[{"name":"Department of Environmental Science, Policy, and Management, UC Berkeley , Berkeley, CA 94720, USA"}]},{"given":"William","family":"Fithian","sequence":"additional","affiliation":[{"name":"Department of Statistics, UC Berkeley , Berkeley, CA 94720, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,2,25]]},"reference":[{"key":"2023041402552270200_","first-page":"32","article-title":"A new method for non parametric multivariate analysis of variance","volume":"26","author":"Anderson","year":"2001","journal-title":"Austral Ecol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1186\/s13742-016-0111-z","article-title":"Species-level resolution of 16S rRNA gene amplicons sequenced through the MinIONTM portable nanopore sequencer","volume":"5","author":"Ben\u00edtez-P\u00e1ez","year":"2016","journal-title":"Gigascience"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"325","DOI":"10.2307\/1942268","article-title":"An ordination of upland forest communities of Southern Wisconsin","volume":"27","author":"Bray","year":"1957","journal-title":"Ecol. Monogr"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"2639","DOI":"10.1038\/ismej.2017.119","article-title":"Exact sequence variants should replace operational taxonomic units in marker-gene data analysis","volume":"11","author":"Callahan","year":"2017","journal-title":"ISME J"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"418","DOI":"10.1214\/12-AOAS592","article-title":"Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis","volume":"7","author":"Chen","year":"2013","journal-title":"Ann. Appl. Stat"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1111\/tmi.12650","article-title":"Gut microbiota in Malawian infants in a nutritional supplementation trial","volume":"21","author":"Cheung","year":"2016","journal-title":"Trop. Med. Int. Health"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1186\/s12864-015-2194-9","article-title":"A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling","volume":"17","author":"D'Amore","year":"2016","journal-title":"BMC Genomics"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1038\/nmeth.2604","article-title":"UPARSE: highly accurate OTU sequences from microbial amplicon reads","volume":"10","author":"Edgar","year":"2013","journal-title":"Nat. Methods"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1016\/j.jare.2019.03.006","article-title":"What is new and relevant for sequencing-based microbiome research? A mini-review","volume":"19","author":"Fricker","year":"2019","journal-title":"J. Adv. Res"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"17004","DOI":"10.1038\/nmicrobiol.2017.4","article-title":"Dynamics of the human gut microbiome in inflammatory bowel disease","volume":"2","author":"Halfvarson","year":"2017","journal-title":"Nat. Microbiol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e0224909","DOI":"10.1371\/journal.pone.0224909","article-title":"Sequence count data are poorly fit by the negative binomial distribution","volume":"15","author":"Hawinkel","year":"2020","journal-title":"PLoS One"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e30126","DOI":"10.1371\/journal.pone.0030126","article-title":"Dirichlet multinomial mixtures: generative models for microbial metagenomics","volume":"7","author":"Holmes","year":"2012","journal-title":"PLoS One"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"750","DOI":"10.1038\/nature03073","article-title":"A taxa-area relationship for bacteria","volume":"432","author":"Horner-Devine","year":"2004","journal-title":"Nature"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1038\/nature11234","article-title":"Structure, function and diversity of the healthy human microbiome","volume":"486","author":"Huttenhower","year":"2012","journal-title":"Nature"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1111\/j.1469-8137.1912.tb05611.x","article-title":"The distribution of the flora in the alpine zone","volume":"11","author":"Jaccard","year":"1912","journal-title":"New Phytol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"12015","DOI":"10.1038\/ncomms12015","article-title":"Alterations of the human gut microbiome in multiple sclerosis","volume":"7","author":"Jangi","year":"2016","journal-title":"Nat. Commun"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"11279","DOI":"10.1073\/pnas.95.19.11279","article-title":"Diversity components of impending primate extinctions","volume":"95","author":"Jernvall","year":"1998","journal-title":"Proc. Natl. Acad. Sci. U S A"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"5029","DOI":"10.1038\/s41467-019-13036-1","article-title":"Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis","volume":"10","author":"Johnson","year":"2019","journal-title":"Nat. Commun"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"297","DOI":"10.3389\/fmicb.2018.00297","article-title":"Linking associations of rare low-abundance species to their environments by association networks","volume":"9","author":"Karpinets","year":"2018","journal-title":"Front. Microbiol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e52078","DOI":"10.1371\/journal.pone.0052078","article-title":"Hypothesis testing and power calculations for taxonomic-based human microbiome data","volume":"7","author":"La Rosa","year":"2012","journal-title":"PLoS One"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"550","DOI":"10.1186\/s13059-014-0550-8","article-title":"Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2","volume":"15","author":"Love","year":"2014","journal-title":"Genome Biol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"8228","DOI":"10.1128\/AEM.71.12.8228-8235.2005","article-title":"UniFrac: a new phylogenetic method for comparing microbial communities","volume":"71","author":"Lozupone","year":"2005","journal-title":"Appl. Environ. Microbiol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1038\/ismej.2010.133","article-title":"UniFrac: an effective distance metric for microbial community comparison","volume":"5","author":"Lozupone","year":"2011","journal-title":"ISME J"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e1003531","DOI":"10.1371\/journal.pcbi.1003531","article-title":"Waste not, want not: why rarefying microbiome data is inadmissible","volume":"10","author":"McMurdie","year":"2014","journal-title":"PLoS Comput. Biol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"946","DOI":"10.1214\/16-AOAS920","article-title":"Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression","volume":"10","author":"Phipson","year":"2016","journal-title":"Ann. Appl. Stat"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e47","DOI":"10.1093\/nar\/gkv007","article-title":"limma powers differential expression analyses for RNA-sequencing and microarray studies","volume":"43","author":"Ritchie","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"24067","DOI":"10.1038\/srep24067","article-title":"Comparison of DNA quantification methods for next generation sequencing","volume":"6","author":"Robin","year":"2016","journal-title":"Sci. Rep"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1093\/bioinformatics\/btp616","article-title":"edgeR: a bioconductor package for differential expression analysis of digital gene expression data","volume":"26","author":"Robinson","year":"2010","journal-title":"Bioinformatics"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"7583","DOI":"10.1128\/AEM.02206-14","article-title":"Performance comparison of Illumina and ion torrent next-generation sequencing platforms for 16S rRNA-based bacterial community profiling","volume":"80","author":"Salipante","year":"2014","journal-title":"Appl. Environ. Microbiol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1086\/282541","article-title":"Marine benthic diversity: a comparative study","volume":"102","author":"Sanders","year":"1968","journal-title":"Am. Nat"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e1869","DOI":"10.7717\/peerj.1869","article-title":"Sequencing 16S rRNA gene fragments using the PacBio SMRT DNA sequencing system","volume":"4","author":"Schloss","year":"2016","journal-title":"PeerJ"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e00021-16","DOI":"10.1128\/mSystems.00021-16","article-title":"Preservation methods differ in fecal microbiome stability, affecting suitability for field studies","volume":"1","author":"Song","year":"2016","journal-title":"mSystems"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1038\/nature12222","article-title":"Comprehensive molecular characterization of clear cell renal cell carcinoma","volume":"499","year":"2013","journal-title":"Nature"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511802256","volume-title":"Asymptotic Statistics","author":"van der Vaart","year":"1998"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"507","DOI":"10.1038\/nature24460","article-title":"Quantitative microbiome profiling links gut community variation to microbial load","volume":"551","author":"Vandeputte","year":"2017","journal-title":"Nature"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"13537","DOI":"10.1038\/s41598-017-13601-y","article-title":"Gut microbiome alterations in Alzheimer\u2019s disease","volume":"7","author":"Vogt","year":"2017","journal-title":"Sci. Rep"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1186\/s40168-017-0237-y","article-title":"Normalization and microbial differential abundance strategies depend upon data characteristics","volume":"5","author":"Weiss","year":"2017","journal-title":"Microbiome"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"e1000352","DOI":"10.1371\/journal.pcbi.1000352","article-title":"Statistical methods for detecting differentially abundant features in clinical metagenomic samples","volume":"5","author":"White","year":"2009","journal-title":"PLoS Comput. Biol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"2407","DOI":"10.3389\/fmicb.2019.02407","article-title":"Rarefaction, alpha diversity, and statistics","volume":"10","author":"Willis","year":"2019","journal-title":"Front. Microbiol"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"2435","DOI":"10.1038\/ismej.2016.37","article-title":"Cigarette smoking and the oral microbiome in a large study of American adults","volume":"10","author":"Wu","year":"2016","journal-title":"ISME J"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"326","DOI":"10.1186\/s12864-018-4677-y","article-title":"Robust sub-nanomolar library preparation for high throughput next generation sequencing","volume":"19","author":"Wu","year":"2018","journal-title":"BMC Genomics"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"138","DOI":"10.1016\/j.gendis.2017.06.001","article-title":"Hypothesis testing and statistical analysis of microbiome","volume":"4","author":"Xia","year":"2017","journal-title":"Genes Dis"},{"key":"2023041402552270200_","doi-asserted-by":"crossref","first-page":"4894","DOI":"10.1038\/s41467-018-07343-2","article-title":"The structure and function of the global citrus rhizosphere microbiome","volume":"9","author":"Xu","year":"2018","journal-title":"Nat. Commun"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac127\/42828749\/btac127.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/9\/2389\/49874496\/btac127.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/9\/2389\/49874496\/btac127.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,18]],"date-time":"2023-11-18T01:14:42Z","timestamp":1700270082000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/9\/2389\/6536959"}},"subtitle":[],"editor":[{"given":"Janet","family":"Kelso","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,2,25]]},"references-count":43,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2022,4,28]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac127","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,5,1]]},"published":{"date-parts":[[2022,2,25]]}}}