{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T06:40:23Z","timestamp":1774680023131,"version":"3.50.1"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"22","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,11,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S ribosomal RNA (rRNA) amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps and clustering algorithms on cluster accuracy for oral microbial sequence data.<\/jats:p>\n               <jats:p>Results: The number of clusters varied up to two orders of magnitude depending on pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 15). Based on run time, purity and normalized mutual information, we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared with the differences in accuracy among different pre-processing steps.<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <jats:p>Contact: \u00a0bonder.m.j@gmail.com or b.brandt@acta.nl<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts552","type":"journal-article","created":{"date-parts":[[2012,9,9]],"date-time":"2012-09-09T00:29:08Z","timestamp":1347150548000},"page":"2891-2897","source":"Crossref","is-referenced-by-count":75,"title":["Comparing clustering and pre-processing in taxonomy analysis"],"prefix":"10.1093","volume":"28","author":[{"given":"Marc J.","family":"Bonder","sequence":"first","affiliation":[{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"},{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"},{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sanne","family":"Abeln","sequence":"additional","affiliation":[{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Egija","family":"Zaura","sequence":"additional","affiliation":[{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bernd W.","family":"Brandt","sequence":"additional","affiliation":[{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"},{"name":"1 Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam and 2Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2012,9,8]]},"reference":[{"key":"2023012513220419100_bts552-B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2023012513220419100_bts552-B2","doi-asserted-by":"crossref","first-page":"W82","DOI":"10.1093\/nar\/gks418","article-title":"TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage","volume":"40","author":"Brandt","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2023012513220419100_bts552-B3","doi-asserted-by":"crossref","first-page":"e95","DOI":"10.1093\/nar\/gkr349","article-title":"ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time","volume":"39","author":"Cai","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"2023012513220419100_bts552-B4","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1038\/nmeth.f.303","article-title":"QIIME allows analysis of high-throughput community sequencing data","volume":"7","author":"Caporaso","year":"2010","journal-title":"Nat. Methods"},{"key":"2023012513220419100_bts552-B5","doi-asserted-by":"crossref","first-page":"D294","DOI":"10.1093\/nar\/gki038","article-title":"The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis","volume":"33","author":"Cole","year":"2005","journal-title":"Nucleic Acids Res."},{"key":"2023012513220419100_bts552-B6","doi-asserted-by":"crossref","first-page":"5002","DOI":"10.1128\/JB.00542-10","article-title":"The human oral microbiome","volume":"192","author":"Dewhirst","year":"2010","journal-title":"J. Bacteriol."},{"key":"2023012513220419100_bts552-B7","doi-asserted-by":"crossref","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than BLAST","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012513220419100_bts552-B8","doi-asserted-by":"crossref","first-page":"2194","DOI":"10.1093\/bioinformatics\/btr381","article-title":"UCHIME improves sensitivity and speed of chimera detection","volume":"27","author":"Edgar","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012513220419100_bts552-B9","doi-asserted-by":"crossref","first-page":"494","DOI":"10.1101\/gr.112730.110","article-title":"Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons","volume":"21","author":"Haas","year":"2011","journal-title":"Genome Res."},{"key":"2023012513220419100_bts552-B10","doi-asserted-by":"crossref","first-page":"1889","DOI":"10.1111\/j.1462-2920.2010.02193.x","article-title":"Ironing out the wrinkles in the rare biosphere through improved OTU clustering","volume":"12","author":"Huse","year":"2010","journal-title":"Environ. Microbiol."},{"key":"2023012513220419100_bts552-B11","doi-asserted-by":"crossref","first-page":"e30230","DOI":"10.1371\/journal.pone.0030230","article-title":"Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons","volume":"7","author":"Jiang","year":"2012","journal-title":"PLoS ONE"},{"key":"2023012513220419100_bts552-B12","doi-asserted-by":"crossref","first-page":"1016","DOI":"10.1177\/154405910808701104","article-title":"Pyrosequencing analysis of the oral microflora of healthy adults","volume":"87","author":"Keijser","year":"2008","journal-title":"J. Dent. Res."},{"key":"2023012513220419100_bts552-B30","doi-asserted-by":"crossref","first-page":"e42770","DOI":"10.1371\/journal.pone.0042770","article-title":"The relation between oral Candida load and bacterial microbiome profiles in Dutch older adults","volume":"7","author":"Kraneveld","year":"2012","journal-title":"PLoS ONE"},{"key":"2023012513220419100_bts552-B13","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1111\/j.1462-2920.2009.02051.x","article-title":"Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates","volume":"12","author":"Kunin","year":"2010","journal-title":"Environ. Microbiol."},{"key":"2023012513220419100_bts552-B14","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2023012513220419100_bts552-B15","doi-asserted-by":"crossref","first-page":"376","DOI":"10.1038\/nature03959","article-title":"Genome sequencing in microfabricated high-density picolitre reactors","volume":"437","author":"Margulies","year":"2005","journal-title":"Nature"},{"key":"2023012513220419100_bts552-B16","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1111\/j.1365-2591.2011.02006.x","article-title":"Ecology of the microbiome of the infected root canal system: a comparison between apical and coronal root segments","volume":"45","author":"\u00d6zok","year":"2012","journal-title":"Int. Endod. J."},{"key":"2023012513220419100_bts552-B17","doi-asserted-by":"crossref","first-page":"3770","DOI":"10.1128\/JB.183.12.3770-3783.2001","article-title":"Bacterial diversity in human subgingival plaque","volume":"183","author":"Paster","year":"2001","journal-title":"J. Bacteriol."},{"key":"2023012513220419100_bts552-B18","volume-title":"Numerical Recipes: The Art of Scientific Computing","author":"Press","year":"2007"},{"key":"2023012513220419100_bts552-B19","doi-asserted-by":"crossref","first-page":"7188","DOI":"10.1093\/nar\/gkm864","article-title":"SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB","volume":"35","author":"Pruesse","year":"2007","journal-title":"Nucleic Acids Res."},{"key":"2023012513220419100_bts552-B20","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1186\/1471-2105-12-38","article-title":"Removing noise from pyrosequenced amplicons","volume":"12","author":"Quince","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012513220419100_bts552-B21","doi-asserted-by":"crossref","first-page":"668","DOI":"10.1038\/nmeth0910-668b","article-title":"Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions","volume":"7","author":"Reeder","year":"2010","journal-title":"Nat. Methods"},{"key":"2023012513220419100_bts552-B22","doi-asserted-by":"crossref","first-page":"e27310","DOI":"10.1371\/journal.pone.0027310","article-title":"Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies","volume":"6","author":"Schloss","year":"2011","journal-title":"PLoS ONE"},{"key":"2023012513220419100_bts552-B23","doi-asserted-by":"crossref","first-page":"3219","DOI":"10.1128\/AEM.02810-10","article-title":"Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis","volume":"77","author":"Schloss","year":"2011","journal-title":"Appl. Environ. Microbiol."},{"key":"2023012513220419100_bts552-B24","doi-asserted-by":"crossref","first-page":"7537","DOI":"10.1128\/AEM.01541-09","article-title":"Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities","volume":"75","author":"Schloss","year":"2009","journal-title":"Appl. Environ. Microbiol."},{"key":"2023012513220419100_bts552-B25","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1038\/nmeth1156","article-title":"Next-generation sequencing transforms today's biology","volume":"5","author":"Schuster","year":"2008","journal-title":"Nat. Methods"},{"key":"2023012513220419100_bts552-B26","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1093\/bib\/bbr009","article-title":"A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis","volume":"13","author":"Sun","year":"2012","journal-title":"Brief. Bioinform."},{"key":"2023012513220419100_bts552-B27","doi-asserted-by":"crossref","first-page":"1277","DOI":"10.1038\/ismej.2011.187","article-title":"Secondary structure information does not improve OTU assignment for partial 16s rRNA sequences","volume":"6","author":"Wang","year":"2012","journal-title":"ISME J."},{"key":"2023012513220419100_bts552-B28","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1038\/ismej.2011.82","article-title":"Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys","volume":"6","author":"Werner","year":"2012","journal-title":"ISME J."},{"key":"2023012513220419100_bts552-B29","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/ismej.2011.71","article-title":"Saliva microbiomes distinguish caries-active from healthy human populations","volume":"6","author":"Yang","year":"2012","journal-title":"ISME J."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/22\/2891\/48871909\/bioinformatics_28_22_2891.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/22\/2891\/48871909\/bioinformatics_28_22_2891.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T19:19:59Z","timestamp":1674674399000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/22\/2891\/241231"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,9,8]]},"references-count":30,"journal-issue":{"issue":"22","published-print":{"date-parts":[[2012,11,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts552","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,11,15]]},"published":{"date-parts":[[2012,9,8]]}}}