{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,29]],"date-time":"2025-11-29T08:02:51Z","timestamp":1764403371311,"version":"3.41.2"},"reference-count":47,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2024,6,13]],"date-time":"2024-06-13T00:00:00Z","timestamp":1718236800000},"content-version":"vor","delay-in-days":12,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Institute of Health","award":["R35GM142725"],"award-info":[{"award-number":["R35GM142725"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,6,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, the training phase of DEPP does not scale to more than roughly 10\u2009000 backbone species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331\u2009270 species.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP. While divide-and-conquer has been extensively used in phylogenetics, applying divide-and-conquer to data-hungry machine-learning methods needs nuance. C-DEPP uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing 20 million 16S fragments on the GG2 reference tree in 41\u2009h of computation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The dataset and C-DEPP software are freely available at https:\/\/github.com\/yueyujiang\/dataset_cdepp\/.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae361","type":"journal-article","created":{"date-parts":[[2024,6,13]],"date-time":"2024-06-13T00:42:47Z","timestamp":1718239367000},"source":"Crossref","is-referenced-by-count":2,"title":["Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8425-7556","authenticated-orcid":false,"given":"Yueyu","family":"Jiang","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]}]},{"given":"Daniel","family":"McDonald","sequence":"additional","affiliation":[{"name":"Pediatrics Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]}]},{"given":"Daniela","family":"Perry","sequence":"additional","affiliation":[{"name":"Pediatrics Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]}]},{"given":"Rob","family":"Knight","sequence":"additional","affiliation":[{"name":"Pediatrics Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]},{"name":"Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]}]},{"given":"Siavash","family":"Mirarab","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]},{"name":"Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego , 9500 Gilman Dr , La Jolla, CA, 92093,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2024,6,13]]},"reference":[{"key":"2024110718472151400_btae361-B1","doi-asserted-by":"crossref","first-page":"e00191\u201316","DOI":"10.1128\/mSystems.00191-16","article-title":"Deblur rapidly resolves single-nucleotide community sequence patterns","volume":"2","author":"Amir","year":"2017","journal-title":"mSystems"},{"key":"2024110718472151400_btae361-B2","doi-asserted-by":"crossref","first-page":"2500","DOI":"10.1038\/s41467-020-16366-7","article-title":"Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0","volume":"11","author":"Asnicar","year":"2020","journal-title":"Nat Commun"},{"key":"2024110718472151400_btae361-B3","doi-asserted-by":"crossref","first-page":"e0221068","DOI":"10.1371\/journal.pone.0221068","article-title":"TreeCluster: clustering biological sequences using phylogenetic trees","volume":"14","author":"Balaban","year":"2019","journal-title":"PLoS One"},{"key":"2024110718472151400_btae361-B4","doi-asserted-by":"crossref","first-page":"566","DOI":"10.1093\/sysbio\/syz063","article-title":"APPLES: scalable distance-based phylogenetic placement with or without alignments","volume":"69","author":"Balaban","year":"2020","journal-title":"Syst Biol"},{"key":"2024110718472151400_btae361-B5","doi-asserted-by":"crossref","first-page":"1213","DOI":"10.1111\/1755-0998.13527","article-title":"Fast and accurate distance\u2013based phylogenetic placement using divide and conquer","volume":"22","author":"Balaban","year":"2022","journal-title":"Mol Ecol Resour"},{"key":"2024110718472151400_btae361-B6","doi-asserted-by":"crossref","first-page":"768","DOI":"10.1038\/s41587-023-01868-8","article-title":"Generation of accurate, expandable phylogenomic trees with udance","volume":"42","author":"Balaban","year":"2023","journal-title":"Nat Biotechnol"},{"key":"2024110718472151400_btae361-B7","doi-asserted-by":"crossref","first-page":"365","DOI":"10.1093\/sysbio\/syy054","article-title":"EPA-ng: massively parallel evolutionary placement of genetic sequences","volume":"68","author":"Barbera","year":"2019","journal-title":"Syst Biol"},{"key":"2024110718472151400_btae361-B8","doi-asserted-by":"crossref","first-page":"e243","DOI":"10.7717\/peerj.243","article-title":"Phylosift: phylogenetic analysis of genomes and metagenomes","volume":"2","author":"Darling","year":"2014","journal-title":"PeerJ"},{"key":"2024110718472151400_btae361-B9","doi-asserted-by":"crossref","first-page":"5069","DOI":"10.1128\/AEM.03006-05","article-title":"Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB","volume":"72","author":"DeSantis","year":"2006","journal-title":"Appl Environ Microbiol"},{"key":"2024110718472151400_btae361-B10","doi-asserted-by":"crossref","first-page":"796","DOI":"10.1038\/s41592-018-0141-9","article-title":"Qiita: rapid, web-enabled microbiome meta-analysis","volume":"15","author":"Gonzalez","year":"2018","journal-title":"Nat Methods"},{"year":"2022","author":"Hasan","first-page":"1212","key":"2024110718472151400_btae361-B11"},{"key":"2024110718472151400_btae361-B12","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1162\/neco.1991.3.1.79","article-title":"Adaptive mixtures of local experts","volume":"3","author":"Jacobs","year":"1991","journal-title":"Neural Comput"},{"key":"2024110718472151400_btae361-B13","doi-asserted-by":"crossref","first-page":"e00021\u201318","DOI":"10.1128\/mSystems.00021-18","article-title":"Phylogenetic placement of exact amplicon sequences improves associations with clinical information","volume":"3","author":"Janssen","year":"2018","journal-title":"mSystems"},{"key":"2024110718472151400_btae361-B14","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1093\/sysbio\/syac031","article-title":"DEPP: deep learning enables extending species trees using single genes","volume":"72","author":"Jiang","year":"2022","journal-title":"Syst Biol"},{"key":"2024110718472151400_btae361-B15","first-page":"1256","article-title":"Learning hyperbolic embedding for phylogenetic tree placement and updates","volume":"11","author":"Jiang","year":"2022","journal-title":"Biology (Basel)"},{"key":"2024110718472151400_btae361-B16","doi-asserted-by":"crossref","first-page":"4453","DOI":"10.1093\/bioinformatics\/btz305","article-title":"Raxml-ng: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference","volume":"35","author":"Kozlov","year":"2019","journal-title":"Bioinformatics"},{"key":"2024110718472151400_btae361-B17","doi-asserted-by":"crossref","first-page":"2798","DOI":"10.1093\/molbev\/msv150","article-title":"FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program","volume":"32","author":"Lefort","year":"2015","journal-title":"Mol Biol Evol"},{"year":"2019","author":"Liao","doi-asserted-by":"publisher","key":"2024110718472151400_btae361-B18","DOI":"10.48550\/arXiv.1901.10668"},{"key":"2024110718472151400_btae361-B19","doi-asserted-by":"crossref","first-page":"3303","DOI":"10.1093\/bioinformatics\/btz068","article-title":"Rapid alignment-free phylogenetic identification of metagenomic sequences","volume":"35","author":"Linard","year":"2019","journal-title":"Bioinformatics"},{"key":"2024110718472151400_btae361-B20","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1093\/sysbio\/syr095","article-title":"SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees","volume":"61","author":"Liu","year":"2011","journal-title":"Syst Biol"},{"key":"2024110718472151400_btae361-B21","doi-asserted-by":"crossref","first-page":"5970","DOI":"10.1073\/pnas.1521291113","article-title":"Scaling laws predict global microbial diversity","volume":"113","author":"Locey","year":"2016","journal-title":"Proc Natl Acad Sci"},{"key":"2024110718472151400_btae361-B22","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1038\/s41559-017-0091","article-title":"Parasites dominate hyperdiverse soil protist communities in neotropical rainforests","volume":"1","author":"Mah\u00e9","year":"2017","journal-title":"Nat Ecol Evol"},{"key":"2024110718472151400_btae361-B23","doi-asserted-by":"crossref","first-page":"1532","DOI":"10.1093\/bioinformatics\/btab875","article-title":"Completing gene trees without species trees in sub-quadratic time","volume":"38","author":"Mai","year":"2022","journal-title":"Bioinformatics"},{"key":"2024110718472151400_btae361-B24","doi-asserted-by":"crossref","first-page":"334","DOI":"10.1093\/sysbio\/syv082","article-title":"SimPhy: phylogenomic simulation of gene, locus, and species trees","volume":"65","author":"Mallo","year":"2016","journal-title":"Syst Biol"},{"key":"2024110718472151400_btae361-B25","doi-asserted-by":"crossref","first-page":"538","DOI":"10.1186\/1471-2105-11-538","article-title":"Pplacer: linear time maximum-likelihood and bayesian phylogenetic placement of sequences onto a fixed reference tree","volume":"11","author":"Matsen","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2024110718472151400_btae361-B26","doi-asserted-by":"publisher","DOI":"10.1101\/2022.12.19.520774","article-title":"Greengenes2 enables a shared data universe for microbiome studies","author":"McDonald","year":"2023","journal-title":"Nature Biotechnology"},{"key":"2024110718472151400_btae361-B27","first-page":"247","article-title":"SEPP: SAT\u00e9-enabled phylogenetic placement","author":"Mirarab","year":"2012","journal-title":"Pac Symp Biocomput"},{"key":"2024110718472151400_btae361-B28","doi-asserted-by":"crossref","first-page":"i541","DOI":"10.1093\/bioinformatics\/btu462","article-title":"ASTRAL: genome-scale coalescent-based species tree estimation","volume":"30","author":"Mirarab","year":"2014","journal-title":"Bioinformatics"},{"key":"2024110718472151400_btae361-B29","doi-asserted-by":"crossref","first-page":"165","DOI":"10.1186\/s13059-018-1554-6","article-title":"RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification","volume":"19","author":"Nasko","year":"2018","journal-title":"Genome Biol"},{"key":"2024110718472151400_btae361-B30","doi-asserted-by":"crossref","first-page":"i274","DOI":"10.1093\/bioinformatics\/bts218","article-title":"DACTAL: divide-and-conquer trees (almost) without alignments","volume":"28","author":"Nelesen","year":"2012","journal-title":"Bioinformatics"},{"key":"2024110718472151400_btae361-B31","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1186\/s13059-015-0688-z","article-title":"Ultra-large alignments using phylogeny-aware profiles","volume":"16","author":"Nguyen","year":"2015","journal-title":"Genome Biol"},{"key":"2024110718472151400_btae361-B32","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1038\/nbt.4229","article-title":"A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life","volume":"36","author":"Parks","year":"2018","journal-title":"Nat Biotechnol"},{"key":"2024110718472151400_btae361-B33","doi-asserted-by":"crossref","first-page":"190","DOI":"10.3390\/e21020190","article-title":"Mixture of experts with entropic regularization for data classification","volume":"21","author":"Peralta","year":"2019","journal-title":"Entropy (Basel)"},{"key":"2024110718472151400_btae361-B34","doi-asserted-by":"crossref","first-page":"D590","DOI":"10.1093\/nar\/gks1219","article-title":"The SILVA ribosomal RNA gene database project: improved data processing and web-based tools","volume":"41","author":"Quast","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2024110718472151400_btae361-B35","doi-asserted-by":"crossref","first-page":"817","DOI":"10.1016\/j.cels.2022.06.007","article-title":"Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling","volume":"13","author":"Rachtman","year":"2022","journal-title":"Cell Syst"},{"key":"2024110718472151400_btae361-B36","doi-asserted-by":"crossref","first-page":"131","DOI":"10.1016\/0025-5564(81)90043-2","article-title":"Comparison of phylogenetic trees","volume":"53","author":"Robinson","year":"1981","journal-title":"Mathematical Biosciences"},{"key":"2024110718472151400_btae361-B37","doi-asserted-by":"crossref","first-page":"D637","DOI":"10.1093\/nar\/gky1008","article-title":"gcMeta: a global catalogue of metagenomics platform to support the archiving, standardization and analysis of microbiome data","volume":"47","author":"Shi","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024110718472151400_btae361-B38","doi-asserted-by":"crossref","first-page":"1312","DOI":"10.1093\/bioinformatics\/btu033","article-title":"Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies","volume":"30","author":"Stamatakis","year":"2014","journal-title":"Bioinformatics"},{"key":"2024110718472151400_btae361-B39","doi-asserted-by":"crossref","first-page":"809","DOI":"10.1038\/s41588-021-00862-7","article-title":"Ultrafast sample placement on existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic","volume":"53","author":"Turakhia","year":"2021","journal-title":"Nat Genet"},{"key":"2024110718472151400_btae361-B40","doi-asserted-by":"crossref","first-page":"e3000494","DOI":"10.1371\/journal.pbio.3000494","article-title":"Inferring the mammal tree: species-level sets of phylogenies for questions in ecology, evolution, and conservation","volume":"17","author":"Upham","year":"2019","journal-title":"PLoS Biol"},{"year":"2021","author":"Wedell","first-page":"94","key":"2024110718472151400_btae361-B41"},{"key":"2024110718472151400_btae361-B42","doi-asserted-by":"crossref","first-page":"1417","DOI":"10.1109\/TCBB.2022.3170386","article-title":"Scampp: scaling alignment-based phylogenetic placement to large trees","volume":"20","author":"Wedell","year":"2023","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"year":"2022","author":"Wedell","key":"2024110718472151400_btae361-B43"},{"key":"2024110718472151400_btae361-B44","doi-asserted-by":"crossref","first-page":"msac215","DOI":"10.1093\/molbev\/msac215","article-title":"Weighting by gene tree uncertainty improves accuracy of quartet-based species trees","volume":"39","author":"Zhang","year":"2022","journal-title":"Mol Biol Evol"},{"key":"2024110718472151400_btae361-B45","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1186\/s13059-018-1450-0","article-title":"HmmUFOtu: an HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies","volume":"19","author":"Zheng","year":"2018","journal-title":"Genome Biol"},{"key":"2024110718472151400_btae361-B46","doi-asserted-by":"crossref","first-page":"5477","DOI":"10.1038\/s41467-019-13443-4","article-title":"Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains bacteria and archaea","volume":"10","author":"Zhu","year":"2019","journal-title":"Nat Commun"},{"key":"2024110718472151400_btae361-B47","doi-asserted-by":"crossref","first-page":"588","DOI":"10.1080\/10635150290102339","article-title":"Increased taxon sampling greatly reduces phylogenetic error","volume":"51","author":"Zwickl","year":"2002","journal-title":"Syst Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae361\/58238097\/btae361.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/6\/btae361\/60482992\/btae361.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/6\/btae361\/60482992\/btae361.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,7]],"date-time":"2024-11-07T18:53:03Z","timestamp":1731005583000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae361\/7693069"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,6]]},"references-count":47,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,6,3]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae361","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2024,6]]},"published":{"date-parts":[[2024,6]]},"article-number":"btae361"}}