{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T00:37:44Z","timestamp":1775867864706,"version":"3.50.1"},"reference-count":46,"publisher":"Oxford University Press (OUP)","issue":"20","license":[{"start":{"date-parts":[[2017,6,19]],"date-time":"2017-06-19T00:00:00Z","timestamp":1497830400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U19 Al109673"],"award-info":[{"award-number":["U19 Al109673"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2017,10,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Summary<\/jats:title>\n                  <jats:p>Nonribosomally synthesized peptides (NRPs) are natural products with widespread applications in medicine and biotechnology. Many algorithms have been developed to predict the substrate specificities of nonribosomal peptide synthetase adenylation (A) domains from DNA sequences, which enables prioritization and dereplication, and integration with other data types in discovery efforts. However, insufficient training data and a lack of clarity regarding prediction quality have impeded optimal use. Here, we introduce prediCAT, a new phylogenetics-inspired algorithm, which quantitatively estimates the degree of predictability of each A-domain. We then systematically benchmarked all algorithms on a newly gathered, independent test set of 434 A-domain sequences, showing that active-site-motif-based algorithms outperform whole-domain-based methods. Subsequently, we developed SANDPUMA, a powerful ensemble algorithm, based on newly trained versions of all high-performing algorithms, which significantly outperforms individual methods. Finally, we deployed SANDPUMA in a systematic investigation of 7635 Actinobacteria genomes, suggesting that NRP chemical diversity is much higher than previously estimated. SANDPUMA has been integrated into the widely used antiSMASH biosynthetic gene cluster analysis pipeline and is also available as an open-source, standalone tool.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>SANDPUMA is freely available at https:\/\/bitbucket.org\/chevrm\/sandpuma and as a docker image at https:\/\/hub.docker.com\/r\/chevrm\/sandpuma\/ under the GNU Public License 3 (GPL3).<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btx400","type":"journal-article","created":{"date-parts":[[2017,6,16]],"date-time":"2017-06-16T11:09:54Z","timestamp":1497611394000},"page":"3202-3210","source":"Crossref","is-referenced-by-count":104,"title":["SANDPUMA: ensemble predictions of nonribosomal peptide chemistry reveal biosynthetic diversity across <i>Actinobacteria<\/i>"],"prefix":"10.1093","volume":"33","author":[{"given":"Marc G","family":"Chevrette","sequence":"first","affiliation":[{"name":"Department of Genetics, University of Wisconsin-Madison, Madison, WI, USA"},{"name":"Department of Bacteriology and J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Madison, WI, USA"}]},{"given":"Fabian","family":"Aicheler","sequence":"additional","affiliation":[{"name":"Applied Bioinformatics, Department of Computer Science, Quantitative Biology Center and Center for Bioinformatics, University of T\u00fcbingen, T\u00fcbingen, Germany"}]},{"given":"Oliver","family":"Kohlbacher","sequence":"additional","affiliation":[{"name":"Applied Bioinformatics, Department of Computer Science, Quantitative Biology Center and Center for Bioinformatics, University of T\u00fcbingen, T\u00fcbingen, Germany"},{"name":"Biomolecular Interactions, Max Planck Institute for Developmental Biology, T\u00fcbingen, Germany"}]},{"given":"Cameron R","family":"Currie","sequence":"additional","affiliation":[{"name":"Department of Bacteriology and J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Madison, WI, USA"}]},{"given":"Marnix H","family":"Medema","sequence":"additional","affiliation":[{"name":"Bioinformatics Group, Wageningen University, Wageningen, The Netherlands"}]}],"member":"286","published-online":{"date-parts":[[2017,6,19]]},"reference":[{"key":"2023020207522182700_btx400-B2","first-page":"181","author":"Bachmann","year":"2009"},{"key":"2023020207522182700_btx400-B3","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1007\/s10295-013-1322-2","article-title":"Predicting substrate specificity of adenylation domains of nonribosomal peptide synthetases and other protein properties by latent semantic indexing","volume":"41","author":"Barana\u0161i\u0107","year":"2014","journal-title":"J. Ind. Microbiol. Biotechnol"},{"key":"2023020207522182700_btx400-B4","first-page":"1019","article-title":"antiSMASH 4.0\u2013\u2013improvements in chemistry prediction and gene cluster boundary identification","volume":"1854","author":"Blin","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2023020207522182700_btx400-B5","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nmeth.3176","article-title":"Fast and sensitive protein alignment using DIAMOND","volume":"12","author":"Buchfink","year":"2015","journal-title":"Nat. Methods"},{"key":"2023020207522182700_btx400-B6","doi-asserted-by":"crossref","first-page":"5143","DOI":"10.1128\/JB.00315-10","article-title":"Diversity of monomers in nonribosomal peptides: towards the prediction of origin and biological activity","volume":"192","author":"Caboche","year":"2010","journal-title":"J. Bacteriol"},{"key":"2023020207522182700_btx400-B7","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1016\/S1074-5521(00)00091-0","article-title":"Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains","volume":"7","author":"Challis","year":"2000","journal-title":"Chem. Biol"},{"key":"2023020207522182700_btx400-B8","doi-asserted-by":"crossref","first-page":"412","DOI":"10.1016\/j.cell.2014.06.034","article-title":"Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters","volume":"158","author":"Cimermancic","year":"2014","journal-title":"Cell"},{"key":"2023020207522182700_btx400-B9","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1093\/jpe\/rtr044","article-title":"Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages","volume":"5","author":"Colwell","year":"2012","journal-title":"J. Plant Ecol"},{"key":"2023020207522182700_btx400-B10","doi-asserted-by":"crossref","first-page":"1041","DOI":"10.1039\/C2SC21722H","article-title":"Evolution-guided engineering of nonribosomal peptide synthetase adenylation domains","volume":"4","author":"Cr\u00fcsemann","year":"2013","journal-title":"Chem. Sci"},{"key":"2023020207522182700_btx400-B11","doi-asserted-by":"crossref","first-page":"1906","DOI":"10.1093\/gbe\/evw125","article-title":"Phylogenomic analysis of natural products biosynthetic gene clusters allows discovery of arseno-organic metabolites in model streptomycetes","volume":"8","author":"Cruz-Morales","year":"2016","journal-title":"Genome Biol. Evol"},{"key":"2023020207522182700_btx400-B12","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s10295-013-1337-8","article-title":"Evolutionary concepts in natural products discovery: what actinomycetes have taught us","volume":"41","author":"Diminic","year":"2014","journal-title":"J. Ind. Microbiol. Biotechnol"},{"key":"2023020207522182700_btx400-B13","doi-asserted-by":"crossref","first-page":"1402","DOI":"10.1016\/j.cell.2014.08.032","article-title":"A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics","volume":"158","author":"Donia","year":"2014","journal-title":"Cell"},{"key":"2023020207522182700_btx400-B14","doi-asserted-by":"crossref","first-page":"963","DOI":"10.1038\/nchembio.1659","article-title":"A roadmap for natural product discovery based on large-scale genomics and metabolomics","volume":"10","author":"Doroghazi","year":"2014","journal-title":"Nat. Chem. Biol"},{"key":"2023020207522182700_btx400-B15","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pcbi.1002195","article-title":"Accelerated profile HMM searches","volume":"7","author":"Eddy","year":"2011","journal-title":"PLoS Comput. Biol"},{"key":"2023020207522182700_btx400-B16","doi-asserted-by":"crossref","first-page":"3468","DOI":"10.1021\/cr0503097","article-title":"Assembly-line enzymology for polyketide and nonribosomal peptide antibiotics: logic, machinery, and mechanisms","volume":"5","author":"Fischbach","year":"2006","journal-title":"Chem. Rev"},{"key":"2023020207522182700_btx400-B19","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1186\/1471-2105-11-119","article-title":"Prodigal: prokaryotic gene recognition and translation initiation site identification","volume":"11","author":"Hyatt","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023020207522182700_btx400-B48","doi-asserted-by":"crossref","first-page":"19196","DOI":"10.1073\/pnas.1206376109","article-title":"Dereplicating nonribosomal peptides using an informatic search algorithm for natural products (iSNAP) discovery","volume":"109","author":"Ibrahim","year":"2012","journal-title":"Proc Natl Acad Sci USA"},{"key":"2023020207522182700_btx400-B20","doi-asserted-by":"crossref","first-page":"772","DOI":"10.1093\/molbev\/mst010","article-title":"MAFFT Multiple Sequence Alignment Software Version 7: improvements in performance and usability","volume":"30","author":"Katoh","year":"2013","journal-title":"Mol. Biol. Evol"},{"key":"2023020207522182700_btx400-B21","doi-asserted-by":"crossref","first-page":"e62136","DOI":"10.1371\/journal.pone.0062136","article-title":"Classification of the adenylation and acyl-transferase activity of NRPS and PKS systems using ensembles of substrate specific hidden Markov models","volume":"8","author":"Khayatt","year":"2013","journal-title":"PloS One"},{"key":"2023020207522182700_btx400-B22","first-page":"btv600","article-title":"Computational discovery of specificity-conferring sites in non-ribosomal peptide synthetases","author":"Knudsen","year":"2015","journal-title":"Bioinformatics"},{"key":"2023020207522182700_btx400-B23","doi-asserted-by":"crossref","first-page":"2947","DOI":"10.1093\/bioinformatics\/btm404","article-title":"Clustal W and Clustal X version 2.0","volume":"23","author":"Larkin","year":"2007","journal-title":"Bioinformatics"},{"key":"2023020207522182700_btx400-B24","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1146\/annurev-micro-102215-095748","article-title":"Evolution and ecology of actinobacteria and their bioenergy applications","volume":"70","author":"Lewin","year":"2016","journal-title":"Annu. Rev. Microbiol"},{"key":"2023020207522182700_btx400-B49","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1186\/1471-2105-10-185","article-title":"Automated genome mining for natural products","volume":"10","author":"Li","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2023020207522182700_btx400-B25","doi-asserted-by":"crossref","first-page":"2081","DOI":"10.1093\/bioinformatics\/btl366","article-title":"An initial strategy for comparing proteins at the domain architecture level","volume":"22","author":"Lin","year":"2006","journal-title":"Bioinformatics"},{"key":"2023020207522182700_btx400-B26","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1093\/cid\/cir034","article-title":"Clinical practice guidelines by the Infectious Diseases Society of America for the treatment of methicillin-resistant Staphylococcus aureus infections in adults and children: executive summary","volume":"52","author":"Liu","year":"2011","journal-title":"Clin. Infect. Dis"},{"issue":"Suppl. 2","key":"2023020207522182700_btx400-B50","doi-asserted-by":"crossref","first-page":"339","DOI":"10.1093\/nar\/gkr466","article-title":"AntiSMASH: Rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences","volume":"39","author":"Medema","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2023020207522182700_btx400-B27","doi-asserted-by":"crossref","first-page":"e1004016","DOI":"10.1371\/journal.pcbi.1004016","article-title":"A systematic computational analysis of biosynthetic gene cluster evolution: lessons for engineering biosynthesis","volume":"10","author":"Medema","year":"2014","journal-title":"PLoS Comput. Biol"},{"key":"2023020207522182700_btx400-B28","doi-asserted-by":"crossref","first-page":"e1003822","DOI":"10.1371\/journal.pcbi.1003822","article-title":"Pep2Path: automated mass spectrometry-guided genome mining of peptidic natural products","volume":"10","author":"Medema","year":"2014","journal-title":"PLoS Comput. Biol"},{"key":"2023020207522182700_btx400-B29","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1038\/nchembio.1890","article-title":"Minimum information about a biosynthetic gene cluster","volume":"11","author":"Medema","year":"2015","journal-title":"Nat. Chem. Biol"},{"key":"2023020207522182700_btx400-B30","doi-asserted-by":"crossref","first-page":"1500","DOI":"10.1016\/j.jmb.2007.02.099","article-title":"Comprehensive analysis of distinctive polyketide and nonribosomal peptide structural motifs encoded in microbial genomes","volume":"368","author":"Minowa","year":"2007","journal-title":"J. Mol. Biol"},{"key":"2023020207522182700_btx400-B31","doi-asserted-by":"crossref","first-page":"1902","DOI":"10.1021\/np500370c","article-title":"NRPquest: coupling mass spectrometry and genome mining for nonribosomal peptide discovery","volume":"77","author":"Mohimani","year":"2014","journal-title":"J. Nat. Prod"},{"key":"2023020207522182700_btx400-B32","doi-asserted-by":"crossref","first-page":"16197","DOI":"10.1038\/nmicrobiol.2016.197","article-title":"Indexing the Pseudomonas specialized metabolome enabled the discovery of poaeamide B and the bananamides","volume":"2","author":"Nguyen","year":"2016","journal-title":"Nat. Microbiol"},{"key":"2023020207522182700_btx400-B33","author":"O\u2019Neill","year":"2016"},{"key":"2023020207522182700_btx400-B34","doi-asserted-by":"crossref","first-page":"391","DOI":"10.1038\/nchembio.159","article-title":"Dentigerumycin: a bacterial mediator of an ant-fungus symbiosis","volume":"5","author":"Oh","year":"2009","journal-title":"Nat. Chem. Biol"},{"key":"2023020207522182700_btx400-B35","first-page":"2825","article-title":"Scikit-learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2012","journal-title":"J. Mach. Learn. Res"},{"key":"2023020207522182700_btx400-B36","doi-asserted-by":"crossref","first-page":"e9490","DOI":"10.1371\/journal.pone.0009490","article-title":"FastTree 2 \u2013 approximately maximum-likelihood trees for large alignments","volume":"5","author":"Price","year":"2010","journal-title":"PLoS ONE"},{"key":"2023020207522182700_btx400-B37","doi-asserted-by":"crossref","first-page":"426","DOI":"10.1093\/bioinformatics\/btr659","article-title":"NRPSSP: Non-ribosomal peptide synthase substrate predictor","volume":"28","author":"Prieto","year":"2012","journal-title":"Bioinformatics"},{"key":"2023020207522182700_btx400-B38","doi-asserted-by":"crossref","first-page":"5799","DOI":"10.1093\/nar\/gki885","article-title":"Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)","volume":"33","author":"Rausch","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2023020207522182700_btx400-B40","doi-asserted-by":"crossref","first-page":"W362","DOI":"10.1093\/nar\/gkr323","article-title":"NRPSpredictor2 \u2013 a web server for predicting NRPS adenylation domain specificity","volume":"39","author":"R\u00f6ttig","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2023020207522182700_btx400-B41","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1186\/1471-2180-8-141","article-title":"Recombination and selectional forces in cyanopeptolin NRPS operons from highly similar, but geographically remote Planktothrix strains","volume":"8","author":"Rounge","year":"2008","journal-title":"BMC Microbiol"},{"key":"2023020207522182700_btx400-B42","doi-asserted-by":"crossref","first-page":"770","DOI":"10.1038\/nchembio.2144","article-title":"A hybrid polyketide\u2013nonribosomal peptide in nematodes that promotes larval survival","volume":"12","author":"Shou","year":"2016","journal-title":"Nat. Chem. Biol"},{"key":"2023020207522182700_btx400-B43","doi-asserted-by":"crossref","first-page":"gkv1012","DOI":"10.1093\/nar\/gkv1012","article-title":"Genomes to natural products PRediction Informatics for Secondary Metabolomes (PRISM)","volume":"9140","author":"Skinnider","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2023020207522182700_btx400-B44","doi-asserted-by":"crossref","first-page":"493","DOI":"10.1016\/S1074-5521(99)80082-9","article-title":"The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases","volume":"6","author":"Stachelhaus","year":"1999","journal-title":"Chem. Biol"},{"key":"2023020207522182700_btx400-B45","doi-asserted-by":"crossref","first-page":"1312","DOI":"10.1093\/bioinformatics\/btu033","article-title":"RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies","volume":"30","author":"Stamatakis","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020207522182700_btx400-B46","first-page":"1","article-title":"Insights into the chemical logic and enzymatic machinery of NRPS assembly lines","volume":"00","author":"Walsh","year":"2015","journal-title":"Nat. Prod. Rep"},{"key":"2023020207522182700_btx400-B47","doi-asserted-by":"crossref","first-page":"828","DOI":"10.1038\/nbt.3597","article-title":"Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking","volume":"34","author":"Wang","year":"2016","journal-title":"Nat. Biotechnol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/20\/3202\/49043009\/bioinformatics_33_20_3202.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/20\/3202\/49043009\/bioinformatics_33_20_3202.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,2]],"date-time":"2023-02-02T07:58:56Z","timestamp":1675324736000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/33\/20\/3202\/3870463"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,6,19]]},"references-count":46,"journal-issue":{"issue":"20","published-print":{"date-parts":[[2017,10,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btx400","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2017,10,15]]},"published":{"date-parts":[[2017,6,19]]}}}