{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T03:00:00Z","timestamp":1777863600124,"version":"3.51.4"},"update-to":[{"DOI":"10.1371\/journal.pcbi.1010462","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2023,2,22]],"date-time":"2023-02-22T00:00:00Z","timestamp":1677024000000}}],"reference-count":50,"publisher":"Public Library of Science (PLoS)","issue":"2","license":[{"start":{"date-parts":[[2023,2,9]],"date-time":"2023-02-09T00:00:00Z","timestamp":1675900800000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100013407","name":"Netherlands eScience Center","doi-asserted-by":"publisher","award":["ASDI.2017.030"],"award-info":[{"award-number":["ASDI.2017.030"]}],"id":[{"id":"10.13039\/100013407","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100013407","name":"Netherlands eScience Center","doi-asserted-by":"publisher","award":["NLESC.OEC.2021.002"],"award-info":[{"award-number":["NLESC.OEC.2021.002"]}],"id":[{"id":"10.13039\/100013407","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>Microbial specialised metabolism is full of valuable natural products that are applied clinically, agriculturally, and industrially. The genes that encode their biosynthesis are often physically clustered on the genome in biosynthetic gene clusters (BGCs). Many BGCs consist of multiple groups of co-evolving genes called sub-clusters that are responsible for the biosynthesis of a specific chemical moiety in a natural product. Sub-clusters therefore provide an important link between the structures of a natural product and its BGC, which can be leveraged for predicting natural product structures from sequence, as well as for linking chemical structures and metabolomics-derived mass features to BGCs. While some initial computational methodologies have been devised for sub-cluster detection, current approaches are not scalable, have only been run on small and outdated datasets, or produce an impractically large number of possible sub-clusters to mine through. Here, we constructed a scalable method for unsupervised sub-cluster detection, called iPRESTO, based on topic modelling and statistical analysis of co-occurrence patterns of enzyme-coding protein families. iPRESTO was used to mine sub-clusters across 150,000 prokaryotic BGCs from antiSMASH-DB. After annotating a fraction of the resulting sub-cluster families, we could predict a substructure for 16% of the antiSMASH-DB BGCs. Additionally, our method was able to confirm 83% of the experimentally characterised sub-clusters in MIBiG reference BGCs. Based on iPRESTO-detected sub-clusters, we could correctly identify the BGCs for xenorhabdin and salbostatin biosynthesis (which had not yet been annotated in BGC databases), as well as propose a candidate BGC for akashin biosynthesis. Additionally, we show for a collection of 145 actinobacteria how substructures can aid in linking BGCs to molecules by correlating iPRESTO-detected sub-clusters to MS\/MS-derived Mass2Motifs substructure patterns. This work paves the way for deeper functional and structural annotation of microbial BGCs by improved linking of orphan molecules to their cognate gene clusters, thus facilitating accelerated natural product discovery.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1010462","type":"journal-article","created":{"date-parts":[[2023,2,9]],"date-time":"2023-02-09T13:54:50Z","timestamp":1675950890000},"page":"e1010462","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":12,"title":["iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures"],"prefix":"10.1371","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4887-9109","authenticated-orcid":true,"given":"Joris J. R.","family":"Louwen","sequence":"first","affiliation":[]},{"given":"Satria A.","family":"Kautsar","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1250-6968","authenticated-orcid":true,"given":"Sven","family":"van der Burg","sequence":"additional","affiliation":[]},{"given":"Marnix H.","family":"Medema","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9340-5511","authenticated-orcid":true,"given":"Justin J. J.","family":"van der Hooft","sequence":"additional","affiliation":[]}],"member":"340","published-online":{"date-parts":[[2023,2,9]]},"reference":[{"issue":"12","key":"pcbi.1010462.ref001","doi-asserted-by":"crossref","first-page":"4022","DOI":"10.1016\/j.bmc.2009.01.046","article-title":"Natural products in crop protection","volume":"17","author":"FE Dayan","year":"2009","journal-title":"Bioorganic & medicinal chemistry"},{"issue":"5937","key":"pcbi.1010462.ref002","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1126\/science.1168243","article-title":"Drug Discovery and Natural Products: End of an Era or an Endless Frontier?","volume":"325","author":"JWH Li","year":"2009","journal-title":"Science"},{"issue":"22","key":"pcbi.1010462.ref003","doi-asserted-by":"crossref","first-page":"5601","DOI":"10.1073\/pnas.1614680114","article-title":"Retrospective analysis of natural products provides insights for future discovery trends","volume":"114","author":"CR Pye","year":"2017","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"12","key":"pcbi.1010462.ref004","doi-asserted-by":"crossref","first-page":"e1004016","DOI":"10.1371\/journal.pcbi.1004016","article-title":"A systematic computational analysis of biosynthetic gene cluster evolution: lessons for engineering biosynthesis","volume":"10","author":"MH Medema","year":"2014","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1010462.ref005","article-title":"Emerging evolutionary paradigms in antibiotic discovery","author":"MG Chevrette","year":"2018","journal-title":"J Ind Microbiol Biotechnol"},{"issue":"2","key":"pcbi.1010462.ref006","doi-asserted-by":"crossref","first-page":"412","DOI":"10.1016\/j.cell.2014.06.034","article-title":"Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters","volume":"158","author":"P Cimermancic","year":"2014","journal-title":"Cell"},{"issue":"W1","key":"pcbi.1010462.ref007","doi-asserted-by":"crossref","first-page":"W29","DOI":"10.1093\/nar\/gkab335","article-title":"antiSMASH 6.0: improving cluster detection and comparison capabilities","volume":"49","author":"K Blin","year":"2021","journal-title":"Nucleic Acids Research"},{"issue":"1","key":"pcbi.1010462.ref008","doi-asserted-by":"crossref","first-page":"6058","DOI":"10.1038\/s41467-020-19986-1","article-title":"Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences","volume":"11","author":"MA Skinnider","year":"2020","journal-title":"Nature Communications"},{"issue":"12","key":"pcbi.1010462.ref009","doi-asserted-by":"crossref","first-page":"4601","DOI":"10.1073\/pnas.0709132105","article-title":"The evolution of gene collectives: How natural selection drives chemical innovation","volume":"105","author":"MA Fischbach","year":"2008","journal-title":"Proceedings of the National Academy of Sciences"},{"issue":"1","key":"pcbi.1010462.ref010","first-page":"2","article-title":"Computational identification of co-evolving multi-gene modules in microbial biosynthetic gene clusters","author":"F Del Carratore","year":"2019","journal-title":"Communications Biology"},{"issue":"D1","key":"pcbi.1010462.ref011","doi-asserted-by":"crossref","first-page":"D639","DOI":"10.1093\/nar\/gkaa978","article-title":"The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes","volume":"49","author":"K Blin","year":"2020","journal-title":"Nucleic Acids Research"},{"issue":"4","key":"pcbi.1010462.ref012","first-page":"e00726","article-title":"Comprehensive large-scale integrative analysis of omics data to accelerate specialized metabolite discovery","volume":"6","author":"JJR Louwen","year":"2021","journal-title":"Msystems"},{"issue":"11","key":"pcbi.1010462.ref013","doi-asserted-by":"crossref","first-page":"3297","DOI":"10.1039\/D0CS00162G","article-title":"Linking genomics and metabolomics to chart specialized metabolic diversity","volume":"49","author":"JJJ van der Hooft","year":"2020","journal-title":"Chemical Society Reviews"},{"issue":"48","key":"pcbi.1010462.ref014","doi-asserted-by":"crossref","first-page":"13738","DOI":"10.1073\/pnas.1608041113","article-title":"Topic modeling for untargeted substructure exploration in metabolomics","volume":"113","author":"JJJ van der Hooft","year":"2016","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"11","key":"pcbi.1010462.ref015","doi-asserted-by":"crossref","first-page":"963","DOI":"10.1038\/nchembio.1659","article-title":"A roadmap for natural product discovery based on large-scale genomics and metabolomics","volume":"10","author":"JR Doroghazi","year":"2014","journal-title":"Nat Chem Biol"},{"issue":"D1","key":"pcbi.1010462.ref016","first-page":"D454","article-title":"MIBiG 2.0: a repository for biosynthetic gene clusters of known function","volume":"48","author":"SA Kautsar","year":"2019","journal-title":"Nucleic Acids Research"},{"issue":"1","key":"pcbi.1010462.ref017","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giaa154","article-title":"BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters","volume":"10","author":"SA Kautsar","year":"2021","journal-title":"GigaScience"},{"key":"pcbi.1010462.ref018","doi-asserted-by":"crossref","unstructured":"Chen X, Hu X, Shen X, Rosen G, editors. Probabilistic topic modeling for genomic data interpretation. 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2010: IEEE.","DOI":"10.1109\/BIBM.2010.5706554"},{"key":"pcbi.1010462.ref019","doi-asserted-by":"crossref","first-page":"W204","DOI":"10.1093\/nar\/gkt449","article-title":"antiSMASH 2.0\u2014a versatile platform for genome mining of secondary metabolite producers","volume":"41","author":"K Blin","year":"2013","journal-title":"Nucleic Acids Res"},{"issue":"18","key":"pcbi.1010462.ref020","doi-asserted-by":"crossref","first-page":"5494","DOI":"10.1021\/jm8006068","article-title":"Optimizing Natural Products by Biosynthetic Engineering: Discovery of Nonquinone Hsp90 Inhibitors","volume":"51","author":"M-Q Zhang","year":"2008","journal-title":"Journal of Medicinal Chemistry"},{"issue":"11","key":"pcbi.1010462.ref021","doi-asserted-by":"crossref","first-page":"1824","DOI":"10.1021\/acscentsci.9b00806","article-title":"The Natural Products Atlas: An Open Access Knowledge Base for Microbial Natural Products Discovery","volume":"5","author":"JA van Santen","year":"2019","journal-title":"ACS Central Science"},{"issue":"46","key":"pcbi.1010462.ref022","doi-asserted-by":"crossref","first-page":"19731","DOI":"10.1073\/pnas.1014140107","article-title":"Identification of the gene cluster for the dithiolopyrrolone antibiotic holomycin in Streptomyces clavuligerus","volume":"107","author":"B Li","year":"2010","journal-title":"Proceedings of the National Academy of Sciences"},{"issue":"3","key":"pcbi.1010462.ref023","doi-asserted-by":"crossref","first-page":"e18031","DOI":"10.1371\/journal.pone.0018031","article-title":"A Natural Plasmid Uniquely Encodes Two Biosynthetic Pathways Creating a Potent Anti-MRSA Antibiotic","volume":"6","author":"D Fukuda","year":"2011","journal-title":"PLOS ONE"},{"issue":"3","key":"pcbi.1010462.ref024","first-page":"277","article-title":"Identification and characterization of the biosynthetic gene cluster of thiolutin, a tumor angiogenesis inhibitor, in Saccharothrix algeriensis NRRL B-24137","volume":"15","author":"S Huang","year":"2015","journal-title":"Anti-Cancer Agents in Medicinal Chemistry (Formerly Current Medicinal Chemistry-Anti-Cancer Agents)"},{"issue":"3","key":"pcbi.1010462.ref025","doi-asserted-by":"crossref","first-page":"774","DOI":"10.1021\/np50075a005","article-title":"Biologically Active Metabolites from Xenorhabdus Spp., Part 1. Dithiolopyrrolone Derivatives with Antibiotic Activity","volume":"54","author":"BV McInerney","year":"1991","journal-title":"Journal of Natural Products"},{"issue":"7","key":"pcbi.1010462.ref026","doi-asserted-by":"crossref","first-page":"1115","DOI":"10.1002\/cbic.201500094","article-title":"Simple \u201cOn-Demand\u201d Production of Bioactive Natural Products","volume":"16","author":"E Bode","year":"2015","journal-title":"ChemBioChem"},{"issue":"4","key":"pcbi.1010462.ref027","doi-asserted-by":"crossref","first-page":"387","DOI":"10.1016\/j.chembiol.2006.02.002","article-title":"Functional analysis of the validamycin biosynthetic gene cluster and engineered production of validoxylamine A","volume":"13","author":"L Bai","year":"2006","journal-title":"Chemistry & biology"},{"issue":"5","key":"pcbi.1010462.ref028","doi-asserted-by":"crossref","first-page":"939","DOI":"10.1021\/np400159a","article-title":"Genetic Insights into Pyralomicin Biosynthesis in Nonomuraea spiralis IMC A-0156","volume":"76","author":"PM Flatt","year":"2013","journal-title":"Journal of Natural Products"},{"issue":"18","key":"pcbi.1010462.ref029","doi-asserted-by":"crossref","first-page":"1844","DOI":"10.1002\/anie.199418441","article-title":"The Trehalase Inhibitor Salbostatin, a Novel Metabolite from Streptomyces albus, ATCC21838","volume":"33","author":"L V\u00e9rtesy","year":"1994","journal-title":"Angewandte Chemie International Edition in English"},{"issue":"4","key":"pcbi.1010462.ref030","doi-asserted-by":"crossref","first-page":"637","DOI":"10.1007\/s00253-008-1591-2","article-title":"Genetic organization of the putative salbostatin biosynthetic gene cluster including the 2-epi-5-epi-valiolone synthase gene in Streptomyces albus ATCC 21838","volume":"80","author":"WS Choi","year":"2008","journal-title":"Applied Microbiology and Biotechnology"},{"issue":"1","key":"pcbi.1010462.ref031","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1038\/s41589-019-0400-9","article-title":"A computational framework to explore large-scale biosynthetic diversity","volume":"16","author":"JC Navarro-Mu\u00f1oz","year":"2020","journal-title":"Nature Chemical Biology"},{"issue":"19","key":"pcbi.1010462.ref032","doi-asserted-by":"crossref","first-page":"e00165","DOI":"10.1128\/MRA.00165-19","article-title":"Genome Sequence of Marine-Derived Streptomyces sp. Strain F001, a Producer of Akashin A and Diazaquinomycins","volume":"8","author":"J Braesel","year":"2019","journal-title":"Microbiology Resource Announcements"},{"issue":"1","key":"pcbi.1010462.ref033","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1016\/j.bbapap.2017.08.002","article-title":"In vitro characterization of CYP102G4 from Streptomyces cattleya: A self-sufficient P450 naturally producing indigo","volume":"1866","author":"J Kim","year":"2018","journal-title":"Biochimica et Biophysica Acta (BBA)\u2014Proteins and Proteomics"},{"issue":"7","key":"pcbi.1010462.ref034","doi-asserted-by":"crossref","first-page":"144","DOI":"10.3390\/metabo9070144","article-title":"MolNetEnhancer: Enhanced Molecular Networks by Integrating Metabolome Mining and Annotation Tools","volume":"9","author":"M Ernst","year":"2019","journal-title":"Metabolites"},{"issue":"5","key":"pcbi.1010462.ref035","doi-asserted-by":"crossref","first-page":"e1008920","DOI":"10.1371\/journal.pcbi.1008920","article-title":"Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions","volume":"17","author":"G Hj\u00f6rleifsson Eldj\u00e1rn","year":"2021","journal-title":"PLOS Computational Biology"},{"issue":"13","key":"pcbi.1010462.ref036","article-title":"Enhanced correlation-based linking of biosynthetic gene clusters to their metabolic products through chemical class matching","volume":"11","author":"JJR Louwen","year":"2023","journal-title":"Microbiome"},{"issue":"0","key":"pcbi.1010462.ref037","doi-asserted-by":"crossref","first-page":"284","DOI":"10.1039\/C8FD00235E","article-title":"Deciphering complex metabolite mixtures by unsupervised and supervised substructure discovery and semi-automated annotation from MS\/MS spectra","volume":"218","author":"S Rogers","year":"2019","journal-title":"Faraday Discussions"},{"issue":"3","key":"pcbi.1010462.ref038","doi-asserted-by":"crossref","first-page":"588","DOI":"10.1021\/acs.jnatprod.6b00722","article-title":"Prioritizing Natural Product Diversity in a Collection of 146 Bacterial Strains Based on Growth and Extraction Protocols","volume":"80","author":"M Cr\u00fcsemann","year":"2017","journal-title":"J Nat Prod"},{"issue":"D1","key":"pcbi.1010462.ref039","first-page":"D427","article-title":"The Pfam protein families database in 2019","volume":"47","author":"A Bateman","year":"2018","journal-title":"Nucleic Acids Research"},{"issue":"12","key":"pcbi.1010462.ref040","doi-asserted-by":"crossref","first-page":"e121","DOI":"10.1093\/nar\/gkt263","article-title":"Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions","volume":"41","author":"J Mistry","year":"2013","journal-title":"Nucleic acids research"},{"issue":"9","key":"pcbi.1010462.ref041","doi-asserted-by":"crossref","first-page":"575","DOI":"10.1145\/362342.362367","article-title":"Algorithm 457: finding all cliques of an undirected graph","volume":"16","author":"C Bron","year":"1973","journal-title":"Commun ACM"},{"issue":"1","key":"pcbi.1010462.ref042","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1186\/s12859-017-1519-x","article-title":"ECDomainMiner: discovering hidden associations between enzyme commission numbers and Pfam domains","volume":"18","author":"SZ Alborzi","year":"2017","journal-title":"BMC Bioinformatics"},{"issue":"4","key":"pcbi.1010462.ref043","doi-asserted-by":"crossref","first-page":"1165","DOI":"10.1214\/aos\/1013699998","article-title":"The control of the false discovery rate in multiple testing under dependency","volume":"29","author":"Y Benjamini","year":"2001","journal-title":"The annals of statistics"},{"key":"pcbi.1010462.ref044","unstructured":"Arthur D, Vassilvitskii S, editors. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms; 2007: Society for Industrial and Applied Mathematics."},{"key":"pcbi.1010462.ref045","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"F Pedregosa","year":"2011","journal-title":"the Journal of machine Learning research"},{"issue":"Jan","key":"pcbi.1010462.ref046","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"DM Blei","year":"2003","journal-title":"Journal of machine Learning research"},{"key":"pcbi.1010462.ref047","unstructured":"Rehurek R, Sojka P, editors. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks; 2010: Citeseer."},{"key":"pcbi.1010462.ref048","article-title":"Online learning for latent dirichlet allocation","author":"M Hoffman","year":"2010","journal-title":"advances in neural information processing systems"},{"key":"pcbi.1010462.ref049","doi-asserted-by":"crossref","unstructured":"R\u00f6der M, Both A, Hinneburg A, editors. Exploring the space of topic coherence measures. Proceedings of the eighth ACM international conference on Web search and data mining; 2015.","DOI":"10.1145\/2684822.2685324"},{"issue":"14","key":"pcbi.1010462.ref050","doi-asserted-by":"crossref","first-page":"7569","DOI":"10.1021\/acs.analchem.7b01391","article-title":"Unsupervised Discovery and Comparison of Structural Families Across Multiple Samples in Untargeted Metabolomics","volume":"89","author":"JJJ van der Hooft","year":"2017","journal-title":"Anal Chem"}],"updated-by":[{"DOI":"10.1371\/journal.pcbi.1010462","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2023,2,22]],"date-time":"2023-02-22T00:00:00Z","timestamp":1677024000000}}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1010462","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,22]],"date-time":"2023-02-22T13:42:00Z","timestamp":1677073320000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1010462"}},"subtitle":[],"editor":[{"given":"Jaime","family":"Huerta Cepas","sequence":"first","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,2,9]]},"references-count":50,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,2,9]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1010462","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.08.05.502908","asserted-by":"object"}]},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,9]]}}}