{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T04:57:48Z","timestamp":1779339468288,"version":"3.51.4"},"update-to":[{"DOI":"10.1371\/journal.pcbi.1011162","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2023,6,5]],"date-time":"2023-06-05T00:00:00Z","timestamp":1685923200000}}],"reference-count":42,"publisher":"Public Library of Science (PLoS)","issue":"5","license":[{"start":{"date-parts":[[2023,5,23]],"date-time":"2023-05-23T00:00:00Z","timestamp":1684800000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1011162","type":"journal-article","created":{"date-parts":[[2023,5,23]],"date-time":"2023-05-23T18:30:17Z","timestamp":1684866617000},"page":"e1011162","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":40,"title":["Deep self-supervised learning for biosynthetic gene cluster detection and product classification"],"prefix":"10.1371","volume":"19","author":[{"given":"Carolina","family":"Rios-Martinez","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nicholas","family":"Bhattacharya","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8601-6040","authenticated-orcid":true,"given":"Ava P.","family":"Amini","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0178-8242","authenticated-orcid":true,"given":"Lorin","family":"Crawford","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9045-6826","authenticated-orcid":true,"given":"Kevin K.","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"340","published-online":{"date-parts":[[2023,5,23]]},"reference":[{"issue":"3","key":"pcbi.1011162.ref001","doi-asserted-by":"crossref","first-page":"629","DOI":"10.1021\/acs.jnatprod.5b01055","article-title":"Natural products as sources of new drugs from 1981 to 2014","volume":"79","author":"DJ Newman","year":"2016","journal-title":"Journal of natural products"},{"key":"pcbi.1011162.ref002","unstructured":"Walsh CT, Tang Y. Natural product biosynthesis. Royal Society of Chemistry; 2017."},{"issue":"8","key":"pcbi.1011162.ref003","doi-asserted-by":"crossref","first-page":"988","DOI":"10.1039\/C6NP00025H","article-title":"The evolution of genome mining in microbes\u2013a review","volume":"33","author":"N Ziemert","year":"2016","journal-title":"Natural product reports"},{"issue":"26","key":"pcbi.1011162.ref004","doi-asserted-by":"crossref","first-page":"e2100751118","DOI":"10.1073\/pnas.2100751118","article-title":"GRINS: Genetic elements that recode assembly-line polyketide synthases and accelerate their diversification","volume":"118","author":"A Nivina","year":"2021","journal-title":"Proceedings of the National Academy of Sciences"},{"issue":"1","key":"pcbi.1011162.ref005","doi-asserted-by":"crossref","first-page":"32","DOI":"10.3390\/medicines6010032","article-title":"New approaches to detect biosynthetic gene clusters in the environment","volume":"6","author":"R Chen","year":"2019","journal-title":"Medicines"},{"issue":"22","key":"pcbi.1011162.ref006","doi-asserted-by":"crossref","first-page":"5601","DOI":"10.1073\/pnas.1614680114","article-title":"Retrospective analysis of natural products provides insights for future discovery trends","volume":"114","author":"CR Pye","year":"2017","journal-title":"Proceedings of the National Academy of Sciences"},{"issue":"8","key":"pcbi.1011162.ref007","doi-asserted-by":"crossref","first-page":"1902","DOI":"10.1021\/np500370c","article-title":"NRPquest: coupling mass spectrometry and genome mining for nonribosomal peptide discovery","volume":"77","author":"H Mohimani","year":"2014","journal-title":"Journal of natural products"},{"issue":"W1","key":"pcbi.1011162.ref008","doi-asserted-by":"crossref","first-page":"W204","DOI":"10.1093\/nar\/gkt449","article-title":"antiSMASH 2.0\u2014a versatile platform for genome mining of secondary metabolite producers","volume":"41","author":"K Blin","year":"2013","journal-title":"Nucleic acids research"},{"issue":"D1","key":"pcbi.1011162.ref009","doi-asserted-by":"crossref","first-page":"D625","DOI":"10.1093\/nar\/gky1060","article-title":"The antiSMASH database version 2: a comprehensive resource on secondary metabolite biosynthetic gene clusters","volume":"47","author":"K Blin","year":"2019","journal-title":"Nucleic acids research"},{"issue":"2","key":"pcbi.1011162.ref010","doi-asserted-by":"crossref","first-page":"412","DOI":"10.1016\/j.cell.2014.06.034","article-title":"Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters","volume":"158","author":"P Cimermancic","year":"2014","journal-title":"Cell"},{"issue":"2","key":"pcbi.1011162.ref011","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1016\/S1672-0229(04)02014-5","article-title":"Recent applications of hidden Markov models in computational biology","volume":"2","author":"KH Choo","year":"2004","journal-title":"Genomics, proteomics & bioinformatics"},{"issue":"18","key":"pcbi.1011162.ref012","doi-asserted-by":"crossref","first-page":"e110","DOI":"10.1093\/nar\/gkz654","article-title":"A deep learning genome-mining strategy for biosynthetic gene cluster prediction","volume":"47","author":"GD Hannigan","year":"2019","journal-title":"Nucleic acids research"},{"issue":"14","key":"pcbi.1011162.ref013","doi-asserted-by":"crossref","first-page":"1728","DOI":"10.1093\/bioinformatics\/btm247","article-title":"Fast model-based protein homology detection without alignment","volume":"23","author":"S Hochreiter","year":"2007","journal-title":"Bioinformatics"},{"issue":"15","key":"pcbi.1011162.ref014","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"A Rives","year":"2021","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"pcbi.1011162.ref015","doi-asserted-by":"crossref","unstructured":"Madani A, McCann B, Naik N, Keskar NS, Anand N, Eguchi RR, et al. ProGen: Language Modeling for Protein Generation. arXiv. 2020;.","DOI":"10.1101\/2020.03.07.982272"},{"key":"pcbi.1011162.ref016","doi-asserted-by":"crossref","unstructured":"Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: Towards Cracking the Language of Life\u2019s Code Through Self-Supervised Learning; 2021.","DOI":"10.1101\/2020.07.12.199554"},{"key":"pcbi.1011162.ref017","article-title":"Deep neural language modeling enables functional protein generation across families","author":"A Madani","year":"2021","journal-title":"bioRxiv"},{"key":"pcbi.1011162.ref018","article-title":"ProteinBERT: A universal deep-learning model of protein sequence and function","author":"N Brandes","year":"2021","journal-title":"bioRxiv"},{"key":"pcbi.1011162.ref019","article-title":"A deep unsupervised language model for protein design","author":"N Ferruz","year":"2022","journal-title":"bioRxiv"},{"key":"pcbi.1011162.ref020","unstructured":"Hesslow D, ed Zanichelli N, Notin P, Poli I, Marks DS. RITA: a Study on Scaling Up Generative Protein Sequence Models; 2022."},{"key":"pcbi.1011162.ref021","doi-asserted-by":"crossref","unstructured":"Nijkamp E, Ruffolo J, Weinstein EN, Naik N, Madani A. ProGen2: Exploring the Boundaries of Protein Language Models. arXiv preprint arXiv:220613517. 2022;.","DOI":"10.1016\/j.cels.2023.10.002"},{"issue":"15","key":"pcbi.1011162.ref022","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome","volume":"37","author":"Y Ji","year":"2021","journal-title":"Bioinformatics"},{"issue":"1","key":"pcbi.1011162.ref023","doi-asserted-by":"crossref","first-page":"lqac012","DOI":"10.1093\/nargab\/lqac012","article-title":"Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning","volume":"4","author":"M Akiyama","year":"2022","journal-title":"NAR genomics and bioinformatics"},{"key":"pcbi.1011162.ref024","doi-asserted-by":"crossref","unstructured":"Chen J, Hu Z, Sun S, Tan Q, Wang Y, Yu Q, et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. arXiv preprint arXiv:220400300. 2022;.","DOI":"10.1101\/2022.08.06.503062"},{"key":"pcbi.1011162.ref025","article-title":"Using natural language processing to learn the grammar of glycans","author":"D Bojar","year":"2020","journal-title":"bioRxiv"},{"issue":"11","key":"pcbi.1011162.ref026","doi-asserted-by":"crossref","first-page":"109251","DOI":"10.1016\/j.celrep.2021.109251","article-title":"Using graph convolutional neural networks to learn a representation for glycans","volume":"35","author":"R Burkholz","year":"2021","journal-title":"Cell Reports"},{"issue":"D1","key":"pcbi.1011162.ref027","doi-asserted-by":"crossref","first-page":"d158","DOI":"10.1093\/nar\/gkw1099","article-title":"UniProt: the universal protein knowledgebase","volume":"45","author":"U Consortium","year":"2017","journal-title":"Nucleic Acids Research"},{"key":"pcbi.1011162.ref028","doi-asserted-by":"crossref","unstructured":"Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. In: Ranzato M, Beygelzimer A, Nguyen K, Liang PS, Vaughan JW, Dauphin Y, editors. Advances in Neural Information Processing Systems 34; 2021.","DOI":"10.1101\/2021.07.09.450648"},{"key":"pcbi.1011162.ref029","article-title":"Transformer protein language models are unsupervised structure learners","author":"R Rao","year":"2020","journal-title":"Biorxiv"},{"key":"pcbi.1011162.ref030","doi-asserted-by":"crossref","unstructured":"Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, et al. Evaluating protein transfer learning with TAPE. In: Advances in Neural Information Processing Systems; 2019. p. 9686\u20139698.","DOI":"10.1101\/676825"},{"key":"pcbi.1011162.ref031","doi-asserted-by":"crossref","unstructured":"Dallago C, Mou J, Johnston KE, Wittmann B, Bhattacharya N, Goldman S, et al. FLIP: Benchmark tasks in fitness landscape inference for proteins. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2); 2021.","DOI":"10.1101\/2021.11.09.467890"},{"key":"pcbi.1011162.ref032","article-title":"Convolutions are competitive with transformers for protein sequence pretraining","author":"KK Yang","year":"2022","journal-title":"bioRxiv"},{"key":"pcbi.1011162.ref033","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;."},{"key":"pcbi.1011162.ref034","unstructured":"Kalchbrenner N, Espeholt L, Simonyan K, Oord Avd, Graves A, Kavukcuoglu K. Neural machine translation in linear time. arXiv preprint arXiv:161010099. 2016;."},{"issue":"D1","key":"pcbi.1011162.ref035","doi-asserted-by":"crossref","first-page":"D222","DOI":"10.1093\/nar\/gkt1223","article-title":"Pfam: the protein families database","volume":"42","author":"RD Finn","year":"2014","journal-title":"Nucleic acids research"},{"issue":"D1","key":"pcbi.1011162.ref036","first-page":"D454","article-title":"MIBiG 2.0: a repository for biosynthetic gene clusters of known function","volume":"48","author":"SA Kautsar","year":"2020","journal-title":"Nucleic acids research"},{"issue":"W1","key":"pcbi.1011162.ref037","doi-asserted-by":"crossref","first-page":"W29","DOI":"10.1093\/nar\/gkab335","article-title":"antiSMASH 6.0: improving cluster detection and comparison capabilities","volume":"49","author":"K Blin","year":"2021","journal-title":"Nucleic acids research"},{"key":"pcbi.1011162.ref038","article-title":"Constructing benchmark test sets for biological sequence analysis using independent set algorithms","author":"S Petti","year":"2021","journal-title":"bioRxiv"},{"key":"pcbi.1011162.ref039","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1186\/1471-2105-11-119","article-title":"Prodigal: prokaryotic gene recognition and translation initiation site identification","volume":"11","author":"D Hyatt","year":"2010","journal-title":"BMC Bioinformatics"},{"issue":"9","key":"pcbi.1011162.ref040","first-page":"755","article-title":"Profile hidden Markov models","volume":"14","author":"SR Eddy","year":"1998","journal-title":"Bioinformatics (Oxford, England)"},{"issue":"10","key":"pcbi.1011162.ref041","doi-asserted-by":"crossref","first-page":"1766","DOI":"10.1093\/bioinformatics\/bty863","article-title":"cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly","volume":"35","author":"TE Lewis","year":"2019","journal-title":"Bioinformatics"},{"key":"pcbi.1011162.ref042","unstructured":"Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Wallach H, Larochelle H, Beygelzimer A, d\u2019Alch\u00e9-Buc F, Fox E, Garnett R, editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; 2019. p. 8024\u20138035. Available from: http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf."}],"updated-by":[{"DOI":"10.1371\/journal.pcbi.1011162","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2023,6,5]],"date-time":"2023-06-05T00:00:00Z","timestamp":1685923200000}}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1011162","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,13]],"date-time":"2023-12-13T20:48:32Z","timestamp":1702500512000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1011162"}},"subtitle":[],"editor":[{"given":"Shihua","family":"Zhang","sequence":"first","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2023,5,23]]},"references-count":42,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2023,5,23]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1011162","relation":{"new_version":[{"id-type":"doi","id":"10.1371\/journal.pcbi.1011162","asserted-by":"object"}]},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,23]]}}}