{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T07:30:10Z","timestamp":1775028610180,"version":"3.50.1"},"reference-count":48,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"vor","delay-in-days":48,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100023650","name":"NCCR Catalysis","doi-asserted-by":"publisher","award":["180544"],"award-info":[{"award-number":["180544"]}],"id":[{"id":"10.13039\/501100023650","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100023650","name":"National Centre of Competence in Research","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100023650","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001711","name":"Swiss National Science Foundation","doi-asserted-by":"publisher","award":["101077879"],"award-info":[{"award-number":["101077879"]}],"id":[{"id":"10.13039\/501100001711","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,22]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optimization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM\u2013GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make \u201cthe best of both worlds\u201d and create mutants with structural features and flexibility comparable with the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.<\/jats:p>","DOI":"10.1093\/bib\/bbae675","type":"journal-article","created":{"date-parts":[[2024,12,13]],"date-time":"2024-12-13T07:28:52Z","timestamp":1734074932000},"source":"Crossref","is-referenced-by-count":9,"title":["Integrating genetic algorithms and language models for enhanced enzyme design"],"prefix":"10.1093","volume":"26","author":[{"given":"Yves Gaetan","family":"Nana Teukam","sequence":"first","affiliation":[{"name":"IBM Research Europe , S\u00e4umerstrasse 4, CH-8803 R\u00fcschlikon,","place":["Switzerland"]},{"name":"Institute for Complex Molecular Systems and Department of Biomedical Engineering , Eindhoven University of Technology, 5612 AZ Eindhoven, the","place":["Netherlands"]}]},{"given":"Federico","family":"Zipoli","sequence":"additional","affiliation":[{"name":"IBM Research Europe , S\u00e4umerstrasse 4, CH-8803 R\u00fcschlikon,","place":["Switzerland"]},{"name":"National Center for Competence in Research-Catalysis (NCCR-Catalysis) ,","place":["Switzerland"]}]},{"given":"Teodoro","family":"Laino","sequence":"additional","affiliation":[{"name":"IBM Research Europe , S\u00e4umerstrasse 4, CH-8803 R\u00fcschlikon,","place":["Switzerland"]},{"name":"National Center for Competence in Research-Catalysis (NCCR-Catalysis) ,","place":["Switzerland"]}]},{"given":"Emanuele","family":"Criscuolo","sequence":"additional","affiliation":[{"name":"Institute for Complex Molecular Systems and Department of Biomedical Engineering , Eindhoven University of Technology, 5612 AZ Eindhoven, the","place":["Netherlands"]}]},{"given":"Francesca","family":"Grisoni","sequence":"additional","affiliation":[{"name":"Institute for Complex Molecular Systems and Department of Biomedical Engineering , Eindhoven University of Technology, 5612 AZ Eindhoven, the","place":["Netherlands"]},{"name":"Centre for Living Technologies , Alliance TU\/e, WUR, UU, UMC Utrecht, Utrecht, the","place":["Netherlands"]}]},{"given":"Matteo","family":"Manica","sequence":"additional","affiliation":[{"name":"IBM Research Europe , S\u00e4umerstrasse 4, CH-8803 R\u00fcschlikon,","place":["Switzerland"]}]}],"member":"286","published-online":{"date-parts":[[2025,1,8]]},"reference":[{"key":"2025010905201205400_ref1","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1038\/s41588-018-0295-5","article-title":"A primer on deep learning in genomics","volume":"51","author":"Zou","year":"2018","journal-title":"Nat Genet"},{"key":"2025010905201205400_ref2","doi-asserted-by":"publisher","first-page":"595","DOI":"10.1089\/omi.2013.0017","article-title":"Application of machine learning to proteomics data: Classification and biomarker identification in postgenomics biology","volume":"17","author":"Swan","year":"2013","journal-title":"Omics: a journal of integrative biology"},{"key":"2025010905201205400_ref3","doi-asserted-by":"publisher","first-page":"520","DOI":"10.1042\/BST0330520","article-title":"Metabolomics, machine learning and modelling: Towards an understanding of the language of cells","volume":"33","author":"Kell","year":"2005","journal-title":"Biochem Soc Trans"},{"key":"2025010905201205400_ref4","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2025010905201205400_ref5","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"CoRR","author":"Devlin","year":"2018"},{"key":"2025010905201205400_ref6","doi-asserted-by":"crossref","first-page":"01","DOI":"10.1093\/bioinformatics\/btac020","article-title":"Proteinbert: A universal deep-learning model of protein sequence and function","volume":"38","author":"Brandes","year":"2022","journal-title":"Bioinformatics"},{"key":"2025010905201205400_ref7","doi-asserted-by":"publisher","first-page":"4348","DOI":"10.1038\/s41467-022-32007-7","article-title":"ProtGPT2 is a deep unsupervised language model for protein design","volume":"13","author":"Ferruz","year":"2022","journal-title":"Nat Commun"},{"key":"2025010905201205400_ref8","doi-asserted-by":"publisher","first-page":"3142","DOI":"10.1021\/acs.jcim.2c00026","article-title":"Ai-based protein structure prediction in drug discovery: Impacts and challenges","volume":"62","author":"Schauperl","year":"2022","journal-title":"J Chem Inf Model"},{"key":"2025010905201205400_ref9","doi-asserted-by":"publisher","first-page":"e2300011","DOI":"10.1002\/pmic.202300011","article-title":"Leveraging transformers-based language models in proteome bioinformatics","volume":"23","author":"Le","year":"2023","journal-title":"Proteomics"},{"key":"2025010905201205400_ref10","doi-asserted-by":"publisher","first-page":"e2100232","DOI":"10.1002\/pmic.202100232","article-title":"Potential of deep representative learning features to interpret the sequence information in proteomics","volume":"22","author":"Le","year":"2022","journal-title":"Proteomics"},{"key":"2025010905201205400_ref11","doi-asserted-by":"publisher","first-page":"1099","DOI":"10.1038\/s41587-022-01618-2","article-title":"Large language models generate functional protein sequences across diverse families","volume":"41","author":"Madani","year":"2023","journal-title":"Nat Biotechnol"},{"key":"2025010905201205400_ref12","doi-asserted-by":"crossref","DOI":"10.1101\/2023.01.16.524265","article-title":"Ankh: Optimized protein language model unlocks general-purpose modelling","author":"Elnaggar","year":"2023"},{"key":"2025010905201205400_ref13","doi-asserted-by":"publisher","first-page":"953","DOI":"10.1098\/rsif.2008.0085","article-title":"How much of protein sequence space has been explored by life on earth?","volume":"5","author":"Dryden","year":"2008","journal-title":"J R Soc, Interface"},{"key":"2025010905201205400_ref14","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1007\/0-387-28356-0_4","article-title":"Genetic algorithms","volume-title":"Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques","author":"Sastry","year":"2005"},{"key":"2025010905201205400_ref15","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1093\/protein\/gzg077","article-title":"Optimizing the search algorithm for protein engineering by directed evolution","volume":"16","author":"Fox","year":"2003","journal-title":"Protein Eng"},{"key":"2025010905201205400_ref16","doi-asserted-by":"publisher","first-page":"04","DOI":"10.1186\/s13321-020-00429-4","article-title":"Autogrow4: An open-source genetic algorithm for de novo drug design and lead optimization","volume":"12","author":"Spiegel","year":"2020","journal-title":"J Chem"},{"key":"2025010905201205400_ref17","article-title":"Reinforced genetic algorithm for structure-based drug design","volume-title":"Advances in Neural Information Processing Systems","author":"Tianfan"},{"key":"2025010905201205400_ref18","doi-asserted-by":"publisher","first-page":"814","DOI":"10.1002\/jcc.27043","article-title":"Gamaterial \u2014A genetic-algorithm software for material design and discovery","volume":"44","author":"Louren\u00e7o","year":"2022","journal-title":"J Comput Chem"},{"key":"2025010905201205400_ref19","doi-asserted-by":"publisher","first-page":"103","DOI":"10.3166\/remn.17.103-126","article-title":"A comparative evaluation of genetic and gradient-based algorithms applied to aerodynamic optimization","volume":"17","author":"Zingg","year":"2008","journal-title":"Eur J Comput Mech"},{"key":"2025010905201205400_ref20","doi-asserted-by":"publisher","first-page":"06","DOI":"10.1186\/s13321-023-00719-7","article-title":"Adaptive language model training for molecular design","volume":"15","author":"Blanchard","year":"2023","journal-title":"J Chem"},{"key":"2025010905201205400_ref21","doi-asserted-by":"publisher","first-page":"D498","DOI":"10.1093\/nar\/gkaa1025","article-title":"Brenda, the elixir core data resource in 2021: New developments and updates","volume":"49","author":"J\u00e4de","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2025010905201205400_ref22","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1007\/978-1-59745-535-0_4","article-title":"Uniprotkb\/swiss-prot: The manually annotated section of the uniprot knowledgebase","volume-title":"Plant Bioinformatics: Methods and Protocols","author":"Boutet","year":"2007"},{"key":"2025010905201205400_ref23","doi-asserted-by":"publisher","first-page":"D1373","DOI":"10.1093\/nar\/gkac956","article-title":"Pubchem 2023 update","volume":"51","author":"Kim","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2025010905201205400_ref24","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1093\/nar\/28.1.235","article-title":"The protein data bank","volume":"28","author":"Berman","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2025010905201205400_ref25","doi-asserted-by":"publisher","first-page":"964","DOI":"10.1038\/s41467-022-28536-w","article-title":"Biocatalysed synthesis planning using data-driven learning","volume":"13","author":"Probst","year":"2022","journal-title":"Nat Commun"},{"key":"2025010905201205400_ref26","article-title":"Language models of protein sequences at the scale of evolution enable accurate structure prediction","author":"Lin","year":"2022"},{"key":"2025010905201205400_ref27","doi-asserted-by":"publisher","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci"},{"key":"2025010905201205400_ref28","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped blast and psi-blast: A new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2025010905201205400_ref29","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1007\/BF00532240","article-title":"The wasserstein distance and approximation theorems","volume":"70","author":"R\u00fcschendorf","year":"1985","journal-title":"Probab Theory Relat Fields"},{"key":"2025010905201205400_ref30","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach Learn"},{"key":"2025010905201205400_ref31","article-title":"Chemberta: Large-scale self-supervised pretraining for molecular property prediction","author":"Chithrananda","year":"2020"},{"key":"2025010905201205400_ref32","article-title":"Xgboost: A scalable tree boosting system","author":"Chen","year":"2016","journal-title":"CoRR"},{"key":"2025010905201205400_ref33","doi-asserted-by":"publisher","first-page":"4139","DOI":"10.1038\/s41467-023-39840-4","article-title":"Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nature","volume":"14","author":"Kroll","year":"2023","journal-title":"Communications"},{"key":"2025010905201205400_ref34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1002\/9780470479216.corpsy0524","article-title":"Mann-Whitney u test","author":"McKnight","year":"2010","journal-title":"The Corsini encyclopedia of psychology"},{"key":"2025010905201205400_ref35","first-page":"82","article-title":"Pymol: An open-source molecular graphics tool","volume":"40","author":"DeLano","year":"2002","journal-title":"CCP4 Newsl Protein Crystallogr"},{"key":"2025010905201205400_ref36","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1016\/j.softx.2015.06.001","article-title":"Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers","volume":"1-2","author":"Abraham","year":"2015","journal-title":"SoftwareX"},{"key":"2025010905201205400_ref37","doi-asserted-by":"publisher","first-page":"1459","DOI":"10.1021\/ct200908r","article-title":"Optimization of the opls-aa force field for long hydrocarbons","volume":"8","author":"Siu","year":"2012","journal-title":"J Chem Theory Comput"},{"key":"2025010905201205400_ref38","doi-asserted-by":"publisher","DOI":"10.1136\/bmj.e4483","article-title":"Pearson\u2019s correlation coefficient","volume":"345","author":"Sedgwick","year":"2012","journal-title":"BMJ"},{"key":"2025010905201205400_ref39","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1016\/0014-5793(90)80839-B","article-title":"The influence of proline residues on alpha-helical structure","volume":"277","author":"Woolfson","year":"1990","journal-title":"FEBS Lett"},{"key":"2025010905201205400_ref40","doi-asserted-by":"publisher","first-page":"1765","DOI":"10.1002\/pro.2558","article-title":"Distinct circular dichroism spectroscopic signatures of polyproline ii and unordered secondary structures: Applications in secondary structure analyses","volume":"23","author":"Lopes","year":"2014","journal-title":"Protein Sci"},{"key":"2025010905201205400_ref41","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1016\/j.sbi.2006.04.004","article-title":"Multiple sequence alignment","volume":"16","author":"Edgar","year":"2006","journal-title":"Curr Opin Struct Biol"},{"key":"2025010905201205400_ref42","doi-asserted-by":"publisher","DOI":"10.1002\/0470013192.bsa712","article-title":"Wilcoxon\u2013Mann\u2013Whitney Test","volume-title":"Encyclopedia of Statistics in Behavioral Science","author":""},{"key":"2025010905201205400_ref43","doi-asserted-by":"publisher","first-page":"07","DOI":"10.1038\/s41467-023-39840-4","article-title":"Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning","volume":"14","author":"Kroll","year":"2023","journal-title":"Nat Commun"},{"key":"2025010905201205400_ref44","doi-asserted-by":"publisher","first-page":"4359","DOI":"10.1021\/acs.jpcb.1c01253","article-title":"Improving the thermostability of xylanase a from bacillus subtilis by combining bioinformatics and electrostatic interactions optimization","volume":"125","author":"Ngo","year":"2021","journal-title":"J Phys Chem B"},{"key":"2025010905201205400_ref45","doi-asserted-by":"publisher","first-page":"1280","DOI":"10.1002\/prot.21617","article-title":"Thermostable variants of the recombinant xylanase a from bacillus subtilis produced by directed evolution show reduced heat capacity changes","volume":"70","author":"Ruller","year":"2008","journal-title":"Proteins"},{"key":"2025010905201205400_ref46","doi-asserted-by":"publisher","first-page":"3270","DOI":"10.1021\/acs.jctc.6b00399","article-title":"Ntl9 folding at constant ph: The importance of electrostatic interaction and ph dependence","volume":"12","author":"Contessoto","year":"2016","journal-title":"J Chem Theory Comput"},{"key":"2025010905201205400_ref47","doi-asserted-by":"publisher","first-page":"9026","DOI":"10.1039\/C8CS00014J","article-title":"Engineering more stable proteins","volume":"47","author":"Kazlauskas","year":"2018","journal-title":"Chem Soc Rev"},{"key":"2025010905201205400_ref48","doi-asserted-by":"publisher","DOI":"10.1038\/s41524-023-01028-1","article-title":"Accelerating material design with the generative toolkit for scientific discovery","volume":"9","author":"Manica","year":"2023","journal-title":"npj Comput Mater"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/1\/bbae675\/61381160\/bbae675.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/1\/bbae675\/61381160\/bbae675.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:20:29Z","timestamp":1736382029000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbae675\/7945613"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,22]]},"references-count":48,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,11,22]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbae675","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2024-j7ntq","asserted-by":"object"}]},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,1]]},"published":{"date-parts":[[2024,11,22]]},"article-number":"bbae675"}}