{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T02:04:50Z","timestamp":1774922690194,"version":"3.50.1"},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T00:00:00Z","timestamp":1713744000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T00:00:00Z","timestamp":1713744000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Comput Aided Mol Des"],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In recent years, generative machine learning algorithms have been successful in designing innovative drug-like molecules. SMILES is a sequence-like language used in most effective drug design models. Due to data\u2019s sequential structure, models such as recurrent neural networks and transformers can design pharmacological compounds with optimized efficacy. Large language models have advanced recently, but their implications on drug design have not yet been explored. Although one study successfully pre-trained a <jats:italic>large chemistry model<\/jats:italic> (LCM), its application to specific tasks in drug discovery is unknown. In this study, the drug design task is modeled as a causal language modeling problem. Thus, the procedure of reward modeling, supervised fine-tuning, and proximal policy optimization was used to transfer the LCM to drug design, similar to Open AI\u2019s ChatGPT and InstructGPT procedures. By combining the SMILES sequence with chemical descriptors, the novel efficacy evaluation model exceeded its performance compared to previous studies. After proximal policy optimization, the drug design model generated molecules with 99.2% having efficacy pIC<jats:sub>50<\/jats:sub>\u2009&gt;\u20097 towards the amyloid precursor protein, with 100% of the generated molecules being valid and novel. This demonstrated the applicability of LCMs in drug discovery, with benefits including less data consumption while fine-tuning. The applicability of LCMs to drug discovery opens the door for larger studies involving reinforcement-learning with human feedback, where chemists provide feedback to LCMs and generate higher-quality molecules. LCMs\u2019 ability to design similar molecules from datasets paves the way for more accessible, non-patented alternatives to drug molecules.<\/jats:p>","DOI":"10.1007\/s10822-024-00559-z","type":"journal-article","created":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T15:01:51Z","timestamp":1713798111000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":23,"title":["De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning"],"prefix":"10.1007","volume":"38","author":[{"given":"Gavin","family":"Ye","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,4,22]]},"reference":[{"key":"559_CR1","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1016\/j.jhealeco.2016.01.012","volume":"47","author":"JA DiMasi","year":"2016","unstructured":"DiMasi JA, Grabowski HG, Hansen RW (2016) Innovation in the pharmaceutical industry: new estimates of R&D costs. J Health Econ 47:20\u201333. https:\/\/doi.org\/10.1016\/j.jhealeco.2016.01.012","journal-title":"J Health Econ"},{"key":"559_CR2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ailsci.2022.100045","volume":"2","author":"S Tripathi","year":"2022","unstructured":"Tripathi S et al (2022) Recent advances and application of generative adversarial networks in drug discovery, development, and targeting. Artif Intell Life Sci 2:100045. https:\/\/doi.org\/10.1016\/j.ailsci.2022.100045","journal-title":"Artif Intell Life Sci"},{"issue":"1","key":"559_CR3","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1186\/s13321-022-00623-6","volume":"14","author":"M Abbasi","year":"2022","unstructured":"Abbasi M et al (2022) Designing optimized drug candidates with generative adversarial network. J Cheminformatics 14(1):40. https:\/\/doi.org\/10.1186\/s13321-022-00623-6","journal-title":"J Cheminformatics"},{"issue":"1","key":"559_CR4","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31\u201336. https:\/\/doi.org\/10.1021\/ci00057a005","journal-title":"J Chem Inf Comput Sci"},{"issue":"4","key":"559_CR5","doi-asserted-by":"publisher","DOI":"10.1088\/2632-2153\/aba947","volume":"1","author":"M Krenn","year":"2020","unstructured":"Krenn M, H\u00e4se F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024. https:\/\/doi.org\/10.1088\/2632-2153\/aba947","journal-title":"Mach Learn Sci Technol"},{"issue":"1","key":"559_CR6","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1186\/s13321-020-00419-6","volume":"12","author":"J Yasonik","year":"2020","unstructured":"Yasonik J (2020) Multiobjective de novo drug design with recurrent neural networks and nondominated sorting. J Cheminformatics 12(1):14. https:\/\/doi.org\/10.1186\/s13321-020-00419-6","journal-title":"J Cheminformatics"},{"issue":"12","key":"559_CR7","doi-asserted-by":"publisher","first-page":"5682","DOI":"10.1021\/acs.jcim.0c00599","volume":"60","author":"K Gao","year":"2020","unstructured":"Gao K, Nguyen DD, Tu M, Wei G-W (2020) Generative network complex for the automated generation of drug-like molecules. J Chem Inf Model 60(12):5682\u20135698. https:\/\/doi.org\/10.1021\/acs.jcim.0c00599","journal-title":"J Chem Inf Model"},{"key":"559_CR8","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-023-00639-z","author":"J Born","year":"2023","unstructured":"Born J, Manica M (2023) Regression transformer enables concurrent sequence regression and generation for molecular language modelling. Nat Mach Intell. https:\/\/doi.org\/10.1038\/s42256-023-00639-z","journal-title":"Nat Mach Intell"},{"key":"559_CR9","doi-asserted-by":"publisher","DOI":"10.26434\/chemrxiv-2022-3s512","author":"N Frey","year":"2022","unstructured":"Frey N et al (2022) Neural scaling of deep chemical models. ChemRxiv. https:\/\/doi.org\/10.26434\/chemrxiv-2022-3s512","journal-title":"ChemRxiv"},{"issue":"11","key":"559_CR10","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0048540","volume":"7","author":"S Yang","year":"2012","unstructured":"Yang S et al (2012) A peptide binding to the \u03b2-site of APP improves spatial memory and attenuates A\u03b2 burden in Alzheimer\u2019s disease transgenic mice. PLoS ONE 7(11):e48540. https:\/\/doi.org\/10.1371\/journal.pone.0048540","journal-title":"PLoS ONE"},{"key":"559_CR11","doi-asserted-by":"publisher","first-page":"137","DOI":"10.3389\/fnmol.2020.00137","volume":"13","author":"J Zhao","year":"2020","unstructured":"Zhao J, Liu X, Xia W, Zhang Y, Wang C (2020) Targeting amyloidogenic processing of APP in Alzheimer\u2019s disease. Front Mol Neurosci 13:137. https:\/\/doi.org\/10.3389\/fnmol.2020.00137","journal-title":"Front Mol Neurosci"},{"key":"559_CR12","doi-asserted-by":"publisher","unstructured":"Brown T B et al. (2020) Language models are few-shot learners. ArXiv, 2020. doi: https:\/\/doi.org\/10.48550\/arXiv.2005.14165.","DOI":"10.48550\/arXiv.2005.14165"},{"key":"559_CR13","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkl999","author":"T Liu","year":"2007","unstructured":"Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2007) BindingDB: a web-accessible database of experimentally determined protein\u2013ligand binding affinities. Nucleic Acids Res. https:\/\/doi.org\/10.1093\/nar\/gkl999","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"559_CR14","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1021\/acs.jcim.7b00616","volume":"58","author":"S Jaeger","year":"2018","unstructured":"Jaeger S, Fulle S, Turk S (2018) Mol2vec: unsupervised machine learning approach with chemical intuition. J Chem Inf Model 58(1):27\u201335. https:\/\/doi.org\/10.1021\/acs.jcim.7b00616","journal-title":"J Chem Inf Model"},{"issue":"D1","key":"559_CR15","doi-asserted-by":"publisher","first-page":"D1373","DOI":"10.1093\/nar\/gkac956","volume":"51","author":"S Kim","year":"2023","unstructured":"Kim S et al (2023) PubChem 2023 update. Nucleic Acids Res 51(D1):D1373\u2013D1380. https:\/\/doi.org\/10.1093\/nar\/gkac956","journal-title":"Nucleic Acids Res"},{"key":"559_CR16","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.8053810","author":"G Landrum","year":"2023","unstructured":"Landrum G et al (2023) rdkit\/rdkit: 2023_03_2 (Q1 2023) Release. Zenodo. https:\/\/doi.org\/10.5281\/zenodo.8053810"},{"issue":"24","key":"559_CR17","doi-asserted-by":"publisher","first-page":"21781","DOI":"10.1021\/acsomega.3c01332","volume":"8","author":"H Kaneko","year":"2023","unstructured":"Kaneko H (2023) Molecular descriptors, structure generation, and inverse QSAR\/QSPR based on SELFIES. ACS Omega 8(24):21781\u201321786. https:\/\/doi.org\/10.1021\/acsomega.3c01332","journal-title":"ACS Omega"},{"key":"559_CR18","doi-asserted-by":"publisher","DOI":"10.26434\/chemrxiv-2022-v5p6m-v3","author":"HA Gandhi","year":"2022","unstructured":"Gandhi HA, White AD (2022) Explaining molecular properties with natural language. Chemistry. https:\/\/doi.org\/10.26434\/chemrxiv-2022-v5p6m-v3","journal-title":"Chemistry"},{"key":"559_CR19","doi-asserted-by":"publisher","unstructured":"Touvron H et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv, 2023. doi: https:\/\/doi.org\/10.48550\/arXiv.2307.09288.","DOI":"10.48550\/arXiv.2307.09288"},{"key":"559_CR20","doi-asserted-by":"publisher","unstructured":"Almazrouei E et al. (2023) The Falcon Series of Open Language Models. arXiv, 2023. doi: https:\/\/doi.org\/10.48550\/arXiv.2311.16867.","DOI":"10.48550\/arXiv.2311.16867"},{"key":"559_CR21","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.5297715","author":"S Black","year":"2021","unstructured":"Black S, Leo G, Wang P, Leahy C, Biderman S (2021) GPT-neo: large scale autoregressive language modeling with mesh-tensorflow. Zenodo. https:\/\/doi.org\/10.5281\/zenodo.5297715"},{"key":"559_CR22","doi-asserted-by":"publisher","unstructured":"Akiba T, Sano S, Yanase T, Ohta T and Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, in KDD \u201819. New York, Association for Computing Machinery, pp. 2623\u20132631. doi: https:\/\/doi.org\/10.1145\/3292500.3330701.","DOI":"10.1145\/3292500.3330701"},{"issue":"10","key":"559_CR23","doi-asserted-by":"publisher","first-page":"1376","DOI":"10.1093\/bioinformatics\/btaa982","volume":"37","author":"N S\u00e1nchez-Cruz","year":"2021","unstructured":"S\u00e1nchez-Cruz N, Medina-Franco JL, Mestres J, Barril X (2021) Extended connectivity interaction features: improving binding affinity prediction through chemical description. Bioinforma Oxf Engl 37(10):1376\u20131382. https:\/\/doi.org\/10.1093\/bioinformatics\/btaa982","journal-title":"Bioinforma Oxf Engl"},{"issue":"21\u201322","key":"559_CR24","doi-asserted-by":"publisher","first-page":"1011","DOI":"10.1016\/j.drudis.2009.07.014","volume":"14","author":"TJ Ritchie","year":"2009","unstructured":"Ritchie TJ, Macdonald SJF (2009) The impact of aromatic ring count on compound developability\u2013are too many aromatic rings a liability in drug design? Drug Discov Today 14(21\u201322):1011\u20131020. https:\/\/doi.org\/10.1016\/j.drudis.2009.07.014","journal-title":"Drug Discov Today"},{"key":"559_CR25","doi-asserted-by":"publisher","unstructured":"Khoi ND, Van CP, Tran HV, Truong CD (2020) Multi-objective exploration for proximal policy optimization. In: 2020 Applying New Technology in Green Buildings (ATiGB). doi: https:\/\/doi.org\/10.1109\/ATiGB50996.2021.9423319.","DOI":"10.1109\/ATiGB50996.2021.9423319"},{"key":"559_CR26","doi-asserted-by":"publisher","unstructured":"Koeberle Y, Sabatini S, Tsishkou D and Sabourin C (2022) Exploring the trade off between human driving imitation and safety for traffic simulation. In: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). pp. 779\u2013786. doi: https:\/\/doi.org\/10.1109\/ITSC55140.2022.9922347.","DOI":"10.1109\/ITSC55140.2022.9922347"},{"issue":"12","key":"559_CR27","doi-asserted-by":"publisher","first-page":"791","DOI":"10.1007\/s10822-023-00539-9","volume":"37","author":"TO Pereira","year":"2023","unstructured":"Pereira TO, Abbasi M, Oliveira RI, Guedes RA, Salvador JAR, Arrais JP (2023) Artificial intelligence for prediction of biological activities and generation of molecular hits using stereochemical information. J Comput Aided Mol Des 37(12):791\u2013806. https:\/\/doi.org\/10.1007\/s10822-023-00539-9","journal-title":"J Comput Aided Mol Des"},{"issue":"7","key":"559_CR28","doi-asserted-by":"publisher","first-page":"eaap7885","DOI":"10.1126\/sciadv.aap7885","volume":"4","author":"M Popova","year":"2018","unstructured":"Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de-novo drug design. Sci Adv 4(7):eaap7885. https:\/\/doi.org\/10.1126\/sciadv.aap7885","journal-title":"Sci Adv"},{"key":"559_CR29","doi-asserted-by":"publisher","unstructured":"Christiano P, Leike J, Brown TB, Martic M, Legg S and Amodei D (2023) Deep reinforcement learning from human preferences. arXiv. doi: https:\/\/doi.org\/10.48550\/arXiv.1706.03741.","DOI":"10.48550\/arXiv.1706.03741"},{"key":"559_CR30","doi-asserted-by":"publisher","unstructured":"Ouyang L et al. (2022) Training language models to follow instructions with human feedback. arXiv. doi: https:\/\/doi.org\/10.48550\/arXiv.2203.02155.","DOI":"10.48550\/arXiv.2203.02155"}],"container-title":["Journal of Computer-Aided Molecular Design"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10822-024-00559-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10822-024-00559-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10822-024-00559-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,26]],"date-time":"2024-11-26T12:16:14Z","timestamp":1732623374000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10822-024-00559-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,22]]},"references-count":30,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["559"],"URL":"https:\/\/doi.org\/10.1007\/s10822-024-00559-z","relation":{"references":[{"id-type":"uri","id":"","asserted-by":"subject"},{"id-type":"uri","id":"","asserted-by":"subject"}]},"ISSN":["0920-654X","1573-4951"],"issn-type":[{"value":"0920-654X","type":"print"},{"value":"1573-4951","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,22]]},"assertion":[{"value":"8 December 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 March 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 April 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"This study declares no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"20"}}