{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T04:44:47Z","timestamp":1771562687086,"version":"3.50.1"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,9,25]],"date-time":"2023-09-25T00:00:00Z","timestamp":1695600000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,9,25]],"date-time":"2023-09-25T00:00:00Z","timestamp":1695600000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Big Data Health Science Center (BDHSC) of the University of South Carolina"},{"name":"National Science Foundation","award":["1940099"],"award-info":[{"award-number":["1940099"]}]},{"name":"National Science Foundation","award":["1905775"],"award-info":[{"award-number":["1905775"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Self-supervised neural language models have recently found wide applications in the generative design of organic molecules and protein sequences as well as representation learning for downstream structure classification and functional prediction. However, most of the existing deep learning models for molecule design usually require a big dataset and have a black-box architecture, which makes it difficult to interpret their design logic. Here we propose the Generative Molecular Transformer (GMTransformer), a probabilistic neural network model for generative design of molecules. Our model is built on the blank filling language model originally developed for text processing, which has demonstrated unique advantages in learning the \u201cmolecules grammars\u201d with high-quality generation, interpretability, and data efficiency. Benchmarked on the MOSES datasets, our models achieve high novelty and Scaf compared to other baselines. The probabilistic generation steps have the potential in tinkering with molecule design due to their capability of recommending how to modify existing molecules with explanation, guided by the learned implicit molecule chemistry. The source code and datasets can be accessed freely at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/usccolumbia\/GMTransformer\">https:\/\/github.com\/usccolumbia\/GMTransformer<\/jats:ext-link><\/jats:p>","DOI":"10.1186\/s13321-023-00759-z","type":"journal-article","created":{"date-parts":[[2023,9,25]],"date-time":"2023-09-25T04:01:52Z","timestamp":1695614512000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Probabilistic generative transformer language models for generative design of molecules"],"prefix":"10.1186","volume":"15","author":[{"given":"Lai","family":"Wei","sequence":"first","affiliation":[]},{"given":"Nihang","family":"Fu","sequence":"additional","affiliation":[]},{"given":"Yuqi","family":"Song","sequence":"additional","affiliation":[]},{"given":"Qian","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Jianjun","family":"Hu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,9,25]]},"reference":[{"issue":"11","key":"759_CR1","doi-asserted-by":"publisher","first-page":"2707","DOI":"10.1016\/j.drudis.2021.05.019","volume":"26","author":"Joshua Meyers","year":"2021","unstructured":"Meyers Joshua, Fabian Benedek, Brown Nathan (2021) De novo molecular design and generative models. Drug Discov Today 26(11):2707\u20132715","journal-title":"Drug Discov Today"},{"issue":"5","key":"759_CR2","doi-asserted-by":"publisher","first-page":"3031","DOI":"10.1021\/acs.chemrev.0c00608","volume":"121","author":"Zunger Alex","year":"2021","unstructured":"Alex Zunger, Malyi Oleksandr I (2021) Understanding doping of quantum materials. Chem Rev 121(5):3031\u20133060","journal-title":"Chem Rev"},{"key":"759_CR3","unstructured":"Du Y, Fu T, Sun J, Liu S (2022) Molgensurvey: a systematic survey in machine learning models for molecule design. arXiv preprint. arXiv:2203.14500"},{"issue":"4","key":"759_CR4","doi-asserted-by":"publisher","first-page":"1983","DOI":"10.1021\/acs.jcim.9b01120","volume":"60","author":"I Fergus","year":"2020","unstructured":"Fergus Imrie, Bradley Anthony R, Mihaela Schaar, van der, Deane Charlotte M, (2020) Deep generative models for 3d linker design. J Chem Inform Model 60(4):1983\u20131995","journal-title":"J Chem Inform Model"},{"issue":"7","key":"759_CR5","doi-asserted-by":"publisher","DOI":"10.1115\/1.4053859","volume":"144","author":"Regenwetter Lyle","year":"2022","unstructured":"Lyle Regenwetter, Heyrani Nobari Amin, Faez Ahmed (2022) Deep generative models in engineering design: a review. J Mech Des 144(7):071704","journal-title":"J Mech Des"},{"key":"759_CR6","unstructured":"Guimaraes GL, Sanchez-Lengeling B, Outeiral C, Farias PLC, Aspuru-Guzik A (2017) Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint. arXiv:1705.10843"},{"key":"759_CR7","unstructured":"Dai H, Tian Y, Dai B, Skiena S, Song L (2018) Syntax-directed variational autoencoder for structured data. arXiv preprint. arXiv:1802.08786"},{"key":"759_CR8","doi-asserted-by":"crossref","unstructured":"Zang C, Wang F (2020) Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 617\u2013626","DOI":"10.1145\/3394486.3403104"},{"key":"759_CR9","doi-asserted-by":"publisher","DOI":"10.1038\/s43588-022-00391-1","author":"Westermayr Julia","year":"2023","unstructured":"Julia Westermayr, Joe Gilkes, Rhyan Barrett, Maurer Reinhard J (2023) High-throughput property-driven generative design of functional organic molecules. Nat Comput Sci. https:\/\/doi.org\/10.1038\/s43588-022-00391-1","journal-title":"Nat Comput Sci"},{"key":"759_CR10","doi-asserted-by":"publisher","first-page":"102566","DOI":"10.1016\/j.sbi.2023.102566","volume":"80","author":"Benoit Baillif","year":"2023","unstructured":"Baillif Benoit, Cole Jason, McCabe Patrick, Bender Andreas (2023) Deep generative models for 3d molecular structure. Curr Opin Struct Biol 80:102566","journal-title":"Curr Opin Struct Biol"},{"key":"759_CR11","unstructured":"Xu M, Yu L, Song Y, Shi C, Ermon S, Tang J (2022) Geodiff: a geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations"},{"issue":"1","key":"759_CR12","doi-asserted-by":"publisher","first-page":"3293","DOI":"10.1038\/s41467-022-30839-x","volume":"13","author":"Daniel Flam-Shepherd","year":"2022","unstructured":"Flam-Shepherd Daniel, Zhu Kevin, Aspuru-Guzik Al\u00e1n (2022) Language models can learn complex molecular distributions. Nat Commun 13(1):3293","journal-title":"Nat Commun"},{"key":"759_CR13","unstructured":"Kusner MJ, Paige B, Hern\u00e1ndez-Lobato JM (2017) Grammar variational autoencoder. In International conference on machine learning, 1945\u20131954. PMLR"},{"issue":"1","key":"759_CR14","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-021-96812-8","volume":"11","author":"Youngchun Kwon","year":"2021","unstructured":"Kwon Youngchun, Kang Seokho, Choi Youn-Suk, Kim Inkoo (2021) Evolutionary design of molecules based on deep learning and a genetic algorithm. Sci Rep 11(1):1\u201311","journal-title":"Sci Rep"},{"issue":"12","key":"759_CR15","doi-asserted-by":"publisher","first-page":"5918","DOI":"10.1021\/acs.jcim.0c00915","volume":"60","author":"Blaschke Thomas","year":"2020","unstructured":"Thomas Blaschke, Josep Ar\u00fas-Pous, Hongming Chen, Christian Margreitter, Christian Tyrchan, Ola Engkvist, Kostas Papadopoulos, Atanas Patronov (2020) Reinvent 2.0: an ai tool for de novo drug design. J Chem Inform Model 60(12):5918\u20135922","journal-title":"J Chem Inform Model"},{"issue":"34","key":"759_CR16","doi-asserted-by":"publisher","first-page":"8016","DOI":"10.1039\/C9SC01928F","volume":"10","author":"Robin Winter","year":"2019","unstructured":"Winter Robin, Montanari Floriane, Steffen Andreas, Briem Hans, No\u00e9 Frank, Clevert Djork-Arn\u00e9 (2019) Efficient multi-objective molecular optimization in a continuous latent space. Chem Sci 10(34):8016\u20138024","journal-title":"Chem Sci"},{"issue":"1","key":"759_CR17","doi-asserted-by":"publisher","first-page":"972","DOI":"10.1080\/14686996.2017.1401424","volume":"18","author":"Xiufeng Yang","year":"2017","unstructured":"Yang Xiufeng, Zhang Jinzhe, Yoshizoe Kazuki, Terayama Kei, Tsuda Koji (2017) Chemts: an efficient python library for de novo molecular generation. Sci Technol Adv Mater 18(1):972\u2013976","journal-title":"Sci Technol Adv Mater"},{"key":"759_CR18","unstructured":"Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M (2021) Therapeutics data commons: machine learning datasets and tasks for therapeutics. arXiv e-prints, pages arXiv\u20132102"},{"issue":"3","key":"759_CR19","doi-asserted-by":"publisher","first-page":"1096","DOI":"10.1021\/acs.jcim.8b00839","volume":"59","author":"Brown Nathan","year":"2019","unstructured":"Nathan Brown, Marco Fiscato, Segler Marwin HS, Vaucher Alain C (2019) Guacamol: benchmarking models for de novo molecular design. J Chem Inform Model 59(3):1096\u20131108","journal-title":"J Chem Inform Model"},{"key":"759_CR20","unstructured":"Yang X, Aasawat TK, Yoshizoe K (2020) Practical massively parallel monte-carlo tree search applied to molecular design. arXiv preprint arXiv:2006.10504"},{"key":"759_CR21","unstructured":"Jin W, Barzilay R, Jaakkola T (2018) Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, 2323\u20132332. PMLR"},{"issue":"1","key":"759_CR22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41598-018-37186-2","volume":"9","author":"Zhou Zhenpeng","year":"2019","unstructured":"Zhenpeng Zhou, Steven Kearnes, Li Li, Zare Richard N, Patrick Riley (2019) Optimization of molecules via deep reinforcement learning. Sci Rep 9(1):1\u201310","journal-title":"Sci Rep"},{"key":"759_CR23","unstructured":"Krenn M, H\u00e4se F, Nigam A, Friederich P, Aspuru-Guzik A (2019) Selfies: a robust representation of semantically constrained graphs with an example application in chemistry. arXiv preprint arXiv:1905.13741"},{"key":"759_CR24","doi-asserted-by":"crossref","unstructured":"O\u2019Boyle N, Dalke A (2018) Deepsmiles: an adaptation of smiles for use in machine-learning of chemical structures","DOI":"10.26434\/chemrxiv.7097960"},{"key":"759_CR25","doi-asserted-by":"crossref","unstructured":"Shen T, Quach V, Barzilay R, Jaakkola T (2020) Blank language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5186\u20135198","DOI":"10.18653\/v1\/2020.emnlp-main.420"},{"key":"759_CR26","unstructured":"Wei L, Li Q, Song Y, Stefanov S, Siriwardane E, Chen F, Hu J (2022) Crystal transformer: Self-learning neural language model for generative and tinkering design of materials. arXiv preprint arXiv:2204.11953"},{"key":"759_CR27","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"759_CR28","unstructured":"Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32"},{"key":"759_CR29","first-page":"21342","volume":"35","author":"Wenhao Gao","year":"2022","unstructured":"Gao Wenhao, Tianfan Fu, Sun Jimeng, Coley Connor (2022) Sample efficiency matters: a benchmark for practical molecular optimization. Adv Neural Inform Process Syst 35:21342\u201321357","journal-title":"Adv Neural Inform Process Syst"},{"issue":"12","key":"759_CR30","doi-asserted-by":"publisher","first-page":"1256","DOI":"10.1038\/s42256-022-00580-7","volume":"4","author":"Jerret Ross","year":"2022","unstructured":"Ross Jerret, Belgodere Brian, Chenthamarakshan Vijil, Padhi Inkit, Mroueh Youssef, Das Payel (2022) Large-scale chemical language representations capture molecular structure and properties. Nat Mach Intell 4(12):1256\u20131264","journal-title":"Nat Mach Intell"},{"issue":"4","key":"759_CR31","doi-asserted-by":"publisher","first-page":"1560","DOI":"10.1021\/acs.jcim.0c01127","volume":"61","author":"Xinhao Li","year":"2021","unstructured":"Li Xinhao, Fourches Denis (2021) Smiles pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inform Model 61(4):1560\u20131569","journal-title":"J Chem Inform Model"},{"key":"759_CR32","doi-asserted-by":"publisher","first-page":"1931","DOI":"10.3389\/fphar.2020.565644","volume":"11","author":"Daniil Polykovskiy","year":"2020","unstructured":"Polykovskiy Daniil, Zhebrak Alexander, Sanchez-Lengeling Benjamin, Golovanov Sergey, Tatanov Oktai, Belyaev Stanislav, Kurbanov Rauf, Artamonov Aleksey, Aladinskiy Vladimir, Veselov Mark et al (2020) Molecular sets (moses): a benchmarking platform for molecular generation models. Front Pharmacol 11:1931","journal-title":"Front Pharmacol"},{"key":"759_CR33","doi-asserted-by":"publisher","DOI":"10.1101\/292177","author":"Benhenda Mostapha","year":"2018","unstructured":"Mostapha Benhenda (2018) Can ai reproduce observed chemical diversity? bioRxiv. https:\/\/doi.org\/10.1101\/292177","journal-title":"bioRxiv"},{"key":"759_CR34","doi-asserted-by":"crossref","unstructured":"Preuer K, Renz P, Unterthiner T, Hochreiter S, Klambauer G (2018) Fr\u00e9chet chemblnet distance: A metric for generative models for molecules. arXiv preprint arXiv:1803.09518","DOI":"10.1021\/acs.jcim.8b00234"},{"issue":"5","key":"759_CR35","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"David Rogers","year":"2010","unstructured":"Rogers David, Hahn Mathew (2010) Extended-connectivity fingerprints. J Chem Inform Model 50(5):742\u2013754","journal-title":"J Chem Inform Model"},{"key":"759_CR36","unstructured":"Tanimoto, Taffee T (1958) Elementary mathematical theory of classification and prediction, International Business Machines Corp."},{"issue":"10","key":"759_CR37","first-page":"1503","volume":"3","author":"Degen J\u00f6rg","year":"2008","unstructured":"J\u00f6rg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, Matthias Rarey (2008) On the art of compiling and using\u2019drug-like\u2019chemical fragment spaces. ChemMedChem Chem Enabling Drug Discov 3(10):1503\u20131507","journal-title":"ChemMedChem Chem Enabling Drug Discov"},{"issue":"15","key":"759_CR38","doi-asserted-by":"publisher","first-page":"2887","DOI":"10.1021\/jm9602928","volume":"39","author":"W Bemis Guy","year":"1996","unstructured":"Bemis Guy W, Murcko Mark A (1996) The properties of known drugs. 1. molecular frameworks. J Med Chem 39(15):2887\u20132893","journal-title":"J Med Chem"},{"issue":"2","key":"759_CR39","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1038\/nchem.1243","volume":"4","author":"Bickerton G Richard","year":"2012","unstructured":"Richard Bickerton G, Paolini Gaia V, J\u00e9r\u00e9my Besnard, Sorel Muresan, Hopkins Andrew L (2012) Quantifying the chemical beauty of drugs. Nat Chem 4(2):90\u201398","journal-title":"Nat Chem"},{"key":"759_CR40","unstructured":"Landrum Greg (2019) Rdkit: Open-source cheminformatics, v. 2019. GitHub (https:\/\/github.com\/rdkit\/rdkit). Accessed 15 Aug 2022"},{"key":"759_CR41","unstructured":"Gnaneshwar D, Ramsundar B, Gandhi D, Kurchin R, Viswanathan V (2022) Score-based generative models for molecule generation. arXiv preprint arXiv:2203.04698"},{"key":"759_CR42","doi-asserted-by":"crossref","unstructured":"Wang W, Wang Y, Zhao H, Sciabola S (2022) A pre-trained conditional transformer for target-specific de novo molecular generation. arXiv preprint arXiv:2210.08749","DOI":"10.3390\/molecules28114430"},{"issue":"1","key":"759_CR43","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"Weininger David","year":"1988","unstructured":"David Weininger (1988) Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J Chem Inform Comput Sci 28(1):31\u201336","journal-title":"J Chem Inform Comput Sci"},{"issue":"4","key":"759_CR44","doi-asserted-by":"publisher","first-page":"045024","DOI":"10.1088\/2632-2153\/aba947","volume":"1","author":"Mario Krenn","year":"2020","unstructured":"Krenn Mario, H\u00e4se Florian, Nigam AkshatKumar, Friederich Pascal, Aspuru-Guzik Alan (2020) Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024","journal-title":"Mach Learn Sci Technol"},{"issue":"12","key":"759_CR45","doi-asserted-by":"publisher","first-page":"3093","DOI":"10.1021\/ci200379p","volume":"51","author":"Markus Hartenfeller","year":"2011","unstructured":"Hartenfeller Markus, Eberle Martin, Meier Peter, Nieto-Oberhuber Cristina, Altmann Karl-Heinz, Schneider Gisbert, Jacoby Edgar, Renner Steffen (2011) A collection of robust organic synthesis reactions for in silico molecule design. J Chem Inform Model 51(12):3093\u20133098","journal-title":"J Chem Inform Model"},{"key":"759_CR46","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jcim.2c00671","author":"Wu Yang Yuwei","year":"2022","unstructured":"Yang Yuwei Wu, Zhenxing Yao Xiaojun, Kang Yu, Tingjun Hou, Chang-Yu Hsieh, Huanxiang Liu (2022) Exploring low-toxicity chemical space with deep learning for molecular generation. J Chem Inform Model. https:\/\/doi.org\/10.1021\/acs.jcim.2c00671","journal-title":"J Chem Inform Model"},{"issue":"11","key":"759_CR47","doi-asserted-by":"publisher","DOI":"10.1063\/1.2894544","volume":"128","author":"DJ Mowbray","year":"2008","unstructured":"Mowbray DJ, Glenn Jones, Sommer Thygesen Kristian (2008) Influence of functional groups on charge transport in molecular junctions. J Chem Phys 128(11):111103","journal-title":"J Chem Phys"},{"issue":"11","key":"759_CR48","doi-asserted-by":"publisher","first-page":"1366","DOI":"10.3390\/ph15111366","volume":"15","author":"Kirsten McAulay","year":"2022","unstructured":"McAulay Kirsten, Bilsland Alan, Bon Marta (2022) Reactivity of covalent fragments and their role in fragment based drug discovery. Pharmaceuticals 15(11):1366","journal-title":"Pharmaceuticals"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00759-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00759-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00759-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,19]],"date-time":"2023-11-19T09:27:54Z","timestamp":1700386074000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00759-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,25]]},"references-count":48,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["759"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00759-z","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,25]]},"assertion":[{"value":"20 September 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 September 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 September 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"88"}}