{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T05:21:51Z","timestamp":1778044911818,"version":"3.51.4"},"reference-count":88,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,2,6]],"date-time":"2024-02-06T00:00:00Z","timestamp":1707177600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,6]],"date-time":"2024-02-06T00:00:00Z","timestamp":1707177600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Nat Mach Intell"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Machine learning has transformed many fields and has recently found applications in chemistry and materials science. The small datasets commonly found in chemistry sparked the development of sophisticated machine learning approaches that incorporate chemical knowledge for each application and, therefore, require specialized expertise to develop. Here we show that GPT-3, a large language model trained on vast amounts of text extracted from the Internet, can easily be adapted to solve various tasks in chemistry and materials science by fine-tuning it to answer chemical questions in natural language with the correct answer. We compared this approach with dedicated machine learning models for many applications spanning the properties of molecules and materials to the yield of chemical reactions. Surprisingly, our fine-tuned version of GPT-3 can perform comparably to or even outperform conventional machine learning techniques, in particular in the low-data limit. In addition, we can perform inverse design by simply inverting the questions. The ease of use and high performance, especially for small datasets, can impact the fundamental approach to using machine learning in the chemical and material sciences. In addition to a literature search, querying a pre-trained large language model might become a routine way to bootstrap a project by leveraging the collective knowledge encoded in these foundation models, or to provide a baseline for predictive tasks.<\/jats:p>","DOI":"10.1038\/s42256-023-00788-1","type":"journal-article","created":{"date-parts":[[2024,2,6]],"date-time":"2024-02-06T12:39:28Z","timestamp":1707223168000},"page":"161-169","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":309,"title":["Leveraging large language models for predictive chemistry"],"prefix":"10.1038","volume":"6","author":[{"given":"Kevin Maik","family":"Jablonka","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3046-6576","authenticated-orcid":false,"given":"Philippe","family":"Schwaller","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0065-0623","authenticated-orcid":false,"given":"Andres","family":"Ortega-Guerrero","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4653-8562","authenticated-orcid":false,"given":"Berend","family":"Smit","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,6]]},"reference":[{"key":"788_CR1","unstructured":"Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https:\/\/arxiv.org\/abs\/2108.07258 (2021)."},{"key":"788_CR2","unstructured":"Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017)."},{"key":"788_CR3","unstructured":"Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1\u2013113 (2023)."},{"key":"788_CR4","unstructured":"Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. Adv. Neural Inf. Process. Syst. 35, 30016\u201330030 (2022)."},{"key":"788_CR5","unstructured":"Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877\u20131901 (2020)."},{"key":"788_CR6","doi-asserted-by":"crossref","unstructured":"Edwards, C. N., Lai, T., Ros, K., Honke, G. & Ji, H. Translation between molecules and natural language. in Conference On Empirical Methods In Natural Language Processing (eds Goldberg, Y. et al.) 375\u2013413 (Association for Computational Linguistics, 2022).","DOI":"10.18653\/v1\/2022.emnlp-main.26"},{"key":"788_CR7","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1039\/D1DD00009H","volume":"1","author":"GM Hocky","year":"2022","unstructured":"Hocky, G. M. & White, A. D. Natural language processing models that automate programming will transform chemistry research and teaching. Digit. Discov. 1, 79\u201383 (2022).","journal-title":"Digit. Discov."},{"key":"788_CR8","doi-asserted-by":"crossref","unstructured":"White, A. D. et al. Assessment of chemistry knowledge in large language models that generate. Digit. Discov. 2, 368\u2013376 (2023).","DOI":"10.1039\/D2DD00087C"},{"key":"788_CR9","unstructured":"Taylor, R. et al. Galactica: a large language model for science. Preprint at https:\/\/arxiv.org\/abs\/2211.09085 (2022)."},{"key":"788_CR10","unstructured":"Dunn, A. et al. Structured information extraction from complex scientific text with fine-tuned large language models. Adv. Neural Inf. Process. Syst. 35, 11763\u201311784 (2022)."},{"key":"788_CR11","doi-asserted-by":"publisher","first-page":"17545","DOI":"10.1021\/acs.jpcc.3c03106","volume":"127","author":"K Choudhary","year":"2023","unstructured":"Choudhary, K. & Kelley, M. L. ChemNLP: a natural language-processing-based library for materials chemistry text data. J. Phys. Chem. C 127, 17545\u201317555 (2023).","journal-title":"J. Phys. Chem. C"},{"key":"788_CR12","doi-asserted-by":"crossref","unstructured":"Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. Digit. Discov. 2, 1233\u20131250 (2023).","DOI":"10.1039\/D3DD00113J"},{"key":"788_CR13","unstructured":"Dinh, T. et al. LIFT: language-interfaced fine-tuning for non-language machine learning tasks. Adv. Neural Inf. Process. Syst. 35, 11763\u201311784 (2022)."},{"key":"788_CR14","doi-asserted-by":"crossref","unstructured":"Karpov, P., Godin, G. & Tetko, I. V. Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J. Cheminform. 12, 17 (2020).","DOI":"10.1186\/s13321-020-00423-w"},{"key":"788_CR15","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1038\/s41586-019-1335-8","volume":"571","author":"V Tshitoyan","year":"2019","unstructured":"Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95\u201398 (2019).","journal-title":"Nature"},{"key":"788_CR16","doi-asserted-by":"publisher","first-page":"432","DOI":"10.1038\/s42256-023-00639-z","volume":"5","author":"J Born","year":"2023","unstructured":"Born, J. & Manica, M. Regression transformer enables concurrent sequence regression and generation for molecular language modelling. Nat. Mach. Intell. 5, 432\u2013444 (2023).","journal-title":"Nat. Mach. Intell."},{"key":"788_CR17","doi-asserted-by":"publisher","first-page":"025035","DOI":"10.1088\/2632-2153\/acdb30","volume":"4","author":"A Y\u00fcksel","year":"2023","unstructured":"Y\u00fcksel, A., Ulusoy, E., \u00dcnl\u00fc, A. & Do\u011fan, T. SELFormer: molecular representation learning via SELFIES language models. Mach. Learn. Sci. Technol. 4, 025035 (2023).","journal-title":"Mach. Learn. Sci. Technol."},{"key":"788_CR18","doi-asserted-by":"crossref","unstructured":"van Deursen, R., Ertl, P., Tetko, I. V. & Godin, G. GEN: highly efficient SMILES explorer using autodidactic generative examination networks. J. Cheminform.12, 22 (2020).","DOI":"10.1186\/s13321-020-00425-8"},{"key":"788_CR19","doi-asserted-by":"crossref","unstructured":"Flam-Shepherd, D., Zhu, K. & Aspuru-Guzik, A. Language models can learn complex molecular distributions. Nat. Commun. 13, 3293 (2022).","DOI":"10.1038\/s41467-022-30839-x"},{"key":"788_CR20","doi-asserted-by":"publisher","first-page":"102527","DOI":"10.1016\/j.sbi.2023.102527","volume":"79","author":"F Grisoni","year":"2023","unstructured":"Grisoni, F. Chemical language models for de novo drug design: challenges and opportunities. Curr. Opin. Struct. Biol. 79, 102527 (2023).","journal-title":"Curr. Opin. Struct. Biol."},{"key":"788_CR21","unstructured":"Ramos, M. C., Michtavy, S. S., Porosoff, M. D. & White, A. D. Bayesian optimization of catalysts with in-context learning. Preprint at https:\/\/arxiv.org\/abs\/2304.05341 (2023)."},{"key":"788_CR22","unstructured":"Guo, T. et al. What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks. Preprint at https:\/\/arxiv.org\/abs\/2305.18365 (2023)."},{"key":"788_CR23","doi-asserted-by":"crossref","unstructured":"Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 328\u2013339 (Association for Computational Linguistics, 2018); https:\/\/aclanthology.org\/P18-1031","DOI":"10.18653\/v1\/P18-1031"},{"key":"788_CR24","doi-asserted-by":"publisher","unstructured":"Pei, Z., Yin, J., Hawk, J. A., Alman, D. E. & Gao, M. C. Machine-learning informed prediction of high-entropy solid solution formation: beyond the Hume\u2013Rothery rules. npj Comput. Mater. https:\/\/doi.org\/10.1038\/s41524-020-0308-7 (2020).","DOI":"10.1038\/s41524-020-0308-7"},{"key":"788_CR25","doi-asserted-by":"publisher","unstructured":"Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Comput. Mater. https:\/\/doi.org\/10.1038\/s41524-020-00406-3 (2020).","DOI":"10.1038\/s41524-020-00406-3"},{"key":"788_CR26","unstructured":"Goldblum, M., Finzi, M., Rowan, K. & Wilson, A. The no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning. ICLR 2024 Conference, OpenReview https:\/\/openreview.net\/forum?id=X7nz6ljg9Y (2023)."},{"key":"788_CR27","doi-asserted-by":"publisher","first-page":"1572","DOI":"10.1021\/acscentsci.9b00576","volume":"5","author":"P Schwaller","year":"2019","unstructured":"Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572\u20131583 (2019).","journal-title":"ACS Cent. Sci."},{"key":"788_CR28","doi-asserted-by":"publisher","first-page":"859","DOI":"10.1039\/D2DD00058J","volume":"1","author":"B Winter","year":"2022","unstructured":"Winter, B., Winter, C., Schilling, J. & Bardow, A. A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing. Digit. Discov. 1, 859\u2013869 (2022).","journal-title":"Digit. Discov."},{"key":"788_CR29","doi-asserted-by":"crossref","unstructured":"Dai, D. et al. Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers. Preprint at https:\/\/arxiv.org\/abs\/2212.10559 (2022).","DOI":"10.18653\/v1\/2023.findings-acl.247"},{"key":"788_CR30","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31\u201336 (1988).","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"788_CR31","doi-asserted-by":"publisher","first-page":"045024","DOI":"10.1088\/2632-2153\/aba947","volume":"1","author":"M Krenn","year":"2020","unstructured":"Krenn, M., H\u00e4se, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).","journal-title":"Mach. Learn. Sci. Technol."},{"key":"788_CR32","doi-asserted-by":"publisher","first-page":"100588","DOI":"10.1016\/j.patter.2022.100588","volume":"3","author":"M Krenn","year":"2022","unstructured":"Krenn, M. et al. SELFIES and the future of molecular string representations. Patterns 3, 100588 (2022).","journal-title":"Patterns"},{"key":"788_CR33","doi-asserted-by":"publisher","first-page":"360","DOI":"10.1126\/science.aat2663","volume":"361","author":"B Sanchez-Lengeling","year":"2018","unstructured":"Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360\u2013365 (2018).","journal-title":"Science"},{"key":"788_CR34","doi-asserted-by":"publisher","first-page":"76","DOI":"10.1038\/s42256-020-00271-1","volume":"3","author":"Z Yao","year":"2021","unstructured":"Yao, Z. et al. Inverse design of nanoporous crystalline reticular materials with deep generative models. Nat. Mach. Intell. 3, 76\u201386 (2021).","journal-title":"Nat. Mach. Intell."},{"key":"788_CR35","doi-asserted-by":"publisher","first-page":"268","DOI":"10.1021\/acscentsci.7b00572","volume":"4","author":"R G\u00f3mez-Bombarelli","year":"2018","unstructured":"G\u00f3mez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268\u2013276 (2018).","journal-title":"ACS Cent. Sci."},{"key":"788_CR36","doi-asserted-by":"publisher","first-page":"eaax9324","DOI":"10.1126\/sciadv.aax9324","volume":"6","author":"B Kim","year":"2020","unstructured":"Kim, B., Lee, S. & Kim, J. Inverse design of porous materials using artificial neural networks. Sci. Adv. 6, eaax9324 (2020).","journal-title":"Sci. Adv."},{"key":"788_CR37","doi-asserted-by":"publisher","first-page":"2709","DOI":"10.1039\/C8TA12208C","volume":"7","author":"S Lee","year":"2019","unstructured":"Lee, S., Kim, B. & Kim, J. Predicting performance limits of methane gas storage in zeolites with an artificial neural network. J. Mater. Chem. A 7, 2709\u20132716 (2019).","journal-title":"J. Mater. Chem. A"},{"key":"788_CR38","unstructured":"Nigam, A., Friederich, P., Krenn, M. & Aspuru-Guzik, A. Augmenting genetic algorithms with deep neural networks for exploring the chemical space. In ICLR (2019)."},{"key":"788_CR39","unstructured":"Jablonka, K. M., Mcilwaine, F., Garcia, S., Smit, B. & Yoo, B. A reproducibility study of \u2018augmenting genetic algorithms with deep neural networks for exploring the chemical space\u2019. Preprint at https:\/\/arxiv.org\/abs\/2102.00700 (2021)."},{"key":"788_CR40","doi-asserted-by":"publisher","first-page":"e1600909","DOI":"10.1126\/sciadv.1600909","volume":"2","author":"YG Chung","year":"2016","unstructured":"Chung, Y. G. et al. In silico discovery of metal-organic frameworks for precombustion CO2 capture using a genetic algorithm. Sci. Adv. 2, e1600909 (2016).","journal-title":"Sci. Adv."},{"key":"788_CR41","doi-asserted-by":"publisher","first-page":"23647","DOI":"10.1021\/acsami.1c02471","volume":"13","author":"S Lee","year":"2021","unstructured":"Lee, S. et al. Computational screening of trillions of metal\u2013organic frameworks for high-performance methane storage. ACS Appl. Mater. Interfaces 13, 23647\u201323654 (2021).","journal-title":"ACS Appl. Mater. Interfaces"},{"key":"788_CR42","doi-asserted-by":"publisher","unstructured":"Collins, S. P., Daff, T. D., Piotrkowski, S. S. & Woo, T. K. Materials design by evolutionary optimization of functional groups in metal\u2013organic frameworks. Sci. Adv. https:\/\/doi.org\/10.1126\/sciadv.1600954 (2016).","DOI":"10.1126\/sciadv.1600954"},{"key":"788_CR43","doi-asserted-by":"publisher","first-page":"13541","DOI":"10.1039\/D2SC04306H","volume":"13","author":"R-R Griffiths","year":"2022","unstructured":"Griffiths, R.-R. et al. Data-driven discovery of molecular photoswitches with multioutput Gaussian processes. Chem. Sci. 13, 13541\u201313551 (2022).","journal-title":"Chem. Sci."},{"key":"788_CR44","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1186\/1758-2946-1-8","volume":"1","author":"P Ertl","year":"2009","unstructured":"Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).","journal-title":"J. Cheminform."},{"key":"788_CR45","doi-asserted-by":"publisher","unstructured":"Jablonka, K. M., Jothiappan, G. M., Wang, S., Smit, B. & Yoo, B. Bias free multiobjective active learning for materials design and discovery. Nat. Commun. https:\/\/doi.org\/10.1038\/s41467-021-22437-0 (2021).","DOI":"10.1038\/s41467-021-22437-0"},{"key":"788_CR46","doi-asserted-by":"publisher","first-page":"1652","DOI":"10.1021\/acs.jctc.8b01176","volume":"15","author":"C Bannwarth","year":"2019","unstructured":"Bannwarth, C., Ehlert, S. & Grimme, S. GFN2-xTB\u2014an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput. 15, 1652\u20131671 (2019).","journal-title":"J. Chem. Theory Comput."},{"key":"788_CR47","doi-asserted-by":"publisher","unstructured":"Isert, C., Atz, K., Jim\u00e9nez-Luna, J. & Schneider, G. QMugs: quantum mechanical properties of drug-like molecules https:\/\/doi.org\/10.3929\/ethz-b-000482129 (2021).","DOI":"10.3929\/ethz-b-000482129"},{"key":"788_CR48","doi-asserted-by":"crossref","unstructured":"Isert, C., Atz, K., Jim\u00e9nez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. Sci. Data 9, 273 (2022).","DOI":"10.1038\/s41597-022-01390-7"},{"key":"788_CR49","doi-asserted-by":"crossref","unstructured":"Westermayr, J., Gilkes, J., Barrett, R. & Maurer, R. J. High-throughput property-driven generative design of functional organic molecules. Nat. Comput. Sci. 3, 139\u2013148 (2023).","DOI":"10.1038\/s43588-022-00391-1"},{"key":"788_CR50","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1038\/s41557-022-00910-7","volume":"14","author":"KM Jablonka","year":"2022","unstructured":"Jablonka, K. M., Patiny, L. & Smit, B. Making the collective knowledge of chemistry open and machine actionable. Nat. Chem. 14, 365\u2013376 (2022).","journal-title":"Nat. Chem."},{"key":"788_CR51","doi-asserted-by":"publisher","first-page":"1096","DOI":"10.1021\/acs.jcim.8b00839","volume":"59","author":"N Brown","year":"2019","unstructured":"Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096\u20131108 (2019).","journal-title":"J. Chem. Inf. Model."},{"key":"788_CR52","unstructured":"Wang, B. Mesh-Transformer-JAX: model-parallel implementation of transformer language model with JAX. GitHub https:\/\/github.com\/kingoflolz\/mesh-transformer-jax (2021)."},{"key":"788_CR53","unstructured":"Wang, B. & Komatsuzaki, A. GPT-J-6B: a 6 billion parameter autoregressive language model. GitHub https:\/\/github.com\/kingoflolz\/mesh-transformer-jax (2021)."},{"key":"788_CR54","unstructured":"Gao, L. et al. The Pile: an 800BG dataset of diverse text for language modeling. Preprint at https:\/\/arxiv.org\/abs\/2101.00027 (2020)."},{"key":"788_CR55","unstructured":"Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 35, 30318\u201330332 (2022)."},{"key":"788_CR56","unstructured":"Dettmers, T., Lewis, M., Shleifer, S. & Zettlemoyer, L. 8-bit optimizers via block-wise quantization. in The Tenth International Conference on Learning Representations (2022)."},{"key":"788_CR57","unstructured":"Hu, E. J. et al. LoRA: low-rank adaptation of large language models. in International Conference On Learning Representations (2021)."},{"key":"788_CR58","doi-asserted-by":"publisher","unstructured":"Jablonka, K. M. kjappelbaum\/gptchem: initial release. Zenodo https:\/\/doi.org\/10.5281\/zenodo.7806672 (2023).","DOI":"10.5281\/zenodo.7806672"},{"key":"788_CR59","doi-asserted-by":"publisher","unstructured":"Jablonka, K. M. chemlift. Zenodo https:\/\/doi.org\/10.5281\/zenodo.10233422 (2023).","DOI":"10.5281\/zenodo.10233422"},{"key":"788_CR60","doi-asserted-by":"publisher","first-page":"653","DOI":"10.1080\/08927022.2018.1426855","volume":"44","author":"D Dubbeldam","year":"2018","unstructured":"Dubbeldam, D., Calero, S. & Vlugt, T. J. iRASPA: GPU-accelerated visualization software for materials scientists. Mol. Simul. 44, 653\u2013676 (2018).","journal-title":"Mol. Simul."},{"key":"788_CR61","doi-asserted-by":"publisher","first-page":"250","DOI":"10.1093\/bioinformatics\/btz470","volume":"36","author":"TT Le","year":"2020","unstructured":"Le, T. T., Fu, W. & Moore, J. H. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 250\u2013256 (2020).","journal-title":"Bioinformatics"},{"key":"788_CR62","doi-asserted-by":"publisher","DOI":"10.1038\/s41524-021-00545-1","volume":"7","author":"AY-T Wang","year":"2021","unstructured":"Wang, A. Y.-T., Kauwe, S. K., Murdock, R. J. & Sparks, T. D. Compositionally restricted attention-based network for materials property predictions. npj Comput. Mater. 7, 77 (2021).","journal-title":"npj Comput. Mater."},{"key":"788_CR63","unstructured":"RDKit contributors. RDKit: Open-source Cheminformatics; (2023) http:\/\/www.rdkit.org"},{"key":"788_CR64","doi-asserted-by":"publisher","first-page":"1736","DOI":"10.1021\/acs.jcim.8b00234","volume":"58","author":"K Preuer","year":"2018","unstructured":"Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fr\u00e9chet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736\u20131741 (2018).","journal-title":"J. Chem. Inf. Model."},{"key":"788_CR65","doi-asserted-by":"crossref","unstructured":"Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform. 12, 12 (2020).","DOI":"10.1186\/s13321-020-0416-x"},{"key":"788_CR66","doi-asserted-by":"crossref","unstructured":"Probst, D. & Reymond, J.-L. A probabilistic molecular fingerprint for big data settings. J. Cheminform. 10, 66 (2018).","DOI":"10.1186\/s13321-018-0321-8"},{"key":"788_CR67","doi-asserted-by":"crossref","unstructured":"Ertl, P. & Rohde, B. The Molecule Cloud\u2014compact visualization of large collections of molecules. J. Cheminform. 4, 12 (2012).","DOI":"10.1186\/1758-2946-4-12"},{"key":"788_CR68","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1038\/s42256-022-00447-x","volume":"4","author":"Y Wang","year":"2022","unstructured":"Wang, Y., Wang, J., Cao, Z. & Farimani, A. B. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279\u2013287 (2022).","journal-title":"Nat. Mach. Intell."},{"key":"788_CR69","doi-asserted-by":"publisher","first-page":"404002","DOI":"10.1088\/1361-648X\/ac1280","volume":"33","author":"P-PD Breuck","year":"2021","unstructured":"Breuck, P.-P. D., Evans, M. L. & Rignanese, G.-M. Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet. J. Phys. Condens. Matter 33, 404002 (2021).","journal-title":"J. Phys. Condens. Matter"},{"key":"788_CR70","unstructured":"Hollmann, N., M\u00fcller, S., Eggensperger, K. & Hutter, F. TabPFN: a transformer that solves small tabular classification problems in a second. Preprint at https:\/\/arxiv.org\/abs\/2207.01848 (2022)."},{"key":"788_CR71","unstructured":"Griffiths, R.-R. et al. Gauche: a library for Gaussian processes in chemistry. in ICML 2022 2nd AI for Science Workshop https:\/\/openreview.net\/forum?id=i9MKI7zrWal (2022)"},{"key":"788_CR72","doi-asserted-by":"crossref","unstructured":"Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785\u2013794 (ACM, 2016).","DOI":"10.1145\/2939672.2939785"},{"key":"788_CR73","doi-asserted-by":"crossref","unstructured":"Moosavi, S. M. et al. Understanding the diversity of the metal-organic framework ecosystem. Nat. Commun. 11, 4068 (2020).","DOI":"10.1038\/s41467-020-17755-8"},{"key":"788_CR74","doi-asserted-by":"crossref","unstructured":"Moosavi, S. M. et al. A data-science approach to predict the heat capacity of nanoporous materials. Nat. Mater. 21, 1419\u20131425 (2022).","DOI":"10.1038\/s41563-022-01374-3"},{"key":"788_CR75","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1039\/D1DD00006C","volume":"1","author":"D Probst","year":"2022","unstructured":"Probst, D., Schwaller, P. & Reymond, J.-L. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digit. Discov. 1, 91\u201397 (2022).","journal-title":"Digit. Discov."},{"key":"788_CR76","first-page":"5485","volume":"21","author":"C Raffel","year":"2020","unstructured":"Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485\u20135551 (2020).","journal-title":"J. Mach. Learn. Res."},{"key":"788_CR77","unstructured":"Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019)."},{"key":"788_CR78","doi-asserted-by":"publisher","first-page":"711","DOI":"10.1007\/s10822-014-9747-x","volume":"28","author":"DL Mobley","year":"2014","unstructured":"Mobley, D. L. & Guthrie, J. P. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J. Comput. Aided Mol. Des. 28, 711\u2013720 (2014).","journal-title":"J. Comput. Aided Mol. Des."},{"key":"788_CR79","doi-asserted-by":"publisher","first-page":"1000","DOI":"10.1021\/ci034243x","volume":"44","author":"JS Delaney","year":"2004","unstructured":"Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44, 1000\u20131005 (2004).","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"788_CR80","unstructured":"Mitchell, J. B. O. DLS-100 solubility dataset. University of St Andrews https:\/\/risweb.st-andrews.ac.uk:443\/portal\/en\/datasets\/dls100-solubility-dataset(3a3a5abc-8458-4924-8e6c-b804347605e8).html (2017)."},{"key":"788_CR81","unstructured":"Walters, P. Predicting aqueous solubility\u2014it\u2019s harder than it looks. Practical Cheminformatics https:\/\/practicalcheminformatics.blogspot.com\/2018\/09\/predicting-aqueous-solubility-its.html (2018)."},{"key":"788_CR82","doi-asserted-by":"publisher","first-page":"D1083","DOI":"10.1093\/nar\/gkt1031","volume":"42","author":"AP Bento","year":"2014","unstructured":"Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083\u2013D1090 (2014).","journal-title":"Nucleic Acids Res."},{"key":"788_CR83","doi-asserted-by":"publisher","first-page":"D1100","DOI":"10.1093\/nar\/gkr777","volume":"40","author":"A Gaulton","year":"2012","unstructured":"Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100\u2013D1107 (2012).","journal-title":"Nucleic Acids Res."},{"key":"788_CR84","doi-asserted-by":"publisher","first-page":"2639","DOI":"10.1021\/acs.jpclett.8b00635","volume":"9","author":"S Nagasawa","year":"2018","unstructured":"Nagasawa, S., Al-Naamani, E. & Saeki, A. Computer-aided screening of conjugated polymers for organic solar cell: classification by random forest. J. Phys. Chem. Lett. 9, 2639\u20132646 (2018).","journal-title":"J. Phys. Chem. Lett."},{"key":"788_CR85","unstructured":"Kawazoe, Y., Yu, J.-Z., Tsai, A.-P. & Masumoto, T. (eds) Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys Landolt-B\u00f6rnstein: Numerical Data and Functional Relationships in Science and Technology\u2014New Series (Springer, 2006)."},{"key":"788_CR86","doi-asserted-by":"publisher","first-page":"1668","DOI":"10.1021\/acs.jpclett.8b00124","volume":"9","author":"Y Zhuo","year":"2018","unstructured":"Zhuo, Y., Tehrani, A. M. & Brgoch, J. Predicting the band gaps of inorganic solids by machine learning. J. Phys. Chem. Lett. 9, 1668\u20131673 (2018).","journal-title":"J. Phys. Chem. Lett."},{"key":"788_CR87","doi-asserted-by":"publisher","first-page":"186","DOI":"10.1126\/science.aar5169","volume":"360","author":"DT Ahneman","year":"2018","unstructured":"Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C\u2013N cross-coupling using machine learning. Science 360, 186\u2013190 (2018).","journal-title":"Science"},{"key":"788_CR88","doi-asserted-by":"publisher","first-page":"429","DOI":"10.1126\/science.aap9112","volume":"359","author":"D Perera","year":"2018","unstructured":"Perera, D. et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 359, 429\u2013434 (2018).","journal-title":"Science"}],"container-title":["Nature Machine Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s42256-023-00788-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-023-00788-1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-023-00788-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,7]],"date-time":"2024-03-07T12:19:50Z","timestamp":1709813990000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s42256-023-00788-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,6]]},"references-count":88,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,2]]}},"alternative-id":["788"],"URL":"https:\/\/doi.org\/10.1038\/s42256-023-00788-1","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2023-fw8n4-v3","asserted-by":"object"}]},"ISSN":["2522-5839"],"issn-type":[{"value":"2522-5839","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,6]]},"assertion":[{"value":"16 May 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 December 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}