{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T22:43:24Z","timestamp":1777502604325,"version":"3.51.4"},"reference-count":67,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,6,10]],"date-time":"2025-06-10T00:00:00Z","timestamp":1749513600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,10]],"date-time":"2025-06-10T00:00:00Z","timestamp":1749513600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001320","name":"Wolfson Foundation","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001320","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100000288","name":"Royal Society","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000288","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000\u201354,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/Rong830\/UMAP_split_for_VS\" ext-link-type=\"uri\">https:\/\/github.com\/Rong830\/UMAP_split_for_VS<\/jats:ext-link>\n                    and archived in\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/zenodo.org\/records\/14736486\" ext-link-type=\"uri\">https:\/\/zenodo.org\/records\/14736486<\/jats:ext-link>\n                    .\n                  <\/jats:p>\n                  <jats:p>\n                    <jats:bold>Scientific contribution<\/jats:bold>\n                    This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.\n                  <\/jats:p>","DOI":"10.1186\/s13321-025-01039-8","type":"journal-article","created":{"date-parts":[[2025,6,10]],"date-time":"2025-06-10T13:22:16Z","timestamp":1749561736000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines*"],"prefix":"10.1186","volume":"17","author":[{"given":"Qianrong","family":"Guo","sequence":"first","affiliation":[]},{"given":"Saiveth","family":"Hernandez-Hernandez","sequence":"additional","affiliation":[]},{"given":"Pedro J.","family":"Ballester","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,10]]},"reference":[{"issue":"7958","key":"1039_CR1","doi-asserted-by":"publisher","first-page":"673","DOI":"10.1038\/s41586-023-05905-z","volume":"616","author":"AV Sadybekov","year":"2023","unstructured":"Sadybekov AV, Katritch V (2023) Computational approaches streamlining drug discovery. Nature 616(7958):673\u2013685. https:\/\/doi.org\/10.1038\/s41586-023-05905-z","journal-title":"Nature"},{"key":"1039_CR2","unstructured":"Austin D, Hayford T (2021) Research and Development in the Pharmaceutical Industry. [cited 2024 Mar 4]. Available from: www.cbo.gov\/publication\/57025"},{"key":"1039_CR3","doi-asserted-by":"crossref","unstructured":"Ballester PJ (2023) The AI revolution in chemistry is not that far away. Vol. 624, Nature. pp 252","DOI":"10.1038\/d41586-023-03948-w"},{"issue":"24","key":"1039_CR4","doi-asserted-by":"publisher","first-page":"5441","DOI":"10.1039\/C8SC00148K","volume":"9","author":"A Mayr","year":"2018","unstructured":"Mayr A, Klambauer G, Unterthiner T, Steijaert M, Wegner JK, Ceulemans H et al (2018) Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem Sci 9(24):5441\u20135451. https:\/\/doi.org\/10.1039\/C8SC00148K","journal-title":"Chem Sci"},{"issue":"3","key":"1039_CR5","doi-asserted-by":"publisher","first-page":"672","DOI":"10.1038\/s41596-021-00659-2","volume":"17","author":"F Gentile","year":"2022","unstructured":"Gentile F, Yaacoub JC, Gleave J, Fernandez M, Ton AT, Ban F et al (2022) Artificial intelligence\u2013enabled virtual screening of ultra-large chemical libraries with deep docking. Nat Protoc 17(3):672\u2013697. https:\/\/doi.org\/10.1038\/s41596-021-00659-2","journal-title":"Nat Protoc"},{"issue":"4","key":"1039_CR6","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0061318","volume":"8","author":"MP Menden","year":"2013","unstructured":"Menden MP, Iorio F, Garnett M, McDermott U, Benes CH, Ballester PJ et al (2013) Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS ONE 8(4):e61318","journal-title":"PLoS ONE"},{"key":"1039_CR7","doi-asserted-by":"publisher","DOI":"10.1039\/D5CS00146C","author":"AM Mroz","year":"2025","unstructured":"Mroz AM, Basford AR, Hastedt F, Jayasekera IS, Mosquera-Lois I, Sedgwick R et al (2025) Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry. Chem Soc Rev. https:\/\/doi.org\/10.1039\/D5CS00146C","journal-title":"Chem Soc Rev"},{"issue":"15","key":"1039_CR8","doi-asserted-by":"publisher","first-page":"10241","DOI":"10.1021\/acs.jmedchem.3c00128","volume":"66","author":"A Gryniukova","year":"2023","unstructured":"Gryniukova A, Kaiser F, Myziuk I, Alieksieieva D, Leberecht C, Heym PP et al (2023) AI-powered virtual screening of large compound libraries leads to the discovery of novel inhibitors of sirtuin-1. J Med Chem 66(15):10241\u201310251. https:\/\/doi.org\/10.1021\/acs.jmedchem.3c00128","journal-title":"J Med Chem"},{"issue":"1","key":"1039_CR9","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1186\/s13321-022-00630-7","volume":"14","author":"N Kumar","year":"2022","unstructured":"Kumar N, Acharya V (2022) Machine intelligence-driven framework for optimized hit selection in virtual screening. J Cheminform 14(1):48. https:\/\/doi.org\/10.1186\/s13321-022-00630-7","journal-title":"J Cheminform"},{"key":"1039_CR10","doi-asserted-by":"publisher","first-page":"9816939","DOI":"10.34133\/2022\/9816939","volume":"2022","author":"Y Luo","year":"2024","unstructured":"Luo Y, Peng J, Ma J (2024) Next decade\u2019s AI-based drug development features tight integration of data and computation. Heal Data Sci. 2022:9816939. https:\/\/doi.org\/10.34133\/2022\/9816939","journal-title":"Heal Data Sci."},{"issue":"3","key":"1039_CR11","doi-asserted-by":"publisher","first-page":"bbaa095","DOI":"10.1093\/bib\/bbaa095","volume":"22","author":"L Fresnais","year":"2021","unstructured":"Fresnais L, Ballester PJ (2021) The impact of compound library size on the performance of scoring functions for structure-based virtual screening. Brief Bioinform 22(3):bbaa095. https:\/\/doi.org\/10.1093\/bib\/bbaa095","journal-title":"Brief Bioinform"},{"issue":"47","key":"1039_CR12","doi-asserted-by":"publisher","first-page":"83142","DOI":"10.18632\/oncotarget.20915","volume":"8","author":"L Zhang","year":"2017","unstructured":"Zhang L, Ai HX, Li SM, Qi MY, Zhao J, Zhao Q et al (2017) Virtual screening approach to identifying influenza virus neuraminidase inhibitors using molecular docking combined with machine-learning-based scoring function. Oncotarget 8(47):83142\u201383154","journal-title":"Oncotarget"},{"key":"1039_CR13","unstructured":"Hernandez-Hernandez S, Guo Q, Ballester PJ (2024) Conformal prediction of molecule-induced cancer cell growth inhibition challenged by strong distribution shifts. bioRxiv. 2024.03.15.585269. Available from: http:\/\/biorxiv.org\/content\/early\/2024\/03\/17\/2024.03.15.585269.abstract"},{"key":"1039_CR14","unstructured":"Hern\u00e1ndez-Hern\u00e1ndez S, Vishwakarma S, Ballester PJ (2022) Conformal prediction of small-molecule drug resistance in cancer cell lines. In: Proceedings of machine learning research. conformal and probabilistic prediction and applications. pp 92\u2013108"},{"issue":"2","key":"1039_CR15","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1039\/C7SC02664A","volume":"9","author":"Z Wu","year":"2018","unstructured":"Wu Z, Ramsundar B, Feinberg ENNN, Gomes J, Geniesse C, Pappu AS et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513\u2013530. https:\/\/doi.org\/10.1039\/C7SC02664A","journal-title":"Chem Sci"},{"key":"1039_CR16","doi-asserted-by":"publisher","DOI":"10.1038\/s41589-023-01349-8","author":"G Liu","year":"2023","unstructured":"Liu G, Catacutan DB, Rathod K, Swanson K, Jin W, Mohammed JC et al (2023) Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat Chem Biol. https:\/\/doi.org\/10.1038\/s41589-023-01349-8","journal-title":"Nat Chem Biol"},{"issue":"7997","key":"1039_CR17","doi-asserted-by":"publisher","first-page":"177","DOI":"10.1038\/s41586-023-06887-8","volume":"626","author":"F Wong","year":"2024","unstructured":"Wong F, Zheng EJ, Valeri JA, Donghia NM, Anahtar MN, Omori S et al (2024) Discovery of a structural class of antibiotics with explainable deep learning. Nature 626(7997):177\u2013185. https:\/\/doi.org\/10.1038\/s41586-023-06887-8","journal-title":"Nature"},{"issue":"6","key":"1039_CR18","doi-asserted-by":"publisher","first-page":"104009","DOI":"10.1016\/j.drudis.2024.104009","volume":"29","author":"MKP Jayatunga","year":"2024","unstructured":"Jayatunga MKP, Ayers M, Bruens L, Jayanth D, Meier C (2024) How successful are AI-discovered drugs in clinical trials? A first analysis and emerging lessons. Drug Discov Today 29(6):104009","journal-title":"Drug Discov Today"},{"issue":"2","key":"1039_CR19","doi-asserted-by":"publisher","first-page":"575","DOI":"10.1109\/TCBB.2019.2919581","volume":"18","author":"M Li","year":"2021","unstructured":"Li M, Wang Y, Zheng R, Shi X, Li Y, Wu FX et al (2021) DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines. IEEE\/ACM Trans Comput Biol Bioinforma 18(2):575\u2013582","journal-title":"IEEE\/ACM Trans Comput Biol Bioinforma"},{"key":"1039_CR20","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbab356","author":"F Xia","year":"2021","unstructured":"Xia F, Allen J, Balaprakash P, Brettin T, Garcia-Cardona C, Clyde A et al (2021) A cross-study analysis of drug response prediction in cancer cell lines. Brief Bioinform. https:\/\/doi.org\/10.1093\/bib\/bbab356","journal-title":"Brief Bioinform"},{"issue":"1","key":"1039_CR21","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1093\/bioinformatics\/btv529","volume":"32","author":"I Cort\u00e9s-Ciriano","year":"2016","unstructured":"Cort\u00e9s-Ciriano I, van Westen GJP, Bouvier G, Nilges M, Overington JP, Bender A et al (2016) Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel. Bioinformatics 32(1):85\u201395","journal-title":"Bioinformatics"},{"issue":"8","key":"1039_CR22","doi-asserted-by":"publisher","first-page":"2347","DOI":"10.1021\/ci500152b","volume":"54","author":"M Ammad-ud-din","year":"2014","unstructured":"Ammad-ud-din M, Georgii E, G\u00f6nen M, Laitinen T, Kallioniemi O, Wennerberg K et al (2014) Integrative and personalized QSAR analysis in cancer by kernelized Bayesian matrix factorization. J Chem Inf Model 54(8):2347\u20132359","journal-title":"J Chem Inf Model"},{"issue":"20","key":"1039_CR23","doi-asserted-by":"publisher","first-page":"3989","DOI":"10.1093\/bioinformatics\/btz183","volume":"35","author":"H Li","year":"2019","unstructured":"Li H, Peng J, Sidorov P, Leung Y, Leung KS, Wong MH et al (2019) Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data. Bioinformatics 35(20):3989\u20133995","journal-title":"Bioinformatics"},{"issue":"3","key":"1039_CR24","doi-asserted-by":"publisher","first-page":"498","DOI":"10.3390\/biom13030498","volume":"13","author":"S Hern\u00e1ndez-Hern\u00e1ndez","year":"2023","unstructured":"Hern\u00e1ndez-Hern\u00e1ndez S, Ballester PJ (2023) On the best way to cluster NCI-60 molecules. Biomolecules 13(3):498","journal-title":"Biomolecules"},{"key":"1039_CR25","doi-asserted-by":"publisher","DOI":"10.1016\/j.ejmech.2023.115773","volume":"260","author":"XM Zhou","year":"2023","unstructured":"Zhou XM, Li QY, Lu X, Bheemanaboina RRY, Fang B, Cai GX et al (2023) Identification of unique indolylcyanoethylenyl sulfonylanilines as novel structural scaffolds of potential antibacterial agents. Eur J Med Chem 260:115773","journal-title":"Eur J Med Chem"},{"issue":"15","key":"1039_CR26","doi-asserted-by":"publisher","first-page":"2887","DOI":"10.1021\/jm9602928","volume":"39","author":"GW Bemis","year":"1996","unstructured":"Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39(15):2887\u20132893. https:\/\/doi.org\/10.1021\/jm9602928","journal-title":"J Med Chem"},{"issue":"4","key":"1039_CR27","doi-asserted-by":"publisher","first-page":"747","DOI":"10.1021\/ci9803381","volume":"39","author":"D Butina","year":"1999","unstructured":"Butina D (1999) Unsupervised data base clustering based on daylight\u2019s fingerprint and tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inf Comput Sci 39(4):747\u2013750. https:\/\/doi.org\/10.1021\/ci9803381","journal-title":"J Chem Inf Comput Sci"},{"key":"1039_CR28","doi-asserted-by":"crossref","unstructured":"Guo Q, Hernandez-Hernandez S, Ballester PJ (2024) Scaffold Splits Overestimate Virtual Screening Performance BT-Artificial Neural Networks and Machine Learning\u2013ICANN 2024. In: Wand M, Malinovsk\u00e1 K, Schmidhuber J, Tetko IV (eeds.) Cham: Springer Nature Switzerland. pp 58\u201372","DOI":"10.1007\/978-3-031-72359-9_5"},{"issue":"3","key":"1039_CR29","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1038\/s42256-022-00447-x","volume":"4","author":"Y Wang","year":"2022","unstructured":"Wang Y, Wang J, Cao Z, Barati FA (2022) Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 4(3):279\u2013287. https:\/\/doi.org\/10.1038\/s42256-022-00447-x","journal-title":"Nat Mach Intell"},{"issue":"12","key":"1039_CR30","doi-asserted-by":"publisher","first-page":"6065","DOI":"10.1021\/acs.jcim.0c00675","volume":"60","author":"JJ Irwin","year":"2020","unstructured":"Irwin JJ, Tang KG, Young J, Dandarchuluun C, Wong BR, Khurelbaatar M et al (2020) ZINC20\u2014a free ultralarge-scale chemical database for ligand discovery. J Chem Inf Model 60(12):6065\u20136073","journal-title":"J Chem Inf Model"},{"key":"1039_CR31","unstructured":"Steshin S (2023) Lo-Hi: Practical ML Drug Discovery Benchmark. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc.; pp 64526\u201354. Available from: https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/cb82f1f97ad0ca1d92df852a44a3bd73-Paper-Datasets_and_Benchmarks.pdf"},{"issue":"3","key":"1039_CR32","doi-asserted-by":"publisher","first-page":"697","DOI":"10.1021\/acs.jcim.3c01774","volume":"64","author":"P Tossou","year":"2024","unstructured":"Tossou P, Wognum C, Craig M, Mary H, Noutahi E (2024) Real-world molecular out-of-distribution: specification and Investigation. J Chem Inf Model 64(3):697\u2013711. https:\/\/doi.org\/10.1021\/acs.jcim.3c01774","journal-title":"J Chem Inf Model"},{"key":"1039_CR33","doi-asserted-by":"crossref","unstructured":"Shoemaker RH (2006) The NCI60 human tumour cell line anticancer drug screen. Nat Rev Cancer. Oct [cited 2023 Nov 8];6(10):813\u201323. Available from: https:\/\/www.nature.com\/articles\/nrc1951","DOI":"10.1038\/nrc1951"},{"issue":"6","key":"1039_CR34","doi-asserted-by":"publisher","first-page":"463","DOI":"10.1038\/s41573-019-0024-5","volume":"18","author":"J Vamathevan","year":"2019","unstructured":"Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G et al (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18(6):463\u2013477. https:\/\/doi.org\/10.1038\/s41573-019-0024-5","journal-title":"Nat Rev Drug Discov"},{"key":"1039_CR35","doi-asserted-by":"crossref","unstructured":"Ferreira LT, Borba JVB, Moreira-Filho JT, Rimoldi A, Andrade CH, Costa FT (2021) QSAR-Based Virtual Screening of Natural Products Database for Identification of Potent Antimalarial Hits. Vol. 11, Biomolecules","DOI":"10.3390\/biom11030459"},{"issue":"suppl_2","key":"1039_CR36","doi-asserted-by":"publisher","first-page":"W486","DOI":"10.1093\/nar\/gkr320","volume":"39","author":"TWH Backman","year":"2011","unstructured":"Backman TWH, Cao Y, Girke T (2011) ChemMine tools: an online service for analyzing and clustering small molecules. Nucleic Acids Res 39(suppl_2):W486\u2013W891","journal-title":"Nucleic Acids Res"},{"key":"1039_CR37","unstructured":"Surendran A, Zsigmond K, L\u00f3pez-P\u00e9rez K, Miranda-Quintana RA (2025) Is Tanimoto a metric? bioRxiv. 2025.02.18.638904. Available from: http:\/\/biorxiv.org\/content\/early\/2025\/02\/23\/2025.02.18.638904.abstract"},{"key":"1039_CR38","unstructured":"Landrum G (2016) RDKit: Open-Source Cheminformatics Software. Available from: https:\/\/github.com\/rdkit\/rdkit\/releases\/tag\/Release_2016_09_4"},{"key":"1039_CR39","unstructured":"Landrum G, Tosco P, Kelley B, Ric, Cosgrove D, Sriniker, et al (2024) rdkit\/rdkit: Release_2023.09.5. Zenodo. 10.5281\/zenodo.10633624"},{"issue":"2","key":"1039_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s42256-021-00438-4","volume":"4","author":"X Fang","year":"2022","unstructured":"Fang X, Liu L, Lei J, He D, Zhang S, Zhou J et al (2022) Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell 4(2):1\u20138. https:\/\/doi.org\/10.1038\/s42256-021-00438-4","journal-title":"Nat Mach Intell"},{"key":"1039_CR41","doi-asserted-by":"crossref","unstructured":"Karpov P, Godin G, Tetko IV (2019) Transformer-CNN: fast and reliable tool for QSAR. arXiv Prepr arXiv191106603","DOI":"10.26434\/chemrxiv.9961787.v1"},{"key":"1039_CR42","first-page":"246","volume":"15","author":"F Galton","year":"1886","unstructured":"Galton F (1886) Regression towards mediocrity in hereditary stature. J Anthropol Inst Gt Britain Irel 15:246\u2013263","journal-title":"J Anthropol Inst Gt Britain Irel"},{"key":"1039_CR43","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2021","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2021) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"1039_CR44","unstructured":"Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. Canada. pp 278\u201382. Available from: https:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?tp=&arnumber=598994&isnumber=13183"},{"issue":"11","key":"1039_CR45","doi-asserted-by":"publisher","first-page":"2324","DOI":"10.1021\/acs.jcim.5b00559","volume":"55","author":"T Sterling","year":"2015","unstructured":"Sterling T, Irwin JJ (2015) ZINC 15-ligand discovery for everyone. J Chem Inf Model 55(11):2324\u20132337. https:\/\/doi.org\/10.1021\/acs.jcim.5b00559","journal-title":"J Chem Inf Model"},{"key":"1039_CR46","unstructured":"Chanchabi C, Paul C, Coulon V, Cicalini I, Pierogostino D, Vishwakarma S, et al. (2025) Machine learning-based drug response prediction identifies novel therapeutic candidates for colorectal cancer cell line KM-12. bioRxiv. 2025.02.24.639035. Available from: http:\/\/biorxiv.org\/content\/early\/2025\/02\/26\/2025.02.24.639035.abstract"},{"key":"1039_CR47","doi-asserted-by":"crossref","unstructured":"Eytcheson SA, Tetko IV (2025) Which modern AI methods provide accurate predictions of toxicological endpoints? Analysis of Tox24 challenge results","DOI":"10.26434\/chemrxiv-2025-7k7x3-v2"},{"issue":"1","key":"1039_CR48","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1186\/s13321-024-00934-w","volume":"16","author":"IV Tetko","year":"2024","unstructured":"Tetko IV, van Deursen R, Godin G (2024) Be aware of overfitting by hyperparameter optimization! J Cheminform. 16(1):139. https:\/\/doi.org\/10.1186\/s13321-024-00934-w","journal-title":"J Cheminform."},{"issue":"1","key":"1039_CR49","doi-asserted-by":"publisher","first-page":"5575","DOI":"10.1038\/s41467-020-19266-y","volume":"11","author":"IV Tetko","year":"2020","unstructured":"Tetko IV, Karpov P, Van Deursen R, Godin G (2020) State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat Commun 11(1):5575. https:\/\/doi.org\/10.1038\/s41467-020-19266-y","journal-title":"Nat Commun"},{"key":"1039_CR50","unstructured":"Kimber TB, Engelke S, Tetko I V, Bruno E, Godin G (2018) Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction. Available from: https:\/\/arxiv.org\/abs\/1812.04439"},{"issue":"8","key":"1039_CR51","doi-asserted-by":"publisher","first-page":"1289","DOI":"10.1021\/acs.chemrestox.2c00196","volume":"35","author":"IV Tetko","year":"2022","unstructured":"Tetko IV, Klambauer G, Clevert DA, Shah I, Benfenati E (2022) Artificial intelligence meets toxicology. Chem Res Toxicol 35(8):1289\u20131290. https:\/\/doi.org\/10.1021\/acs.chemrestox.2c00196","journal-title":"Chem Res Toxicol"},{"issue":"2","key":"1039_CR52","doi-asserted-by":"publisher","DOI":"10.1016\/j.slasd.2024.01.005","volume":"29","author":"A Hunklinger","year":"2024","unstructured":"Hunklinger A, Hartog P, \u0160\u00edcho M, Godin G, Tetko IV (2024) The openOCHEM consensus model is the best-performing open-source predictive model in the First EUOS\/SLAS joint compound solubility challenge. SLAS Discov 29(2):100144","journal-title":"SLAS Discov"},{"issue":"11","key":"1039_CR53","doi-asserted-by":"publisher","first-page":"3460","DOI":"10.1038\/s41596-023-00885-w","volume":"18","author":"VK Tran-Nguyen","year":"2023","unstructured":"Tran-Nguyen VK, Junaid M, Simeon S, Ballester PJ (2023) A practical guide to machine-learning scoring for structure-based virtual screening. Nat Protoc 18(11):3460\u20133511. https:\/\/doi.org\/10.1038\/s41596-023-00885-w","journal-title":"Nat Protoc"},{"issue":"2","key":"1039_CR54","doi-asserted-by":"publisher","first-page":"100077","DOI":"10.1016\/j.aichem.2024.100077","volume":"2","author":"K L\u00f3pez-P\u00e9rez","year":"2024","unstructured":"L\u00f3pez-P\u00e9rez K, Avellaneda-Tamayo JF, Chen L, L\u00f3pez-L\u00f3pez E, Ju\u00e1rez-Mercado KE, Medina-Franco JL et al (2024) Molecular similarity: theory, applications, and perspectives. Artif Intell Chem. 2(2):100077","journal-title":"Artif Intell Chem."},{"issue":"D1","key":"1039_CR55","doi-asserted-by":"publisher","first-page":"D1180","DOI":"10.1093\/nar\/gkad1004","volume":"52","author":"B Zdrazil","year":"2024","unstructured":"Zdrazil B, Felix E, Hunter F, Manners EJ, Blackshaw J, Corbett S et al (2024) The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 52(D1):D1180\u2013D1192. https:\/\/doi.org\/10.1093\/nar\/gkad1004","journal-title":"Nucleic Acids Res"},{"key":"1039_CR56","unstructured":"https:\/\/www.fda.gov\/drugs\/drug-approvals-and-databases\/drugsfda-data-files. 2025. Drugs@FDA Data Files"},{"issue":"4","key":"1039_CR57","doi-asserted-by":"publisher","first-page":"1166","DOI":"10.1021\/acs.jcim.2c01253","volume":"63","author":"BI Tingle","year":"2023","unstructured":"Tingle BI, Tang KG, Castanon M, Gutierrez JJ, Khurelbaatar M, Dandarchuluun C et al (2023) ZINC-22\u2500a free multi-billion-scale database of tangible compounds for ligand discovery. J Chem Inf Model 63(4):1166\u20131176. https:\/\/doi.org\/10.1021\/acs.jcim.2c01253","journal-title":"J Chem Inf Model"},{"issue":"2","key":"1039_CR58","doi-asserted-by":"publisher","first-page":"488","DOI":"10.1021\/ci600426e","volume":"47","author":"JF Truchon","year":"2007","unstructured":"Truchon JF, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the \u201cearly recognition\u201d problem. J Chem Inf Model 47(2):488\u2013508. https:\/\/doi.org\/10.1021\/ci600426e","journal-title":"J Chem Inf Model"},{"issue":"6","key":"1039_CR59","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/biom10060963","volume":"10","author":"S Naulaerts","year":"2020","unstructured":"Naulaerts S, Menden MP, Ballester PJ (2020) Concise polygenic models for cancer-specific identification of drug-sensitive tumors from their multi-omics profiles. Biomolecules 10(6):1\u201321","journal-title":"Biomolecules"},{"issue":"2","key":"1039_CR60","doi-asserted-by":"publisher","first-page":"381","DOI":"10.1002\/1878-0261.12849","volume":"15","author":"J Krushkal","year":"2021","unstructured":"Krushkal J, Negi S, Yee LM, Evans JR, Grkovic T, Palmisano A et al (2021) Molecular genomic features associated with in vitro response of the NCI-60 cancer cell line panel to natural products. Mol Oncol 15(2):381\u2013406. https:\/\/doi.org\/10.1002\/1878-0261.12849","journal-title":"Mol Oncol"},{"issue":"1","key":"1039_CR61","doi-asserted-by":"publisher","first-page":"77","DOI":"10.1186\/s12885-016-2082-y","volume":"16","author":"H Singh","year":"2016","unstructured":"Singh H, Kumar R, Singh S, Chaudhary K, Gautam A, Raghava GPS (2016) Prediction of anticancer molecules using hybrid model developed on molecules screened against NCI-60 cancer cell lines. BMC Cancer 16(1):77. https:\/\/doi.org\/10.1186\/s12885-016-2082-y","journal-title":"BMC Cancer"},{"key":"1039_CR62","doi-asserted-by":"crossref","unstructured":"Martorana A, La Monica G, Bono A, Mannino S, Buscemi S, Palumbo Piccionello A, et al. (2022) Antiproliferative activity predictor: a new reliable in silico tool for drug response prediction against NCI60 panel. Int J Mol Sci. [cited 2024 Feb 28];23(22):14374. Available from: https:\/\/www.mdpi.com\/1422-0067\/23\/22\/14374","DOI":"10.3390\/ijms232214374"},{"issue":"9","key":"1039_CR63","doi-asserted-by":"publisher","first-page":"1075","DOI":"10.1039\/D0MD00110D","volume":"11","author":"T Nakano","year":"2020","unstructured":"Nakano T, Takeda S, Brown JB (2020) Active learning effectively identifies a minimal set of maximally informative and asymptotically performant cytotoxic structure\u2013activity patterns in NCI-60 cell lines. RSC Med Chem 11(9):1075\u20131087. https:\/\/doi.org\/10.1039\/D0MD00110D","journal-title":"RSC Med Chem"},{"issue":"9","key":"1039_CR64","doi-asserted-by":"publisher","first-page":"S17","DOI":"10.1186\/1471-2105-10-S9-S17","volume":"10","author":"P Shivakumar","year":"2009","unstructured":"Shivakumar P, Krauthammer M (2009) Structural similarity assessment for drug sensitivity prediction in cancer. BMC Bioinformatics 10(9):S17. https:\/\/doi.org\/10.1186\/1471-2105-10-S9-S17","journal-title":"BMC Bioinformatics"},{"issue":"1","key":"1039_CR65","doi-asserted-by":"publisher","first-page":"bpase065","DOI":"10.1093\/biomethods\/bpae065","volume":"9","author":"S Vishwakarma","year":"2024","unstructured":"Vishwakarma S, Hernandez-Hernandez S, Ballester PJ (2024) Graph neural networks are promising for phenotypic virtual screening on cancer cell lines. Biol Methods Protoc. 9(1):bpase065. https:\/\/doi.org\/10.1093\/biomethods\/bpae065","journal-title":"Biol Methods Protoc."},{"issue":"5","key":"1039_CR66","doi-asserted-by":"publisher","first-page":"1148","DOI":"10.1016\/j.drudis.2019.02.013","volume":"24","author":"T Hoffmann","year":"2019","unstructured":"Hoffmann T, Gastreich M (2019) The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discov Today 24(5):1148\u20131156","journal-title":"Drug Discov Today"},{"key":"1039_CR67","doi-asserted-by":"publisher","DOI":"10.1080\/17460441.2024.2403639","author":"G Ghislat","year":"2024","unstructured":"Ghislat G, Hernandez-Hernandez S, Piyawajanusorn C, Ballester PJ (2024) Data-centric challenges with the application and adoption of artificial intelligence for drug discovery. Expert Opin Drug Discov. https:\/\/doi.org\/10.1080\/17460441.2024.2403639","journal-title":"Expert Opin Drug Discov"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01039-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-025-01039-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01039-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,10]],"date-time":"2025-06-10T13:22:22Z","timestamp":1749561742000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-025-01039-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,10]]},"references-count":67,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1039"],"URL":"https:\/\/doi.org\/10.1186\/s13321-025-01039-8","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2024-f1v2v","asserted-by":"object"},{"id-type":"doi","id":"10.26434\/chemrxiv-2024-f1v2v-v2","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,10]]},"assertion":[{"value":"28 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 June 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"94"}}