{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T21:30:05Z","timestamp":1777066205856,"version":"3.51.4"},"reference-count":94,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,7,21]],"date-time":"2025-07-21T00:00:00Z","timestamp":1753056000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,21]],"date-time":"2025-07-21T00:00:00Z","timestamp":1753056000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100012940","name":"Universit\u00e4tsklinikum T\u00fcbingen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100012940","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations.<\/jats:p>\n          <jats:p>\n            <jats:bold>Scientific contribution<\/jats:bold> Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.<\/jats:p>","DOI":"10.1186\/s13321-025-01045-w","type":"journal-article","created":{"date-parts":[[2025,7,21]],"date-time":"2025-07-21T19:38:00Z","timestamp":1753126680000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["The topology of molecular representations and its influence on machine learning performance"],"prefix":"10.1186","volume":"17","author":[{"given":"Florian","family":"Rottach","sequence":"first","affiliation":[]},{"given":"Sebastian","family":"Schieferdecker","sequence":"additional","affiliation":[]},{"given":"Carsten","family":"Eickhoff","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,7,21]]},"reference":[{"issue":"9","key":"1045_CR1","doi-asserted-by":"publisher","first-page":"844","DOI":"10.1001\/jama.2020.1166","volume":"323","author":"OJ Wouters","year":"2020","unstructured":"Wouters OJ, McKee M, Luyten J (2020) Estimated research and development investment needed to bring a new medicine to market, 2009\u20132018. JAMA 323(9):844\u2013853","journal-title":"JAMA"},{"issue":"11","key":"1045_CR2","doi-asserted-by":"publisher","first-page":"1243","DOI":"10.1007\/s40273-021-01065-y","volume":"39","author":"M Schlander","year":"2021","unstructured":"Schlander M, Hernandez-Villafuerte K, Cheng C-Y, Mestre-Ferrandiz J, Baumann M (2021) How much does it cost to research and develop a new drug? A systematic review and assessment. Pharmacoeconomics 39(11):1243\u20131269. https:\/\/doi.org\/10.1007\/s40273-021-01065-y","journal-title":"Pharmacoeconomics"},{"issue":"7","key":"1045_CR3","doi-asserted-by":"publisher","first-page":"495","DOI":"10.1038\/d41573-019-00074-z","volume":"18","author":"H Dowden","year":"2019","unstructured":"Dowden H, Munro J (2019) Trends in clinical success rates and therapeutic focus. Nat Rev Drug Discov 18(7):495\u2013496","journal-title":"Nat Rev Drug Discov"},{"issue":"7","key":"1045_CR4","doi-asserted-by":"publisher","first-page":"3049","DOI":"10.1016\/j.apsb.2022.02.002","volume":"12","author":"D Sun","year":"2022","unstructured":"Sun D, Gao W, Hu H, Zhou S (2022) Why 90% of clinical drug development fails and how to improve it? Acta Pharm Sin B 12(7):3049\u20133062","journal-title":"Acta Pharm Sin B"},{"issue":"12","key":"1045_CR5","doi-asserted-by":"publisher","first-page":"817","DOI":"10.1038\/nrd.2016.184","volume":"15","author":"RK Harrison","year":"2016","unstructured":"Harrison RK (2016) Phase II and phase III failures: 2013\u20132015. Nat Rev Drug Discov 15(12):817\u2013818","journal-title":"Nat Rev Drug Discov"},{"issue":"7958","key":"1045_CR6","doi-asserted-by":"publisher","first-page":"673","DOI":"10.1038\/s41586-023-05905-z","volume":"616","author":"AV Sadybekov","year":"2023","unstructured":"Sadybekov AV, Katritch V (2023) Computational approaches streamlining drug discovery. Nature 616(7958):673\u2013685. https:\/\/doi.org\/10.1038\/s41586-023-05905-z","journal-title":"Nature"},{"issue":"6","key":"1045_CR7","doi-asserted-by":"publisher","first-page":"463","DOI":"10.1038\/s41573-019-0024-5","volume":"18","author":"J Vamathevan","year":"2019","unstructured":"Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18(6):463\u2013477. https:\/\/doi.org\/10.1038\/s41573-019-0024-5","journal-title":"Nat Rev Drug Discov"},{"key":"1045_CR8","doi-asserted-by":"publisher","first-page":"4538","DOI":"10.1016\/j.csbj.2021.08.011","volume":"19","author":"P Carracedo-Reboredo","year":"2021","unstructured":"Carracedo-Reboredo P, Li\u00f1ares-Blanco J, Rodr\u00edguez-Fern\u00e1ndez N, Cedr\u00f3n F, Novoa FJ, Carballal A, Maojo V, Pazos A, Fernandez-Lozano C (2021) A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J 19:4538\u20134558. https:\/\/doi.org\/10.1016\/j.csbj.2021.08.011","journal-title":"Comput Struct Biotechnol J"},{"key":"1045_CR9","doi-asserted-by":"publisher","first-page":"241","DOI":"10.1016\/j.csbj.2019.12.006","volume":"18","author":"C R\u00e9da","year":"2020","unstructured":"R\u00e9da C, Kaufmann E, Delahaye-Duriez A (2020) Machine learning applications in drug development. Comput Struct Biotechnol J 18:241\u2013252. https:\/\/doi.org\/10.1016\/j.csbj.2019.12.006","journal-title":"Comput Struct Biotechnol J"},{"issue":"1","key":"1045_CR10","doi-asserted-by":"publisher","first-page":"430","DOI":"10.1093\/bib\/bbab430","volume":"23","author":"J Deng","year":"2022","unstructured":"Deng J, Yang Z, Ojima I, Samaras D, Wang F (2022) Artificial intelligence in drug discovery: applications and techniques. Brief Bioinform 23(1):430","journal-title":"Brief Bioinform"},{"issue":"22","key":"1045_CR11","doi-asserted-by":"publisher","first-page":"5317","DOI":"10.1021\/acs.jcim.2c01422","volume":"62","author":"TA Soares","year":"2022","unstructured":"Soares TA, Nunes-Alves A, Mazzolari A, Ruggiu F, Wei G-W, Merz K (2022) The (re)-evolution of quantitative structure-activity relationship (qsar) studies propelled by the surge of machine learning methods. J Chem Inf Model 62(22):5317\u20135320. https:\/\/doi.org\/10.1021\/acs.jcim.2c01422","journal-title":"J Chem Inf Model"},{"issue":"8","key":"1045_CR12","doi-asserted-by":"publisher","first-page":"960","DOI":"10.1038\/s41589-024-01679-1","volume":"20","author":"DB Catacutan","year":"2024","unstructured":"Catacutan DB, Alexander J, Arnold A, Stokes JM (2024) Machine learning in preclinical drug discovery. Nat Chem Biol 20(8):960\u2013973. https:\/\/doi.org\/10.1038\/s41589-024-01679-1","journal-title":"Nat Chem Biol"},{"key":"1045_CR13","doi-asserted-by":"publisher","first-page":"3525","DOI":"10.1039\/D0CS00098A","volume":"49","author":"EN Muratov","year":"2020","unstructured":"Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, Isayev O, Curtalolo S, Fourches D, Cohen Y, Aspuru-Guzik A, Winkler DA, Agrafiotis D, Cherkasov AA (2020) Qsar without borders. Chem Soc Rev. 49:3525\u20133564. https:\/\/doi.org\/10.1039\/D0CS00098A","journal-title":"Chem Soc Rev."},{"key":"1045_CR14","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1007\/978-1-60761-839-3_3","volume":"672","author":"R Guha","year":"2011","unstructured":"Guha R (2011) The ups and downs of structure-activity landscapes. Methods Mol Biol 672:101\u2013117","journal-title":"Methods Mol Biol"},{"issue":"11","key":"1045_CR15","doi-asserted-by":"publisher","first-page":"14360","DOI":"10.1021\/acsomega.9b02221","volume":"4","author":"D Stumpfe","year":"2019","unstructured":"Stumpfe D, Hu H, Bajorath J (2019) Evolving concept of activity cliffs. ACS Omega 4(11):14360\u201314368","journal-title":"ACS Omega"},{"issue":"1","key":"1045_CR16","doi-asserted-by":"publisher","first-page":"6395","DOI":"10.1038\/s41467-023-41948-6","volume":"14","author":"J Deng","year":"2023","unstructured":"Deng J, Yang Z, Wang H, Ojima I, Samaras D, Wang F (2023) A systematic study of key elements underlying molecular property prediction. Nat Commun 14(1):6395","journal-title":"Nat Commun"},{"issue":"1","key":"1045_CR17","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1186\/s13321-023-00708-w","volume":"15","author":"M Dablander","year":"2023","unstructured":"Dablander M, Hanser T, Lambiotte R, Morris GM (2023) Exploring qsar models for activity-cliff prediction. J Cheminform 15(1):47. https:\/\/doi.org\/10.1186\/s13321-023-00708-w","journal-title":"J Cheminform"},{"issue":"23","key":"1045_CR18","doi-asserted-by":"publisher","first-page":"5938","DOI":"10.1021\/acs.jcim.2c01073","volume":"62","author":"D Tilborg","year":"2022","unstructured":"Tilborg D, Alenicheva A, Grisoni F (2022) Exposing the limitations of molecular machine learning with activity cliffs. J Chem Inf Model 62(23):5938\u20135951. https:\/\/doi.org\/10.1021\/acs.jcim.2c01073","journal-title":"J Chem Inf Model"},{"issue":"4","key":"1045_CR19","doi-asserted-by":"publisher","first-page":"1969","DOI":"10.1021\/acs.jcim.9b01067","volume":"60","author":"RP Sheridan","year":"2020","unstructured":"Sheridan RP, Karnachi P, Tudor M, Xu Y, Liaw A, Shah F, Cheng AC, Joshi E, Glick M, Alvarez J (2020) Experimental error, kurtosis, activity cliffs, and methodology: What limits the predictivity of quantitative structure-activity relationship models? J Chem Inf Model 60(4):1969\u20131982. https:\/\/doi.org\/10.1021\/acs.jcim.9b01067","journal-title":"J Chem Inf Model"},{"issue":"8","key":"1045_CR20","doi-asserted-by":"publisher","first-page":"3370","DOI":"10.1021\/acs.jcim.9b00237","volume":"59","author":"K Yang","year":"2019","unstructured":"Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370\u20133388. https:\/\/doi.org\/10.1021\/acs.jcim.9b00237","journal-title":"J Chem Inf Model"},{"issue":"1","key":"1045_CR21","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1186\/s13321-020-00460-5","volume":"12","author":"L David","year":"2020","unstructured":"David L, Thakkar A, Mercado R, Engkvist O (2020) Molecular representations in ai-driven drug discovery: a review and practical guide. J Cheminform 12(1):56. https:\/\/doi.org\/10.1186\/s13321-020-00460-5","journal-title":"J Cheminform"},{"issue":"5","key":"1045_CR22","doi-asserted-by":"publisher","first-page":"1603","DOI":"10.1002\/wcms.1603","volume":"12","author":"DS Wigh","year":"2022","unstructured":"Wigh DS, Goodman JM, Lapkin AA (2022) A review of molecular representation in the age of machine learning. Wiley Interdiscipl Rev Comput Mol Sci 12(5):1603","journal-title":"Wiley Interdiscipl Rev Comput Mol Sci"},{"issue":"16","key":"1045_CR23","doi-asserted-by":"publisher","first-page":"8705","DOI":"10.1021\/acs.jmedchem.0c00385","volume":"63","author":"KV Chuang","year":"2020","unstructured":"Chuang KV, Gunsalus LM, Keiser MJ (2020) Learning molecular representations for medicinal chemistry. J Med Chem 63(16):8705\u20138722. https:\/\/doi.org\/10.1021\/acs.jmedchem.0c00385","journal-title":"J Med Chem"},{"issue":"5","key":"1045_CR24","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742\u2013754. https:\/\/doi.org\/10.1021\/ci100050t","journal-title":"J Chem Inf Model"},{"issue":"6","key":"1045_CR25","doi-asserted-by":"publisher","first-page":"1273","DOI":"10.1021\/ci010132r","volume":"42","author":"JL Durant","year":"2002","unstructured":"Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of mdl keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273\u20131280. https:\/\/doi.org\/10.1021\/ci010132r","journal-title":"J Chem Inf Comput Sci"},{"issue":"4","key":"1045_CR26","doi-asserted-by":"publisher","first-page":"1468","DOI":"10.1002\/wcms.1468","volume":"10","author":"D Schaller","year":"2020","unstructured":"Schaller D, \u0160ribar D, Noonan T, Deng L, Nguyen TN, Pach S, Machalz D, Bermudez M, Wolber G (2020) Next generation 3d pharmacophore modeling. Wiley Interdiscipl Rev Comput Mol Sci 10(4):1468","journal-title":"Wiley Interdiscipl Rev Comput Mol Sci"},{"issue":"6","key":"1045_CR27","doi-asserted-by":"publisher","first-page":"2545","DOI":"10.1021\/acs.jcim.9b00266","volume":"59","author":"AC Mater","year":"2019","unstructured":"Mater AC, Coote ML (2019) Deep learning in chemistry. J Chem Inf Model 59(6):2545\u20132559. https:\/\/doi.org\/10.1021\/acs.jcim.9b00266","journal-title":"J Chem Inf Model"},{"issue":"12","key":"1045_CR28","doi-asserted-by":"publisher","first-page":"1023","DOI":"10.1038\/s42256-021-00418-8","volume":"3","author":"K Atz","year":"2021","unstructured":"Atz K, Grisoni F, Schneider G (2021) Geometric deep learning on molecular representations. Nat Mach Intell 3(12):1023\u20131032. https:\/\/doi.org\/10.1038\/s42256-021-00418-8","journal-title":"Nat Mach Intell"},{"key":"1045_CR29","unstructured":"Kipf TN, Welling M (2017) Semi-Supervised Classification with Graph Convolutional Networks. https:\/\/arxiv.org\/abs\/1609.02907"},{"issue":"1","key":"1045_CR30","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1038\/s43246-022-00315-6","volume":"3","author":"P Reiser","year":"2022","unstructured":"Reiser P, Neubert M, Eberhard A, Torresi L, Zhou C, Shao C, Metni H, Hoesel C, Schopmans H, Sommer T, Friederich P (2022) Graph neural networks for materials science and chemistry. Commun Mater 3(1):93. https:\/\/doi.org\/10.1038\/s43246-022-00315-6","journal-title":"Commun Mater"},{"key":"1045_CR31","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2023) Attention Is All You Need. https:\/\/arxiv.org\/abs\/1706.03762"},{"key":"1045_CR32","doi-asserted-by":"publisher","DOI":"10.1016\/j.sbi.2023.102527","volume":"79","author":"F Grisoni","year":"2023","unstructured":"Grisoni F (2023) Chemical language models for de novo drug design: challenges and opportunities. Curr Opin Struct Biol 79:102527","journal-title":"Curr Opin Struct Biol"},{"issue":"1","key":"1045_CR33","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31\u201336. https:\/\/doi.org\/10.1021\/ci00057a005","journal-title":"J Chem Inf Comput Sci"},{"issue":"4","key":"1045_CR34","doi-asserted-by":"publisher","DOI":"10.1088\/2632-2153\/aba947","volume":"1","author":"M Krenn","year":"2020","unstructured":"Krenn M, H\u00e4se F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024. https:\/\/doi.org\/10.1088\/2632-2153\/aba947","journal-title":"Mach Learn Sci Technol"},{"issue":"1","key":"1045_CR35","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1186\/s13321-024-00830-3","volume":"16","author":"D Boldini","year":"2024","unstructured":"Boldini D, Ballabio D, Consonni V, Todeschini R, Grisoni F, Sieber SA (2024) Effectiveness of molecular fingerprints for exploring the chemical space of natural products. J Cheminform 16(1):35. https:\/\/doi.org\/10.1186\/s13321-024-00830-3","journal-title":"J Cheminform"},{"key":"1045_CR36","doi-asserted-by":"crossref","unstructured":"Graff DE, Pyzer-Knapp EO, Jordan KE, Shakhnovich EI, Coley CW (2023) Evaluating the roughness of structure-property relationships using pretrained molecular representations. https:\/\/arxiv.org\/abs\/2305.08238","DOI":"10.1039\/D3DD00088E"},{"key":"1045_CR37","doi-asserted-by":"publisher","first-page":"674","DOI":"10.1039\/D2DD00099G","volume":"2","author":"J Born","year":"2023","unstructured":"Born J, Markert G, Janakarajan N, Kimber TB, Volkamer A, Mart\u00ednez MR, Manica M (2023) Chemical representation learning for toxicity prediction. Digit Discov 2:674\u2013691. https:\/\/doi.org\/10.1039\/D2DD00099G","journal-title":"Digit Discov"},{"issue":"19","key":"1045_CR38","doi-asserted-by":"publisher","first-page":"4660","DOI":"10.1021\/acs.jcim.2c00903","volume":"62","author":"M Aldeghi","year":"2022","unstructured":"Aldeghi M, Graff DE, Frey N, Morrone JA, Pyzer-Knapp EO, Jordan KE, Coley CW (2022) Roughness of molecular property landscapes and its impact on modellability. J Chem Inf Model 62(19):4660\u20134671. https:\/\/doi.org\/10.1021\/acs.jcim.2c00903","journal-title":"J Chem Inf Model"},{"issue":"3","key":"1045_CR39","doi-asserted-by":"publisher","first-page":"20220006","DOI":"10.1515\/jib-2022-0006","volume":"19","author":"D Baptista","year":"2022","unstructured":"Baptista D, Correia J, Pereira B, Rocha M (2022) Evaluating molecular representations in machine learning models for drug response prediction and interpretability. J Integr Bioinform 19(3):20220006. https:\/\/doi.org\/10.1515\/jib-2022-0006","journal-title":"J Integr Bioinform"},{"issue":"1","key":"1045_CR40","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1186\/s13321-020-00479-8","volume":"13","author":"D Jiang","year":"2021","unstructured":"Jiang D, Wu Z, Hsieh C-Y, Chen G, Liao B, Wang Z, Shen C, Cao D, Wu J, Hou T (2021) Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminform 13(1):12. https:\/\/doi.org\/10.1186\/s13321-020-00479-8","journal-title":"J Cheminform"},{"issue":"6","key":"1045_CR41","doi-asserted-by":"publisher","first-page":"291","DOI":"10.1093\/bib\/bbab291","volume":"22","author":"B Zagidullin","year":"2021","unstructured":"Zagidullin B, Wang Z, Guan Y, Pitk\u00e4nen E, Tang J (2021) Comparative analysis of molecular fingerprints in prediction of drug combination effects. Brief Bioinform 22(6):291. https:\/\/doi.org\/10.1093\/bib\/bbab291","journal-title":"Brief Bioinform"},{"issue":"16","key":"1045_CR42","doi-asserted-by":"publisher","first-page":"8373","DOI":"10.1039\/D0CP00305K","volume":"22","author":"K Gao","year":"2020","unstructured":"Gao K, Nguyen DD, Sresht V, Mathiowetz AM, Tu M, Wei G-W (2020) Are 2D fingerprints still valuable for drug discovery? Phys Chem Chem Phys 22(16):8373\u20138390","journal-title":"Phys Chem Chem Phys"},{"key":"1045_CR43","doi-asserted-by":"crossref","unstructured":"Xia J, Zhang L, Zhu X, Li SZ (2023) Why Deep Models Often cannot Beat Non-deep Counterparts on Molecular Property Prediction? https:\/\/arxiv.org\/abs\/2306.17702","DOI":"10.26434\/chemrxiv-2023-xl49v-v2"},{"key":"1045_CR44","unstructured":"Sun R, Dai H, Yu AW (2022) Does GNN Pretraining Help Molecular Representation? https:\/\/arxiv.org\/abs\/2207.06010"},{"issue":"3","key":"1045_CR45","doi-asserted-by":"publisher","first-page":"646","DOI":"10.1021\/ci7004093","volume":"48","author":"R Guha","year":"2008","unstructured":"Guha R, Van Drie JH (2008) Structure-activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48(3):646\u2013658. https:\/\/doi.org\/10.1021\/ci7004093","journal-title":"J Chem Inf Model"},{"issue":"1","key":"1045_CR46","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1186\/s13321-015-0069-3","volume":"7","author":"D Bajusz","year":"2015","unstructured":"Bajusz D, R\u00e1cz A, H\u00e9berger K (2015) Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7(1):20. https:\/\/doi.org\/10.1186\/s13321-015-0069-3","journal-title":"J Cheminform"},{"issue":"23","key":"1045_CR47","doi-asserted-by":"publisher","first-page":"5571","DOI":"10.1021\/jm0705713","volume":"50","author":"L Peltason","year":"2007","unstructured":"Peltason L, Bajorath J (2007) Sar index: quantifying the nature of structure-activity relationships. J Med Chem 50(23):5571\u20135578. https:\/\/doi.org\/10.1021\/jm0705713","journal-title":"J Med Chem"},{"issue":"10","key":"1045_CR48","doi-asserted-by":"publisher","first-page":"2069","DOI":"10.1021\/acs.jcim.8b00313","volume":"58","author":"I Luque Ruiz","year":"2018","unstructured":"Luque Ruiz I, G\u00f3mez-Nieto M\u00c1 (2018) Regression modelability index: a new index for prediction of the modelability of data sets in the development of qsar regression models. J Chem Inf Model 58(10):2069\u20132084. https:\/\/doi.org\/10.1021\/acs.jcim.8b00313","journal-title":"J Chem Inf Model"},{"key":"1045_CR49","unstructured":"Goodfellow I (2016) Deep learning. MIT press"},{"key":"1045_CR50","doi-asserted-by":"crossref","unstructured":"Chazal F, Michel B (2021) An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists. https:\/\/arxiv.org\/abs\/1710.04019","DOI":"10.3389\/frai.2021.667963"},{"issue":"1","key":"1045_CR51","first-page":"2202331","volume":"8","author":"D Leykam","year":"2023","unstructured":"Leykam D, Angelakis DG (2023) Topological data analysis and machine learning. Adv Phys X 8(1):2202331","journal-title":"Adv Phys X"},{"key":"1045_CR52","unstructured":"Hatcher A (2005) Algebraic Topology"},{"issue":"2","key":"1045_CR53","doi-asserted-by":"publisher","first-page":"255","DOI":"10.1090\/S0273-0979-09-01249-X","volume":"46","author":"G Carlsson","year":"2009","unstructured":"Carlsson G (2009) Topology and data. Bull Am Math Soc 46(2):255\u2013308","journal-title":"Bull Am Math Soc"},{"issue":"3","key":"1045_CR54","doi-asserted-by":"publisher","first-page":"263","DOI":"10.1016\/j.cag.2010.03.007","volume":"34","author":"A Zomorodian","year":"2010","unstructured":"Zomorodian A (2010) Fast construction of the vietoris-rips complex. Comput Graphics 34(3):263\u2013271","journal-title":"Comput Graphics"},{"key":"1045_CR55","unstructured":"Jiang Y, Neyshabur B, Mobahi H, Krishnan D, Bengio S (2019) Fantastic generalization measures and where to find them. https:\/\/arxiv.org\/abs\/1912.02178"},{"issue":"5","key":"1045_CR56","doi-asserted-by":"publisher","first-page":"851","DOI":"10.1162\/neco.1994.6.5.851","volume":"6","author":"V Vapnik","year":"1994","unstructured":"Vapnik V, Levin E, Le Cun Y (1994) Measuring the vc-dimension of a learning machine. Neural Comput 6(5):851\u2013876","journal-title":"Neural Comput"},{"key":"1045_CR57","unstructured":"Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836"},{"key":"1045_CR58","doi-asserted-by":"publisher","unstructured":"Rieck, Bastian Alexander, Togninalli, Matteo, Bock, Christian, Moor, Michael, Horn, Max, Gumbsch, Thomas, Borgwardt, Karsten: Neural persistence: A complexity measure for deep neural networks using algebraic topology (2023) https:\/\/doi.org\/10.3929\/ETHZ-B-000327207","DOI":"10.3929\/ETHZ-B-000327207"},{"key":"1045_CR59","unstructured":"Magai G, Ayzenberg A (2022) Topology and geometry of data manifold in deep learning. arXiv preprint arXiv:2204.08624"},{"key":"1045_CR60","doi-asserted-by":"crossref","unstructured":"MacPherson R, Schweinhart B (2012) Measuring shape with topology. J Math Phys 53(7)","DOI":"10.1063\/1.4737391"},{"key":"1045_CR61","first-page":"6776","volume":"34","author":"T Birdal","year":"2021","unstructured":"Birdal T, Lou A, Guibas LJ, Simsekli U (2021) Intrinsic dimension, persistent homology and generalization in neural networks. Adv Neural Inf Process Syst 34:6776\u20136789","journal-title":"Adv Neural Inf Process Syst"},{"key":"1045_CR62","doi-asserted-by":"publisher","DOI":"10.1016\/j.cnsns.2019.105163","volume":"84","author":"J Jaquette","year":"2020","unstructured":"Jaquette J, Schweinhart B (2020) Fractal dimension estimation with persistent homology: a comparative study. Commun Nonlinear Sci Numer Simul 84:105163","journal-title":"Commun Nonlinear Sci Numer Simul"},{"key":"1045_CR63","unstructured":"Ansuini A, Laio A, Macke JH, Zoccolan D (2019) Intrinsic dimension of data representations in deep neural networks. https:\/\/arxiv.org\/abs\/1905.12784"},{"issue":"1","key":"1045_CR64","doi-asserted-by":"publisher","first-page":"12140","DOI":"10.1038\/s41598-017-11873-y","volume":"7","author":"E Facco","year":"2017","unstructured":"Facco E, d\u2019Errico M, Rodriguez A, Laio A (2017) Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci Rep 7(1):12140. https:\/\/doi.org\/10.1038\/s41598-017-11873-y","journal-title":"Sci Rep"},{"issue":"2","key":"1045_CR65","doi-asserted-by":"publisher","first-page":"176","DOI":"10.1109\/T-C.1971.223208","volume":"20","author":"K Fukunaga","year":"1971","unstructured":"Fukunaga K, Olsen DR (1971) An algorithm for finding intrinsic dimensionality of data. IEEE Trans Comput 20(2):176\u2013183. https:\/\/doi.org\/10.1109\/T-C.1971.223208","journal-title":"IEEE Trans Comput"},{"issue":"11","key":"1045_CR66","doi-asserted-by":"publisher","first-page":"3263","DOI":"10.1021\/acs.jcim.3c00160","volume":"63","author":"C Fang","year":"2023","unstructured":"Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, Sciabola S (2023) Prospective validation of machine learning algorithms for absorption, distribution, metabolism, and excretion prediction: An industrial perspective. J Chem Inf Model 63(11):3263\u20133274. https:\/\/doi.org\/10.1021\/acs.jcim.3c00160","journal-title":"J Chem Inf Model"},{"key":"1045_CR67","doi-asserted-by":"publisher","first-page":"198","DOI":"10.1093\/nar\/gkl999","volume":"35","author":"T Liu","year":"2006","unstructured":"Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2006) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res 35:198\u2013201","journal-title":"Nucleic Acids Res"},{"key":"1045_CR68","unstructured":"Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M (2021) Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. https:\/\/arxiv.org\/abs\/2102.09548"},{"key":"1045_CR69","unstructured":"Landrum G (2013) Rdkit documentation. Release 1(1\u201379):4"},{"key":"1045_CR70","unstructured":"Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. https:\/\/arxiv.org\/abs\/2010.09885"},{"issue":"11","key":"1045_CR71","doi-asserted-by":"publisher","first-page":"1297","DOI":"10.1038\/s42256-023-00740-3","volume":"5","author":"NC Frey","year":"2023","unstructured":"Frey NC, Soklaski R, Axelrod S, Samsi S, G\u00f3mez-Bombarelli R, Coley CW, Gadepally V (2023) Neural scaling of deep chemical models. Nat Mach Intell 5(11):1297\u20131305. https:\/\/doi.org\/10.1038\/s42256-023-00740-3","journal-title":"Nat Mach Intell"},{"key":"1045_CR72","unstructured":"Noutahi E, Gabellini C, Craig M, Lim JSC, Tossou P (2023) Gotta be SAFE: a New Framework for Molecular Design. https:\/\/arxiv.org\/abs\/2310.10773"},{"issue":"12","key":"1045_CR73","doi-asserted-by":"publisher","first-page":"1256","DOI":"10.1038\/s42256-022-00580-7","volume":"4","author":"J Ross","year":"2022","unstructured":"Ross J, Belgodere B, Chenthamarakshan V, Padhi I, Mroueh Y, Das P (2022) Large-scale chemical language representations capture molecular structure and properties. Nat Mach Intell 4(12):1256\u20131264. https:\/\/doi.org\/10.1038\/s42256-022-00580-7","journal-title":"Nat Mach Intell"},{"key":"1045_CR74","unstructured":"Ying C, Cai T, Luo S, Zheng S, Ke G, He D, Shen Y, Liu T-Y (2021) Do transformers really perform bad for graph representation? https:\/\/arxiv.org\/abs\/2106.05234"},{"key":"1045_CR75","unstructured":"Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, Leskovec J (2020) Strategies for Pre-training Graph Neural Networks. https:\/\/arxiv.org\/abs\/1905.12265"},{"issue":"1","key":"1045_CR76","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1021\/acs.jcim.7b00616","volume":"58","author":"S Jaeger","year":"2018","unstructured":"Jaeger S, Fulle S, Turk S (2018) Mol2vec: unsupervised machine learning approach with chemical intuition. J Chem Inf Model 58(1):27\u201335. https:\/\/doi.org\/10.1021\/acs.jcim.7b00616","journal-title":"J Chem Inf Model"},{"key":"1045_CR77","doi-asserted-by":"publisher","unstructured":"Noutahi E, Wognum C, Mary H, Hounwanou H, Kovary KM, Gilmour D, thibaultvarin-r, Burns J, St-Laurent J, DomInvivo Maheshkar S, rbyrne-momatx: Datamol-io\/molfeat: 0.9.4. https:\/\/doi.org\/10.5281\/zenodo.7775253. https:\/\/doi.org\/10.5281\/zenodo.7775253","DOI":"10.5281\/zenodo.7775253"},{"key":"1045_CR78","doi-asserted-by":"crossref","unstructured":"Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38\u201345","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"1045_CR79","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-020-00445-4","volume":"12","author":"A Capecchi","year":"2020","unstructured":"Capecchi A, Probst D, Reymond J-L (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 12:1\u201315","journal-title":"J Cheminform"},{"issue":"5","key":"1045_CR80","doi-asserted-by":"publisher","first-page":"1924","DOI":"10.1021\/ci050413p","volume":"46","author":"P Gedeck","year":"2006","unstructured":"Gedeck P, Rohde B, Bartels C (2006) Qsar\u2014how good is it in practice? J Chem Inf Model 46(5):1924\u20131936. https:\/\/doi.org\/10.1021\/ci050413p","journal-title":"J Chem Inf Model"},{"issue":"21","key":"1045_CR81","doi-asserted-by":"publisher","first-page":"2518","DOI":"10.1093\/bioinformatics\/btn479","volume":"24","author":"J Klekota","year":"2008","unstructured":"Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):2518\u20132525","journal-title":"Bioinformatics"},{"issue":"D1","key":"1045_CR82","doi-asserted-by":"publisher","first-page":"1202","DOI":"10.1093\/nar\/gkv951","volume":"44","author":"S Kim","year":"2016","unstructured":"Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA et al (2016) Pubchem substance and compound databases. Nucleic Acids Res 44(D1):1202\u20131213","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"1045_CR83","doi-asserted-by":"publisher","first-page":"133","DOI":"10.1002\/minf.201200141","volume":"32","author":"M Reutlinger","year":"2013","unstructured":"Reutlinger M, Koch CP, Reker D, Todoroff N, Schneider P, Rodrigues T, Schneider G (2013) Chemically advanced template search (cats) for scaffold-hopping and prospective target prediction for \u00e2\u20ac\u02dcorphan\u00e2\u20ac\u2122molecules. Mol Inf 32(2):133","journal-title":"Mol Inf"},{"issue":"1","key":"1045_CR84","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1002\/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z","volume":"61","author":"A Gobbi","year":"1998","unstructured":"Gobbi A, Poppinger D (1998) Genetic optimization of combinatorial libraries. Biotechnol Bioeng 61(1):47\u201354","journal-title":"Biotechnol Bioeng"},{"issue":"7","key":"1045_CR85","doi-asserted-by":"publisher","first-page":"1466","DOI":"10.1002\/jcc.21707","volume":"32","author":"CW Yap","year":"2011","unstructured":"Yap CW (2011) Padel-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466\u20131474","journal-title":"J Comput Chem"},{"key":"1045_CR86","doi-asserted-by":"publisher","unstructured":"Mauri A (2020) In: Roy, K. (ed.) alvaDesc: a tool to calculate and analyze molecular descriptors and fingerprints, pp. 801\u2013820. Springer, New York, NY. https:\/\/doi.org\/10.1007\/978-1-0716-0150-1_32 . https:\/\/doi.org\/10.1007\/978-1-0716-0150-1_32","DOI":"10.1007\/978-1-0716-0150-1_32"},{"key":"1045_CR87","first-page":"507","volume":"35","author":"L Grinsztajn","year":"2022","unstructured":"Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still outperform deep learning on typical tabular data? Adv Neural Inf Process Syst 35:507\u2013520","journal-title":"Adv Neural Inf Process Syst"},{"key":"1045_CR88","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"issue":"15","key":"1045_CR89","doi-asserted-by":"publisher","first-page":"2887","DOI":"10.1021\/jm9602928","volume":"39","author":"GW Bemis","year":"1996","unstructured":"Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39(15):2887\u20132893. https:\/\/doi.org\/10.1021\/jm9602928","journal-title":"J Med Chem"},{"issue":"29","key":"1045_CR90","doi-asserted-by":"publisher","first-page":"925","DOI":"10.21105\/joss.00925","volume":"3","author":"C Tralie","year":"2018","unstructured":"Tralie C, Saul N, Bar-On R (2018) Ripser py: a lean persistent homology library for python. J Open Source Softw 3(29):925","journal-title":"J Open Source Softw"},{"issue":"39","key":"1045_CR91","first-page":"1","volume":"22","author":"G Tauzin","year":"2021","unstructured":"Tauzin G, Lupo U, Tunstall L, P\u00e9rez JB, Caorsi M, Medina-Mardones AM, Dassatti A, Hess K (2021) giotto-tda: a topological data analysis toolkit for machine learning and data exploration. J Mach Learn Res 22(39):1\u20136","journal-title":"J Mach Learn Res"},{"issue":"2","key":"1045_CR92","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.100.022314","volume":"100","author":"A Myers","year":"2019","unstructured":"Myers A, Munch E, Khasawneh FA (2019) Persistent homology of complex networks for dynamic state detection. Phys Rev E 100(2):022314","journal-title":"Phys Rev E"},{"issue":"10","key":"1045_CR93","doi-asserted-by":"publisher","first-page":"1368","DOI":"10.3390\/e23101368","volume":"23","author":"J Bac","year":"2021","unstructured":"Bac J, Mirkes EM, Gorban AN, Tyukin I, Zinovyev A (2021) Scikit-dimension: a python package for intrinsic dimension estimation. Entropy 23(10):1368","journal-title":"Entropy"},{"key":"1045_CR94","unstructured":"Lundberg SM, Lee S-I (2017) A unified approach to interpreting model predictions. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4765\u20134774. Curran Associates, Inc., Red Hook, NY, USA. http:\/\/papers.nips.cc\/paper\/7062-a-unified-approach-to-interpreting-model-predictions.pdf"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01045-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-025-01045-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01045-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,7]],"date-time":"2025-09-07T17:03:11Z","timestamp":1757264591000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-025-01045-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,21]]},"references-count":94,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1045"],"URL":"https:\/\/doi.org\/10.1186\/s13321-025-01045-w","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,21]]},"assertion":[{"value":"5 March 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 June 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 July 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"109"}}