{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:30:06Z","timestamp":1772166606344,"version":"3.50.1"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,9,19]],"date-time":"2023-09-19T00:00:00Z","timestamp":1695081600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,9,19]],"date-time":"2023-09-19T00:00:00Z","timestamp":1695081600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/L016044\/1"],"award-info":[{"award-number":["EP\/L016044\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000268","name":"Biotechnology and Biological Sciences Research Council","doi-asserted-by":"publisher","award":["BB\/S507611\/1"],"award-info":[{"award-number":["BB\/S507611\/1"]}],"id":[{"id":"10.13039\/501100000268","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/tomhadfield95\/synthVS\">https:\/\/github.com\/tomhadfield95\/synthVS<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1186\/s13321-023-00755-3","type":"journal-article","created":{"date-parts":[[2023,9,19]],"date-time":"2023-09-19T08:02:01Z","timestamp":1695110521000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding"],"prefix":"10.1186","volume":"15","author":[{"given":"Thomas E.","family":"Hadfield","sequence":"first","affiliation":[]},{"given":"Jack","family":"Scantlebury","sequence":"additional","affiliation":[]},{"given":"Charlotte M.","family":"Deane","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,9,19]]},"reference":[{"issue":"9","key":"755_CR1","doi-asserted-by":"publisher","first-page":"844","DOI":"10.1001\/jama.2020.1166","volume":"323","author":"OJ Wouters","year":"2020","unstructured":"Wouters OJ, McKee M, Luyten J (2020) Estimated Research And Development Investment Needed To Bring A New Medicine To Market, 2009\u20132018. J Am Med Assoc 323(9):844\u2013853","journal-title":"J Am Med Assoc"},{"issue":"7676","key":"755_CR2","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1038\/nature24270","volume":"550","author":"D Silver","year":"2017","unstructured":"Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354\u2013359","journal-title":"Nature"},{"key":"755_CR3","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Proc Adv Neural Inf Process Syst 33:1877\u20131901","journal-title":"Proc Adv Neural Inf Process Syst"},{"issue":"7873","key":"755_CR4","doi-asserted-by":"publisher","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","volume":"596","author":"J Jumper","year":"2021","unstructured":"Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, \u017d\u00eddek A, Potapenko A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583\u2013589","journal-title":"Nature"},{"issue":"6557","key":"755_CR5","doi-asserted-by":"publisher","first-page":"871","DOI":"10.1126\/science.abj8754","volume":"373","author":"M Baek","year":"2021","unstructured":"Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD et al (2021) Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557):871\u2013876","journal-title":"Science"},{"issue":"12","key":"755_CR6","doi-asserted-by":"publisher","first-page":"5634","DOI":"10.1038\/s41596-021-00628-9","volume":"16","author":"Z Du","year":"2021","unstructured":"Du Z, Su H, Wang W, Ye L, Wei H, Peng Z, Anishchenko I, Baker D, Yang J (2021) The trrosetta server for fast and accurate protein structure prediction. Nat Prot 16(12):5634\u20135651","journal-title":"Nat Prot"},{"issue":"10","key":"755_CR7","doi-asserted-by":"publisher","first-page":"4282","DOI":"10.1021\/acs.molpharmaceut.9b00634","volume":"16","author":"M Skalic","year":"2019","unstructured":"Skalic M, Sabbadin D, Sattarov B, Sciabola S, De Fabritiis G (2019) From target to drug: generative modeling for the multimodal structure-based ligand design. Mol Pharm 16(10):4282\u20134291","journal-title":"Mol Pharm"},{"issue":"9","key":"755_CR8","doi-asserted-by":"publisher","first-page":"2701","DOI":"10.1039\/D1SC05976A","volume":"13","author":"M Ragoza","year":"2022","unstructured":"Ragoza M, Masuda T, Koes DR (2022) Generating 3d molecules conditional on receptor binding sites with deep generative models. Chem Sci 13(9):2701\u20132713","journal-title":"Chem Sci"},{"issue":"1","key":"755_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-017-0235-x","volume":"9","author":"M Olivecrona","year":"2017","unstructured":"Olivecrona M, Blaschke T, Engkvist O, Chen H (2017) Molecular de-novo design through deep reinforcement learning. J Cheminform 9(1):1\u201314","journal-title":"J Cheminform"},{"issue":"4","key":"755_CR10","doi-asserted-by":"publisher","first-page":"1983","DOI":"10.1021\/acs.jcim.9b01120","volume":"60","author":"F Imrie","year":"2020","unstructured":"Imrie F, Bradley AR, van der Schaar M, Deane CM (2020) Deep generative models for 3d linker design. J Chem Inf Model 60(4):1983\u20131995","journal-title":"J Chem Inf Model"},{"issue":"10","key":"755_CR11","doi-asserted-by":"publisher","first-page":"2280","DOI":"10.1021\/acs.jcim.1c01311","volume":"62","author":"TE Hadfield","year":"2022","unstructured":"Hadfield TE, Imrie F, Merritt A, Birchall K, Deane CM (2022) Incorporating target-specific pharmacophoric information into deep generative models for fragment elaboration. J Chem Inf Model 62(10):2280\u20132292","journal-title":"J Chem Inf Model"},{"issue":"1","key":"755_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-020-00472-1","volume":"12","author":"S Genheden","year":"2020","unstructured":"Genheden S, Thakkar A, Chadimov\u00e1 V, Reymond J-L, Engkvist O, Bjerrum E (2020) AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J Cheminform 12(1):1\u20139","journal-title":"J Cheminform"},{"issue":"6","key":"755_CR13","doi-asserted-by":"publisher","first-page":"1357","DOI":"10.1021\/acs.jcim.1c01074","volume":"62","author":"S Ishida","year":"2022","unstructured":"Ishida S, Terayama K, Kojima R, Takasu K, Okuno Y (2022) Ai-driven synthetic route design incorporated with retrosynthesis knowledge. J Chem Inf Model 62(6):1357\u20131367","journal-title":"J Chem Inf Model"},{"key":"755_CR14","unstructured":"Dai H, Li C, Coley C, Dai B, Song L (2019) Retrosynthesis prediction with conditional graph logic network. Proc Adv Neural Inf Process Syst. Vol 32"},{"issue":"2","key":"755_CR15","doi-asserted-by":"publisher","first-page":"455","DOI":"10.1002\/jcc.21334","volume":"31","author":"O Trott","year":"2010","unstructured":"Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. J Comp Chem 31(2):455\u2013461","journal-title":"J Comp Chem"},{"issue":"4","key":"755_CR16","doi-asserted-by":"publisher","first-page":"609","DOI":"10.1002\/prot.10465","volume":"52","author":"ML Verdonk","year":"2003","unstructured":"Verdonk ML, Cole JC, Hartshorn MJ, Murray CW, Taylor RD (2003) Improved protein-ligand docking using GOLD. Proteins 52(4):609\u2013623","journal-title":"Proteins"},{"issue":"9","key":"755_CR17","doi-asserted-by":"publisher","first-page":"1169","DOI":"10.1093\/bioinformatics\/btq112","volume":"26","author":"PJ Ballester","year":"2010","unstructured":"Ballester PJ, Mitchell JB (2010) A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26(9):1169\u20131175","journal-title":"Bioinformatics"},{"issue":"11","key":"755_CR18","doi-asserted-by":"publisher","first-page":"2897","DOI":"10.1021\/ci2003889","volume":"51","author":"JD Durrant","year":"2011","unstructured":"Durrant JD, McCammon JA (2011) NNScore 2.0: a neural-network receptor-ligand scoring function. J Chem Inf Model 51(11):2897\u20132903","journal-title":"J Chem Inf Model"},{"issue":"12","key":"755_CR19","doi-asserted-by":"publisher","first-page":"2495","DOI":"10.1021\/acs.jcim.6b00355","volume":"56","author":"JC Pereira","year":"2016","unstructured":"Pereira JC, Caffarena ER, Dos Santos CN (2016) Boosting docking-based virtual screening with deep learning. J Chem Inf Model 56(12):2495\u20132506","journal-title":"J Chem Inf Model"},{"issue":"4","key":"755_CR20","doi-asserted-by":"publisher","first-page":"942","DOI":"10.1021\/acs.jcim.6b00740","volume":"57","author":"M Ragoza","year":"2017","unstructured":"Ragoza M, Hochuli J, Idrobo E, Sunseri J, Koes DR (2017) Protein-ligand scoring with convolutional neural networks. J Chem Inf Model 57(4):942\u2013957","journal-title":"J Chem Inf Model"},{"issue":"11","key":"755_CR21","doi-asserted-by":"publisher","first-page":"2319","DOI":"10.1021\/acs.jcim.8b00350","volume":"58","author":"F Imrie","year":"2018","unstructured":"Imrie F, Bradley AR, van der Schaar M, Deane CM (2018) Protein family-specific models using deep neural networks and transfer learning improve virtual screening and highlight the need for more data. J Chem Inf Model 58(11):2319\u20132330","journal-title":"J Chem Inf Model"},{"issue":"8","key":"755_CR22","doi-asserted-by":"publisher","first-page":"0220113","DOI":"10.1371\/journal.pone.0220113","volume":"14","author":"L Chen","year":"2019","unstructured":"Chen L, Cruz A, Ramsey S, Dickson CJ, Duca JS, Hornak V, Koes DR, Kurtzman T (2019) Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS ONE 14(8):0220113","journal-title":"PLoS ONE"},{"issue":"11","key":"755_CR23","doi-asserted-by":"publisher","first-page":"7946","DOI":"10.1021\/acs.jmedchem.2c00487","volume":"65","author":"M Volkov","year":"2022","unstructured":"Volkov M, Turk J-A, Drizard N, Martin N, Hoffmann B, Gaston-Math\u00e9 Y, Rognan D (2022) On the frustration to predict binding affinities from protein-ligand structures with deep neural networks. J Med Chem 65(11):7946\u20137958","journal-title":"J Med Chem"},{"issue":"8","key":"755_CR24","doi-asserted-by":"publisher","first-page":"3722","DOI":"10.1021\/acs.jcim.0c00263","volume":"60","author":"J Scantlebury","year":"2020","unstructured":"Scantlebury J, Brown N, Von Delft F, Deane CM (2020) Data set augmentation allows deep learning-based virtual screening to better generalize to unseen target classes and highlight important binding interactions. J Chem Inf Model 60(8):3722\u20133730","journal-title":"J Chem Inf Model"},{"issue":"24","key":"755_CR25","doi-asserted-by":"publisher","first-page":"11624","DOI":"10.1073\/pnas.1820657116","volume":"116","author":"K McCloskey","year":"2019","unstructured":"McCloskey K, Taly A, Monti F, Brenner MP, Colwell LJ (2019) Using attribution to decode binding mechanism in neural network models for chemistry. Proc Natl Acad Sci 116(24):11624\u201311629","journal-title":"Proc Natl Acad Sci"},{"key":"755_CR26","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2007.01436","author":"V Sundar","year":"2020","unstructured":"Sundar V, Colwell L (2020) Attribution methods reveal flaws in fingerprint-based virtual screening. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2007.01436","journal-title":"arXiv"},{"issue":"1","key":"755_CR27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-021-00519-x","volume":"13","author":"M Matveieva","year":"2021","unstructured":"Matveieva M, Polishchuk P (2021) Benchmarks For interpretation Of QSAR models. J Cheminform 13(1):1\u201320","journal-title":"J Cheminform"},{"key":"755_CR28","first-page":"3319","volume-title":"Proceedings of 34th international conference on machine learning","author":"M Sundararajan","year":"2017","unstructured":"Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. In: Sundararajan M (ed) Proceedings of 34th international conference on machine learning. Proceedings of Machine Learning Research, Pittsburgh, pp 3319\u20133328"},{"key":"755_CR29","doi-asserted-by":"publisher","first-page":"96","DOI":"10.1016\/j.jmgm.2018.06.005","volume":"84","author":"J Hochuli","year":"2018","unstructured":"Hochuli J, Helbling A, Skaist T, Ragoza M, Koes DR (2018) Visualizing convolutional neural network protein-ligand scoring. J Mol Graph Model 84:96\u2013108","journal-title":"J Mol Graph Model"},{"key":"755_CR30","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jcim.3c00322","author":"J Scantlebury","year":"2023","unstructured":"Scantlebury J, Vost L, Carbery A, Hadfield TE, Turnbull OM, Brown N, Chenthamarakshan V, Das P, Grosjean H, von Delft F et al (2023) A small step toward generalizability: training a machine learning scoring function for structure-based virtual screening. J Chem Inform Model. https:\/\/doi.org\/10.1021\/acs.jcim.3c00322","journal-title":"J Chem Inform Model"},{"key":"755_CR31","unstructured":"Landrum G (2006) RDKit: Open-Source Cheminformatics"},{"issue":"5","key":"755_CR32","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inform Model 50(5):742\u2013754","journal-title":"J Chem Inform Model"},{"issue":"8","key":"755_CR33","doi-asserted-by":"publisher","first-page":"1334","DOI":"10.1093\/bioinformatics\/bty757","volume":"35","author":"M W\u00f3jcikowski","year":"2019","unstructured":"W\u00f3jcikowski M, Kukie\u0142ka M, Stepniewska-Dziubinska MM, Siedlecki P (2019) Development of a protein-ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions. Bioinformatics 35(8):1334\u20131341","journal-title":"Bioinformatics"},{"issue":"1","key":"755_CR34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-015-0078-2","volume":"7","author":"M W\u00f3jcikowski","year":"2015","unstructured":"W\u00f3jcikowski M, Zielenkiewicz P, Siedlecki P (2015) Open drug discovery toolkit (ODDT): a new open-source player in the drug discovery field. J Cheminform 7(1):1\u20136","journal-title":"J Cheminform"},{"key":"755_CR35","first-page":"9323","volume-title":"Proceedings of the 38th international conference on machine learning","author":"VG Satorras","year":"2021","unstructured":"Satorras VG, Hoogeboom E, Welling M (2021) E (n) equivariant graph neural networks. In: Satorras VG (ed) Proceedings of the 38th international conference on machine learning. Proceedings Machine Learning Research, Pittsburgh, pp 9323\u20139332"},{"issue":"11","key":"755_CR36","doi-asserted-by":"publisher","first-page":"2324","DOI":"10.1021\/acs.jcim.5b00559","volume":"55","author":"T Sterling","year":"2015","unstructured":"Sterling T, Irwin JJ (2015) ZINC 15-ligand discovery for everyone. J Chem Inform Model 55(11):2324\u20132337","journal-title":"J Chem Inform Model"},{"issue":"14","key":"755_CR37","doi-asserted-by":"publisher","first-page":"6582","DOI":"10.1021\/jm300687e","volume":"55","author":"MM Mysinger","year":"2012","unstructured":"Mysinger MM, Carchia M, Irwin JJ, Shoichet BK (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55(14):6582\u20136594","journal-title":"J Med Chem"},{"issue":"3","key":"755_CR38","doi-asserted-by":"publisher","first-page":"947","DOI":"10.1021\/acs.jcim.8b00712","volume":"59","author":"J Sieg","year":"2019","unstructured":"Sieg J, Flachsenberg F, Rarey M (2019) In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inform Model 59(3):947\u2013961","journal-title":"J Chem Inform Model"},{"issue":"9","key":"755_CR39","doi-asserted-by":"publisher","first-page":"4263","DOI":"10.1021\/acs.jcim.0c00155","volume":"60","author":"V-K Tran-Nguyen","year":"2020","unstructured":"Tran-Nguyen V-K, Jacquemard C, Rognan D (2020) LIT-PCBA: an unbiased data set for machine learning and virtual screening. J Chem Inform Model 60(9):4263\u20134273","journal-title":"J Chem Inform Model"},{"issue":"2","key":"755_CR40","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1021\/acs.accounts.6b00491","volume":"50","author":"Z Liu","year":"2017","unstructured":"Liu Z, Su M, Han L, Liu J, Yang Q, Li Y, Wang R (2017) Forging the basis for developing protein-ligand interaction scoring functions. Acc Chem Res 50(2):302\u2013309","journal-title":"Acc Chem Res"},{"issue":"1","key":"755_CR41","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/srep46710","volume":"7","author":"M W\u00f3jcikowski","year":"2017","unstructured":"W\u00f3jcikowski M, Ballester PJ, Siedlecki P (2017) Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 7(1):1\u201310","journal-title":"Sci Rep"},{"key":"755_CR42","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2204.06348","author":"C Poelking","year":"2022","unstructured":"Poelking C, Chessari G, Murray CW, Hall RJ, Colwell L, Verdonk M (2022) Meaningful machine learning models and machine-learned pharmacophores from fragment screening campaigns. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2204.06348","journal-title":"arXiv"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00755-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00755-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00755-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,18]],"date-time":"2023-11-18T11:47:52Z","timestamp":1700308072000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00755-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,19]]},"references-count":42,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["755"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00755-3","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.04.29.538820","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,19]]},"assertion":[{"value":"2 May 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 August 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 September 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"T.E.H. is an employee of AstraZeneca PLC; all work was done whilst a doctoral student at the University of Oxford. C.M.D. is an employee of Exscientia PLC.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"84"}}