{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,31]],"date-time":"2025-08-31T10:22:42Z","timestamp":1756635762100,"version":"3.37.3"},"reference-count":24,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2020,8,10]],"date-time":"2020-08-10T00:00:00Z","timestamp":1597017600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"European Union\u2019s Framework Programme For Research and Innovation Horizon 2020"},{"name":"Marie Sklodowska-Curie","award":["703543"],"award-info":[{"award-number":["703543"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,4,20]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Quantitative structure\u2013activity relationship (QSAR) methods are increasingly used in assisting the process of preclinical, small molecule drug discovery. Regression models are trained on data consisting of a finite-dimensional representation of molecular structures and their corresponding target-specific activities. These supervised learning models can then be used to predict the activity of previously unmeasured novel compounds.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>This work provides methods that solve three problems in QSAR modelling: (i) a method for comparing the information content between finite-dimensional representations of molecular structures (fingerprints) with respect to the target of interest, (ii) a method that quantifies how the accuracy of the model prediction degrades as a function of the distance between the testing and training data and (iii) a method to adjust for screening dependent selection bias inherent in many training datasets. For example, in the most extreme cases, only compounds which pass an activity-dependent screening threshold are reported. A semi-supervised learning framework combines (ii) and (iii) and can make predictions, which take into account the similarity of the testing compounds to those in the training data and adjust for the reporting selection bias. We illustrate the three methods using publicly available structure\u2013activity data for a large set of compounds reported by GlaxoSmithKline (the Tres Cantos AntiMalarial Set, TCAMS) to inhibit asexual in vitro Plasmodium falciparum growth.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availabilityand implementation<\/jats:title>\n                  <jats:p>https:\/\/github.com\/owatson\/PenalizedPrediction.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa711","type":"journal-article","created":{"date-parts":[[2020,8,3]],"date-time":"2020-08-03T12:00:04Z","timestamp":1596456004000},"page":"342-350","source":"Crossref","is-referenced-by-count":6,"title":["A semi-supervised learning framework for quantitative structure\u2013activity regression modelling"],"prefix":"10.1093","volume":"37","author":[{"given":"Oliver","family":"Watson","sequence":"first","affiliation":[{"name":"Evariste Technologies Ltd , Goring on Thames RG8 9AL, UK"}]},{"given":"Isidro","family":"Cortes-Ciriano","sequence":"additional","affiliation":[{"name":"Centre for Molecular Informatics, Department of Chemistry, University of Cambridge , Cambridge CB2 1EW, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5524-0325","authenticated-orcid":false,"given":"James A","family":"Watson","sequence":"additional","affiliation":[{"name":"Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford , Oxford OX1 2JD, UK"},{"name":"Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University , Bangkok 10400, Thailand"}]}],"member":"286","published-online":{"date-parts":[[2020,8,10]]},"reference":[{"key":"2023051604084799500_btaa711-B1","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1186\/s13321-015-0069-3","article-title":"Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?","volume":"7","author":"Bajusz","year":"2015","journal-title":"J. Cheminf"},{"key":"2023051604084799500_btaa711-B2","doi-asserted-by":"crossref","first-page":"4977","DOI":"10.1021\/jm4004285","article-title":"QSAR modeling: where have you been? Where are you going to?","volume":"57","author":"Cherkasov","year":"2014","journal-title":"J. Med. Chem"},{"key":"2023051604084799500_btaa711-B3","doi-asserted-by":"crossref","first-page":"2000","DOI":"10.1021\/acs.jcim.8b00376","article-title":"Discovering highly potent molecules from an initial set of inactives using iterative screening","volume":"58","author":"Cortes-Ciriano","year":"2018","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B4","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1021\/ci100176x","article-title":"Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research","volume":"50","author":"Fourches","year":"2010","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B5","doi-asserted-by":"crossref","first-page":"305","DOI":"10.1038\/nature09107","article-title":"Thousands of chemical starting points for antimalarial lead identification","volume":"465","author":"Gamo","year":"2010","journal-title":"Nature"},{"key":"2023051604084799500_btaa711-B6","first-page":"208","volume-title":"ACS Chemical Biology","author":"Huggins","year":"2011"},{"key":"2023051604084799500_btaa711-B7","doi-asserted-by":"crossref","first-page":"923","DOI":"10.1038\/nmeth1113","article-title":"Semi-supervised learning for peptide identification from shotgun proteomics datasets","volume":"4","author":"K\u00e4ll","year":"2007","journal-title":"Nat. Methods"},{"key":"2023051604084799500_btaa711-B8","doi-asserted-by":"crossref","first-page":"230","DOI":"10.1021\/ci400469u","article-title":"How diverse are diversity assessment methods? A comparative analysis and benchmarking of molecular descriptor space","volume":"54","author":"Koutsoukas","year":"2014","journal-title":"J. Chem. Inf. Model"},{"year":"2017","author":"Landrum","key":"2023051604084799500_btaa711-B9"},{"key":"2023051604084799500_btaa711-B10","first-page":"2","article-title":"High-throughput screening: the hits and leads of drug discovery \u2013 an overview","volume":"01","author":"Martis","year":"2011","journal-title":"J. Appl. Pharm. Sci"},{"key":"2023051604084799500_btaa711-B11","doi-asserted-by":"crossref","first-page":"453","DOI":"10.2174\/1386207013330896","article-title":"Computational approaches towards the rational design of drug-like compound libraries","volume":"4","author":"Matter","year":"2012","journal-title":"Comb. Chem. High Throughput Screen"},{"key":"2023051604084799500_btaa711-B12","doi-asserted-by":"crossref","first-page":"941","DOI":"10.1021\/ci7004498","article-title":"Application of belief theory to similarity data fusion for use in analog searching and lead hopping","volume":"48","author":"Muchmore","year":"2008","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B13","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1177\/026119290503300209","article-title":"Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships: the report and recommendations of ECVAM workshop 52","volume":"33","author":"Netzeva","year":"2005","journal-title":"Alternatives Lab. Anim"},{"key":"2023051604084799500_btaa711-B14","doi-asserted-by":"crossref","first-page":"256","DOI":"10.1016\/j.jmgm.2017.01.008","article-title":"Binary classification of imbalanced datasets using conformal prediction","volume":"72","author":"Norinder","year":"2017","journal-title":"J. Mol. Graph. Model"},{"key":"2023051604084799500_btaa711-B15","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res"},{"key":"2023051604084799500_btaa711-B16","doi-asserted-by":"crossref","first-page":"947","DOI":"10.1517\/17460440903190961","article-title":"High-throughput and in silico screenings in drug discovery","volume":"4","author":"Phatak","year":"2009","journal-title":"Exp. Opin. Drug Disc"},{"key":"2023051604084799500_btaa711-B17","doi-asserted-by":"crossref","first-page":"742","DOI":"10.1021\/ci100050t","article-title":"Extended-connectivity fingerprints","volume":"50","author":"Rogers","year":"2010","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B18","doi-asserted-by":"crossref","first-page":"1098","DOI":"10.1021\/acs.jcim.5b00110","article-title":"The relative importance of domain applicability metrics for estimating prediction errors in QSAR varies with training set diversity","volume":"55","author":"Sheridan","year":"2015","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B19","doi-asserted-by":"crossref","first-page":"3017","DOI":"10.1093\/bioinformatics\/btr502","article-title":"Semi-supervised learning improves gene expression-based prediction of cancer recurrence","volume":"27","author":"Shi","year":"2011","journal-title":"Bioinformatics"},{"key":"2023051604084799500_btaa711-B20","doi-asserted-by":"crossref","first-page":"1591","DOI":"10.1021\/acs.jcim.7b00159","article-title":"Applying mondrian cross-conformal prediction to estimate prediction confidence on large imbalanced bioactivity data sets","volume":"57","author":"Sun","year":"2017","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B21","doi-asserted-by":"crossref","DOI":"10.1016\/S1359-6446(00)01517-8","volume-title":"Diversity Screening versus Focussed Screening in Drug Discovery","author":"Valler","year":"2000"},{"key":"2023051604084799500_btaa711-B22","doi-asserted-by":"crossref","first-page":"916","DOI":"10.1021\/acs.jcim.7b00403","article-title":"Most ligand-based classification benchmarks reward memorization rather than generalization","volume":"58","author":"Wallach","year":"2018","journal-title":"J. Chem. Inf. Model"},{"key":"2023051604084799500_btaa711-B23","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1016\/S0169-409X(02)00003-0","article-title":"Prediction of \u2018drug-likeness\u2019","volume":"54","author":"Walters","year":"2002","journal-title":"Adv. Drug Deliv. Rev"},{"key":"2023051604084799500_btaa711-B24","doi-asserted-by":"crossref","first-page":"4656","DOI":"10.1093\/bioinformatics\/btz293","article-title":"A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery","volume":"35","author":"Watson","year":"2019","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa711\/34051646\/btaa711.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/3\/342\/50325990\/btaa711.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/3\/342\/50325990\/btaa711.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,16]],"date-time":"2023-05-16T04:10:03Z","timestamp":1684210203000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/3\/342\/5890674"}},"subtitle":[],"editor":[{"given":"Jinbo","family":"Xu","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,8,10]]},"references-count":24,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,4,20]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa711","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2021,2,1]]},"published":{"date-parts":[[2020,8,10]]}}}