{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T19:35:21Z","timestamp":1772825721410,"version":"3.50.1"},"reference-count":26,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2014,6,11]],"date-time":"2014-06-11T00:00:00Z","timestamp":1402444800000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/2.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2014,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Na\u00efve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Na\u00efve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1758-2946-6-32","type":"journal-article","created":{"date-parts":[[2014,6,11]],"date-time":"2014-06-11T01:02:57Z","timestamp":1402448577000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":68,"title":["The influence of negative training set size on machine learning-based virtual screening"],"prefix":"10.1186","volume":"6","author":[{"given":"Rafa\u0142","family":"Kurczab","sequence":"first","affiliation":[]},{"given":"Sabina","family":"Smusz","sequence":"additional","affiliation":[]},{"given":"Andrzej J","family":"Bojarski","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2014,6,11]]},"reference":[{"key":"607_CR1","doi-asserted-by":"publisher","first-page":"332","DOI":"10.2174\/138620709788167980","volume":"12","author":"JL Melville","year":"2009","unstructured":"Melville JL, Burke EK, Hirst JD: Machine learning in virtual screening. Comb Chem High Throughput Screen. 2009, 12: 332-343. 10.2174\/138620709788167980.","journal-title":"Comb Chem High Throughput Screen"},{"key":"607_CR2","doi-asserted-by":"publisher","first-page":"1227","DOI":"10.1021\/ci800022e","volume":"48","author":"XH Ma","year":"2008","unstructured":"Ma XH, Wang R, Yang SY, Li ZR, Xue Y, Wei YC, Low BC, Chen YZ: Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds. J Chem Inf Model. 2008, 48: 1227-1237. 10.1021\/ci800022e.","journal-title":"J Chem Inf Model"},{"key":"607_CR3","doi-asserted-by":"publisher","first-page":"1098","DOI":"10.1021\/ci050519k","volume":"46","author":"D Plewczynski","year":"2006","unstructured":"Plewczynski D, Spieser SH, Koch U: Assessing different classification methods for virtual screening. J Chem Inf Model. 2006, 46: 1098-1106. 10.1021\/ci050519k.","journal-title":"J Chem Inf Model"},{"key":"607_CR4","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1021\/ci600332j","volume":"47","author":"CL Bruce","year":"2007","unstructured":"Bruce CL, Melville JL, Pickett SD, Hirst JD: Contemporary QSAR classifiers compared. J Chem Inf Model. 2007, 47: 219-227. 10.1021\/ci600332j.","journal-title":"J Chem Inf Model"},{"key":"607_CR5","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1016\/j.chemolab.2013.08.003","volume":"128","author":"S Smusz","year":"2013","unstructured":"Smusz S, Kurczab R, Bojarski AJ: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds. Chemom Intell Lab Syst. 2013, 128: 89-100.","journal-title":"Chemom Intell Lab Syst"},{"key":"607_CR6","doi-asserted-by":"publisher","first-page":"17","DOI":"10.1186\/1758-2946-5-17","volume":"5","author":"S Smusz","year":"2013","unstructured":"Smusz S, Kurczab R, Bojarski AJ: The influence of the inactives subset generation on the performance of machine learning methods. J Cheminf. 2013, 5: 17-25. 10.1186\/1758-2946-5-17.","journal-title":"J Cheminf"},{"key":"607_CR7","doi-asserted-by":"publisher","first-page":"1757","DOI":"10.1021\/ci3001277","volume":"52","author":"JJ Irwin","year":"2012","unstructured":"Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG: ZINC: A free tool to discover chemistry for biology. J Chem Inf Model. 2012, 52: 1757-1768. 10.1021\/ci3001277.","journal-title":"J Chem Inf Model"},{"key":"607_CR8","unstructured":"USA: MDDR licensed by Accelrys, Inc, [http:\/\/www.accelrys.com]"},{"key":"607_CR9","doi-asserted-by":"publisher","first-page":"6789","DOI":"10.1021\/jm0608356","volume":"49","author":"N Huang","year":"2006","unstructured":"Huang N, Shoichet BK, Irwin JJ: Benchmarking sets for molecular docking. J Med Chem. 2006, 49: 6789-6801. 10.1021\/jm0608356.","journal-title":"J Med Chem"},{"key":"607_CR10","doi-asserted-by":"publisher","first-page":"1595","DOI":"10.1021\/ci4002712","volume":"53","author":"K Heikamp","year":"2013","unstructured":"Heikamp K, Bajorath J: Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening. J Chem Inf Model. 2013, 53: 1595-1601. 10.1021\/ci4002712.","journal-title":"J Chem Inf Model"},{"key":"607_CR11","doi-asserted-by":"publisher","first-page":"D400","DOI":"10.1093\/nar\/gkr1132","volume":"40","author":"Y Wang","year":"2012","unstructured":"Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z, Han L, Karapetyan K, Dracheva S, Shoemaker BA, Bolton E, Gindulyte A, Bryant SH: PubChem\u2019s BioAssay Database. Nucleic Acids Res. 2012, 40: D400-D412. 10.1093\/nar\/gkr1132.","journal-title":"Nucleic Acids Res"},{"key":"607_CR12","doi-asserted-by":"publisher","first-page":"D1100","DOI":"10.1093\/nar\/gkr777","volume":"40","author":"A Gaulton","year":"2011","unstructured":"Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, Mcglinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2011, 40: D1100-D1107.","journal-title":"Nucleic Acids Res"},{"key":"607_CR13","first-page":"233","volume-title":"Proceedings of the 23rd international conference onMachine Learning","author":"J Davis","year":"2006","unstructured":"Davis J, Goadrich M: The relationship between precision-recall and ROC curves. Proceedings of the 23rd international conference onMachine Learning . 2006, 233-240."},{"key":"607_CR14","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1007\/s10822-006-9096-5","volume":"21","author":"B Chen","year":"2007","unstructured":"Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, Lewell XQ, Greenidge P, Stiefl N: Evaluation of machine-learning methods for ligand-based virtual screening. J Comput Aid Mol Des. 2007, 21: 53-62. 10.1007\/s10822-006-9096-5.","journal-title":"J Comput Aid Mol Des"},{"key":"607_CR15","doi-asserted-by":"publisher","first-page":"344","DOI":"10.2174\/138620709788167944","volume":"12","author":"XH Ma","year":"2009","unstructured":"Ma XH, Jia J, Zhu F, Xue Y, Li ZR, Chen YZ: Comparative analysis of machine learning methods in ligand-based virtual screening of large compound libraries. Comb Chem High Throughput Screen. 2009, 12: 344-357. 10.2174\/138620709788167944.","journal-title":"Comb Chem High Throughput Screen"},{"key":"607_CR16","doi-asserted-by":"publisher","first-page":"269","DOI":"10.1007\/s10822-007-9113-3","volume":"21","author":"EO Cannon","year":"2007","unstructured":"Cannon EO, Amini A, Bender A, Sternberg MJE, Muggleton SH, Glen RC, Mitchell JBO: Support vector inductive logic programming outperforms the naive Bayes classifier and inductive logic programming for the classification of bioactive chemical compounds. J Comput Aid Mol Des. 2007, 21: 269-280. 10.1007\/s10822-007-9113-3.","journal-title":"J Comput Aid Mol Des"},{"key":"607_CR17","first-page":"1","volume-title":"Technical Report MSR-TR-98-14","author":"Platt JC Sequential Minimal Optimization","year":"1998","unstructured":"Platt JC Sequential Minimal Optimization: A fast algorithm for training Support Vector Machines. Technical Report MSR-TR-98-14. 1998, 1-21."},{"key":"607_CR18","volume-title":"Machine Learning","author":"TM Mitchell","year":"1997","unstructured":"Mitchell TM: Machine Learning. 1997, New York: McGraw-Hill"},{"key":"607_CR19","first-page":"37","volume":"6","author":"DW Aha","year":"1991","unstructured":"Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Mach Learn. 1991, 6: 37-66.","journal-title":"Mach Learn"},{"key":"607_CR20","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1023\/A:1014043630878","volume":"6","author":"H Brighton","year":"2002","unstructured":"Brighton H, Mellish C: Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc. 2002, 6: 153-172. 10.1023\/A:1014043630878.","journal-title":"Data Min Knowl Disc"},{"key":"607_CR21","first-page":"81","volume":"1","author":"JR Quinlan","year":"1986","unstructured":"Quinlan JR: Induction of decision trees. Mach Learn. 1986, 1: 81-106.","journal-title":"Mach Learn"},{"key":"607_CR22","doi-asserted-by":"publisher","first-page":"1947","DOI":"10.1021\/ci034160g","volume":"43","author":"V Svetnik","year":"2003","unstructured":"Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP: Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003, 43: 1947-1958. 10.1021\/ci034160g.","journal-title":"J Chem Inf Comput Sci"},{"key":"607_CR23","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023\/A:1010933404324.","journal-title":"Mach Learn"},{"key":"607_CR24","unstructured":"San Diego, CA, USA: MACCS Structural keys, Accelrys, [http:\/\/www.accelrys.com]"},{"key":"607_CR25","doi-asserted-by":"publisher","first-page":"493","DOI":"10.1021\/ci025584y","volume":"43","author":"C Steinbeck","year":"2003","unstructured":"Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J Chem Inf Comput Sci. 2003, 43: 493-500. 10.1021\/ci025584y.","journal-title":"J Chem Inf Comput Sci"},{"key":"607_CR26","doi-asserted-by":"publisher","first-page":"1466","DOI":"10.1002\/jcc.21707","volume":"32","author":"CW Yap","year":"2011","unstructured":"Yap CW: PaDEL-Descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011, 32: 1466-1474. 10.1002\/jcc.21707.","journal-title":"J Comput Chem"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/article\/10.1186\/1758-2946-6-32\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1758-2946-6-32.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1758-2946-6-32.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,2]],"date-time":"2021-09-02T04:07:28Z","timestamp":1630555648000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/1758-2946-6-32"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,6,11]]},"references-count":26,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,12]]}},"alternative-id":["607"],"URL":"https:\/\/doi.org\/10.1186\/1758-2946-6-32","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,6,11]]},"assertion":[{"value":"12 October 2013","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 June 2014","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 June 2014","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"32"}}