{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T16:26:55Z","timestamp":1770222415277,"version":"3.49.0"},"reference-count":70,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,10,27]],"date-time":"2020-10-27T00:00:00Z","timestamp":1603756800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,10,27]],"date-time":"2020-10-27T00:00:00Z","timestamp":1603756800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100006229","name":"Oak Ridge Institute for Science and Education","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100006229","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure\u2013Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for\u2009&gt;\u200910,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F<jats:sub>1<\/jats:sub> score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman\u2019s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g.,\u2009&gt;\u200928). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.<\/jats:p>","DOI":"10.1186\/s13321-020-00468-x","type":"journal-article","created":{"date-parts":[[2020,10,27]],"date-time":"2020-10-27T13:04:02Z","timestamp":1603803842000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":67,"title":["Structure\u2013activity relationship-based chemical classification of highly imbalanced Tox21 datasets"],"prefix":"10.1186","volume":"12","author":[{"given":"Gabriel","family":"Idakwo","sequence":"first","affiliation":[]},{"given":"Sundar","family":"Thangapandian","sequence":"additional","affiliation":[]},{"given":"Joseph","family":"Luttrell","sequence":"additional","affiliation":[]},{"given":"Yan","family":"Li","sequence":"additional","affiliation":[]},{"given":"Nan","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Zhaoxian","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Huixiao","family":"Hong","sequence":"additional","affiliation":[]},{"given":"Bei","family":"Yang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6873-1780","authenticated-orcid":false,"given":"Chaoyang","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Ping","family":"Gong","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,10,27]]},"reference":[{"key":"468_CR1","doi-asserted-by":"publisher","first-page":"192","DOI":"10.1109\/Trustcom.2015.581","volume-title":"2015 IEEE Trustcom\/BigDataSE\/ISPA","author":"WM Czarnecki","year":"2015","unstructured":"Czarnecki WM, Rataj K (2015) Compounds activity prediction in large imbalanced datasets with substructural relations fingerprint and EEM. 2015 IEEE Trustcom\/BigDataSE\/ISPA. IEEE, Helsinki, pp 192\u2013192"},{"key":"468_CR2","doi-asserted-by":"publisher","first-page":"1757","DOI":"10.1021\/ci3001277","volume":"52","author":"JJ Irwin","year":"2012","unstructured":"Irwin JJ, Sterling T, Mysinger MM et al (2012) ZINC: a free tool to discover chemistry for biology. J ChemInf Model 52:1757\u20131768. https:\/\/doi.org\/10.1021\/ci3001277","journal-title":"J ChemInf Model"},{"key":"468_CR3","unstructured":"Dahl GE, Jaitly N, Salakhutdinov R (2014) Multi-task neural networks for QSAR predictions. https:\/\/arxiv.org\/abs\/1406.1231. Accessed 6 Oct 2017"},{"key":"468_CR4","doi-asserted-by":"publisher","first-page":"1590","DOI":"10.1016\/j.ejmech.2010.01.002","volume":"45","author":"R Darnag","year":"2010","unstructured":"Darnag R, Mostapha Mazouz EL, Schmitzer A et al (2010) Support vector machines: development of QSAR models for predicting anti-HIV-1 activity of TIBO derivatives. Eur J Med Chem 45:1590\u20131597. https:\/\/doi.org\/10.1016\/j.ejmech.2010.01.002","journal-title":"Eur J Med Chem"},{"key":"468_CR5","doi-asserted-by":"publisher","first-page":"2481","DOI":"10.1021\/ci900203n","volume":"49","author":"PG Polishchuk","year":"2009","unstructured":"Polishchuk PG, Muratov EN, Artemenko AG et al (2009) Application of random forest approach to QSAR prediction of aquatic toxicity. J ChemInf Model 49:2481\u20132488. https:\/\/doi.org\/10.1021\/ci900203n","journal-title":"J ChemInf Model"},{"key":"468_CR6","doi-asserted-by":"publisher","first-page":"463","DOI":"10.1109\/TSMCC.2011.2161285","volume":"42","author":"M Galar","year":"2012","unstructured":"Galar M, Fern\u00e1ndez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42:463\u2013484. https:\/\/doi.org\/10.1109\/TSMCC.2011.2161285","journal-title":"IEEE Trans Syst Man Cybern Part C"},{"key":"468_CR7","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1007\/s13748-016-0094-0","volume":"5","author":"B Krawczyk","year":"2016","unstructured":"Krawczyk B, Krawczyk BB (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221\u2013232. https:\/\/doi.org\/10.1007\/s13748-016-0094-0","journal-title":"Prog Artif Intell"},{"key":"468_CR8","doi-asserted-by":"publisher","first-page":"412","DOI":"10.1002\/sam.10061","volume":"2","author":"S Hido","year":"2009","unstructured":"Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2:412\u2013426. https:\/\/doi.org\/10.1002\/sam.10061","journal-title":"Stat Anal Data Min"},{"key":"468_CR9","doi-asserted-by":"publisher","first-page":"853","DOI":"10.1007\/0-387-25465-X_40","volume-title":"Data Mining and Knowledge Discovery Handbook","author":"NV Chawla","year":"2005","unstructured":"Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data Mining and Knowledge Discovery Handbook. Springer-Verlag, New York, pp 853\u2013867"},{"key":"468_CR10","doi-asserted-by":"publisher","DOI":"10.1002\/9781118646106","volume-title":"Imbalanced learning: foundations, algorithms, and applications","author":"H He","year":"2013","unstructured":"He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons Inc, New York"},{"key":"468_CR11","unstructured":"Branco P, Torgo L, Ribeiro R (2015) A survey of predictive modelling under imbalanced distributions. https:\/\/arxiv.org\/abs\/1505.01658. Accessed 8 Aug 2017"},{"key":"468_CR12","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 16:321\u2013357. https:\/\/doi.org\/10.1613\/jair.953","journal-title":"J Artif Intell Res"},{"key":"468_CR13","doi-asserted-by":"publisher","first-page":"362","DOI":"10.3389\/fchem.2018.00362","volume":"6","author":"P Banerjee","year":"2018","unstructured":"Banerjee P, Dehnbostel FO, Preissner R (2018) Prediction is a balancing act: importance of sampling methods to balance sensitivity and specificity of predictive models based on imbalanced chemical data sets. Front Chem 6:362. https:\/\/doi.org\/10.3389\/fchem.2018.00362","journal-title":"Front Chem"},{"key":"468_CR14","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1007\/978-3-319-18781-5_17","volume-title":"Challenges in computational statistics and data mining","author":"J Stefanowski","year":"2016","unstructured":"Stefanowski J (2016) Dealing with Data Difficulty Factors While Learning from Imbalanced Data. Challenges in computational statistics and data mining. Springer, Cham, Switzerland, pp 333\u2013363"},{"key":"468_CR15","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1186\/s13321-018-0325-4","volume":"11","author":"N Bosc","year":"2019","unstructured":"Bosc N, Atkinson F, Felix E et al (2019) Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J Cheminform 11:4. https:\/\/doi.org\/10.1186\/s13321-018-0325-4","journal-title":"J Cheminform"},{"key":"468_CR16","doi-asserted-by":"publisher","first-page":"1003","DOI":"10.1021\/acs.chemrestox.6b00037","volume":"29","author":"U Norinder","year":"2016","unstructured":"Norinder U, Boyer S (2016) Conformal Prediction Classification of a Large Data Set of EnRvironmental Chemicals from ToxCast and Tox21 Estrogen Receptor Assays. Chem Res Toxicol 29:1003\u20131010. https:\/\/doi.org\/10.1021\/acs.chemrestox.6b00037","journal-title":"Chem Res Toxicol"},{"key":"468_CR17","doi-asserted-by":"publisher","first-page":"1591","DOI":"10.1021\/acs.jcim.7b00159","volume":"57","author":"J Sun","year":"2017","unstructured":"Sun J, Carlsson L, Ahlberg E et al (2017) Applying mondrian cross-conformal prediction to estimate prediction confidence on large imbalanced bioactivity data sets. J ChemInf Model 57:1591\u20131598. https:\/\/doi.org\/10.1021\/acs.jcim.7b00159","journal-title":"J ChemInf Model"},{"key":"468_CR18","doi-asserted-by":"crossref","unstructured":"Cort\u00e9s-Ciriano I, Bender A (2019) Concepts and applications of conformal prediction in computational drug discovery","DOI":"10.1039\/9781788016841-00063"},{"key":"468_CR19","doi-asserted-by":"publisher","first-page":"256","DOI":"10.1016\/j.jmgm.2017.01.008","volume":"72","author":"U Norinder","year":"2017","unstructured":"Norinder U, Boyer S (2017) Binary classification of imbalanced datasets using conformal prediction. J Mol Graph Model 72:256\u2013265. https:\/\/doi.org\/10.1016\/j.jmgm.2017.01.008","journal-title":"J Mol Graph Model"},{"key":"468_CR20","doi-asserted-by":"publisher","first-page":"1263","DOI":"10.1109\/TKDE.2008.239","volume":"21","author":"H He","year":"2009","unstructured":"He H, Garcia EA (2009) Learning from Imbalanced Data. IEEE Trans Knowl Data Eng 21:1263\u20131284. https:\/\/doi.org\/10.1109\/TKDE.2008.239","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"468_CR21","doi-asserted-by":"crossref","unstructured":"Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, Pittsburgh, pp 233\u2013240","DOI":"10.1145\/1143844.1143874"},{"key":"468_CR22","unstructured":"Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc, San Francisco, pp 445\u2013453"},{"key":"468_CR23","doi-asserted-by":"publisher","first-page":"3","DOI":"10.3389\/fenvs.2016.00003","volume":"4","author":"SJ Capuzzi","year":"2016","unstructured":"Capuzzi SJ, Politi R, Isayev O et al (2016) QSAR modeling of Tox21 challenge stress response and nuclear receptor signaling toxicity assays. Front Environ Sci 4:3. https:\/\/doi.org\/10.3389\/fenvs.2016.00003","journal-title":"Front Environ Sci"},{"key":"468_CR24","doi-asserted-by":"publisher","first-page":"12","DOI":"10.3389\/fenvs.2016.00012","volume":"4","author":"K Ribay","year":"2016","unstructured":"Ribay K, Kim MT, Wang W et al (2016) Predictive modeling of estrogen receptor binding agents using advanced cheminformatics tools and massive public data. Front Environ Sci 4:12. https:\/\/doi.org\/10.3389\/fenvs.2016.00012","journal-title":"Front Environ Sci"},{"key":"468_CR25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3389\/fenvs.2015.00080","volume":"3","author":"A Mayr","year":"2016","unstructured":"Mayr A, Klambauer G, Unterthiner T et al (2016) DeepTox: toxicity prediction using deep learning. Front Environ Sci 3:1\u201315. https:\/\/doi.org\/10.3389\/fenvs.2015.00080","journal-title":"Front Environ Sci"},{"key":"468_CR26","doi-asserted-by":"publisher","first-page":"54","DOI":"10.3389\/fenvs.2015.00054","volume":"3","author":"MN Drwal","year":"2015","unstructured":"Drwal MN, Siramshetty VB, Banerjee P et al (2015) Molecular similarity-based predictions of the Tox21 screening outcome. Front Environ Sci 3:54. https:\/\/doi.org\/10.3389\/fenvs.2015.00054","journal-title":"Front Environ Sci"},{"key":"468_CR27","doi-asserted-by":"publisher","first-page":"e0118432","DOI":"10.1371\/journal.pone.0118432","volume":"10","author":"T Saito","year":"2015","unstructured":"Saito T, Rehmsmeier M, Hood L et al (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10:e0118432. https:\/\/doi.org\/10.1371\/journal.pone.0118432","journal-title":"PLoS ONE"},{"key":"468_CR28","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1016\/J.JMGM.2012.01.002","volume":"35","author":"J Chen","year":"2012","unstructured":"Chen J, Tang YY, Fang B, Guo C (2012) In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner. J Mol Graph Model 35:21\u201327. https:\/\/doi.org\/10.1016\/J.JMGM.2012.01.002","journal-title":"J Mol Graph Model"},{"key":"468_CR29","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1007\/s11030-015-9649-4","volume":"20","author":"H Pham-The","year":"2016","unstructured":"Pham-The H, Casa\u00f1ola-Martin G, Garrigues T et al (2016) Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling. Mol Divers 20:93\u2013109. https:\/\/doi.org\/10.1007\/s11030-015-9649-4","journal-title":"Mol Divers"},{"key":"468_CR30","doi-asserted-by":"publisher","first-page":"3935","DOI":"10.1021\/acs.molpharmaceut.7b00631","volume":"14","author":"T Lei","year":"2017","unstructured":"Lei T, Sun H, Kang Y et al (2017) ADMET evaluation in drug discovery. 18. Reliable prediction of chemical-induced urinary tract toxicity by boosting machine learning approaches. Mol Pharm 14:3935\u20133953. https:\/\/doi.org\/10.1021\/acs.molpharmaceut.7b00631","journal-title":"Mol Pharm"},{"key":"468_CR31","doi-asserted-by":"publisher","first-page":"383","DOI":"10.1007\/s10044-015-0497-8","volume":"20","author":"WM Czarnecki","year":"2017","unstructured":"Czarnecki WM, Tabor J (2017) Extreme entropy machines: robust information theoretic classification. Pattern Anal Appl 20:383\u2013400. https:\/\/doi.org\/10.1007\/s10044-015-0497-8","journal-title":"Pattern Anal Appl"},{"key":"468_CR32","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1145\/1007730.1007735","volume":"6","author":"GEAPA Batista","year":"2004","unstructured":"Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDDExplorNewsl 6:20\u201329. https:\/\/doi.org\/10.1145\/1007730.1007735","journal-title":"ACM SIGKDDExplorNewsl"},{"key":"468_CR33","unstructured":"NCATS Toxicology in the 21st Century (Tox21). https:\/\/ncats.nih.gov\/tox21. Accessed 11 May 2017"},{"key":"468_CR34","doi-asserted-by":"publisher","first-page":"3","DOI":"10.3389\/fenvs.2017.00003","volume":"5","author":"R Huang","year":"2016","unstructured":"Huang R, Xia M, Nguyen D-T et al (2016) Editorial: Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental toxicants and drugs. Front Environ Sci 5:3. https:\/\/doi.org\/10.3389\/fenvs.2017.00003","journal-title":"Front Environ Sci"},{"key":"468_CR35","doi-asserted-by":"publisher","DOI":"10.3389\/978-2-88945-197-5","volume-title":"Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs","author":"R Huang","year":"2017","unstructured":"Huang R, Xia M, Nguyen D-T et al (2017) Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers Media, Lausanne"},{"key":"468_CR36","unstructured":"MolVS: Molecule Validation and Standardization\u2014MolVS 0.0.9 documentation. https:\/\/molvs.readthedocs.io\/en\/latest\/. Accessed 6 Feb 2018"},{"key":"468_CR37","unstructured":"Greg L RDKit: Open-source cheminformatics Software"},{"key":"468_CR38","doi-asserted-by":"publisher","first-page":"69","DOI":"10.1002\/qsar.200390007","volume":"22","author":"A Tropsha","year":"2003","unstructured":"Tropsha A, Gramatica P, Gombar V (2003) The importance of being Earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22:69\u201377. https:\/\/doi.org\/10.1002\/qsar.200390007","journal-title":"QSAR Comb Sci"},{"key":"468_CR39","doi-asserted-by":"publisher","first-page":"77","DOI":"10.3389\/fenvs.2015.00077","volume":"3","author":"F Stefaniak","year":"2015","unstructured":"Stefaniak F (2015) Prediction of compounds activity in nuclear receptor signaling and stress pathway assays using machine learning algorithms and low-dimensional molecular descriptors. Front Environ Sci 3:77. https:\/\/doi.org\/10.3389\/fenvs.2015.00077","journal-title":"Front Environ Sci"},{"key":"468_CR40","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J ChemInf Model 50:742\u2013754. https:\/\/doi.org\/10.1021\/ci100050t","journal-title":"J ChemInf Model"},{"key":"468_CR41","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1109\/TSMCA.2009.2029559","volume":"40","author":"C Seiffert","year":"2010","unstructured":"Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part ASyst Humans 40:185\u2013197. https:\/\/doi.org\/10.1109\/TSMCA.2009.2029559","journal-title":"IEEE Trans Syst Man, Cybern Part ASyst Humans"},{"key":"468_CR42","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2011.06.013","author":"V Garc\u00eda","year":"2012","unstructured":"Garc\u00eda V, S\u00e1nchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst. https:\/\/doi.org\/10.1016\/j.knosys.2011.06.013","journal-title":"Knowl Based Syst"},{"key":"468_CR43","doi-asserted-by":"publisher","first-page":"3460","DOI":"10.1016\/J.PATCOG.2013.05.006","volume":"46","author":"M Galar","year":"2013","unstructured":"Galar M, Fern\u00e1ndez A, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46:3460\u20133471. https:\/\/doi.org\/10.1016\/J.PATCOG.2013.05.006","journal-title":"Pattern Recognit"},{"key":"468_CR44","doi-asserted-by":"publisher","unstructured":"Wilson DL (1972) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans Syst Man Cybern 3:408\u2013421. doi.:https:\/\/doi.org\/10.1109\/TSMC.1972.4309137","DOI":"10.1109\/TSMC.1972.4309137"},{"key":"468_CR45","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L (2001) Random forests. Mach Learn 45:5\u201332. https:\/\/doi.org\/10.1023\/A:1010933404324","journal-title":"Mach Learn"},{"key":"468_CR46","volume-title":"Data mining : concepts and techniques","author":"J Han","year":"2011","unstructured":"Han J, Kamber M, Pei J (2011) Data mining\u202f: concepts and techniques, 3rd edn. Elsevier Science, Amsterdam","edition":"3"},{"key":"468_CR47","doi-asserted-by":"publisher","first-page":"933","DOI":"10.1038\/nmeth.4438","volume":"14","author":"N Altman","year":"2017","unstructured":"Altman N, Krzywinski M (2017) Ensemble methods: bagging and random forests. Nat Methods 14:933\u2013934. https:\/\/doi.org\/10.1038\/nmeth.4438","journal-title":"Nat Methods"},{"key":"468_CR48","doi-asserted-by":"publisher","first-page":"552","DOI":"10.1109\/TSMCA.2010.2084081","volume":"41","author":"TM Khoshgoftaar","year":"2011","unstructured":"Khoshgoftaar TM, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern Part A Syst Humans 41:552\u2013568. https:\/\/doi.org\/10.1109\/TSMCA.2010.2084081","journal-title":"IEEE Trans Syst Man Cybern Part A Syst Humans"},{"key":"468_CR49","doi-asserted-by":"crossref","unstructured":"Laszczyski J, Stefanowski J, Idkowiak L (2013) Extending bagging for imbalanced data. In: Burduk R., Jackowski K., Kurzynski M., Wozniak M., Zolnierek A. (eds) Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. Advances in Intelligent Systems and Computing. Springer, Heidelberg, pp 269\u2013278","DOI":"10.1007\/978-3-319-00969-8_26"},{"key":"468_CR50","first-page":"107","volume-title":"SMOTEBoost: improving prediction of the minority class in boosting","author":"NV Chawla","year":"2003","unstructured":"Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Springer, Berlin, Heidelberg, pp 107\u2013119"},{"key":"468_CR51","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"468_CR52","first-page":"1","volume":"18","author":"G Lema\u02c6\u0131tre","year":"2017","unstructured":"Lema\u02c6\u0131tre G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18:1\u20135","journal-title":"J Mach Learn Res"},{"key":"468_CR53","doi-asserted-by":"publisher","first-page":"e0177678","DOI":"10.1371\/journal.pone.0177678","volume":"12","author":"S Boughorbel","year":"2017","unstructured":"Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12:e0177678. https:\/\/doi.org\/10.1371\/journal.pone.0177678","journal-title":"PLoS ONE"},{"key":"468_CR54","first-page":"100","volume-title":"Improvements of general multiple test procedures for redundant systems of hypotheses","author":"B Bergmann","year":"1988","unstructured":"Bergmann B, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. Springer, Berlin, Heidelberg, pp 100\u2013115"},{"key":"468_CR55","doi-asserted-by":"publisher","first-page":"2044","DOI":"10.1016\/J.INS.2009.12.010","volume":"180","author":"S Garc\u00eda","year":"2010","unstructured":"Garc\u00eda S, Fern\u00e1ndez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. InfSci (Ny) 180:2044\u20132064. https:\/\/doi.org\/10.1016\/J.INS.2009.12.010","journal-title":"InfSci (Ny)"},{"key":"468_CR56","doi-asserted-by":"publisher","first-page":"248","DOI":"10.32614\/rj-2016-017","volume":"8","author":"B Calvo","year":"2016","unstructured":"Calvo B, Santaf\u00e9 G (2016) scmamp: Statistical comparison of multiple algorithms in multiple problems. R J 8:248\u2013256. https:\/\/doi.org\/10.32614\/rj-2016-017","journal-title":"R J"},{"key":"468_CR57","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1007\/978-1-4614-1412-4_35","volume-title":"Selected works of E L. Lehmann","author":"JL Hodges","year":"2012","unstructured":"Hodges JL, Lehmann EL (2012) Rank methods for combination of independent experiments in analysis of variance. In: Rojo J (ed) Selected works of E L. Lehmann. Springer US, Boston, MA, pp 403\u2013418"},{"key":"468_CR58","doi-asserted-by":"publisher","DOI":"10.3389\/fenvs.2016.00052","author":"G Barta","year":"2016","unstructured":"Barta G (2016) Identifying biological pathway interrupting toxins using multi-tree ensembles. Front Environ Sci. https:\/\/doi.org\/10.3389\/fenvs.2016.00052","journal-title":"Front Environ Sci"},{"key":"468_CR59","doi-asserted-by":"publisher","first-page":"9","DOI":"10.3389\/fenvs.2016.00009","volume":"4","author":"Y Uesawa","year":"2016","unstructured":"Uesawa Y (2016) Rigorous selection of random forest models for identifying compounds that activate toxicity-related pathways. Front Environ Sci 4:9. https:\/\/doi.org\/10.3389\/fenvs.2016.00009","journal-title":"Front Environ Sci"},{"key":"468_CR60","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1023\/A:1022859003006","volume":"51","author":"LI Kuncheva","year":"2003","unstructured":"Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51:181\u2013207","journal-title":"Mach Learn"},{"key":"468_CR61","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1016\/J.PATREC.2008.08.010","volume":"30","author":"C Ferri","year":"2009","unstructured":"Ferri C, Hern\u00e1ndez-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30:27\u201338. https:\/\/doi.org\/10.1016\/J.PATREC.2008.08.010","journal-title":"Pattern Recognit Lett"},{"key":"468_CR62","doi-asserted-by":"crossref","unstructured":"Jeni LA, Cohn JF, De La Torre F (2013) Facing imbalanced data\u2014recommendations for the use of performance metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE, New York, pp 245\u2013251","DOI":"10.1109\/ACII.2013.47"},{"key":"468_CR63","doi-asserted-by":"publisher","first-page":"525","DOI":"10.1021\/ci020058s","volume":"43","author":"W Tong","year":"2003","unstructured":"Tong W, Hong H, Fang H et al (2003) Decision forest: combining the predictions of multiple independent decision tree models. J ChemInfComputSci 43:525\u2013531. https:\/\/doi.org\/10.1021\/ci020058s","journal-title":"J ChemInfComputSci"},{"key":"468_CR64","doi-asserted-by":"publisher","first-page":"92989","DOI":"10.18632\/oncotarget.21723","volume":"8","author":"S Sakkiah","year":"2017","unstructured":"Sakkiah S, Selvaraj C, Gong P et al (2017) Development of estrogen receptor beta binding prediction model using large sets of chemicals. Oncotarget 8:92989\u201393000. https:\/\/doi.org\/10.18632\/oncotarget.21723","journal-title":"Oncotarget"},{"key":"468_CR65","doi-asserted-by":"publisher","first-page":"1069","DOI":"10.1016\/j.drudis.2014.02.003","volume":"19","author":"M Cruz-Monteagudo","year":"2014","unstructured":"Cruz-Monteagudo M, Medina-Franco JL, P\u00e9 Rez-Castillo Y et al (2014) Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discov Today 19:1069\u20131080. https:\/\/doi.org\/10.1016\/j.drudis.2014.02.003","journal-title":"Drug Discov Today"},{"key":"468_CR66","doi-asserted-by":"publisher","first-page":"14360","DOI":"10.1021\/acsomega.9b02221","volume":"4","author":"D Stumpfe","year":"2019","unstructured":"Stumpfe D, Hu H, Bajorath J (2019) Evolving concept of activity cliffs. ACS Omega 4:14360","journal-title":"ACS Omega"},{"key":"468_CR67","doi-asserted-by":"publisher","DOI":"10.12785\/amis\/071L50","volume-title":"Classification for imbalanced and overlapping classes using outlier detection and sampling techniques","author":"Z Yang","year":"2013","unstructured":"Yang Z, Gao D (2013) Classification for imbalanced and overlapping classes using outlier detection and sampling techniques. NSP Natural Sciences Publishing, New York"},{"key":"468_CR68","doi-asserted-by":"publisher","first-page":"2","DOI":"10.3389\/fenvs.2016.00002","volume":"4","author":"A Abdelaziz","year":"2016","unstructured":"Abdelaziz A, Spahn-Langguth H, Schramm K-W, Tetko IV (2016) Consensus modeling for HTS assays using in silico descriptors calculates the best balanced accuracy in Tox21 challenge. Front Environ Sci 4:2. https:\/\/doi.org\/10.3389\/fenvs.2016.00002","journal-title":"Front Environ Sci"},{"key":"468_CR69","doi-asserted-by":"publisher","first-page":"3244","DOI":"10.1021\/ci400527b","volume":"53","author":"Q Zang","year":"2013","unstructured":"Zang Q, Rotroff DM, Judson RS (2013) Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods. J Chem Inf Model 53:3244\u20133261. https:\/\/doi.org\/10.1021\/ci400527b","journal-title":"J Chem Inf Model"},{"key":"468_CR70","doi-asserted-by":"publisher","first-page":"1044","DOI":"10.3389\/fphys.2019.01044","volume":"10","author":"G Idakwo","year":"2019","unstructured":"Idakwo G, Thangapandian S, Luttrell J et al (2019) Deep learning-based structure-activity relationship modeling for multi-category toxicity classification: a case study of 10KTox21 chemicals with high-throughput cell-based androgen receptor bioassay data. Front Physiol 10:1044. https:\/\/doi.org\/10.3389\/fphys.2019.01044","journal-title":"Front Physiol"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-020-00468-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-020-00468-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-020-00468-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,10,26]],"date-time":"2021-10-26T23:31:03Z","timestamp":1635291063000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-020-00468-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,27]]},"references-count":70,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["468"],"URL":"https:\/\/doi.org\/10.1186\/s13321-020-00468-x","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,10,27]]},"assertion":[{"value":"13 December 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 October 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 October 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential competing interest.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"66"}}