{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T04:34:10Z","timestamp":1777350850901,"version":"3.51.4"},"reference-count":69,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,5,19]],"date-time":"2023-05-19T00:00:00Z","timestamp":1684454400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,5,19]],"date-time":"2023-05-19T00:00:00Z","timestamp":1684454400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"The University of Auckland Doctoral Scholarship"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers\u2019 experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Proposed solution<\/jats:title>\n                    <jats:p>\n                      In this paper, we propose\n                      <jats:sc>cancels<\/jats:sc>\n                      (\n                      <jats:bold>C<\/jats:bold>\n                      ounter\n                      <jats:bold>A<\/jats:bold>\n                      cti\n                      <jats:bold>N<\/jats:bold>\n                      g\n                      <jats:bold>C<\/jats:bold>\n                      ompound sp\n                      <jats:bold>E<\/jats:bold>\n                      cia\n                      <jats:bold>L<\/jats:bold>\n                      ization bia\n                      <jats:bold>S<\/jats:bold>\n                      ), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data.\n                      <jats:sc>cancels<\/jats:sc>\n                      does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.\n                    <\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>\n                      An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that\n                      <jats:sc>cancels<\/jats:sc>\n                      produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor\u2019s performance while reducing the number of required experiments. Overall, we believe that\n                      <jats:sc>cancels<\/jats:sc>\n                      can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under\n                      <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/KatDost\/Cancels\">github.com\/KatDost\/Cancels<\/jats:ext-link>\n                      .\n                    <\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s13321-023-00716-w","type":"journal-article","created":{"date-parts":[[2023,5,19]],"date-time":"2023-05-19T04:02:43Z","timestamp":1684468963000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Combatting over-specialization bias in growing chemical databases"],"prefix":"10.1186","volume":"15","author":[{"given":"Katharina","family":"Dost","sequence":"first","affiliation":[]},{"given":"Zac","family":"Pullar-Strecker","sequence":"additional","affiliation":[]},{"given":"Liam","family":"Brydon","sequence":"additional","affiliation":[]},{"given":"Kunyang","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Jasmin","family":"Hafner","sequence":"additional","affiliation":[]},{"given":"Patricia J.","family":"Riddle","sequence":"additional","affiliation":[]},{"given":"J\u00f6rg S.","family":"Wicker","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,5,19]]},"reference":[{"issue":"6334","key":"716_CR1","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1126\/science.aal4230","volume":"356","author":"A Caliskan","year":"2017","unstructured":"Caliskan A, Bryson JJ, Narayanan A (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356(6334):183\u2013186. https:\/\/doi.org\/10.1126\/science.aal4230","journal-title":"Science"},{"issue":"3","key":"716_CR2","doi-asserted-by":"publisher","first-page":"947","DOI":"10.1021\/acs.jcim.8b00712","volume":"59","author":"J Sieg","year":"2019","unstructured":"Sieg J, Flachsenberg F, Rarey M (2019) In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inf Model 59(3):947\u2013961. https:\/\/doi.org\/10.1021\/acs.jcim.8b00712","journal-title":"J Chem Inf Model"},{"issue":"7","key":"716_CR3","doi-asserted-by":"publisher","first-page":"479","DOI":"10.1038\/nchembio.180","volume":"5","author":"J Hert","year":"2009","unstructured":"Hert J, Irwin JJ, Laggner C, Keiser MJ, Shoichet BK (2009) Quantifying biogenic bias in screening libraries. Nat Chem Biol 5(7):479\u2013483. https:\/\/doi.org\/10.1038\/nchembio.180","journal-title":"Nat Chem Biol"},{"issue":"1","key":"716_CR4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-022-00582-y","volume":"14","author":"A Kerstjens","year":"2022","unstructured":"Kerstjens A, De Winter H (2022) LEADD: lamarckian evolutionary algorithm for de novo drug design. J Cheminform 14(1):1\u201320. https:\/\/doi.org\/10.1186\/s13321-022-00582-y","journal-title":"J Cheminform"},{"issue":"3","key":"716_CR5","doi-asserted-by":"publisher","first-page":"359","DOI":"10.1016\/j.cbpa.2008.03.015","volume":"12","author":"E Gregori-Puigjan\u00e9","year":"2008","unstructured":"Gregori-Puigjan\u00e9 E, Mestres J (2008) Coverage and bias in chemical library design. Curr Opin Chem Biol 12(3):359\u2013365. https:\/\/doi.org\/10.1016\/j.cbpa.2008.03.015","journal-title":"Curr Opin Chem Biol"},{"issue":"1","key":"716_CR6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-016-0182-y","volume":"8","author":"N Aniceto","year":"2016","unstructured":"Aniceto N, Freitas AA, Bender A, Ghafourian T (2016) A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood. J Cheminform 8(1):1\u201320. https:\/\/doi.org\/10.1186\/s13321-016-0182-y","journal-title":"J Cheminform"},{"key":"716_CR7","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1186\/1758-2946-5-27","volume":"5","author":"F Sahigara","year":"2013","unstructured":"Sahigara F, Ballabio D, Todeschini R, Consonni V (2013) Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J Cheminform 5:27. https:\/\/doi.org\/10.1186\/1758-2946-5-27","journal-title":"J Cheminform"},{"issue":"3\u20134","key":"716_CR8","doi-asserted-by":"publisher","first-page":"147","DOI":"10.1007\/s10822-007-9150-y","volume":"22","author":"AE Cleves","year":"2008","unstructured":"Cleves AE, Jain AN (2008) Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery. J Comput Aided Mol Des 22(3\u20134):147\u2013159. https:\/\/doi.org\/10.1007\/s10822-007-9150-y","journal-title":"J Comput Aided Mol Des"},{"issue":"7773","key":"716_CR9","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1038\/s41586-019-1540-5","volume":"573","author":"X Jia","year":"2019","unstructured":"Jia X, Lynch A, Huang Y, Danielson M, Lang\u2019at I, Milder A, Ruby AE, Wang H, Friedler SA, Norquist AJ, Schrier J (2019) Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573(7773):251\u2013255. https:\/\/doi.org\/10.1038\/s41586-019-1540-5","journal-title":"Nature"},{"issue":"1","key":"716_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.2200\/S00429ED1V01Y201207AIM018","volume":"6","author":"B Settles","year":"2012","unstructured":"Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn 6(1):1\u2013114. https:\/\/doi.org\/10.2200\/S00429ED1V01Y201207AIM018","journal-title":"Synth Lect Artif Intell Mach Learn"},{"key":"716_CR11","volume-title":"Can You Trust Your Model\u2019s Uncertainty?","author":"Y Ovadia","year":"2019","unstructured":"Ovadia Y, Fertig E, Ren J, Nado Z, Sculley D, Nowozin S, Dillon JV, Lakshminarayanan B, Snoek J (2019) Can You Trust Your Model\u2019s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift. Curran Associates Inc., Red Hook, NY, USA"},{"key":"716_CR12","doi-asserted-by":"publisher","unstructured":"Dost K, Taskova K, Riddle P, Wicker J (2020) Your best guess when you know nothing: identification and mitigation of selection bias. In: 20th IEEE International Conference on Data Mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, IEEE, New York, pp 996\u20131001. https:\/\/doi.org\/10.1109\/ICDM50108.2020.00115","DOI":"10.1109\/ICDM50108.2020.00115"},{"key":"716_CR13","doi-asserted-by":"publisher","unstructured":"Dost K, Duncanson H, Ziogas I, Riddle P, Wicker J (2022) Divide and imitate: Multi-cluster identification and mitigation of selection bias. In: Advances in Knowledge Discovery and Data Mining\u201426th Pacific-Asia Conference, PAKDD 2022. Lecture Notes in Computer Science, vol 13281, Springer, Cham, pp 149\u2013160. https:\/\/doi.org\/10.1007\/978-3-031-05936-0_12","DOI":"10.1007\/978-3-031-05936-0_12"},{"issue":"4","key":"716_CR14","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/ijms22041676","volume":"22","author":"VD Mouchlis","year":"2021","unstructured":"Mouchlis VD, Afantitis A, Serra A, Fratello M, Papadiamantis AG, Aidinis V, Lynch I, Greco D, Melagraki G (2021) Advances in de novo drug design: from conventional to machine learning methods. Int J Mol Sci 22(4):1\u201322. https:\/\/doi.org\/10.3390\/ijms22041676","journal-title":"Int J Mol Sci"},{"issue":"32","key":"716_CR15","doi-asserted-by":"publisher","first-page":"10792","DOI":"10.1002\/anie.201814681","volume":"58","author":"G Schneider","year":"2019","unstructured":"Schneider G, Clark DE (2019) Automated de novo drug design: are we nearly there yet? Angew Chem Int Ed 58(32):10792\u201310803. https:\/\/doi.org\/10.1002\/anie.201814681","journal-title":"Angew Chem Int Ed"},{"issue":"1","key":"716_CR16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-021-00501-7","volume":"13","author":"Y Kwon","year":"2021","unstructured":"Kwon Y, Lee J (2021) MolFinder: an evolutionary algorithm for the global optimization of molecular properties and the extensive exploration of chemical space using SMILES. J Cheminform 13(1):1\u201314. https:\/\/doi.org\/10.1186\/s13321-021-00501-7","journal-title":"J Cheminform"},{"issue":"9","key":"716_CR17","doi-asserted-by":"publisher","first-page":"4077","DOI":"10.1021\/acs.jmedchem.5b01849","volume":"59","author":"P Schneider","year":"2016","unstructured":"Schneider P, Schneider G (2016) De novo design at the edge of chaos. J Med Chem 59(9):4077\u20134086. https:\/\/doi.org\/10.1021\/acs.jmedchem.5b01849. (PMID: 26881908)","journal-title":"J Med Chem"},{"issue":"1","key":"716_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-019-0341-z","volume":"11","author":"J Ar\u00fas-Pous","year":"2019","unstructured":"Ar\u00fas-Pous J, Blaschke T, Ulander S, Reymond JL, Chen H, Engkvist O (2019) Exploring the GDB-13 chemical space using deep generative models. J Cheminform 11(1):1\u201314. https:\/\/doi.org\/10.1186\/s13321-019-0341-z","journal-title":"J Cheminform"},{"key":"716_CR19","doi-asserted-by":"publisher","unstructured":"Kang SG, Morrone JA, Weber JK, Cornell WD (2022) Analysis of training and seed bias in small molecules generated with a conditional graph-based variational autoencoder\u2014insights for practical AI-driven molecule generation. J Chem Inf Model 62(4):801\u2013816. https:\/\/doi.org\/10.1021\/acs.jcim.1c01545","DOI":"10.1021\/acs.jcim.1c01545"},{"issue":"1","key":"716_CR20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-021-00498-z","volume":"13","author":"T Pereira","year":"2021","unstructured":"Pereira T, Abbasi M, Ribeiro B, Arrais JP (2021) Diversity oriented deep reinforcement learning for targeted molecule generation. J Cheminform 13(1):1\u201317. https:\/\/doi.org\/10.1186\/s13321-021-00498-z","journal-title":"J Cheminform"},{"issue":"1","key":"716_CR21","first-page":"9074","volume":"28","author":"E Bareinboim","year":"2014","unstructured":"Bareinboim E, Tian J, Pearl J (2014) Recovering from selection bias in causal and statistical inference. Proc AAAI Conf Artif Intell. 28(1):9074","journal-title":"Proc AAAI Conf Artif Intell."},{"issue":"3","key":"716_CR22","doi-asserted-by":"publisher","first-page":"621","DOI":"10.1093\/bjps\/axs046","volume":"65","author":"A Lyon","year":"2014","unstructured":"Lyon A (2014) Why are normal distributions normal? Br J Philos Sci 65(3):621\u2013649. https:\/\/doi.org\/10.1093\/bjps\/axs046","journal-title":"Br J Philos Sci"},{"issue":"3","key":"716_CR23","doi-asserted-by":"publisher","first-page":"773","DOI":"10.1215\/S0012-7094-48-01568-3","volume":"15","author":"W Hoeffding","year":"1948","unstructured":"Hoeffding W, Robbins H (1948) The central limit theorem for dependent random variables. Duke Math J 15(3):773\u2013780. https:\/\/doi.org\/10.1215\/S0012-7094-48-01568-3","journal-title":"Duke Math J"},{"issue":"4","key":"716_CR24","doi-asserted-by":"publisher","first-page":"411","DOI":"10.1016\/S0893-6080(00)00026-5","volume":"13","author":"A Hyv\u00e4rinen","year":"2000","unstructured":"Hyv\u00e4rinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4):411\u2013430. https:\/\/doi.org\/10.1016\/S0893-6080(00)00026-5","journal-title":"Neural Netw"},{"issue":"10","key":"716_CR25","doi-asserted-by":"publisher","first-page":"781","DOI":"10.1007\/978-981-15-5971-6_83","volume":"194","author":"S Panigrahi","year":"2021","unstructured":"Panigrahi S, Nanda A, Swarnkar T (2021) A survey on transfer learning. Smart Innov Syst Technol 194(10):781\u2013789. https:\/\/doi.org\/10.1007\/978-981-15-5971-6_83","journal-title":"Smart Innov Syst Technol"},{"key":"716_CR26","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/9780262170055.003.0008","author":"A Gretton","year":"2013","unstructured":"Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Sch\u00f6lkopf B (2013) Covariate shift by Kernel mean matching. Dataset Shift Mach Learn. https:\/\/doi.org\/10.7551\/mitpress\/9780262170055.003.0008","journal-title":"Dataset Shift Mach Learn"},{"key":"716_CR27","doi-asserted-by":"publisher","unstructured":"McGaughey G, Walters W, Goldman B (2016) Understanding covariate shift in model performance. F1000Research. https:\/\/doi.org\/10.12688\/f1000research.8317.1","DOI":"10.12688\/f1000research.8317.1"},{"key":"716_CR28","doi-asserted-by":"publisher","unstructured":"Bickel S, Br\u00fcckner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th International Conference on Machine Learning. ICML \u201907, Association for Computing Machinery, New York, NY, USA, pp 81\u201388. https:\/\/doi.org\/10.1145\/1273496.1273507","DOI":"10.1145\/1273496.1273507"},{"key":"716_CR29","doi-asserted-by":"publisher","unstructured":"Cortes C, Mohri M, Riley M, Rostamizadeh A (2008) Sample selection bias correction theory. In: Proceedings of the 19th International Conference on Algorithmic Learning Theory. ALT \u201908Springer, Berlin, Heidelberg, pp 38\u201353. https:\/\/doi.org\/10.1007\/978-3-540-87987-9_8","DOI":"10.1007\/978-3-540-87987-9_8"},{"key":"716_CR30","doi-asserted-by":"publisher","unstructured":"Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML \u201904, Association for Computing Machinery, New York, NY, USA, p 114. https:\/\/doi.org\/10.1145\/1015330.1015425","DOI":"10.1145\/1015330.1015425"},{"key":"716_CR31","doi-asserted-by":"publisher","unstructured":"Huang J, Smola A.J, Gretton A, Borgwardt KM, Sch\u00f6lkopf B (2007) Correcting sample selection bias by unlabeled data. In: Advances in Neural Information Processing Systems, pp 601\u2013608. https:\/\/doi.org\/10.7551\/mitpress\/7503.003.0080","DOI":"10.7551\/mitpress\/7503.003.0080"},{"issue":"1\u20133","key":"716_CR32","doi-asserted-by":"publisher","first-page":"191","DOI":"10.1023\/A:1012406528296","volume":"46","author":"Y Lin","year":"2002","unstructured":"Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1\u20133):191\u2013202. https:\/\/doi.org\/10.1023\/A:1012406528296","journal-title":"Mach Learn"},{"key":"716_CR33","doi-asserted-by":"publisher","unstructured":"Sugiyama M, M\u00fcller K-R (2005) Input-dependent estimation of generalization error under covariate shift 23(4):249\u2013279. https:\/\/doi.org\/10.1524\/stnd.2005.23.4.249","DOI":"10.1524\/stnd.2005.23.4.249"},{"key":"716_CR34","unstructured":"Baum EB, Lang K (1992) Query learning can work poorly when a human oracle is used. In: International Joint Conference on Neural Networks, vol 8, p 8"},{"issue":"24","key":"716_CR35","doi-asserted-by":"publisher","DOI":"10.1063\/1.5023802","volume":"148","author":"JS Smith","year":"2018","unstructured":"Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE (2018) Less is more: sampling chemical space with active learning. J Chem Phys 148(24):241733. https:\/\/doi.org\/10.1063\/1.5023802","journal-title":"J Chem Phys"},{"issue":"4","key":"716_CR36","doi-asserted-by":"publisher","first-page":"458","DOI":"10.1016\/j.drudis.2014.12.004","volume":"20","author":"D Reker","year":"2015","unstructured":"Reker D, Schneider G (2015) Active-learning strategies in computer-assisted drug discovery. Drug Discov Today 20(4):458\u2013465. https:\/\/doi.org\/10.1016\/j.drudis.2014.12.004","journal-title":"Drug Discov Today"},{"key":"716_CR37","doi-asserted-by":"publisher","DOI":"10.1016\/j.comtox.2020.100129","volume":"15","author":"A Habib Polash","year":"2020","unstructured":"Habib Polash A, Nakano T, Rakers C, Takeda S, Brown JB (2020) Active learning efficiently converges on rational limits of toxicity prediction and identifies patterns for molecule design. Comput Toxicol 15:100129. https:\/\/doi.org\/10.1016\/j.comtox.2020.100129","journal-title":"Comput Toxicol"},{"issue":"4","key":"716_CR38","doi-asserted-by":"publisher","first-page":"381","DOI":"10.4155\/fmc-2016-0197","volume":"9","author":"D Reker","year":"2017","unstructured":"Reker D, Schneider P, Schneider G, Brown J (2017) Active learning for computational chemogenomics. Future Med Chem 9(4):381\u2013402. https:\/\/doi.org\/10.4155\/fmc-2016-0197","journal-title":"Future Med Chem"},{"issue":"7","key":"716_CR39","doi-asserted-by":"publisher","first-page":"1211","DOI":"10.1021\/acsestengg.1c00434","volume":"2","author":"S Zhong","year":"2022","unstructured":"Zhong S, Lambeth DR, Igou TK, Chen Y (2022) Enlarging applicability domain of quantitative structure-activity relationship models through uncertainty-based active learning. ACS ES &T Eng 2(7):1211\u20131220. https:\/\/doi.org\/10.1021\/acsestengg.1c00434","journal-title":"ACS ES &T Eng"},{"issue":"9","key":"716_CR40","doi-asserted-by":"publisher","first-page":"1278","DOI":"10.1016\/j.neunet.2008.06.004","volume":"21","author":"M Sugiyama","year":"2008","unstructured":"Sugiyama M, Rubens N (2008) A batch ensemble approach to active learning with model selection. Neural Netw 21(9):1278\u20131286.","journal-title":"Neural Netw"},{"issue":"1","key":"716_CR41","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1002\/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO;2-6","volume":"16","author":"RS Bohacek","year":"1996","unstructured":"Bohacek RS, McMartin C, Guida WC (1996) The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev 16(1):3\u201350","journal-title":"Med Res Rev"},{"issue":"1","key":"716_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-020-00468-x","volume":"12","author":"G Idakwo","year":"2020","unstructured":"Idakwo G, Thangapandian S, Luttrell J, Li Y, Wang N, Zhou Z, Hong H, Yang B, Zhang C, Gong P (2020) Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J Cheminform 12(1):1\u201319. https:\/\/doi.org\/10.1186\/s13321-020-00468-x","journal-title":"J Cheminform"},{"key":"716_CR43","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2020.104197","volume":"130","author":"T Stepis\u030cnik","year":"2021","unstructured":"Stepis\u030cnik T, S\u030ckrlj B, Wicker J, Kocev D, (2021) A comprehensive comparison of molecular feature representations for use in predictive modeling. Comput Biol Med 130:104197. https:\/\/doi.org\/10.1016\/j.compbiomed.2020.104197","journal-title":"Comput Biol Med"},{"issue":"D1","key":"716_CR44","doi-asserted-by":"publisher","first-page":"1388","DOI":"10.1093\/nar\/gkaa971","volume":"49","author":"S Kim","year":"2020","unstructured":"Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE (2020) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49(D1):1388\u20131395. https:\/\/doi.org\/10.1093\/nar\/gkaa971","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"716_CR45","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-021-00506-2","volume":"13","author":"H Kuwahara","year":"2021","unstructured":"Kuwahara H, Gao X (2021) Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach. J Cheminform 13(1):1\u201312. https:\/\/doi.org\/10.1186\/s13321-021-00506-2","journal-title":"J Cheminform"},{"key":"716_CR46","doi-asserted-by":"publisher","first-page":"387","DOI":"10.1007\/s10822-014-9819-y","volume":"29","author":"E Martin","year":"2015","unstructured":"Martin E, Cao E (2015) Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel\u2019s ravens. J Comput Aided Mol Des 29:387\u2013395. https:\/\/doi.org\/10.1007\/s10822-014-9819-y","journal-title":"J Comput Aided Mol Des"},{"issue":"1","key":"716_CR47","first-page":"27","volume":"41","author":"A Mead","year":"1992","unstructured":"Mead A (1992) Review of the development of multidimensional scaling methods. J R Stat Soc Series D 41(1):27\u201339","journal-title":"J R Stat Soc Series D"},{"key":"716_CR48","doi-asserted-by":"publisher","unstructured":"Granichin O, Volkovich Z, Toledano-Kitai D (2015) Randomized algorithms in automatic control and data mining vol 67. https:\/\/doi.org\/10.1007\/978-3-642-54786-7","DOI":"10.1007\/978-3-642-54786-7"},{"key":"716_CR49","unstructured":"Dost K (2022) CANCELS experiments and implementation. https:\/\/github.com\/KatDost\/Cancels. Accessed 21 Sep 2022"},{"key":"716_CR50","doi-asserted-by":"publisher","DOI":"10.1039\/C6EM00697C","author":"D Latino","year":"2017","unstructured":"Latino D, Wicker J, G\u00fctlein M, Schmid E, Kramer S, Fenner K (2017) Eawag-soil in envipath: a new resource for exploring regulatory pesticide soil biodegradation pathways and half-life data. Enviro Sci Process Impact. https:\/\/doi.org\/10.1039\/C6EM00697C","journal-title":"Enviro Sci Process Impact"},{"issue":"6","key":"716_CR51","doi-asserted-by":"publisher","first-page":"814","DOI":"10.1093\/bioinformatics\/btq024","volume":"26","author":"J Wicker","year":"2010","unstructured":"Wicker J, Fenner K, Ellis L, Wackett L, Kramer S (2010) Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach. Bioinformatics 26(6):814\u2013821. https:\/\/doi.org\/10.1093\/bioinformatics\/btq024","journal-title":"Bioinformatics"},{"key":"716_CR52","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1007\/978-3-319-31858-5_5","volume-title":"Comput Sustain","author":"J Wicker","year":"2016","unstructured":"Wicker J, Fenner K, Kramer S (2016) A hybrid machine learning and knowledge based approach to limit combinatorial explosion in biodegradation prediction. In: L\u00e4ssig J, Kersting K, Morik K (eds) Comput Sustain. Springer, Cham, pp 75\u201397"},{"issue":"D1","key":"716_CR53","doi-asserted-by":"publisher","first-page":"502","DOI":"10.1093\/nar\/gkv1229","volume":"44","author":"J Wicker","year":"2016","unstructured":"Wicker J, Lorsbach T, G\u00fctlein M, Schmid E, Latino D, Kramer S, Fenner K (2016) Envipath - the environmental contaminant biotransformation pathway resource. Nucleic Acid Res 44(D1):502\u2013508. https:\/\/doi.org\/10.1093\/nar\/gkv1229","journal-title":"Nucleic Acid Res"},{"issue":"1","key":"716_CR54","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1186\/s13321-021-00543-x","volume":"13","author":"J Tam","year":"2021","unstructured":"Tam J, Lorsbach T, Schmidt S, Wicker J (2021) Holistic evaluation of biodegradation pathway prediction: assessing multi-step reactions and intermediate products. J Cheminform 13(1):63. https:\/\/doi.org\/10.1186\/s13321-021-00543-x","journal-title":"J Cheminform"},{"key":"716_CR55","doi-asserted-by":"publisher","DOI":"10.3389\/fenvs.2015.00080","author":"A Mayr","year":"2016","unstructured":"Mayr A, Klambauer G, Unterthiner T, Hochreiter S (2016) Deeptox: toxicity prediction using deep learning. Front Environ Sci. https:\/\/doi.org\/10.3389\/fenvs.2015.00080","journal-title":"Front Environ Sci"},{"key":"716_CR56","doi-asserted-by":"publisher","DOI":"10.3389\/fenvs.2015.00085","author":"R Huang","year":"2016","unstructured":"Huang R, Xia M, Nguyen D-T, Zhao T, Sakamuru S, Zhao J, Shahane SA, Rossoshek A, Simeonov A (2016) Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front Environ Sci. https:\/\/doi.org\/10.3389\/fenvs.2015.00085","journal-title":"Front Environ Sci"},{"issue":"3","key":"716_CR57","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1007\/s10994-011-5256-5","volume":"85","author":"J Read","year":"2011","unstructured":"Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333\u2013359","journal-title":"Mach Learn"},{"key":"716_CR58","first-page":"1","volume":"7","author":"J Dem\u0161ar","year":"2006","unstructured":"Dem\u0161ar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1\u201330","journal-title":"J Mach Learn Res"},{"key":"716_CR59","doi-asserted-by":"publisher","unstructured":"Herbold, S (2020) Autorank: A python package for automated ranking of classifiers. J Open Source Softw 5(48), 2173. https:\/\/doi.org\/10.21105\/joss.02173","DOI":"10.21105\/joss.02173"},{"key":"716_CR60","doi-asserted-by":"publisher","first-page":"1692","DOI":"10.1039\/C8SC04175J","volume":"10","author":"R Winter","year":"2019","unstructured":"Winter R, Montanari F, No\u00e9 F, Clevert D-A (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10:1692\u20131701. https:\/\/doi.org\/10.1039\/C8SC04175J","journal-title":"Chem Sci"},{"issue":"7","key":"716_CR61","doi-asserted-by":"publisher","first-page":"1466","DOI":"10.1002\/jcc.21707","volume":"32","author":"CW Yap","year":"2011","unstructured":"Yap CW (2011) Padel-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466\u20131474. https:\/\/doi.org\/10.1002\/jcc.21707","journal-title":"J Comput Chem"},{"key":"716_CR62","doi-asserted-by":"publisher","unstructured":"Gladysz R, Dos\u00a0Santos F.M, Langenaeker W, Thijs G, Augustyns K, De\u00a0Winter H (2018) Spectrophores as one-dimensional descriptors calculated from three-dimensional atomic properties: applications ranging from scaffold hopping to multi-target virtual screening. Journal of Cheminformatics 10(1). https:\/\/doi.org\/10.1186\/s13321-018-0268-9","DOI":"10.1186\/s13321-018-0268-9"},{"issue":"1","key":"716_CR63","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1021\/acs.jcim.7b00616","volume":"58","author":"S Jaeger","year":"2018","unstructured":"Jaeger S, Fulle S, Turk S (2018) Mol2vec: Unsupervised machine learning approach with chemical intuition. J Chem Inf Model 58(1):27\u201335. https:\/\/doi.org\/10.1021\/acs.jcim.7b00616","journal-title":"J Chem Inf Model"},{"key":"716_CR64","unstructured":"enviPath UG\u00a0 & Co.\u00a0KG: SOIL dataset. https:\/\/envipath.org\/package\/5882df9c-dae1-4d80-a40e-db4724271456. Accessed 21 Sep 2022"},{"key":"716_CR65","unstructured":"enviPath UG\u00a0 & Co.\u00a0KG: BBD dataset. https:\/\/envipath.org\/package\/32de3cf4-e3e6-4168-956e-32fa5ddb0ce1. Accessed 21 Sep 2022"},{"key":"716_CR66","unstructured":"enviPath UG\u00a0 & Co.\u00a0KG: enviPath. https:\/\/envipath.org. Accessed 21 Sep 2022"},{"key":"716_CR67","unstructured":"National Center for Biotechnology Information: PubChem. https:\/\/pubchem.ncbi.nlm.nih.gov. Accessed 21 Sep 2022"},{"key":"716_CR68","unstructured":"National Center for Advancing Translational Sciences: Tox21 Data Challenge. https:\/\/tripod.nih.gov\/tox21\/challenge. Accessed 21 Sep 2022"},{"key":"716_CR69","unstructured":"Dost K, Brydon L (2022) PyPI Package \u201cimitatebias\u201d. https:\/\/pypi.org\/project\/imitatebias Accessed 21 Sep 2022"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00716-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00716-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00716-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,19]],"date-time":"2023-05-19T04:08:34Z","timestamp":1684469314000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00716-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,19]]},"references-count":69,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["716"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00716-w","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-2133331\/v1","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,19]]},"assertion":[{"value":"4 October 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 March 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 May 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"J\u00f6rg Wicker (co-founder, CTO) and Katharina Dost are employees of enviPath UG & Co. KG, a scientific software development company that develops and maintains the enviPath system. The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"53"}}