{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T01:44:15Z","timestamp":1772502255116,"version":"3.50.1"},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T00:00:00Z","timestamp":1638316800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,12,7]],"date-time":"2021-12-07T00:00:00Z","timestamp":1638835200000},"content-version":"vor","delay-in-days":6,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100010767","name":"innovative medicines initiative","doi-asserted-by":"publisher","award":["831472"],"award-info":[{"award-number":["831472"]}],"id":[{"id":"10.13039\/501100010767","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.<\/jats:p>","DOI":"10.1186\/s13321-021-00576-2","type":"journal-article","created":{"date-parts":[[2021,12,7]],"date-time":"2021-12-07T03:02:52Z","timestamp":1638846172000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":50,"title":["Splitting chemical structure data sets for federated privacy-preserving machine learning"],"prefix":"10.1186","volume":"13","author":[{"given":"Jaak","family":"Simm","sequence":"first","affiliation":[]},{"given":"Lina","family":"Humbeck","sequence":"additional","affiliation":[]},{"given":"Adam","family":"Zalewski","sequence":"additional","affiliation":[]},{"given":"Noe","family":"Sturm","sequence":"additional","affiliation":[]},{"given":"Wouter","family":"Heyndrickx","sequence":"additional","affiliation":[]},{"given":"Yves","family":"Moreau","sequence":"additional","affiliation":[]},{"given":"Bernd","family":"Beck","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6385-0414","authenticated-orcid":false,"given":"Ansgar","family":"Schuffenhauer","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,12,7]]},"reference":[{"key":"576_CR1","volume-title":"Pattern recognition and machine learning","author":"CM Bishop","year":"2006","unstructured":"Bishop CM (2006) Pattern recognition and machine learning. Springer, New York"},{"issue":"4","key":"576_CR2","first-page":"348","volume":"21","author":"H Kubinyi","year":"2002","unstructured":"Kubinyi H (2002) From narcosis to hyperspace: the history of QSAR. QSAR 21(4):348\u2013356","journal-title":"QSAR"},{"key":"576_CR3","first-page":"437","volume-title":"Computational medicinal chemistry and drug discovery","author":"JH Van Drie","year":"2003","unstructured":"Van Drie JH (2003) Pharmacophore discovery: a critical review. In: Bultinck P, De Winter H, Langenaeker W (eds) Computational medicinal chemistry and drug discovery, 2nd edn. Dekker, New York, pp 437\u2013460","edition":"2"},{"issue":"5","key":"576_CR4","doi-asserted-by":"publisher","first-page":"1242","DOI":"10.1021\/jm030408h","volume":"47","author":"F Lombardo","year":"2004","unstructured":"Lombardo F, Obach RS, Shalaeva MY, Gao F (2004) Prediction of human volume of distribution values for neutral and basic drugs. 2. Extended data set and leave-class-out statistics. J Med Chem 47(5):1242\u20131250. https:\/\/doi.org\/10.1021\/jm030408h","journal-title":"J Med Chem"},{"issue":"8","key":"576_CR5","doi-asserted-by":"publisher","first-page":"2077","DOI":"10.1021\/acs.jcim.7b00166","volume":"57","author":"EJ Martin","year":"2017","unstructured":"Martin EJ, Polyakov VR, Tian L, Perez RC (2017) Profile-QSAR 2.0: kinase virtual screening accuracy comparable to four-concentration IC50s for Realistically novel compounds. J Chem Inf Model 57(8):2077\u20132088. https:\/\/doi.org\/10.1021\/acs.jcim.7b00166","journal-title":"J Chem Inf Model"},{"issue":"4","key":"576_CR6","doi-asserted-by":"publisher","first-page":"783","DOI":"10.1021\/ci400084k","volume":"53","author":"RP Sheridan","year":"2013","unstructured":"Sheridan RP (2013) Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783\u2013790. https:\/\/doi.org\/10.1021\/ci400084k","journal-title":"J Chem Inf Model"},{"key":"576_CR7","doi-asserted-by":"publisher","unstructured":"S\u00f8gaard A, Ebert S, Bastings J, Filippova K  (2021) We need to talk about random splits. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1823\u20131832. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2021.eacl-main.156","DOI":"10.18653\/v1\/2021.eacl-main.156"},{"key":"576_CR8","doi-asserted-by":"publisher","DOI":"10.1145\/3298981","author":"Q Yang","year":"2019","unstructured":"Yang Q, Liu Y, Chen T, Tong Y (2019) Federated machine learning: concept and applications. ACM Trans Intell Syst Technol. https:\/\/doi.org\/10.1145\/3298981","journal-title":"ACM Trans Intell Syst Technol."},{"key":"576_CR9","unstructured":"MELLODDY: machine learning ledger orchestration for drug discovery. https:\/\/www.melloddy.eu\/. Accessed 29 NOv 2021"},{"issue":"6","key":"576_CR10","doi-asserted-by":"publisher","first-page":"2651","DOI":"10.1021\/ci600219n","volume":"46","author":"MF Engels","year":"2006","unstructured":"Engels MF (2006) A cluster-based strategy for assessing the overlap between large chemicallibraries and its application to a recent acquisition. J Chem Inf Model 46(6):2651\u20132660. https:\/\/doi.org\/10.1021\/ci600219n","journal-title":"J Chem Inf Model"},{"issue":"13\u201314","key":"576_CR11","doi-asserted-by":"publisher","first-page":"636","DOI":"10.1016\/j.drudis.2011.04.005","volume":"16","author":"J Schamberger","year":"2011","unstructured":"Schamberger J, Grimm M, Steinmeyer A, Hillisch A (2011) Rendezvous in chemical space? Comparing the small molecule compound libraries of Bayer and Schering. Drug Discov Today 16(13\u201314):636\u2013641. https:\/\/doi.org\/10.1016\/j.drudis.2011.04.005","journal-title":"Drug Discov Today"},{"key":"576_CR12","unstructured":"Galtier MN, Marini C (2019) Substra: a framework for privacy-preserving, traceable and collaborative machine learning.  arXiv:1910.11567"},{"key":"576_CR13","doi-asserted-by":"publisher","unstructured":"Bonawitz K, Ivanov V, Kreuter B, Marcedone A, McMahan HB, Patel S, Ramage D, Segal A, Seth K (2017) Practical secure aggregation for privacy preserving machine learning. Cryptology ePrint Archive, Report 2017\/281. https:\/\/doi.org\/10.1145\/3133956.3133982","DOI":"10.1145\/3133956.3133982"},{"issue":"D1","key":"576_CR14","doi-asserted-by":"publisher","first-page":"945","DOI":"10.1093\/nar\/gkw1074","volume":"45","author":"A Gaulton","year":"2017","unstructured":"Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibri\u00e1n-Uhalte E, Davies M, Dedman N, Karlsson A, Magari\u00f1os MP, Overington JP, Papadatos G, Smit I, Leach AR (2017) The ChEMBL database in 2017. Nucleic Acids Res 45(D1):945\u2013954. https:\/\/doi.org\/10.1093\/nar\/gkw1074","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"576_CR15","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742\u2013754. https:\/\/doi.org\/10.1021\/ci100050t (PMID: 20426451)","journal-title":"J Chem Inf Model"},{"key":"576_CR16","doi-asserted-by":"publisher","unstructured":"Simm J, Friedrich L. MELLODDY TUNER release V1 public data. https:\/\/doi.org\/10.5281\/zenodo.4778424","DOI":"10.5281\/zenodo.4778424"},{"key":"576_CR17","doi-asserted-by":"publisher","unstructured":"National Institute of Standards and Technology (NIST) (2015)  Federal Information Processing Standards Publication 180-4: Secure Hash Standard (SHS). https:\/\/doi.org\/10.6028\/NIST.FIPS.180-4","DOI":"10.6028\/NIST.FIPS.180-4"},{"issue":"1","key":"576_CR18","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1021\/ci00023a009","volume":"35","author":"R Taylor","year":"1995","unstructured":"Taylor R (1995) Simulation analysis of experimental design strategies for screening random compounds as potential new drugs and agrochemicals. J Chem Inform Comput Sci 35(1):59\u201367. https:\/\/doi.org\/10.1021\/ci00023a009","journal-title":"J Chem Inform Comput Sci"},{"issue":"4","key":"576_CR19","doi-asserted-by":"publisher","first-page":"747","DOI":"10.1021\/ci9803381","volume":"39","author":"D Butina","year":"1999","unstructured":"Butina D (1999) Unsupervised data base clustering based on daylight\u2019s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J Chem Inform Comput Sci 39(4):747\u2013750. https:\/\/doi.org\/10.1021\/ci9803381","journal-title":"J Chem Inform Comput Sci"},{"key":"576_CR20","unstructured":"Parthasarathy D, Shah D, Zaman T  (2010) Leaders, followers, and community detection. arXiv:1011.0774"},{"key":"576_CR21","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781107337756","volume-title":"Secure multiparty computation and secret sharing","author":"R Cramer","year":"2015","unstructured":"Cramer R, Damg\u00e5rd IB, Nielsen JB (2015) Secure multiparty computation and secret sharing. Cambridge University Press, Cambridge.  10.1017\/CBO9781107337756"},{"key":"576_CR22","unstructured":"Damgard I, Pastro V, Smart NP, Zakarias S  (2012) Multiparty computation from somewhat homomorphic encryption. Cryptology ePrint Archive, Report 2011\/535. https:\/\/ia.cr\/2011\/535. Accessed 29 Nov 2021)"},{"key":"576_CR23","unstructured":"Gionis A, Indyk P, Motwani R  (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases. VLDB \u201999, pp 518\u2013529. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA"},{"issue":"15","key":"576_CR24","doi-asserted-by":"publisher","first-page":"2887","DOI":"10.1021\/jm9602928","volume":"39","author":"GW Bemis","year":"1996","unstructured":"Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39(15):2887\u20132893. https:\/\/doi.org\/10.1021\/jm9602928","journal-title":"J Med Chem"},{"issue":"8","key":"576_CR25","doi-asserted-by":"publisher","first-page":"3370","DOI":"10.1021\/acs.jcim.9b00237.","volume":"59","author":"K Yang","year":"2019","unstructured":"Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Mode 59(8):3370\u20133388. https:\/\/doi.org\/10.1021\/acs.jcim.9b00237. (1904.01561)","journal-title":"J Chem Inf Mode"},{"issue":"1","key":"576_CR26","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1021\/ci600338x","volume":"47","author":"A Schuffenhauer","year":"2007","unstructured":"Schuffenhauer A, Ertl P, Roggo S, Wetzel S, Koch MA, Waldmann H (2007) The scaffold tree - visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Mode 47(1):47\u201358. https:\/\/doi.org\/10.1021\/ci600338x","journal-title":"J Chem Inf Mode"},{"key":"576_CR27","doi-asserted-by":"publisher","DOI":"10.1186\/s13321-017-0213-3","author":"T Sch\u00e4fer","year":"2017","unstructured":"Sch\u00e4fer T, Kriege N, Humbeck L, Klein K, Koch O, Mutzel P (2017) Scaffold Hunter: a comprehensive visual analytics framework for drug discovery. J Cheminformatics. https:\/\/doi.org\/10.1186\/s13321-017-0213-3","journal-title":"J Cheminformatics"},{"issue":"7","key":"576_CR28","doi-asserted-by":"publisher","first-page":"1528","DOI":"10.1021\/ci2000924","volume":"51","author":"T Varin","year":"2011","unstructured":"Varin T, Schuffenhauer A, Ertl P, Renner S (2011) Mining for bioactive scaffolds with scaffold networks: improved compound set enrichment from primary screening data. J Chem Inf Model 51(7):1528\u20131538. https:\/\/doi.org\/10.1021\/ci2000924","journal-title":"J Chem Inf Model"},{"key":"576_CR29","doi-asserted-by":"publisher","first-page":"3331","DOI":"10.1021\/acs.jcim.0c00296","volume":"60","author":"F Kruger","year":"2020","unstructured":"Kruger F, Stiefl N, Landrum GA (2020) Rdscaffoldnetwork: the scaffold network implementation in RDKIT. J Chem Inf Model 60:3331\u20133335","journal-title":"J Chem Inf Model"},{"issue":"23","key":"576_CR30","doi-asserted-by":"publisher","first-page":"14425","DOI":"10.1021\/acs.jmedchem.0c01332","volume":"63","author":"A Schuffenhauer","year":"2020","unstructured":"Schuffenhauer A, Schneider N, Hintermann S, Auld D, Blank J, Cotesta S, Engeloch C, Fechner N, Gaul C, Giovannoni J, Jansen J, Joslin J, Krastel P, Lounkine E, Manchester J, Monovich LG, Pelliccioli AP, Schwarze M, Shultz MD, Stiefl N, Baeschlin DK (2020) Evolution of Novartis\u2019 small molecule screening deck design. J Med Chem 63(23):14425\u201314447. https:\/\/doi.org\/10.1021\/acs.jmedchem.0c01332","journal-title":"J Med Chem"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-021-00576-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-021-00576-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-021-00576-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,7]],"date-time":"2021-12-07T03:10:13Z","timestamp":1638846613000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-021-00576-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":30,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["576"],"URL":"https:\/\/doi.org\/10.1186\/s13321-021-00576-2","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2021-xd440","asserted-by":"object"},{"id-type":"doi","id":"10.26434\/chemrxiv-2021-xd440-v2","asserted-by":"object"},{"id-type":"doi","id":"10.26434\/chemrxiv-2021-xd440-v3","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12]]},"assertion":[{"value":"28 July 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 November 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 December 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests. The authors AZ, NS, AS, WH, BB and LH did the work as employee of Amgen, Novartis, Janssen and Boehringer Ingelheim, respectively.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"96"}}