{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:37Z","timestamp":1772138077578,"version":"3.50.1"},"reference-count":45,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":2413,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/uk\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2010,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Motivation: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries.<\/jats:p>\n                  <jats:p>Results: In this article, we introduce a new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques. First, we present EI-Search as a general purpose similarity search method for finding objects with similar features in large databases and apply it here to searching and clustering of large compound sets. The method embeds the compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on locality sensitive hashing (LSH). Second, to cluster large compound sets, we introduce the EI-Clustering algorithm that combines the EI-Search method with Jarvis\u2013Patrick clustering. Both methods were tested on three large datasets with sizes ranging from about 260 000 to over 19 million compounds. In comparison to sequential search methods, the EI-Search method was 40\u2013200 times faster, while maintaining comparable recall rates. The EI-Clustering method allowed us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days.<\/jats:p>\n                  <jats:p>Availability: Software implementations and online services have been developed based on the methods introduced in this study. The online services provide access to the generated clustering results and ultra-fast similarity searching of the PubChem Compound database with subsecond response time.<\/jats:p>\n                  <jats:p>Contact: \u00a0thomas.girke@ucr.edu<\/jats:p>\n                  <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btq067","type":"journal-article","created":{"date-parts":[[2010,2,23]],"date-time":"2010-02-23T20:26:45Z","timestamp":1266956805000},"page":"953-959","source":"Crossref","is-referenced-by-count":29,"title":["Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing"],"prefix":"10.1093","volume":"26","author":[{"given":"Yiqun","family":"Cao","sequence":"first","affiliation":[{"name":"1 Department of Computer Science & Engineering and 2 Department of Botany & Plant Sciences, University of California, Riverside, CA 92521, USA"}]},{"given":"Tao","family":"Jiang","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science & Engineering and 2 Department of Botany & Plant Sciences, University of California, Riverside, CA 92521, USA"}]},{"given":"Thomas","family":"Girke","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science & Engineering and 2 Department of Botany & Plant Sciences, University of California, Riverside, CA 92521, USA"}]}],"member":"286","published-online":{"date-parts":[[2010,2,23]]},"reference":[{"key":"2023012508034128800_B1","doi-asserted-by":"crossref","first-page":"1215","DOI":"10.1002\/jcc.10234","article-title":"Stochastic proximity embedding","volume":"24","author":"Agrafiotis","year":"2003","journal-title":"J. Comput. Chem."},{"key":"2023012508034128800_B2","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1021\/ci980100c","article-title":"An efficient implementation of distance-based diversity measures based on kd trees","volume":"39","author":"Agrafiotis","year":"1999","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"2023012508034128800_B3","doi-asserted-by":"crossref","first-page":"488","DOI":"10.1002\/1096-987X(20010415)22:5%3C488::AID-JCC1020%3E3.0.CO;2-4","article-title":"Multidimensional scaling and visualization of large molecular similarity tables","volume":"22","author":"Agrafiotis","year":"2001","journal-title":"J. Comput. Chem."},{"key":"2023012508034128800_B4","doi-asserted-by":"crossref","first-page":"15869","DOI":"10.1073\/pnas.242424399","article-title":"A self-organizing principle for learning nonlinear manifolds","volume":"99","author":"Agrafiotis","year":"2002","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012508034128800_B5","doi-asserted-by":"crossref","first-page":"1138","DOI":"10.1126\/science.1105511","article-title":"NIH molecular libraries initiative","volume":"306","author":"Austin","year":"2004","journal-title":"Science"},{"key":"2023012508034128800_B6","doi-asserted-by":"crossref","first-page":"1367","DOI":"10.1021\/ci800076s","article-title":"Speeding up chemical database searches using a proximity filter based on the logical exclusive OR","volume":"48","author":"Baldi","year":"2008","journal-title":"J. Chem. Inf. Model."},{"key":"2023012508034128800_B7","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1145\/361002.361007","article-title":"Multidimensional binary search trees used for associative searching","volume":"18","author":"Bentley","year":"1975","journal-title":"Comm. ACM"},{"key":"2023012508034128800_B8","doi-asserted-by":"crossref","first-page":"322","DOI":"10.1145\/502807.502809","article-title":"Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases","volume":"33","author":"Bohm","year":"2001","journal-title":"ACM Comput. Surv."},{"key":"2023012508034128800_B9","first-page":"237","article-title":"Efficient processing of spatial joins using R-trees","volume-title":"Proceedings of ACM SIGMOD Conference on Management of Data.","author":"Brinkhoff","year":"1993"},{"key":"2023012508034128800_B10","doi-asserted-by":"crossref","first-page":"i366","DOI":"10.1093\/bioinformatics\/btn186","article-title":"A maximum common substructure-based algorithm for searching and predicting drug-like compounds","volume":"24","author":"Cao","year":"2008","journal-title":"Bioinformatics"},{"key":"2023012508034128800_B11","first-page":"197200","article-title":"A heuristic relaxation method for nonlinear mapping in cluster analysis","volume":"3","author":"Chang","year":"1973","journal-title":"IEEE Trans. Syst. Man Cybernet."},{"key":"2023012508034128800_B12","doi-asserted-by":"crossref","first-page":"2348","DOI":"10.1093\/bioinformatics\/btm341","article-title":"ChemDB update\u2014full-text search and virtual chemical space","volume":"23","author":"Chen","year":"2007","journal-title":"Bioinformatics"},{"key":"2023012508034128800_B13","doi-asserted-by":"crossref","first-page":"1407","DOI":"10.1021\/ci025531g","article-title":"Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients","volume":"42","author":"Chen","year":"2002","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"2023012508034128800_B14","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1038\/nbt1273","article-title":"Structure-based maximal affinity model predicts small-molecule druggability","volume":"25","author":"Cheng","year":"2007","journal-title":"Nat. Biotechnol."},{"key":"2023012508034128800_B15","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1145\/997817.997857","article-title":"Locality-sensitive hashing scheme based on p-stable distributions","volume-title":"Proceedings of the Twentieth Annual Symposium on Computational Geometry.","author":"Datar","year":"2004"},{"key":"2023012508034128800_B16","first-page":"1","article-title":"Clustering methods and their uses in computational chemistry","volume":"18","author":"Downs","year":"2002","journal-title":"Rev. Comput. Chem."},{"key":"2023012508034128800_B17","first-page":"163","article-title":"FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets","volume-title":"Proceedings of the ACM SIGMOD Conference on Management of Data.","author":"Faloutsos","year":"1995"},{"key":"2023012508034128800_B18","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1007\/PL00010672","article-title":"Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances","volume":"9","author":"Fu","year":"2000","journal-title":"VLDB J."},{"key":"2023012508034128800_B19","first-page":"518","article-title":"Similarity search in high dimensions via hashing","volume-title":"Proceedings of the International Conference on Very Large Data Bases.","author":"Gionis","year":"1999"},{"key":"2023012508034128800_B20","doi-asserted-by":"crossref","first-page":"573","DOI":"10.1104\/pp.105.062687","article-title":"ChemMine. A compound mining database for chemical genomics","volume":"138","author":"Girke","year":"2005","journal-title":"Plant Physiol."},{"key":"2023012508034128800_B21","doi-asserted-by":"crossref","first-page":"296","DOI":"10.1016\/j.cbpa.2005.04.006","article-title":"The principle of complementarity: chemical versus biological space","volume":"9","author":"Haggarty","year":"2005","journal-title":"Curr. Opin. Chem. Biol."},{"key":"2023012508034128800_B22","doi-asserted-by":"crossref","first-page":"46","DOI":"10.1021\/ci010056s","article-title":"Enhanced CACTVS browser of the open NCI database","volume":"42","author":"Ihlenfeldt","year":"2002","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"2023012508034128800_B23","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1021\/ci049714+","article-title":"ZINC\u2014a free database of commercially available compounds for virtual screening","volume":"45","author":"Irwin","year":"2005","journal-title":"J. Chem. Inf. Model."},{"key":"2023012508034128800_B24","first-page":"369","article-title":"The SR-tree: an index structure for high-dimensional nearest neighbor queries","volume-title":"Proceedings of the ACM SIGMOD Conference on Management of Data.","author":"Katayama","year":"1997"},{"key":"2023012508034128800_B25","doi-asserted-by":"crossref","DOI":"10.4135\/9781412985130","volume-title":"Multidimensional Scaling.","author":"Kruskal","year":"1978"},{"key":"2023012508034128800_B26","first-page":"950","article-title":"Multi-probe LSH: efficient indexing for high-dimensional similarity search","volume-title":"Proceedings of the International Conference on Very Large Data Bases.","author":"Lv","year":"2007"},{"key":"2023012508034128800_B27","author":"NIH Chemical Genomics Center","year":"2009","journal-title":"PubChem Fingerprint for JChem."},{"key":"2023012508034128800_B28","doi-asserted-by":"crossref","first-page":"384","DOI":"10.1016\/S1367-5931(02)00329-0","article-title":"Chemical space navigation in lead discovery","volume":"6","author":"Oprea","year":"2002","journal-title":"Curr. Opin. Chem. Biol."},{"key":"2023012508034128800_B29","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1038\/nchembio0807-447","article-title":"Systems chemical biology","volume":"3","author":"Oprea","year":"2007","journal-title":"Nat. Chem. Biol."},{"key":"2023012508034128800_B30","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1016\/S1093-3263(02)00188-2","article-title":"Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures","volume":"21","author":"Raymond","year":"2003","journal-title":"J. Mol. Graph Model."},{"key":"2023012508034128800_B31","doi-asserted-by":"crossref","first-page":"2323","DOI":"10.1126\/science.290.5500.2323","article-title":"Nonlinear dimensionality reduction by locally linear embedding","volume":"290","author":"Roweis","year":"2000","journal-title":"Science"},{"key":"2023012508034128800_B32","doi-asserted-by":"crossref","first-page":"412","DOI":"10.1016\/j.cbpa.2004.06.003","article-title":"Exploring the chemogenomic knowledge space with annotated chemical libraries","volume":"8","author":"Savchuk","year":"2004","journal-title":"Curr. Opin. Chem. Biol."},{"issue":"Database issue","key":"2023012508034128800_B33","first-page":"351","article-title":"ChemBank: a small-molecule screening and cheminformatics resource database","volume":"36","author":"Seiler","year":"2008","journal-title":"Nucleic Acids. Res."},{"key":"2023012508034128800_B34","doi-asserted-by":"crossref","first-page":"903","DOI":"10.1016\/S1359-6446(02)02411-X","article-title":"Why do we need so many chemical similarity search methods?","volume":"7","author":"Sheridan","year":"2002","journal-title":"Drug Discov. Today"},{"key":"2023012508034128800_B35","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1021\/ci050404g","article-title":"Visualization and interpretation of high content screening data","volume":"46","author":"Smellie","year":"2006","journal-title":"J. Chem. Inf. Model."},{"key":"2023012508034128800_B36","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1126\/science.1083395","article-title":"From knowing to controlling: a path from genomics to drugs using small molecule probes","volume":"300","author":"Strausberg","year":"2003","journal-title":"Science"},{"key":"2023012508034128800_B37","doi-asserted-by":"crossref","first-page":"302","DOI":"10.1021\/ci600358f","article-title":"Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sub-linear time","volume":"47","author":"Swamidass","year":"2007","journal-title":"J. Chem. Inf. Model."},{"key":"2023012508034128800_B38","doi-asserted-by":"crossref","first-page":"2319","DOI":"10.1126\/science.290.5500.2319","article-title":"A global geometric framework for nonlinear dimensionality reduction","volume":"290","author":"Tenenbaum","year":"2000","journal-title":"Science"},{"key":"2023012508034128800_B39","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1007\/BF02187718","article-title":"An O (n log n) algorithm for the all-nearest-neighbors problem","volume":"4","author":"Vaidya","year":"1989","journal-title":"Discrete Comput. Geom."},{"key":"2023012508034128800_B40","first-page":"194","article-title":"A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces","volume-title":"Proceedings of the International Conference on Very Large Data Bases.","author":"Weber","year":"1998"},{"key":"2023012508034128800_B41","volume-title":"Similarity and Clustering in Chemical Information Systems.","author":"Willett","year":"1987"},{"key":"2023012508034128800_B42","doi-asserted-by":"crossref","first-page":"4183","DOI":"10.1021\/jm0582165","article-title":"Searching techniques for databases of two- and three-dimensional chemical structures","volume":"48","author":"Willett","year":"2005","journal-title":"J. Med. Chem."},{"key":"2023012508034128800_B43","doi-asserted-by":"crossref","first-page":"983","DOI":"10.1021\/ci9800211","article-title":"Chemical similarity searching","volume":"38","author":"Willett","year":"1998","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"2023012508034128800_B44","doi-asserted-by":"crossref","first-page":"1933","DOI":"10.1021\/ci034150f","article-title":"Nearest neighbor search in general metric spaces using a tree data structure with a simple heuristic","volume":"43","author":"Xu","year":"2003","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"2023012508034128800_B45","doi-asserted-by":"crossref","first-page":"550","DOI":"10.1145\/279232.279236","article-title":"L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization","volume":"23","author":"Zhu","year":"1997","journal-title":"ACM Trans. Math. Softw."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/26\/7\/953\/48855784\/bioinformatics_26_7_953.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/26\/7\/953\/48855784\/bioinformatics_26_7_953.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T03:07:08Z","timestamp":1674616028000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/26\/7\/953\/213574"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,2,23]]},"references-count":45,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2010,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btq067","relation":{"has-review":[{"id-type":"doi","id":"10.3410\/f.3292956.2979054","asserted-by":"object"}]},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2010,4,1]]},"published":{"date-parts":[[2010,2,23]]}}}