{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T03:38:19Z","timestamp":1752550699424,"version":"3.37.3"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,10,12]],"date-time":"2023-10-12T00:00:00Z","timestamp":1697068800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,12]],"date-time":"2023-10-12T00:00:00Z","timestamp":1697068800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["DE-SC0023354"],"award-info":[{"award-number":["DE-SC0023354"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV\u00a0(Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/whitead\/molbloom\">https:\/\/github.com\/whitead\/molbloom<\/jats:ext-link>.<\/jats:p>","DOI":"10.1186\/s13321-023-00765-1","type":"journal-article","created":{"date-parts":[[2023,10,12]],"date-time":"2023-10-12T20:41:13Z","timestamp":1697143273000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Bloom filters for molecules"],"prefix":"10.1186","volume":"15","author":[{"given":"Jorge","family":"Medina","sequence":"first","affiliation":[]},{"given":"Andrew D.","family":"White","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,10,12]]},"reference":[{"issue":"4","key":"765_CR1","first-page":"559","volume":"11","author":"Ulrich Rester","year":"2008","unstructured":"Rester Ulrich (2008) From virtuality to reality - virtual screening in lead discovery and lead optimization: a medicinal chemistry perspective. Curr Opinion Drug Disc Devel 11(4):559\u2013568","journal-title":"Curr Opinion Drug Disc Devel"},{"issue":"12","key":"765_CR2","doi-asserted-by":"publisher","first-page":"6065","DOI":"10.1021\/acs.jcim.0c00675","volume":"60","author":"J Irwin John","year":"2020","unstructured":"Irwin John J, Tang Khanh G, Jennifer Young, Chinzorig Dandarchuluun, Wong Benjamin R, Munkhzul Khurelbaatar, Moroz Yurii S, John Mayfield, Sayle RA (2020) Zinc20-a free ultralarge-scale chemical database for ligand discovery. J Chem Inform Model 60(12):6065\u20136073","journal-title":"J Chem Inform Model"},{"issue":"7","key":"765_CR3","doi-asserted-by":"publisher","first-page":"422","DOI":"10.1145\/362686.362692","volume":"13","author":"Burton H Bloom","year":"1970","unstructured":"Bloom Burton H (1970) Space\/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422\u2013426","journal-title":"Commun ACM"},{"issue":"1","key":"765_CR4","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1109\/SURV.2011.031611.00024","volume":"14","author":"Sasu Tarkoma","year":"2012","unstructured":"Tarkoma Sasu, Rothenberg Christian Esteve, Lagerspetz Eemil (2012) Theory and practice of bloom filters for distributed systems. IEEE Commun Surv Tutor 14(1):131\u2013155","journal-title":"IEEE Commun Surv Tutor"},{"issue":"4","key":"765_CR5","doi-asserted-by":"publisher","first-page":"485","DOI":"10.1080\/15427951.2004.10129096","volume":"1","author":"Andrei Broder","year":"2004","unstructured":"Broder Andrei, Mitzenmacher Michael (2004) Network applications of bloom filters: a survey. Internet Mathemat 1(4):485\u2013509","journal-title":"Internet Mathemat"},{"issue":"1","key":"765_CR6","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1109\/TCOM.1982.1095395","volume":"30","author":"M McIlroy","year":"1982","unstructured":"McIlroy M (1982) Development of a spelling list. IEEE Trans Commun 30(1):91\u201399","journal-title":"IEEE Trans Commun"},{"key":"765_CR7","unstructured":"Yakunin Alex (2010) Nice bloom filter application"},{"issue":"51","key":"765_CR8","doi-asserted-by":"publisher","first-page":"13093","DOI":"10.1073\/pnas.1814448115","volume":"115","author":"Sanjoy Dasgupta","year":"2018","unstructured":"Dasgupta Sanjoy, Sheehan Timothy C, Stevens Charles F, Navlakha Saket (2018) A neural data structure for novelty detection. Proc Natl Acad Sci 115(51):13093\u201313098","journal-title":"Proc Natl Acad Sci"},{"key":"765_CR9","unstructured":"Talbot Jamie (July 2015) What are Bloom filters?"},{"key":"765_CR10","doi-asserted-by":"crossref","unstructured":"Goodwin Bob, Hopcroft Michael, Luu Dan, Clemmer Alex, Curmei Mihaela, Elnikety Sameh, He Yuxiong (August 2017) BitFunnel: Revisiting Signatures for Search. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605\u2013614, Shinjuku Tokyo Japan, ACM","DOI":"10.1145\/3077136.3080789"},{"key":"765_CR11","unstructured":"Bran Andres\u00a0M, Cox Sam, White Andrew\u00a0D (2023) and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools"},{"issue":"2","key":"765_CR12","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1517\/17460441.2016.1117070","volume":"11","author":"Ingo Muegge","year":"2016","unstructured":"Muegge Ingo, Mukherjee Prasenjit (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Disc 11(2):137\u2013148","journal-title":"Expert Opin Drug Disc"},{"issue":"1","key":"765_CR13","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1186\/s13321-020-00478-9","volume":"13","author":"Maria Sorokina","year":"2021","unstructured":"Sorokina Maria, Merseburger Peter, Rajan Kohulan, Yirik MehmetAziz, Steinbeck Christoph (2021) COCONUT online: collection of open natural products database. J Cheminform 13(1):2","journal-title":"J Cheminform"},{"key":"765_CR14","doi-asserted-by":"crossref","unstructured":"Fan Bin, Andersen Dave\u00a0G., Kaminsky Michael, Mitzenmacher Michael\u00a0D. (2014) Cuckoo filter: Practically better than bloom. In: Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, CoNEXT \u201914, page 75-88, New York, NY, USA. Association for Computing Machinery","DOI":"10.1145\/2674005.2674994"},{"key":"765_CR15","doi-asserted-by":"crossref","unstructured":"Bender Michael\u00a0A, Farach-Colton Martin, Johnson Rob, Kuszmaul Bradley\u00a0C, Medjedovic Dzejla, Montes Pablo, Shetty Pradeep, Spillane Richard\u00a0P, Zadok Erez (2011) Don\u2019t thrash: how to cache your hash on flash. In: 3rd Workshop on Hot Topics in Storage and File Systems (HotStorage 11)","DOI":"10.14778\/2350229.2350275"},{"key":"765_CR16","doi-asserted-by":"crossref","unstructured":"Cormode Graham (2009) Count-min sketch","DOI":"10.1007\/978-0-387-39940-9_87"},{"key":"765_CR17","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139058452","volume-title":"Mining of massive datasets","author":"Anand Rajaraman","year":"2011","unstructured":"Rajaraman Anand, Ullman Jeffrey David (2011) Mining of massive datasets. Cambridge University Press; Cambridge"},{"issue":"D1","key":"765_CR18","doi-asserted-by":"publisher","first-page":"D1100","DOI":"10.1093\/nar\/gkr777","volume":"40","author":"A Gaulton","year":"2012","unstructured":"Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40(D1):D1100\u2013D1107","journal-title":"Nucl Acids Res"},{"issue":"Database","key":"765_CR19","doi-asserted-by":"publisher","first-page":"D198","DOI":"10.1093\/nar\/gkl999","volume":"35","author":"T Liu","year":"2007","unstructured":"Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucl Acids Res 35(Database):D198\u2013D201","journal-title":"Nucl Acids Res"},{"issue":"D1","key":"765_CR20","doi-asserted-by":"publisher","first-page":"D1373","DOI":"10.1093\/nar\/gkac956","volume":"51","author":"Sunghwan Kim","year":"2023","unstructured":"Kim Sunghwan, Chen Jie, Cheng Tiejun, Gindulyte Asta, He Jia, He Siqian, Li Qingliang, Shoemaker Benjamin A, Thiessen Paul A, Bo Yu, Zaslavsky Leonid, Zhang Jian, Bolton Evan E (2023) PubChem 2023 update. Nucl Acids Res 51(D1):D1373\u2013D1380","journal-title":"Nucl Acids Res"},{"issue":"D1","key":"765_CR21","doi-asserted-by":"publisher","first-page":"D1220","DOI":"10.1093\/nar\/gkv1253","volume":"44","author":"George Papadatos","year":"2016","unstructured":"Papadatos George, Davies Mark, Dedman Nathan, Chambers Jon, Gaulton Anna, Siddle James, Koks Richard, Irvine Sean A, Pettersson Joe, Goncharoff Nicko, Hersey Anne, Overington John P (2016) SureChEMBL: a large-scale, chemically annotated patent document database. Nucl Acids Res 44(D1):D1220\u2013D1228","journal-title":"Nucl Acids Res"},{"issue":"11","key":"765_CR22","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.1021\/ed100697w","volume":"87","author":"Harry E Pence","year":"2010","unstructured":"Pence Harry E, Williams Antony (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123\u20131124","journal-title":"J Chem Educ"},{"key":"765_CR23","doi-asserted-by":"crossref","unstructured":"St Denis Tom, Johnson Simon (2007) Chapter 5 - hash functions. In: St Denis Tom, Johnson Simon (eds) Cryptography for Developers, pages 203\u2013250. Syngress, Burlington","DOI":"10.1016\/B978-159749104-4\/50008-X"},{"key":"765_CR24","unstructured":"Wikipedia contributors (2023) Bloom filter, 2"},{"key":"765_CR25","doi-asserted-by":"crossref","unstructured":"Dillinger Peter C, $$<$$peterd@cc.gatech.edu$$>$$ Manolios Panagiotis\u00a0$$<$$manolios@cc.gatech.edu$$>$$ (2004) Bloom filters in probabilistic verification. International Conference on Formal Methods in Computer-Aided Design","DOI":"10.1007\/978-3-540-30494-4_26"},{"key":"765_CR26","unstructured":"White Andrew\u00a0D (2022) molbloom: quick assessment of compound purchasability with bloom filters url = https:\/\/github.com\/whitead\/molbloom, Dic 2022"},{"key":"765_CR27","unstructured":"Fowler Glenn, Noll Landon\u00a0Curt, Vo Kiem-Phong, Eastlake Donald E 3rd, Hansen Tony (2023) The FNV Non-Cryptographic Hash Algorithm. Internet-Draft draft-eastlake-fnv-19, Internet Engineering Task Force, January 2023. Work in Progress"},{"key":"765_CR28","doi-asserted-by":"crossref","unstructured":"Rivest Ronald\u00a0L (April 1992) The MD4 Message-Digest Algorithm. RFC 1320","DOI":"10.17487\/rfc1320"},{"key":"765_CR29","doi-asserted-by":"crossref","unstructured":"Rivest Ronald\u00a0L (April 1992) The MD5 Message-Digest Algorithm. RFC 1321","DOI":"10.17487\/rfc1321"},{"issue":"6","key":"765_CR30","doi-asserted-by":"publisher","first-page":"1273","DOI":"10.1021\/ci010132r","volume":"42","author":"Joseph L Durant","year":"2002","unstructured":"Durant Joseph L, Leland Burton A, Henry Douglas R, Nourse James G (2002) Reoptimization of mdl keys for use in drug discovery. J Chem Inform Comp Sci 42(6):1273\u20131280 (PMID: 12444722)","journal-title":"J Chem Inform Comp Sci"},{"issue":"2","key":"765_CR31","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1021\/c160017a018","volume":"5","author":"HL Morgan","year":"1965","unstructured":"Morgan HL (1965) The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J Chem Document 5(2):107\u2013113","journal-title":"J Chem Document"},{"issue":"1","key":"765_CR32","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1186\/s13321-020-00445-4","volume":"12","author":"Alice Capecchi","year":"2020","unstructured":"Capecchi Alice, Probst Daniel, Reymond Jean-Louis (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 12(1):43","journal-title":"J Cheminform"},{"key":"765_CR33","doi-asserted-by":"crossref","unstructured":"Bosselaers Antoon (2005) Md4-Md5, pages 378\u2013379. Springer US, Boston, MA","DOI":"10.1007\/0-387-23483-7_249"},{"issue":"D1","key":"765_CR34","doi-asserted-by":"publisher","first-page":"D1220","DOI":"10.1093\/nar\/gkv1253","volume":"44","author":"George Papadatos","year":"2016","unstructured":"Papadatos George, Davies Mark, Dedman Nathan, Chambers Jon, Gaulton Anna, Siddle James, Koks Richard, Irvine Sean A, Pettersson Joe, Goncharoff Nicko et al (2016) Surechembl: a large-scale, chemically annotated patent document database. Nucl acids Res 44(D1):D1220\u2013D1228","journal-title":"Nucl acids Res"},{"key":"765_CR35","unstructured":"Medina Jorge (March 2023) molbloom: quick assessment of compound purchasability with bloom filters url = https:\/\/github.com\/Jgmedina95\/molbloom-paper"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00765-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00765-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00765-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,20]],"date-time":"2023-11-20T22:20:30Z","timestamp":1700518830000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00765-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,12]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["765"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00765-1","relation":{},"ISSN":["1758-2946"],"issn-type":[{"type":"electronic","value":"1758-2946"}],"subject":[],"published":{"date-parts":[[2023,10,12]]},"assertion":[{"value":"14 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 September 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 October 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not Applicable","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not Applicable","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors have no competing interests to declare.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"95"}}