{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,5]],"date-time":"2025-10-05T16:58:13Z","timestamp":1759683493191},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2008,8]]},"abstract":"<jats:p>We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF\/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I\/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.<\/jats:p>","DOI":"10.14778\/1453856.1453883","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"201-212","source":"Crossref","is-referenced-by-count":41,"title":["Hashed samples"],"prefix":"10.14778","volume":"1","author":[{"given":"Marios","family":"Hadjieleftheriou","sequence":"first","affiliation":[{"name":"AT&amp;T Labs-Research, Florham Park NJ"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaohui","family":"Yu","sequence":"additional","affiliation":[{"name":"York University, Toronto ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nick","family":"Koudas","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Divesh","family":"Srivastava","sequence":"additional","affiliation":[{"name":"AT&amp;T Labs-Research, Florham Park NJ"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2008,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/324133.324179"},{"key":"e_1_2_1_2_1","first-page":"918","volume-title":"Proc. of Very Large Data Bases (VLDB)","author":"Arasu A.","year":"2006","unstructured":"A. Arasu , V. Ganti , and R. Kaushik . Efficient exact set-similarity joins . In Proc. of Very Large Data Bases (VLDB) , pages 918 -- 929 , 2006 . A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. of Very Large Data Bases (VLDB), pages 918--929, 2006."},{"key":"e_1_2_1_3_1","unstructured":"AT&T Inc. Business listings from YellowPages.com. proprietary.  AT&T Inc. Business listings from YellowPages.com. proprietary."},{"key":"e_1_2_1_4_1","volume-title":"Modern Information Retrieval","author":"Baeza-Yates R.","year":"1999","unstructured":"R. Baeza-Yates and B. Ribeiro-Neto . Modern Information Retrieval . Addison Wesley , May 1999 . R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/PL00009253"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247504"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/76359.76371"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-0000(79)90044-8"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247521"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/335168.335230"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007602"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/977401.978116"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.9"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-0000(85)90041-8"},{"key":"e_1_2_1_15_1","first-page":"491","volume-title":"Proc. of Very Large Data Bases (VLDB)","author":"Gravano L.","year":"2001","unstructured":"L. Gravano , P. G. Ipeirotis , H. V. Jagadish , N. Koudas , S. Muthukrishnan , and D. Srivastava . Approximate string joins in a database (almost) for free . In Proc. of Very Large Data Bases (VLDB) , pages 491 -- 500 , 2001 . L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of Very Large Data Bases (VLDB), pages 491--500, 2001."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.128"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007601"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497435"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/10515.10522"},{"key":"e_1_2_1_20_1","unstructured":"IMDB. IMDB database. http:\/\/ www.imdb.com\/interfaces.  IMDB. IMDB database. http:\/\/ www.imdb.com\/interfaces."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s007780000029"},{"key":"e_1_2_1_22_1","first-page":"397","volume-title":"Proc. of Very Large Data Bases (VLDB)","author":"Jin L.","year":"2005","unstructured":"L. Jin and C. Li . Selectivity estimation for fuzzy string predicates in large data sets . In Proc. of Very Large Data Bases (VLDB) , pages 397 -- 408 , 2005 . L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In Proc. of Very Large Data Bases (VLDB), pages 397--408, 2005."},{"key":"e_1_2_1_23_1","first-page":"195","volume-title":"Proc. of Very Large Data Bases (VLDB)","author":"Lee H.","year":"2007","unstructured":"H. Lee , R. Ng , and K. Shim . Extending q-grams to estimate selectivity of string matching with low edit distance . In Proc. of Very Large Data Bases (VLDB) , pages 195 -- 205 , 2007 . H. Lee, R. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proc. of Very Large Data Bases (VLDB), pages 195--205, 2007."},{"key":"e_1_2_1_24_1","unstructured":"M. Ley. DBLP database. http:\/\/dblp.uni-trier.de\/xml.  M. Ley. DBLP database. http:\/\/dblp.uni-trier.de\/xml."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242524.1242529"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/233269.233342"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2003.1260787"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775087"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007652"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1328854.1328855"},{"key":"e_1_2_1_32_1","volume-title":"On the uniform convergence of relative frequencies of events to their probabilities. SIAM: Theory of Probability and its Applications (TPA), 16(2):264--280","author":"Vapnik V. N.","year":"1971","unstructured":"V. N. Vapnik and A. Y. Chrervonenkis . On the uniform convergence of relative frequencies of events to their probabilities. SIAM: Theory of Probability and its Applications (TPA), 16(2):264--280 , 1971 . V. N. Vapnik and A. Y. Chrervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. SIAM: Theory of Probability and its Applications (TPA), 16(2):264--280, 1971."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3147.3165"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1453856.1453883","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:12:25Z","timestamp":1672225945000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1453856.1453883"}},"subtitle":["selectivity estimators for set similarity selection queries"],"short-title":[],"issued":{"date-parts":[[2008,8]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2008,8]]}},"alternative-id":["10.14778\/1453856.1453883"],"URL":"https:\/\/doi.org\/10.14778\/1453856.1453883","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2008,8]]}}}