{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,1,12]],"date-time":"2024-01-12T21:05:46Z","timestamp":1705093546512},"reference-count":22,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2022,9]]},"abstract":"<jats:p>Stochastic sample-based estimators are among the most fundamental and universally applied tools in statistics. Such estimators are particularly important when processing huge amounts of data, where we need to be able to answer a wide range of statistical queries reliably, yet cannot afford to store the data in its full length.<\/jats:p>\n          <jats:p>\n            In many applications we need the sampling to be coordinated which is typically attained using hashing. In previous work, a common strategy to obtain reliable sample-based estimators that work within certain error bounds with high probability has been to design one that works with constant probability, and then boost the probability by taking the median over\n            <jats:italic>r<\/jats:italic>\n            independent repetitions. Aamand et al. (STOC'20) recently proposed a fast and practical hashing scheme with\n            <jats:italic>strong concentration bounds<\/jats:italic>\n            , Tabulation-1Permutation, the first of its kind. In this paper, we demonstrate that using such a hash family for the sampling, we achieve the same high probability bounds without any need for repetitions. Using the same space, this saves a factor\n            <jats:italic>r<\/jats:italic>\n            in time, and simplifies the overall algorithms.\n          <\/jats:p>\n          <jats:p>We validate our approach experimentally on both real and synthetic data. We compare Tabulation-1Permutation with other hash functions such as strongly universal hash functions and various other hash functions such as MurmurHash3 and BLAKE3, both with and without resorting to repetitions. We see that if we want reliability in terms of small error probabilities, then Tabulation-1Permutation is significantly faster.<\/jats:p>","DOI":"10.14778\/3565838.3565851","type":"journal-article","created":{"date-parts":[[2023,1,20]],"date-time":"2023-01-20T23:09:56Z","timestamp":1674256196000},"page":"3989-4001","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["No Repetition"],"prefix":"10.14778","volume":"15","author":[{"given":"Anders","family":"Aamand","sequence":"first","affiliation":[{"name":"MIT"}]},{"given":"Debarati","family":"Das","sequence":"additional","affiliation":[{"name":"Penn State University"}]},{"given":"Evangelos","family":"Kipouridis","sequence":"additional","affiliation":[{"name":"University of Copenhagen, Copenhagen, Denmark"}]},{"given":"Jakob B. T.","family":"Knudsen","sequence":"additional","affiliation":[{"name":"University of Copenhagen, Copenhagen, Denmark"}]},{"given":"Peter M. R.","family":"Rasmussen","sequence":"additional","affiliation":[{"name":"University of Copenhagen, Copenhagen, Denmark"}]},{"given":"Mikkel","family":"Thorup","sequence":"additional","affiliation":[{"name":"University of Copenhagen, Copenhagen, Denmark"}]}],"member":"320","published-online":{"date-parts":[[2023,1,20]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020. ACM, 1265--1278","author":"Aamand Anders","year":"2020","unstructured":"Anders Aamand , Jakob B\u00e6k Tejs Knudsen , Mathias B\u00e6k Tejs Knudsen , Peter Michael Reichstein Rasmussen , and Mikkel Thorup . 2020 . Fast hashing with strong concentration bounds . In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020. ACM, 1265--1278 . Anders Aamand, Jakob B\u00e6k Tejs Knudsen, Mathias B\u00e6k Tejs Knudsen, Peter Michael Reichstein Rasmussen, and Mikkel Thorup. 2020. Fast hashing with strong concentration bounds. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020. ACM, 1265--1278."},{"key":"e_1_2_1_2_1","unstructured":"Austin Appleby. 2016. Murmurhash3. Available at https:\/\/github.com\/aappleby\/smhasher\/wiki\/MurmurHash3 Last accessed 16\/9\/2022.  Austin Appleby. 2016. Murmurhash3. Available at https:\/\/github.com\/aappleby\/smhasher\/wiki\/MurmurHash3 Last accessed 16\/9\/2022."},{"key":"e_1_2_1_3_1","volume-title":"International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM). 1--10","author":"Bar-Yossef Ziv","year":"2002","unstructured":"Ziv Bar-Yossef , T. S. Jayram , Ravi Kumar , D. Sivakumar , and Luca Trevisan . 2002 . Counting Distinct Elements in a Data Stream . In International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM). 1--10 . Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting Distinct Elements in a Data Stream. In International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM). 1--10."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247504"},{"key":"e_1_2_1_5_1","unstructured":"Martin Bo\u00dflet. 2012. Breaking murmur: Hash-flooding dos reloaded. Available at https:\/\/emboss.github.io\/blog\/2012\/12\/14\/breaking-murmur-hash-flooding-dos-reloaded\/ Last accessed 16\/9\/2022.  Martin Bo\u00dflet. 2012. Breaking murmur: Hash-flooding dos reloaded. Available at https:\/\/emboss.github.io\/blog\/2012\/12\/14\/breaking-murmur-hash-flooding-dos-reloaded\/ Last accessed 16\/9\/2022."},{"key":"e_1_2_1_6_1","volume-title":"Proc. Compression and Complexity of Sequences (SEQUENCES). 21--29","author":"Broder Andrei Z.","year":"1997","unstructured":"Andrei Z. Broder . 1997 . On the resemblance and containment of documents . In Proc. Compression and Complexity of Sequences (SEQUENCES). 21--29 . Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (SEQUENCES). 21--29."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-0000(79)90044-8"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453884"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2015.83"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1006\/jagm.1997.0873"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-322-95233-2_7"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/347059.347555"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1314690.1314696"},{"key":"e_1_2_1_14_1","volume-title":"In AOFA '07: Proceedings of the 2007 International Conference on Analysis of Algorithms. 127--146","author":"Flajolet Philippe","year":"2007","unstructured":"Philippe Flajolet , \u00c9ric Fusy , Olivier Gandouet , and Frederic Meunier . 2007 . Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm . In In AOFA '07: Proceedings of the 2007 International Conference on Analysis of Algorithms. 127--146 . Philippe Flajolet, \u00c9ric Fusy, Olivier Gandouet, and Frederic Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In In AOFA '07: Proceedings of the 2007 International Conference on Analysis of Algorithms. 127--146."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3164135.3164145"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.ICALP.2022.74"},{"key":"e_1_2_1_17_1","volume-title":"Jean-Philippe Aumasson and Zooko Wilcox-O'Hearn","author":"Jack O'Connor Samuel Neves","year":"2020","unstructured":"Samuel Neves Jack O'Connor , Jean-Philippe Aumasson and Zooko Wilcox-O'Hearn . 2020 . BLAKE 3. Available at https:\/\/github.com\/BLAKE3-team\/BLAKE3, Last accessed 16\/9\/2022. Samuel Neves Jack O'Connor, Jean-Philippe Aumasson and Zooko Wilcox-O'Hearn. 2020. BLAKE3. Available at https:\/\/github.com\/BLAKE3-team\/BLAKE3, Last accessed 16\/9\/2022."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/948205.948236"},{"key":"e_1_2_1_19_1","volume-title":"Randomized Algorithms","author":"Motwani Rajeev","unstructured":"Rajeev Motwani and Prabhakar Raghavan . 1995. Randomized Algorithms . Cambridge University Press . Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2488608.2488655"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1137\/100800774"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-0000(81)90033-7"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3565838.3565851","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,20]],"date-time":"2023-01-20T23:15:36Z","timestamp":1674256536000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3565838.3565851"}},"subtitle":["Fast and Reliable Sampling with Highly Concentrated Hashing"],"short-title":[],"issued":{"date-parts":[[2022,9]]},"references-count":22,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2022,9]]}},"alternative-id":["10.14778\/3565838.3565851"],"URL":"https:\/\/doi.org\/10.14778\/3565838.3565851","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2022,9]]},"assertion":[{"value":"2023-01-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}