{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T07:15:44Z","timestamp":1779174944329,"version":"3.51.4"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,5,29]],"date-time":"2024-05-29T00:00:00Z","timestamp":1716940800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"HKRGC","award":["16205420,16205422,16204223"],"award-info":[{"award-number":["16205420,16205422,16204223"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,5,29]]},"abstract":"<jats:p>Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical queries or training machine learning models. In this paper, we study the problem of how to maintain a random sample over joins while the tuples are streaming in. Without the join, this problem can be solved by some simple and classical reservoir sampling algorithms. However, the join operator makes the problem significantly harder, as the join size can be polynomially larger than the input. We present a new algorithm for this problem that achieves a near-linear complexity. The key technical components are a generalized reservoir sampling algorithm that supports a predicate, and a dynamic index for sampling over joins. We also conduct extensive experiments on both graph and relational data over various join queries, and the experimental results demonstrate significant performance improvement over the state of the art.<\/jats:p>","DOI":"10.1145\/3654921","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T09:44:53Z","timestamp":1717062293000},"page":"1-26","source":"Crossref","is-referenced-by-count":8,"title":["Reservoir Sampling over Joins"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-5438-7288","authenticated-orcid":false,"given":"Binyang","family":"Dai","sequence":"first","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7890-665X","authenticated-orcid":false,"given":"Xiao","family":"Hu","sequence":"additional","affiliation":[{"name":"University of Waterloo, Waterloo, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2178-3716","authenticated-orcid":false,"given":"Ke","family":"Yi","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Code. https:\/\/github.com\/hkustDB\/Reservoir-Sampling-over-Joins"},{"key":"e_1_2_1_2_1","unstructured":"Reservoir Sampling over Joins. https:\/\/arxiv.org\/pdf\/2404.03194.pdf"},{"key":"e_1_2_1_3_1","unstructured":"Symmetric hash join. https:\/\/en.wikipedia.org\/wiki\/Symmetric_hash_join"},{"key":"e_1_2_1_4_1","unstructured":"TPC-DS dataset. https:\/\/www.tpc.org\/tpcds\/"},{"key":"e_1_2_1_5_1","volume-title":"Foundations of databases","author":"Abiteboul Serge","unstructured":"Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of databases. Vol. 8. Addison-Wesley Reading."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304207"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2008.43"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/2392389.2392412"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2402.322389"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3034786.3034789"},{"key":"e_1_2_1_11_1","volume-title":"RANDOM 2021","author":"Biswas AS","year":"2021","unstructured":"AS Biswas, T Eden, and R Rubinfeld. 2021. Towards a Decomposition-Optimal Algorithm for Counting and Sampling Arbitrary Motifs in Sublinear Time. RANDOM 2021 (2021)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3375395.3387662"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/304181.304206"},{"key":"e_1_2_1_14_1","volume-title":"Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory.","author":"Chen Yu","year":"2020","unstructured":"Yu Chen and Ke Yi. 2020. Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3584372.3588666"},{"key":"e_1_2_1_16_1","volume-title":"On Sampling Edges Almost Uniformly. In 1st Symposium on Simplicity in Algorithms (SOSA","author":"Eden Talya","year":"2018","unstructured":"Talya Eden and Will Rosenbaum. 2018. On Sampling Edges Almost Uniformly. In 1st Symposium on Simplicity in Algorithms (SOSA 2018)."},{"key":"e_1_2_1_17_1","volume-title":"Tractability: Practical Approaches to Hard Problems 1","author":"Gottlob Georg","year":"2014","unstructured":"Georg Gottlob, Gianluigi Greco, Francesco Scarcello, et al. 2014. Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems 1 (2014), 20."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3064027"},{"key":"e_1_2_1_19_1","volume-title":"22nd International Conference on Database Theory. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.","author":"Kara Ahmet","year":"2019","unstructured":"Ahmet Kara, Hung Q Ngo, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. 2019. Counting Triangles under Updates in Worst-Case Optimal Time. In 22nd International Conference on Database Theory. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3375395.3387646"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3584372.3588676"},{"key":"e_1_2_1_22_1","volume-title":"The art of computer programming","author":"Knuth Donald Ervin","unstructured":"Donald Ervin Knuth. 1997. The art of computer programming. Vol. 3. Pearson Education."},{"key":"e_1_2_1_23_1","unstructured":"Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http:\/\/snap.stanford. edu\/data."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/198429.198435"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/2556549.2556569"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574270"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3147.3165"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3579075.3579080"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380586"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183739"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389717"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654921","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3654921","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:42:26Z","timestamp":1755787346000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654921"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,29]]},"references-count":31,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,29]]}},"alternative-id":["10.1145\/3654921"],"URL":"https:\/\/doi.org\/10.1145\/3654921","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,29]]}}}