{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T15:29:00Z","timestamp":1773588540344,"version":"3.50.1"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"HKRGC","award":["16205420, 16205422, 16204223"],"award-info":[{"award-number":["16205420, 16205422, 16204223"]}]},{"name":"Natural Sciences and Engineering Research Council of Canada \u2013 Discovery Grant"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2026,6,30]]},"abstract":"<jats:p>Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical queries or training machine learning models. In this article, we study the problem of how to maintain a random sample over joins while the tuples are streaming in. Without the join, this problem can be solved by some simple and classical reservoir sampling algorithms. However, the join operator makes the problem significantly harder, as the join size can be polynomially larger than the input. We present a new algorithm for this problem that achieves a near-linear complexity. The key technical components are a generalized reservoir sampling algorithm that supports a predicate, and a dynamic index for sampling over joins. We also conduct extensive experiments on both graph and relational data over various join queries, and the experimental results demonstrate significant performance improvement over the state of the art.<\/jats:p>","DOI":"10.1145\/3787855","type":"journal-article","created":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T20:45:16Z","timestamp":1767818716000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Reservoir Sampling over Joins"],"prefix":"10.1145","volume":"51","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-5438-7288","authenticated-orcid":false,"given":"Binyang","family":"Dai","sequence":"first","affiliation":[{"name":"The Hong Kong University of Science and Technology School of Engineering","place":["Hong Kong, Hong Kong"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7890-665X","authenticated-orcid":false,"given":"Xiao","family":"Hu","sequence":"additional","affiliation":[{"name":"Computer Science, University of Waterloo","place":["Waterloo, Canada"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2178-3716","authenticated-orcid":false,"given":"Ke","family":"Yi","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology School of Engineering","place":["Hong Kong, Hong Kong"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,3,10]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","unstructured":"G\u00e1bor Sz\u00e1rnyas Jack Waudby Benjamin A. Steer D\u00e1vid Szak\u00e1llas Altan Birler Mingxi Wu Yuchen Zhang and Peter Boncz. 2022. The LDBC social network benchmark: Business intelligence workload. In Proceedings of the VLDB Endow 16 4 (December 2022) 877\u2013890.","DOI":"10.14778\/3574245.3574270"},{"key":"e_1_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Junyi Xie and Jun Yang. 2007. A survey of join processing in data streams. In Data Streams: Models and Algorithms Charu C. Aggarwal (Ed.). Springer US Boston MA 209\u2013236.","DOI":"10.1007\/978-0-387-47534-9_10"},{"key":"e_1_3_2_4_2","unstructured":"Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford large network dataset collection. Retrieved from http:\/\/snap.stanford.edu\/data"},{"key":"e_1_3_2_5_2","doi-asserted-by":"crossref","unstructured":"Meikel Poess Bryan Smith Lubor Kollar and Paul Larson. 2002. TPC-DS taking decision support benchmarking to the next level. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD\u201902). Association for Computing Machinery Madison Wisconsin 582\u2013587.","DOI":"10.1145\/564691.564759"},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Mark Raasveldt and Hannes M\u00fchleisen. 2019. DuckDB: An embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD\u201919). Association for Computing Machinery Amsterdam Netherlands 1981\u20131984.","DOI":"10.1145\/3299869.3320212"},{"key":"e_1_3_2_7_2","volume-title":"Foundations of Databases: The Logical Level (1st ed.)","author":"Abiteboul Serge","year":"1995","unstructured":"Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases: The Logical Level (1st ed.). Addison-Wesley Longman Publishing Co., Inc., USA."},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304207"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2008.43"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74915-8_18"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45726-7_1"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3034786.3034789"},{"key":"e_1_3_2_13_2","series-title":"Leibniz International Proceedings in Informatics (LIPIcs)","first-page":"55:1\u201355:19","volume-title":"Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX\/RANDOM 2021)","volume":"207","author":"Biswas Amartya Shankha","year":"2021","unstructured":"Amartya Shankha Biswas, Talya Eden, and Ronitt Rubinfeld. 2021. Towards a decomposition-optimal algorithm for counting and sampling arbitrary motifs in sublinear time. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX\/RANDOM 2021)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 207), Mary Wootters and Laura Sanit\u00e0 (Eds.). Schloss Dagstuhl \u2013 Leibniz-Zentrum f\u00fcr Informatik, Dagstuhl, Germany, 55:1\u201355:19."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3375395.3387662"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304206"},{"key":"e_1_3_2_16_2","series-title":"Leibniz International Proceedings in Informatics (LIPIcs)","first-page":"7:1\u20137:18","volume-title":"23rd International Conference on Database Theory (ICDT 2020)","volume":"155","author":"Chen Yu","year":"2020","unstructured":"Yu Chen and Ke Yi. 2020. Random sampling and size estimation over cyclic joins. In 23rd International Conference on Database Theory (ICDT 2020)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 155), Carsten Lutz and Jean Christoph Jung (Eds.). Schloss Dagstuhl \u2013 Leibniz-Zentrum f\u00fcr Informatik, Dagstuhl, Germany, 7:1\u20137:18."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1017\/9781108769938"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3654921"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3584372.3588666"},{"key":"e_1_3_2_20_2","series-title":"Open Access Series in Informatics (OASIcs)","first-page":"7:1\u20137:9","volume-title":"1st Symposium on Simplicity in Algorithms (SOSA 2018)","author":"Eden Talya","year":"2018","unstructured":"Talya Eden and Will Rosenbaum. 2018. On sampling edges almost uniformly. In 1st Symposium on Simplicity in Algorithms (SOSA 2018)(Open Access Series in Informatics (OASIcs), Vol. 61), Raimund Seidel (Ed.). Schloss Dagstuhl \u2013 Leibniz-Zentrum f\u00fcr Informatik, Dagstuhl, Germany, 7:1\u20137:9."},{"key":"e_1_3_2_21_2","first-page":"10","article-title":"HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm","volume":"2007","author":"Flajolet Philippe","year":"2007","unstructured":"Philippe Flajolet, \u00c9ric Fusy, Olivier Gandouet, and Fr\u00e9d\u00e9ric Meunier. 2007. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics & Theoretical Computer Science DMTCS Proceedings vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07), Article 10 (Jan2007).","journal-title":"Discrete Mathematics & Theoretical Computer Science"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304208"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3064027"},{"key":"e_1_3_2_24_2","series-title":"Leibniz International Proceedings in Informatics (LIPIcs)","first-page":"4:1\u20134:18","volume-title":"22nd International Conference on Database Theory (ICDT 2019)","volume":"127","author":"Kara Ahmet","year":"2019","unstructured":"Ahmet Kara, Hung Q. Ngo, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. 2019. Counting triangles under updates in worst-case optimal time. In 22nd International Conference on Database Theory (ICDT 2019)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 127), Pablo Barcelo and Marco Calautti (Eds.). Schloss Dagstuhl \u2013 Leibniz-Zentrum f\u00fcr Informatik, Dagstuhl, Germany, 4:1\u20134:18."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3375395.3387646"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3584372.3588676"},{"key":"e_1_3_2_27_2","volume-title":"The Art of Computer Programming, Volume 2 (3rd ed.): Seminumerical Algorithms","author":"Knuth Donald E.","year":"1997","unstructured":"Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., USA."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/198429.198435"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/359619.359627"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183758"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.14778\/2556549.2556569"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3147.3165"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.14778\/3579075.3579080"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380586"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.5555\/1286831.1286840"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183739"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389717"}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3787855","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T13:49:26Z","timestamp":1773582566000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3787855"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,10]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,6,30]]}},"alternative-id":["10.1145\/3787855"],"URL":"https:\/\/doi.org\/10.1145\/3787855","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,10]]},"assertion":[{"value":"2025-04-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-03-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}