{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T08:54:33Z","timestamp":1775638473603,"version":"3.50.1"},"reference-count":28,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2008,8,1]],"date-time":"2008-08-01T00:00:00Z","timestamp":1217548800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000145","name":"Division of Information and Intelligent Systems","doi-asserted-by":"publisher","award":["NSF-CAREER-IIS-0448264"],"award-info":[{"award-number":["NSF-CAREER-IIS-0448264"]}],"id":[{"id":"10.13039\/100000145","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2008,8]]},"abstract":"<jats:p>Sketching techniques provide approximate answers to aggregate queries both for data-streaming and distributed computation. Small space summaries that have linearity properties are required for both types of applications. The prevalent method for analyzing sketches uses moment analysis and distribution-independent bounds based on moments. This method produces clean, easy to interpret, theoretical bounds that are especially useful for deriving asymptotic results. However, the theoretical bounds obscure fine details of the behavior of various sketches and they are mostly not indicative of which type of sketches should be used in practice. Moreover, no significant empirical comparison between various sketching techniques has been published, which makes the choice even harder. In this article we take a close look at the sketching techniques proposed in the literature from a statistical point of view with the goal of determining properties that indicate the actual behavior and producing tighter confidence bounds. Interestingly, the statistical analysis reveals that two of the techniques, Fast-AGMS and Count-Min, provide results that are in some cases orders of magnitude better than the corresponding theoretical predictions. We conduct an extensive empirical study that compares the different sketching techniques in order to corroborate the statistical analysis with the conclusions we draw from it. The study indicates the expected performance of various sketches, which is crucial if the techniques are to be used by practitioners. The overall conclusion of the study is that Fast-AGMS sketches are, for the full spectrum of problems, either the best, or close to the best, sketching technique. We apply the insights obtained from the statistical study and the experimental results to design effective algorithms for sketching interval data. We show how the two basic methods for sketching interval data, DMAP and fast range-summation, can be improved significantly with respect to the update time without a significant loss in accuracy. The gain in update time can be as large as two orders of magnitude, thus making the improved methods practical. The empirical study suggests that DMAP is preferable when update time is the critical requirement and fast range-summation is desirable for better accuracy.<\/jats:p>","DOI":"10.1145\/1386118.1386121","type":"journal-article","created":{"date-parts":[[2008,9,4]],"date-time":"2008-09-04T12:51:35Z","timestamp":1220532695000},"page":"1-46","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":50,"title":["Sketches for size of join estimation"],"prefix":"10.1145","volume":"33","author":[{"given":"Florin","family":"Rusu","sequence":"first","affiliation":[{"name":"University of Florida, Gainesville"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alin","family":"Dobra","sequence":"additional","affiliation":[{"name":"University of Florida, Gainesville"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2008,9,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.118"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.2001.1813"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/237814.237823"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 17th IEEE ICDE International Conference on Data Engineering. IEEE Computer Society, 368--375","author":"An N.","unstructured":"An , N. , Yang , Z.-Y. , and Sivasubramaniam , A . 2001. Selectivity estimation for spatial joins . In Proceedings of the 17th IEEE ICDE International Conference on Data Engineering. IEEE Computer Society, 368--375 . An, N., Yang, Z.-Y., and Sivasubramaniam, A. 2001. Selectivity estimation for spatial joins. In Proceedings of the 17th IEEE ICDE International Conference on Data Engineering. IEEE Computer Society, 368--375."},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1080\/00031305.1988.10475539","article-title":"Kurtosis: A critical review","volume":"42","author":"Balanda K. P.","year":"1988","unstructured":"Balanda , K. P. and MacGillivray , H. L. 1988 . Kurtosis: A critical review . J. Amer. Statist. 42 , 2, 111 -- 119 . Balanda, K. P. and MacGillivray, H. L. 1988. Kurtosis: A critical review. J. Amer. Statist. 42, 2, 111--119.","journal-title":"J. Amer. Statist."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 29th International Colloquium on Automata, Languages and Programming","author":"Charikar M.","unstructured":"Charikar , M. , Chen , K. , and Farach-Colton , M. 2002. Finding frequent items in data streams . In Proceedings of the 29th International Colloquium on Automata, Languages and Programming . Springer , 693--703. Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming. Springer, 693--703."},{"key":"e_1_2_1_7_1","volume-title":"An Introduction to Statistical Modeling of Extreme Values","author":"Coles S.","unstructured":"Coles , S. 2001. An Introduction to Statistical Modeling of Extreme Values . Springer , Berlin, Germany . Coles, S. 2001. An Introduction to Statistical Modeling of Extreme Values. Springer, Berlin, Germany."},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 13--24","author":"Cormode G.","unstructured":"Cormode , G. and Garofalakis , M . 2005. Sketching streams through the net: distributed approximate query tracking . In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 13--24 . Cormode, G. and Garofalakis, M. 2005. Sketching streams through the net: distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 13--24."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jalgor.2003.12.001"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of SIAM Conference on Data Mining. SIAM","author":"Cormode G.","unstructured":"Cormode , G. and Muthukrishnan , S . 2005b. Summarizing and mining skewed data streams . In Proceedings of SIAM Conference on Data Mining. SIAM , Philadelphia, PA. Cormode, G. and Muthukrishnan, S. 2005b. Summarizing and mining skewed data streams. In Proceedings of SIAM Conference on Data Mining. SIAM, Philadelphia, PA."},{"key":"e_1_2_1_11_1","unstructured":"CPS. 2006. http:\/\/www.census.gov\/cps. CPS. 2006. http:\/\/www.census.gov\/cps."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007646"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564699"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.61"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 9th International Conference on Extending Database Technology. Springer, 569--586","author":"Ganguly S.","unstructured":"Ganguly , S. , Garofalakis , M. , and Rastogi , R . 2004. Processing data-stream join aggregates using skimmed sketches . In Proceedings of the 9th International Conference on Extending Database Technology. Springer, 569--586 . Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology. Springer, 569--586."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/11590156_24"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.108"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304208"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 44th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, 482--491","author":"Kempe D.","unstructured":"Kempe , D. , Dobra , A. , and Gehrke , J . 2003. Gossip-based computation of aggregate information . In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, 482--491 . Kempe, D., Dobra, A., and Gehrke, J. 2003. Gossip-based computation of aggregate information. In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, 482--491."},{"key":"e_1_2_1_20_1","unstructured":"MassDAL. 2006. http:\/\/www.cs.rutgers.edu\/_muthu\/massdal.html. MassDAL. 2006. http:\/\/www.cs.rutgers.edu\/_muthu\/massdal.html."},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Motwani R. and Raghavan P. 1995. Randomized Algorithms. Cambridge University Press Cambridge UK. Motwani R. and Raghavan P. 1995. Randomized Algorithms. Cambridge University Press Cambridge UK.","DOI":"10.1017\/CBO9780511814075"},{"key":"e_1_2_1_22_1","unstructured":"Olive D. J. 2005. A simple confidence interval for the median. Manuscript. www.math.siu.edu\/olive\/ppmedci.pdf. Olive D. J. 2005. A simple confidence interval for the median. Manuscript. www.math.siu.edu\/olive\/ppmedci.pdf."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1088\/0026-1394\/43\/3\/004"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1080\/00949650108812071"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242524.1242528"},{"key":"e_1_2_1_26_1","volume-title":"Applied Statistics\u2014A Handbook of Techniques","author":"Sachs L.","unstructured":"Sachs , L. 1984. Applied Statistics\u2014A Handbook of Techniques . Springer , Berlin, Germany . Sachs, L. 1984. Applied Statistics\u2014A Handbook of Techniques. Springer, Berlin, Germany."},{"key":"e_1_2_1_27_1","volume-title":"Mathematical Statistics","author":"Shao J.","unstructured":"Shao , J. 1999. Mathematical Statistics . Springer , Berlin, Germany . Shao, J. 1999. Mathematical Statistics. Springer, Berlin, Germany."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. 615--624","author":"Thorup M.","unstructured":"Thorup , M. and Zhang , Y . 2004. Tabulation based 4-universal hashing with applications to second moment estimation . In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. 615--624 . Thorup, M. and Zhang, Y. 2004. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. 615--624."}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1386118.1386121","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1386118.1386121","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:57:47Z","timestamp":1750255067000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1386118.1386121"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,8]]},"references-count":28,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2008,8]]}},"alternative-id":["10.1145\/1386118.1386121"],"URL":"https:\/\/doi.org\/10.1145\/1386118.1386121","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2008,8]]},"assertion":[{"value":"2007-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-09-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}