{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T11:48:02Z","timestamp":1763466482045,"version":"3.41.0"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2008,8,1]],"date-time":"2008-08-01T00:00:00Z","timestamp":1217548800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2008,8]]},"abstract":"<jats:p>Sampling is now a very important data management tool, to such an extent that an interface for database sampling is included in the latest SQL standard. In this article we reconsider in depth what at first may seem like a very simple problem\u2014computing the error of a sampling-based guess for the answer to a GROUP BY query over a multitable join. The difficulty when sampling for the answer to such a query is that the same sample will be used to guess the result of the query for each group, which induces correlations among the estimates. Thus, from a statistical point-of-view it is very problematic and even dangerous to use traditional methods such as confidence intervals for communicating estimate accuracy to the user. We explore ways to address this problem, and pay particular attention to the computational aspects of computing \u201csafe\u201d confidence intervals.<\/jats:p>","DOI":"10.1145\/1386118.1386122","type":"journal-article","created":{"date-parts":[[2008,9,4]],"date-time":"2008-09-04T12:51:35Z","timestamp":1220532695000},"page":"1-44","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Confidence bounds for sampling-based group by estimates"],"prefix":"10.1145","volume":"33","author":[{"given":"Fei","family":"Xu","sequence":"first","affiliation":[{"name":"University of Florida, Gainesville, Gainesville, FL"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christopher","family":"Jermaine","sequence":"additional","affiliation":[{"name":"University of Florida, Gainesville, Gainesville, FL"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alin","family":"Dobra","sequence":"additional","affiliation":[{"name":"University of Florida, Gainesville, Gainesville, FL"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2008,9,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304207"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304581"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335450"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","article-title":"Controlling the false discovery rate: A practical and powerful approach to multiple testing","volume":"57","author":"Benjamini Y.","year":"1995","unstructured":"Benjamini , Y. and Hochberg , Y. 1995 . Controlling the false discovery rate: A practical and powerful approach to multiple testing . J. Royal Statisti. Soc. 57 , 289 -- 300 . Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Royal Statisti. Soc. 57, 289--300.","journal-title":"J. Royal Statisti. Soc."},{"key":"e_1_2_1_5_1","unstructured":"Casella G. and Berger R. L. 2002. Statistical Inference. 2nd Ed. Duxbury. CAS g2 02:1 1.Ex. Casella G. and Berger R. L. 2002. Statistical Inference. 2nd Ed. Duxbury. CAS g 2 02:1 1.Ex."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/335168.335230"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375694"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564699"},{"volume-title":"Data Analysis Tools for DNA Microarrays","author":"Dragici S.","key":"e_1_2_1_9_1","unstructured":"Dragici , S. 2003. Data Analysis Tools for DNA Microarrays . Chapman and Hall, CRC Press . Dragici, S. 2003. Data Analysis Tools for DNA Microarrays. Chapman and Hall, CRC Press."},{"volume-title":"Proceedings of the 26th International Conference on Very Large Data Bases (VLDB'00)","author":"Ganti V.","key":"e_1_2_1_10_1","unstructured":"Ganti , V. , Lee , M.-L. , and Ramakrishnan , R . 2000. Icicles: Self-tuning samples for approximate query answering . In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB'00) . Morgan Kaufmann, 176--187. Ganti, V., Lee, M.-L., and Ramakrishnan, R. 2000. Icicles: Self-tuning samples for approximate query answering. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB'00). Morgan Kaufmann, 176--187."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/276304.276334"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304208"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.781635"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/253260.253291"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/75.4.800"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Hochberg Y. and Tamhane A. C. 1987. Multiple Comparison Procedures. Wiley New York. Hochberg Y. and Tamhane A. C. 1987. Multiple Comparison Procedures. Wiley New York.","DOI":"10.1002\/9780470316672"},{"key":"e_1_2_1_17_1","first-page":"65","article-title":"A simple sequentially rejective multiple test procedure","volume":"6","author":"Holm S.","year":"1979","unstructured":"Holm , S. 1979 . A simple sequentially rejective multiple test procedure . Scand. J. Stat 6 , 65 -- 70 . Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat 6, 65--70.","journal-title":"Scand. J. Stat"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/308386.308455"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/67544.66933"},{"key":"e_1_2_1_20_1","volume-title":"Multiple Comparisons: Theory and Methods","author":"Hsu J.","year":"1996","unstructured":"Hsu , J. 1996 . Multiple Comparisons: Theory and Methods . Chapman and Hall, CRC Press . Hsu, J. 1996. Multiple Comparisons: Theory and Methods. Chapman and Hall, CRC Press."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066222"},{"key":"e_1_2_1_22_1","unstructured":"Johnson N. L. Kotz S. and Balakrishnan N. 1995. Continuous Univariate Distributions Vol. 2 Wiley New York. Johnson N. L. Kotz S. and Balakrishnan N. 1995. Continuous Univariate Distributions Vol. 2 Wiley New York."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/93597.93611"},{"key":"e_1_2_1_24_1","volume-title":"Simultaneous Statistical Inference","author":"Miller R. G.","unstructured":"Miller , R. G. 1981. Simultaneous Statistical Inference , 2 nd ed. Springer , Berlin, Germany . Miller, R. G. 1981. Simultaneous Statistical Inference, 2nd ed. Springer, Berlin, Germany.","edition":"2"},{"volume-title":"Proceedings of the Conference on Very Large Data Bases (VLDB'89)","author":"Olken F.","key":"e_1_2_1_25_1","unstructured":"Olken , F. and Rotem , D . 1989. Random sampling from b+ trees . In Proceedings of the Conference on Very Large Data Bases (VLDB'89) . 269--277. Olken, F. and Rotem, D. 1989. Random sampling from b+ trees. In Proceedings of the Conference on Very Large Data Bases (VLDB'89). 269--277."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/93597.98746"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Robert C. P. and Casella G. 2005. Monte Carlo Statistical Methods. Springer New York. Robert C. P. and Casella G. 2005. Monte Carlo Statistical Methods. Springer New York.","DOI":"10.1007\/978-1-4757-4145-2"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Sarndal C. Swensson B. and Wretman J. 1992. Model Assisted Survey Sampling. Springer Berlin Germany. Sarndal C. Swensson B. and Wretman J. 1992. Model Assisted Survey Sampling. Springer Berlin Germany.","DOI":"10.1007\/978-1-4612-4378-6"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1111\/1467-9868.00346"},{"key":"e_1_2_1_30_1","unstructured":"Westfall P. and Young S. 1993. Resampling-Based Multiple Testing. Wiley New York. Westfall P. and Young S. 1993. Resampling-Based Multiple Testing. Wiley New York."}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1386118.1386122","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1386118.1386122","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:57:47Z","timestamp":1750255067000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1386118.1386122"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,8]]},"references-count":30,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2008,8]]}},"alternative-id":["10.1145\/1386118.1386122"],"URL":"https:\/\/doi.org\/10.1145\/1386118.1386122","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"type":"print","value":"0362-5915"},{"type":"electronic","value":"1557-4644"}],"subject":[],"published":{"date-parts":[[2008,8]]},"assertion":[{"value":"2006-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-09-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}