{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,11]],"date-time":"2026-01-11T05:39:36Z","timestamp":1768109976610,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p>Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates.<\/jats:p>\n          <jats:p>To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABAE). ABAE must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, ABAE uses pilot sampling and plugin estimates to sample according to the optimal allocation. We show that ABAE converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABAE outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3X.<\/jats:p>","DOI":"10.14778\/3476249.3476285","type":"journal-article","created":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T16:46:23Z","timestamp":1635353183000},"page":"2341-2354","source":"Crossref","is-referenced-by-count":17,"title":["Accelerating approximate aggregation queries with expensive predicates"],"prefix":"10.14778","volume":"14","author":[{"given":"Daniel","family":"Kang","sequence":"first","affiliation":[{"name":"Stanford University"}]},{"given":"John","family":"Guibas","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Peter","family":"Bailis","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Tatsunori","family":"Hashimoto","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Yi","family":"Sun","sequence":"additional","affiliation":[{"name":"University of Chicago"}]},{"given":"Matei","family":"Zaharia","sequence":"additional","affiliation":[{"name":"Stanford University"}]}],"member":"320","published-online":{"date-parts":[[2021,10,27]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/645925.671347"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2465351.2465355"},{"key":"e_1_2_1_3_1","volume-title":"Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638--1649","author":"Akbik Alan","year":"2018"},{"key":"e_1_2_1_4_1","volume-title":"Predicate Optimization for a Visual Analytics Database. ICDE","author":"Anderson Michael R","year":"2019"},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Bouhari Arouna. 2004. Adaptative Monte Carlo method a variance reduction technique. (2004).  Bouhari Arouna. 2004. Adaptative Monte Carlo method a variance reduction technique. (2004).","DOI":"10.1515\/156939604323091180"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/993483"},{"key":"e_1_2_1_7_1","volume-title":"Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques","author":"Braverman Vladimir"},{"key":"e_1_2_1_8_1","volume-title":"Scaling Video Analytics on Constrained Edge Nodes. SysML","author":"Canel Christopher","year":"2019"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/2789272.2886821"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242524.1242526"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/PL00012580"},{"key":"e_1_2_1_12_1","volume-title":"Sampling techniques","author":"Cochran William G"},{"key":"e_1_2_1_13_1","volume-title":"TREC 2005 Spam Track Overview.. In TREC. 500--274","author":"Cormack Gordon V","year":"2005"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687687"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-0258(19990715)18:13<1575::AID-SIM153>3.0.CO;2-Z"},{"key":"e_1_2_1_16_1","volume-title":"An introduction to the bootstrap","author":"Efron Bradley"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0150205"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1086\/268245"},{"key":"e_1_2_1_19_1","volume-title":"Rekall: Specifying video events using compositions of spatiotemporal labels. arXiv preprint arXiv:1910.02993","author":"Fu Daniel Y","year":"2019"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407817"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564794"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1081870.1081884"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305518"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1946.10501894"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","volume-title":"Statistical meta-analysis with applications","author":"Hartung Joachim","DOI":"10.1002\/9780470386347"},{"key":"e_1_2_1_26_1","volume-title":"Mask r-cnn","author":"He Kaiming"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/253262.253291"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"James Hong Will Crichton Haotian Zhang Daniel Y Fu Jacob Ritchie Jeremy Barenholtz Ben Hannel Xinwei Yao Michaela Murray Geraldine Moriba etal 2020. Analyzing Who and What Appears in a Decade of US Cable TV News. arXiv preprint arXiv:2008.06007 (2020).  James Hong Will Crichton Haotian Zhang Daniel Y Fu Jacob Ritchie Jeremy Barenholtz Ben Hannel Xinwei Yao Michaela Murray Geraldine Moriba et al. 2020. Analyzing Who and What Appears in a Decade of US Cable TV News. arXiv preprint arXiv:2008.06007 (2020).","DOI":"10.1145\/3447548.3467134"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/3291168.3291188"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1609\/icwsm.v8i1.14550"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230574"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/3372716.3372725"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137664"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407804"},{"key":"e_1_2_1_35_1","volume-title":"Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data. arXiv preprint arXiv:2009.04540","author":"Kang Daniel","year":"2020"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476285"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/3425879.3425881"},{"key":"e_1_2_1_38_1","first-page":"42","article-title":"Optimum allocation in multivariate stratified sampling in presence of non-response","volume":"62","author":"Khan Mohammad GM","year":"2008","journal-title":"Journal of the Indian Society of Agricultural Statistics"},{"key":"e_1_2_1_39_1","unstructured":"Leslie Kish. 1965. Survey sampling. Number 04; HN29 K5.  Leslie Kish. 1965. Survey sampling. Number 04; HN29 K5."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-018-0074-4"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3384414"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.425"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183751"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687738"},{"key":"e_1_2_1_45_1","unstructured":"Dankit K Nassiuma. 2001. Survey sampling: Theory and methods.  Dankit K Nassiuma. 2001. Survey sampling: Theory and methods."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1018"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/971697.602294"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201394"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/235968.233342"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00474"},{"key":"e_1_2_1_51_1","volume-title":"R Lyman Ott, and Kenneth G Gerow.","author":"Scheaffer Richard L","year":"2011"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASYU50717.2020.9259802"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1137\/0719026"},{"key":"e_1_2_1_54_1","unstructured":"Robert Tarjan. 2009. 15-359: Probability and Computing Lecture 10: More Chernoff Bounds Sampling and the Chernoff + Union Bound method. http:\/\/aiweb.techfak.uni-bielefeld.de\/content\/bworld-robot-control-software\/. [Online; accessed 11-Feb-2021].  Robert Tarjan. 2009. 15-359: Probability and Computing Lecture 10: More Chernoff Bounds Sampling and the Chernoff + Union Bound method. http:\/\/aiweb.techfak.uni-bielefeld.de\/content\/bworld-robot-control-software\/. [Online; accessed 11-Feb-2021]."},{"key":"e_1_2_1_55_1","volume-title":"A tutorial on pilot studies: the what, why and how. BMC medical research methodology 10, 1","author":"Thabane Lehana","year":"2010"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00117"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11590-011-0329-8"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/3368289.3368302"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303971"},{"key":"e_1_2_1_60_1","volume-title":"Face detection using improved faster rcnn. arXiv preprint arXiv:1802.02142","author":"Zhang Changzheng","year":"2018"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.5555\/3154630.3154661"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2603342"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.14778\/3372716.3372721"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3476249.3476285","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:04:10Z","timestamp":1672221850000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3476249.3476285"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":63,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.14778\/3476249.3476285"],"URL":"https:\/\/doi.org\/10.14778\/3476249.3476285","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,7]]}}}