{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,3,6]],"date-time":"2024-03-06T00:41:29Z","timestamp":1709685689396},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,12]]},"abstract":"<jats:p>Top-k aggregation queries are widely used in data analytics for summarizing and identifying important groups from large amounts of data. These queries are usually processed by first computing exact aggregates for all groups and then selecting the groups with the top-k aggregate values. However, such an approach can be inefficient for high-cardinality large datasets where intermediate results may not fit within the local cache of multi-core processors leading to excessive data movement. To address this problem, we have developed Zippy, a new cache-conscious aggregation framework that leverages the skew in the data distribution to minimize data movements. This is achieved by designing cache-resident data structures and an adaptive multi-pass algorithm that quickly identifies candidate groups during processing, and performs exact aggregations for these groups. The non-candidate groups are pruned cheaply using efficient hashing and partitioning techniques without performing exact aggregations. We develop techniques to improve robustness over adversarial data distributions and have optimized the framework to reuse computations incrementally for rolling (or paginated) top-k aggregate queries. Our extensive evaluation using both real-world and synthetic datasets demonstrate that Zippy can achieve a median speed-up of more than 3\u00d7 for monotonic aggregation functions across typical ranges of k values (e.g., 1 to 100) and 1.4\u00d7 for non-monotonic functions when compared with state-of-the-art cache-conscious aggregation techniques.<\/jats:p>","DOI":"10.14778\/3636218.3636222","type":"journal-article","created":{"date-parts":[[2024,3,5]],"date-time":"2024-03-05T17:04:07Z","timestamp":1709658247000},"page":"644-656","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets"],"prefix":"10.14778","volume":"17","author":[{"given":"Tarique","family":"Siddiqui","sequence":"first","affiliation":[{"name":"Microsoft Research, Redmond, Washington, USA"}]},{"given":"Vivek","family":"Narasayya","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, Washington, USA"}]},{"given":"Marius","family":"Dumitru","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]},{"given":"Surajit","family":"Chaudhuri","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, Washington, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,3,5]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"https:\/\/godatadriven.com\/blog\/optimizing-topk-queries-in-datafusion\/ [Online","author":"Apache","year":"2023","unstructured":"2023. Apache Datafusion). https:\/\/godatadriven.com\/blog\/optimizing-topk-queries-in-datafusion\/ [Online; accessed 3-May-2023]."},{"key":"e_1_2_1_2_1","volume-title":"https:\/\/powerbi.microsoft.com\/en-us\/ [Online","author":"BI","year":"2023","unstructured":"2023. PowerBI (https:\/\/powerbi.microsoft.com\/en-us\/). https:\/\/powerbi.microsoft.com\/en-us\/ [Online; accessed 3-May-2023]."},{"key":"e_1_2_1_3_1","volume-title":"www.tableaupublic.com\/ [Online","author":"Tableau Public","year":"2023","unstructured":"2023. Tableau Public (www.tableaupublic.com\/). www.tableaupublic.com\/ [Online; accessed 3-May-2023]."},{"key":"e_1_2_1_4_1","volume-title":"Massively parallel sort-merge joins in main memory multi-core database systems. arXiv preprint arXiv:1207.0145","author":"Albutiu Martina-Cezara","year":"2012","unstructured":"Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively parallel sort-merge joins in main memory multi-core database systems. arXiv preprint arXiv:1207.0145 (2012)."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732219.2732227"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732219.2732227"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/2735496.2735499"},{"key":"e_1_2_1_8_1","first-page":"225","article-title":"MonetDB\/X100: Hyper-Pipelining Query Execution","volume":"5","author":"Boncz Peter A","year":"2005","unstructured":"Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB\/X100: Hyper-Pipelining Query Execution.. In Cidr, Vol. 5. 225--237.","journal-title":"Cidr"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 731--742","author":"Chandramouli Badrish","year":"2014","unstructured":"Badrish Chandramouli and Jonathan Goldstein. 2014. Patience is a virtue: Revisiting merge and sort on modern processors. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 731--742."},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2423--2437","author":"Chronis Yannis","year":"2020","unstructured":"Yannis Chronis, Thanh Do, Goetz Graefe, and Keith Peters. 2020. External merge sort for Top-K queries: Eager input filtering guided by histograms. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2423--2437."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/1325851.1325893"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jalgor.2003.12.001"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/602259.602261"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/2735479.2735481"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-0000(85)90041-8"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/191839.191886"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/253260.253291"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1391729.1391730"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/762471.762473"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/2735479.2735485"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 61--72","author":"Li Chengkai","year":"2006","unstructured":"Chengkai Li, Kevin Chen-Chuan Chang, and Ihab F Ilyas. 2006. Supporting ad-hoc ranking aggregates. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 61--72."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2002.1019210"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-155860869-6\/50038-X"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1166074.1166084"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2747644"},{"key":"e_1_2_1_26_1","volume-title":"Pareto distributions and Zipf's law. Contemporary physics 46, 5","author":"Newman Mark EJ","year":"2005","unstructured":"Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary physics 46, 5 (2005), 323--351."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485278.2485284"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610522"},{"key":"e_1_2_1_29_1","unstructured":"Vijayshankar Raman Gopi K Attaluri Ronald Barber Naresh Chainani David Kalmuk Vincent KulandaiSamy Jens Leenstra Sam Lightstone Shaorong Liu Guy M Lohman et al. 2013. Ren\u00e9 M\u00fc ller Ippokratis Pandis Berni Schiefer David Sharpe Richard Sidle Adam J. Storm and Liping Zhang (2013)."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining. 1451--1459","author":"Roy Pratanu","year":"2012","unstructured":"Pratanu Roy, Jens Teubner, and Gustavo Alonso. 2012. Efficient frequent item counting in multi-core hardware. In Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining. 1451--1459."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183735"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/568271.223801"},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","first-page":"1378","DOI":"10.14778\/1687553.1687564","article-title":"Sedlar E Nguyen AD Satish N Chhugani J Di Blas A Dubey P Sort vs. hash revisited: fast join implementation on modern multi-core CPUs","volume":"2","author":"Kim C","year":"2009","unstructured":"Kim C Kaldewey T Lee VW. 2009. Sedlar E Nguyen AD Satish N Chhugani J Di Blas A Dubey P Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. Proc. VLDB Endow 2, 2 (2009), 1378.","journal-title":"Proc. VLDB Endow"},{"key":"e_1_2_1_34_1","volume-title":"Euro-Par 2011 Parallel Processing: 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29-September 2, 2011, Proceedings, Part II 17","author":"Wassenberg Jan","year":"2011","unstructured":"Jan Wassenberg and Peter Sanders. 2011. Engineering a multi-core radix sort. In Euro-Par 2011 Parallel Processing: 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29-September 2, 2011, Proceedings, Part II 17. Springer, 160--169."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995441.1995442"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3636218.3636222","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,5]],"date-time":"2024-03-05T17:08:20Z","timestamp":1709658500000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3636218.3636222"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12]]},"references-count":35,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12]]}},"alternative-id":["10.14778\/3636218.3636222"],"URL":"https:\/\/doi.org\/10.14778\/3636218.3636222","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,12]]},"assertion":[{"value":"2024-03-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}