{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T06:30:36Z","timestamp":1772519436602,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,2]]},"abstract":"<jats:p>Exponential growth in data collection is creating significant challenges for data storage and analytics latency. Approximate Query Processing (AQP) has long been touted as a solution for accelerating analytics on large datasets, however, there is still room for improvement across all key performance criteria. In this paper, we propose a novel histogram-based data synopsis called PairwiseHist that uses recursive hypothesis testing to ensure accurate histograms and can be built on top of data compressed using Generalized Deduplication (GD). We thus show that GD data compression can contribute to AQP. Compared to state-of-the-art AQP approaches, Pairwise-Hist achieves better performance across all key metrics, including 2.6\u00d7 higher accuracy, 3.5\u00d7 lower latency, 24\u00d7 smaller synopses and 1.5--4\u00d7 faster construction time.<\/jats:p>","DOI":"10.14778\/3648160.3648181","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T21:52:53Z","timestamp":1714773173000},"page":"1432-1445","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data Compression"],"prefix":"10.14778","volume":"17","author":[{"given":"Aaron","family":"Hurst","sequence":"first","affiliation":[{"name":"Aarhus University, Denmark"}]},{"given":"Daniel E.","family":"Lucani","sequence":"additional","affiliation":[{"name":"Aarhus University, Denmark"}]},{"given":"Qi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Aarhus University, Denmark"}]}],"member":"320","published-online":{"date-parts":[[2024,5,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2745754.2745772"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2593667"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2465351.2465355"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0185-4"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","unstructured":"Pritom Saha Akash Wei-Cheng Lai and Po-Wen Lin. 2022. Online Aggregation based Approximate Query Processing: A Literature Survey. 10.48550\/ARXIV.2204.07125","DOI":"10.48550\/ARXIV.2204.07125"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872822"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375686"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3056097"},{"key":"e_1_2_1_9_1","volume-title":"https:\/\/data.cityofchicago.org\/Transportation\/Taxi-Trips-2020\/r2u4-wwk3 Accessed","author":"Chicago City","year":"2024","unstructured":"City of Chicago. 2022. Taxi Trips - 2020. https:\/\/data.cityofchicago.org\/Transportation\/Taxi-Trips-2020\/r2u4-wwk3 Accessed: 18 Feb, 2024."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000004"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 31st Conference On Learning Theory","volume":"75","author":"Diakonikolas Ilias","year":"2018","unstructured":"Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. 2018. Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms. In Proceedings of the 31st Conference On Learning Theory, Vol. 75. 819--842."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380574"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","unstructured":"Mohammadali Fallahian Mohsen Dorodchi and Kyle Kreth. 2022. GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions. 10.48550\/ARXIV.2212.09015","DOI":"10.48550\/ARXIV.2212.09015"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","unstructured":"Marcell Feh\u00e9r Daniel E. Lucani and Ioannis Chatzigeorgiou. 2022. An Adaptive Column Compression Family for Self-Driving Databases. arXiv:2209.02334v1. 10.48550\/arXiv.2209.02334","DOI":"10.48550\/arXiv.2209.02334"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i6.25929"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3401025.3404098"},{"key":"e_1_2_1_17_1","volume-title":"Individual household electric power consumption Data Set. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Individual+household+electric+power+consumption Accessed","author":"Hebrail Georges","year":"2024","unstructured":"Georges Hebrail and Alice Berard. 2012. Individual household electric power consumption Data Set. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Individual+household+electric+power+consumption Accessed: 18 Feb, 2024."},{"key":"e_1_2_1_18_1","volume-title":"deepdb-public. https:\/\/github.com\/DataManagementLab\/deepdb-public GitHub repository. Accessed","author":"Hilprecht Benjamin","year":"2023","unstructured":"Benjamin Hilprecht. 2020. deepdb-public. https:\/\/github.com\/DataManagementLab\/deepdb-public GitHub repository. Accessed: 1 Apr, 2023.."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3384345.3384349"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 20th International Conference on Artificial Intelligence and Statistics","volume":"54","author":"Hong Dezhi","year":"2017","unstructured":"Dezhi Hong, Quanquan Gu, and Kamin Whitehouse. 2017. High-dimensional Time Series Clustering via Cross-Predictability. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Vol. 54. 642--651. https:\/\/www.kaggle.com\/datasets\/ranakrc\/smart-building-system Accessed: 18 Feb, 2024."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.chemolab.2016.07.004"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/JIOT.2022.3166455"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/tii.2024.3353913"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/GLOBECOM46510.2021.9685589"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/568271.223841"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3555308"},{"key":"e_1_2_1_27_1","volume-title":"Towards Modelbased Approximate Query Processing. In 1st International Workshop on Applied AI for Database Systems and Applications.","author":"Kulessa Moritz","year":"2019","unstructured":"Moritz Kulessa, Benjamin Hilprecht, Alejandro Molina, Knowledge Engineering, Group Data, Management Lab, Machine Learning, and Lab. 2019. Towards Modelbased Approximate Query Processing. In 1st International Workshop on Applied AI for Database Systems and Applications."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","unstructured":"Moritz Kulessa Alejandro Molina Carsten Binnig Benjamin Hilprecht and Kristian Kersting. 2018. Model-based Approximate Query Processing. 10.48550\/ARXIV.1811.06224","DOI":"10.48550\/ARXIV.1811.06224"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/bigdata55660.2022.10020252"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-018-0074-4"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/tkde.2018.2877362"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457277"},{"key":"e_1_2_1_33_1","volume-title":"https:\/\/github.com\/qingzma\/DBEstClient GitHub repository. Accessed","author":"EstClient Qingzhi Ma.","year":"2023","unstructured":"Qingzhi Ma. 2022. DBEstClient. https:\/\/github.com\/qingzma\/DBEstClient GitHub repository. Accessed: 1 Apr, 2023.."},{"key":"e_1_2_1_34_1","volume-title":"Accurate and Fast. In Conference on Innovative Data Systems Research.","author":"Ma Qingzhi","year":"2021","unstructured":"Qingzhi Ma, Ali M. Shanghooshabad, Mehrdad Almasi, Meghdad Kurmanji, and Peter Triantafillou. 2021. Learned Approximate Query Processing: Make it Light, Accurate and Fast. In Conference on Innovative Data Systems Research."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324958"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","unstructured":"Stephen Makonin. 2016. AMPds2: The Almanac of Minutely Power dataset (Version 2). 10.7910\/DVN\/FIE0S4","DOI":"10.7910\/DVN\/FIE0S4"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2016.37"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/3213880.3213882"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.dcan.2022.10.016"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/gcwkshps45667.2019.9024368"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196905"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183747"},{"key":"e_1_2_1_43_1","volume-title":"Temperature IoT on GCP. https:\/\/www.kaggle.com\/datasets\/mattpo\/temperature-iot-on-gcp Accessed","author":"Porter Matthew","year":"2024","unstructured":"Matthew Porter. 2021. Temperature IoT on GCP. https:\/\/www.kaggle.com\/datasets\/mattpo\/temperature-iot-on-gcp Accessed: 18 Feb, 2024."},{"key":"e_1_2_1_44_1","volume-title":"ML prediction for Light Detection Sensor IoT. https:\/\/www.kaggle.com\/datasets\/aashnaprasad\/ml-prediction-for-lightdetection-sensor-iot Accessed","author":"Prasad Aashna","year":"2024","unstructured":"Aashna Prasad. 2020. ML prediction for Light Detection Sensor IoT. https:\/\/www.kaggle.com\/datasets\/aashnaprasad\/ml-prediction-for-lightdetection-sensor-iot Accessed: 18 Feb, 2024."},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the ACM SIGMOD International Conference on Management of Data.","author":"Sanca Viktor","year":"2023","unstructured":"Viktor Sanca, Periklis Chrysogelos, and Anastasia Ailamaki. 2023. LAQy: Efficient and Reusable Query Approximations via Lazy Sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1002\/wics.35"},{"key":"e_1_2_1_47_1","volume-title":"Christian M\u00f8rup, Elena Pagnin, and Daniel E. Lucani.","author":"Sehat Hadi","year":"2022","unstructured":"Hadi Sehat, Anders Lindskov Kloborg, Christian M\u00f8rup, Elena Pagnin, and Daniel E. Lucani. 2022. Bonsai: A Generalized Look at Dual Deduplication."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137658"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i8.20800"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3548732.3548746"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505756"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","unstructured":"Udanor Collins Blessing Ogbuokiri and Nweke Onyinye. 2022. Sensor Based Aquaponics Fish Pond Datasets. 10.34740\/kaggle\/dsv\/3748790","DOI":"10.34740\/kaggle\/dsv\/3748790"},{"key":"e_1_2_1_53_1","volume-title":"2015 Flight Delays and Cancellations. https:\/\/www.kaggle.com\/datasets\/usdot\/flight-delays Accessed","author":"USA Department of Transportation. 2016.","year":"2024","unstructured":"USA Department of Transportation. 2016. 2015 Flight Delays and Cancellations. https:\/\/www.kaggle.com\/datasets\/usdot\/flight-delays Accessed: 18 Feb, 2024."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/infocom41043.2020.9155450"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/globecom38437.2019.9014012"},{"key":"e_1_2_1_56_1","volume-title":"Lucani","author":"Vestergaard Rasmus","year":"2019","unstructured":"Rasmus Vestergaard, Qi Zhang, and Daniel E. Lucani. 2019. Generalized Deduplication: Lossless Compression for Large Amounts of Small IoT Data. In European Wireless."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/jiot.2021.3081868"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588954"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2588579"},{"key":"e_1_2_1_60_1","volume-title":"Impression Store: Compressive Sensing-based Storage for Big Data Analytics. In 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14)","author":"Zhang Jiaxing","year":"2014","unstructured":"Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda, and Zheng Zhang. 2014. Impression Store: Compressive Sensing-based Storage for Big Data Analytics. In 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14). USENIX Association."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2020.09.070"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-021-01547-7"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3648160.3648181","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T21:59:58Z","timestamp":1714773598000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3648160.3648181"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2]]},"references-count":62,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,2]]}},"alternative-id":["10.14778\/3648160.3648181"],"URL":"https:\/\/doi.org\/10.14778\/3648160.3648181","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,2]]},"assertion":[{"value":"2024-05-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}