{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T19:02:17Z","timestamp":1774983737858,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T00:00:00Z","timestamp":1739145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100006374","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62232005, 62021002, 92267203, 62072265"],"award-info":[{"award-number":["62232005, 62021002, 92267203, 62072265"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006374","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2021YFB3300500"],"award-info":[{"award-number":["2021YFB3300500"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006374","name":"Beijing National Research Center For Information Science And Technology","doi-asserted-by":"publisher","award":["BNR2025RC01011"],"award-info":[{"award-number":["BNR2025RC01011"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,2,10]]},"abstract":"<jats:p>Quantiles are costly to compute exactly but can be efficiently estimated by quantile sketches. Extensive works on summarizing streaming data, such as KLL sketch, focus on minimizing the cost in memory to provide certain error guarantees. For the problem of quantile estimation of values in LSM-tree based stores, streaming methods have an expensive I\/O cost linear to data size N. Since disk components (chunks and SSTables) in the LSM-tree are immutable once flushed, quantile sketches can be pre-computed as a type of statistics to reduce I\/O cost and accelerate queries. Unfortunately, to provide deterministic additive \u03b5N error guarantees on queried data, all pre-computed deterministic sketches of queried chunks each with size N_c should provide \u03b5N_c error guarantee, resulting in no improvement in the linear I\/O cost. In this study, we propose pre-computing randomized sketches which provide randomized additive error guarantees. Our major technical contributions include (1) randomized sketches for data chunks constructed in flush events, which are proved to be optimal and achieve an I\/O cost proportional to \u221a(N), (2) hierarchical randomized sketches for SSTables constructed in compaction events, that further improve the asymptotic I\/O cost, (3) the KLL sketch summarizing proposed pre-computed sketches is proved to be more accurate than that summarizing streaming data, and proved to achieve sublinear I\/O cost while achieving the same memory complexity as in the streaming settings. Extensive experiments on synthetic and real datasets demonstrate the superiority of the proposed techniques. The approach is deployed in an LSM-tree based time-series database Apache IoTDB.<\/jats:p>","DOI":"10.1145\/3709717","type":"journal-article","created":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T15:45:06Z","timestamp":1739288706000},"page":"1-26","source":"Crossref","is-referenced-by-count":1,"title":["Randomized Sketches for Quantile in LSM-tree based Store"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-3197-2722","authenticated-orcid":false,"given":"Ziling","family":"Chen","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9503-2755","authenticated-orcid":false,"given":"Shaoxu","family":"Song","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183761"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2500128"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476266"},{"key":"e_1_2_1_5_1","unstructured":"Experiment Code and Data. 2024. https:\/\/github.com\/thssdb\/LSM-quantile\/."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3516431.3516433"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3452021.3458323"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467152"},{"key":"e_1_2_1_9_1","unstructured":"Bitcoin Dataset. 2018a. https:\/\/www.kaggle.com\/shiheyingzhe\/bitcoin-transaction-data-from-2009-to-2018."},{"key":"e_1_2_1_10_1","unstructured":"Taxi Dataset. 2018b. https:\/\/www.kaggle.com\/datasets\/sunnets\/taxipredictiondata8m."},{"key":"e_1_2_1_11_1","unstructured":"Thruster Dataset. 2022. https:\/\/www.kaggle.com\/datasets\/patrickfleith\/spacecraft-thruster-firing-tests-dataset."},{"key":"e_1_2_1_12_1","unstructured":"Apache Datasketches. 2024. https:\/\/datasketches.apache.org\/."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3064054"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196927"},{"key":"e_1_2_1_15_1","unstructured":"Lognormal Distribution. 2024. https:\/\/commons.apache.org\/proper\/commons-math\/javadocs\/api-3.6.1\/org\/apache\/commons\/math3\/distribution\/LogNormalDistribution.html."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.simpa.2020.100049"},{"key":"e_1_2_1_17_1","unstructured":"RocksDB File Format. 2024. https:\/\/github.com\/facebook\/rocksdb\/wiki\/Rocksdb-BlockBasedTable-Format."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375670"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2367502.2367510"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/5.1.10"},{"key":"e_1_2_1_21_1","unstructured":"Document in Apache IoTDB. 2024a. https:\/\/iotdb.apache.org\/UserGuide\/Master\/Operators-Functions\/Data-Profiling.html#quantile."},{"key":"e_1_2_1_22_1","unstructured":"Implementation in Apache IoTDB. 2024b. https:\/\/github.com\/apache\/iotdb\/tree\/research\/LSM-quantile."},{"key":"e_1_2_1_23_1","unstructured":"Quantile Query in Flux. 2024. https:\/\/docs.influxdata.com\/influxdb\/v2\/query-data\/flux\/percentile-quantile\/#estimate_tdigest."},{"key":"e_1_2_1_24_1","unstructured":"Apache IoTDB. 2024. https:\/\/iotdb.apache.org\/."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-005-0171--7"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00315"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2016.17"},{"key":"e_1_2_1_28_1","unstructured":"Google LevelDB. 2024. https:\/\/github.com\/google\/leveldb."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/S11390-024--3538--1"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-019-00555-y"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304204"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352135"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/0304--3975(80)90061--4"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s002360050048"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025124"},{"key":"e_1_2_1_36_1","unstructured":"Supplementary. 2024. urlhttps:\/\/github.com\/thssdb\/LSM-quantile\/blob\/main\/full.pdf."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589775"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.5220\/0006235400360046"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3450980.3450990"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/S41019-023-00235--6"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709717","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3709717","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:22:01Z","timestamp":1774981321000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709717"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,10]]},"references-count":40,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2,10]]}},"alternative-id":["10.1145\/3709717"],"URL":"https:\/\/doi.org\/10.1145\/3709717","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,10]]}}}