{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T09:24:06Z","timestamp":1758273846935},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:p>Data Lakes deployed in the cloud are a go-to solution for enterprise data storage. While the pay-as-you-go cost model allows flexible resource allocation and billing, it mandates an efficient use of resources like CPU hours, network traffic, and used storage. The distributed nature of cloud environments necessitates partitioning the data and processing these partitions separately. In this work, we put forward a practical solution to improve the efficiency of compression algorithms on Dremel-encoded data by clustering similarly structured nested data at ingestion time, such that compressible partitions can be created. We propose a clustering approach inspired by decision trees that outpaces even the naive partition-then-sort approach by up to factor 17.44 while also boosting the compression by up to factor 2. We further show that when sorting the individual buckets, a compression boost that is competitive with the well-established increasing-cardinality heuristic can be achieved, but at a lower ingestion time.<\/jats:p>","DOI":"10.14778\/3681954.3682013","type":"journal-article","created":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T16:23:36Z","timestamp":1725035016000},"page":"3456-3469","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines"],"prefix":"10.14778","volume":"17","author":[{"given":"Patrick","family":"Hansert","sequence":"first","affiliation":[{"name":"RPTU Kaiserslautern-Landau, Kaiserslautern, Germany"}]},{"given":"Sebastian","family":"Michel","sequence":"additional","affiliation":[{"name":"RPTU Kaiserslautern-Landau, Kaiserslautern, Germany"}]}],"member":"320","published-online":{"date-parts":[[2024,8,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142548"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517834"},{"key":"e_1_2_1_3_1","volume-title":"https:\/\/parquet.apache.org\/ (Last accessed","author":"Foundation Apache Software","year":"2024","unstructured":"Apache Software Foundation. 2013. Apache Parquet. https:\/\/parquet.apache.org\/ (Last accessed: July 9, 2024)"},{"key":"e_1_2_1_4_1","volume-title":"https:\/\/spark.apache.org\/ (Last accessed","author":"Foundation Apache Software","year":"2024","unstructured":"Apache Software Foundation. 2014. Apache Spark. https:\/\/spark.apache.org\/ (Last accessed: July 9, 2024)"},{"key":"e_1_2_1_5_1","volume-title":"https:\/\/iceberg.apache.org\/ (Last accessed","author":"Foundation Apache Software","year":"2024","unstructured":"Apache Software Foundation. 2017. Apache Iceberg. https:\/\/iceberg.apache.org\/ (Last accessed: July 9, 2024)"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.4108\/ICST.INFOSCALE2008.3554"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415560"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5441\/002\/edbt.2017.21"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3122831.3122837"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00200"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035930"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, January 9--11","author":"Buchsbaum Adam L.","year":"2000","unstructured":"Adam L. Buchsbaum, Donald F. Caldwell, Kenneth Ward Church, Glenn S. Fowler, and S. Muthukrishnan. 2000. Engineering the compression of massive tables: an experimental approach. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, January 9--11, 2000, San Francisco, CA, USA, David B. Shmoys (Ed.). ACM\/SIAM, 175--184. http:\/\/dl.acm.org\/citation.cfm?id=338219.338249"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/950620.950622"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/tip.2022.3186532"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626246.3653379"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457270"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452809"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74553-2_1"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/IRI.2018.00060"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1921011"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-011-0242-x"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2018.02.007"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/645925.671516"},{"key":"e_1_2_1_24_1","volume-title":"Github developer webhooks. https:\/\/developer.github.com\/webhooks\/ (Last accessed","author":"GitHub Inc. 2024.","year":"2024","unstructured":"GitHub Inc. 2024. Github developer webhooks. https:\/\/developer.github.com\/webhooks\/ (Last accessed: July 9, 2024)"},{"key":"e_1_2_1_25_1","volume-title":"Retrieved","author":"Google Inc.","year":"2011","unstructured":"Google Inc. 2011. Snappy Compressed Format Description. Retrieved February 29, 2023 from https:\/\/github.com\/google\/snappy\/blob\/main\/format_description.txt"},{"key":"e_1_2_1_26_1","unstructured":"Ilya Grigorik. 2023. GH Archive. https:\/\/www.gharchive.org\/"},{"key":"e_1_2_1_27_1","first-page":"5","article-title":"Managing Google's data lake: an overview of the Goods system","volume":"39","author":"Halevy Alon Y.","year":"2016","unstructured":"Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5--14. http:\/\/sites.computer.org\/debull\/A16sept\/p5.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3530050.3532923"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579142.3594286"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2452376.2452456"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/bf01199431"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/2556549.2556559"},{"key":"e_1_2_1_33_1","volume-title":"Database Cracking. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7--10, 2007, Online Proceedings. www.cidrdb.org, 68--78","author":"Idreos Stratos","year":"2007","unstructured":"Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7--10, 2007, Online Proceedings. www.cidrdb.org, 68--78. http:\/\/cidrdb.org\/cidr2007\/papers\/cidr07p07.pdf"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/93597.98742"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00500-018-3081-5"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457547"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the First International Workshop on Data Ecosystems co-located with 48th International Conference on Very Large Databases (VLDB","author":"Klessinger Stefan","year":"2022","unstructured":"Stefan Klessinger, Meike Klettke, Uta St\u00f6rl, and Stefanie Scherzinger. 2022. Extracting JSON Schemas with tagged unions. In Proceedings of the First International Workshop on Data Ecosystems co-located with 48th International Conference on Very Large Databases (VLDB 2022), Sydney, Australia, September 5, 2022 (CEUR Workshop Proceedings), Cinzia Cappiello, Sandra Geisler, and Maria-Esther Vidal (Eds.), Vol. 3306. CEUR-WS.org, 27--40. https:\/\/ceur-ws.org\/Vol-3306\/paper4.pdf"},{"key":"e_1_2_1_38_1","unstructured":"Meike Klettke Uta St\u00f6rl and Stefanie Scherzinger. 2015. Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores. In Datenbanksysteme f\u00fcr Business Technologie und Web (BTW) 16. Fachtagung des GI-Fachbereichs \"Datenbanken und Informationssysteme\" (DBIS) 4.-6.3.2015 in Hamburg Germany. Proceedings (LNI) Thomas Seidl Norbert Ritter Harald Sch\u00f6ning Kai-Uwe Sattler Theo H\u00e4rder Steffen Friedrich and Wolfram Wingerath (Eds.) Vol. P-241. GI 425--444. https:\/\/dl.gi.de\/20.500.12116\/2420"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2011.02.002"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2338626.2338633"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compmedimag.2007.11.002"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-012088469-8.50111-X"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920886"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415568"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.03113"},{"key":"e_1_2_1_46_1","volume-title":"A computer oriented geodetic data base and a new technique in file sequencing. resreport. https:\/\/dominoweb.draco.res.ibm.com\/0dabf9473b9c86d48525779800566a39.html (Last accessed","author":"Morton Guy M","year":"2024","unstructured":"Guy M Morton. 1966. A computer oriented geodetic data base and a new technique in file sequencing. resreport. https:\/\/dominoweb.draco.res.ibm.com\/0dabf9473b9c86d48525779800566a39.html (Last accessed: July 9, 2024)"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/icde55515.2023.00223"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009744630224"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/1182635.1164217"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/6012.15407"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.35"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31235-9_{3}{1"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/proc.1967.5493"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-25264-3_35"},{"volume-title":"Foundations of multidimensional and metric data structures","author":"Samet Hanan","key":"e_1_2_1_55_1","unstructured":"Hanan Samet. 2006. Foundations of multidimensional and metric data structures. Academic Press."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00224"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3127479.3131613"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3384413"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452801"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/3598581.3598601"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025123"},{"key":"e_1_2_1_62_1","volume-title":"Introduction to Data Mining","author":"Tan Pang-Ning","unstructured":"Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar. 2018. Introduction to Data Mining (2nd Edition) (2nd ed.). Pearson.","edition":"2"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2007.07.016"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035956"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3567484"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCC.2023.3339208"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3681954.3682013","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T18:32:29Z","timestamp":1725474749000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3681954.3682013"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7]]},"references-count":66,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["10.14778\/3681954.3682013"],"URL":"https:\/\/doi.org\/10.14778\/3681954.3682013","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2024,7]]},"assertion":[{"value":"2024-08-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}