{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,14]],"date-time":"2025-11-14T17:44:19Z","timestamp":1763142259389,"version":"3.44.0"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,12]],"date-time":"2024-03-12T00:00:00Z","timestamp":1710201600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,3,12]]},"abstract":"<jats:p>Data processing engines increasingly leverage distributed file systems for scalable, cost-effective storage. While the Apache Parquet columnar format has become a popular choice for data storage and retrieval, the immutability of Parquet files renders it impractical to meet the demands of frequent updates in contemporary analytical workloads. Log-Structured Tables (LSTs), such as Delta Lake, Apache Iceberg, and Apache Hudi, offer an alternative for scenarios requiring data mutability, providing a balance between efficient updates and the benefits of columnar storage. They provide features like transactions, time-travel, and schema evolution, enhancing usability and enabling access from multiple engines. Moreover, engines like Apache Spark and Trino can be configured to leverage the optimizations and controls offered by LSTs to meet specific business needs. Conventional benchmarks and tools are inadequate for evaluating the transformative changes in the storage layer resulting from these advancements, as they do not allow us to measure the impact of design and optimization choices in this new setting.<\/jats:p>\n          <jats:p>In this paper, we propose a novel benchmarking approach and metrics that build upon existing benchmarks, aiming to systematically assess LSTs. We develop a framework, LST-Bench, which facilitates effective exploration and evaluation of the collaborative functioning of LSTs and data processing engines through tailored benchmark packages. A package is a mix of use patterns reflecting a target workload; LST-Bench makes it easy to define a wide range of use patterns and combine them into a package, and we include a baseline package for completeness. Our assessment demonstrates the effectiveness of our framework and benchmark packages in extracting valuable insights across diverse environments. The code for LST-Bench is open source and is available at https:\/\/github.com\/microsoft\/lst-bench\/.<\/jats:p>","DOI":"10.1145\/3639314","type":"journal-article","created":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T18:51:32Z","timestamp":1711479092000},"page":"1-26","source":"Crossref","is-referenced-by-count":2,"title":["LST-Bench: Benchmarking Log-Structured Tables in the Cloud"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-9151-6024","authenticated-orcid":false,"given":"Jes\u00fas","family":"Camacho-Rodr\u00edguez","sequence":"first","affiliation":[{"name":"Microsoft, Mountain View, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7862-0995","authenticated-orcid":false,"given":"Ashvin","family":"Agrawal","sequence":"additional","affiliation":[{"name":"Microsoft, Mountain View, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2547-8610","authenticated-orcid":false,"given":"Anja","family":"Gruenheid","sequence":"additional","affiliation":[{"name":"Microsoft, Zurich, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7939-6692","authenticated-orcid":false,"given":"Ashit","family":"Gosalia","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9007-4733","authenticated-orcid":false,"given":"Cristian","family":"Petculescu","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-4641-0236","authenticated-orcid":false,"given":"Josep","family":"Aguilar-Saborit","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5760-8657","authenticated-orcid":false,"given":"Avrilia","family":"Floratou","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3712-7358","authenticated-orcid":false,"given":"Carlo","family":"Curino","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5086-7664","authenticated-orcid":false,"given":"Raghu","family":"Ramakrishnan","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,3,26]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Foundations and Trends\u00ae in Databases","volume":"5","author":"Abadi Daniel","year":"2013","unstructured":"Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends\u00ae in Databases, Vol. 5, 3 (2013), 197--280."},{"key":"e_1_2_2_2_1","unstructured":"Amazon. 2023 a. Redshift - Cloud Data Warehouse. https:\/\/aws.amazon.com\/redshift\/."},{"key":"e_1_2_2_3_1","unstructured":"Amazon. 2023 b. S3 - Cloud Object Storage. https:\/\/aws.amazon.com\/s3\/."},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415560"},{"key":"e_1_2_2_5_1","volume-title":"Making Data Engineering Declarative. In Conference on Innovative Data Systems Research (CIDR).","author":"Armbrust Michael","year":"2023","unstructured":"Michael Armbrust, Ali Ghodsi, Reynold Xin, Vuk Ercegovac, Sourav Chatterji, Eun-Gyu Kim, Paul Lappas, Yannis Papakonstantinou, Yingyi Bu, Yuhong Chen, Yijia Cui, Rahul Govind, Aakash Japi, Kiavash Kianfar, Xi Liang, Jon Mio, Mukul Murthy, Supun Nakandala, Andreas Neumann, Nitin Sharma, Yannis Sismanis, Justin Tang, Joseph Torres, Min Yang, Li Zhang, and Bilal Aslam. 2023. Making Data Engineering Declarative. In Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_2_6_1","volume-title":"Conference on Innovative Data Systems Research (CIDR).","author":"Armbrust Michael","year":"2021","unstructured":"Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.14569\/IJACSA.2021.0120864"},{"key":"e_1_2_2_8_1","unstructured":"BenchBase. 2023. Multi-DBMS SQL Benchmarking Framework via JDBC. https:\/\/github.com\/cmu-db\/benchbase."},{"key":"e_1_2_2_9_1","unstructured":"Ryan Blue. 2023 a. CDC Data Gremlins. https:\/\/tabular.io\/blog\/cdc-data-gremlins\/"},{"key":"e_1_2_2_10_1","unstructured":"Ryan Blue. 2023 b. Tutorial: Using Trino and Iceberg for data warehousing. https:\/\/tabular.io\/tutorials\/using-trino-and-iceberg\/"},{"key":"e_1_2_2_11_1","volume-title":"PEEL: A Framework for Benchmarking Distributed Systems and Algorithms. In TPC Technology Conference (TPCTC).","author":"Boden Christoph","year":"2017","unstructured":"Christoph Boden, Alexander Alexandrov, Andreas Kunft, Tilmann Rabl, and Volker Markl. 2017. PEEL: A Framework for Benchmarking Distributed Systems and Algorithms. In TPC Technology Conference (TPCTC)."},{"key":"e_1_2_2_12_1","unstructured":"Brooklyn Data Co. 2023. Setting the Table: Benchmarking Open Table Formats. https:\/\/brooklyndata.co\/blog\/benchmarking-open-table-formats"},{"key":"e_1_2_2_13_1","unstructured":"Tim Brown. 2023. Announcing Onetable. https:\/\/www.onehouse.ai\/blog\/onetable-hudi-delta-iceberg"},{"key":"e_1_2_2_14_1","volume-title":"ACM International Conference on Management of Data (SIGMOD). 1773--1786","author":"Jes\u00fa","year":"2019","unstructured":"Jes\u00fa s Camacho-Rodr'i guez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O'Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, and G\u00fc nther Hagleitner. 2019. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing. In ACM International Conference on Management of Data (SIGMOD). 1773--1786."},{"key":"e_1_2_2_15_1","volume-title":"Benchmarking Cloud Serving Systems with YCSB. In ACM Symposium on Cloud Computing (SoCC). 143--154","author":"Cooper Brian F.","year":"2010","unstructured":"Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In ACM Symposium on Cloud Computing (SoCC). 143--154."},{"key":"e_1_2_2_16_1","unstructured":"DataBeans. 2022. Delta vs Iceberg vs Hudi : Reassessing Performance. https:\/\/databeans-blogs.medium.com\/delta-vs-iceberg-vs-hudi-reassessing-performance-cb8157005eb0"},{"key":"e_1_2_2_17_1","unstructured":"Databricks. 2023 a. https:\/\/www.databricks.com\/."},{"key":"e_1_2_2_18_1","unstructured":"Databricks. 2023 b. CREATE TABLE CLONE. https:\/\/docs.databricks.com\/sql\/language-manual\/delta-clone.html."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415551"},{"key":"e_1_2_2_20_1","unstructured":"Delta Lake. 2023 a. https:\/\/delta.io\/."},{"key":"e_1_2_2_21_1","unstructured":"Delta Lake. 2023 b. Optimistic concurrency control. https:\/\/docs.delta.io\/2.2.0\/concurrency-control.html#optimistic-concurrency-control."},{"key":"e_1_2_2_22_1","unstructured":"Delta Lake. 2023 c. Optimizations. https:\/\/docs.delta.io\/latest\/optimizations-oss.html."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463710"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732240.2732246"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3484224.3484234"},{"key":"e_1_2_2_26_1","unstructured":"Google Cloud. 2023. Cloud Storage. https:\/\/cloud.google.com\/storage."},{"key":"e_1_2_2_27_1","unstructured":"Apache Hudi. 2022. RFC - 15: Metadata Table. https:\/\/cwiki.apache.org\/confluence\/pages\/viewpage.action?pageId=147427331."},{"key":"e_1_2_2_28_1","unstructured":"Apache Hudi. 2023 a. https:\/\/hudi.apache.org\/."},{"key":"e_1_2_2_29_1","unstructured":"Apache Hudi. 2023 b. Automatic async compaction. https:\/\/hudi.apache.org\/docs\/compaction."},{"key":"e_1_2_2_30_1","unstructured":"Apache Hudi. 2023 c. Deployment considerations. https:\/\/hudi.apache.org\/docs\/0.12.2\/metadata\/#deployment-model-c-multi-writer."},{"key":"e_1_2_2_31_1","unstructured":"Apache Hudi. 2023 d. File Auto Sizing. https:\/\/hudi.apache.org\/docs\/file_sizing."},{"key":"e_1_2_2_32_1","unstructured":"Apache Hudi. 2023 e. Optimization Procedures. https:\/\/hudi.apache.org\/docs\/procedures#optimization-table."},{"key":"e_1_2_2_33_1","unstructured":"Apache Hudi. 2023 f. Schema Evolution. https:\/\/hudi.apache.org\/docs\/0.12.2\/schema_evolution\/."},{"key":"e_1_2_2_34_1","unstructured":"Apache Hudi. 2023 g. Supported Concurrency Controls. https:\/\/hudi.apache.org\/docs\/0.12.2\/concurrency_control\/#supported-concurrency-controls."},{"key":"e_1_2_2_35_1","unstructured":"Apache Hudi. 2023 h. Time Travel Query. https:\/\/hudi.apache.org\/docs\/0.12.2\/quick-start-guide#time-travel-query."},{"key":"e_1_2_2_36_1","unstructured":"Apache Iceberg. 2022. Schema evolution. https:\/\/iceberg.apache.org\/docs\/1.1.0\/evolution\/#schema-evolution."},{"key":"e_1_2_2_37_1","unstructured":"Apache Iceberg. 2023 a. https:\/\/iceberg.apache.org\/."},{"key":"e_1_2_2_38_1","unstructured":"Apache Iceberg. 2023 b. Concurrent write operations. https:\/\/iceberg.apache.org\/docs\/1.1.0\/reliability\/#concurrent-write-operations."},{"key":"e_1_2_2_39_1","unstructured":"Apache Iceberg. 2023 c. MAX_CONCURRENT_FILE_GROUP_REWRITES. https:\/\/iceberg.apache.org\/javadoc\/1.1.0\/org\/apache\/iceberg\/actions\/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES."},{"key":"e_1_2_2_40_1","unstructured":"Apache Iceberg. 2023 d. Partition evolution. https:\/\/iceberg.apache.org\/docs\/1.1.0\/evolution\/#partition-evolution."},{"key":"e_1_2_2_41_1","unstructured":"Apache Iceberg. 2023 e. Spark Procedures. https:\/\/iceberg.apache.org\/docs\/latest\/spark-procedures\/."},{"key":"e_1_2_2_42_1","unstructured":"Apache Iceberg. 2023 f. Specification. https:\/\/iceberg.apache.org\/spec\/."},{"key":"e_1_2_2_43_1","unstructured":"Apache Iceberg. 2023 g. Time travel. https:\/\/iceberg.apache.org\/docs\/1.1.0\/spark-queries\/#time-travel."},{"key":"e_1_2_2_44_1","volume-title":"Analyzing and Comparing Lakehouse Storage Systems. In Conference on Innovative Data Systems Research (CIDR).","author":"Jain Paras","year":"2023","unstructured":"Paras Jain, Peter Kraft, Conor Power, Tathagata Das, Ion Stoica1, and Matei Zaharia. 2023. Analyzing and Comparing Lakehouse Storage Systems. In Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_2_45_1","volume-title":"The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling","author":"Kimball Ralph","unstructured":"Ralph Kimball and Margy Ross. 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling 3rd ed.). Wiley Publishing.","edition":"3"},{"key":"e_1_2_2_46_1","unstructured":"Alexey Kudinkin. 2022. Apache Hudi vs Delta Lake - Transparent TPC-DS Lakehouse Performance Benchmarks). https:\/\/www.onehouse.ai\/blog\/apache-hudi-vs-delta-lake-transparent-tpc-ds-lakehouse-performance-benchmarks"},{"key":"e_1_2_2_47_1","unstructured":"Delta Lake. 2023. Delta OSS TPC-DS Benchmark. https:\/\/github.com\/delta-io\/delta\/tree\/master\/benchmarks."},{"key":"e_1_2_2_48_1","volume-title":"Adaptive Code Learning for Spark Configuration Tuning. In IEEE International Conference on Data Engineering (ICDE). 1995--2007","author":"Lin Chen","year":"2022","unstructured":"Chen Lin, Junqing Zhuang, Jiadong Feng, Hui Li, Xuanhe Zhou, and Guoliang Li. 2022. Adaptive Code Learning for Spark Configuration Tuning. In IEEE International Conference on Data Engineering (ICDE). 1995--2007."},{"key":"e_1_2_2_49_1","volume-title":"Starburst: Trino: The origins and development of fault-tolerant execution. https:\/\/www.starburst.io\/blog\/trino-development-fault-tolerant-execution\/","author":"Lullo Emma","year":"2023","unstructured":"Emma Lullo. 2023. Starburst: Trino: The origins and development of fault-tolerant execution. https:\/\/www.starburst.io\/blog\/trino-development-fault-tolerant-execution\/"},{"key":"e_1_2_2_50_1","unstructured":"Dipankar Mazumdar. 2023. Dremio: Write performance in Data Lakes with Apache Iceberg & Spark. https:\/\/www.linkedin.com\/posts\/dipankar-mazumdar_apacheiceberg-dataengineering-softwareengineering-activity-7085019704540418048-J4hG"},{"key":"e_1_2_2_51_1","unstructured":"Alex Merced. 2022. Comparison of Data Lake Table Formats (Apache Iceberg Apache Hudi and Delta Lake). https:\/\/www.dremio.com\/blog\/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake\/"},{"key":"e_1_2_2_52_1","unstructured":"Microsoft. 2023 a. Azure Data Lake Storage. https:\/\/azure.microsoft.com\/products\/storage\/data-lake-storage."},{"key":"e_1_2_2_53_1","unstructured":"Microsoft. 2023 b. Azure Monitor. https:\/\/azure.microsoft.com\/products\/monitor."},{"key":"e_1_2_2_54_1","unstructured":"Microsoft. 2023 c. Azure Virtual Machine Scale Sets. https:\/\/azure.microsoft.com\/products\/virtual-machine-scale-sets\/."},{"key":"e_1_2_2_55_1","unstructured":"Microsoft. 2023 d. Log Analytics in Azure Monitor. https:\/\/learn.microsoft.com\/azure\/azure-monitor\/logs\/log-analytics-overview."},{"key":"e_1_2_2_56_1","unstructured":"Microsoft. 2023 e. Universal Format (UniForm). https:\/\/learn.microsoft.com\/en-us\/azure\/databricks\/delta\/uniform."},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.5555\/1182635.1164217"},{"key":"e_1_2_2_58_1","volume-title":"Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In ACM International Conference on Management of Data (SIGMOD). 677--689","author":"Neumann Thomas","year":"2015","unstructured":"Thomas Neumann, Tobias M\u00fchlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In ACM International Conference on Management of Data (SIGMOD). 677--689."},{"key":"e_1_2_2_59_1","unstructured":"OneHouse. 2023. https:\/\/www.onehouse.ai\/."},{"key":"e_1_2_2_60_1","unstructured":"Oracle Exadata. 2023. http:\/\/www.oracle.com\/exadata."},{"key":"e_1_2_2_61_1","unstructured":"Apache ORC. 2023. https:\/\/orc.apache.org\/."},{"key":"e_1_2_2_62_1","unstructured":"Apache Ozone. 2023. https:\/\/ozone.apache.org\/."},{"key":"e_1_2_2_63_1","unstructured":"Apache Parquet. 2023. https:\/\/parquet.apache.org\/."},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3127479.3128603"},{"key":"e_1_2_2_65_1","unstructured":"Matthew Powers. 2023 a. Delta Lake Schema Evolution. https:\/\/delta.io\/blog\/2023-02-08-delta-lake-schema-evolution\/"},{"key":"e_1_2_2_66_1","unstructured":"Matthew Powers. 2023 b. Delta Lake Time Travel. https:\/\/delta.io\/blog\/2023-02-01-delta-lake-time-travel\/"},{"key":"e_1_2_2_67_1","unstructured":"Apache Spark. 2023 a. https:\/\/spark.apache.org\/."},{"key":"e_1_2_2_68_1","unstructured":"Apache Spark. 2023 b. SET. https:\/\/spark.apache.org\/docs\/3.4.1\/sql-ref-syntax-aux-conf-mgmt-set.html."},{"volume-title":"The Design of POSTGRES. In ACM International Conference on Management of Data (SIGMOD). 340--355","author":"Stonebraker Michael","key":"e_1_2_2_69_1","unstructured":"Michael Stonebraker and Lawrence A. Rowe. 1986. The Design of POSTGRES. In ACM International Conference on Management of Data (SIGMOD). 340--355."},{"key":"e_1_2_2_70_1","unstructured":"Tabular. 2023. https:\/\/tabular.io\/."},{"key":"e_1_2_2_71_1","unstructured":"Teradata. 2023. https:\/\/www.teradata.com\/."},{"key":"e_1_2_2_72_1","unstructured":"TPC. 2021. TPC-DS Specification Version 3.2.0. https:\/\/www.tpc.org\/tpc_documents_current_versions\/pdf\/tpc-ds_v3.2.0.pdf."},{"key":"e_1_2_2_73_1","unstructured":"TPC. 2023. TPC Benchmarks Overview. https:\/\/www.tpc.org\/information\/benchmarks5.asp."},{"key":"e_1_2_2_74_1","unstructured":"Trino. 2023. https:\/\/trino.io\/."},{"key":"e_1_2_2_75_1","unstructured":"Kyle Weller. 2023. Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison. https:\/\/www.onehouse.ai\/blog\/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison"},{"key":"e_1_2_2_76_1","doi-asserted-by":"crossref","unstructured":"Paul Westerman. 2001. Data warehousing: using the Wal-Mart model. Morgan Kaufmann.","DOI":"10.1016\/B978-155860684-5\/50001-6"},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.14778\/3067421.3067427"},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457569"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639314","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3639314","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T15:15:13Z","timestamp":1755789313000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639314"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,12]]},"references-count":78,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,12]]}},"alternative-id":["10.1145\/3639314"],"URL":"https:\/\/doi.org\/10.1145\/3639314","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2024,3,12]]}}}