{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:02:58Z","timestamp":1775638978221,"version":"3.50.1"},"reference-count":118,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,10]]},"abstract":"<jats:p>Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed.<\/jats:p>\n          <jats:p>In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.<\/jats:p>","DOI":"10.14778\/3626292.3626298","type":"journal-article","created":{"date-parts":[[2023,12,11]],"date-time":"2023-12-11T23:24:55Z","timestamp":1702337095000},"page":"148-161","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":34,"title":["An Empirical Evaluation of Columnar Storage Formats"],"prefix":"10.14778","volume":"17","author":[{"given":"Xinyu","family":"Zeng","sequence":"first","affiliation":[{"name":"Tsinghua University"}]},{"given":"Yulong","family":"Hui","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]},{"given":"Jiahong","family":"Shen","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]},{"given":"Andrew","family":"Pavlo","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}]},{"given":"Wes","family":"McKinney","sequence":"additional","affiliation":[{"name":"Voltron Data"}]},{"given":"Huanchen","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]}],"member":"320","published-online":{"date-parts":[[2023,10]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2016. File Format Benchmark - Avro JSON ORC & Parquet. https:\/\/www.slideshare.net\/HadoopSummit\/file-format-benchmark-avro-json-orc-parquet.  2016. File Format Benchmark - Avro JSON ORC & Parquet. https:\/\/www.slideshare.net\/HadoopSummit\/file-format-benchmark-avro-json-orc-parquet."},{"key":"e_1_2_1_2_1","unstructured":"2016. Format Wars: From VHS and Beta to Avro and Parquet. http:\/\/www.svds.com\/dataformats\/.  2016. Format Wars: From VHS and Beta to Avro and Parquet. http:\/\/www.svds.com\/dataformats\/."},{"key":"e_1_2_1_3_1","unstructured":"2016. Inside Capacitor BigQuery's next-generation columnar storage format. https:\/\/cloud.google.com\/blog\/products\/bigquery\/inside-capacitor-bigquerys-next-generation-columnar-storage-format.  2016. Inside Capacitor BigQuery's next-generation columnar storage format. https:\/\/cloud.google.com\/blog\/products\/bigquery\/inside-capacitor-bigquerys-next-generation-columnar-storage-format."},{"key":"e_1_2_1_4_1","unstructured":"2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? http:\/\/dbmsmusings.blogspot.com\/2017\/10\/apache-arrow-vs-parquet-and-orc-do-we.html.  2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? http:\/\/dbmsmusings.blogspot.com\/2017\/10\/apache-arrow-vs-parquet-and-orc-do-we.html."},{"key":"e_1_2_1_5_1","unstructured":"2017. Some comments to Daniel Abadi's blog about Apache Arrow. https:\/\/wesmckinney.com\/blog\/arrow-columnar-abadi\/.  2017. Some comments to Daniel Abadi's blog about Apache Arrow. https:\/\/wesmckinney.com\/blog\/arrow-columnar-abadi\/."},{"key":"e_1_2_1_6_1","unstructured":"2022. UCI Machine Learning Repository. https:\/\/archive.ics.uci.edu\/ml\/datasets.php. Accessed: 2022-09-22.  2022. UCI Machine Learning Repository. https:\/\/archive.ics.uci.edu\/ml\/datasets.php. Accessed: 2022-09-22."},{"key":"e_1_2_1_7_1","unstructured":"2023. Amazon S3. https:\/\/aws.amazon.com\/s3\/.  2023. Amazon S3. https:\/\/aws.amazon.com\/s3\/."},{"key":"e_1_2_1_8_1","unstructured":"2023. Apache Arrow. https:\/\/arrow.apache.org\/.  2023. Apache Arrow. https:\/\/arrow.apache.org\/."},{"key":"e_1_2_1_9_1","unstructured":"2023. Apache Arrow Dataset API. https:\/\/arrow.apache.org\/docs\/python\/generated\/pyarrow.parquet.ParquetDataset.html.  2023. Apache Arrow Dataset API. https:\/\/arrow.apache.org\/docs\/python\/generated\/pyarrow.parquet.ParquetDataset.html."},{"key":"e_1_2_1_10_1","unstructured":"2023. Apache Avro. https:\/\/avro.apache.org\/.  2023. Apache Avro. https:\/\/avro.apache.org\/."},{"key":"e_1_2_1_11_1","unstructured":"2023. Apache Carbondata. https:\/\/carbondata.apache.org\/.  2023. Apache Carbondata. https:\/\/carbondata.apache.org\/."},{"key":"e_1_2_1_12_1","unstructured":"2023. Apache Hadoop. https:\/\/hadoop.apache.org\/.  2023. Apache Hadoop. https:\/\/hadoop.apache.org\/."},{"key":"e_1_2_1_13_1","unstructured":"2023. Apache Hive. https:\/\/hive.apache.org\/.  2023. Apache Hive. https:\/\/hive.apache.org\/."},{"key":"e_1_2_1_14_1","unstructured":"2023. Apache Hudi. https:\/\/hudi.apache.org\/.  2023. Apache Hudi. https:\/\/hudi.apache.org\/."},{"key":"e_1_2_1_15_1","unstructured":"2023. Apache Iceberg. https:\/\/iceberg.apache.org\/.  2023. Apache Iceberg. https:\/\/iceberg.apache.org\/."},{"key":"e_1_2_1_16_1","unstructured":"2023. Apache Impala. https:\/\/impala.apache.org\/.  2023. Apache Impala. https:\/\/impala.apache.org\/."},{"key":"e_1_2_1_17_1","unstructured":"2023. Apache ORC. https:\/\/orc.apache.org\/.  2023. Apache ORC. https:\/\/orc.apache.org\/."},{"key":"e_1_2_1_18_1","unstructured":"2023. Apache Parquet. https:\/\/parquet.apache.org\/.  2023. Apache Parquet. https:\/\/parquet.apache.org\/."},{"key":"e_1_2_1_19_1","unstructured":"2023. Apache Presto. https:\/\/prestodb.io\/.  2023. Apache Presto. https:\/\/prestodb.io\/."},{"key":"e_1_2_1_20_1","unstructured":"2023. Apache Spark. https:\/\/spark.apache.org\/.  2023. Apache Spark. https:\/\/spark.apache.org\/."},{"key":"e_1_2_1_21_1","unstructured":"2023. Arrow C++ and Parquet C++. https:\/\/github.com\/apache\/arrow\/tree\/main\/cpp.  2023. Arrow C++ and Parquet C++. https:\/\/github.com\/apache\/arrow\/tree\/main\/cpp."},{"key":"e_1_2_1_22_1","unstructured":"2023. AutoFaiss. https:\/\/github.com\/criteo\/autofaiss.  2023. AutoFaiss. https:\/\/github.com\/criteo\/autofaiss."},{"key":"e_1_2_1_23_1","unstructured":"2023. AutoFAISS build index API. https:\/\/criteo.github.io\/autofaiss\/API\/_autosummary\/autofaiss.external.quantize.build_index.html. Accessed: 2023-07-17.  2023. AutoFAISS build index API. https:\/\/criteo.github.io\/autofaiss\/API\/_autosummary\/autofaiss.external.quantize.build_index.html. Accessed: 2023-07-17."},{"key":"e_1_2_1_24_1","unstructured":"2023. Azure Blob Storage. https:\/\/azure.microsoft.com\/en-us\/services\/storage\/blobs\/.  2023. Azure Blob Storage. https:\/\/azure.microsoft.com\/en-us\/services\/storage\/blobs\/."},{"key":"e_1_2_1_25_1","unstructured":"2023. BP5. https:\/\/adios2.readthedocs.io\/en\/latest\/engines\/engines.html#bp5.  2023. BP5. https:\/\/adios2.readthedocs.io\/en\/latest\/engines\/engines.html#bp5."},{"key":"e_1_2_1_26_1","unstructured":"2023. Chroma. https:\/\/github.com\/chroma-core\/chroma\/.  2023. Chroma. https:\/\/github.com\/chroma-core\/chroma\/."},{"key":"e_1_2_1_27_1","unstructured":"2023. ClickHouse. https:\/\/clickhouse.com\/.  2023. ClickHouse. https:\/\/clickhouse.com\/."},{"key":"e_1_2_1_28_1","unstructured":"2023. ClickHouse Example Datasets. https:\/\/clickhouse.com\/docs\/en\/getting-started\/example-datasets.  2023. ClickHouse Example Datasets. https:\/\/clickhouse.com\/docs\/en\/getting-started\/example-datasets."},{"key":"e_1_2_1_29_1","unstructured":"2023. Dremio. https:\/\/www.dremio.com\/\/.  2023. Dremio. https:\/\/www.dremio.com\/\/."},{"key":"e_1_2_1_30_1","unstructured":"2023. EDGAR Log File Data Sets. https:\/\/www.sec.gov\/about\/data\/edgar-log-file-data-sets.html.  2023. EDGAR Log File Data Sets. https:\/\/www.sec.gov\/about\/data\/edgar-log-file-data-sets.html."},{"key":"e_1_2_1_31_1","unstructured":"2023. GeoNames Dataset. http:\/\/www.geonames.org\/.  2023. GeoNames Dataset. http:\/\/www.geonames.org\/."},{"key":"e_1_2_1_32_1","unstructured":"2023. Google BigQuery. https:\/\/cloud.google.com\/bigquery.  2023. Google BigQuery. https:\/\/cloud.google.com\/bigquery."},{"key":"e_1_2_1_33_1","unstructured":"2023. Google Cloud Storage. https:\/\/cloud.google.com\/storage.  2023. Google Cloud Storage. https:\/\/cloud.google.com\/storage."},{"key":"e_1_2_1_34_1","unstructured":"2023. Google snappy. http:\/\/google.github.io\/snappy\/.  2023. Google snappy. http:\/\/google.github.io\/snappy\/."},{"key":"e_1_2_1_35_1","unstructured":"2023. Hugging Face Datasets Server. https:\/\/huggingface.co\/docs\/datasets-server\/quick_start#access-parquet-files. Accessed: 2023-07-09.  2023. Hugging Face Datasets Server. https:\/\/huggingface.co\/docs\/datasets-server\/quick_start#access-parquet-files. Accessed: 2023-07-09."},{"key":"e_1_2_1_36_1","unstructured":"2023. image-parquet. https:\/\/discuss.huggingface.co\/t\/image-dataset-best-practices\/13974.  2023. image-parquet. https:\/\/discuss.huggingface.co\/t\/image-dataset-best-practices\/13974."},{"key":"e_1_2_1_37_1","unstructured":"2023. IMDb Datasets. https:\/\/www.imdb.com\/interfaces\/.  2023. IMDb Datasets. https:\/\/www.imdb.com\/interfaces\/."},{"key":"e_1_2_1_38_1","unstructured":"2023. InfluxData. https:\/\/www.influxdata.com\/.  2023. InfluxData. https:\/\/www.influxdata.com\/."},{"key":"e_1_2_1_39_1","unstructured":"2023. NetCDF. https:\/\/www.unidata.ucar.edu\/software\/netcdf\/.  2023. NetCDF. https:\/\/www.unidata.ucar.edu\/software\/netcdf\/."},{"key":"e_1_2_1_40_1","unstructured":"2023. NVIDIA Nsight Compute. https:\/\/developer.nvidia.com\/nsight-compute.  2023. NVIDIA Nsight Compute. https:\/\/developer.nvidia.com\/nsight-compute."},{"key":"e_1_2_1_41_1","unstructured":"2023. ORC C++. https:\/\/github.com\/apache\/orc\/tree\/main\/c%2B%2B.  2023. ORC C++. https:\/\/github.com\/apache\/orc\/tree\/main\/c%2B%2B."},{"key":"e_1_2_1_42_1","unstructured":"2023. Parquet Bloom Filter Jira Discussion. https:\/\/issues.apache.org\/jira\/browse\/PARQUET-41.  2023. Parquet Bloom Filter Jira Discussion. https:\/\/issues.apache.org\/jira\/browse\/PARQUET-41."},{"key":"e_1_2_1_43_1","unstructured":"2023. Pinecone. https:\/\/www.pinecone.io\/.  2023. Pinecone. https:\/\/www.pinecone.io\/."},{"key":"e_1_2_1_44_1","unstructured":"2023. Protocol Buffers. https:\/\/developers.google.com\/protocol-buffers\/.  2023. Protocol Buffers. https:\/\/developers.google.com\/protocol-buffers\/."},{"key":"e_1_2_1_45_1","unstructured":"2023. Public BI benchmark. https:\/\/github.com\/cwida\/public_bi_benchmark.  2023. Public BI benchmark. https:\/\/github.com\/cwida\/public_bi_benchmark."},{"key":"e_1_2_1_46_1","unstructured":"2023. Querying Parquet with Millisecond Latency. https:\/\/www.influxdata.com\/blog\/querying-parquet-millisecond-latency\/.  2023. Querying Parquet with Millisecond Latency. https:\/\/www.influxdata.com\/blog\/querying-parquet-millisecond-latency\/."},{"key":"e_1_2_1_47_1","unstructured":"2023. RAPIDS. https:\/\/rapids.ai\/.  2023. RAPIDS. https:\/\/rapids.ai\/."},{"key":"e_1_2_1_48_1","unstructured":"2023. Samsung 980 PRO 4.0 NVMe SSD. https:\/\/www.samsung.com\/us\/computing\/memory-storage\/solid-state-drives\/980-pro-pcie-4-0-nvme-ssd-1tb-mz-v8p1t0b-am\/. Accessed: 2023-02-21.  2023. Samsung 980 PRO 4.0 NVMe SSD. https:\/\/www.samsung.com\/us\/computing\/memory-storage\/solid-state-drives\/980-pro-pcie-4-0-nvme-ssd-1tb-mz-v8p1t0b-am\/. Accessed: 2023-02-21."},{"key":"e_1_2_1_49_1","unstructured":"2023. SequenceFile. https:\/\/cwiki.apache.org\/confluence\/display\/HADOOP2\/SequenceFile.  2023. SequenceFile. https:\/\/cwiki.apache.org\/confluence\/display\/HADOOP2\/SequenceFile."},{"key":"e_1_2_1_50_1","unstructured":"2023. The DWRF Format. https:\/\/github.com\/facebookarchive\/hive-dwrf.  2023. The DWRF Format. https:\/\/github.com\/facebookarchive\/hive-dwrf."},{"key":"e_1_2_1_51_1","unstructured":"2023. Vector Data Lakes. https:\/\/www.databricks.com\/dataaisummit\/session\/vector-data-lakes\/. Accessed: 2023-07-28.  2023. Vector Data Lakes. https:\/\/www.databricks.com\/dataaisummit\/session\/vector-data-lakes\/. Accessed: 2023-07-28."},{"key":"e_1_2_1_52_1","unstructured":"2023. Yelp Open Dataset. https:\/\/www.yelp.com\/dataset\/.  2023. Yelp Open Dataset. https:\/\/www.yelp.com\/dataset\/."},{"key":"e_1_2_1_53_1","unstructured":"2023. Zarr. https:\/\/zarr.dev\/.  2023. Zarr. https:\/\/zarr.dev\/."},{"key":"e_1_2_1_54_1","unstructured":"2023. Zstandard. https:\/\/github.com\/facebook\/zstd.  2023. Zstandard. https:\/\/github.com\/facebook\/zstd."},{"key":"e_1_2_1_55_1","doi-asserted-by":"crossref","unstructured":"Daniel Abadi Peter Boncz Stavros Harizopoulos Stratos Idreos Samuel Madden etal 2013. The design and implementation of modern column-oriented database systems. Foundations and Trends\u00ae in Databases 5 3 (2013) 197--280.  Daniel Abadi Peter Boncz Stavros Harizopoulos Stratos Idreos Samuel Madden et al. 2013. The design and implementation of modern column-oriented database systems. Foundations and Trends \u00ae in Databases 5 3 (2013) 197--280.","DOI":"10.1561\/1900000024"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142548"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3598581.3598587"},{"key":"e_1_2_1_58_1","volume-title":"Proceedings of the VLDB Endowment (PVLDB) 14 (12)","author":"Agiwal Ankur","year":"2021","unstructured":"Ankur Agiwal and Kevin Lai et al. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google . Proceedings of the VLDB Endowment (PVLDB) 14 (12) ( 2021 ), 2986--2998. Ankur Agiwal and Kevin Lai et al. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. Proceedings of the VLDB Endowment (PVLDB) 14 (12) (2021), 2986--2998."},{"key":"e_1_2_1_59_1","first-page":"169","article-title":"Weaving Relations for Cache Performance","volume":"1","author":"Ailamaki Anastassia","year":"2001","unstructured":"Anastassia Ailamaki , David J DeWitt , Mark D Hill , and Marios Skounakis . 2001 . Weaving Relations for Cache Performance .. In VLDB , Vol. 1. 169 -- 180 . Anastassia Ailamaki, David J DeWitt, Mark D Hill, and Marios Skounakis. 2001. Weaving Relations for Cache Performance.. In VLDB, Vol. 1. 169--180.","journal-title":"VLDB"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/3547305.3547314"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415560"},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of CIDR. 8.","author":"Armbrust Michael","year":"2021","unstructured":"Michael Armbrust , Ali Ghodsi , Reynold Xin , and Matei Zaharia . 2021 . Lake-house: a new generation of open platforms that unify data warehousing and advanced analytics . In Proceedings of CIDR. 8. Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lake-house: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR. 8."},{"key":"e_1_2_1_63_1","volume-title":"Pixels: An Efficient Column Store for Cloud Data Lakes. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 3078--3090","author":"Bian Haoqiong","year":"2022","unstructured":"Haoqiong Bian and Anastasia Ailamaki . 2022 . Pixels: An Efficient Column Store for Cloud Data Lakes. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 3078--3090 . Haoqiong Bian and Anastasia Ailamaki. 2022. Pixels: An Efficient Column Store for Cloud Data Lakes. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 3078--3090."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035930"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407851"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352121"},{"key":"e_1_2_1_67_1","doi-asserted-by":"crossref","unstructured":"Brian F. Cooper Adam Silberstein Erwin Tam Raghu Ramakrishnan and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In SoCC. 143--154.  Brian F. Cooper Adam Silberstein Erwin Tam Raghu Ramakrishnan and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In SoCC. 143--154.","DOI":"10.1145\/1807128.1807152"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/971699.318923"},{"key":"e_1_2_1_69_1","unstructured":"Dario Curreri Olivier Cur\u00e9 and Marinella Sciortino. [n.d.]. RDF DATA AND COLUMNAR FORMATS. Master's thesis.  Dario Curreri Olivier Cur\u00e9 and Marinella Sciortino. [n.d.]. RDF DATA AND COLUMNAR FORMATS. Master's thesis."},{"key":"e_1_2_1_70_1","doi-asserted-by":"crossref","unstructured":"Benoit Dageville Thierry Cruanes Marcin Zukowski Vadim Antonov Artin Avanes Jon Bock Jonathan Claybaugh Daniel Engovatov Martin Hentschel Jiansheng Huang etal 2016. The Snowflake Elastic Data Warehouse. In SIGMOD.  Benoit Dageville Thierry Cruanes Marcin Zukowski Vadim Antonov Artin Avanes Jon Bock Jonathan Claybaugh Daniel Engovatov Martin Hentschel Jiansheng Huang et al. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD.","DOI":"10.1145\/2882903.2903741"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.14778\/3484224.3484234"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2733002"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/1966895.1966900"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.1998.655800"},{"key":"e_1_2_1_75_1","doi-asserted-by":"crossref","unstructured":"Anurag Gupta Deepak Agarwal Derek Tan Jakub Kulesza Rahul Pathak Stefano Stefani and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.  Anurag Gupta Deepak Agarwal Derek Tan Jakub Kulesza Rahul Pathak Stefano Stefani and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.","DOI":"10.1145\/2723372.2742795"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767933"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196911"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2595630"},{"key":"e_1_2_1_79_1","volume-title":"Monetdb: Two decades of research in column-oriented database","author":"Idreos S","year":"2012","unstructured":"S Idreos , F Groffen , N Nes , S Manegold , S Mullender , and M Kersten . 2012 . Monetdb: Two decades of research in column-oriented database . IEEE Data Engineering Bulletin ( 2012). S Idreos, F Groffen, N Nes, S Manegold, S Mullender, and M Kersten. 2012. Monetdb: Two decades of research in column-oriented database. IEEE Data Engineering Bulletin (2012)."},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.5523"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457283"},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2019.2921572"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589263"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.2203"},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589323"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465322"},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551852"},{"key":"e_1_2_1_88_1","volume-title":"LeCo: Lightweight Compression via Learning Serial Correlations. arXiv preprint arXiv:2306.15374","author":"Liu Yihao","year":"2023","unstructured":"Yihao Liu , Xinyu Zeng , and Huanchen Zhang . 2023. LeCo: Lightweight Compression via Learning Serial Correlations. arXiv preprint arXiv:2306.15374 ( 2023 ). Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2023. LeCo: Lightweight Compression via Learning Serial Correlations. arXiv preprint arXiv:2306.15374 (2023)."},{"key":"e_1_2_1_89_1","volume-title":"Self-Organizing Data Containers. In The Conference on Innovative Data Systems Research, CIDR.","author":"Madden Samuel","year":"2022","unstructured":"Samuel Madden , Jialin Ding , Tim Kraska , Sivaprasad Sudhir , David Cohen , Timothy Mattson , and Nesime Tatbul . 2022 . Self-Organizing Data Containers. In The Conference on Innovative Data Systems Research, CIDR. Samuel Madden, Jialin Ding, Tim Kraska, Sivaprasad Sudhir, David Cohen, Timothy Mattson, and Nesime Tatbul. 2022. Self-Organizing Data Containers. In The Conference on Innovative Data Systems Research, CIDR."},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1985.5009382"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920886"},{"key":"e_1_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415568"},{"key":"e_1_2_1_93_1","first-page":"50","article-title":"The star schema benchmark (SSB)","volume":"200","author":"O'Neil Patrick E","year":"2007","unstructured":"Patrick E O'Neil , Elizabeth J O'Neil , and Xuedong Chen . 2007 . The star schema benchmark (SSB) . Pat 200 , 0 (2007), 50 . Patrick E O'Neil, Elizabeth J O'Neil, and Xuedong Chen. 2007. The star schema benchmark (SSB). Pat 200, 0 (2007), 50.","journal-title":"Pat"},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824078"},{"key":"e_1_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2017.8258260"},{"key":"e_1_2_1_96_1","article-title":"Cache-, Hash-, and Space-Efficient Bloom Filters","author":"Putze Felix","year":"2010","unstructured":"Felix Putze , Peter Sanders , and Johannes Singler . 2010 . Cache-, Hash-, and Space-Efficient Bloom Filters . ACM J. Exp. Algorithmics 14, Article 4 (Jan 2010), 18 pages. Felix Putze, Peter Sanders, and Johannes Singler. 2010. Cache-, Hash-, and Space-Efficient Bloom Filters. ACM J. Exp. Algorithmics 14, Article 4 (Jan 2010), 18 pages.","journal-title":"ACM J. Exp. Algorithmics 14, Article 4"},{"key":"e_1_2_1_97_1","unstructured":"Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman Mehdi Cherti Theo Coombes Aarush Katta Clayton Mullis Mitchell Wortsman Patrick Schramowski Srivatsa Kundurthy Katherine Crowson Ludwig Schmidt Robert Kaczmarczyk and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS.  Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman Mehdi Cherti Theo Coombes Aarush Katta Clayton Mullis Mitchell Wortsman Patrick Schramowski Srivatsa Kundurthy Katherine Crowson Ludwig Schmidt Robert Kaczmarczyk and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS."},{"key":"e_1_2_1_98_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00196"},{"key":"e_1_2_1_99_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380595"},{"key":"e_1_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526132"},{"key":"e_1_2_1_101_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465306"},{"key":"e_1_2_1_102_1","volume-title":"Proceedings of the 31st International Conference on Very Large Data Bases","author":"Stonebraker Michael","year":"2005","unstructured":"Michael Stonebraker , Daniel J. Abadi , Adam Batkin , Xuedong Chen , Mitch Cherniack , Miguel Ferreira , Edmond Lau , Amerson Lin , Samuel Madden , Elizabeth J. O'Neil , Patrick E. O'Neil , Alex Rasin , Nga Tran , and Stanley B. Zdonik . 2005. C-Store: A Column-oriented DBMS . In Proceedings of the 31st International Conference on Very Large Data Bases , Trondheim, Norway, August 30 - September 2, 2005 . ACM, 553--564. Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005. ACM, 553--564."},{"key":"e_1_2_1_103_1","unstructured":"The Transaction Processing Council. 2021. TPC-DS Benchmark (Revision 3.2.0).  The Transaction Processing Council. 2021. TPC-DS Benchmark (Revision 3.2.0)."},{"key":"e_1_2_1_104_1","unstructured":"The Transaction Processing Council. 2022. TPC-H Benchmark (Revision 3.0.1).  The Transaction Processing Council. 2022. TPC-H Benchmark (Revision 3.0.1)."},{"key":"e_1_2_1_105_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687609"},{"key":"e_1_2_1_106_1","volume-title":"2018 USENIX Annual Technical Conference (USENIX ATC 18)","author":"Trivedi Animesh","year":"2018","unstructured":"Animesh Trivedi , Patrick Stuedi , Jonas Pfefferle , Adrian Schuepbach , and Bernard Metzler . 2018 . Albis:{High-Performance} File Format for Big Data Systems . In 2018 USENIX Annual Technical Conference (USENIX ATC 18) . 615--630. Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler. 2018. Albis:{High-Performance} File Format for Big Data Systems. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 615--630."},{"key":"e_1_2_1_107_1","doi-asserted-by":"publisher","DOI":"10.14778\/3529337.3529347"},{"key":"e_1_2_1_108_1","volume-title":"The Conference on Innovative Data Systems Research, CIDR.","author":"Vakharia Suketu","year":"2023","unstructured":"Suketu Vakharia , Peng Li , Weiran Liu , and Sundaram Narayanan . 2023 . Shared Foundations: Modernizing Meta's Data Lakehouse . In The Conference on Innovative Data Systems Research, CIDR. Suketu Vakharia, Peng Li, Weiran Liu, and Sundaram Narayanan. 2023. Shared Foundations: Modernizing Meta's Data Lakehouse. In The Conference on Innovative Data Systems Research, CIDR."},{"key":"e_1_2_1_109_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209950.3209952"},{"key":"e_1_2_1_110_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457550"},{"key":"e_1_2_1_111_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835958"},{"key":"e_1_2_1_112_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551809"},{"key":"e_1_2_1_113_1","volume-title":"9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Zaharia Matei","year":"2012","unstructured":"Matei Zaharia , Mosharaf Chowdhury , Tathagata Das , Ankur Dave , Justin Ma , Murphy McCauly , Michael J Franklin , Scott Shenker , and Ion Stoica . 2012 . Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing . In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) . 15--28. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28."},{"key":"e_1_2_1_114_1","volume-title":"An Empirical Evaluation of Columnar Storage Formats. https:\/\/arxiv.org\/pdf\/2304.05028.pdf\/. arXiv preprint arXiv:2304.05028","author":"Zeng Xinyu","year":"2023","unstructured":"Xinyu Zeng , Yulong Hui , Jiahong Shen , Andrew Pavlo , Wes McKinney , and Huanchen Zhang . 2023. An Empirical Evaluation of Columnar Storage Formats. https:\/\/arxiv.org\/pdf\/2304.05028.pdf\/. arXiv preprint arXiv:2304.05028 ( 2023 ). Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. https:\/\/arxiv.org\/pdf\/2304.05028.pdf\/. arXiv preprint arXiv:2304.05028 (2023)."},{"key":"e_1_2_1_115_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196931"},{"key":"e_1_2_1_116_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380583"},{"key":"e_1_2_1_117_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.150"},{"key":"e_1_2_1_118_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.148"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3626292.3626298","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,8]],"date-time":"2024-01-08T23:10:05Z","timestamp":1704755405000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3626292.3626298"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10]]},"references-count":118,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,10]]}},"alternative-id":["10.14778\/3626292.3626298"],"URL":"https:\/\/doi.org\/10.14778\/3626292.3626298","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,10]]},"assertion":[{"value":"2023-10-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}