{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T03:15:41Z","timestamp":1767842141783,"version":"3.49.0"},"reference-count":87,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,9,22]]},"abstract":"<jats:p>Columnar storage formats are the foundation for modern data analytics systems. The proliferation of open-source file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today. Although these formats have incorporated some updates to their specification to adapt to these changes, not all deployments support those modifications, and too often systems cannot overcome the formats' deficiencies and limitations without a rewrite.<\/jats:p>\n          <jats:p>\n            In this paper, we present the\n            <jats:bold>F<\/jats:bold>\n            uture-proof\n            <jats:bold>File<\/jats:bold>\n            <jats:bold>Format<\/jats:bold>\n            (F3) project. It is a next-generation open-source file format with interoperability, extensibility, and efficiency as its core design principles. F3 obviates the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily. Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. To evaluate F3, we compared it against legacy and state-of-the-art open-source file formats. Our evaluations demonstrate the efficacy of F3's storage layout and the benefits of Wasm-driven decoding.\n          <\/jats:p>","DOI":"10.1145\/3749163","type":"journal-article","created":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T17:17:03Z","timestamp":1758647823000},"page":"1-27","source":"Crossref","is-referenced-by-count":1,"title":["F3: The Open-Source Data File Format for the Future"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-6858-1457","authenticated-orcid":false,"given":"Xinyu","family":"Zeng","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2311-4476","authenticated-orcid":false,"given":"Ruijun","family":"Meng","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4348-236X","authenticated-orcid":false,"given":"Martin","family":"Prammer","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4028-1639","authenticated-orcid":false,"given":"Wes","family":"McKinney","sequence":"additional","affiliation":[{"name":"Posit PBC, Nashville, TN, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3653-2538","authenticated-orcid":false,"given":"Jignesh M.","family":"Patel","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6040-6991","authenticated-orcid":false,"given":"Andrew","family":"Pavlo","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4821-1558","authenticated-orcid":false,"given":"Huanchen","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China and Shanghai Qi Zhi Institute, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,9,23]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2024. Aligning Velox and Apache Arrow: Towards composable data management. https:\/\/engineering.fb.com\/2024\/02\/20\/developer-tools\/velox-apache-arrow-15-composable-data-management\/."},{"key":"e_1_2_1_2_1","unstructured":"2024. Apache Arrow. https:\/\/arrow.apache.org\/."},{"key":"e_1_2_1_3_1","unstructured":"2024. DuckDB Format. https:\/\/duckdb.org\/docs\/guides\/performance\/file_formats.html. Accessed: 2024-12-04."},{"key":"e_1_2_1_4_1","unstructured":"2024. DuckDB's Parquet Implementation. https:\/\/github.com\/duckdb\/duckdb\/tree\/main\/extension\/parquet."},{"key":"e_1_2_1_5_1","unstructured":"2024. Impala's Parquet Implementation. https:\/\/github.com\/apache\/impala\/tree\/master\/be\/src\/exec\/parquet."},{"key":"e_1_2_1_6_1","unstructured":"2024. InfluxDB video: \"Parquet is a standard like SQL is a standard\". https:\/\/lists.apache.org\/thread\/tnxbykozo5owq2y36nw7lomr91hrdxhz."},{"key":"e_1_2_1_7_1","unstructured":"2024. LZ4 Flex. https:\/\/github.com\/PSeitz\/lz4_flex."},{"key":"e_1_2_1_8_1","unstructured":"2024. Parquet C++ Implementation. https:\/\/github.com\/apache\/arrow\/tree\/main\/cpp\/src\/parquet."},{"key":"e_1_2_1_9_1","unstructured":"2024. Parquet Go Implementation. https:\/\/github.com\/apache\/arrow-go\/tree\/main\/parquet."},{"key":"e_1_2_1_10_1","unstructured":"2024. Parquet Java Implementation. https:\/\/github.com\/apache\/parquet-java\/."},{"key":"e_1_2_1_11_1","unstructured":"2024. Parquet Rust Implementation. https:\/\/github.com\/apache\/arrow-rs\/tree\/main\/parquet."},{"key":"e_1_2_1_12_1","unstructured":"2024. Pcodec. https:\/\/github.com\/mwlon\/pcodec."},{"key":"e_1_2_1_13_1","unstructured":"2024. Projects Powered By Apache Arrow. https:\/\/arrow.apache.org\/powered_by\/."},{"key":"e_1_2_1_14_1","unstructured":"2024. Trino's Parquet Implementation. https:\/\/github.com\/trinodb\/trino\/tree\/master\/lib\/trino-parquet."},{"key":"e_1_2_1_15_1","unstructured":"2024. Wasm Feature Extensions. https:\/\/webassembly.org\/features\/. Accessed: 2024-11-29."},{"key":"e_1_2_1_16_1","unstructured":"2024. Wasmtime. https:\/\/wasmtime.dev\/."},{"key":"e_1_2_1_17_1","unstructured":"2024. WebAssembly. https:\/\/webassembly.org\/."},{"key":"e_1_2_1_18_1","unstructured":"2025. Apache Carbondata. https:\/\/carbondata.apache.org\/."},{"key":"e_1_2_1_19_1","unstructured":"2025. Apache Hadoop. https:\/\/hadoop.apache.org\/."},{"key":"e_1_2_1_20_1","unstructured":"2025. Apache Hive. https:\/\/hive.apache.org\/."},{"key":"e_1_2_1_21_1","unstructured":"2025. Apache Hudi. https:\/\/hudi.apache.org\/."},{"key":"e_1_2_1_22_1","unstructured":"2025. Apache Iceberg. https:\/\/iceberg.apache.org\/."},{"key":"e_1_2_1_23_1","unstructured":"2025. Apache Impala. https:\/\/impala.apache.org\/."},{"key":"e_1_2_1_24_1","unstructured":"2025. Apache mailing list-Coordinating \/ scheduling C++ Parquet-Arrow nested data work. https:\/\/lists.apache.org\/ thread\/wyr53b94fjwfxgynn60bbprpxztqzdym."},{"key":"e_1_2_1_25_1","unstructured":"2025. Apache ORC. https:\/\/orc.apache.org\/."},{"key":"e_1_2_1_26_1","unstructured":"2025. Apache Parquet. https:\/\/parquet.apache.org\/."},{"key":"e_1_2_1_27_1","unstructured":"2025. Apache Presto. https:\/\/prestodb.io\/."},{"key":"e_1_2_1_28_1","unstructured":"2025. Apache Spark. https:\/\/spark.apache.org\/."},{"key":"e_1_2_1_29_1","unstructured":"2025. Arrow IPC format. https:\/\/arrow.apache.org\/docs\/format\/Columnar.html#serialization-and-interprocess- communication-ipc."},{"key":"e_1_2_1_30_1","unstructured":"2025. Dremio. https:\/\/www.dremio.com\/\/."},{"key":"e_1_2_1_31_1","unstructured":"2025. DuckDB Blog: Query Engines: Gatekeepers of the Parquet File Format. https:\/\/duckdb.org\/2025\/01\/22\/parquet- encodings.html."},{"key":"e_1_2_1_32_1","unstructured":"2025. Flatbuffers Verifier. https:\/\/github.com\/google\/flatbuffers\/blob\/master\/rust\/flatbuffers\/src\/verifier.rs."},{"key":"e_1_2_1_33_1","unstructured":"2025. Future File Format (F3). https:\/\/github.com\/future-file-format."},{"key":"e_1_2_1_34_1","unstructured":"2025. Google snappy. http:\/\/google.github.io\/snappy\/."},{"key":"e_1_2_1_35_1","unstructured":"2025. InfluxData. https:\/\/www.influxdata.com\/."},{"key":"e_1_2_1_36_1","unstructured":"2025. Jira-Read and write nested Parquet data with a mix of struct and list nesting levels. https:\/\/issues.apache.org\/ jira\/browse\/ARROW-1644."},{"key":"e_1_2_1_37_1","unstructured":"2025. Lance. https:\/\/github.com\/eto-ai\/lance."},{"key":"e_1_2_1_38_1","unstructured":"2025. Lance read all metadata code. https:\/\/github.com\/lancedb\/lance\/blob\/039c6c65c92f5e606fe1431060212a9b1f7becc5\/ rust\/lance-file\/src\/v2\/reader.rs#L447."},{"key":"e_1_2_1_39_1","unstructured":"2025. Nimble. https:\/\/github.com\/facebookincubator\/nimble\/."},{"key":"e_1_2_1_40_1","unstructured":"2025. Parquet Implementation Status Page. https:\/\/parquet.apache.org\/docs\/file-format\/implementationstatus\/."},{"key":"e_1_2_1_41_1","unstructured":"2025. Personal Discussion with Wasm maintainers."},{"key":"e_1_2_1_42_1","unstructured":"2025. Semantic Versioniong. https:\/\/semver.org\/."},{"key":"e_1_2_1_43_1","unstructured":"2025. Snowflake Data Unloading. https:\/\/docs.snowflake.com\/en\/user-guide\/data-unload-overview."},{"key":"e_1_2_1_44_1","unstructured":"2025. Vortex. https:\/\/github.com\/spiraldb\/vortex\/."},{"key":"e_1_2_1_45_1","volume-title":"Vortex Writer Flush Code. https:\/\/github.com\/spiraldb\/vortex\/blob\/ 6187ebb4ee130be56404342e21cddadcb69c7215\/vortex-file\/src\/writer.rs#L90","unstructured":"2025. Vortex Writer Flush Code. https:\/\/github.com\/spiraldb\/vortex\/blob\/ 6187ebb4ee130be56404342e21cddadcb69c7215\/vortex-file\/src\/writer.rs#L90."},{"key":"e_1_2_1_46_1","unstructured":"2025. Wasm3. https:\/\/github.com\/wasm3\/wasm3."},{"key":"e_1_2_1_47_1","unstructured":"2025. Zstandard. https:\/\/github.com\/facebook\/zstd."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3598581.3598587"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3662010.3663450"},{"key":"e_1_2_1_50_1","first-page":"169","article-title":"Weaving Relations for Cache Performance","volume":"1","author":"Ailamaki Anastassia","year":"2001","unstructured":"Anastassia Ailamaki, David J DeWitt, Mark D Hill, and Marios Skounakis. 2001. Weaving Relations for Cache Performance.. In VLDB, Vol. 1. 169-180.","journal-title":"VLDB"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415560"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407851"},{"key":"e_1_2_1_53_1","unstructured":"Peter A. Boncz Marcin Zukowski and Niels Nes. 2005. MonetDB\/X100: Hyper-Pipelining Query Execution. In CIDR."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2391229.2391247"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611486"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3062341.3062363"},{"key":"e_1_2_1_57_1","volume-title":"Fast compilation and execution of SQL queries with webassembly. arXiv preprint arXiv:2104.15098","author":"Haffner Immanuel","year":"2021","unstructured":"Immanuel Haffner and Jens Dittrich. 2021. Fast compilation and execution of SQL queries with webassembly. arXiv preprint arXiv:2104.15098 (2021)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767933"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196911"},{"key":"e_1_2_1_60_1","volume-title":"An evaluation of webassembly and ebpf as offloading mechanisms in the context of computational storage. arXiv preprint arXiv:2111.01947","author":"Huang Wenjun","year":"2021","unstructured":"Wenjun Huang and Marcus Paradies. 2021. An evaluation of webassembly and ebpf as offloading mechanisms in the context of computational storage. arXiv preprint arXiv:2111.01947 (2021)."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554847"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589263"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551852"},{"key":"e_1_2_1_64_1","volume-title":"Bullion: A Column Store for Machine Learning. arXiv preprint arXiv:2404.08901","author":"Liao Gang","year":"2024","unstructured":"Gang Liao, Ye Liu, Jianjun Chen, and Daniel J Abadi. 2024. Bullion: A Column Store for Machine Learning. arXiv preprint arXiv:2404.08901 (2024)."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611507"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639320"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920886"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415568"},{"key":"e_1_2_1_69_1","unstructured":"Ziya Mukhtarov. 2024. Nested Data-Type Encodings in FastLanes. Master's thesis. TECHNICAL UNIVERSITY OF MUNICH."},{"key":"e_1_2_1_70_1","first-page":"29","article-title":"Umbra: A Disk-Based System with In-Memory Performance","volume":"20","author":"Neumann Thomas","year":"2020","unstructured":"Thomas Neumann and Michael J Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance.. In CIDR, Vol. 20. 29.","journal-title":"CIDR"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.14778\/3685800.3685847"},{"key":"e_1_2_1_72_1","volume-title":"Towards Functional Decomposition of Storage Formats. In Conference on Innovative Data Systems Research (CIDR).","author":"Prammer Martin","unstructured":"Martin Prammer, Xinyu Zeng, Ruijun Meng, Wes McKinney, Huanchen Zhang, Andrew Pavlo, and Jignesh M. Patel. 2025. Towards Functional Decomposition of Storage Formats. In Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_1_73_1","unstructured":"Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman Mehdi Cherti Theo Coombes Aarush Katta Clayton Mullis Mitchell Wortsman Patrick Schramowski Srivatsa Kundurthy Katherine Crowson Ludwig Schmidt Robert Kaczmarczyk and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS."},{"key":"e_1_2_1_74_1","unstructured":"Moritz-Felipe Sichert. 2024. Efficient and Safe Integration of User-Defined Operators into Modern Database Systems. Ph.D. Dissertation. Technische Universit\u00e4t M\u00fcnchen."},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465306"},{"key":"e_1_2_1_76_1","unstructured":"Utku Sirin Victoria Kauffman Aadit Saluja Florian Klein Jeremy Hsu and Stratos Idreos. 2025. Frequency-Store: Scaling Image AI by A Column-Store for Images. (2025)."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676288.3676305"},{"key":"e_1_2_1_78_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research. https:\/\/www. cidrdb. org\/cidr2023\/papers\/p66-wolde. pdf.","author":"Wolde Daniel","year":"2023","unstructured":"Daniel ten Wolde, Tavneet Singh, G\u00e1bor Sz\u00e1rnyas, and Peter Boncz. 2023. DuckPGQ: Efficient property graph queries in an analytical RDBMS. In Proceedings of the Conference on Innovative Data Systems Research. https:\/\/www. cidrdb. org\/cidr2023\/papers\/p66-wolde. pdf."},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687609"},{"key":"e_1_2_1_80_1","volume-title":"The Conference on Innovative Data Systems Research, CIDR, .","author":"Vakharia Suketu","year":"2023","unstructured":"Suketu Vakharia, Peng Li, Weiran Liu, and Sundaram Narayanan. 2023. Shared Foundations: Modernizing Meta's Data Lakehouse. In The Conference on Innovative Data Systems Research, CIDR, ."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209950.3209952"},{"key":"e_1_2_1_82_1","first-page":"15","volume-title":"9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Zaharia Matei","year":"2012","unstructured":"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15-28."},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626298"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/3662010.3663452"},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196931"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380583"},{"key":"e_1_2_1_87_1","first-page":"4064","article-title":"Apache TsFile","volume":"17","author":"Zhao Xin","year":"2024","unstructured":"Xin Zhao, Jialin Qiao, Xiangdong Huang, Chen Wang, Shaoxu Song, and Jianmin Wang. 2024. Apache TsFile: An IoT-native Time Series File Format. Proc. VLDB Endow., Vol. 17, 12 (2024), 4064-4076. https:\/\/www.vldb.org\/pvldb\/vol17\/p4064-song.pdf","journal-title":"An IoT-native Time Series File Format. Proc. VLDB Endow."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3749163","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T16:22:25Z","timestamp":1758903745000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3749163"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,22]]},"references-count":87,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,9,22]]}},"alternative-id":["10.1145\/3749163"],"URL":"https:\/\/doi.org\/10.1145\/3749163","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,22]]}}}