{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,7]],"date-time":"2024-08-07T00:42:33Z","timestamp":1722991353875},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,5]]},"abstract":"<jats:p>\n            Dataframe is a popular construct in data analysis libraries that offers a tabular view of the data. However, data within a dataframe often has redundancy, which can lead to high memory utilization of data analysis libraries. Inspired by the process of normalization in relational database systems, we propose a technique called\n            <jats:italic>splitting<\/jats:italic>\n            that can be applied to tabular data to reduce redundancy. Splitting involves performing lossless join decomposition by explicitly adding joining keys, and unlike normalization, splitting can be applied to tabular data without the need to perform functional dependency discovery. A\n            <jats:italic>split<\/jats:italic>\n            dataframe provides the same unified tabular view to the data, while internally operating on split data to improve memory efficiency. We develop SplitDF, an implementation of split dataframes in Ibis for DuckDB backend, which enables data analysis on split data with minimal changes to the Ibis API. Generation of split tabular data is automated using an algorithm SplitGen implemented in Velox. In our analysis involving ten handwritten notebooks running on SplitDF, we observe a reduction in memory usage of 19--61% when operating on split data as compared to operating on original data.\n          <\/jats:p>","DOI":"10.14778\/3665844.3665849","type":"journal-article","created":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T22:19:07Z","timestamp":1722982747000},"page":"2175-2184","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis"],"prefix":"10.14778","volume":"17","author":[{"given":"Aarati","family":"Kakaraparthy","sequence":"first","affiliation":[{"name":"University of Wisconsin, Madison"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jignesh M.","family":"Patel","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,8,6]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2010. How to add an auto-incrementing primary key to an existing table in PostgreSQL? https:\/\/stackoverflow.com\/questions\/2944499\/how-to-add-an-auto-incrementing-primary-key-to-an-existing-table-in-postgresql."},{"key":"e_1_2_1_2_1","unstructured":"2020. The Definitive Data Scientist Environment Setup. https:\/\/whiteboxml.com\/blog\/the-definitive-data-scientist-environment-setup."},{"key":"e_1_2_1_3_1","unstructured":"2022. Dask. https:\/\/www.dask.org\/."},{"key":"e_1_2_1_4_1","unstructured":"2022. Lightweight Compression in DuckDB. https:\/\/duckdb.org\/2022\/10\/28\/lightweight-compression.html."},{"key":"e_1_2_1_5_1","unstructured":"2023. 515k Hotel Reviews Data in Europe. https:\/\/www.kaggle.com\/datasets\/jiashenliu\/515k-hotel-reviews-data-in-europe."},{"key":"e_1_2_1_6_1","unstructured":"2023. Apache Arrow. https:\/\/arrow.apache.org\/."},{"key":"e_1_2_1_7_1","unstructured":"2023. Apache Arrow Python Bindings. https:\/\/arrow.apache.org\/docs\/python\/index.html."},{"key":"e_1_2_1_8_1","unstructured":"2023. Bitcoin Historical Data. https:\/\/www.kaggle.com\/datasets\/mczielinski\/bitcoin-historical-data."},{"key":"e_1_2_1_9_1","unstructured":"2023. Brazilian E-commerce Public Dataset by Olist. https:\/\/www.kaggle.com\/datasets\/olistbr\/brazilian-ecommerce."},{"key":"e_1_2_1_10_1","unstructured":"2023. COVID 19 Dataset. https:\/\/www.kaggle.com\/datasets\/imdevskp\/corona-virus-report."},{"key":"e_1_2_1_11_1","unstructured":"2023. Create Sequence in DuckDB. https:\/\/duckdb.org\/docs\/sql\/statements\/create_sequence.html."},{"key":"e_1_2_1_12_1","unstructured":"2023. cuDF - GPU DataFrames. https:\/\/github.com\/rapidsai\/cudf."},{"key":"e_1_2_1_13_1","unstructured":"2023. Dask Memory limits reached in simple ETL-like data transformations. https:\/\/dask.discourse.group\/t\/memory-limits-reached-in-simple-etl-like-data-transformations\/1687."},{"key":"e_1_2_1_14_1","unstructured":"2023. Data Science for Good - Kiva Crowdfunding. https:\/\/www.kaggle.com\/datasets\/kiva\/data-science-for-good-kiva-crowdfunding."},{"key":"e_1_2_1_15_1","unstructured":"2023. Emergency - 911 Calls. https:\/\/www.kaggle.com\/datasets\/mchirico\/montcoalert."},{"key":"e_1_2_1_16_1","unstructured":"2023. FIFA 20 complete player dataset. https:\/\/www.kaggle.com\/datasets\/stefanoleone992\/fifa-20-complete-player-dataset."},{"key":"e_1_2_1_17_1","unstructured":"2023. Fitbit Fitness Tracker Data. https:\/\/www.kaggle.com\/datasets\/arashnic\/fitbit."},{"key":"e_1_2_1_18_1","unstructured":"2023. Flight Status Prediction. https:\/\/www.kaggle.com\/datasets\/robikscube\/flight-delay-dataset-20182022."},{"key":"e_1_2_1_19_1","unstructured":"2023. Football Events. https:\/\/www.kaggle.com\/datasets\/secareanualin\/football-events."},{"key":"e_1_2_1_20_1","unstructured":"2023. The Ibis Project. https:\/\/ibis-project.org\/."},{"key":"e_1_2_1_21_1","unstructured":"2023. Importing Data in DuckDB. https:\/\/duckdb.org\/docs\/data\/overview.html."},{"key":"e_1_2_1_22_1","unstructured":"2023. Kaggle: Your Machine Learning and Data Science Community. https:\/\/www.kaggle.com\/."},{"key":"e_1_2_1_23_1","unstructured":"2023. Koalas. https:\/\/github.com\/databricks\/koalas."},{"key":"e_1_2_1_24_1","unstructured":"2023. Lossless Join Decomposition. https:\/\/en.wikipedia.org\/wiki\/Lossless_join_decomposition."},{"key":"e_1_2_1_25_1","unstructured":"2023. NYC Parking Tickets. https:\/\/www.kaggle.com\/datasets\/new-york-city\/nyc-parking-tickets\/."},{"key":"e_1_2_1_26_1","unstructured":"2023. pandas. https:\/\/pandas.pydata.org\/."},{"key":"e_1_2_1_27_1","unstructured":"2023. PySpark Documentation. https:\/\/spark.apache.org\/docs\/latest\/api\/python."},{"key":"e_1_2_1_28_1","unstructured":"2023. Splitting. https:\/\/github.com\/UWQuickstep\/splitting."},{"key":"e_1_2_1_29_1","unstructured":"2023. The SQLGlot library. https:\/\/sqlglot.com\/sqlglot.html."},{"key":"e_1_2_1_30_1","unstructured":"2023. Tackling Excessive Memory Usage with Dask Dataframes from Parquet Files. https:\/\/saturncloud.io\/blog\/tackling-excessive-memory-usage-with-dask-dataframes-from-parquet-files\/."},{"key":"e_1_2_1_31_1","unstructured":"2023. time(1) - Linux Manual Page. https:\/\/man7.org\/linux\/man-pages\/man1\/time.1.html."},{"key":"e_1_2_1_32_1","unstructured":"2023. US Accidents 2019. https:\/\/www.kaggle.com\/datasets\/sobhanmoosavi\/us-accidents."},{"key":"e_1_2_1_33_1","unstructured":"2023. Vaex.io: An ML Ready Fast Dataframe for Python. https:\/\/vaex.io\/."},{"key":"e_1_2_1_34_1","unstructured":"2023. What is data partitioning and how to do it right. https:\/\/www.cockroachlabs.com\/blog\/what-is-data-partitioning-and-how-to-do-it-right\/."},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management","author":"Abedjan Ziawasch","year":"2014","unstructured":"Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014. DFD: Efficient Functional Dependency Discovery. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (Shanghai, China) (CIKM '14). Association for Computing Machinery, New York, NY, USA, 949--958. 10.1145\/2661829.2661884"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/362384.362685"},{"key":"e_1_2_1_37_1","volume-title":"California RJ909","author":"Codd E. F.","year":"1971","unstructured":"E. F. Codd. 1971. Further Normalization of the Data Base Relational Model. Research Report \/ RJ \/ IBM \/ San Jose, California RJ909 (1971). https:\/\/api.semanticscholar.org\/CorpusID:45071523"},{"key":"e_1_2_1_38_1","unstructured":"E. F. Codd. 1974. Recent Investigations in Relational Data Base Systems. In ACM Pacific. https:\/\/api.semanticscholar.org\/CorpusID:47325247"},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the 2016 International Conference on Management of Data","author":"Dageville Benoit","year":"2016","unstructured":"Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 215--226. 10.1145\/2882903.2903741"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 15th International Conference on Database Theory","author":"Darwen Hugh","year":"2012","unstructured":"Hugh Darwen, C. J. Date, and Ronald Fagin. 2012. A Normal Form for Preventing Redundant Tuples in Relational Databases. In Proceedings of the 15th International Conference on Database Theory (Berlin, Germany) (ICDT '12). Association for Computing Machinery, New York, NY, USA, 114--126. 10.1145\/2274576.2274589"},{"key":"e_1_2_1_41_1","volume-title":"Discrete Mathematics Theoretical Computer Science DMTCS Proceedings","volume":"03","author":"Flajolet Philippe","year":"2012","unstructured":"Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Fr\u00e9d\u00e9ric Meunier. 2012. Hyper-LogLog: The analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics Theoretical Computer Science DMTCS Proceedings vol. AH,... (03 2012). 10.46298\/dmtcs.3545"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/0022-0000(85)90041-8"},{"key":"e_1_2_1_43_1","volume-title":"Conference on Innovative Data Systems Research. https:\/\/api.semanticscholar.org\/CorpusID:213180298","author":"Hagedorn Stefan","year":"2020","unstructured":"Stefan Hagedorn. 2020. When sweet and cute isn't enough anymore: Solving scalability issues in Python Pandas with Grizzly. In Conference on Innovative Data Systems Research. https:\/\/api.semanticscholar.org\/CorpusID:213180298"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/42.2.100"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data","author":"Ilyas Ihab F.","year":"2004","unstructured":"Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD '04). Association for Computing Machinery, New York, NY, USA, 647--658. 10.1145\/1007568.1007641"},{"key":"e_1_2_1_46_1","volume-title":"Conference on Innovative Data Systems Research. https:\/\/api.semanticscholar.org\/CorpusID:231782138","author":"Jindal Alekh","year":"2021","unstructured":"Alekh Jindal, K. Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, Wentao Wu, and Hiren Patel. 2021. Magpie: Python at Speed and Scale using Cloud Backends. In Conference on Innovative Data Systems Research. https:\/\/api.semanticscholar.org\/CorpusID:231782138"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192968"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2010.197"},{"key":"e_1_2_1_49_1","volume-title":"Ray: A Distributed Framework for Emerging AI Applications. CoRR abs\/1712.05889","author":"Moritz Philipp","year":"2017","unstructured":"Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I. Jordan, and Ion Stoica. 2017. Ray: A Distributed Framework for Emerging AI Applications. CoRR abs\/1712.05889 (2017). arXiv:1712.05889 http:\/\/arxiv.org\/abs\/1712.05889"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3003665.3003667"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824086"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/2794367.2794377"},{"key":"e_1_2_1_53_1","volume-title":"Proceedings of the 2016 International Conference on Management of Data","author":"Papenbrock Thorsten","year":"2016","unstructured":"Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency Discovery. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 821--833. 10.1145\/2882903.2915203"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554829"},{"key":"e_1_2_1_55_1","volume-title":"Stephen Macke, Doris Xin, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran.","author":"Petersohn Devin","year":"2020","unstructured":"Devin Petersohn, William W. Ma, Doris Jung Lin Lee, Stephen Macke, Doris Xin, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran. 2020. Towards Scalable Dataframe Systems. CoRR abs\/2001.00888 (2020). arXiv:2001.00888 http:\/\/arxiv.org\/abs\/2001.00888"},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the 2019 International Conference on Management of Data","author":"Raasveldt Mark","year":"2019","unstructured":"Mark Raasveldt and Hannes M\u00fchleisen. 2019. DuckDB: An Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1981--1984. 10.1145\/3299869.3320212"},{"volume-title":"Database management systems (3. ed.)","author":"Ramakrishnan Raghu","key":"e_1_2_1_57_1","unstructured":"Raghu Ramakrishnan and Johannes Gehrke. 2003. Database management systems (3. ed.). McGraw-Hill."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476281"},{"key":"e_1_2_1_59_1","volume-title":"Very Large Data Bases Conference. https:\/\/api.semanticscholar.org\/CorpusID:16124888","author":"Sismanis Yannis","year":"2006","unstructured":"Yannis Sismanis, Paul G. Brown, Peter J. Haas, and Berthold Reinwald. 2006. GORDIAN: efficient and scalable discovery of composite keys. In Very Large Data Bases Conference. https:\/\/api.semanticscholar.org\/CorpusID:16124888"},{"key":"e_1_2_1_60_1","unstructured":"Wes McKinney. 2017. Apache Arrow and the \"10 Things I Hate About pandas\"'. https:\/\/wesmckinney.com\/blog\/apache-arrow-pandas-internals\/."},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","author":"Zhang Yunjia","year":"2020","unstructured":"Yunjia Zhang, Zhihan Guo, and Theodoros Rekatsinas. 2020. A Statistical Perspective on Discovering Functional Dependencies in Noisy Data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 861--876. 10.1145\/3318464.3389749"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1978.1055934"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3665844.3665849","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T22:29:17Z","timestamp":1722983357000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3665844.3665849"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5]]},"references-count":62,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,5]]}},"alternative-id":["10.14778\/3665844.3665849"],"URL":"https:\/\/doi.org\/10.14778\/3665844.3665849","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2024,5]]},"assertion":[{"value":"2024-08-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}