{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T16:51:51Z","timestamp":1771951911658,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,12]],"date-time":"2024-03-12T00:00:00Z","timestamp":1710201600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,3,12]]},"abstract":"<jats:p>In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify source-to-source, external program rewriting as a lightweight technique which can optimize across representations, and offer substantial speedups while also avoiding slowdowns. We implemented Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including checking the preconditions under which rewrites are correct, dynamically, at fine-grained program points. We show that Dias can rewrite individual cells to be 57\u00d7 faster compared to pandas and 1909\u00d7 faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6\u00d7 compared to pandas and 27.1\u00d7 compared to modin.<\/jats:p>","DOI":"10.1145\/3639313","type":"journal-article","created":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T18:51:32Z","timestamp":1711479092000},"page":"1-27","source":"Crossref","is-referenced-by-count":4,"title":["Dias: Dynamic Rewriting of Pandas Code"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-4061-7094","authenticated-orcid":false,"given":"Stefanos","family":"Baziotis","sequence":"first","affiliation":[{"name":"University of Illinois (UIUC), Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9860-9938","authenticated-orcid":false,"given":"Daniel","family":"Kang","sequence":"additional","affiliation":[{"name":"University of Illinois (UIUC), Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8140-2321","authenticated-orcid":false,"given":"Charith","family":"Mendis","sequence":"additional","affiliation":[{"name":"University of Illinois (UIUC), Urbana, IL, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,3,26]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"AIEducation. 2022. What course are you going to take? https:\/\/www.kaggle.com\/code\/aieducation\/what-course-are-you-going-to-take\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_2_1","unstructured":"Python ast module. 2022. https:\/\/docs.python.org\/3\/library\/ast.html. Accessed: 2022--12-09."},{"key":"e_1_2_2_3_1","volume-title":"Constant","author":"Python","year":"2022","unstructured":"Python ast module: Constant. 2022. https:\/\/docs.python.org\/3\/library\/ast.html#ast.Constant. Accessed: 2022--12-09."},{"key":"e_1_2_2_4_1","unstructured":"Ponder | Pandas at Scale. 2022. https:\/\/ponder.io\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_5_1","unstructured":"Rounak Banik. 2017. Movie Recommender Systems. https:\/\/www.kaggle.com\/code\/rounakbanik\/movie-recommender-systems. Accessed: 2022--12-09."},{"key":"e_1_2_2_6_1","unstructured":"Erik Bruin. 2022. NLP on Student Writing: EDA. https:\/\/www.kaggle.com\/code\/erikbruin\/nlp-on-student-writing-eda. Accessed: 2022--12-09."},{"key":"e_1_2_2_7_1","unstructured":"Nathan Cheever. 2019. 1000x faster data manipulation: vectorizing with Pandas and Numpy. https:\/\/www.youtube.com\/watch?v=nxWginnBklU. Accessed: 2022--12-09."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446692"},{"key":"e_1_2_2_9_1","unstructured":"Atanu Dan. 2020. Pandas DataFrame: Performance Optimization. https:\/\/medium.com\/@atanudan\/pandas-dataframe-performance-optimization-8b87db24c2c4."},{"key":"e_1_2_2_10_1","unstructured":"PySpark Documentation. 2022. https:\/\/spark.apache.org\/docs\/latest\/api\/python\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_11_1","unstructured":"Pandas Documentation. 2023. Enhancing performance. https:\/\/pandas.pydata.org\/docs\/user_guide\/enhancingperf.html."},{"key":"e_1_2_2_12_1","unstructured":"Javascript V8 Engine. 2022. https:\/\/v8.dev\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_13_1","volume-title":"Lightning fast DataFrame library for Rust and Python","year":"2022","unstructured":"PolaRS: Lightning fast DataFrame library for Rust and Python. 2022. https:\/\/www.pola.rs\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542476.1542528"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2093157.2093176"},{"key":"e_1_2_2_16_1","volume-title":"PyCon","author":"Heisler Sofia","year":"2017","unstructured":"Sofia Heisler. 2017. No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency, PyCon 2017. https:\/\/www.youtube.com\/watch?v=HN5d490_KKk."},{"key":"e_1_2_2_17_1","unstructured":"NYC Taxi Dataset Used in Kaggle Competition. 2017. https:\/\/www.kaggle.com\/c\/nyc-taxi-trip-duration. Accessed: 2022--12-09."},{"key":"e_1_2_2_18_1","volume-title":"Custom input transformation","year":"2022","unstructured":"IPython: Custom input transformation. 2022. https:\/\/ipython.readthedocs.io\/en\/stable\/config\/inputtransforms.html#string-based-transformations. Accessed: 2023-05--30."},{"key":"e_1_2_2_19_1","unstructured":"Python Specializing Adaptive Interpreter. 2021. https:\/\/peps.python.org\/pep-0659\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359630"},{"key":"e_1_2_2_21_1","volume-title":"Andreas Mueller, et al.","author":"Jindal Alekh","year":"2021","unstructured":"Alekh Jindal, K Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, et al. 2021. Magpie: Python at Speed and Scale using Cloud Backends.. In CIDR."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"key":"e_1_2_2_23_1","volume-title":"River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko.","author":"Lattner Chris","year":"2021","unstructured":"Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Arnaud Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In CGO 2021."},{"key":"e_1_2_2_24_1","volume-title":"Parameswaran","author":"Lin Lee Doris Jung","year":"2021","unstructured":"Doris Jung Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A. Hearst, and Aditya G. Parameswaran. 2021. Lux: Always-on Visualization Recommendations for Exploratory Data Science. CoRR, Vol. abs\/2105.00121 (2021). showeprint[arXiv]2105.00121 https:\/\/arxiv.org\/abs\/2105.00121"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3473579"},{"key":"e_1_2_2_26_1","unstructured":"LLVM. 2022a. InstCombine. https:\/\/llvm.org\/doxygen\/InstructionCombining_8cpp_source.html. Accessed: 2022--12-09."},{"key":"e_1_2_2_27_1","unstructured":"LLVM. 2022b. VectorCombine. https:\/\/llvm.org\/doxygen\/VectorCombine_8cpp_source.html. Accessed: 2022--12-09."},{"key":"e_1_2_2_28_1","volume-title":"Provably Correct Peephole Optimizations with Alive. In PLDI'15","author":"Lopes Nuno","year":"2015","unstructured":"Nuno Lopes, David Menendez, Santosh Nagarakatte, and John Regehr. 2015. Provably Correct Peephole Optimizations with Alive. In PLDI'15, Portland, OR, USA. ACM. https:\/\/www.microsoft.com\/en-us\/research\/publication\/provably-correct-peephole-optimizations-alive\/"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3453483.3454030"},{"key":"e_1_2_2_30_1","volume-title":"_map_infer_mask()","author":"Pandas","year":"2022","unstructured":"Pandas 1.5.1: _map_infer_mask(). 2022. https:\/\/github.com\/pandas-dev\/pandas\/blob\/91111fd99898d9dcaa6bf6bedb662db4108da6e6\/pandas\/_libs\/lib.pyx#L2863. Accessed: 2022--12-09."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-013"},{"key":"e_1_2_2_32_1","unstructured":"Fahad Mehfooz. 2021. ClubHouse EDA. https:\/\/www.kaggle.com\/code\/fahadmehfoooz\/clubhouse-eda. Accessed: 2022--12-09."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3428234"},{"key":"e_1_2_2_34_1","unstructured":"Jupyter Notebooks. 2022. https:\/\/jupyter-notebook.readthedocs.io\/en\/latest\/notebook.html. Accessed: 2022--12-09."},{"key":"e_1_2_2_35_1","volume-title":"pandas API on Apache Spark","author":"Koalas","year":"2022","unstructured":"Koalas: pandas API on Apache Spark. 2022. https:\/\/koalas.readthedocs.io\/en\/latest\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3494124.3494152"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSR52588.2021.00072"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173606"},{"key":"e_1_2_2_40_1","unstructured":"Python for Social Scientists San Diego State University Linguistics\/BDA 572. 2022. https:\/\/gawron.sdsu.edu\/python_for_ss. Accessed: 2022--12-09."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1711.04422"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3475726.3475729"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476281"},{"key":"e_1_2_2_44_1","unstructured":"Sunny Solanki. 2021. How to Speed up Code involving Pandas DataFrame using Numba? https:\/\/coderzcolumn.com\/tutorials\/python\/guide-to-speed-up-code-involving-pandas-dataframe-using-numba."},{"key":"e_1_2_2_45_1","volume-title":"Taxi and Limousine Commission","author":"New York","year":"2015","unstructured":"New York (N.Y.). Taxi and Limousine Commission. 2015. TLC Trip Record Data. https:\/\/dask-data.s3.amazonaws.com\/nyc-taxi\/2015\/yellow_tripdata_2015-01.csv. Accessed: 2022--12-09."},{"key":"e_1_2_2_46_1","unstructured":"TensorFlow. 2023. TensorFlow graph optimization with Grappler. https:\/\/www.tensorflow.org\/guide\/graph_optimization."},{"key":"e_1_2_2_47_1","unstructured":"Eyal Trabelsi. 2021. Practical Optimisation for Pandas. https:\/\/www.youtube.com\/watch?v=zdubYLjXHb0."},{"key":"e_1_2_2_48_1","unstructured":"Prakritidev Verma. 2017. Notebook673580193d. https:\/\/www.kaggle.com\/code\/prakritidevverma\/notebook673580193d. Accessed: 2022--12-09."},{"key":"e_1_2_2_49_1","volume-title":"PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021","author":"Wang Haojie","year":"2021","unstructured":"Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14--16, 2021, Angela Demke Brown and Jay R. Lorch (Eds.). USENIX Association, 37--54. https:\/\/www.usenix.org\/conference\/osdi21\/presentation\/wang"},{"key":"e_1_2_2_50_1","unstructured":"IPython Website. 2022. https:\/\/ipython.org\/. Accessed: 2022--12-09."},{"key":"e_1_2_2_51_1","unstructured":"Solving Real-World Business Questions with Python Pandas. 2020. https:\/\/medium.com\/li-ting-liao-tiffany\/solving-real-world-business-questions-with-pandas-70ef8ef02675. Accessed: 2022--12-09."},{"key":"e_1_2_2_52_1","volume-title":"Parameswaran","author":"Xin Doris","year":"2021","unstructured":"Doris Xin, Devin Petersohn, Dixin Tang, Yifan Wu, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran. 2021. Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time. CoRR, Vol. abs\/2103.02145 (2021). showeprint[arXiv]2103.02145 https:\/\/arxiv.org\/abs\/2103.02145"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.ECOOP.2021.15"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639313","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3639313","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T15:14:40Z","timestamp":1755789280000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639313"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,12]]},"references-count":53,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,12]]}},"alternative-id":["10.1145\/3639313"],"URL":"https:\/\/doi.org\/10.1145\/3639313","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,12]]}}}