{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T16:51:50Z","timestamp":1771951910136,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,11]]},"abstract":"<jats:p>Dataframes have become universally popular as a means to represent data in various stages of structure, and manipulate it using a rich set of operators---thereby becoming an essential tool in the data scientists' toolbox. However, dataframe systems, such as pandas, scale poorly---and are non-interactive on moderate to large datasets. We discuss our experiences developing Modin, our first cut at a parallel dataframe system, which already has users across several industries and over 1M downloads. Modin translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that we formalize in this paper. We also introduce metadata independence to allow metadata---such as order and type---to be decoupled from the physical representation and maintained lazily. Using rule-based decomposition and metadata independence, along with careful engineering, Modin is able to support pandas operations across both rows and columns on very large dataframes---unlike Koalas and Dask DataFrames that either break down or are unable to support such operations, while also being much faster than pandas.<\/jats:p>","DOI":"10.14778\/3494124.3494152","type":"journal-article","created":{"date-parts":[[2022,2,5]],"date-time":"2022-02-05T00:31:46Z","timestamp":1644021106000},"page":"739-751","source":"Crossref","is-referenced-by-count":12,"title":["Flexible rule-based decomposition and metadata independence in modin"],"prefix":"10.14778","volume":"15","author":[{"given":"Devin","family":"Petersohn","sequence":"first","affiliation":[{"name":"UC Berkeley"}]},{"given":"Dixin","family":"Tang","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Rehan","family":"Durrani","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Areg","family":"Melik-Adamyan","sequence":"additional","affiliation":[{"name":"Intel"}]},{"given":"Joseph E.","family":"Gonzalez","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Anthony D.","family":"Joseph","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Aditya G.","family":"Parameswaran","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]}],"member":"320","published-online":{"date-parts":[[2022,2,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"https:\/\/qz.com\/1126615\/the-story-of-the-most-important-tool-in-data-science\/","author":"Meet","year":"2017","unstructured":"Meet the man behind the most important tool in data science. https:\/\/qz.com\/1126615\/the-story-of-the-most-important-tool-in-data-science\/ , 2017 . Meet the man behind the most important tool in data science. https:\/\/qz.com\/1126615\/the-story-of-the-most-important-tool-in-data-science\/, 2017."},{"key":"e_1_2_1_2_1","volume-title":"https:\/\/wesmckinney.com\/blog\/apache-arrowpandas-internals\/","author":"Hate About Pandas Ten Things I","year":"2017","unstructured":"Ten Things I Hate About Pandas . https:\/\/wesmckinney.com\/blog\/apache-arrowpandas-internals\/ , 2017 . Date accessed: 2019-12-27. Ten Things I Hate About Pandas. https:\/\/wesmckinney.com\/blog\/apache-arrowpandas-internals\/, 2017. Date accessed: 2019-12-27."},{"key":"e_1_2_1_3_1","unstructured":"What's the future of the pandas library? https:\/\/www.dataschool.io\/future-of-pandas 2018.  What's the future of the pandas library? https:\/\/www.dataschool.io\/future-of-pandas 2018."},{"key":"e_1_2_1_4_1","volume-title":"pandas api on apache spark. https:\/\/koalas.readthedocs.io\/en\/latest\/","author":"Koalas","year":"2019","unstructured":"Koalas : pandas api on apache spark. https:\/\/koalas.readthedocs.io\/en\/latest\/ , 2019 . Koalas: pandas api on apache spark. https:\/\/koalas.readthedocs.io\/en\/latest\/, 2019."},{"key":"e_1_2_1_5_1","volume-title":"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/index.html","author":"Pandas API","year":"2019","unstructured":"Pandas API reference. https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/index.html , 2019 . Date accessed: 2019-12-27. Pandas API reference. https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/index.html, 2019. Date accessed: 2019-12-27."},{"key":"e_1_2_1_6_1","volume-title":"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/scale.html","author":"Large Datasets Scaling","year":"2019","unstructured":"Scaling to Large Datasets , Pandas Documentation . https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/scale.html , 2019 . Date accessed: 2019-12-27. Scaling to Large Datasets, Pandas Documentation. https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/scale.html, 2019. Date accessed: 2019-12-27."},{"key":"e_1_2_1_7_1","volume-title":"R packages for data science. https:\/\/www.tidyverse.org\/","author":"Tidyverse","year":"2019","unstructured":"Tidyverse : R packages for data science. https:\/\/www.tidyverse.org\/ , 2019 . Date accessed: 2019-12-27. Tidyverse: R packages for data science. https:\/\/www.tidyverse.org\/, 2019. Date accessed: 2019-12-27."},{"key":"e_1_2_1_8_1","volume-title":"Out-of-core dataframes for python. https:\/\/github.com\/vaexio\/vaex","author":"Vaex","year":"2019","unstructured":"Vaex : Out-of-core dataframes for python. https:\/\/github.com\/vaexio\/vaex , 2019 . Date accessed: 2019-12-27. Vaex: Out-of-core dataframes for python. https:\/\/github.com\/vaexio\/vaex, 2019. Date accessed: 2019-12-27."},{"key":"e_1_2_1_9_1","volume-title":"https:\/\/www.kaggle.com\/rsrinivasaraghavan\/lending-club-risk-analysis-and-metrics","author":"Kaggle","year":"2020","unstructured":"Kaggle notebook. https:\/\/www.kaggle.com\/rsrinivasaraghavan\/lending-club-risk-analysis-and-metrics , 2020 . Date accessed: 2020-4-12. Kaggle notebook. https:\/\/www.kaggle.com\/rsrinivasaraghavan\/lending-club-risk-analysis-and-metrics, 2020. Date accessed: 2020-4-12."},{"key":"e_1_2_1_10_1","volume-title":"https:\/\/www.kaggle.com\/ethon0426\/lending-club-20072020q1","author":"Lending","year":"2020","unstructured":"Lending club data. https:\/\/www.kaggle.com\/ethon0426\/lending-club-20072020q1 , 2020 . Date accessed: 2020-4-12. Lending club data. https:\/\/www.kaggle.com\/ethon0426\/lending-club-20072020q1, 2020. Date accessed: 2020-4-12."},{"key":"e_1_2_1_11_1","volume-title":"https:\/\/docs.dask.org\/en\/latest\/dataframe-api.html","author":"Dask","year":"2021","unstructured":"Dask dataframe api reference. https:\/\/docs.dask.org\/en\/latest\/dataframe-api.html , 2021 . Dask dataframe api reference. https:\/\/docs.dask.org\/en\/latest\/dataframe-api.html, 2021."},{"key":"e_1_2_1_12_1","unstructured":"Ibis documentation 2021.  Ibis documentation 2021."},{"key":"e_1_2_1_13_1","volume-title":"https:\/\/kaggle.com","year":"2021","unstructured":"Kaggle. https:\/\/kaggle.com , 2021 . Date accessed: 2021-6-25. Kaggle. https:\/\/kaggle.com, 2021. Date accessed: 2021-6-25."},{"key":"e_1_2_1_14_1","volume-title":"https:\/\/spark.apache.org\/docs\/1.6.3\/api\/java\/org\/apache\/spark\/sql\/DataFrame.html","author":"Spark","year":"2021","unstructured":"Spark dataframe api reference. https:\/\/spark.apache.org\/docs\/1.6.3\/api\/java\/org\/apache\/spark\/sql\/DataFrame.html , 2021 . Spark dataframe api reference. https:\/\/spark.apache.org\/docs\/1.6.3\/api\/java\/org\/apache\/spark\/sql\/DataFrame.html, 2021."},{"key":"e_1_2_1_15_1","unstructured":"The stanford open policing project.https:\/\/openpolicing.stanford.edu\/ 2021. Date accessed: 2021-7-6.  The stanford open policing project.https:\/\/openpolicing.stanford.edu\/ 2021. Date accessed: 2021-7-6."},{"key":"e_1_2_1_16_1","unstructured":"Teradata | data analytics for a hybrid multi-cloud world 2021.  Teradata | data analytics for a hybrid multi-cloud world 2021."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687731"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415545"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 7th biennial conference on innovative data systems research","author":"Bittorf M.","year":"2015","unstructured":"M. Bittorf , T. Bobrovytsky , C. Erickson , M. G. D. Hecht , M. Kuff , D. K. A. Leblang , N. Robinson , D. R. S. Rus , J. Wanderman , and M. M. Yoder . Impala: A modern, open-source sql engine for hadoop . In Proceedings of the 7th biennial conference on innovative data systems research , 2015 . M. Bittorf, T. Bobrovytsky, C. Erickson, M. G. D. Hecht, M. Kuff, D. K. A. Leblang, N. Robinson, D. R. S. Rus, J. Wanderman, and M. M. Yoder. Impala: A modern, open-source sql engine for hadoop. In Proceedings of the 7th biennial conference on innovative data systems research, 2015."},{"key":"e_1_2_1_21_1","first-page":"225","volume-title":"Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4--7, 2005, Online Proceedings","author":"Boncz P. A.","year":"2005","unstructured":"P. A. Boncz , M. Zukowski , and N. Nes . Monetdb\/x100: Hyper-pipelining query execution . In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4--7, 2005, Online Proceedings , pages 225 -- 237 . www.cidrdb.org, 2005 . P. A. Boncz, M. Zukowski, and N. Nes. Monetdb\/x100: Hyper-pipelining query execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4--7, 2005, Online Proceedings, pages 225--237. www.cidrdb.org, 2005."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807271"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.780863"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1365815.1365816"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/2544030"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903741"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236194"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","DOI":"10.56021\/9781421407944","volume-title":"Matrix computations","author":"Golub G. H.","year":"2013","unstructured":"G. H. Golub and C. F. Van Loan . Matrix computations , volume 3 . JHU press , 2013 . G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU press, 2013."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.273032"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742795"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/222610.222615"},{"key":"e_1_2_1_32_1","unstructured":"A. Jindal K. V. Emani M. Daum O. Poppe B. Haynes A. Pavlenko A. Gupta K. Ramachandra C. Curino A. Mueller et al. Magpie: Python at speed and scale using cloud backends.  A. Jindal K. V. Emani M. Daum O. Poppe B. Haynes A. Pavlenko A. Gupta K. Ramachandra C. Curino A. Mueller et al. Magpie: Python at speed and scale using cloud backends."},{"key":"e_1_2_1_33_1","first-page":"195","volume-title":"Technologie und Web (BTW 2021) 13.-17.","author":"Kl\u00e4be S.","year":"2021","unstructured":"S. Kl\u00e4be and S. Hagedorn . When bears get machine support: Applying machine learning models to scalable dataframes with grizzly. Datenbanksysteme f\u00fcr Business , Technologie und Web (BTW 2021) 13.-17. September 2021 in Dresden , Deutschland , page 195 . S.Kl\u00e4be and S. Hagedorn. When bears get machine support: Applying machine learning models to scalable dataframes with grizzly. Datenbanksysteme f\u00fcr Business, Technologie und Web (BTW 2021) 13.-17. September 2021 in Dresden, Deutschland, page 195."},{"key":"e_1_2_1_34_1","first-page":"2021","article-title":"Applying machine learning models to scalable dataframes with grizzly","author":"Kl\u00e4be S.","year":"2021","unstructured":"S. Kl\u00e4be and S. Hagedorn . Applying machine learning models to scalable dataframes with grizzly . BTW 2021 , 2021 . S.Kl\u00e4be and S. Hagedorn. Applying machine learning models to scalable dataframes with grizzly. BTW 2021, 2021.","journal-title":"BTW"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920886"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415568"},{"key":"e_1_2_1_37_1","volume-title":"Ray:A distributed framework for emerging AI applications. CoRR, abs\/1712.05889","author":"Moritz P.","year":"2017","unstructured":"P. Moritz , R. Nishihara , S. Wang , A. Tumanov , R. Liaw , E. Liang , W. Paul , M. I. Jordan , and I. Stoica . Ray:A distributed framework for emerging AI applications. CoRR, abs\/1712.05889 , 2017 . P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica. Ray:A distributed framework for emerging AI applications. CoRR, abs\/1712.05889, 2017."},{"key":"e_1_2_1_38_1","volume-title":"Taxi And Limousine Commission. New york city taxi trip data","author":"New York","year":"2009","unstructured":"New York (N.Y.). Taxi And Limousine Commission. New york city taxi trip data , 2009 --2018, 2019. New York (N.Y.). Taxi And Limousine Commission. New york city taxi trip data, 2009--2018, 2019."},{"key":"e_1_2_1_39_1","volume-title":"Low-rank plus sparse matrix decomposition for accelerated dynamic mri with separation of background and dynamic components. Magnetic resonance in medicine, 73(3):1125--1136","author":"Otazo R.","year":"2015","unstructured":"R. Otazo , E. Candes , and D. K. Sodickson . Low-rank plus sparse matrix decomposition for accelerated dynamic mri with separation of background and dynamic components. Magnetic resonance in medicine, 73(3):1125--1136 , 2015 . R. Otazo, E. Candes, and D. K. Sodickson. Low-rank plus sparse matrix decomposition for accelerated dynamic mri with separation of background and dynamic components. Magnetic resonance in medicine, 73(3):1125--1136, 2015."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/234313.234368"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025117"},{"key":"e_1_2_1_42_1","volume-title":"Towards scalable dataframe systems. arXiv preprint arXiv:2001.00888","author":"Petersohn D.","year":"2020","unstructured":"D. Petersohn , W. Ma , D. Lee , S. Macke , D. Xin , X. Mo , J. E. Gonzalez , J. M. Hellerstein , A. D. Joseph , and A. Parameswaran . Towards scalable dataframe systems. arXiv preprint arXiv:2001.00888 , 2020 . D. Petersohn, W. Ma, D. Lee, S. Macke, D. Xin, X. Mo, J. E. Gonzalez, J. M. Hellerstein, A. D. Joseph, and A. Parameswaran. Towards scalable dataframe systems. arXiv preprint arXiv:2001.00888, 2020."},{"key":"e_1_2_1_43_1","volume-title":"R: A Language and Environment for Statistical Computing","author":"Team R Core","year":"2017","unstructured":"R Core Team . R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing , Vienna, Austria , 2017 . R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2017."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-013"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData47090.2019.9006303"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989351"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3226595.3226638"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2013.19"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCSNT.2011.6182030"},{"key":"e_1_2_1_50_1","first-page":"4","article-title":"Tidy data","volume":"59","author":"Wickham H.","unstructured":"H. Wickham . Tidy data . The Journal of Statistical Software , 59 , 201 4 . H. Wickham. Tidy data. The Journal of Statistical Software, 59, 201 4.","journal-title":"The Journal of Statistical Software"},{"key":"e_1_2_1_51_1","volume-title":"Bulletin of the Technical Committee on Data Engineering","author":"Xin D.","year":"2021","unstructured":"D. Xin , D. Petersohn , D. Tang , Y. Wu , J. E. Gonzalez , J. M. Hellerstein , A. D. Joseph , and A. G. Parameswaran . Enhancing the interactivity of dataframe queries by leveraging think time . In Bulletin of the Technical Committee on Data Engineering , volume 4 . IEEE , 2021 . D. Xin, D. Petersohn, D. Tang, Y. Wu, J. E. Gonzalez, J. M. Hellerstein, A. D. Joseph, and A. G. Parameswaran. Enhancing the interactivity of dataframe queries by leveraging think time. In Bulletin of the Technical Committee on Data Engineering, volume 4. IEEE, 2021."},{"key":"e_1_2_1_52_1","volume-title":"Riot: I\/o-efficient numerical computing without sql. arXiv preprint arXiv:0909.1766","author":"Zhang Y.","year":"2009","unstructured":"Y. Zhang , H. Herodotou , and J. Yang . Riot: I\/o-efficient numerical computing without sql. arXiv preprint arXiv:0909.1766 , 2009 . Y. Zhang, H. Herodotou, and J. Yang. Riot: I\/o-efficient numerical computing without sql. arXiv preprint arXiv:0909.1766, 2009."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3104482.3104487"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3494124.3494152","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:32:46Z","timestamp":1672227166000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3494124.3494152"}},"subtitle":["a parallel dataframe system"],"short-title":[],"issued":{"date-parts":[[2021,11]]},"references-count":53,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,11]]}},"alternative-id":["10.14778\/3494124.3494152"],"URL":"https:\/\/doi.org\/10.14778\/3494124.3494152","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,11]]}}}