{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T22:37:54Z","timestamp":1778279874279,"version":"3.51.4"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,5]]},"abstract":"<jats:p>Data science pipelines consist of data preprocessing and transformation, and a typical pipeline comprises a series of operators, such as DataFrame filtering and groupby. As practitioners seek tools to handle larger-scale data while maintaining APIs compatible with popular single-machine libraries (e.g., pandas), scaling such a pipeline requires efficient distribution of decomposed tasks across the cluster and fine-grained, key-level intermediate storage management, two challenges that existing systems have not effectively addressed. Motivated by the requirements of scaling diverse data science applications, we present the design and implementation of Xorbits, a native scalable data science engine built on our decentralized actor model, Xoscar. Our actor model can eliminate dependency on a global scheduler and enable fast actor task scheduling. We also provide reference-based distributed storage with unified access across heterogeneous memory resources. Our evaluation demonstrates that Xorbits achieves up to 3.22X speedup on 3 machine learning pipelines and 22 data analysis workloads compared to state-of-the-art solutions. Xorbits is available on PyPI with nearly 1k daily downloads and has been successfully deployed in production environments.<\/jats:p>","DOI":"10.14778\/3746405.3746420","type":"journal-article","created":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T17:06:20Z","timestamp":1756919180000},"page":"2955-2963","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Decentralized Actor Scheduling and Reference-Based Storage in Xorbits: A Native Scalable Data Science Engine"],"prefix":"10.14778","volume":"18","author":[{"given":"Weizheng","family":"Lu","sequence":"first","affiliation":[{"name":"Renmin University of China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chao","family":"Hui","sequence":"additional","affiliation":[{"name":"Shandong University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yunhai","family":"Wang","sequence":"additional","affiliation":[{"name":"Renmin University of China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Feng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Renmin University of China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yueguo","family":"Chen","sequence":"additional","affiliation":[{"name":"Renmin University of China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bao","family":"Liu","sequence":"additional","affiliation":[{"name":"Xorbits Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengjie","family":"Li","sequence":"additional","affiliation":[{"name":"Xorbits Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhaoxin","family":"Wu","sequence":"additional","affiliation":[{"name":"Xorbits Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xuye","family":"Qin","sequence":"additional","affiliation":[{"name":"Xorbits Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,3]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2014. 2014 Yellow Taxi Trip Data. https:\/\/catalog.data.gov\/dataset\/2014-yellow-taxi-trip-data Accessed: 2024-11-22."},{"key":"e_1_2_1_2_1","unstructured":"2022. PySpark vs Scala Spark vs Spark SQL - Which one is performance efficient? Are UDFs still bad? https:\/\/community.databricks.com\/t5\/data-engineering\/pyspark-udf-is-taking-long-to-process\/td-p\/7794 Accessed: 2025-03-15."},{"key":"e_1_2_1_3_1","unstructured":"2024. Actors. https:\/\/distributed.dask.org\/en\/stable\/actors.html Accessed: 2024-10-22."},{"key":"e_1_2_1_4_1","unstructured":"2024. Flight Status Prediction. https:\/\/www.kaggle.com\/datasets\/robikscube\/flight-delay-dataset-20182022\/data Accessed: 2024-11-22."},{"key":"e_1_2_1_5_1","unstructured":"2024. mmap \u2014 Memory-mapped file support. https:\/\/docs.python.org\/3.12\/library\/mmap.html Accessed: 2024-10-26."},{"key":"e_1_2_1_6_1","unstructured":"2024. modin with ray engine hang. https:\/\/github.com\/modin-project\/modin\/issues\/7349 Accessed: 2025-01-20."},{"key":"e_1_2_1_7_1","unstructured":"2024. RAPIDS Accelerator For Apache Spark. https:\/\/github.com\/NVIDIA\/spark-rapids Accessed: 2024-11-22."},{"key":"e_1_2_1_8_1","unstructured":"2024. Ray v2 Architecture. https:\/\/docs.google.com\/document\/d\/1tBw9A4j62ruI5omIJbMxly-la5w4q_TjyJgJL_jN2fI Accessed: 2024-11-22."},{"key":"e_1_2_1_9_1","unstructured":"2024. RMM: RAPIDS Memory Manager. https:\/\/github.com\/rapidsai\/rmm Accessed: 2024-10-28."},{"key":"e_1_2_1_10_1","unstructured":"2024. suggestions on handling out of memory matrix operation on large dataset. https:\/\/github.com\/modin-project\/modin\/issues\/6677 Accessed: 2024-11-02."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/1086.001.0001"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307681.3325400"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.71"},{"key":"e_1_2_1_14_1","volume-title":"Parallel Scientific Computation: A Structured Approach Using BSP and MPI","author":"Bisseling Rob H.","unstructured":"Rob H. Bisseling. 2004. Parallel Scientific Computation: A Structured Approach Using BSP and MPI. Oxford University Press, Inc."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Peter Boncz Thomas Neumann and Orri Erling. 2013. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In Performance Characterization and Benchmarking. 61\u201376.","DOI":"10.1007\/978-3-319-04936-6_5"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611554"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/PAW-ATM56565.2022.00009"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2777598.2777604"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/1624775.1624804"},{"key":"e_1_2_1_20_1","first-page":"103","article-title":"GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism","volume":"32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. In Advances in Neural Information Processing Systems, Vol. 32. 103\u2013112.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_21_1","volume-title":"11th Conference on Innovative Data Systems Research.","author":"Jindal Alekh","year":"2021","unstructured":"Alekh Jindal, K. Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas M\u00fcller, Wentao Wu, and Hiren Patel. 2021. Magpie: Python at Speed and Scale Using Cloud Backends. In 11th Conference on Innovative Data Systems Research."},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications. 91\u2013108","author":"Laxmikant","unstructured":"Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications. 91\u2013108."},{"key":"e_1_2_1_23_1","unstructured":"Lakshay Goel. 2023. PySpark UDF is taking long to process. https:\/\/community.databricks.com\/t5\/data-engineering\/pyspark-udf-is-taking-long-to-process\/td-p\/7794 Accessed: 2025-03-15."},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning","volume":"80","author":"Liang Eric","year":"2018","unstructured":"Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. RLlib: Abstractions for Distributed Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. 3053\u20133062."},{"key":"e_1_2_1_25_1","volume-title":"Xorbits: Automating Operator Tiling for Distributed Data Science. In 2024 IEEE 40th International Conference on Data Engineering. 5211\u20135223","author":"Lu Weizheng","year":"2024","unstructured":"Weizheng Lu, Kaisheng He, Xuye Qin, Chengjie Li, Zhong Wang, Tao Yuan, Xia Liao, Feng Zhang, Yueguo Chen, and Xiaoyong Du. 2024. Xorbits: Automating Operator Tiling for Distributed Data Science. In 2024 IEEE 40th International Conference on Data Engineering. 5211\u20135223."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11704-024-40763-6"},{"key":"e_1_2_1_27_1","volume-title":"Pandas: A Foundational Python Library for Data Analysis and Statistics. Python for high performance and scientific computing 14, 9 (2011) 1\u20139.","author":"McKinney Wes","year":"2011","unstructured":"Wes McKinney. 2011. Pandas: A Foundational Python Library for Data Analysis and Statistics. Python for high performance and scientific computing 14, 9 (2011) 1\u20139."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503284"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3605573.3605642"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. 561\u2013577","author":"Moritz Philipp","year":"2018","unstructured":"Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. 561\u2013577."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings 28th International Conference on Extending Database Technology. 337\u2013349","author":"Mozzillo Angelo","year":"2025","unstructured":"Angelo Mozzillo, Luca Zecchini, Luca Gagliardelli, Adeel Aslam, Sonia Bergamaschi, and Giovanni Simonini. 2025. Evaluation of Dataframe Libraries for Data Preparation on a Single Machine. In Proceedings 28th International Conference on Extending Database Technology. 337\u2013349."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.3389\/fhpcp.2024.1384619"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407807"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3494124.3494152"},{"key":"e_1_2_1_35_1","volume-title":"Dask: Parallel Computation with Blocked Algorithms and Task Scheduling. In Python in Science Conference. 126\u2013132","author":"Rocklin Matthew","year":"2015","unstructured":"Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked Algorithms and Task Scheduling. In Python in Science Conference. 126\u2013132."},{"key":"e_1_2_1_36_1","volume-title":"UCX: An Open Source Framework for HPC Network Apis and Beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 40\u201343","author":"Shamis Pavel","year":"2015","unstructured":"Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Christopher Lamb, Sameer Kumar, Craig Stunkel, George Bosilca, and Aurelien Bouteiller. 2015. UCX: An Open Source Framework for HPC Network Apis and Beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 40\u201343."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476281"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/79173.79181"},{"key":"e_1_2_1_39_1","first-page":"56","article-title":"MPI: A Standard Message Passing Interface","volume":"12","author":"Walker David W","year":"1996","unstructured":"David W Walker and Jack J Dongarra. 1996. MPI: A Standard Message Passing Interface. Supercomputer 12 (1996), 56\u201368.","journal-title":"Supercomputer"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3617"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465288"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3685800.3685818"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389738"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934664"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCC.2025.3559346"},{"key":"e_1_2_1_46_1","volume-title":"Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning. In 2019 IEEE International Conference on Data Mining. 1504\u20131509","author":"Zhao Xing","year":"2019","unstructured":"Xing Zhao, Manos Papagelis, Aijun An, Bao Xin Chen, Junfeng Liu, and Yonggang Hu. 2019. Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning. In 2019 IEEE International Conference on Data Mining. 1504\u20131509."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3746405.3746420","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T19:50:32Z","timestamp":1757015432000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3746405.3746420"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5]]},"references-count":46,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,5]]}},"alternative-id":["10.14778\/3746405.3746420"],"URL":"https:\/\/doi.org\/10.14778\/3746405.3746420","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,5]]},"assertion":[{"value":"2025-09-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}