{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T18:45:35Z","timestamp":1758393935527,"version":"3.41.0"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T00:00:00Z","timestamp":1686614400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,6,13]]},"abstract":"<jats:p>Modern data analytics and AI jobs become increasingly complex and involve multiple tasks performed on specialized systems. Sharing of intermediate data between different systems is often a significant bottleneck in such jobs. When the intermediate data is large, it is mostly exchanged through files in standard formats (e.g., CSV and ORC), causing high I\/O and (de)serialization overheads. To solve these problems, we develop Vineyard, a high-performance, extensible, and cloud-native object store, trying to provide an intuitive experience for users to share data across systems in complex real-life workflows. Since different systems usually work on data structures (e.g., dataframes, graphs, hashmaps) with similar interfaces, and their computation logic is often loosely-coupled with how such interfaces are implemented over specific memory layouts, it enables Vineyard to conduct data sharing efficiently at a high level via memory mapping and method sharing. Vineyard provides an IDL named VCDL to facilitate users to register their own intermediate data types into Vineyard such that objects of the registered types can then be efficiently shared across systems in a polyglot workflow. As a cloud-native system, Vineyard is designed to work closely with Kubernetes, as well as achieve fault-tolerance and high performance in production environments. Evaluations on real-life datasets and data analytics jobs show that the above optimizations of Vineyard can significantly improve the end-to-end performance of data analytics jobs, by reducing their data-sharing time up to 68.4x.<\/jats:p>","DOI":"10.1145\/3589780","type":"journal-article","created":{"date-parts":[[2023,6,20]],"date-time":"2023-06-20T20:26:45Z","timestamp":1687292805000},"page":"1-27","source":"Crossref","is-referenced-by-count":5,"title":["Vineyard: Optimizing Data Sharing in Data-Intensive Analytics"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-5641-2452","authenticated-orcid":false,"given":"Wenyuan","family":"Yu","sequence":"first","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7687-7342","authenticated-orcid":false,"given":"Tao","family":"He","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7535-452X","authenticated-orcid":false,"given":"Lei","family":"Wang","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3629-7892","authenticated-orcid":false,"given":"Ke","family":"Meng","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7273-0575","authenticated-orcid":false,"given":"Ye","family":"Cao","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7175-0784","authenticated-orcid":false,"given":"Diwen","family":"Zhu","sequence":"additional","affiliation":[{"name":"Alibaba Group, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2869-7944","authenticated-orcid":false,"given":"Sanhong","family":"Li","sequence":"additional","affiliation":[{"name":"Alibaba Group, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6851-1366","authenticated-orcid":false,"given":"Jingren","family":"Zhou","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2023,6,20]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"2019. Google Analytics Customer Revenue Prediction. https:\/\/www.kaggle.com\/c\/ga-customer-revenue-prediction."},{"key":"e_1_2_2_2_1","unstructured":"2023. ioctl(2) - Linux manual page. https:\/\/man7.org\/linux\/man-pages\/man2\/ioctl.2.html."},{"key":"e_1_2_2_3_1","unstructured":"2023. LD_PRELOAD - Linux manual page. https:\/\/man7.org\/linux\/man-pages\/man8\/ld.so.8.html."},{"key":"e_1_2_2_4_1","unstructured":"2023. Data-intensive computing. https:\/\/en.wikipedia.org\/wiki\/Data-intensive_computing."},{"key":"e_1_2_2_5_1","unstructured":"2023. Kubernets Scheduling Framework. https:\/\/kubernetes.io\/docs\/concepts\/scheduling-eviction\/scheduling-framework."},{"key":"e_1_2_2_6_1","unstructured":"2023. Node Property Prediction. https:\/\/ogb.stanford.edu\/docs\/nodeprop\/."},{"key":"e_1_2_2_7_1","unstructured":"2023. Production-Grade Container Orchestration. https:\/\/kubernetes.io."},{"key":"e_1_2_2_8_1","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/ Software available from tensorflow.org."},{"key":"e_1_2_2_9_1","unstructured":"Sajid Alam Nok Lam Chan Gabriel Comym Yetunde Dada Ivan Danov Deepyaman Datta Tynan DeBold Jannic Holzer Rashida Kanchwala Ankita Katiyar Amanda Koh Andrew Mackay Ahdra Merali Antony Milne Huong Nguyen Nero Okwa Juan Luis Cano Rodr\u00edguez Joel Schwarzmann Jo Stichbury and Merel Theisen. 2023. Kedro. https:\/\/github.com\/kedro-org\/kedro"},{"key":"e_1_2_2_10_1","volume-title":"Amazon Web Service","author":"Inc.","year":"2022","unstructured":"Inc. Amazon Web Service. 2022. Amazon Simple Storage Service: Object Storage built to retrieve any amount of data from anywhere. https:\/\/aws.amazon.com\/s3\/."},{"key":"e_1_2_2_11_1","volume-title":"9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Ananthanarayanan Ganesh","year":"2012","unstructured":"Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Warfield, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. 2012. Pacman: Coordinated memory caching for parallel jobs. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 267--280."},{"key":"e_1_2_2_12_1","unstructured":"Fluid Authors. 2021. Fluid: elastic data abstraction and acceleration for BigData\/AI applications in cloud. https:\/\/fluid-cloudnative.github.io."},{"key":"e_1_2_2_13_1","unstructured":"Kubernetes Authors. 2022. Kubernets Custom Resources. https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes\/api-extension\/custom-resources."},{"key":"e_1_2_2_14_1","unstructured":"Kubernetes Authors. 2022. Kubernets Operator Pattern. https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes\/operator\/."},{"key":"e_1_2_2_15_1","unstructured":"NumPy Authors. 2022. NumPy: The fundamental package for scientific computing with Python. https:\/\/www.numpy.org\/."},{"key":"e_1_2_2_16_1","volume-title":"Pandas: Python Data Analysis Library. https:\/\/pandas.pydata.org\/.","author":"Pandas","year":"2022","unstructured":"Pandas authors. 2022. Pandas: Python Data Analysis Library. https:\/\/pandas.pydata.org\/."},{"key":"e_1_2_2_17_1","volume-title":"Polars: Fast multi-threaded, hybrid-streaming DataFrame library. https:\/\/www.pola.rs.","author":"Authors Polars","year":"2022","unstructured":"Polars Authors. 2022. Polars: Fast multi-threaded, hybrid-streaming DataFrame library. https:\/\/www.pola.rs."},{"key":"e_1_2_2_18_1","volume-title":"SWIG: Simplified Wrapper and Interface Generator. https:\/\/github.com\/swig\/swig.","author":"Authors SWIG","year":"2019","unstructured":"SWIG Authors. 2019. SWIG: Simplified Wrapper and Interface Generator. https:\/\/github.com\/swig\/swig."},{"key":"e_1_2_2_19_1","unstructured":"Inc. ClickHouse. 2022. ClickHouse: Fast Open-Source OLAP DBMS. https:\/\/clickhouse.com\/."},{"key":"e_1_2_2_20_1","unstructured":"Dormando. 2022. memcached: a distributed memory object caching system. https:\/\/memcached.org\/."},{"key":"e_1_2_2_21_1","volume-title":"Dagster: An orchestration platform for the development, production, and observation of data assets. https:\/\/github.com\/dagster-io\/dagster.","author":"Inc. Elementl.","year":"2023","unstructured":"Inc. Elementl. 2023. Dagster: An orchestration platform for the development, production, and observation of data assets. https:\/\/github.com\/dagster-io\/dagster."},{"key":"e_1_2_2_22_1","unstructured":"etcd Authors. 2022. etcd: A distributed reliable key-value store for the most critical data of a distributed system. https:\/\/etcd.io\/."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476369"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3282488"},{"key":"e_1_2_2_25_1","volume-title":"Scaling Large Production Clusters with Partitioned Synchronization. In 2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Feng Yihui","year":"2021","unstructured":"Yihui Feng, Zhi Liu, Yunjian Zhao, Tatiana Jin, Yidi Wu, Yang Zhang, James Cheng, Chao Li, and Tao Guan. 2021. Scaling Large Production Clusters with Partitioned Synchronization. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 81--97."},{"key":"e_1_2_2_26_1","unstructured":"Linux Foundation. 2015. Data Plane Development Kit (DPDK). http:\/\/www.dpdk.org."},{"key":"e_1_2_2_27_1","volume-title":"Apache Airflow: A platform to programmatically author, schedule, and monitor workflows. https:\/\/airflow.apache.org\/.","author":"Software Foundation The Apache","year":"2022","unstructured":"The Apache Software Foundation. 2022. Apache Airflow: A platform to programmatically author, schedule, and monitor workflows. https:\/\/airflow.apache.org\/."},{"key":"e_1_2_2_28_1","unstructured":"The Apache Software Foundation. 2022. Apache Data Fusion SQL Query Engine. https:\/\/arrow.apache.org\/datafusion\/."},{"key":"e_1_2_2_29_1","volume-title":"Apache Doris: An easy-to-use, high-performance and unified analytical database. https:\/\/doris.apache.org\/.","author":"Software Foundation The Apache","year":"2022","unstructured":"The Apache Software Foundation. 2022. Apache Doris: An easy-to-use, high-performance and unified analytical database. https:\/\/doris.apache.org\/."},{"key":"e_1_2_2_30_1","volume-title":"Apache Dremio: The Easy and Open Data Lakehouse. https:\/\/www.dremio.com\/.","author":"Software Foundation The Apache","year":"2022","unstructured":"The Apache Software Foundation. 2022. Apache Dremio: The Easy and Open Data Lakehouse. https:\/\/www.dremio.com\/."},{"key":"e_1_2_2_31_1","volume-title":"Arrow: A cross-language development platform for in-memory analytics. https:\/\/github.com\/apache\/arrow.","author":"Software Foundation The Apache","year":"2022","unstructured":"The Apache Software Foundation. 2022. Arrow: A cross-language development platform for in-memory analytics. https:\/\/github.com\/apache\/arrow."},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/945445.945450"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2741948.2741968"},{"key":"e_1_2_2_34_1","volume-title":"10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12)","author":"Gonzalez Joseph E","year":"2012","unstructured":"Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 17--30."},{"key":"e_1_2_2_35_1","volume-title":"Protocol Buffers: A language-neutral, platform-neutral extensible mechanism for serializing structured data. https:\/\/developers.google.com\/protocol-buffers.","author":"Inc. Google.","year":"2022","unstructured":"Inc. Google. 2022. Protocol Buffers: A language-neutral, platform-neutral extensible mechanism for serializing structured data. https:\/\/developers.google.com\/protocol-buffers."},{"key":"e_1_2_2_36_1","volume-title":"Whiz: Data-Driven Analytics Execution. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)","author":"Grandl Robert","year":"2021","unstructured":"Robert Grandl, Arjun Singhvi, Raajay Viswanathan, and Aditya Akella. 2021. Whiz: Data-Driven Analytics Execution. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)."},{"key":"e_1_2_2_37_1","unstructured":"gRPC Authors. 2022. gRPC: A high performance open source universal RPC framework. https:\/\/grpc.io."},{"key":"e_1_2_2_38_1","volume-title":"Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430","author":"Hu Weihua","year":"2021","unstructured":"Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. 2021. Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430 (2021)."},{"key":"e_1_2_2_39_1","unstructured":"Inc. Juicedata. 2022. JuiceFS: A POSIX HDFS and S3 compatible distributed file system for cloud. https:\/\/juicefs.com\/en\/."},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670985"},{"key":"e_1_2_2_41_1","unstructured":"Jingdong Li Zhao Li Jiaming Huang Ji Zhang Xiaoling Wang Xingjian Lu and Jingren Zhou. 2021. Large-scale Fake Click Detection for E-commerce Recommendation Systems. In ICDE."},{"key":"e_1_2_2_42_1","unstructured":"libclang Authors. 2022. libclang: C interface to Clang. https:\/\/clang.llvm.org\/doxygen\/group__CINDEX.html."},{"key":"e_1_2_2_43_1","unstructured":"libfuse authors. 2022. libfuse: The reference implementation of the Linux FUSE (Filesystem in Userspace) interface. https:\/\/github.com\/libfuse\/libfuse."},{"key":"e_1_2_2_44_1","volume-title":"Redis: The open source, in-memory data store. https:\/\/redis.io\/.","author":"Ltd Redis","year":"2022","unstructured":"Redis Ltd. 2022. Redis: The open source, in-memory data store. https:\/\/redis.io\/."},{"key":"e_1_2_2_45_1","unstructured":"The Alibaba Group Holding Ltd. 2022. Mars: a tensor-based unified framework for large-scale data computation. https:\/\/github.com\/mars-project\/mars."},{"key":"e_1_2_2_46_1","unstructured":"Ruotian Luo. 2017. An Image Captioning codebase in PyTorch. https:\/\/github.com\/ruotianluo\/ImageCaptioning.pytorch."},{"key":"e_1_2_2_47_1","volume-title":"15th Workshop on Hot Topics in Operating Systems (HotOS XV ).","author":"McSherry Frank","year":"2015","unstructured":"Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV )."},{"volume-title":"Handbook of cloud computing","author":"Middleton Anthony M","key":"e_1_2_2_48_1","unstructured":"Anthony M Middleton. 2010. Data-intensive technologies for cloud computing. In Handbook of cloud computing. Springer, 83--136."},{"key":"e_1_2_2_49_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Moritz Philipp","year":"2018","unstructured":"Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al . 2018. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 561--577."},{"volume-title":"Cocomo ii forum","author":"Nguyen Vu","key":"e_1_2_2_50_1","unstructured":"Vu Nguyen, Sophia Deeds-Rubin, Thomas Tan, and Barry Boehm. 2007. A SLOC counting standard. In Cocomo ii forum, Vol. 2007. Citeseer, 1--16."},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359652"},{"volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","key":"e_1_2_2_52_1","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035. http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf"},{"key":"e_1_2_2_53_1","volume-title":"18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)","author":"Qian Zhengping","year":"2021","unstructured":"Zhengping Qian, Chenqiang Min, Longbin Lai, Yong Fang, Gaofeng Li, Youyang Yao, Bingqing Lyu, Xiaoli Zhou, Zhimin Chen, and Jingren Zhou. 2021. GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3012408.3012416"},{"key":"e_1_2_2_55_1","unstructured":"scikit-learn Authors. 2022. scikit-learn: Machine-Learning in Python. https:\/\/scikit-learn.org\/."},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00196"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSST.2010.5496972"},{"key":"e_1_2_2_58_1","volume-title":"Thrift: Scalable cross-language services implementation. Facebook white paper 5, 8","author":"Slee Mark","year":"2007","unstructured":"Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. 2007. Thrift: Scalable cross-language services implementation. Facebook white paper 5, 8 (2007), 127."},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/1064979.1064997"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2010.5447738"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313411"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/2509578.2509581"},{"key":"e_1_2_2_63_1","volume-title":"9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Zaharia Matei","year":"2012","unstructured":"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28."},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934664"},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934664"},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352127"},{"key":"e_1_2_2_67_1","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Zhu Xiaowei","year":"2016","unstructured":"Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A computation-centric distributed graph processing system. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 301--316."},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.14778\/3384345.3384351"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3589780","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3589780","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:22Z","timestamp":1750182562000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3589780"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,13]]},"references-count":68,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,13]]}},"alternative-id":["10.1145\/3589780"],"URL":"https:\/\/doi.org\/10.1145\/3589780","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2023,6,13]]}}}