{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,15]],"date-time":"2026-07-15T07:13:56Z","timestamp":1784099636429,"version":"3.55.0"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:p>Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new \"DBMS-for-ML\" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3\u00d7 over state-of-the-art baselines.<\/jats:p>","DOI":"10.14778\/3625054.3625060","type":"journal-article","created":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T17:09:42Z","timestamp":1701709782000},"page":"4230-4239","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Optimizing Data Pipelines for Machine Learning in Feature Stores"],"prefix":"10.14778","volume":"16","author":[{"given":"Rui","family":"Liu","sequence":"first","affiliation":[{"name":"University of Chicago"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kwanghyun","family":"Park","sequence":"additional","affiliation":[{"name":"Yonsei University"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fotis","family":"Psallidas","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiaoyong","family":"Zhu","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jinghui","family":"Mo","sequence":"additional","affiliation":[{"name":"LinkedIn"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rathijit","family":"Sen","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Matteo","family":"Interlandi","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Konstantinos","family":"Karanasos","sequence":"additional","affiliation":[{"name":"Meta"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuanyuan","family":"Tian","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jes\u00fas","family":"Camacho-Rodr\u00edguez","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,12,4]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2019. Delta Lake. https:\/\/delta.io\/. Accessed: 2023-02-23.  2019. Delta Lake. https:\/\/delta.io\/. Accessed: 2023-02-23."},{"key":"e_1_2_1_2_1","unstructured":"2022. Amazon Redshift - Automated materialized views. https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/materialized-view-auto-mv.html. Accessed: 2022-10-02.  2022. Amazon Redshift - Automated materialized views. https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/materialized-view-auto-mv.html. Accessed: 2022-10-02."},{"key":"e_1_2_1_3_1","unstructured":"2022. Apache Spark. https:\/\/spark.apache.org\/. Accessed: 2022-10-02.  2022. Apache Spark. https:\/\/spark.apache.org\/. Accessed: 2022-10-02."},{"key":"e_1_2_1_4_1","unstructured":"2022. Apache Spark in Azure Synapse Analytics. https:\/\/learn.microsoft.com\/azure\/synapse-analytics\/spark\/apache-spark-overview. Accessed: 2022-10-02.  2022. Apache Spark in Azure Synapse Analytics. https:\/\/learn.microsoft.com\/azure\/synapse-analytics\/spark\/apache-spark-overview. Accessed: 2022-10-02."},{"key":"e_1_2_1_5_1","unstructured":"2022. Azure Blob Storage. https:\/\/azure.microsoft.com\/en-us\/products\/storage\/blobs. Accessed: 2022-10-02.  2022. Azure Blob Storage. https:\/\/azure.microsoft.com\/en-us\/products\/storage\/blobs. Accessed: 2022-10-02."},{"key":"e_1_2_1_6_1","unstructured":"2022. Azure Synapse Analytics. https:\/\/azure.microsoft.com\/en-us\/products\/synapse-analytics. Accessed: 2022-10-02.  2022. Azure Synapse Analytics. https:\/\/azure.microsoft.com\/en-us\/products\/synapse-analytics. Accessed: 2022-10-02."},{"key":"e_1_2_1_7_1","unstructured":"2022. Corporaci\u00f3n Favorita Grocery Sales Forecasting. https:\/\/www.kaggle.com\/c\/favorita-grocery-sales-forecasting. Accessed: 2022-12-20.  2022. Corporaci\u00f3n Favorita Grocery Sales Forecasting. https:\/\/www.kaggle.com\/c\/favorita-grocery-sales-forecasting. Accessed: 2022-12-20."},{"key":"e_1_2_1_8_1","unstructured":"2022. Databricks - Create run and manage Databricks Jobs. https:\/\/docs.databricks.com\/workflows\/jobs\/jobs.html. Accessed: 2022-10-02.  2022. Databricks - Create run and manage Databricks Jobs. https:\/\/docs.databricks.com\/workflows\/jobs\/jobs.html. Accessed: 2022-10-02."},{"key":"e_1_2_1_9_1","unstructured":"2022. Databricks Feature Store - Use time series feature tables with point-in-time support. https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html. Accessed: 2022-10-12.  2022. Databricks Feature Store - Use time series feature tables with point-in-time support. https:\/\/docs.databricks.com\/machine-learning\/feature-store\/time-series.html. Accessed: 2022-10-12."},{"key":"e_1_2_1_10_1","unstructured":"2022. eCommerce behavior data from multi category store. https:\/\/www.kaggle.com\/datasets\/mkechinov\/ecommerce-behavior-data-from-multi-category-store. Accessed: 2022-12-20.  2022. eCommerce behavior data from multi category store. https:\/\/www.kaggle.com\/datasets\/mkechinov\/ecommerce-behavior-data-from-multi-category-store. Accessed: 2022-12-20."},{"key":"e_1_2_1_11_1","unstructured":"2022. Feathr - Point-in-time Correctness and Point-in-time Join. https:\/\/github.com\/feathr-ai\/feathr\/blob\/main\/docs\/concepts\/point-in-time-join.md. Accessed: 2022-10-02.  2022. Feathr - Point-in-time Correctness and Point-in-time Join. https:\/\/github.com\/feathr-ai\/feathr\/blob\/main\/docs\/concepts\/point-in-time-join.md. Accessed: 2022-10-02."},{"key":"e_1_2_1_12_1","unstructured":"2022. Feathr - Point-in-Time Join Implementation. https:\/\/github.com\/feathr-ai\/feathr\/blob\/main\/feathr-impl\/src\/main\/scala\/com\/linkedin\/feathr\/offline\/join\/DataFrameFeatureJoiner.scala. Accessed: 2022-10-02.  2022. Feathr - Point-in-Time Join Implementation. https:\/\/github.com\/feathr-ai\/feathr\/blob\/main\/feathr-impl\/src\/main\/scala\/com\/linkedin\/feathr\/offline\/join\/DataFrameFeatureJoiner.scala. Accessed: 2022-10-02."},{"key":"e_1_2_1_13_1","unstructured":"2022. Feathr: An Enterprise-Grade High-Performance Feature Store. https:\/\/github.com\/feathr-ai\/feathr. Accessed: 2022-10-02.  2022. Feathr: An Enterprise-Grade High-Performance Feature Store. https:\/\/github.com\/feathr-ai\/feathr. Accessed: 2022-10-02."},{"key":"e_1_2_1_14_1","unstructured":"2022. Flint: A Time Series Library for Apache Spark. https:\/\/github.com\/twosigma\/flint. Accessed: 2022-10-02.  2022. Flint: A Time Series Library for Apache Spark. https:\/\/github.com\/twosigma\/flint. Accessed: 2022-10-02."},{"key":"e_1_2_1_15_1","unstructured":"2022. Google OR-Tools. https:\/\/developers.google.com\/optimization\/. Accessed: 2022-10-02.  2022. Google OR-Tools. https:\/\/developers.google.com\/optimization\/. Accessed: 2022-10-02."},{"key":"e_1_2_1_16_1","unstructured":"2022. Gurobi Optimization. https:\/\/www.gurobi.com\/. Accessed: 2022-10-02.  2022. Gurobi Optimization. https:\/\/www.gurobi.com\/. Accessed: 2022-10-02."},{"key":"e_1_2_1_17_1","unstructured":"2022. kiwisolver. https:\/\/pypi.org\/project\/kiwisolver\/. Accessed: 2022-10-02.  2022. kiwisolver. https:\/\/pypi.org\/project\/kiwisolver\/. Accessed: 2022-10-02."},{"key":"e_1_2_1_18_1","unstructured":"2022. OpenMLDB. https:\/\/github.com\/4paradigm\/OpenMLDB. Accessed: 2022-10-02.  2022. OpenMLDB. https:\/\/github.com\/4paradigm\/OpenMLDB. Accessed: 2022-10-02."},{"key":"e_1_2_1_19_1","unstructured":"2022. Point-in-time Joins in Feast. https:\/\/docs.feast.dev\/getting-started\/concepts\/point-in-time-joins. Accessed: 2022-10-10.  2022. Point-in-time Joins in Feast. https:\/\/docs.feast.dev\/getting-started\/concepts\/point-in-time-joins. Accessed: 2022-10-10."},{"key":"e_1_2_1_20_1","unstructured":"2022. Point-in-time Joins in Feathr. https:\/\/feathr-ai.github.io\/feathr\/concepts\/point-in-time-join.html. Accessed: 2022-10-12.  2022. Point-in-time Joins in Feathr. https:\/\/feathr-ai.github.io\/feathr\/concepts\/point-in-time-join.html. Accessed: 2022-10-12."},{"key":"e_1_2_1_21_1","unstructured":"2022. Point-in-time Joins in Feathr. https:\/\/feathr-ai.github.io\/feathr\/concepts\/feature-definition.html. Accessed: 2022-10-12.  2022. Point-in-time Joins in Feathr. https:\/\/feathr-ai.github.io\/feathr\/concepts\/feature-definition.html. Accessed: 2022-10-12."},{"key":"e_1_2_1_22_1","unstructured":"2022. Point-in-time Joins in Hopsworks. https:\/\/www.hopsworks.ai\/post\/a-spark-join-operator-for-point-in-time-correct-joins. Accessed: 2022-10-11.  2022. Point-in-time Joins in Hopsworks. https:\/\/www.hopsworks.ai\/post\/a-spark-join-operator-for-point-in-time-correct-joins. Accessed: 2022-10-11."},{"key":"e_1_2_1_23_1","unstructured":"2022. Spark on Databricks. https:\/\/www.databricks.com\/product\/spark. Accessed: 2022-10-02.  2022. Spark on Databricks. https:\/\/www.databricks.com\/product\/spark. Accessed: 2022-10-02."},{"key":"e_1_2_1_24_1","unstructured":"2022. Spark PIT: Utility library for Point-in-Time joins in Apache Spark. https:\/\/github.com\/Ackuq\/spark-pit. Accessed: 2022-10-02.  2022. Spark PIT: Utility library for Point-in-Time joins in Apache Spark. https:\/\/github.com\/Ackuq\/spark-pit. Accessed: 2022-10-02."},{"key":"e_1_2_1_25_1","unstructured":"2022. Synapse Analytics - Integrate with pipelines. https:\/\/learn.microsoft.com\/azure\/synapse-analytics\/get-started-pipelines. Accessed: 2022-10-02.  2022. Synapse Analytics - Integrate with pipelines. https:\/\/learn.microsoft.com\/azure\/synapse-analytics\/get-started-pipelines. Accessed: 2022-10-02."},{"key":"e_1_2_1_26_1","unstructured":"2022. tempo: Time Series Utilities for Data Teams Using Databricks. https:\/\/github.com\/databrickslabs\/tempo. Accessed: 2022-10-02.  2022. tempo: Time Series Utilities for Data Teams Using Databricks. https:\/\/github.com\/databrickslabs\/tempo. Accessed: 2022-10-02."},{"key":"e_1_2_1_27_1","volume-title":"Conference on Innovative Data Systems Research (CIDR).","author":"Agrawal Ashvin","year":"2020","unstructured":"Ashvin Agrawal , Rony Chatterjee , Carlo Curino , Avrilia Floratou , Neha Godwal , Matteo Interlandi , Alekh Jindal , Konstantinos Karanasos , Subru Krishnan , Brian Kroth , Jyoti Leeka , Kwanghyun Park , Hiren Patel , Olga Poppe , Fotis Psallidas , Raghu Ramakrishnan , Abhishek Roy , Karla Saur , Rathijit Sen , Markus Weimer , Travis Wright , and Yiwen Zhu . 2020 . Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML . In Conference on Innovative Data Systems Research (CIDR). Ashvin Agrawal, Rony Chatterjee, Carlo Curino, Avrilia Floratou, Neha Godwal, Matteo Interlandi, Alekh Jindal, Konstantinos Karanasos, Subru Krishnan, Brian Kroth, Jyoti Leeka, Kwanghyun Park, Hiren Patel, Olga Poppe, Fotis Psallidas, Raghu Ramakrishnan, Abhishek Roy, Karla Saur, Rathijit Sen, Markus Weimer, Travis Wright, and Yiwen Zhu. 2020. Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML. In Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00)","author":"Agrawal Sanjay","unstructured":"Sanjay Agrawal , Surajit Chaudhuri , and Vivek R. Narasayya . 2000. Automated Selection of Materialized Views and Indexes in SQL Databases . In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00) . 496--505. Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). 496--505."},{"key":"e_1_2_1_29_1","first-page":"12","article-title":"TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems","volume":"16","author":"Br\u00fccke Christoph","year":"2023","unstructured":"Christoph Br\u00fccke , Philipp H\u00e4rtling , Rodrigo D Escobar Palacios , Hamesh Patel , and Tilmann Rabl . 2023 . TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems . Proc. VLDB Endow. 16 , 12 (sep 2023), 3649--3661. Christoph Br\u00fccke, Philipp H\u00e4rtling, Rodrigo D Escobar Palacios, Hamesh Patel, and Tilmann Rabl. 2023. TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems. Proc. VLDB Endow. 16, 12 (sep 2023), 3649--3661.","journal-title":"Proc. VLDB Endow."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983669"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2004.75"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07)","author":"Chaudhuri Surajit","year":"2007","unstructured":"Surajit Chaudhuri and Vivek Narasayya . 2007 . Self-Tuning Database Systems: A Decade of Progress . In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07) . 3--14. Surajit Chaudhuri and Vivek Narasayya. 2007. Self-Tuning Database Systems: A Decade of Progress. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07). 3--14."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3446095.3446102"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/2480856"},{"key":"e_1_2_1_35_1","volume-title":"Relative Error Streaming Quantiles. In ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems PODS. 96--108","author":"Cormode Graham","year":"2021","unstructured":"Graham Cormode , Zohar S. Karnin , Edo Liberty , Justin Thaler , and Pavel Vesel\u00fd . 2021 . Relative Error Streaming Quantiles. In ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems PODS. 96--108 . Graham Cormode, Zohar S. Karnin, Edo Liberty, Justin Thaler, and Pavel Vesel\u00fd. 2021. Relative Error Streaming Quantiles. In ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems PODS. 96--108."},{"key":"e_1_2_1_36_1","volume-title":"Materialization and Reuse Optimizations for Production Data Science Pipelines. In ACM International Conference on Management of Data (SIGMOD). 1962--1976","author":"Derakhshan Behrouz","year":"2022","unstructured":"Behrouz Derakhshan , Alireza Rezaei Mahdiraji , Zoi Kaoudi , Tilmann Rabl , and Volker Markl . 2022 . Materialization and Reuse Optimizations for Production Data Science Pipelines. In ACM International Conference on Management of Data (SIGMOD). 1962--1976 . Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Zoi Kaoudi, Tilmann Rabl, and Volker Markl. 2022. Materialization and Reuse Optimizations for Production Data Science Pipelines. In ACM International Conference on Management of Data (SIGMOD). 1962--1976."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/645925.756643"},{"key":"e_1_2_1_38_1","volume-title":"ACM International Conference on Management of Data (SIGMOD). 331--342","author":"Goldstein Jonathan","year":"2001","unstructured":"Jonathan Goldstein and Per-\u00c5ke Larson . 2001 . Optimizing Queries Using Materialized Views: A practical, scalable solution . In ACM International Conference on Management of Data (SIGMOD). 331--342 . Jonathan Goldstein and Per-\u00c5ke Larson. 2001. Optimizing Queries Using Materialized Views: A practical, scalable solution. In ACM International Conference on Management of Data (SIGMOD). 331--342."},{"key":"e_1_2_1_39_1","volume-title":"Deep Learning","author":"Goodfellow Ian","unstructured":"Ian Goodfellow , Yoshua Bengio , and Aaron Courville . 2016. Deep Learning . MIT Press . http:\/\/www.deeplearningbook.org. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http:\/\/www.deeplearningbook.org."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/568271.223849"},{"key":"e_1_2_1_41_1","volume-title":"Maintaining Views Incrementally. In ACM International Conference on Management of Data (SIGMOD). 157--166","author":"Gupta Ashish","unstructured":"Ashish Gupta , Inderpal Singh Mumick , and V. S. Subrahmanian . 1993 . Maintaining Views Incrementally. In ACM International Conference on Management of Data (SIGMOD). 157--166 . Ashish Gupta, Inderpal Singh Mumick, and V. S. Subrahmanian. 1993. Maintaining Views Incrementally. In ACM International Conference on Management of Data (SIGMOD). 157--166."},{"key":"e_1_2_1_42_1","unstructured":"Paul Hargis Jason MacKay Raphey Holmes and Mark Roy. 2021. Build accurate ML training datasets using point-in-time queries with Amazon SageMaker Feature Store and Apache Spark. https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-accurate-ml-training-datasets-using-point-in-time-queries-with-amazon-sagemaker-feature-store-and-apache-spark\/  Paul Hargis Jason MacKay Raphey Holmes and Mark Roy. 2021. Build accurate ML training datasets using point-in-time queries with Amazon SageMaker Feature Store and Apache Spark. https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-accurate-ml-training-datasets-using-point-in-time-queries-with-amazon-sagemaker-feature-store-and-apache-spark\/"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192971"},{"key":"e_1_2_1_44_1","volume-title":"ACM International Conference on Management of Data (SIGMOD). 191--203","author":"Jindal Alekh","year":"2018","unstructured":"Alekh Jindal , Shi Qiao , Hiren Patel , Zhicheng Yin , Jieming Di , Malay Bag , Marc T. Friedman , Yifung Lin , Konstantinos Karanasos , and Sriram Rao . 2018 . Computation Reuse in Analytics Job Service at Microsoft . In ACM International Conference on Management of Data (SIGMOD). 191--203 . Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc T. Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018. Computation Reuse in Analytics Job Service at Microsoft. In ACM International Conference on Management of Data (SIGMOD). 191--203."},{"key":"e_1_2_1_45_1","first-page":"4","article-title":"Delta: Scalable Data Dissemination under Capacity Constraints","volume":"7","author":"Karanasos Konstantinos","year":"2013","unstructured":"Konstantinos Karanasos , Asterios Katsifodimos , and Ioana Manolescu . 2013 . Delta: Scalable Data Dissemination under Capacity Constraints . Proc. VLDB Endow. 7 , 4 (dec 2013), 217--228. Konstantinos Karanasos, Asterios Katsifodimos, and Ioana Manolescu. 2013. Delta: Scalable Data Dissemination under Capacity Constraints. Proc. VLDB Endow. 7, 4 (dec 2013), 217--228.","journal-title":"Proc. VLDB Endow."},{"key":"e_1_2_1_46_1","volume-title":"Optimal Quantile Approximation in Streams. In IEEE Symposium on Foundations of Computer Science (FOCS). 71--78","author":"Karnin Zohar S.","year":"2016","unstructured":"Zohar S. Karnin , Kevin J. Lang , and Edo Liberty . 2016 . Optimal Quantile Approximation in Streams. In IEEE Symposium on Foundations of Computer Science (FOCS). 71--78 . Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. 2016. Optimal Quantile Approximation in Streams. In IEEE Symposium on Foundations of Computer Science (FOCS). 71--78."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2382577.2382579"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476402"},{"key":"e_1_2_1_49_1","unstructured":"Axel Pettersson. 2022. Resource-efficient and fast Point-in-Time joins for Apache Spark: Optimization of time travel operations for the creation of machine learning training datasets. Master's thesis. KTH School of Electrical Engineering and Computer Science (EECS).  Axel Pettersson. 2022. Resource-efficient and fast Point-in-Time joins for Apache Spark: Optimization of time travel operations for the creation of machine learning training datasets. Master's thesis. KTH School of Electrical Engineering and Computer Science (EECS)."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551842"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/767141.767146"},{"key":"e_1_2_1_52_1","volume-title":"Hidden Technical Debt in Machine Learning Systems. In Conference on Neural Information Processing Systems (NeurIPS). 2503--2511","author":"Sculley D.","year":"2015","unstructured":"D. Sculley , Gary Holt , Daniel Golovin , Eugene Davydov , Todd Phillips , Dietmar Ebner , Vinay Chaudhary , Michael Young , Jean-Fran\u00e7ois Crespo , and Dan Dennison . 2015 . Hidden Technical Debt in Machine Learning Systems. In Conference on Neural Information Processing Systems (NeurIPS). 2503--2511 . D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran\u00e7ois Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. In Conference on Neural Information Processing Systems (NeurIPS). 2503--2511."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/3510397.3510402"},{"key":"e_1_2_1_54_1","volume-title":"International Conference on Data Engineering (ICDE). 575--584","author":"Stocker Konrad","year":"2001","unstructured":"Konrad Stocker , Donald Kossmann , Reinhard Braumandl , and Alfons Kemper . 2001 . Integrating Semi-Join-Reducers into State of the Art Query Processors . In International Conference on Data Engineering (ICDE). 575--584 . Konrad Stocker, Donald Kossmann, Reinhard Braumandl, and Alfons Kemper. 2001. Integrating Semi-Join-Reducers into State of the Art Query Processors. In International Conference on Data Engineering (ICDE). 575--584."},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the 26th International Conference on Data Engineering (ICDE","author":"Thusoo Ashish","year":"2010","unstructured":"Ashish Thusoo , Joydeep Sen Sarma , Namit Jain , Zheng Shao , Prasad Chakka , Ning Zhang , Suresh Anthony , Hao Liu , and Raghotham Murthy . 2010 . Hive - a petabyte scale data warehouse using Hadoop . In Proceedings of the 26th International Conference on Data Engineering (ICDE 2010). 996--1005. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In Proceedings of the 26th International Conference on Data Engineering (ICDE 2010). 996--1005."},{"issue":"0","key":"e_1_2_1_56_1","first-page":"2","article-title":"TPCx-AI Specification","volume":"1","author":"TPC.","year":"2022","unstructured":"TPC. 2022 . TPCx-AI Specification , Version 1 . 0 . 2 . https:\/\/www.tpc.org\/tpc_documents_current_versions\/pdf\/tpcx-ai_v1.0.2.pdf. Accessed: 2022-10-02. TPC. 2022. TPCx-AI Specification, Version 1.0.2. https:\/\/www.tpc.org\/tpc_documents_current_versions\/pdf\/tpcx-ai_v1.0.2.pdf. Accessed: 2022-10-02.","journal-title":"Version"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.5555\/645923.673657"},{"key":"e_1_2_1_58_1","volume-title":"Automatic View Generation with Deep Learning and Reinforcement Learning. In IEEE International Conference on Data Engineering (ICDE). 1501--1512","author":"Yuan Haitao","year":"2020","unstructured":"Haitao Yuan , Guoliang Li , Ling Feng , Ji Sun , and Yue Han . 2020 . Automatic View Generation with Deep Learning and Reinforcement Learning. In IEEE International Conference on Data Engineering (ICDE). 1501--1512 . Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic View Generation with Deep Learning and Reinforcement Learning. In IEEE International Conference on Data Engineering (ICDE). 1501--1512."},{"key":"e_1_2_1_59_1","volume-title":"Answering Complex SQL Queries Using Automatic Summary Tables. In ACM International Conference on Management of Data (SIGMOD). 105--116","author":"Zaharioudakis Markos","year":"2000","unstructured":"Markos Zaharioudakis , Roberta Cochrane , George Lapis , Hamid Pirahesh , and Monica Urata . 2000 . Answering Complex SQL Queries Using Automatic Summary Tables. In ACM International Conference on Management of Data (SIGMOD). 105--116 . Markos Zaharioudakis, Roberta Cochrane, George Lapis, Hamid Pirahesh, and Monica Urata. 2000. Answering Complex SQL Queries Using Automatic Summary Tables. In ACM International Conference on Management of Data (SIGMOD). 105--116."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/2877204"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/3450980.3450990"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3625054.3625060","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T17:10:22Z","timestamp":1701709822000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3625054.3625060"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9]]},"references-count":61,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["10.14778\/3625054.3625060"],"URL":"https:\/\/doi.org\/10.14778\/3625054.3625060","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,9]]},"assertion":[{"value":"2023-12-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}