{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T01:22:11Z","timestamp":1776993731192,"version":"3.51.4"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p>\n            Cloud service providers commonly use standard benchmarks like TPC-H and TPC-DS to evaluate and optimize cloud data analytics systems. However, these benchmarks rely on fixed query patterns and fail to capture real execution statistics of production cloud workloads. Although some cloud database vendors have recently released real workload traces, these traces alone do not qualify as benchmarks, as they typically lack essential components (i.e., queries and databases). To overcome this limitation, this paper studies a new problem of\n            <jats:italic toggle=\"yes\">workload synthesis with real statistics<\/jats:italic>\n            , which generates\n            <jats:italic toggle=\"yes\">synthetic workloads<\/jats:italic>\n            that closely approximate real execution statistics, including key performance metrics and operator distributions. To address this problem, we propose PBench, a novel workload synthesizer that constructs synthetic workloads by (1) selecting and combining workload components from existing benchmarks and (2) augmenting new workload components. This paper studies the key challenges in PBench. First, we address the challenge of balancing performance metrics and operator distributions by introducing a multi-objective optimization-based component selection method. Second, to capture the temporal dynamics of real workloads, we design a timestamp assignment method that progressively reines workload timestamps. Third, to handle the disparity between the original workload and the candidate workload, we propose a component augmentation approach that leverages large language models (LLMs) to generate additional workload components while maintaining statistical idelity. Experimental results show that PBench reduces approximation error by up to 6X compared to state-of-the-art methods.\n          <\/jats:p>","DOI":"10.14778\/3749646.3749661","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:55:06Z","timestamp":1757008506000},"page":"3883-3895","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking"],"prefix":"10.14778","volume":"18","author":[{"given":"Yan","family":"Zhou","sequence":"first","affiliation":[{"name":"Renmin University, China"}]},{"given":"Chunwei","family":"Liu","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]},{"given":"Bhuvan","family":"Urgaonkar","sequence":"additional","affiliation":[{"name":"Penn State and AWS"}]},{"given":"Zhengle","family":"Wang","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Magnus","family":"Mueller","sequence":"additional","affiliation":[{"name":"Amazon Web Services"}]},{"given":"Chao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Songyue","family":"Zhang","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Pascal","family":"Pfeil","sequence":"additional","affiliation":[{"name":"Amazon Web Services"}]},{"given":"Dominik","family":"Horn","sequence":"additional","affiliation":[{"name":"Amazon Web Services"}]},{"given":"Zhengchun","family":"Liu","sequence":"additional","affiliation":[{"name":"Amazon Web Services"}]},{"given":"Davide","family":"Pagano","sequence":"additional","affiliation":[{"name":"Amazon Web Services"}]},{"given":"Tim","family":"Kraska","sequence":"additional","affiliation":[{"name":"MIT and AWS"}]},{"given":"Samuel","family":"Madden","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]},{"given":"Ju","family":"Fan","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Seltenreich Andreas Tang Bo and Mullender Sjoerd. 2022. SQLsmith. https:\/\/github.com\/anse1\/sqlsmith. 2025\/04\/10."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526045"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654991"},{"key":"e_1_2_1_4_1","volume-title":"Performance Evaluation and Benchmarking for the Analytics Era","author":"Boncz Peter","unstructured":"Peter Boncz, Angelos-Christos Anatiotis, and Steffen Kl\u00e4be. 2018. JCC-H: Adding Join Crossing Correlations with Skew to TPC-H. In Performance Evaluation and Benchmarking for the Analytics Era, Raghunath Nambiar and Meikel Poess (Eds.). Springer International Publishing, Cham, 103\u2013119."},{"key":"e_1_2_1_5_1","volume-title":"Technology Conference on Performance Evaluation and Benchmarking. Springer, 61\u201376","author":"Boncz Peter","year":"2013","unstructured":"Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 61\u201376."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807128.1807152"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903741"},{"key":"e_1_2_1_8_1","unstructured":"Databend. 2023. TPC-H Benchmark: Databend Cloud vs. Snowflake. Technical Report. Databend. https:\/\/docs.databend.com\/guides\/benchmark\/tpch Accessed: 2023-11-20."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3471485.3471492"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW.2014.6818330"},{"key":"e_1_2_1_11_1","volume-title":"Cloud-Native Databases: A Survey","author":"Dong Haowen","year":"2024","unstructured":"Haowen Dong, Chao Zhang, Guoliang Li, and Huanchen Zhang. 2024. Cloud-Native Databases: A Survey. IEEE Transactions on Knowledge and Data Engineering (2024)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3389133.3389138"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407802"},{"key":"e_1_2_1_14_1","unstructured":"John Forrest and Robin Lougee-Heimer. [n.d.]. CBC User Guide. https:\/\/www.coin-or.org\/Cbc\/. 2024\/12\/13."},{"key":"e_1_2_1_15_1","volume-title":"A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811","author":"Frazier Peter I","year":"2018","unstructured":"Peter I Frazier. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018)."},{"key":"e_1_2_1_16_1","volume-title":"ICML","volume":"2014","author":"Gardner Jacob R","year":"2014","unstructured":"Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. 2014. Bayesian optimization with inequality constraints.. In ICML, Vol. 2014. 937\u2013945."},{"key":"e_1_2_1_17_1","unstructured":"IMDb. 2024. IMDb Non-Commercial Datasets. https:\/\/developer.imdb.com\/non-commercial-datasets\/. 2024\/12\/13."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850583.2850594"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3461549"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554893"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3486001.3486248"},{"key":"e_1_2_1_22_1","unstructured":"Thibaut Lust and Jacques Teghem. 2010. The multiobjective multidimensional knapsack problem: a survey and a new approach. arXiv:1007.4063 [cs.DM] https:\/\/arxiv.org\/abs\/1007.4063"},{"key":"e_1_2_1_23_1","unstructured":"Microsoft. 2024. DSB benchmark. https:\/\/github.com\/microsoft\/dsb. 2024\/12\/13."},{"key":"e_1_2_1_24_1","unstructured":"Magnus M\u00fcller Lucas Woltmann and Wolfgang Lehner. 2023. Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation.. In EDBT. 273\u2013284."},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Vikram Nathan Vikramank Singh Zhengchun Liu Mohammad Rahman Andreas Kipf Dominik Horn Davide Pagano Gaurav Saxena Balakrishnan (Murali) Narayanaswamy and Tim Kraska. 2024. Intelligent scaling in Amazon Redshift. (2024). https:\/\/www.amazon.science\/publications\/intelligent-scaling-in-amazon-redshift","DOI":"10.1145\/3626246.3653394"},{"key":"e_1_2_1_26_1","unstructured":"Fernando Nogueira. 2014\u2013. Bayesian Optimization: Open source constrained global optimization tool for Python. https:\/\/github.com\/bayesian-optimization\/BayesianOptimization"},{"key":"e_1_2_1_27_1","first-page":"1138","article-title":"Why You Should Run TPC-DS: A Workload Analysis","volume":"7","author":"Poess Meikel","year":"2007","unstructured":"Meikel Poess, Raghunath Othayoth Nambiar, and David Walrath. 2007. Why You Should Run TPC-DS: A Workload Analysis.. In VLDB, Vol. 7. 1138\u20131149.","journal-title":"VLDB"},{"key":"e_1_2_1_28_1","unstructured":"Prometheus. 2024. Prometheus Github Page. https:\/\/github.com\/prometheus. 2024\/12\/13."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3555041.3589677"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3064029"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3681954.3682031"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/3583140.3583156"},{"key":"e_1_2_1_33_1","volume-title":"17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)","author":"Vuppalapati Midhul","year":"2020","unstructured":"Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. 2020. Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 449\u2013462."},{"key":"e_1_2_1_34_1","unstructured":"Richard J. Wagner. 2024. Python module for simulated annealing. https:\/\/github.com\/perrygeo\/simanneal. 2024\/12\/13."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.48786\/EDBT.2023.33"},{"key":"e_1_2_1_36_1","volume-title":"Stage: Query Execution Time Prediction in Amazon Redshift. arXiv:2403.02286 [cs.DB] https:\/\/arxiv.org\/abs\/2403.02286","author":"Wu Ziniu","year":"2024","unstructured":"Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrishnan Narayanaswamy, and Tim Kraska. 2024. Stage: Query Execution Time Prediction in Amazon Redshift. arXiv:2403.02286 [cs.DB] https:\/\/arxiv.org\/abs\/2403.02286"},{"key":"e_1_2_1_37_1","volume-title":"Balakrishnan Narayanaswamy, Tim Kraska, and Samuel Madden.","author":"Wu Ziniu","year":"2025","unstructured":"Ziniu Wu, Markos Markakis, Chunwei Liu, Peter Baile Chen, Balakrishnan Narayanaswamy, Tim Kraska, and Samuel Madden. 2025. Improving DBMS Scheduling Decisions with Fine-grained Performance Prediction on Concurrent Queries-Extended. arXiv preprint arXiv:2501.16256 (2025)."},{"key":"e_1_2_1_38_1","volume-title":"CloudyBench: A Testbed for A Comprehensive Evaluation of Cloud-Native Databases. In 2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE Computer Society, 2535\u20132547","author":"Zhang Chao","year":"2025","unstructured":"Chao Zhang, Guoliang Li, Leyao Liu, Tao Lv, and Ju Fan. 2025. CloudyBench: A Testbed for A Comprehensive Evaluation of Cloud-Native Databases. In 2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE Computer Society, 2535\u20132547."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3641204.3641206"},{"key":"e_1_2_1_40_1","volume-title":"HTAP Databases: A Survey","author":"Zhang Chao","year":"2024","unstructured":"Chao Zhang, Guoliang Li, Jintao Zhang, Xinning Zhang, and Jianhua Feng. 2024. HTAP Databases: A Survey. IEEE Transactions on Knowledge and Data Engineering (2024)."},{"key":"e_1_2_1_41_1","volume-title":"Cost-Intelligent Data Analytics in the Cloud. arXiv preprint arXiv:2308.09569","author":"Zhang Huanchen","year":"2023","unstructured":"Huanchen Zhang, Yihao Liu, and Jiaqi Yan. 2023. Cost-Intelligent Data Analytics in the Cloud. arXiv preprint arXiv:2308.09569 (2023)."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526155"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3749646.3749661","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T03:04:59Z","timestamp":1757041499000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3749646.3749661"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7]]},"references-count":42,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.14778\/3749646.3749661"],"URL":"https:\/\/doi.org\/10.14778\/3749646.3749661","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,7]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}