{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T17:55:23Z","timestamp":1757613323883,"version":"3.44.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,5]]},"abstract":"<jats:p>\n            Decision making under uncertainty often requires choosing\n            <jats:italic toggle=\"yes\">packages<\/jats:italic>\n            , or bags of tuples, that collectively optimize expected outcomes while limiting risks. Processing\n            <jats:italic toggle=\"yes\">Stochastic Package Queries<\/jats:italic>\n            (SPQs) involves solving very large optimization problems on uncertain data. Monte Carlo methods create numerous\n            <jats:italic toggle=\"yes\">scenarios<\/jats:italic>\n            , or sample realizations of the stochastic attributes of\n            <jats:italic toggle=\"yes\">all<\/jats:italic>\n            the tuples, and generate packages with optimal objective values across these scenarios. The number of scenarios needed for accurate approximation\u2014and hence the size of the optimization problem when using prior methods\u2014increases with variance in the data, and the search space of the optimization problem increases exponentially with the number of tuples in the relation. Existing solvers take hours to process SPQs on large relations containing stochastic attributes with high variance. Besides enriching the SPaQL language to capture a broader class of risk specifications, we make two fundamental contributions toward scalable SPQ processing. First, we propose\n            <jats:italic toggle=\"yes\">risk-constraint linearization<\/jats:italic>\n            (RCL), which converts SPQs into Integer Linear Programs (ILPs) whose size is independent of the number of scenarios used. Solving these ILPs gives us feasible and near-optimal packages. Second, we propose Stochastic SketchRefine, a divide and conquer framework that breaks down a large stochastic optimization problem into subproblems involving smaller subsets of tuples. Our experiments show that, together, RCL and Stochastic SketchRefine produce high-quality packages in orders of magnitude lower runtime than the state of the art.\n          <\/jats:p>","DOI":"10.14778\/3746405.3746431","type":"journal-article","created":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T17:06:20Z","timestamp":1756919180000},"page":"3106-3118","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["<i>Stochastic SketchRefine<\/i>\n            : Scaling In-Database Decision-Making under Uncertainty to Millions of Tuples"],"prefix":"10.14778","volume":"18","author":[{"given":"Riddho R.","family":"Haque","sequence":"first","affiliation":[{"name":"University of Massachusetts Amherst"}]},{"given":"Anh L.","family":"Mai","sequence":"additional","affiliation":[{"name":"NYU Abu Dhabi"}]},{"given":"Matteo","family":"Brucato","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Azza","family":"Abouzied","sequence":"additional","affiliation":[{"name":"NYU Abu Dhabi"}]},{"given":"Peter J.","family":"Haas","sequence":"additional","affiliation":[{"name":"University of Massachusetts Amherst"}]},{"given":"Alexandra","family":"Meliou","sequence":"additional","affiliation":[{"name":"University of Massachusetts Amherst"}]}],"member":"320","published-online":{"date-parts":[[2025,9,3]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Principal component analysis","author":"Abdi Herv\u00e9","year":"2010","unstructured":"Herv\u00e9 Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433\u2013459."},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Shabbir Ahmed and Alexander Shapiro. 2008. Solving chance-constrained stochastic programs via sampling and integer programming. In State-of-the-Art Decision-making Tools in the Information-intensive Age. (INFORMS) 261\u2013269.","DOI":"10.1287\/educ.1080.0048"},{"key":"e_1_2_1_3_1","volume-title":"A comparison of VaR and CVaR constraints on portfolio selection with the mean-variance model. Management science 50, 9","author":"Alexander Gordon J","year":"2004","unstructured":"Gordon J Alexander and Alexandre M Baptista. 2004. A comparison of VaR and CVaR constraints on portfolio selection with the mean-variance model. Management science 50, 9 (2004), 1261\u20131273."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-017-0483-4"},{"key":"e_1_2_1_5_1","volume-title":"Azza Abouzied, and Alexandra Meliou.","author":"Brucato Matteo","year":"2015","unstructured":"Matteo Brucato, Juan Felipe Beltran, Azza Abouzied, and Alexandra Meliou. 2015. Scalable package queries in relational database systems. arXiv preprint arXiv:1512.03564 (2015)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389765"},{"key":"e_1_2_1_7_1","volume-title":"Scalable package queries in relational database systems: Extended version. arXiv preprint arXiv:2103.06784","author":"Brucato Matteo","year":"2021","unstructured":"Matteo Brucato, Nishant Yadav, Azza Abouzied, Peter J. Haas, and Alexandra Meliou. 2021. Scalable package queries in relational database systems: Extended version. arXiv preprint arXiv:2103.06784 (2021)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1561\/9781601988614"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526138"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10957-010-9754-6"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.arcontrol.2009.07.001"},{"key":"e_1_2_1_12_1","volume-title":"Near-optimal density estimation in near-linear time using variable-width histograms. Advances in neural information processing systems 27","author":"Chan Siu On","year":"2014","unstructured":"Siu On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. 2014. Near-optimal density estimation in near-linear time using variable-width histograms. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/11731139_24"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-006-0004-3"},{"volume-title":"Computers and intractability","author":"Garey Michael R","key":"e_1_2_1_15_1","unstructured":"Michael R Garey and David S Johnson. 1979. Computers and intractability. Vol. 174. freeman San Francisco."},{"volume-title":"Computational Statistics","author":"Gentle James E","key":"e_1_2_1_16_1","unstructured":"James E Gentle. 2009. Computational Statistics. Springer."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/937332.937336"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-87993-0_19"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1002\/sam.11170"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2008.115"},{"key":"e_1_2_1_21_1","volume-title":"An information-theoretic approach to hierarchical clustering of uncertain data. Information sciences 402","author":"Gullo Francesco","year":"2017","unstructured":"Francesco Gullo, Giovanni Ponti, Andrea Tagarelli, and Sergio Greco. 2017. An information-theoretic approach to hierarchical clustering of uncertain data. Information sciences 402 (2017), 199\u2013215."},{"key":"e_1_2_1_22_1","volume-title":"Uncertain centroid based partitional clustering of uncertain data. arXiv preprint arXiv:1203.6401","author":"Gullo Francesco","year":"2012","unstructured":"Francesco Gullo and Andrea Tagarelli. 2012. Uncertain centroid based partitional clustering of uncertain data. arXiv preprint arXiv:1203.6401 (2012)."},{"key":"e_1_2_1_23_1","unstructured":"Riddho R. Haque Anh L. Mai Matteo Brucato Azza Abouzied Peter J. Haas and Alexandra Meliou. 2024. Stochastic SketchRefine: Scaling In-Database Decision-Making under Uncertainty to Millions of Tuples. arXiv:2411.17915 [cs.DB] https:\/\/arxiv.org\/abs\/2411.17915"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sorms.2014.05.001"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.107996"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000824.2000828"},{"volume-title":"Handbook of Simulation Optimization","author":"Kim Sujin","key":"e_1_2_1_27_1","unstructured":"Sujin Kim, Raghu Pasupathy, and Shane G Henderson. 2015. A guide to sample average approximation. In Handbook of Simulation Optimization. Springer, 207\u2013243."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3641204.3641222"},{"volume-title":"Quantitative Risk Management: Concepts, Techniques and Tools","author":"McNeil Alexander J.","key":"e_1_2_1_29_1","unstructured":"Alexander J. McNeil, R\u00fcdiger Frey, and Paul Embrechts. 2015. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press."},{"key":"e_1_2_1_30_1","volume-title":"Accessed","author":"Mooney Paul","year":"2024","unstructured":"Paul Mooney. [n.d.]. Stock Market Data (NASDAQ, NYSE, S&P500). https:\/\/www.kaggle.com\/datasets\/paultimothymooney\/stock-market-data\/data. Accessed: March 10, 2024."},{"key":"e_1_2_1_31_1","volume-title":"Statistical aspects of Wasserstein distances. Annual review of statistics and its application 6, 1","author":"Panaretos Victor M","year":"2019","unstructured":"Victor M Panaretos and Yoav Zemel. 2019. Statistical aspects of Wasserstein distances. Annual review of statistics and its application 6, 1 (2019), 405\u2013431."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/369275.369291"},{"key":"e_1_2_1_33_1","first-page":"23","article-title":"Simulating stock prices using geometric Brownian motion: Evidence from Australian companies","volume":"10","author":"Reddy Krishna","year":"2016","unstructured":"Krishna Reddy and Vaughan Clinton. 2016. Simulating stock prices using geometric Brownian motion: Evidence from Australian companies. Australasian Accounting, Business and Finance Journal 10, 3 (2016), 23\u201347.","journal-title":"Australasian Accounting, Business and Finance Journal"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1287\/educ.1080.0052"},{"key":"e_1_2_1_35_1","volume-title":"Notes on Kullback-Leibler divergence and likelihood. arXiv preprint arXiv:1404.2000","author":"Shlens Jonathon","year":"2014","unstructured":"Jonathon Shlens. 2014. Notes on Kullback-Leibler divergence and likelihood. arXiv preprint arXiv:1404.2000 (2014)."},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management. 1\u201312","author":"\u0160ik\u0161nys Laurynas","year":"2016","unstructured":"Laurynas \u0160ik\u0161nys and Torben Bach Pedersen. 2016. SolveDB: Integrating optimization problem solvers into SQL databases. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management. 1\u201312."},{"key":"e_1_2_1_37_1","volume-title":"Advances in Database Technology-24th International Conference on Extending Database Technology, EDBT","author":"Siksnys Laurynas","year":"2021","unstructured":"Laurynas Siksnys, Torben Bach Pedersen, Thomas Dyhre Nielsen, and Davide Frazzetto. 2021. SolveDB+: Sql-based prescriptive analytics. In Advances in Database Technology-24th International Conference on Extending Database Technology, EDBT 2021. OpenProceedings.org, 133\u2013144."},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Eric K Tokuda Cesar H Comin and Luciano da F Costa. 2022. Revisiting agglomerative clustering. Physica A: Statistical mechanics and its applications 585 (2022) 126433.","DOI":"10.1016\/j.physa.2021.126433"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2003.07.003"},{"key":"e_1_2_1_40_1","volume-title":"Novel density-based and hierarchical density-based clustering algorithms for uncertain data. Neural networks 93","author":"Zhang Xianchao","year":"2017","unstructured":"Xianchao Zhang, Han Liu, and Xiaotong Zhang. 2017. Novel density-based and hierarchical density-based clustering algorithms for uncertain data. Neural networks 93 (2017), 240\u2013255."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3746405.3746431","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T19:53:10Z","timestamp":1757015590000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3746405.3746431"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5]]},"references-count":40,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,5]]}},"alternative-id":["10.14778\/3746405.3746431"],"URL":"https:\/\/doi.org\/10.14778\/3746405.3746431","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,5]]},"assertion":[{"value":"2025-09-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}