{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T09:51:22Z","timestamp":1773481882913,"version":"3.50.1"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,5,29]],"date-time":"2024-05-29T00:00:00Z","timestamp":1716940800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Hong Kong ITC ITF","award":["MRP\/071\/20X"],"award-info":[{"award-number":["MRP\/071\/20X"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61932001"],"award-info":[{"award-number":["61932001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U2241212"],"award-info":[{"award-number":["U2241212"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Beijing Natural Science Foundation","award":["4222028"],"award-info":[{"award-number":["4222028"]}]},{"name":"Hong Kong RGC GRF","award":["14217322"],"award-info":[{"award-number":["14217322"]}]},{"name":"Beijing Outstanding Young Scientist Program","award":["BJJWZYJH012019100020098"],"award-info":[{"award-number":["BJJWZYJH012019100020098"]}]},{"name":"Alibaba Innovative Research Program"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,5,29]]},"abstract":"<jats:p>The problem of estimating data properties using sampling frequency histograms has attracted extensive interest in the area of databases. The properties include the number of distinct values (NDV), entropy, and so on. In the field of databases, property estimation is fundamental to complex applications. For example, NDV estimation is the foundation of query optimization, and entropy estimation is the foundation of data compression. Among them, methods originating from statistics exhibit desirable theoretical guarantees but rely on specific assumptions about the distribution of data, resulting in poor performance in real-world applications. Learning-based methods, which use information from training data, are adaptable in the real world but often lack theoretical guarantees or explainability. In addition, a unified framework for estimating these frequency-based estimators with machine learning is lacking. Given the aforementioned challenges, it is natural to wonder if a unified framework with theoretical guarantees can be established for property estimation. The recent literature has presented theoretical studies that propose estimation frameworks based on polynomials. These studies also prove estimation errors with respect to the sample size. Motivated by the above polynomial estimation framework, we propose a learning-based estimation framework with polynomial approximation, which aims to learn the coefficients of the polynomial, providing theoretical guarantees to the learning framework. Through comprehensive experiments on both synthetic and real-world datasets for estimating various data properties like NDV, entropy, and power sum, our results show the superiority of our algorithms over previous estimators.<\/jats:p>","DOI":"10.1145\/3654994","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T09:44:53Z","timestamp":1717062293000},"page":"1-27","source":"Crossref","is-referenced-by-count":2,"title":["Learning-based Property Estimation with Polynomials"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-4149-5165","authenticated-orcid":false,"given":"Jiajun","family":"Li","sequence":"first","affiliation":[{"name":"Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2700-3840","authenticated-orcid":false,"given":"Runlin","family":"Lei","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1892-6971","authenticated-orcid":false,"given":"Sibo","family":"Wang","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, HongKong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3620-5086","authenticated-orcid":false,"given":"Zhewei","family":"Wei","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1535-9692","authenticated-orcid":false,"given":"Bolin","family":"Ding","sequence":"additional","affiliation":[{"name":"Alibaba Group, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2020. Airlines Departure Delay. https:\/\/www.openml.org\/d\/42728."},{"key":"e_1_2_1_2_1","unstructured":"2020. Voter Registration Statistics. https:\/\/www.ncsbe.gov\/results-data\/voterregistration-data."},{"key":"e_1_2_1_3_1","unstructured":"2021. Synthetic data generator. https:\/\/github.com\/wurenzhi\/learned_ndv_estimator.git."},{"key":"e_1_2_1_4_1","unstructured":"2022. Github. https:\/\/github.com\/wurenzhi\/learned_ndv_estimator.git."},{"key":"e_1_2_1_5_1","unstructured":"2023. Source Code of Spark. https:\/\/github.com\/apache\/spark\/blob\/master\/sql\/catalyst\/src\/main\/scala\/org\/apache\/spark\/sql\/catalyst\/plans\/logical\/statsEstimation\/JoinEstimation.scala."},{"key":"e_1_2_1_6_1","unstructured":"2024. Source Code of PostgreSQL. https:\/\/github.com\/postgres\/postgres\/blob\/master\/src\/backend\/optimizer\/plan\/analyzejoins.c."},{"key":"e_1_2_1_7_1","volume-title":"International Conference on Machine Learning. PMLR, 11--21","author":"Acharya Jayadev","year":"2017","unstructured":"Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11--21."},{"key":"e_1_2_1_8_1","volume-title":"International Conference on Machine Learning. PMLR, 11--21","author":"Acharya Jayadev","year":"2017","unstructured":"Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11--21."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2015.7363736"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1002\/rsa.10019"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-short.20"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/646978.711822"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/509907.510005"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536222.2536235"},{"key":"e_1_2_1_15_1","volume-title":"Algorithms for minimization without derivatives","author":"Brent Richard P","unstructured":"Richard P Brent. 2013. Algorithms for minimization without derivatives. Courier Corporation."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1993.10594330"},{"key":"e_1_2_1_17_1","volume-title":"Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics","author":"Chao Anne","year":"1984","unstructured":"Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265--270."},{"key":"e_1_2_1_18_1","volume-title":"User's guide for program SPADE (Species prediction and diversity estimation)","author":"Chao Anne","year":"2010","unstructured":"Anne Chao and TJ Shen. 2010. User's guide for program SPADE (Species prediction and diversity estimation). Taiwan: National Tsing Hua University (2010)."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/335168.335230"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007602"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457255"},{"key":"e_1_2_1_22_1","volume-title":"Support Estimation with Sampling Artifacts and Errors. In 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 244--249","author":"Chien Eli","year":"2021","unstructured":"Eli Chien, Olgica Milenkovic, and Angelia Nedich. 2021. Support Estimation with Sampling Artifacts and Errors. In 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 244--249."},{"key":"e_1_2_1_23_1","unstructured":"I Chien. 2019. Regularized weighted Chebyshev approximations for support estimation. (2019)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2019.2940705"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407820"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3329772.3329780"},{"key":"e_1_2_1_27_1","volume-title":"Learning-based Support Estimation in Sublinear Time. In International Conference on Learning Representations.","author":"Eden Talya","year":"2021","unstructured":"Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner. 2021. Learning-based Support Estimation in Sublinear Time. In International Conference on Learning Representations."},{"key":"e_1_2_1_28_1","volume-title":"The jackknife estimate of variance. The Annals of Statistics","author":"Efron Bradley","year":"1981","unstructured":"Bradley Efron and Charles Stein. 1981. The jackknife estimate of variance. The Annals of Statistics (1981), 586--596."},{"key":"e_1_2_1_29_1","first-page":"435","article-title":"Estimating the number of unseen species: How many words did Shakespeare know","volume":"63","author":"Efron Bradley","year":"1976","unstructured":"Bradley Efron and Ronald Thisted. 1976. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63, 3 (1976), 435--447.","journal-title":"Biometrika"},{"key":"e_1_2_1_30_1","volume-title":"The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology","author":"Fisher Ronald A","year":"1943","unstructured":"Ronald A Fisher, A Steven Corbet, and Carrington B Williams. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology (1943), 42--58."},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Philippe Flajolet \u00c9ric Fusy Olivier Gandouet and Fr\u00e9d\u00e9ric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science 137--156.","DOI":"10.46298\/dmtcs.3545"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/40.3-4.237"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/43.1-2.45"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/1109557.1109637"},{"key":"e_1_2_1_35_1","volume-title":"Conference On Learning Theory. PMLR, 3189--3221","author":"Han Yanjun","year":"2018","unstructured":"Yanjun Han, Jiantao Jiao, and Tsachy Weissman. 2018. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. In Conference On Learning Theory. PMLR, 3189--3221."},{"key":"e_1_2_1_36_1","volume-title":"Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems 32","author":"Hao Yi","year":"2019","unstructured":"Yi Hao and Alon Orlitsky. 2019. Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_2_1_37_1","volume-title":"Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607","author":"Hilprecht Benjamin","year":"2019","unstructured":"Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019)."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/308386.308455"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2018.2846245"},{"key":"e_1_2_1_40_1","volume-title":"Rob R De Ruyter Van Steveninck, and William Bialek","author":"Koberle Roland","year":"1998","unstructured":"Roland Koberle, Rob R De Ruyter Van Steveninck, and William Bialek. 1998. Entropy and information in neural spike trains. Physical review letters 80, 1 (1998), 197."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539390"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626302"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.5555\/2886444.2886453"},{"key":"e_1_2_1_44_1","volume-title":"Note on the bias of information estimates. Information theory in psychology: Problems and methods","author":"Miller George","year":"1955","unstructured":"George Miller. 1955. Note on the bias of information estimates. Information theory in psychology: Problems and methods (1955)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/1036843.1036895"},{"key":"e_1_2_1_46_1","volume-title":"SUM, and AVERAGE relational algebra queries. In Database and Expert Systems Applications","author":"Ozsoyoglu Gultekin","unstructured":"Gultekin Ozsoyoglu, Kaizheng Du, A Tjahjana, W-C Hou, and DY Rowland. 1991. On estimating COUNT, SUM, and AVERAGE relational algebra queries. In Database and Expert Systems Applications. Springer, 406--412."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-10424-4_17"},{"key":"e_1_2_1_48_1","first-page":"97","article-title":"On estimation of the size of the dictionary of a long text on the basis of a sample","volume":"19","author":"Shlosser A","year":"1981","unstructured":"A Shlosser. 1981. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19, 1 (1981), 97--102.","journal-title":"Engineering Cybernetics"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080713"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993636.1993727"},{"key":"e_1_2_1_51_1","unstructured":"Paul Valiant and Gregory Valiant. 2013. Estimating the Unseen: Improved Estimators for Entropy and other Properties.. In NIPS. 2157--2165."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489508"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2016.2548468"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1214\/17-AOS1665"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3461539"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654994","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3654994","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:40:44Z","timestamp":1755787244000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654994"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,29]]},"references-count":55,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,29]]}},"alternative-id":["10.1145\/3654994"],"URL":"https:\/\/doi.org\/10.1145\/3654994","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,29]]}}}