{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T15:05:43Z","timestamp":1758121543060,"version":"3.41.0"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:p>Estimating the Number of Distinct Values (NDV) is fundamental for numerous data management tasks, especially within database applications. However, most existing works primarily focus on introducing new statistical or learned estimators, while identifying the most suitable estimator for a given scenario remains largely unexplored. Therefore, we propose AdaNDV, a learned method designed to adaptively select and fuse existing estimators to address this issue. Specifically, (1) we propose to use learned models to distinguish between overestimated and underestimated estimators and then select appropriate estimators from each category. This strategy provides a complementary perspective by integrating overestimations and underestimations for error correction, thereby improving the accuracy of NDV estimation. (2) To further integrate the estimation results, we introduce a novel fusion approach that employs a learned model to predict the weights of the selected estimators and then applies a weighted sum to merge them. By combining these strategies, the proposed AdaNDV fundamentally distinguishes itself from previous works that directly estimate NDV. Moreover, extensive experiments conducted on real-world datasets, with the number of individual columns being several orders of magnitude larger than in previous studies, demonstrate the superior performance of our method.<\/jats:p>","DOI":"10.14778\/3717755.3717769","type":"journal-article","created":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T15:51:49Z","timestamp":1747756309000},"page":"1104-1117","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators"],"prefix":"10.14778","volume":"18","author":[{"given":"Xianghong","family":"Xu","sequence":"first","affiliation":[{"name":"ByteDance, Beijing, China"}]},{"given":"Tieying","family":"Zhang","sequence":"additional","affiliation":[{"name":"ByteDance, San Jose, USA"}]},{"given":"Xiao","family":"He","sequence":"additional","affiliation":[{"name":"ByteDance, Hangzhou, China"}]},{"given":"Haoyang","family":"Li","sequence":"additional","affiliation":[{"name":"ByteDance, Beijing, China"}]},{"given":"Rong","family":"Kang","sequence":"additional","affiliation":[{"name":"ByteDance, Beijing, China"}]},{"given":"Shuai","family":"Wang","sequence":"additional","affiliation":[{"name":"ByteDance, Beijing, China"}]},{"given":"Linhui","family":"Xu","sequence":"additional","affiliation":[{"name":"ByteDance, Beijing, China"}]},{"given":"Zhimin","family":"Liang","sequence":"additional","affiliation":[{"name":"ByteDance, Beijing, China"}]},{"given":"Shangyu","family":"Luo","sequence":"additional","affiliation":[{"name":"ByteDance, San Jose, USA"}]},{"given":"Lei","family":"Zhang","sequence":"additional","affiliation":[{"name":"ByteDance, Shenzhen, China"}]},{"given":"Jianjun","family":"Chen","sequence":"additional","affiliation":[{"name":"ByteDance, San Jose, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,5,20]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2020. Campaign finance data. https:\/\/www.fec.gov\/data\/"},{"key":"e_1_2_1_2_1","unstructured":"2020. Voter Registration Statistics. https:\/\/www.ncsbe.gov\/results-data\/voterregistration-data"},{"key":"e_1_2_1_3_1","unstructured":"2024. Box plot. https:\/\/en.wikipedia.org\/wiki\/Box_plot"},{"key":"e_1_2_1_4_1","unstructured":"2024. Chi-squared test. https:\/\/en.wikipedia.org\/wiki\/Chi-squared_test"},{"key":"e_1_2_1_5_1","unstructured":"2024. Commoncrawl. https:\/\/commoncrawl.org\/"},{"key":"e_1_2_1_6_1","unstructured":"2024. GitHub. https:\/\/github.com\/"},{"key":"e_1_2_1_7_1","unstructured":"2024. Pydistinct - Population Distinct Value Estimators. https:\/\/pydistinct.readthedocs.io\/"},{"key":"e_1_2_1_8_1","unstructured":"2024. Source Code of MySQL. https:\/\/github.com\/mysql\/mysql-server\/blob\/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4\/sql\/join_optimizer\/cost_model.cc"},{"key":"e_1_2_1_9_1","unstructured":"2024. Source Code of PostgreSQL. https:\/\/github.com\/postgres\/postgres\/blob\/master\/src\/backend\/optimizer\/plan\/analyzejoins.c"},{"key":"e_1_2_1_10_1","unstructured":"2024. Source Code of Spark. https:\/\/github.com\/apache\/spark\/blob\/master\/sql\/catalyst\/src\/main\/scala\/org\/apache\/spark\/sql\/catalyst\/plans\/logical\/statsEstimation\/JoinEstimation.scala"},{"key":"e_1_2_1_11_1","unstructured":"2024. tablib-v1-sample dataset. https:\/\/huggingface.co\/datasets\/approximatelabs\/tablib-v1-sample"},{"key":"e_1_2_1_12_1","unstructured":"2024. Violin plot. https:\/\/en.wikipedia.org\/wiki\/Violin_plot"},{"key":"e_1_2_1_13_1","volume-title":"International Conference on Machine Learning. PMLR, 11\u201321","author":"Acharya Jayadev","year":"2017","unstructured":"Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. 2017. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning. PMLR, 11\u201321."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331347"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1993.10594330"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/65.3.625"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.2307\/1936861"},{"key":"e_1_2_1_18_1","volume-title":"Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics","author":"Chao Anne","year":"1984","unstructured":"Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265\u2013270."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1992.10475194"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/335168.335230"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/276305.276343"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISIT45174.2021.9517892"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2019.2940705"},{"key":"e_1_2_1_24_1","unstructured":"Gus Eggert Kevin Huo Mike Biven and Justin Waugh. 2023. TabLib: A Dataset of 627M Tables with Context. arXiv:2310.07875 [cs.CL]"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3654621.3654632"},{"key":"e_1_2_1_26_1","volume-title":"Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science Proceedings","author":"Flajolet Philippe","year":"2007","unstructured":"Philippe Flajolet, \u00c9ric Fusy, Olivier Gandouet, and Fr\u00e9d\u00e9ric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science Proceedings (2007)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729949"},{"key":"e_1_2_1_28_1","first-page":"311","article-title":"Sampling-based estimation of the number of distinct values of an attribute","volume":"95","author":"Haas Peter J","year":"1995","unstructured":"Peter J Haas, Jeffrey F Naughton, S Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In VLDB, Vol. 95. 311\u2013322.","journal-title":"VLDB"},{"key":"e_1_2_1_29_1","volume-title":"ByteCard: Enhancing Data Warehousing with Learned Cardinality Estimation. arXiv preprint arXiv:2403.16110","author":"Han Yuxing","year":"2024","unstructured":"Yuxing Han, Haoyu Wang, Lixiang Chen, Yifeng Dong, Xing Chen, Benquan Yu, Chengcheng Yang, and Weining Qian. 2024. ByteCard: Enhancing Data Warehousing with Learned Cardinality Estimation. arXiv preprint arXiv:2403.16110 (2024)."},{"key":"e_1_2_1_30_1","volume-title":"Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems 32","author":"Hao Yi","year":"2019","unstructured":"Yi Hao and Alon Orlitsky. 2019. Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3186728.3164145"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588710"},{"key":"e_1_2_1_33_1","volume-title":"Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.","author":"Diederik","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654994"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539390"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626302"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Tie-Yan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends\u00ae in Information Retrieval 3 3 (2009) 225\u2013331.","DOI":"10.1561\/1500000016"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687738"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972962.7"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1340771.1340773"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/1036843.1036895"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-7091-7555-2_68"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-10424-4_17"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330677"},{"key":"e_1_2_1_45_1","first-page":"1","article-title":"Approximate profile maximum likelihood","volume":"20","author":"Pavlichin Dmitri S","year":"2019","unstructured":"Dmitri S Pavlichin, Jiantao Jiao, and Tsachy Weissman. 2019. Approximate profile maximum likelihood. Journal of Machine Learning Research 20, 122 (2019), 1\u201355.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517900"},{"key":"e_1_2_1_47_1","series-title":"Springer Series in Statistics (1992)","volume-title":"Model Assisted Survey Sampling","author":"S\u00e4rndal Carl-Erik","unstructured":"Carl-Erik S\u00e4rndal, Bengt Swensson, and Jan Wretman. 1992. Model Assisted Survey Sampling. Springer Series in Statistics (1992)."},{"key":"e_1_2_1_48_1","first-page":"97","article-title":"On estimation of the size of the dictionary of a long text on the basis of a sample","volume":"19","author":"Shlosser A","year":"1981","unstructured":"A Shlosser. 1981. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19, 1 (1981), 97\u2013102.","journal-title":"Engineering Cybernetics"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1080\/03610928608829161"},{"key":"e_1_2_1_50_1","first-page":"45","article-title":"Word frequency distributions and type-token characteristics","volume":"11","author":"Sichel Herbert S","year":"1986","unstructured":"Herbert S Sichel. 1986. Word frequency distributions and type-token characteristics. Math. Scientist 11 (1986), 45\u201372.","journal-title":"Math. Scientist"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(92)90088-H"},{"key":"e_1_2_1_52_1","volume-title":"Nonparametric estimation of species richness. Biometrics","author":"Smith Eric P","year":"1984","unstructured":"Eric P Smith and Gerald van Belle. 1984. Nonparametric estimation of species richness. Biometrics (1984), 119\u2013129."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3125643"},{"key":"e_1_2_1_54_1","volume-title":"Estimating the Unseen: Improved Estimators for Entropy and other Properties. Advances in Neural Information Processing Systems 26","author":"Valiant Paul","year":"2013","unstructured":"Paul Valiant and Gregory Valiant. 2013. Estimating the Unseen: Improved Estimators for Entropy and other Properties. Advances in Neural Information Processing Systems 26 (2013)."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3269206.3271784"},{"key":"e_1_2_1_56_1","volume-title":"Conference on learning theory. PMLR, 25\u201354","author":"Wang Yining","year":"2013","unstructured":"Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG type ranking measures. In Conference on learning theory. PMLR, 25\u201354."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489508"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1214\/17-AOS1665"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390306"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00201"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/3632093.3632114"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3717755.3717769","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T16:16:24Z","timestamp":1747757784000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3717755.3717769"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12]]},"references-count":61,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["10.14778\/3717755.3717769"],"URL":"https:\/\/doi.org\/10.14778\/3717755.3717769","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,12]]},"assertion":[{"value":"2025-05-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}