{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T16:25:28Z","timestamp":1776443128796,"version":"3.51.2"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2017,10,4]],"date-time":"2017-10-04T00:00:00Z","timestamp":1507075200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["IIS-1562657"],"award-info":[{"award-number":["IIS-1562657"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"NSF CAREER Award","award":["CCF-1351108"],"award-info":[{"award-number":["CCF-1351108"]}]},{"name":"Sloan Research Fellowship"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. ACM"],"published-print":{"date-parts":[[2017,12,31]]},"abstract":"<jats:p>\n            We show that a class of statistical properties of distributions, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a\n            <jats:italic>sublinear<\/jats:italic>\n            sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most\n            <jats:italic>k<\/jats:italic>\n            distinct elements, these properties can be estimated accurately using a sample of size\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>k<\/jats:italic>\n            log\n            <jats:italic>k<\/jats:italic>\n            ). For these estimation tasks, this performance is\n            <jats:italic>optimal<\/jats:italic>\n            , to constant factors. Complementing these theoretical results, we also demonstrate that our estimators perform exceptionally well, in practice, for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. The key step in our approach is to first use the sample to characterize the \u201cunseen\u201d portion of the distribution\u2014effectively reconstructing this portion of the distribution as accurately as if one had a logarithmic factor larger sample. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: We seek to estimate the\n            <jats:italic>shape<\/jats:italic>\n            of the unobserved portion of the distribution. This work can be seen as introducing a robust, general, and theoretically principled framework that, for many practical applications, essentially amplifies the sample size by a logarithmic factor; we expect that it may be fruitfully used as a component within larger machine learning and statistical analysis systems.\n          <\/jats:p>","DOI":"10.1145\/3125643","type":"journal-article","created":{"date-parts":[[2017,10,4]],"date-time":"2017-10-04T12:17:29Z","timestamp":1507119449000},"page":"1-41","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":40,"title":["Estimating the Unseen"],"prefix":"10.1145","volume":"64","author":[{"given":"Gregory","family":"Valiant","sequence":"first","affiliation":[{"name":"Stanford University, Stanford, CA"}]},{"given":"Paul","family":"Valiant","sequence":"additional","affiliation":[{"name":"Brown University, Providence, RI"}]}],"member":"320","published-online":{"date-parts":[[2017,10,4]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Proceedings of the Conference on Learning Theory (COLT\u201911)","author":"Acharya J."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISIT.2009.5206016"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1002\/rsa.10019"},{"key":"e_1_2_2_4_1","volume-title":"Proceedings of the Symposium on Theory of Computing (STOC\u201901)","author":"Bar-Yossef Z."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1137\/1104033"},{"key":"e_1_2_2_6_1","unstructured":"T. Batu. 2001. Testing Properties of Distributions. Ph.D. thesis Cornell University.  T. Batu. 2001. Testing Properties of Distributions. Ph.D. thesis Cornell University."},{"key":"e_1_2_2_7_1","volume-title":"Proceedings of the Symposium on Theory of Computing (STOC\u201902)","author":"Batu T."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0097539702403645"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.2000.892113"},{"key":"e_1_2_2_10_1","volume-title":"Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA07)","author":"Brautbar M."},{"key":"e_1_2_2_11_1","unstructured":"J. Bunge. Bibliography of references on the problem of estimating support size. Retrieved from http:\/\/www.stat.cornell.edu\/&sim;bunge\/bibliography.html.  J. Bunge. Bibliography of references on the problem of estimating support size. Retrieved from http:\/\/www.stat.cornell.edu\/&sim;bunge\/bibliography.html."},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1993.10594330"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1026096204727"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/335168.335230"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1214\/aos\/1176345462"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/63.3.435"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.2307\/1411"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/43.1-2.45"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/40.3-4.237"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1109557.1109637"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.1996.0041"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1952.10483446"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2015.2412945"},{"key":"e_1_2_2_24_1","first-page":"1058","article-title":"On a functional space and certain extremal problems","volume":"115","author":"Kantorovich L.","year":"1957","journal-title":"Dokl. Akad. Nauk. SSSR (Russ.)"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1217283"},{"key":"e_1_2_2_26_1","volume-title":"Proceedings of the Conference on Learning Theory (COLT\u201900)","author":"McAllester D. A."},{"key":"e_1_2_2_27_1","unstructured":"G. Miller. 1955. Note on the bias of information estimates. Inf. Theory Psychol.: Probl. Methods (1955) 95--100.  G. Miller. 1955. Note on the bias of information estimates. Inf. Theory Psychol.: Probl. Methods (1955) 95--100."},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511813603"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1217876"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-52342-1_23"},{"key":"e_1_2_2_31_1","volume-title":"Proceedings of the Conference on Uncertainity in Artificial Intelligence (UAI\u201904)","author":"Orlitsky A."},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1088284"},{"key":"e_1_2_2_33_1","volume-title":"Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS\u201903)","author":"Orlitsky A."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1607774113"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1162\/089976603321780272"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1088\/0954-898X\/7\/1\/006"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177698526"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1219240"},{"key":"e_1_2_2_39_1","unstructured":"G. Valiant and P. Valiant. 2010. A CLT and tight lower bounds for estimating entropy. (2010). Retrieved from http:\/\/www.eccc.uni-trier.de\/report\/2010\/179\/.  G. Valiant and P. Valiant. 2010. A CLT and tight lower bounds for estimating entropy. (2010). Retrieved from http:\/\/www.eccc.uni-trier.de\/report\/2010\/179\/."},{"key":"e_1_2_2_40_1","volume-title":"Proceedings of the ACM Symposium on Theory of Computing (STOC\u201911)","author":"Valiant G."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2011.81"},{"key":"e_1_2_2_42_1","volume-title":"Proceedings of the Conference on Neural Information Processing Systems (NIPS\u201913)","author":"Valiant G."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897518.2897641"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1374376.1374432"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1002\/sim.2942"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISIT.2006.262066"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISIT.2007.4557571"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2016.2548468"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.2307\/1936227"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1038\/ncomms13293"}],"container-title":["Journal of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3125643","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3125643","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3125643","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:30:06Z","timestamp":1750217406000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3125643"}},"subtitle":["Improved Estimators for Entropy and Other Properties"],"short-title":[],"issued":{"date-parts":[[2017,10,4]]},"references-count":50,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2017,12,31]]}},"alternative-id":["10.1145\/3125643"],"URL":"https:\/\/doi.org\/10.1145\/3125643","relation":{},"ISSN":["0004-5411","1557-735X"],"issn-type":[{"value":"0004-5411","type":"print"},{"value":"1557-735X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,10,4]]},"assertion":[{"value":"2015-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-10-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}