{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:35:55Z","timestamp":1750221355005,"version":"3.41.0"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2018,3,6]],"date-time":"2018-03-06T00:00:00Z","timestamp":1520294400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2018,3,31]]},"abstract":"<jats:p>It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? and (2) What is the impact of any unknown (i.e., unobserved) data on query results?<\/jats:p>\n          <jats:p>\n            In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a.,\n            <jats:italic>unknown unknowns<\/jats:italic>\n            ) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution; we also propose a parametric model that can be used instead when the data sources are imbalanced. Through a series of experiments, we show that estimating the impact of\n            <jats:italic>unknown unknowns<\/jats:italic>\n            is invaluable to better assess the results of aggregate queries over integrated data sources.\n          <\/jats:p>","DOI":"10.1145\/3167970","type":"journal-article","created":{"date-parts":[[2018,3,7]],"date-time":"2018-03-07T19:00:36Z","timestamp":1520449236000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Estimating the Impact of Unknown Unknowns on Aggregate Query Results"],"prefix":"10.1145","volume":"43","author":[{"given":"Yeounoh","family":"Chung","sequence":"first","affiliation":[{"name":"Brown University, Providence, RI"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael Lind","family":"Mortensen","sequence":"additional","affiliation":[{"name":"Aarhus University, Aarhus C, Denmark"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carsten","family":"Binnig","sequence":"additional","affiliation":[{"name":"Brown University, Providence, RI"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tim","family":"Kraska","sequence":"additional","affiliation":[{"name":"Brown University, Providence, RI"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,3,6]]},"reference":[{"volume-title":"Proceedings of the SAS Global Forum. 1--21","year":"2012","author":"Allison Paul D.","key":"e_1_2_2_1_1"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807341"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1993.10594330"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/65.3.625"},{"key":"e_1_2_2_5_1","unstructured":"Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. (1984) 265--270.  Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. (1984) 265--270."},{"key":"e_1_2_2_6_1","article-title":"Nonparametric estimation of the number of classes in a population","volume":"11","author":"Chao Anne","year":"1984","journal-title":"Scand. J. Stat."},{"volume-title":"Estimating the population size for capture-recapture data with unequal catchability. Biometrics","year":"1987","author":"Chao Anne","key":"e_1_2_2_7_1"},{"edition":"2","volume-title":"Encyclopedia of Statistical Sciences","author":"Chao Anne","key":"e_1_2_2_8_1"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1992.10475194"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1026096204727"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/80.1.193"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/335168.335230"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882909"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.2000.10474263"},{"volume-title":"Rubin","year":"1977","author":"Dempster Arthur P.","key":"e_1_2_2_15_1"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1924421.1924442"},{"volume-title":"Proceedings of the Conference on Very Large Data Bases (VLDB\u201997)","author":"Florescu Daniela","key":"e_1_2_2_17_1"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989331"},{"key":"e_1_2_2_19_1","unstructured":"Michael Fu et al. 2015. Handbook of Simulation Optimization. Vol. 216. Springer.   Michael Fu et al. 2015. Handbook of Simulation Optimization. Vol. 216. Springer."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807286"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/40.3-4.237"},{"key":"e_1_2_2_22_1","unstructured":"Google. 2015. Freebase. Retrieved from https:\/\/www.freebase.com.  Google. 2015. Freebase. Retrieved from https:\/\/www.freebase.com."},{"key":"e_1_2_2_23_1","first-page":"39","article-title":"Estimating species richness","volume":"12","author":"Gotelli Nicholas J.","year":"2011","journal-title":"Biol. Divers.: Front. Measure. Assess."},{"volume-title":"Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.","year":"2013","author":"Haas Daniel","key":"e_1_2_2_24_1"},{"key":"e_1_2_2_25_1","unstructured":"Peter J. Haas. 1996. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM.  Peter J. Haas. 1996. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM."},{"volume-title":"Proceedings of the Conference on Very Large Databases (VLDB\u201995)","year":"1995","author":"Haas Peter J.","key":"e_1_2_2_26_1"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/57.1.97"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1952.10483446"},{"key":"e_1_2_2_29_1","unstructured":"Leslie Kish. 1965. Survey Sampling. John Wiley and Sons.  Leslie Kish. 1965. Survey Sampling. John Wiley and Sons."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2612176"},{"key":"e_1_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Ulf Leser and Felix Naumann. 2001. Query planning with information quality bounds. Flex. Query Answer. Syst. (2001) 85--94.  Ulf Leser and Felix Naumann. 2001. Query planning with information quality bounds. Flex. Query Answer. Syst. (2001) 85--94.","DOI":"10.1007\/978-3-7908-1834-5_8"},{"key":"e_1_2_2_32_1","unstructured":"Michael Lexa. 2004. Useful Facts about the Kullback-Leibler discrimination distance. Retrieved from https:\/\/scholarship.rice.edu\/bitstream\/handle\/1911\/20061\/Lex2004Dec8UsefulFact.PDF?sequence&equals;1&isAllowed&equals;&equals;&equals;y.  Michael Lexa. 2004. Useful Facts about the Kullback-Leibler discrimination distance. Retrieved from https:\/\/scholarship.rice.edu\/bitstream\/handle\/1911\/20061\/Lex2004Dec8UsefulFact.PDF?sequence&equals;1&isAllowed&equals;&equals;&equals;y."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/2535568.2448943"},{"key":"e_1_2_2_34_1","unstructured":"Jie Liang. 2008. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. Retrieved from http:\/\/cs.uwindsor.ca\/richard\/cs510\/survey_jie_liang.pdf.  Jie Liang. 2008. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. Retrieved from http:\/\/cs.uwindsor.ca\/richard\/cs510\/survey_jie_liang.pdf."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-009-9107-y"},{"volume-title":"Sample size, the margin of error and the coefficient of variation. InterStat","year":"2010","author":"Lynch Robert","key":"e_1_2_2_36_1"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1805286.1805291"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989486"},{"volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR\u201911)","author":"Marcus Adam","key":"e_1_2_2_39_1"},{"volume-title":"Proceedings of the Conference on Learning Theory (COLT\u201900)","author":"David","key":"e_1_2_2_40_1"},{"key":"e_1_2_2_41_1","unstructured":"James T. McClave P. George Benson and Terry Sincich. 2014. Statistics for Business and Economics. Pearson Essex.  James T. McClave P. George Benson and Terry Sincich. 2014. Statistics for Business and Economics. Pearson Essex."},{"volume-title":"Proceedings of International Conference on Data Engineering (ICDE\u201999)","author":"Meng Weiyi","key":"e_1_2_2_42_1"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2003.12.005"},{"volume-title":"Proceedings of European Conference on Information Systems. 69","year":"2000","author":"Neiling Mat Tis","key":"e_1_2_2_44_1"},{"key":"e_1_2_2_45_1","volume-title":"Proceedings of the Conference on Very Large data Bases (VLDB\u201986)","volume":"86","author":"Olken Frank","year":"1986"},{"volume-title":"Best Practices in Data Cleaning: A Complete Guide to Everything You Need to do Before and After Collecting Your Data","author":"Osborne Jason W.","key":"e_1_2_2_46_1"},{"volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR\u201911)","year":"2011","author":"Parameswaran Aditya","key":"e_1_2_2_47_1"},{"key":"e_1_2_2_48_1","unstructured":"Pew Research Center. 2014. How U.S. tech-sector jobs have grown changed in 15 years. Retrieved from http:\/\/pewrsr.ch\/PtqZDA.  Pew Research Center. 2014. How U.S. tech-sector jobs have grown changed in 15 years. Retrieved from http:\/\/pewrsr.ch\/PtqZDA."},{"key":"e_1_2_2_49_1","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm Erhard","year":"2000","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1002\/ece3.2463"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2750544"},{"key":"e_1_2_2_52_1","unstructured":"John Rice. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.  John Rice. 2006. Mathematical Statistics and Data Analysis. Cengage Learning."},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/63.3.581"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816764"},{"volume-title":"Survey Research","author":"Sapsford Roger","key":"e_1_2_2_55_1","doi-asserted-by":"crossref","DOI":"10.4135\/9780857024664"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544865"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1046\/j.1365-2656.2003.00748.x"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993636.1993727"},{"key":"e_1_2_2_59_1","unstructured":"Wikipedia. 2015. List of U.S. states by GDP. Retrieved from https:\/\/en.wikipedia.org\/wiki\/List_of_U.S._states_by_GDP.  Wikipedia. 2015. List of U.S. states by GDP. Retrieved from https:\/\/en.wikipedia.org\/wiki\/List_of_U.S._states_by_GDP."},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/1814433.1814443"},{"volume-title":"Multiple imputation for missing data: Concepts and new development (Version 9.0)","year":"2010","author":"Yuan Yang C.","key":"e_1_2_2_61_1"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.2307\/1936227"}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3167970","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3167970","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:26:07Z","timestamp":1750213567000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3167970"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,3,6]]},"references-count":62,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2018,3,31]]}},"alternative-id":["10.1145\/3167970"],"URL":"https:\/\/doi.org\/10.1145\/3167970","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"type":"print","value":"0362-5915"},{"type":"electronic","value":"1557-4644"}],"subject":[],"published":{"date-parts":[[2018,3,6]]},"assertion":[{"value":"2016-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-03-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}