{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T13:02:31Z","timestamp":1773406951476,"version":"3.50.1"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2008,8]]},"abstract":"<jats:p>The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own \"schema\" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude.<\/jats:p>\n          <jats:p>We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus?<\/jats:p>\n          <jats:p>\n            First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the\n            <jats:italic>attribute correlation statistics database<\/jats:italic>\n            (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications:\n            <jats:italic>schema auto-complete<\/jats:italic>\n            , which helps a database designer to choose schema elements;\n            <jats:italic>attribute synonym finding<\/jats:italic>\n            , which automatically computes attribute synonym pairs for schema matching; and\n            <jats:italic>join-graph traversal<\/jats:italic>\n            , which allows a user to navigate between extracted schemas using automatically-generated join links.\n          <\/jats:p>","DOI":"10.14778\/1453856.1453916","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"538-549","source":"Crossref","is-referenced-by-count":388,"title":["WebTables"],"prefix":"10.14778","volume":"1","author":[{"given":"Michael J.","family":"Cafarella","sequence":"first","affiliation":[{"name":"University of Washington, Seattle, WA"}]},{"given":"Alon","family":"Halevy","sequence":"additional","affiliation":[{"name":"Google, Inc., Mountain View, CA"}]},{"given":"Daisy Zhe","family":"Wang","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, CA"}]},{"given":"Eugene","family":"Wu","sequence":"additional","affiliation":[{"name":"MIT, Cambridge, MA"}]},{"given":"Yang","family":"Zhang","sequence":"additional","affiliation":[{"name":"MIT, Cambridge, MA"}]}],"member":"320","published-online":{"date-parts":[[2008,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375774"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564782"},{"key":"e_1_2_1_3_1","volume-title":"European Conference on Machine Learning","author":"Bell S.","year":"1995","unstructured":"S. Bell and P. Brockhausen . Discovery of data dependencies in relational databases . In European Conference on Machine Learning , 1995 . S. Bell and P. Brockhausen. Discovery of data dependencies in relational databases. In European Conference on Machine Learning, 1995."},{"key":"e_1_2_1_4_1","first-page":"858","volume-title":"Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning","author":"Brants T.","year":"2007","unstructured":"T. Brants , A. C. Popat , P. Xu , F. J. Och , and J. Dean . Large language models in machine translation . In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning , pages 858 -- 867 , 2007 . T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858--867, 2007."},{"key":"e_1_2_1_5_1","volume-title":"Uncovering the relational web. In under review","author":"Cafarella M.","year":"2008","unstructured":"M. Cafarella , A. Halevy , Z. Wang , E. Wu , and Y. Zhang . Uncovering the relational web. In under review , 2008 . M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In under review, 2008."},{"key":"e_1_2_1_6_1","volume-title":"Web DB","author":"Cafarella M. J.","year":"2007","unstructured":"M. J. Cafarella , D. Suciu , and O. Etzioni . Navigating extracted data with schema discovery . In Web DB , 2007 . M. J. Cafarella, D. Suciu, and O. Etzioni. Navigating extracted data with schema discovery. In Web DB, 2007."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.3115\/990820.990845"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.3115\/981623.981633"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007612"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375731"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988687"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242583"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007691"},{"key":"e_1_2_1_14_1","volume-title":"VLDB","author":"Hristidis V.","year":"2002","unstructured":"V. Hristidis and Y. Papakonstantinou . Discover: Keyword search in relational databases . In VLDB , 2002 . V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/502512.502559"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.39"},{"key":"e_1_2_1_17_1","volume-title":"VLDB","author":"Madhavan J.","year":"2001","unstructured":"J. Madhavan , P. A. Bernstein , and E. Rahm . Generic schema matching with cupid . In VLDB , 2001 . J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001."},{"issue":"4","key":"e_1_2_1_18_1","first-page":"19","article-title":"Structured data meets the web: A few observations","volume":"29","author":"Madhavan J.","year":"2006","unstructured":"J. Madhavan , A. Y. Halevy , S. Cohen , X. L. Dong , S. R. Jeffery , D. Ko , and C. Yu . Structured data meets the web: A few observations . IEEE Data Eng. Bull. , 29 ( 4 ): 19 -- 26 , 2006 . J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko, and C. Yu. Structured data meets the web: A few observations. IEEE Data Eng. Bull., 29(4): 19--26, 2006.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_19_1","volume-title":"Foundations of Statistical Natural Language Processing","author":"Manning C.","year":"1999","unstructured":"C. Manning and H. Sch\u00fctze . Foundations of Statistical Natural Language Processing . MIT Press , 1999 . C. Manning and H. Sch\u00fctze. Foundations of Statistical Natural Language Processing. MIT Press, 1999."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.83"},{"issue":"3","key":"e_1_2_1_21_1","first-page":"40","article-title":"Schema discovery","volume":"26","author":"Miller R.","year":"2003","unstructured":"R. Miller and P. Andritsos . Schema discovery . IEEE Data Eng. Bull. , 26 ( 3 ): 40 -- 45 , 2003 . R. Miller and P. Andritsos. Schema discovery. IEEE Data Eng. Bull., 26(3):40--45, 2003.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247640"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/876867.877593"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s007780100057"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/645328.650004"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/511446.511478"},{"key":"e_1_2_1_27_1","volume-title":"Data Mining: Practical machine learning tools and techniques. Morgan Kaufman","author":"Witten I.","year":"2005","unstructured":"I. Witten and E. Frank . Data Mining: Practical machine learning tools and techniques. Morgan Kaufman , San Francisco , 2 nd edition edition, 2005 . I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufman, San Francisco, 2nd edition edition, 2005.","edition":"2"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(19980415)49:5%3C455::AID-ASI7%3E3.3.CO;2-D"},{"key":"e_1_2_1_29_1","first-page":"31","volume-title":"Proceedings of the 1st International Workshop on Web Document Analysis","author":"Yoshida M.","year":"2001","unstructured":"M. Yoshida and K. Torisawa . A method to integrate tables of the world wide web . In Proceedings of the 1st International Workshop on Web Document Analysis , pages 31 -- 34 , 2001 . M. Yoshida and K. Torisawa. A method to integrate tables of the world wide web. In Proceedings of the 1st International Workshop on Web Document Analysis, pages 31--34, 2001."},{"key":"e_1_2_1_30_1","volume-title":"A survey of table recognition: Models, observations, transformations, and inferences","author":"Zanibbi R.","year":"2003","unstructured":"R. Zanibbi , D. Blostein , and J. Cordy . A survey of table recognition: Models, observations, transformations, and inferences , 2003 . R. Zanibbi, D. Blostein, and J. Cordy. A survey of table recognition: Models, observations, transformations, and inferences, 2003."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1453856.1453916","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:05:30Z","timestamp":1672225530000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1453856.1453916"}},"subtitle":["exploring the power of tables on the web"],"short-title":[],"issued":{"date-parts":[[2008,8]]},"references-count":30,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2008,8]]}},"alternative-id":["10.14778\/1453856.1453916"],"URL":"https:\/\/doi.org\/10.14778\/1453856.1453916","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2008,8]]}}}