{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T20:09:13Z","timestamp":1776888553665,"version":"3.51.2"},"reference-count":28,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2009,3,20]],"date-time":"2009-03-20T00:00:00Z","timestamp":1237507200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGMOD Rec."],"published-print":{"date-parts":[[2009,3,20]]},"abstract":"<jats:p>A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (two of which come from Google Research). The TextRunner system focuses on raw natural language text, the WebTables system focuses on HTML-embedded tables, and the deep-web surfacing system focuses on \"hidden\" databases. The domain, expressiveness, and accuracy of extracted data can depend strongly on its source extractor; we describe differences in the characteristics of data produced by the three extractors. Finally, we discuss a series of unique data applications (some of which have already been prototyped) that are enabled by aggregating extractedWeb information.<\/jats:p>","DOI":"10.1145\/1519103.1519112","type":"journal-article","created":{"date-parts":[[2009,4,6]],"date-time":"2009-04-06T16:34:22Z","timestamp":1239035662000},"page":"55-61","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":41,"title":["Web-scale extraction of structured data"],"prefix":"10.1145","volume":"37","author":[{"given":"Michael J.","family":"Cafarella","sequence":"first","affiliation":[{"name":"University of Washington"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jayant","family":"Madhavan","sequence":"additional","affiliation":[{"name":"Google Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alon","family":"Halevy","sequence":"additional","affiliation":[{"name":"Google Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2009,3,20]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375774"},{"key":"e_1_2_1_2_1","volume-title":"Personal Communication","author":"Banko M.","year":"2008","unstructured":"M. Banko . Personal Communication , 2008 . M. Banko. Personal Communication, 2008."},{"key":"e_1_2_1_3_1","volume-title":"IJCAI","author":"Banko M.","year":"2007","unstructured":"M. Banko , M. J. Cafarella , S. Soderland , M. Broadhead , and O. Etzioni . Open Information Extraction from the Web . In IJCAI , 2007 . M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, 2007."},{"key":"e_1_2_1_4_1","volume-title":"ACL","author":"Banko M.","year":"2008","unstructured":"M. Banko and O. Etzioni . The Tradeoffs Between Open and Traditional Relational Extraction . In ACL , 2008 . M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relational Extraction. In ACL, 2008."},{"key":"e_1_2_1_5_1","volume-title":"SBBD","author":"Barbosa L.","year":"2004","unstructured":"L. Barbosa and J. Freire . Siphoning hidden-web data through keyword-based interfaces . In SBBD , 2004 . L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004."},{"key":"e_1_2_1_6_1","volume-title":"The Deep Web: Surfacing Hidden Value","author":"Bergman M. K.","year":"2001","unstructured":"M. K. Bergman . The Deep Web: Surfacing Hidden Value . Journal of Electronic Publishing , 2001 . M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/646543.696220"},{"key":"e_1_2_1_8_1","volume-title":"WebDB","author":"Cafarella M. J.","year":"2008","unstructured":"M. J. Cafarella , A. Halevy , Y. Zhang , D. Z. Wang , and E. Wu . Uncovering the Relational Web . In WebDB , 2008 . M. J. Cafarella, A. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the Relational Web. In WebDB, 2008."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453916"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/382979.383040"},{"key":"e_1_2_1_11_1","unstructured":"Cars.com FAQ. http:\/\/siy.cars.com\/siy\/qsg\/faqGeneralInfo.jsp#howmanyads.  Cars.com FAQ. http:\/\/siy.cars.com\/siy\/qsg\/faqGeneralInfo.jsp#howmanyads."},{"key":"e_1_2_1_12_1","unstructured":"Cazoodle Apartment Search. http:\/\/apartments.cazoodle.com\/.  Cazoodle Apartment Search. http:\/\/apartments.cazoodle.com\/."},{"key":"e_1_2_1_13_1","volume-title":"VLDB-IIWeb","author":"Chang K. C.-C.","year":"2004","unstructured":"K. C.-C. Chang , B. He , and Z. Zhang . MetaQuerier over the Deep Web: Shallow Integration across Holistic Sources . In VLDB-IIWeb , 2004 . K. C.-C. Chang, B. He, and Z. Zhang. MetaQuerier over the Deep Web: Shallow Integration across Holistic Sources. In VLDB-IIWeb, 2004."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988687"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872784"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1230819.1241670"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.3115\/992133.992154"},{"key":"e_1_2_1_18_1","volume-title":"VLDB","author":"Ipeirotis P. G.","year":"2002","unstructured":"P. G. Ipeirotis and L. Gravano . Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection . In VLDB , 2002 . P. G. Ipeirotis and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In VLDB, 2002."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.39"},{"key":"e_1_2_1_20_1","volume-title":"CIDR","author":"Madhavan J.","year":"2007","unstructured":"J. Madhavan , S. Jeffery , S. Cohen , X. Dong , D. Ko , C. Yu , and A. Halevy . Web-scale Data Integration: You can only afford to Pay As You Go . In CIDR , 2007 . J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/1454159.1454163"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065385.1065407"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376702"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242667"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/1613715.1613787"},{"key":"e_1_2_1_26_1","unstructured":"Trulia. http:\/\/www.trulia.com\/.  Trulia. http:\/\/www.trulia.com\/."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1321440.1321449"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007582"}],"container-title":["ACM SIGMOD Record"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1519103.1519112","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1519103.1519112","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:29:53Z","timestamp":1750253393000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1519103.1519112"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,3,20]]},"references-count":28,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2009,3,20]]}},"alternative-id":["10.1145\/1519103.1519112"],"URL":"https:\/\/doi.org\/10.1145\/1519103.1519112","relation":{},"ISSN":["0163-5808"],"issn-type":[{"value":"0163-5808","type":"print"}],"subject":[],"published":{"date-parts":[[2009,3,20]]},"assertion":[{"value":"2009-03-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}