{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T20:34:38Z","timestamp":1780346078205,"version":"3.54.1"},"reference-count":21,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>\n            The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this\n            <jats:italic>relational web<\/jats:italic>\n            raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately.\n          <\/jats:p>\n          <jats:p>This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.<\/jats:p>","DOI":"10.14778\/1687627.1687750","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1090-1101","source":"Crossref","is-referenced-by-count":161,"title":["Data integration for the relational web"],"prefix":"10.14778","volume":"2","author":[{"given":"Michael J.","family":"Cafarella","sequence":"first","affiliation":[{"name":"University of Washington, Seattle, WA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alon","family":"Halevy","sequence":"additional","affiliation":[{"name":"Google, Inc., Mountain View, CA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nodira","family":"Khoussainova","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"CIDR","author":"Bernstein P. A.","year":"2003","unstructured":"P. A. Bernstein . Applying Model Management to Classical Meta Data Problems . In CIDR , 2003 . P. A. Bernstein. Applying Model Management to Classical Meta Data Problems. In CIDR, 2003."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"e_1_2_1_3_1","volume-title":"WebDB","author":"Cafarella M. J.","year":"2008","unstructured":"M. J. Cafarella , A. Y. Halevy , Y. Zhang , D. Z. Wang , and E. Wu . Uncovering the Relational Web . In WebDB , 2008 . M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the Relational Web. In WebDB, 2008."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453916"},{"key":"e_1_2_1_5_1","volume-title":"Sixth Meeting on Mathematics of Language","author":"da Silva J. F.","year":"1999","unstructured":"J. F. da Silva and G. P. Lopes . A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora . Sixth Meeting on Mathematics of Language , 1999 . J. F. da Silva and G. P. Lopes. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multi-Word Units from Corpora. Sixth Meeting on Mathematics of Language, 1999."},{"key":"e_1_2_1_6_1","first-page":"399","volume-title":"VLDB","author":"DeRose P.","year":"2007","unstructured":"P. DeRose , W. Shen , F. Chen , A. Doan , and R. Ramakrishnan . Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach . In VLDB , pages 399 -- 410 , 2007 . P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach. In VLDB, pages 399--410, 2007."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375731"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066168"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687749"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988687"},{"key":"e_1_2_1_11_1","first-page":"67","volume-title":"AAAI\/IAAI","author":"Friedman M.","year":"1999","unstructured":"M. Friedman , A. Y. Levy , and T. D. Millstein . Navigational Plans for Data Integration . In AAAI\/IAAI , pages 67 -- 73 , 1999 . M. Friedman, A. Y. Levy, and T. D. Millstein. Navigational Plans for Data Integration. In AAAI\/IAAI, pages 67--73, 1999."},{"key":"e_1_2_1_12_1","first-page":"371","volume-title":"VLDB","author":"Galhardas H.","year":"2001","unstructured":"H. Galhardas , D. Florescu , D. Shasha , E. Simon , and C.-A. Saita . Declarative Data Cleaning: Language, Model, and Algorithms . In VLDB , pages 371 -- 380 , 2001 . H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. In VLDB, pages 371--380, 2001."},{"key":"e_1_2_1_13_1","first-page":"624","volume-title":"ECML\/PKDD (1)","author":"Kok S.","year":"2008","unstructured":"S. Kok and P. Domingos . Extracting Semantic Networks from Text Via Relational Clustering . In ECML\/PKDD (1) , pages 624 -- 639 , 2008 . S. Kok and P. Domingos. Extracting Semantic Networks from Text Via Relational Clustering. In ECML\/PKDD (1), pages 624--639, 2008."},{"key":"e_1_2_1_14_1","unstructured":"Microsoft Popfly. http:\/\/www.popfly.com\/.  Microsoft Popfly. http:\/\/www.popfly.com\/."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s007780100057"},{"key":"e_1_2_1_16_1","first-page":"381","volume-title":"VLDB","author":"Raman V.","year":"2001","unstructured":"V. Raman and J. M. Hellerstein . Potter's Wheel: An Interactive Data Cleaning System . In VLDB , pages 381 -- 390 , 2001 . V. Raman and J. M. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System. In VLDB, pages 381--390, 2001."},{"key":"e_1_2_1_17_1","volume-title":"NIPS","author":"Sarawagi S.","year":"2004","unstructured":"S. Sarawagi and W. W. Cohen . Semi-Markov Conditional Random Fields for Information Extraction . In NIPS , 2004 . S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. In NIPS, 2004."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1216295.1216328"},{"key":"e_1_2_1_19_1","volume-title":"CoRR","author":"Turney P. D.","year":"2002","unstructured":"P. D. Turney . Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL . CoRR , 2002 . P. D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. CoRR, 2002."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1240624.1240842"},{"key":"e_1_2_1_21_1","unstructured":"Yahoo Pipes. http:\/\/pipes.yahoo.com\/pipes\/.  Yahoo Pipes. http:\/\/pipes.yahoo.com\/pipes\/."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687627.1687750","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:35:17Z","timestamp":1672227317000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687627.1687750"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":21,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687627.1687750"],"URL":"https:\/\/doi.org\/10.14778\/1687627.1687750","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}