{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T20:43:22Z","timestamp":1765485802100},"reference-count":25,"publisher":"Association for Computing Machinery (ACM)","issue":"1-2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2010,9]]},"abstract":"<jats:p>\n            We propose a novel extraction approach that exploits\n            <jats:italic>content redundancy<\/jats:italic>\n            on the web to extract structured data from\n            <jats:italic>template-based<\/jats:italic>\n            web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To match attribute values with diverse representations across sites, we define a new similarity metric that leverages the templatized structure of attribute content. Specifically, our metric discovers the matching pattern between attribute values from two sites, and uses this to ignore extraneous portions of attribute values when computing similarity scores. Further, to filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.\n          <\/jats:p>","DOI":"10.14778\/1920841.1920915","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"578-587","source":"Crossref","is-referenced-by-count":11,"title":["Exploiting content redundancy for web information extraction"],"prefix":"10.14778","volume":"3","author":[{"given":"Pankaj","family":"Gulhane","sequence":"first","affiliation":[{"name":"Yahoo! Labs, Bangalore"}]},{"given":"Rajeev","family":"Rastogi","sequence":"additional","affiliation":[{"name":"Yahoo! Labs, Bangalore"}]},{"given":"Srinivasan H.","family":"Sengamedu","sequence":"additional","affiliation":[{"name":"Yahoo! Labs, Bangalore"}]},{"given":"Ashwin","family":"Tengli","sequence":"additional","affiliation":[{"name":"Microsoft IDC, Bangalore"}]}],"member":"320","published-online":{"date-parts":[[2010,9]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1014052.1014058"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/336597.336644"},{"key":"e_1_2_1_3_1","volume-title":"SIGMOD","author":"Agrawal R.","year":"1994","unstructured":"R. Agrawal and R. Srikant . Fast algorithms for mining association rules . In SIGMOD , 1994 . R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In SIGMOD, 1994."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/956750.956759"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375682"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/646543.696220"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.9"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/276304.276323"},{"key":"e_1_2_1_9_1","volume-title":"VLDB","author":"Crescenzi V.","year":"2001","unstructured":"V. Crescenzi , G. Mecca , and P. Merialdo . Roadrunner: Towards automatic data extraction from large web sites . In VLDB , 2001 . V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.9"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526841"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1062745.1062763"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526735"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1150402.1150457"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142599"},{"key":"e_1_2_1_16_1","volume-title":"IJCAI","author":"Kushmerick N.","year":"1997","unstructured":"N. Kushmerick , D. S. Weld , and R. Doorenbos . Wrapper induction for information extraction . In IJCAI , 1997 . N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775166"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(96)00268-X"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/1394399"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1010022931168"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-006-5833-1"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872796"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775087"},{"key":"e_1_2_1_24_1","volume-title":"Some biological sequence metrics. Advances in Math., 20(4)","author":"Waterman M.","year":"1976","unstructured":"M. Waterman , T. Smith , and W. Beyer . Some biological sequence metrics. Advances in Math., 20(4) , 1976 . M. Waterman, T. Smith, and W. Beyer. Some biological sequence metrics. Advances in Math., 20(4), 1976."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060761"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1920841.1920915","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:49:17Z","timestamp":1672228157000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1920841.1920915"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,9]]},"references-count":25,"journal-issue":{"issue":"1-2","published-print":{"date-parts":[[2010,9]]}},"alternative-id":["10.14778\/1920841.1920915"],"URL":"https:\/\/doi.org\/10.14778\/1920841.1920915","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2010,9]]}}}