{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T03:39:46Z","timestamp":1771299586667,"version":"3.50.1"},"reference-count":21,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tasks are required. On-the-fly data cleansing is necessary in order two identify identical products from different suppliers. It must be possible to smoothly define data flow scenarios that merge and filter streams of extracted data stemming from several Web sites and store the resulting data into a data warehouse, where the data is subjected to market intelligence analytics. Finally, the system must be highly scalable, in order to be able to extract and process massive amounts of data in a short time. Lixto (www.lixto.com), a company offering data extraction tools and services, has been providing OMI solutions for several customers. In this paper we show how Lixto has tackled each of the above challenges by improving and extending its original data extraction software. Most importantly, we show how high scalability is achieved through cloud computing. This paper also features a case study from the computers and electronics market.<\/jats:p>","DOI":"10.14778\/1687553.1687580","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1512-1523","source":"Crossref","is-referenced-by-count":27,"title":["Scalable web data extraction for online market intelligence"],"prefix":"10.14778","volume":"2","author":[{"given":"Robert","family":"Baumgartner","sequence":"first","affiliation":[{"name":"Lixto Software GmbH, Vienna, Austria"}]},{"given":"Georg","family":"Gottlob","sequence":"additional","affiliation":[{"name":"Oxford University, Oxford, UK"}]},{"given":"Marcus","family":"Herzog","sequence":"additional","affiliation":[{"name":"Lixto Software GmbH, Vienna, Austria"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1389-1286(00)00073-6"},{"key":"e_1_2_1_2_1","volume-title":"Proc. of IAWTIC","author":"Baumgartner R.","year":"2005","unstructured":"R. Baumgartner , M. Ceresna , and G. Lederm\u00fcller . Deep web navigation in web data extraction . In Proc. of IAWTIC , 2005 . R. Baumgartner, M. Ceresna, and G. Lederm\u00fcller. Deep web navigation in web data extraction. In Proc. of IAWTIC, 2005."},{"key":"e_1_2_1_3_1","volume-title":"Web Crawling and Recursive Wrapping with Lixto. In Proc. of LPNMR","author":"Baumgartner R.","year":"2001","unstructured":"R. Baumgartner , S. Flesca , and G. Gottlob . Declarative Information Extraction , Web Crawling and Recursive Wrapping with Lixto. In Proc. of LPNMR , 2001 . R. Baumgartner, S. Flesca, and G. Gottlob. Declarative Information Extraction, Web Crawling and Recursive Wrapping with Lixto. In Proc. of LPNMR, 2001."},{"key":"e_1_2_1_4_1","volume-title":"Proc. of VLDB","author":"Baumgartner R.","year":"2001","unstructured":"R. Baumgartner , S. Flesca , and G. Gottlob . Visual Web Information Extraction with Lixto . In Proc. of VLDB , 2001 . R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proc. of VLDB, 2001."},{"key":"e_1_2_1_5_1","volume-title":"Encyclopedia of Database Systems (to appear)","author":"Baumgartner R.","year":"2009","unstructured":"R. Baumgartner , W. Gatterbauer , and G. Gottlob . Web Data Extraction System . In Encyclopedia of Database Systems (to appear) . Springer-Verlag New York, Inc. , 2009 . R. Baumgartner, W. Gatterbauer, and G. Gottlob. Web Data Extraction System. In Encyclopedia of Database Systems (to appear). Springer-Verlag New York, Inc., 2009."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification","author":"Baxter R.","year":"2003","unstructured":"R. Baxter , P. Christen , and T. Churches . A comparison of fast blocking methods for record linkage . In Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification , 2003 . R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, 2003."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the IADIS International Conference WWW\/Internet 2006","author":"Carme J.","year":"2006","unstructured":"J. Carme , M. Ceresna , and M. Goebel . Web wrapper specification using compound filter learning . In Proceedings of the IADIS International Conference WWW\/Internet 2006 , 2006 . J. Carme, M. Ceresna, and M. Goebel. Web wrapper specification using compound filter learning. In Proceedings of the IADIS International Conference WWW\/Internet 2006, 2006."},{"issue":"5","key":"e_1_2_1_8_1","volume":"9","author":"Cunningham H.","year":"2005","unstructured":"H. Cunningham , K. Bontcheva , and Y. Li . Knowledge Management and Human Language: Crossing the Chasm. J. of Knowledge Management , 9 ( 5 ), 2005 . H. Cunningham, K. Bontcheva, and Y. Li. Knowledge Management and Human Language: Crossing the Chasm. J. of Knowledge Management, 9(5), 2005.","journal-title":"Knowledge Management and Human Language: Crossing the Chasm. J. of Knowledge Management"},{"key":"e_1_2_1_9_1","volume-title":"Web wrapper induction: a brief survey. AI Communications","author":"Flesca S.","year":"2004","unstructured":"S. Flesca , G. Manco , E. Masciari , E. Rende , and A. Tagarelli . Web wrapper induction: a brief survey. AI Communications Vol. 17\/2 , 2004 . S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli. Web wrapper induction: a brief survey. AI Communications Vol. 17\/2, 2004."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242583"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/543613.543617"},{"key":"e_1_2_1_12_1","volume-title":"CSIRO Mathematical and Information Sciences","author":"Gu L.","year":"2003","unstructured":"L. Gu , R. Baxter , D. Vickers , and C. Rainsford . Record linkage: Current practice and future directions. Technical report , CSIRO Mathematical and Information Sciences , 2003 . L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical report, CSIRO Mathematical and Information Sciences, 2003."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/1304596.1304833"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/647212.720217"},{"key":"e_1_2_1_15_1","volume-title":"Net.ObjectDays","author":"Kuhlins S.","year":"2002","unstructured":"S. Kuhlins and R. Tredwell . Toolkits for generating wrappers . In Net.ObjectDays , 2002 . S. Kuhlins and R. Tredwell. Toolkits for generating wrappers. In Net.ObjectDays, 2002."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/565117.565137"},{"key":"e_1_2_1_17_1","volume-title":"Proc. of WWW","author":"Liu B.","year":"2005","unstructured":"B. Liu . Web Content Mining . In Proc. of WWW , Tutorial , 2005 . B. Liu. Web Content Mining. In Proc. of WWW, Tutorial, 2005."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/SERVICES-1.2008.36"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/319950.319962"},{"key":"e_1_2_1_20_1","volume-title":"Wrapper Generating Tools","author":"Tredwell R.","year":"2003","unstructured":"R. Tredwell and S. Kuhlins . Wrapper Generating Tools , 2003 . http:\/\/www.wifo.unimannheim.de\/kuhlins\/wrappertools\/. R. Tredwell and S. Kuhlins. Wrapper Generating Tools, 2003. http:\/\/www.wifo.unimannheim.de\/kuhlins\/wrappertools\/."},{"key":"e_1_2_1_21_1","volume-title":"As of","year":"2009","unstructured":"Wikipedia. Entry : Market Intelligence, 2009 . As of April 15, 2009 . Wikipedia. Entry: Market Intelligence, 2009. As of April 15, 2009."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687553.1687580","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:59:04Z","timestamp":1672225144000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687553.1687580"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":21,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687553.1687580"],"URL":"https:\/\/doi.org\/10.14778\/1687553.1687580","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}