{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T15:46:16Z","timestamp":1770219976063,"version":"3.49.0"},"reference-count":23,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes.<\/jats:p>\n          <jats:p>This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, directory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.<\/jats:p>","DOI":"10.14778\/1687627.1687694","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"586-597","source":"Crossref","is-referenced-by-count":19,"title":["SHARC"],"prefix":"10.14778","volume":"2","author":[{"given":"Dimitar","family":"Denev","sequence":"first","affiliation":[{"name":"Max Planck Institute for Informatics Campus, Saarbr\u00fccken, Germany"}]},{"given":"Arturas","family":"Mazeika","sequence":"additional","affiliation":[{"name":"Max Planck Institute for Informatics Campus, Saarbr\u00fccken, Germany"}]},{"given":"Marc","family":"Spaniol","sequence":"additional","affiliation":[{"name":"Max Planck Institute for Informatics Campus, Saarbr\u00fccken, Germany"}]},{"given":"Gerhard","family":"Weikum","sequence":"additional","affiliation":[{"name":"Max Planck Institute for Informatics Campus, Saarbr\u00fccken, Germany"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498759.1498837"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/38714.38760"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1409220.1409223"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.841784"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1032657.1034011"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/335191.335391"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/857166.857170"},{"key":"e_1_2_1_8_1","first-page":"161","volume-title":"WWW","author":"Cho J.","year":"1998","unstructured":"J. Cho , H. Garcia-Molina , L. Page . Efficient crawling through url ordering . In WWW , pp. 161 -- 172 , 1998 . J. Cho, H. Garcia-Molina, L. Page. Efficient crawling through url ordering. In WWW, pp. 161--172, 1998."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/1287369.1287414"},{"key":"e_1_2_1_10_1","first-page":"375","volume-title":"VLDB","author":"Cho J.","year":"2007","unstructured":"J. Cho , U. Schonfeld . Rankmass crawler: a crawler with high personalized pagerank coverage guarantee . In VLDB , pp. 375 -- 386 . 2007 . J. Cho, U. Schonfeld. Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In VLDB, pp. 375--386. 2007."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/253262.253353"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-540-46332-0","volume-title":"Web Archiving","author":"Masan\u00e8s J.","year":"2006","unstructured":"J. Masan\u00e8s , editor. Web Archiving , Springer , 2006 . J. Masan\u00e8s, editor. Web Archiving, Springer, 2006."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-006-0035-9"},{"key":"e_1_2_1_14_1","volume-title":"SNA-KDD","author":"Klamma R.","year":"2008","unstructured":"R. Klamma , C. Haasler . Wikis as social networks: Evolution and dynamics . In SNA-KDD , 2008 . R. Klamma, C. Haasler. Wikis as social networks: Evolution and dynamics. In SNA-KDD, 2008."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367556"},{"key":"e_1_2_1_16_1","volume-title":"Size, Topology and Use","author":"Levene M.","year":"2004","unstructured":"M. Levene , A. Poulovassilis , eds. Web Dynamics - Adapting to Change in Content , Size, Topology and Use . Springer , 2004 . M. Levene, A. Poulovassilis, eds. Web Dynamics - Adapting to Change in Content, Size, Topology and Use. Springer, 2004."},{"key":"e_1_2_1_17_1","volume-title":"IWAW","author":"Mohr G.","year":"2004","unstructured":"G. Mohr , M. Kimpton , M. Stack , I. Ranitovic . Introduction to heritrix, an archival quality web crawler . In IWAW , 2004 . G. Mohr, M. Kimpton, M. Stack, I. Ranitovic. Introduction to heritrix, an archival quality web crawler. In IWAW, 2004."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/371920.371965"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988674"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367557"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564701"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242615"},{"key":"e_1_2_1_23_1","unstructured":"Debunking the Wayback Machine http:\/\/practice.com\/2008\/12\/29\/debunking-the-wayback-machine  Debunking the Wayback Machine http:\/\/practice.com\/2008\/12\/29\/debunking-the-wayback-machine"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687627.1687694","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:29:57Z","timestamp":1672226997000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687627.1687694"}},"subtitle":["framework for quality-conscious web archiving"],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":23,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687627.1687694"],"URL":"https:\/\/doi.org\/10.14778\/1687627.1687694","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}