{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,12,29]],"date-time":"2022-12-29T05:20:51Z","timestamp":1672291251962},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>\n            Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for\n            <jats:italic>selectively<\/jats:italic>\n            (re)downloading Web pages that are located in hierarchical directory structures which are believed to have\n            <jats:italic>changed significantly<\/jats:italic>\n            (e.g., a substantial percentage of pages are inserted to\/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive.\n          <\/jats:p>\n          <jats:p>\n            In our approach, we propose an off-line data mining algorithm called near-\n            <jats:italic>Miner<\/jats:italic>\n            that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines\n            <jats:italic>negatively correlated association rules<\/jats:italic>\n            (near) between\n            <jats:italic>ancestor-descendant<\/jats:italic>\n            Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are\n            <jats:italic>negatively correlated<\/jats:italic>\n            with it in undergoing\n            <jats:italic>significant<\/jats:italic>\n            changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the \"freshness\" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.\n          <\/jats:p>","DOI":"10.14778\/1687627.1687757","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1150-1161","source":"Crossref","is-referenced-by-count":0,"title":["NEAR-Miner"],"prefix":"10.14778","volume":"2","author":[{"given":"Ling","family":"Chen","sequence":"first","affiliation":[{"name":"L3S\/University of Hannover, Hannover, Germany"}]},{"given":"Sourav S.","family":"Bhowmick","sequence":"additional","affiliation":[{"name":"Nanyang Technological University, Singapore"}]},{"given":"Wolfgang","family":"Nejdl","sequence":"additional","affiliation":[{"name":"L3S\/University of Hannover, Hannover, Germany"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/1053072.1053078"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74469-6_28"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/253260.253327"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/253260.253266"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-00887-0_62"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.datak.2005.09.002"},{"key":"e_1_2_1_7_1","unstructured":"L. Chen S. S. Bhowmick and W. Nejdl. Autonomous Web Archive Maintenance Based on Evolution Associations of Web Site Directories. Techical report available at http:\/\/www.cais.ntu.edu.sg\/~assourav\/TechReports\/NEAR-TR.pdf  L. Chen S. S. Bhowmick and W. Nejdl. Autonomous Web Archive Maintenance Based on Evolution Associations of Web Site Directories. Techical report available at http:\/\/www.cais.ntu.edu.sg\/~assourav\/TechReports\/NEAR-TR.pdf"},{"key":"e_1_2_1_8_1","volume-title":"VLDB","author":"Cho J.","year":"2000","unstructured":"J. Cho and H. Garcia-Molina . The Evolution of the Web and Implications for an Incremental Crawler . In VLDB , 2000 . J. Cho and H. Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. In VLDB, 2000."},{"key":"e_1_2_1_9_1","volume-title":"WWW","author":"Cho J.","year":"1998","unstructured":"J. Cho , H. Garcia-Molina , and L. Page . Efficient Crawling Through URL Ordering . In WWW , 1998 . J. Cho, H. Garcia-Molina, and L. Page. Efficient Crawling Through URL Ordering. In WWW, 1998."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/958942.958945"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/857166.857170"},{"key":"e_1_2_1_12_1","volume-title":"Statistical Power Analysis for the Behavioral Sciences","author":"Cohen J.","year":"1988","unstructured":"J. Cohen . Statistical Power Analysis for the Behavioral Sciences . Lawrence Erlbaum Associates , 1988 . J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 1988."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/371920.371960"},{"key":"e_1_2_1_14_1","volume-title":"WebKDD","author":"Fu Y.","year":"1999","unstructured":"Y. Fu , K. Sandhu , and M. Shih . A Generalization-based Approach to Clustering Web Usage Sessions . In WebKDD , 1999 . Y. Fu, K. Sandhu, and M. Shih. A Generalization-based Approach to Clustering Web Usage Sessions. In WebKDD, 1999."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148222"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"J. Masan\u00e8s. Web Archiving. Springer New York Inc. Secaucus N.J. 2006.   J. Masan\u00e8s. Web Archiving. Springer New York Inc. Secaucus N.J. 2006.","DOI":"10.1007\/978-3-540-46332-0"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/383952.383995"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.190667"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988674"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367557"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-10874-1_7"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060805"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1526993.1526999"},{"key":"e_1_2_1_24_1","volume-title":"Introduction to Data Mining","author":"Tan P.-N.","year":"2006","unstructured":"P.-N. Tan , M. Steinbach , and V. Kumar . Introduction to Data Mining . Addison Wesley , 2006 . P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/581804"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2003.1260818"},{"key":"e_1_2_1_27_1","volume-title":"ICML","author":"Wu X.","year":"2002","unstructured":"X. Wu , C. Zhang , and S. Zhang . Mining both Positive and Negative Association Rules . In ICML , 2002 . X. Wu, C. Zhang, and S. Zhang. Mining both Positive and Negative Association Rules. In ICML, 2002."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687627.1687757","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:35:41Z","timestamp":1672227341000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687627.1687757"}},"subtitle":["mining evolution associations of web site directories for efficient maintenance of web archives"],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":27,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687627.1687757"],"URL":"https:\/\/doi.org\/10.14778\/1687627.1687757","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}