{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T06:11:14Z","timestamp":1775283074658,"version":"3.50.1"},"reference-count":28,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2009,1,1]],"date-time":"2009-01-01T00:00:00Z","timestamp":1230768000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2009,1]]},"abstract":"<jats:p>\n            We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm,\n            <jats:italic>DustBuster<\/jats:italic>\n            , for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs,\n            <jats:italic>without<\/jats:italic>\n            \/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.\n          <\/jats:p>","DOI":"10.1145\/1462148.1462151","type":"journal-article","created":{"date-parts":[[2009,1,20]],"date-time":"2009-01-20T14:41:13Z","timestamp":1232462473000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Do not crawl in the DUST"],"prefix":"10.1145","volume":"3","author":[{"given":"Ziv","family":"Bar-Yossef","sequence":"first","affiliation":[{"name":"Technion Israel Institute of Technology, Haifa, Israel"}]},{"given":"Idit","family":"Keidar","sequence":"additional","affiliation":[{"name":"Technion Israel Institute of Technology, Haifa, Israel"}]},{"given":"Uri","family":"Schonfeld","sequence":"additional","affiliation":[{"name":"University of California Los Angeles, CA"}]}],"member":"320","published-online":{"date-parts":[[2009,1,17]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Apache 2008. Apache. http server version 2.2 configuration files. http:\/\/httpd.apache.org\/docs\/2.2\/configuring.html.  Apache 2008. Apache. http server version 2.2 configuration files. http:\/\/httpd.apache.org\/docs\/2.2\/configuring.html."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487--499","author":"Agrawal R.","unstructured":"Agrawal , R. and Srikant , R . 1994. Fast algorithms for mining association rules . In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487--499 . Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487--499."},{"key":"e_1_2_1_3_1","unstructured":"Analog. 2008. Analog homepage. http:\/\/www.analog.cx\/.  Analog. 2008. Analog homepage. http:\/\/www.analog.cx\/."},{"key":"e_1_2_1_4_1","unstructured":"Berners-Lee T. Fielding R. and Masinter L. Uniform resource identifiers (URI): Generic syntax. http:\/\/www.ietf.org\/rfc\/rfc2396.txt.   Berners-Lee T. Fielding R. and Masinter L. Uniform resource identifiers (URI): Generic syntax. http:\/\/www.ietf.org\/rfc\/rfc2396.txt."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1389-1286(99)00021-3"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1002\/1097-4571(2000)9999:9999%3C::AID-ASI1025%3E3.0.CO;2-0"},{"key":"e_1_2_1_7_1","unstructured":"Bognar M. 1995. A survey on abstract rewriting. www.di.ubi.pt\/~desousa\/1998-1999\/logica\/mb.ps.  Bognar M. 1995. A survey on abstract rewriting. www.di.ubi.pt\/~desousa\/1998-1999\/logica\/mb.ps."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/223784.223855"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 6th International World Wide Web Conference (WWW), 1157--1166","author":"Broder A. Z.","unstructured":"Broder , A. Z. , Glassman , S. C. , and Manasse , M. S . 1997. Syntactic clustering of the Web . In Proceedings of the 6th International World Wide Web Conference (WWW), 1157--1166 . Broder, A. Z., Glassman, S. C., and Manasse, M. S. 1997. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference (WWW), 1157--1166."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335429"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 11th International World Wide Web Conference (WWW).","author":"Di Iorio E.","unstructured":"Di Iorio , E. , Diligenti , M. , Gori , M. , Maggini , M. , and Pucci , A . 2003. Detecting near-replicas on the Web by content and hyperlink analysis . In Proceedings of the 11th International World Wide Web Conference (WWW). Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., and Pucci, A. 2003. Detecting near-replicas on the Web by content and hyperlink analysis. In Proceedings of the 11th International World Wide Web Conference (WWW)."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems (USITS).","author":"Douglis F.","unstructured":"Douglis , F. , Feldman , A. , Krishnamurthy , B. , and Mogul , J . 1997. Rate of change and other metrics: A live study of the World Wide Web . In Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems (USITS). Douglis, F., Feldman, A., Krishnamurthy, B., and Mogul, J. 1997. Rate of change and other metrics: A live study of the World Wide Web. In Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems (USITS)."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 25th Australasian Computer Science Conference (ACSC), 59--64","author":"Finkel R. A.","unstructured":"Finkel , R. A. , Zaslavsky , A. B. , Monostori , K. , and Schmidt , H. W . 2002. Signature extraction for overlap detection in documents . In Proceedings of the 25th Australasian Computer Science Conference (ACSC), 59--64 . Finkel, R. A., Zaslavsky, A. B., Monostori, K., and Schmidt, H. W. 2002. Signature extraction for overlap detection in documents. In Proceedings of the 25th Australasian Computer Science Conference (ACSC), 59--64."},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the 4th International Conference on Parallel and Distributed Information Systems (PDIS), 68--79","author":"Garcia-Molina H.","unstructured":"Garcia-Molina , H. , Gravano , L. , and Shivakumar , N . 1996. Dscam: Finding document copies across multiple databases . In Proceedings of the 4th International Conference on Parallel and Distributed Information Systems (PDIS), 68--79 . Garcia-Molina, H., Gravano, L., and Shivakumar, N. 1996. Dscam: Finding document copies across multiple databases. In Proceedings of the 4th International Conference on Parallel and Distributed Information Systems (PDIS), 68--79."},{"key":"e_1_2_1_15_1","unstructured":"Garey M. R. and Johnson D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.   Garey M. R. and Johnson D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman."},{"key":"e_1_2_1_16_1","unstructured":"Google Inc. 2008. Google sitemaps. http:\/\/sitemaps.google.com.  Google Inc. 2008. Google sitemaps. http:\/\/sitemaps.google.com."},{"key":"e_1_2_1_17_1","volume-title":"Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology","author":"Gusfield D.","unstructured":"Gusfield , D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology . Cambridge University Press . Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1002\/asi.10170"},{"key":"e_1_2_1_19_1","unstructured":"Jaccard P. 1908. Nouvelles recherches sur la distribution florale. 44 223--270.  Jaccard P. 1908. Nouvelles recherches sur la distribution florale. 44 223--270."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30","author":"Jain N.","unstructured":"Jain , N. , Dahlin , M. , and Tewari , R . 2005. Using bloom filters to refine Web search results . In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30 . Jain, N., Dahlin, M., and Tewari, R. 2005. Using bloom filters to refine Web search results. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/511446.511484"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/11751649_67"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1149941.1149972"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 10th International Conference on Complex Systems (ICCS), 51--60","author":"Monostori K.","unstructured":"Monostori , K. , Finkel , R. A. , Zaslavsky , A. B. , Hod\u00e1sz , G. , and Pataki , M . 2002. Comparison of overlap detection techniques . In Proceedings of the 10th International Conference on Complex Systems (ICCS), 51--60 . Monostori, K., Finkel, R. A., Zaslavsky, A. B., Hod\u00e1sz, G., and Pataki, M. 2002. Comparison of overlap detection techniques. In Proceedings of the 10th International Conference on Complex Systems (ICCS), 51--60."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 1st International Workshop on the Web and Databases (WebDB), 204--212","author":"Shivakumar N.","unstructured":"Shivakumar , N. and Garcia-Molina , H . 1998. Finding near-replicas of documents and servers on the Web . In Proceedings of the 1st International Workshop on the Web and Databases (WebDB), 204--212 . Shivakumar, N. and Garcia-Molina, H. 1998. Finding near-replicas of documents and servers on the Web. In Proceedings of the 1st International Workshop on the Web and Databases (WebDB), 204--212."},{"key":"e_1_2_1_27_1","unstructured":"StatCounter. 1998. Counter homepage. http:\/\/www.statcounter.com\/.  StatCounter. 1998. Counter homepage. http:\/\/www.statcounter.com\/."},{"key":"e_1_2_1_28_1","unstructured":"2008} WEBLOGEXPERT WebLog Expert. 2008. WebLog expert homepage. http:\/\/www.weblogexpert.com\/.  2008} WEBLOGEXPERT WebLog Expert. 2008. WebLog expert homepage. http:\/\/www.weblogexpert.com\/."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/281250.281256"}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1462148.1462151","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1462148.1462151","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:30:14Z","timestamp":1750253414000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1462148.1462151"}},"subtitle":["Different URLs with similar text"],"short-title":[],"issued":{"date-parts":[[2009,1]]},"references-count":28,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,1]]}},"alternative-id":["10.1145\/1462148.1462151"],"URL":"https:\/\/doi.org\/10.1145\/1462148.1462151","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"value":"1559-1131","type":"print"},{"value":"1559-114X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,1]]},"assertion":[{"value":"2007-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-01-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}