{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:50:12Z","timestamp":1750308612390,"version":"3.41.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2019,5,21]],"date-time":"2019-05-21T00:00:00Z","timestamp":1558396800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGCOMM Comput. Commun. Rev."],"published-print":{"date-parts":[[2019,5,21]]},"abstract":"<jats:p>With vast amount of content online, it is not surprising that unscrupulous entities \"borrow\" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009\/10), GeoCities (1990s\u20132000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6\u201311% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.<\/jats:p>","DOI":"10.1145\/3336937.3336940","type":"journal-article","created":{"date-parts":[[2019,5,23]],"date-time":"2019-05-23T18:01:38Z","timestamp":1558634498000},"page":"9-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Precise Detection of Content Reuse in the Web"],"prefix":"10.1145","volume":"49","author":[{"given":"Calvin","family":"Ardi","sequence":"first","affiliation":[{"name":"USC\/Information Sciences Institute"}]},{"given":"John","family":"Heidemann","sequence":"additional","affiliation":[{"name":"USC\/Information Sciences Institute"}]}],"member":"320","published-online":{"date-parts":[[2019,5,21]]},"reference":[{"key":"e_1_2_1_1_2","doi-asserted-by":"crossref","unstructured":"Steven Abney. 1991. Parsing by Chunks. Principle\u2013based Parsing. (1991).  Steven Abney. 1991. Parsing by Chunks. Principle\u2013based Parsing. (1991).","DOI":"10.1007\/978-94-011-3474-3_10"},{"key":"e_1_2_1_2_2","unstructured":"Apache. {n. d.}a. Hadoop. http:\/\/hadoop.apache.org. ({n. d.}).  Apache. {n. d.}a. Hadoop. http:\/\/hadoop.apache.org. ({n. d.})."},{"key":"e_1_2_1_3_2","unstructured":"Apache. {n. d.}b. Pig. http:\/\/pig.apache.org. ({n. d.}).  Apache. {n. d.}b. Pig. http:\/\/pig.apache.org. ({n. d.})."},{"key":"e_1_2_1_4_2","unstructured":"ArchiveTeam. 2009. GeoCities. http:\/\/archiveteam.org\/index.php\/GeoCities. (2009).  ArchiveTeam. 2009. GeoCities. http:\/\/archiveteam.org\/index.php\/GeoCities. (2009)."},{"key":"e_1_2_1_6_2","doi-asserted-by":"publisher","DOI":"10.14722\/usec.2016.23012"},{"key":"e_1_2_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/357830.357849"},{"key":"e_1_2_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"e_1_2_1_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/297805.297827"},{"key":"e_1_2_1_10_2","doi-asserted-by":"crossref","unstructured":"Andrei Z. Broder Steven C. Glassman Mark S. Manasse and Geoffrey Zweig. 1997. Syntactic clustering of the Web. In Selected papers from the sixth international conference on World Wide Web. 1157\u20131166. http:\/\/dl.acm.org\/citation.cfm?id=283554.283370   Andrei Z. Broder Steven C. Glassman Mark S. Manasse and Geoffrey Zweig. 1997. Syntactic clustering of the Web. In Selected papers from the sixth international conference on World Wide Web . 1157\u20131166. http:\/\/dl.acm.org\/citation.cfm?id=283554.283370","DOI":"10.1016\/S0169-7552(97)00031-7"},{"key":"e_1_2_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/509907.509965"},{"key":"e_1_2_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/1840784.1840829"},{"key":"e_1_2_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335429"},{"key":"e_1_2_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/506309.506311"},{"key":"e_1_2_1_15_2","volume-title":"An Open Digest-based Technique for Spam Detection. ISCA PDCS 2004","author":"Damiani Ernesto","year":"2004","unstructured":"Ernesto Damiani , Sabrina De Capitani di Vimercati , Stefano Paraboschi , and Pierangela Samarati . 2004 . An Open Digest-based Technique for Spam Detection. ISCA PDCS 2004 (2004), 559\u2013564. http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.61.6185 Ernesto Damiani, Sabrina De Capitani di Vimercati, Stefano Paraboschi, and Pierangela Samarati. 2004. An Open Digest-based Technique for Spam Detection. ISCA PDCS 2004 (2004), 559\u2013564. http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.61.6185"},{"key":"e_1_2_1_16_2","volume-title":"Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation -","volume":"6","author":"Dean Jeffrey","year":"2004","unstructured":"Jeffrey Dean and Sanjay Ghemawat . 2004 . MapReduce: Simplified Data Processing on Large Clusters . In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA, 10\u201310. http:\/\/dl.acm.org\/citation.cfm?id=1251254.1251264 Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA, 10\u201310. http:\/\/dl.acm.org\/citation.cfm?id=1251254.1251264"},{"key":"e_1_2_1_17_2","doi-asserted-by":"publisher","DOI":"10.28945\/974"},{"key":"e_1_2_1_18_2","doi-asserted-by":"publisher","DOI":"10.17487\/RFC3174"},{"key":"e_1_2_1_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/11735106_66"},{"key":"e_1_2_1_20_2","unstructured":"Common Crawl Foundation. {n. d.}. Common Crawl. http:\/\/commoncrawl.org. ({n. d.}).  Common Crawl Foundation. {n. d.}. Common Crawl. http:\/\/commoncrawl.org. ({n. d.})."},{"key":"e_1_2_1_21_2","volume-title":"Static HTML Dump of Wikipedia. (June","author":"Foundation Wikimedia","year":"2008","unstructured":"Wikimedia Foundation . 2008. Static HTML Dump of Wikipedia. (June 2008 ). http:\/\/dumps.wikimedia.org\/other\/static_html_dumps\/2008-06\/en\/ Wikimedia Foundation. 2008. Static HTML Dump of Wikipedia. (June 2008). http:\/\/dumps.wikimedia.org\/other\/static_html_dumps\/2008-06\/en\/"},{"key":"e_1_2_1_22_2","volume-title":"Wikimedia Statistics. (Feb","author":"Foundation Wikimedia","year":"2019","unstructured":"Wikimedia Foundation . 2019. Wikimedia Statistics. (Feb . 2019 ). https:\/\/stats.wikimedia.org\/v2\/#\/en.wikipedia.org {Online; accessed 2019-Feb-27}. Wikimedia Foundation. 2019. Wikimedia Statistics. (Feb. 2019). https:\/\/stats.wikimedia.org\/v2\/#\/en.wikipedia.org {Online; accessed 2019-Feb-27}."},{"key":"e_1_2_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-013-0320-5"},{"key":"e_1_2_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148222"},{"key":"e_1_2_1_25_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1019213109274"},{"key":"e_1_2_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526721"},{"key":"e_1_2_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIC.2006.23"},{"key":"e_1_2_1_28_2","volume-title":"Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment","author":"Lv Qin","year":"2007","unstructured":"Qin Lv , William Josephson , Zhe Wang , Moses Charikar , and Kai Li . 2007 . Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search . In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment , Vienna, Austria, 950\u2013961. http:\/\/dl.acm.org\/citation.cfm?id=1325958 Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, Vienna, Austria, 950\u2013961. http:\/\/dl.acm.org\/citation.cfm?id=1325958"},{"key":"e_1_2_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/1557019.1557153"},{"volume-title":"2015 International Conference on Pervasive Computing (ICPC). 1\u20135.","author":"Malhotra J.","key":"e_1_2_1_30_2","unstructured":"J. Malhotra and J. Bakal . 2015. A survey and comparative study of data deduplication techniques . In 2015 International Conference on Pervasive Computing (ICPC). 1\u20135. J. Malhotra and J. Bakal. 2015. A survey and comparative study of data deduplication techniques. In 2015 International Conference on Pervasive Computing (ICPC). 1\u20135."},{"key":"e_1_2_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242592"},{"key":"e_1_2_1_32_2","first-page":"180","article-title":"Secure Hash Standard (SHS)","author":"National Institute of Standards and Technology.","year":"2008","unstructured":"National Institute of Standards and Technology. 2008 . Secure Hash Standard (SHS) . Federal Information Processing Standard (FIPS) 180 - 183 . National Institute of Science and Technology. http:\/\/csrc.nist.gov\/publications\/fips\/fips180-3\/fips180-3_final.pdf National Institute of Standards and Technology. 2008. Secure Hash Standard (SHS). Federal Information Processing Standard (FIPS) 180-3. National Institute of Science and Technology. http:\/\/csrc.nist.gov\/publications\/fips\/fips180-3\/fips180-3_final.pdf","journal-title":"Federal Information Processing Standard (FIPS)"},{"key":"e_1_2_1_33_2","unstructured":"OpenDNS. {n. d.}. PhishTank. ({n. d.}). http:\/\/www.phishtank.com  OpenDNS. {n. d.}. PhishTank. ({n. d.}). http:\/\/www.phishtank.com"},{"key":"e_1_2_1_34_2","doi-asserted-by":"publisher","DOI":"10.5555\/1973430.1973432"},{"key":"e_1_2_1_35_2","doi-asserted-by":"publisher","DOI":"10.5555\/1083323.1083333"},{"key":"e_1_2_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988732"},{"key":"e_1_2_1_37_2","volume-title":"ACL Third Workshop on Very Large Corpora cmp-lg\/9505040","author":"Lance","year":"1995","unstructured":"Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using Transformation-Based Learning . ACL Third Workshop on Very Large Corpora cmp-lg\/9505040 ( 1995 ), 82\u201394. http:\/\/arxiv.org\/abs\/cmp-lg\/9505040 Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using Transformation-Based Learning. ACL Third Workshop on Very Large Corpora cmp-lg\/9505040 (1995), 82\u201394. http:\/\/arxiv.org\/abs\/cmp-lg\/9505040"},{"key":"e_1_2_1_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijar.2008.11.006"},{"key":"e_1_2_1_39_2","volume-title":"Why Data Mining Won't Stop Terror. Wired Magazine (9","author":"Schneier Bruce","year":"2005","unstructured":"Bruce Schneier . 2005. Why Data Mining Won't Stop Terror. Wired Magazine (9 March 2005 ). https:\/\/schneier.com\/essays\/archives\/2005\/03\/why_data_mining_wont.html Bruce Schneier. 2005. Why Data Mining Won't Stop Terror. Wired Magazine (9 March 2005). https:\/\/schneier.com\/essays\/archives\/2005\/03\/why_data_mining_wont.html"},{"key":"e_1_2_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/331697.335176"},{"key":"e_1_2_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/347059.347408"},{"key":"e_1_2_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/1328964.1328976"},{"key":"e_1_2_1_43_2","first-page":"1","article-title":"GNU Parallel - The Command-Line Power Tool. ;login","volume":"36","author":"Tange O.","year":"2011","unstructured":"O. Tange . 2011 . GNU Parallel - The Command-Line Power Tool. ;login : The USENIX Magazine 36 , 1 (Feb 2011), 42\u201347. http:\/\/www.gnu.org\/s\/parallel O. Tange. 2011. GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine 36, 1 (Feb 2011), 42\u201347. http:\/\/www.gnu.org\/s\/parallel","journal-title":"The USENIX Magazine"},{"key":"e_1_2_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390431"},{"volume-title":"Reusing Wikipedia Content. (31","year":"2017","key":"e_1_2_1_45_2","unstructured":"Wikipedia. 2017. Reusing Wikipedia Content. (31 Oct. 2017 ). https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Reusing_Wikipedia_content {Online; accessed 23-Feb-2019}. Wikipedia. 2017. Reusing Wikipedia Content. (31 Oct. 2017). https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Reusing_Wikipedia_content {Online; accessed 23-Feb-2019}."},{"key":"e_1_2_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148243"},{"key":"e_1_2_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.47"},{"key":"e_1_2_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/1835449.1835562"}],"container-title":["ACM SIGCOMM Computer Communication Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3336937.3336940","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3336937.3336940","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:07:25Z","timestamp":1750273645000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3336937.3336940"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,5,21]]},"references-count":47,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,5,21]]}},"alternative-id":["10.1145\/3336937.3336940"],"URL":"https:\/\/doi.org\/10.1145\/3336937.3336940","relation":{},"ISSN":["0146-4833"],"issn-type":[{"type":"print","value":"0146-4833"}],"subject":[],"published":{"date-parts":[[2019,5,21]]},"assertion":[{"value":"2019-05-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}