{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T06:11:13Z","timestamp":1775283073722,"version":"3.50.1"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2008,2,1]],"date-time":"2008-02-01T00:00:00Z","timestamp":1201824000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2008,2]]},"abstract":"<jats:p>Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).<\/jats:p>\n          <jats:p>Those pages built using the same generating method (template or script) share a common \u201clook and feel\u201d that is not easily detected by common text classification methods, but is more related to stylometry.<\/jats:p>\n          <jats:p>In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.<\/jats:p>\n          <jats:p>We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.<\/jats:p>","DOI":"10.1145\/1326561.1326564","type":"journal-article","created":{"date-parts":[[2008,3,12]],"date-time":"2008-03-12T22:35:44Z","timestamp":1205361344000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":53,"title":["Tracking Web spam with HTML style similarities"],"prefix":"10.1145","volume":"2","author":[{"given":"Tanguy","family":"Urvoy","sequence":"first","affiliation":[{"name":"Orange Labs (France Telecom R&amp;D), Lannion cedex, France"}]},{"given":"Emmanuel","family":"Chauveau","sequence":"additional","affiliation":[{"name":"Orange Labs (France Telecom R&amp;D), Lannion cedex, France"}]},{"given":"Pascal","family":"Filoche","sequence":"additional","affiliation":[{"name":"Orange Labs (France Telecom R&amp;D), Lannion cedex, France"}]},{"given":"Thomas","family":"Lavergne","sequence":"additional","affiliation":[{"name":"Orange Labs and ENST Paris, Lannion cedex, France"}]}],"member":"320","published-online":{"date-parts":[[2008,3,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/511446.511522"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060840"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06)","author":"Bencz\u00far A.","unstructured":"Bencz\u00far , A. , Csalog\u00e1ny , K. , and Sarl\u00f3s , T . 2006. Link-based similarity search to fight web spam . In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06) . Seattle, WA. Bencz\u00far, A., Csalog\u00e1ny, K., and Sarl\u00f3s, T. 2006. Link-based similarity search to fight web spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06). Seattle, WA."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-006-8364-x"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the Compression and Complexity of Sequences (SEQUENCES'97)","author":"Broder A.","year":"1997","unstructured":"Broder , A. 1997 . On the resemblance and containment of documents . In Proceedings of the Compression and Complexity of Sequences (SEQUENCES'97) . IEEE Computer Society. 21. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES'97). IEEE Computer Society. 21."},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Broder A. Z. Glassman S. C. Manasse M. S. and Zweig G. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science Publishers 1157--1166.   Broder A. Z. Glassman S. C. Manasse M. S. and Zweig G. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science Publishers 1157--1166.","DOI":"10.1016\/S0169-7552(97)00031-7"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1189702.1189703"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277814"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242582"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/509907.509965"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1141277.1141534"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1017074.1017077"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1076034.1076066"},{"key":"e_1_2_1_14_1","unstructured":"Filoche P. Urvoy T. Emmanuel C. and Lavergne T. 2007. France Telecom R&D entry. Web Spam Challenge 2007 (Track I).  Filoche P. Urvoy T. Emmanuel C. and Lavergne T. 2007. France Telecom R&D entry. Web Spam Challenge 2007 (Track I)."},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 3rd Biannual Conference of International Association of Forensic Linguists (IAFL'97)","author":"Gray A.","year":"1997","unstructured":"Gray , A. , Sallis , P. , and MacDonell , S. 1997 . Software forensics: Extending authorship analysis techniques to computer programs . In Proceedings of the 3rd Biannual Conference of International Association of Forensic Linguists (IAFL'97) . 1--8. Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of International Association of Forensic Linguists (IAFL'97). 1--8."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05)","author":"Gy\u00f6ngyi Z.","unstructured":"Gy\u00f6ngyi , Z. and Garcia-Molina , H . 2005. Web spam taxonomy . In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05) . Chiba, Japan. Gy\u00f6ngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05). Chiba, Japan."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587","author":"Gy\u00f6ngyi Z.","unstructured":"Gy\u00f6ngyi , Z. , Garcia-Molina , H. , and Pedersen , J . 2004. Combating Web spam with TrustRank . In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587 . Gy\u00f6ngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the USENIX Workshop on Electronic Commerce.","author":"Heintze N.","year":"1996","unstructured":"Heintze , N. 1996 . Scalable document fingerprinting . In Proceedings of the USENIX Workshop on Electronic Commerce. Heintze, N. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148222"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276876"},{"key":"e_1_2_1_21_1","unstructured":"Jenkins B. 1997. A hash function for hash table lookup. Dr Dobbs Journal.  Jenkins B. 1997. A hash function for hash table lookup. Dr Dobbs Journal."},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of Young Scientists' Conference on Information Retrieval (RJCRI'06)","author":"Lavergne T.","year":"2006","unstructured":"Lavergne , T. 2006 . Unnatural language detection . In Proceedings of Young Scientists' Conference on Information Retrieval (RJCRI'06) . Lavergne, T. 2006. Unnatural language detection. In Proceedings of Young Scientists' Conference on Information Retrieval (RJCRI'06)."},{"key":"e_1_2_1_23_1","unstructured":"Manber U. 1994. Finding similar files in a large file system. In USENIX Winter. 1--10.   Manber U. 1994. Finding similar files in a large file system. In USENIX Winter. 1--10."},{"key":"e_1_2_1_24_1","unstructured":"McEnery T. and Oakes M. 2000. Authorship identification and computational stylometry. In Handbook of Natural Language Processing. Marcel Dekker Inc.  McEnery T. and Oakes M. 2000. Authorship identification and computational stylometry. In Handbook of Natural Language Processing. Marcel Dekker Inc."},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of 27th German Conference on Artificial Intelligence (KI-04)","volume":"3238","author":"Meyer Zu Eissen S.","unstructured":"Meyer Zu Eissen , S. and Stein , B . 2004. Genre classification of web pages . In Proceedings of 27th German Conference on Artificial Intelligence (KI-04) , S. Biundo, T. Fr\u00fchwirth, and G. Palm, Eds. Lecture Notes in Computer Science , vol. 3238 . Meyer Zu Eissen, S. and Stein, B. 2004. Genre classification of web pages. In Proceedings of 27th German Conference on Artificial Intelligence (KI-04), S. Biundo, T. Fr\u00fchwirth, and G. Palm, Eds. Lecture Notes in Computer Science, vol. 3238."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1135777.1135794"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872770"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06)","author":"Urvoy T.","unstructured":"Urvoy , T. , Lavergne , T. , and Filoche , P . 2006. Tracking web spam with hidden style similarity . In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06) . Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06)."},{"key":"e_1_2_1_29_1","volume-title":"Information Retrieval","author":"Van Rijsbergen C. J.","unstructured":"Van Rijsbergen , C. J. 1979. Information Retrieval 2 nd ed. University of Glasgow , Glasgow, Scotland, UK . Van Rijsbergen, C. J. 1979. Information Retrieval 2nd ed. University of Glasgow, Glasgow, Scotland, UK.","edition":"2"},{"key":"e_1_2_1_30_1","unstructured":"Westbrook A. and Greene R. 2002. Using semantic analysis to classify search engine spam. Tech. rep. Stanford University.  Westbrook A. and Greene R. 2002. Using semantic analysis to classify search engine spam. Tech. rep. Stanford University."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/281250.281256"}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1326561.1326564","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1326561.1326564","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:56:25Z","timestamp":1750254985000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1326561.1326564"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,2]]},"references-count":31,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2008,2]]}},"alternative-id":["10.1145\/1326561.1326564"],"URL":"https:\/\/doi.org\/10.1145\/1326561.1326564","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"value":"1559-1131","type":"print"},{"value":"1559-114X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2008,2]]},"assertion":[{"value":"2007-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2007-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-03-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}