{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T10:20:32Z","timestamp":1777890032928,"version":"3.51.4"},"reference-count":55,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2017,11,20]],"date-time":"2017-11-20T00:00:00Z","timestamp":1511136000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Web Intelligence"],"published-print":{"date-parts":[[2017,11,20]]},"abstract":"<jats:p>The world wide web has two main forms of architecture, the first is that which is explicitly encoded into web pages, and the second is that which is implied by the web content, particularly pertaining to look and feel. The latter is exemplified by the concept of a website, a concept that is only loosely defined, although users intuitively understand it. The Website Boundary Detection (WBD) problem is concerned with the task of identifying the complete collection of web pages\/resources that are contained within a single website. Whatever the case, the concept of a website is used with respect to a number of application domains including; website archiving, spam detection, and www analysis. In the context of such applications it is beneficial if a website can be automatically identified. This is usually done by identifying a website of interest in terms of its boundary, the so called WBD problem. In this paper seven WBD techniques are proposed and compared, four statistical techniques where the web data to be used is obtained apriori, and three dynamic techniques where the data to be used is obtained as the process progresses. All seven techniques are presented in detail and evaluated.<\/jats:p>","DOI":"10.3233\/web-170365","type":"journal-article","created":{"date-parts":[[2017,11,21]],"date-time":"2017-11-21T14:37:06Z","timestamp":1511275026000},"page":"269-290","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":0,"title":["Mining the information architecture of the WWW using automated website boundary detection"],"prefix":"10.1177","volume":"15","author":[{"given":"Ayesh","family":"Alshukri","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Liverpool, Ashton Building, Ashton Street, L69 3BX, Liverpool, UK. E-mails:\u00a0,\u00a0"}]},{"given":"Frans","family":"Coenen","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Liverpool, Ashton Building, Ashton Street, L69 3BX, Liverpool, UK. E-mails:\u00a0,\u00a0"}]}],"member":"179","published-online":{"date-parts":[[2017,11,20]]},"reference":[{"key":"ref001","doi-asserted-by":"crossref","unstructured":"S.\u00a0Abiteboul, G.\u00a0Cobena, J.\u00a0Masan\u00e8s and G.\u00a0Sedrati, A first experience in archiving the French web, in: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, Vol.\u00a02458, Springer-Verlag, London, UK, 2002, pp.\u00a01\u201315. doi:10.1007\/3-540-45747-X_1.","DOI":"10.1007\/3-540-45747-X_1"},{"key":"ref002","unstructured":"R.K.\u00a0Ahuja, T.L.\u00a0Magnanti and J.B.\u00a0Orlin, Network Flows: Theory, Algorithms, and Applications, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993."},{"key":"ref003","unstructured":"A.\u00a0Alshukri, Website boundary detection via machine learning, Thesis, University of Liverpool, 2012."},{"key":"ref004","doi-asserted-by":"crossref","unstructured":"A.\u00a0Alshukri, F.\u00a0Coenen and M.\u00a0Zito, Web-site boundary detection, in: Proceedings of the 10th Industrial Conference on Data Mining, Springer, Berlin, Germany, 2010, pp.\u00a0529\u2013543.","DOI":"10.1007\/978-3-642-14400-4_41"},{"key":"ref005","doi-asserted-by":"crossref","unstructured":"A.\u00a0Alshukri, F.\u00a0Coenen and M.\u00a0Zito, Incremental web-site boundary detection using random walks, in: Proceedings of the 7th International Conference on Machine Learning and Data Mining, Springer, New York, USA, 2011, pp.\u00a0414\u2013427.","DOI":"10.1007\/978-3-642-23199-5_31"},{"key":"ref006","doi-asserted-by":"crossref","unstructured":"A.\u00a0Alshukri, F.\u00a0Coenen and M.\u00a0Zito, Web-site boundary detection using incremental random walk clustering, in: Proceedings of the 31st SGAI International Conference, Springer, Cambridge, UK, 2011, pp.\u00a0255\u2013268.","DOI":"10.1007\/978-1-4471-2318-7_20"},{"key":"ref007","doi-asserted-by":"crossref","unstructured":"Y.\u00a0Asano, H.\u00a0Imai, M.\u00a0Toyoda and M.\u00a0Kitsuregawa, Applying the site information to the information retrieval from the web, in: Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002, IEEE Computer Society, 2002, pp.\u00a083\u201392.","DOI":"10.1109\/WISE.2002.1181646"},{"key":"ref008","unstructured":"L.\u00a0Becchetti, C.\u00a0Castillo, D.\u00a0Donato, S.\u00a0Leonardi and R.\u00a0Baeza-Yates, Link-based characterization and detection of web spam, in: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, 2006, pp.\u00a01\u20138."},{"key":"ref009","unstructured":"A.\u00a0Bencz\u00far, K.\u00a0Csalog\u00e1ny and T.\u00a0Sarl\u00f3s, Link-based similarity search to fight web spam, in: Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006, pp.\u00a01\u20138."},{"key":"ref010","doi-asserted-by":"crossref","unstructured":"K.\u00a0Bharat, B.W.\u00a0Chang, M.\u00a0Henzinger and M.\u00a0Ruhl, Who links to whom: Mining linkage between web sites, in: Proceedings 2001 IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, 2001, pp.\u00a051\u201358. doi:10.1109\/ICDM.2001.989500.","DOI":"10.1109\/ICDM.2001.989500"},{"key":"ref011","doi-asserted-by":"publisher","DOI":"10.1016\/S1389-1286(00)00083-9"},{"key":"ref012","doi-asserted-by":"crossref","unstructured":"A.\u00a0Brown, Archiving Websites: A Practical Guide for Information Management Professionals, Facet Publishing, London, England, 2006.","DOI":"10.29085\/9781856049009"},{"key":"ref013","unstructured":"K.\u00a0Chellapilla and D.M.\u00a0Chickering, Improving cloaking detection using search query popularity and monetizability, in: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, Seattle, WA, 2006, pp.\u00a017\u201324."},{"key":"ref014","unstructured":"K.W.\u00a0Cheung and Y.\u00a0Sun, Mining web site\u2019s clusters from link topology and site hierarchy, in: Proceedings of the 2003 IEEE\/WIC International Conference on Web Intelligence, IEEE Computer Society, Washington, DC, USA, 2003, p.\u00a0271."},{"key":"ref015","first-page":"343","volume":"5","author":"Cheung K.-W.","year":"2007","journal-title":"Web Intelligence and Agent Systems: An International Journal"},{"key":"ref016","unstructured":"T.H.\u00a0Cormen, C.E.\u00a0Leiserson, R.L.\u00a0Rivest and C.\u00a0Stein, Introduction to Algorithms, 3rd edn, The MIT Press, Cambridge, Massachusetts, 2009."},{"key":"ref017","unstructured":"M.\u00a0Deegan and S.\u00a0Tanner, Digital Preservation, Digital Futures Series, 2006."},{"key":"ref018","doi-asserted-by":"crossref","unstructured":"P.\u00a0Dmitriev, As we may perceive: Finding the boundaries of compound documents on the web, in: Proceeding of the 17th International Conference on World Wide Web, ACM, Beijing, China, 2008, pp.\u00a01029\u20131030. doi:10.1145\/1367497.1367640.","DOI":"10.1145\/1367497.1367640"},{"key":"ref019","doi-asserted-by":"crossref","unstructured":"P.\u00a0Dmitriev and C.\u00a0Lagoze, Automatically constructing descriptive site maps, in: Frontiers of WWW Research and Development, APWeb 2006, Springer, Berlin, Heidelberg, 2006, pp.\u00a0201\u2013212. doi:10.1007\/11610113_19.","DOI":"10.1007\/11610113_19"},{"key":"ref020","unstructured":"M.H.\u00a0Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall PTR, Upper Saddle River, NJ, USA, 2002."},{"key":"ref021","doi-asserted-by":"crossref","unstructured":"N.\u00a0Eiron and K.S.\u00a0McCurley, Untangling compound documents on the web, in: Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, ACM Press, New York, USA, 2003, pp.\u00a085\u201394. doi:10.1145\/900051.900070.","DOI":"10.1145\/900051.900070"},{"key":"ref022","doi-asserted-by":"crossref","unstructured":"M.\u00a0Ester, H.P.\u00a0Kriegel and M.\u00a0Schubert, Web site mining: A new way to spot competitors, customers and suppliers in the world wide web, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York City, USA, 2002, pp.\u00a0249\u2013258. doi:10.1145\/775047.775084.","DOI":"10.1145\/775047.775084"},{"key":"ref023","unstructured":"M.\u00a0Ester and H.P.\u00a0Kriegel, A density-based algorithm for discovering clusters in large spatial databases with noise, in: 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp.\u00a0226\u2013231."},{"key":"ref024","doi-asserted-by":"publisher","DOI":"10.4153\/CJM-1956-045-5"},{"key":"ref025","unstructured":"L.R.\u00a0Ford and D.R.\u00a0Fulkerson, Flows in Networks, Princeton University Press, Princeton, NJ, 1962."},{"key":"ref026","unstructured":"A.\u00a0Gibbons, Algorithmic Graph Theory, Cambridge University Press, 1985."},{"key":"ref027","doi-asserted-by":"crossref","unstructured":"K.\u00a0Golub and A.\u00a0Ard\u00f6, Importance of HTML structural elements and metadata in automated subject classification, in: Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries, LNCS, 2005, pp.\u00a0368\u2013378. doi:10.1007\/11551362_33.","DOI":"10.1007\/11551362_33"},{"key":"ref028","doi-asserted-by":"crossref","unstructured":"M.\u00a0Henzinger, Finding near-duplicate web pages: A large-scale evaluation of algorithms, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2006, pp.\u00a0284\u2013291.","DOI":"10.1145\/1148170.1148222"},{"key":"ref029","doi-asserted-by":"crossref","unstructured":"M.\u00a0Keller and H.\u00a0Hartenstein, Mining taxonomies from web menus: Rule-based concepts and algorithms, in: Web Engineering, 2013.","DOI":"10.1007\/978-3-642-39200-9_23"},{"key":"ref030","doi-asserted-by":"crossref","unstructured":"M.\u00a0Keller and M.\u00a0Nussbaumer, MenuMiner: Revealing the information architecture of large web sites by analyzing maximal cliques, in: Proceedings of the 21st International Conference Companion on World Wide Web, ACM Press, New York, USA, 2012, p.\u00a01025.","DOI":"10.1145\/2187980.2188237"},{"key":"ref031","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(02)00022-5"},{"key":"ref032","unstructured":"T.\u00a0Lavergne, T.\u00a0Urvoy and F.\u00a0Yvon, Detecting fake content with relative entropy scoring, in: International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, CEUR Workshop Proceedings, Vol.\u00a0377, CEUR-WS.org, 2008."},{"key":"ref033","doi-asserted-by":"crossref","unstructured":"B.\u00a0Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications), 2nd edn, Springer, 2011.","DOI":"10.1007\/978-3-642-19460-3"},{"key":"ref034","doi-asserted-by":"crossref","unstructured":"N.\u00a0Liu and C.\u00a0Yang, Extracting a website\u2019s content structure from its link structure, in: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA, 2005, pp.\u00a0345\u2013346.","DOI":"10.1145\/1099554.1099660"},{"key":"ref035","doi-asserted-by":"crossref","unstructured":"N.\u00a0Liu and C.\u00a0Yang, Mining web site\u2019s topic hierarchy, in: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, 2005, p.\u00a0980. doi:10.1145\/1062745.1062828.","DOI":"10.1145\/1062745.1062828"},{"key":"ref036","doi-asserted-by":"crossref","unstructured":"Y.\u00a0Liu, Y.\u00a0Ouyang, H.\u00a0Sheng and Z.\u00a0Xiong, An incremental algorithm for clustering search results, in: Proceedings of the 2008 IEEE International Conference on Signal Image Technology and Internet Based Systems, IEEE Computer Society, Washington, DC, USA, 2008, pp.\u00a0112\u2013117. doi:10.1109\/SITIS.2008.53.","DOI":"10.1109\/SITIS.2008.53"},{"key":"ref037","unstructured":"J.\u00a0MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium, 1967."},{"key":"ref038","unstructured":"G.\u00a0Mishne, D.\u00a0Carmel and R.\u00a0Lempel, Blocking blog spam with language model disagreement, in: Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005."},{"key":"ref039","doi-asserted-by":"crossref","unstructured":"M.\u00a0Newman, Detecting community structure in networks, The European Physical Journal\u00a0B\u00a0\u2013 Condensed Matter \u2026 (2004).","DOI":"10.1140\/epjb\/e2004-00124-y"},{"key":"ref040","doi-asserted-by":"crossref","unstructured":"M.\u00a0Newman, Fast algorithm for detecting community structure in networks, Physical Review\u00a0E 69(6) (2004), 5.","DOI":"10.1103\/PhysRevE.69.066133"},{"key":"ref041","doi-asserted-by":"crossref","unstructured":"M.\u00a0Newman and M.\u00a0Girvan, Finding and evaluating community structure in networks, Physical Review\u00a0E 69(2) (2004).","DOI":"10.1103\/PhysRevE.69.026113"},{"key":"ref042","doi-asserted-by":"publisher","DOI":"10.1243\/0954406041319509"},{"key":"ref043","doi-asserted-by":"crossref","unstructured":"S.V.\u00a0Ramnath and P.\u00a0Halkarnikar, Web site mining using entropy estimation, in: International Conference on Data Storage and Data Engineering, IEEE, 2010, pp.\u00a0225\u2013229.","DOI":"10.1109\/DSDE.2010.19"},{"key":"ref044","doi-asserted-by":"crossref","unstructured":"E.M.\u00a0Rodrigues, N.\u00a0Milic-Frayling and B.\u00a0Fortuna, Detection of web subsites: Concepts, algorithms, and evaluation issues, in: Web Intelligence, IEEE Computer Society, 2007, pp.\u00a066\u201373.","DOI":"10.1109\/WI.2007.107"},{"key":"ref045","unstructured":"E.M.\u00a0Rodrigues, N.\u00a0Milic-Frayling, M.\u00a0Hicks and G.\u00a0Smyth, Link structure graphs for representing and analyzing web sites, 2006."},{"key":"ref046","doi-asserted-by":"publisher","DOI":"10.1145\/361219.361220"},{"key":"ref047","doi-asserted-by":"publisher","DOI":"10.1016\/j.cosrev.2007.05.001"},{"key":"ref048","doi-asserted-by":"crossref","unstructured":"D.\u00a0Sculley, Web-scale\n                      k\n                      -means clustering, in: Proceedings of the 19th International Conference on World Wide Web, ACM Press, New York, New York, USA, 2010, p.\u00a01177.","DOI":"10.1145\/1772690.1772862"},{"key":"ref049","unstructured":"P.\u00a0Senellart, Website identification, Masters thesis, Universit\u00e9 Paris\u00a0XI, Orsay, France, 2003."},{"key":"ref050","doi-asserted-by":"crossref","unstructured":"P.\u00a0Senellart, Identifying websites with flow simulation, Technical report, Gemo, INRIA Futurs, Orsay, France, 2005.","DOI":"10.1007\/11531371_18"},{"key":"ref051","unstructured":"Y.\u00a0Tian, A web site mining algorithm using the multiscale tree representation model, in: Proceedings of the 5th Webmining as a Premise to Effective and Intelligent Web Applications, 2003."},{"key":"ref052","unstructured":"T.\u00a0Urvoy, E.\u00a0Chauveau, P.\u00a0Filoche and T.\u00a0Lavergne, Web spam challenge 2007: France Telecom R&D submission, 2007."},{"key":"ref053","doi-asserted-by":"publisher","DOI":"10.1145\/1326561.1326564"},{"key":"ref054","unstructured":"T.\u00a0Urvoy, T.\u00a0Lavergne and P.\u00a0Filoche, Tracking web spam with hidden style similarity, in: 2nd International Workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006, pp.\u00a025\u201331."},{"key":"ref055","unstructured":"Y.\u00a0Zhao and G.\u00a0Karypis, Clustering in life sciences, in: Functional Genomics: Methods and Protocols, M.\u00a0Brownstein, A.\u00a0Khodursky and D.\u00a0Conniffe, eds, 2003."}],"container-title":["Web Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/WEB-170365","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.3233\/WEB-170365","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/WEB-170365","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T05:27:02Z","timestamp":1777613222000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/WEB-170365"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,11,20]]},"references-count":55,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2017,11,20]]}},"alternative-id":["10.3233\/WEB-170365"],"URL":"https:\/\/doi.org\/10.3233\/web-170365","relation":{},"ISSN":["2405-6456","2405-6464"],"issn-type":[{"value":"2405-6456","type":"print"},{"value":"2405-6464","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,11,20]]}}}