{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T17:24:16Z","timestamp":1771003456106,"version":"3.50.1"},"reference-count":25,"publisher":"SAGE Publications","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JCM"],"published-print":{"date-parts":[[2021,5,6]]},"abstract":"<jats:p>Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.<\/jats:p>","DOI":"10.3233\/jcm-204658","type":"journal-article","created":{"date-parts":[[2020,10,13]],"date-time":"2020-10-13T13:26:25Z","timestamp":1602595585000},"page":"461-474","source":"Crossref","is-referenced-by-count":6,"title":["UCrawler: A learning-based web crawler using a URL knowledge base"],"prefix":"10.1177","volume":"21","author":[{"given":"Wei","family":"Wang","sequence":"first","affiliation":[{"name":"Computer Lab, Hangzhou Medical College, Hangzhou, Zhejiang, China"}]},{"given":"Lihua","family":"Yu","sequence":"additional","affiliation":[{"name":"Netease Hangzhou Network Ltd., Hangzhou, Zhejiang, China"}]}],"member":"179","reference":[{"issue":"9","key":"10.3233\/JCM-204658_ref1","first-page":"1965","article-title":"Survey on the research of focused crawling technique","volume":"25","author":"Zhou","year":"2005","journal-title":"Computer Applications"},{"key":"10.3233\/JCM-204658_ref2","doi-asserted-by":"crossref","unstructured":"R. Meusel, P. Mika and R. Blanco, Focused crawling for structured data, ACM International Conference on Information and Knowledge Management, 2014, pp.\u00a01039\u20131048.","DOI":"10.1145\/2661829.2661902"},{"key":"10.3233\/JCM-204658_ref3","doi-asserted-by":"crossref","unstructured":"J. Wu, P. Teregowda, J.P. Fern\u00e1ndez Ram\u00edrez et al., The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists, in Proc. 3rd Annual ACM Web Science Conference, Evanston, IL, USA, June 2012, pp.\u00a0340\u2013343.","DOI":"10.1145\/2380718.2380762"},{"issue":"4","key":"10.3233\/JCM-204658_ref4","doi-asserted-by":"crossref","first-page":"608","DOI":"10.1109\/TSC.2015.2414931","article-title":"SmartCrawler: A Two-stage crawler for efficiently harvesting deep-web interfaces","volume":"9","author":"Zhao","year":"2016","journal-title":"IEEE Trans Services Computing"},{"key":"10.3233\/JCM-204658_ref5","unstructured":"L. Page, S. Brin and R. Motwani, The PageRank Citation Ranking: Bring Order to the Web, Stanford, CA: Stanford University, 1998."},{"key":"10.3233\/JCM-204658_ref6","doi-asserted-by":"crossref","first-page":"604","DOI":"10.1145\/324133.324140","article-title":"Authoritative sources in a hyperlinked environment","author":"Kleinberg","year":"1999","journal-title":"Journal of the ACM"},{"key":"10.3233\/JCM-204658_ref7","doi-asserted-by":"crossref","first-page":"524","DOI":"10.1007\/11538059_55","article-title":"Improvement of HITS for topic-specific web crawler","volume":"3644","author":"Zong","year":"2005","journal-title":"Lecture Notes in Computer Science"},{"key":"10.3233\/JCM-204658_ref8","doi-asserted-by":"crossref","unstructured":"S. Chakrabarti, M. van den Berg and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web ResourceDiscovery, in Proc. 8th International WWW Conf., Toronto, 1999, pp.\u00a01623\u20131640.","DOI":"10.1016\/S1389-1286(99)00052-3"},{"key":"10.3233\/JCM-204658_ref9","doi-asserted-by":"crossref","unstructured":"W. Wang, X. Chen, Y. Zou et al., A Focused Crawler Based on Na\u00efve Bayes Classifier, International Symposium on Intelligent Information Technology and Security Informatics, 2010, pp.\u00a0517\u2013521.","DOI":"10.1109\/IITSI.2010.30"},{"key":"10.3233\/JCM-204658_ref10","doi-asserted-by":"crossref","unstructured":"D. Taylan, M. Poyraz, S. Akyokus et al., Intelligent focused crawler: Learning which links to crawl, Innovations in Intelligent Systems and Applications (INISTA), 2011, pp. 504\u2013508.","DOI":"10.1109\/INISTA.2011.5946150"},{"key":"10.3233\/JCM-204658_ref11","first-page":"61","article-title":"An Effectively Focused Crawling System","volume":"376","author":"Uemura","year":"2012","journal-title":"Studies in Computational Intelligence"},{"key":"10.3233\/JCM-204658_ref12","doi-asserted-by":"crossref","unstructured":"X. Chen and X. Zhang, HAWK: A Focused Crawler with Content and Link Analysis, in Proc. the 2008 IEEE ICEBE, 2008, pp. 677\u2013680.","DOI":"10.1109\/ICEBE.2008.46"},{"issue":"11","key":"10.3233\/JCM-204658_ref13","first-page":"76","article-title":"Research on topical crawler of shark-search algorithm and hits algorithm","volume":"20","author":"Luo","year":"2010","journal-title":"Computer Technology and Development"},{"issue":"23","key":"10.3233\/JCM-204658_ref14","doi-asserted-by":"crossref","first-page":"4512","DOI":"10.1016\/j.ins.2008.07.030","article-title":"An ontology-based approach to learnable focused crawling","volume":"178","author":"Zheng","year":"2008","journal-title":"Information Sciences"},{"issue":"10","key":"10.3233\/JCM-204658_ref15","doi-asserted-by":"crossref","first-page":"1001","DOI":"10.1016\/j.datak.2009.04.002","article-title":"\u201cImproving the performance of focused web crawlers","volume":"68","author":"Batsakis","year":"2009","journal-title":"Data & Knowledge Engineering"},{"key":"10.3233\/JCM-204658_ref17","doi-asserted-by":"crossref","unstructured":"X. Zheng, T. Zhou, Z. Yu et al., URL Rule Based Focused Crawler, IEEE International Conference on e-Business Engineering (2008), 147\u2013154.","DOI":"10.1109\/ICEBE.2008.61"},{"key":"10.3233\/JCM-204658_ref18","unstructured":"J. Priyanka, Efficient crawler for gathering and exploring relevant sites from hidden web, International Journal of Emerging Technology and Computer Science 2(2) (2017)."},{"key":"10.3233\/JCM-204658_ref19","first-page":"111","article-title":"The web of knowledge mining: Theory, Methods and Applications","author":"Zheng","year":"2010","journal-title":"Science Press"},{"issue":"2","key":"10.3233\/JCM-204658_ref20","doi-asserted-by":"crossref","first-page":"1616","DOI":"10.1109\/TII.2012.2234472","article-title":"Self-adaptive semantic focused crawler for mining services information discovery","volume":"10","author":"Dong","year":"2014","journal-title":"IEEE Transactions on Industrial Informatics"},{"key":"10.3233\/JCM-204658_ref21","first-page":"329","article-title":"The Architecture and Implementation of an Extensible Web Crawler","author":"Hsieh","year":"2010","journal-title":"NSDI"},{"key":"10.3233\/JCM-204658_ref22","first-page":"445","article-title":"Adaptive geospatially focused crawling","author":"Ahlers","year":"2009","journal-title":"CIKM"},{"issue":"2","key":"10.3233\/JCM-204658_ref23","first-page":"19","article-title":"A focused crawling algorithm for improving Shark-Search","volume":"33","author":"Qiu","year":"2017","journal-title":"Micro-computer Applications"},{"key":"10.3233\/JCM-204658_ref24","unstructured":"G. Cai, Design and implementation of topic-oriented multi-threaded web crawler, Lanzhou: Northwest University for Nationalities, 2017."},{"issue":"2","key":"10.3233\/JCM-204658_ref25","first-page":"195","article-title":"The theme reptile algorithm based on fusion link structure","volume":"38","author":"Liu","year":"2017","journal-title":"Journal of Huaqiao University: Natural Science"},{"key":"10.3233\/JCM-204658_ref26","doi-asserted-by":"crossref","unstructured":"H. Jiang, B. Han, Y. Lin et al., Design and implementation of university focused crawler based on BP network classifier, 2009 Second International Workshop on K, 2009.","DOI":"10.1109\/WKDD.2009.77"}],"container-title":["Journal of Computational Methods in Sciences and Engineering"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/JCM-204658","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T16:32:07Z","timestamp":1771000327000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/JCM-204658"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,6]]},"references-count":25,"journal-issue":{"issue":"2"},"URL":"https:\/\/doi.org\/10.3233\/jcm-204658","relation":{},"ISSN":["1472-7978","1875-8983"],"issn-type":[{"value":"1472-7978","type":"print"},{"value":"1875-8983","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,6]]}}}