{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T04:29:17Z","timestamp":1777696157781,"version":"3.51.4"},"reference-count":48,"publisher":"SAGE Publications","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IDA"],"published-print":{"date-parts":[[2022,4,18]]},"abstract":"<jats:p>Access to one of the richest data sources in the world, the web, is not possible without cost. Often, this cost is not taken into account in data acquisition processes. In this paper, we introduce the Learning Agents (LA) method for automatic topical data acquisition from the web with minimum bandwidth usage and the lowest cost. The proposed LA method uses online learning topical crawlers. The online learning capability makes the LA able to dynamically adapt to the properties of web pages during the crawling process of the target topic, and learn an effective combination of a set of link scoring criteria for that topic. That way, the LA resolves the challenge in the mechanism of combining the outputs of different criteria for computing the value of following a link, in the formerly approaches, and increases the efficiency of the crawlers. A version of the LA method is implemented that uses a collection of topical content analyzers for scoring the links. The learning ability in the implemented LA resolves the challenge of the unclear appropriate size of link contexts for pages of different topics. Using standard metrics in empirical evaluation indicates that when non-learning methods show inefficiency, the learning capability of LA significantly increases the efficiency of topical crawling, and achieves the state of the art results.<\/jats:p>","DOI":"10.3233\/ida-205107","type":"journal-article","created":{"date-parts":[[2022,4,26]],"date-time":"2022-04-26T13:24:16Z","timestamp":1650979456000},"page":"695-722","source":"Crossref","is-referenced-by-count":2,"title":["Online learning agents for cost-sensitive topical data acquisition from the web"],"prefix":"10.1177","volume":"26","author":[{"given":"Mahdi","family":"Naghibi","sequence":"first","affiliation":[{"name":"Faculty of Electrical and Computer Engineering, Malek-Ashtar University of Technology, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Reza","family":"Anvari","sequence":"additional","affiliation":[{"name":"Faculty of Electrical and Computer Engineering, Malek-Ashtar University of Technology, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ali","family":"Forghani","sequence":"additional","affiliation":[{"name":"Faculty of Electrical and Computer Engineering, Malek-Ashtar University of Technology, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Behrouz","family":"Minaei","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Iran University of Science and Technology, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"issue":"2","key":"10.3233\/IDA-205107_ref1","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1007\/s10618-007-0082-x","article-title":"Maximizing classifier utility when there are data acquisition and modeling costs","volume":"17","author":"Weiss","year":"2008","journal-title":"Data Min Knowl Discov"},{"issue":"2","key":"10.3233\/IDA-205107_ref3","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1145\/1233321.1233325","article-title":"Maximizing classifier utility when training data is costly","volume":"8","author":"Weiss","year":"2006","journal-title":"ACM SIGKDD Explor Newsl"},{"key":"10.3233\/IDA-205107_ref4","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1145\/3201064.3201085","article-title":"Focused Crawl of Web Archives to Build Event Collections","author":"Klein","year":"2018","journal-title":"Proceedings of the 10th ACM Conference on Web Science \u2013 WebSci \u201918"},{"issue":"1","key":"10.3233\/IDA-205107_ref5","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/s00799-016-0207-1","article-title":"Focused crawler for events","volume":"19","author":"Farag","year":"2018","journal-title":"Int J Digit Libr"},{"key":"10.3233\/IDA-205107_ref6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-642-29166-1_1","article-title":"Focused crawling using vision-based page segmentation","volume":"285","author":"Naghibi","year":"2012","journal-title":"Communications in Computer and Information Science"},{"key":"10.3233\/IDA-205107_ref7","doi-asserted-by":"publisher","DOI":"10.1145\/3066911.3066912"},{"key":"10.3233\/IDA-205107_ref8","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-65930-5_3"},{"issue":"1","key":"10.3233\/IDA-205107_ref9","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1109\/TKDE.2006.12","article-title":"Link contexts in classifier-guided topical crawlers","volume":"18","author":"Pant","year":"2006","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"10","key":"10.3233\/IDA-205107_ref10","doi-asserted-by":"publisher","first-page":"1001","DOI":"10.1016\/j.datak.2009.04.002","article-title":"Improving the performance of focused web crawlers","volume":"68","author":"Batsakis","year":"2009","journal-title":"Data Knowl Eng"},{"key":"10.3233\/IDA-205107_ref11","doi-asserted-by":"publisher","DOI":"10.5121\/ijdkp.2019.9304"},{"issue":"4","key":"10.3233\/IDA-205107_ref12","doi-asserted-by":"crossref","first-page":"378","DOI":"10.1145\/1031114.1031117","article-title":"Topical web crawlers: Evaluating adaptive algorithms","volume":"4","author":"Menczer","year":"2004","journal-title":"ACM Trans Internet Technol"},{"key":"10.3233\/IDA-205107_ref14","doi-asserted-by":"publisher","DOI":"10.1145\/371920.371955"},{"key":"10.3233\/IDA-205107_ref15","doi-asserted-by":"publisher","DOI":"10.1145\/511463.511466"},{"issue":"2","key":"10.3233\/IDA-205107_ref16","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1007\/s00354-017-0029-8","article-title":"Efficient Topical Focused Crawling Through Neighborhood Feature","volume":"36","author":"Suebchua","year":"2018","journal-title":"New Gener Comput"},{"key":"10.3233\/IDA-205107_ref17","doi-asserted-by":"publisher","first-page":"80","DOI":"10.1109\/ICOIN.2018.8343090","article-title":"History-enhanced focused website segment crawler","author":"Suebchua","year":"2018","journal-title":"2018 International Conference on Information Networking (ICOIN)"},{"issue":"2","key":"10.3233\/IDA-205107_ref18","doi-asserted-by":"publisher","first-page":"129","DOI":"10.11989\/JEST.1674-862X.70116018","article-title":"A survey about algorithms utilized by focused web crawler","volume":"16","author":"Bin Yu","year":"2018","journal-title":"J Electron Sci Technol"},{"key":"10.3233\/IDA-205107_ref19","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01437-7_4"},{"key":"10.3233\/IDA-205107_ref20","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1016\/J.ASOC.2016.12.028","article-title":"A web page distillation strategy for efficient focused crawling based on optimized Na\u00efve bayes (ONB) classifier","volume":"53","author":"Saleh","year":"2017","journal-title":"Appl Soft Comput"},{"issue":"2","key":"10.3233\/IDA-205107_ref21","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1016\/0169-7552(94)90132-5","article-title":"Information retrieval in the World Wide Web: Making client-based searching feasible","volume":"27","author":"De Bra","year":"1994","journal-title":"Comput Networks ISDN Syst"},{"issue":"11","key":"10.3233\/IDA-205107_ref22","doi-asserted-by":"crossref","first-page":"613","DOI":"10.1145\/361219.361220","article-title":"A vector space model for automatic indexing","volume":"18","author":"Salton","year":"1975","journal-title":"Commun ACM"},{"key":"10.3233\/IDA-205107_ref23","doi-asserted-by":"crossref","unstructured":"M. Ehrig and A. Maedche, Ontology-focused crawling of Web documents, in Proceedings of the 2003 ACM symposium on Applied computing, 2003, pp.\u00a01174\u20131178.","DOI":"10.1145\/952532.952761"},{"key":"10.3233\/IDA-205107_ref24","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1016\/S0169-7552(98)00038-5","article-title":"The shark-search algorithm \u2013 An application: Tailored Web site mapping","volume":"30","author":"Hersovici","year":"1998","journal-title":"Comput Networks ISDN Syst"},{"issue":"1","key":"10.3233\/IDA-205107_ref25","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1002\/cpe","article-title":"Tunneling enhanced by web page content block partition for focused crawling","volume":"20","author":"Peng","year":"2008","journal-title":"Concurr Comput Pract Exp"},{"key":"10.3233\/IDA-205107_ref26","unstructured":"M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles and M. Gori, Focused Crawling Using Context Graphs, in Proceedings of 26th VLDB Conference, 2000, pp.\u00a0527\u2013534."},{"key":"10.3233\/IDA-205107_ref27","unstructured":"J. Rennie and A.K. McCallum, Using reinforcement learning to spider the web efficiently, in Proceedings of the Sixteenth International Conference on Machine Learning, 1999, pp.\u00a0335\u2013343."},{"issue":"2","key":"10.3233\/IDA-205107_ref28","doi-asserted-by":"crossref","first-page":"270","DOI":"10.1016\/j.datak.2006.01.012","article-title":"Using HMM to learn user browsing patterns for focused web crawling","volume":"59","author":"Liu","year":"2006","journal-title":"Data Knowl Eng"},{"issue":"4\u20135","key":"10.3233\/IDA-205107_ref29","doi-asserted-by":"publisher","first-page":"232","DOI":"10.1016\/j.is.2005.02.007","article-title":"Topic-specific crawling on the Web with the measurements of the relevancy context graph","volume":"31","author":"Hsu","year":"2006","journal-title":"Inf Syst"},{"key":"10.3233\/IDA-205107_ref30","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1007\/978-3-319-91662-0_20","article-title":"Focused crawling through reinforcement learning","author":"Han","year":"2018","journal-title":"ICWE 2018: Web Engineering"},{"key":"10.3233\/IDA-205107_ref31","unstructured":"L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank citation ranking: Bringing order to the web, Stanford InfoLab, 1999."},{"issue":"4","key":"10.3233\/IDA-205107_ref32","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1145\/345966.345982","article-title":"Hubs, Authorities, and Communities","volume":"31","author":"Kleinberg","year":"1999","journal-title":"ACM Comput Surv"},{"issue":"11\u201316","key":"10.3233\/IDA-205107_ref33","doi-asserted-by":"publisher","first-page":"1623","DOI":"10.1016\/S1389-1286(99)00052-3","article-title":"Focused crawling: A new approach to topic-specific Web resource discovery","volume":"31","author":"Chakrabarti","year":"1999","journal-title":"Comput Networks"},{"issue":"8","key":"10.3233\/IDA-205107_ref34","doi-asserted-by":"publisher","first-page":"1114","DOI":"10.1631\/jzus.A0820481","article-title":"On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis","volume":"10","author":"Wang","year":"2009","journal-title":"J Zhejiang Univ Sci A"},{"issue":"3","key":"10.3233\/IDA-205107_ref35","doi-asserted-by":"crossref","first-page":"417","DOI":"10.1007\/s10791-005-6993-5","article-title":"A general evaluation framework for topical crawlers","volume":"8","author":"Srinvasan","year":"2005","journal-title":"Inf Retr Boston"},{"issue":"2","key":"10.3233\/IDA-205107_ref36","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/S0167-9236(02)00106-9","article-title":"Complementing search engines with online web mining agents","volume":"35","author":"Menczer","year":"2003","journal-title":"Decis Support Syst"},{"issue":"4","key":"10.3233\/IDA-205107_ref37","doi-asserted-by":"crossref","first-page":"430","DOI":"10.1145\/1095872.1095875","article-title":"Learning to crawl: Comparing classification schemes","volume":"23","author":"Pant","year":"2005","journal-title":"ACM Trans Inf Syst"},{"issue":"4","key":"10.3233\/IDA-205107_ref38","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1177\/0165551510368620","article-title":"A parametric methodology for text classification","volume":"36","author":"Karanikolas","year":"2010","journal-title":"J Inf Sci"},{"key":"10.3233\/IDA-205107_ref39","doi-asserted-by":"crossref","unstructured":"S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.","DOI":"10.1017\/CBO9781107298019"},{"key":"10.3233\/IDA-205107_ref41","doi-asserted-by":"crossref","unstructured":"D. Cai, S. Yu, J.R. Wen and W.Y. Ma, Block-based web search, in Proceedings of the 27th ACM SIGIR Conference, 2004, pp.\u00a0456\u2013463.","DOI":"10.1145\/1008992.1009070"},{"key":"10.3233\/IDA-205107_ref42","doi-asserted-by":"crossref","unstructured":"D. Cai, X. He, J.R. Wen and W.Y. Ma, Block-level link analysis, in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004, pp.\u00a0440\u2013447.","DOI":"10.1145\/1008992.1009068"},{"issue":"2","key":"10.3233\/IDA-205107_ref43","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1145\/1230812.1230816","article-title":"Clustering and searching WWW images using link and page layout analysis","volume":"3","author":"He","year":"2007","journal-title":"ACM Trans Multimed Comput Commun Appl"},{"key":"10.3233\/IDA-205107_ref45","doi-asserted-by":"crossref","unstructured":"M.F. Porterand others, An algorithm for suffix stripping, Program 14(3) (1980), 130\u2013137.","DOI":"10.1108\/eb046814"},{"key":"10.3233\/IDA-205107_ref46","unstructured":"Y. Yang and J.O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, in Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp.\u00a0412\u2013420."},{"issue":"1","key":"10.3233\/IDA-205107_ref47","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The WEKA data mining software: an update","volume":"11","author":"Hall","year":"2009","journal-title":"ACM SIGKDD Explor Newsl"},{"key":"10.3233\/IDA-205107_ref48","first-page":"219","article-title":"Comparative analysis of methods for determining number of hidden neurons in artificial neural network","author":"Vujicic","year":"2016","journal-title":"Central european conference on information and intelligent systems"},{"key":"10.3233\/IDA-205107_ref49","unstructured":"S. Xu and L. Chen, A novel approach for determining the optimal number of hidden layer neurons for FNN\u2019s and its application in data mining, in Proceedings of the 5th International Conference on Information Technology and Applications, 2008, pp.\u00a0683\u2013686."},{"key":"10.3233\/IDA-205107_ref50","unstructured":"J. Heaton, Introduction to neural networks with Java, Heaton Research, Inc., 2008."},{"key":"10.3233\/IDA-205107_ref51","unstructured":"R. Baeza-Yates, B. Ribeiro-Neto, and others, Modern information retrieval, vol. 463. ACM Press, 1999."},{"key":"10.3233\/IDA-205107_ref52","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1007\/978-3-662-10874-1_7","article-title":"Crawling the web","author":"Pant","year":"2004","journal-title":"Web dynamics"}],"container-title":["Intelligent Data Analysis"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/IDA-205107","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:19:25Z","timestamp":1777454365000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/IDA-205107"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,18]]},"references-count":48,"journal-issue":{"issue":"3"},"URL":"https:\/\/doi.org\/10.3233\/ida-205107","relation":{},"ISSN":["1088-467X","1571-4128"],"issn-type":[{"value":"1088-467X","type":"print"},{"value":"1571-4128","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,18]]}}}