{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,31]],"date-time":"2025-12-31T01:29:58Z","timestamp":1767144598681,"version":"build-2238731810"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2021,7,24]],"date-time":"2021-07-24T00:00:00Z","timestamp":1627084800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,24]],"date-time":"2021-07-24T00:00:00Z","timestamp":1627084800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Due to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.<\/jats:p>","DOI":"10.1007\/s40747-021-00471-1","type":"journal-article","created":{"date-parts":[[2021,7,24]],"date-time":"2021-07-24T03:03:00Z","timestamp":1627095780000},"page":"3635-3653","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["IHWC: intelligent hidden web crawler for harvesting data in urban domains"],"prefix":"10.1007","volume":"9","author":[{"given":"Sawroop","family":"Kaur","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6571-327X","authenticated-orcid":false,"given":"Aman","family":"Singh","sequence":"additional","affiliation":[]},{"given":"G.","family":"Geetha","sequence":"additional","affiliation":[]},{"given":"Xiaochun","family":"Cheng","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,7,24]]},"reference":[{"issue":"2","key":"471_CR1","doi-asserted-by":"publisher","first-page":"144","DOI":"10.1145\/358923.358934","volume":"32","author":"M Kobayashi","year":"2000","unstructured":"Kobayashi M, Takeda K (2000) Information retrieval on the Web. ACM Comput Surv 32(2):144\u2013173","journal-title":"ACM Comput Surv"},{"key":"471_CR2","doi-asserted-by":"crossref","unstructured":"Wu M, Lee C (2020) A study on natural language processing classified news. pp. 244\u2013247","DOI":"10.1109\/Indo-TaiwanICAN48429.2020.9181355"},{"key":"471_CR3","doi-asserted-by":"crossref","unstructured":"Chakrabarti S (2003) Crawling the web. In: Mining the web, pp 17\u201343","DOI":"10.1016\/B978-155860754-5\/50003-3"},{"key":"471_CR4","unstructured":"Kaur S, Geetha G (2007) Advances in web crawlers, vol. 10, pp. 1\u201322"},{"issue":"1","key":"471_CR5","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1007\/s10844-012-0221-8","volume":"40","author":"Y Li","year":"2013","unstructured":"Li Y, Wang Y, Du J (2013) E-FFC: an enhanced form-focused crawler for domain-specific deep web databases. J Intell Inform Syst 40(1):159\u2013184","journal-title":"J Intell Inform Syst"},{"key":"471_CR6","unstructured":"Wu Z et al (2003) Towards automatic incorporation of search engines into a large-scale metasearch engine. In: Proceedings - IEEE\/WIC International Conference on Web Intelligence, WI 2003, pp 658\u2013661"},{"key":"471_CR7","first-page":"342","volume":"7","author":"J Madhavan","year":"2007","unstructured":"Madhavan J et al (2007) Web-scale data integration: you can only afford to pay as you go. Cidr 7:342\u2013350","journal-title":"Cidr"},{"key":"471_CR8","unstructured":"Cope J, Craswell N, Hawking D (2003) Automated discovery of search interfaces on the web BT. In: Fourteenth Australasian Database Conference (ADC2003), vol. 17, pp. 181\u2013189"},{"key":"471_CR9","first-page":"1","volume":"5","author":"L Barbosa","year":"2005","unstructured":"Barbosa L, Freire J (2005) Searching for hidden-web databases. Proc WebDB 5:1\u20136","journal-title":"Proc WebDB"},{"key":"471_CR10","doi-asserted-by":"crossref","unstructured":"Hicks C, Scheffer M, Ngu AHH, ShengQZ (2012) Discovery and cataloging of deep web sources. In: Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration, IRI 2012, pp. 224\u2013230","DOI":"10.1109\/IRI.2012.6303014"},{"key":"471_CR11","unstructured":"Raghavan S, Garcia-molina H (2001) Crawling the hidden web. In: 27th VLDB Conference, Roma, Italy, pp 1\u201310"},{"issue":"2","key":"471_CR12","doi-asserted-by":"publisher","first-page":"133","DOI":"10.1023\/A:1008672508721","volume":"8","author":"M Perkowitz","year":"1997","unstructured":"Perkowitz M, Doorenbos RB, Etzioni O, Weld DS (1997) Learning to understand information on the internet: an example-based approach. J Intell Inform Syst 8(2):133\u2013153","journal-title":"J Intell Inform Syst"},{"key":"471_CR13","doi-asserted-by":"publisher","first-page":"402","DOI":"10.1007\/978-3-540-45275-1_35","volume":"2784","author":"S Liddle","year":"2003","unstructured":"Liddle S, Embley D, Scott D, Yau SH (2003) Extracting data behind web forms. Lect Notes Comput Sci 2784:402\u2013413","journal-title":"Lect Notes Comput Sci"},{"key":"471_CR14","doi-asserted-by":"crossref","unstructured":"He Y, Xin D, Ganti V, Rajaraman S, Shah N (2013) Crawling deep web entity pages. Web Search Data Min, 355\u2013364","DOI":"10.1145\/2433396.2433442"},{"issue":"1","key":"471_CR15","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1007\/s00778-012-0286-6","volume":"22","author":"T Furche","year":"2013","unstructured":"Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A (2013) OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J 22(1):47\u201372","journal-title":"VLDB J"},{"key":"471_CR16","doi-asserted-by":"crossref","unstructured":"\u00c1lvarez M, Raposo J, Pan A, Cacheda F, Bellas F, Carneiro V (2007) Crawling the content hidden behind web forms. In: Lecture notes in computer science, pp. 322\u2013333","DOI":"10.1007\/978-3-540-74477-1_31"},{"key":"471_CR17","unstructured":"Dragut EC. Deep web query interface understanding and integration"},{"issue":"1","key":"471_CR18","first-page":"33","volume":"39","author":"R Khare","year":"2010","unstructured":"Khare R, An Y, Song I-Y (2010) Understanding deep web search interfaces. Survey 39(1):33\u201340","journal-title":"Survey"},{"key":"471_CR19","doi-asserted-by":"crossref","unstructured":"Jamali M, Sayyadi H, Hariri BB, Abolhassani H (2006) A method for focused crawling using combination of link structure and content similarity. In: Proceedings of the 2006 IEEE\/WIC\/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI\u201906, pp. 753\u2013756","DOI":"10.1109\/WI.2006.19"},{"issue":"11","key":"471_CR20","doi-asserted-by":"publisher","first-page":"1623","DOI":"10.1016\/S1389-1286(99)00052-3","volume":"31","author":"S Chakrabarti","year":"1999","unstructured":"Chakrabarti S, Van Den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11):1623\u20131640","journal-title":"Comput Netw"},{"key":"471_CR21","doi-asserted-by":"crossref","unstructured":"Barbosa L, Freire J (2007) An adaptive crawler for locating hiddenwebentry points. In: Proceedings of the 16th international conference on World Wide Web\u2014WWW \u201907, p. 441","DOI":"10.1145\/1242572.1242632"},{"key":"471_CR22","doi-asserted-by":"crossref","unstructured":"Najork M, Wiener JL (2001) Breadth-first search crawling yields high-quality pages. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 114\u2013118","DOI":"10.1145\/371920.371965"},{"key":"471_CR23","unstructured":"Korf RE, Schultze P (2005) Large-scale parallel breadth-first search. In: Proceedings of the National Conference on Artificial Intelligence, vol. 3, pp. 1380\u20131385"},{"key":"471_CR24","unstructured":"Peshave M, Dezhgosha K (2005) How search engines work and a web crawler application"},{"key":"471_CR25","doi-asserted-by":"crossref","unstructured":"Wang W, Chen X, Zou Y, Wang H, Dai Z (2010) A focused crawler based on naive Bayes classifier. In: 3rd International Symposium on Intelligent Information Technology and Security Informatics, IITSI 2010, pp 517\u2013521","DOI":"10.1109\/IITSI.2010.30"},{"key":"471_CR26","doi-asserted-by":"crossref","unstructured":"Zheng X, Zhou T, Yu Z, Chen D (2008) URL rule based focused crawlers. In: IEEE International Conference on e-Business Engineering, ICEBE\u201908\u2014Workshops: AiR\u201908, EM2I\u201908, SOAIC\u201908, SOKM\u201908, BIMA\u201908, DKEEE\u201908, pp. 147\u2013154","DOI":"10.1109\/ICEBE.2008.61"},{"key":"471_CR27","unstructured":"Barbosa L, Freire J. Searching for hidden-web databases"},{"key":"471_CR28","doi-asserted-by":"crossref","unstructured":"Barbosa L, Freire J (2007) An adaptive crawler for locating hidden web entry points. In: 16th International World Wide Web Conference, WWW2007, pp. 441\u2013450","DOI":"10.1145\/1242572.1242632"},{"key":"471_CR29","unstructured":"Madhavan J et al (2007) Web-scale data integration: you can only afford to pay as you go. In: Proceedings of CIDR, pp. 342\u2013350"},{"issue":"4","key":"471_CR30","doi-asserted-by":"publisher","first-page":"608","DOI":"10.1109\/TSC.2015.2414931","volume":"9","author":"F Zhao","year":"2016","unstructured":"Zhao F, Zhou J, Nie C, Huang H, Jin H (2016) SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608\u2013620","journal-title":"IEEE Trans Serv Comput"},{"issue":"1","key":"471_CR31","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1007\/s10796-018-9863-6","volume":"21","author":"C Jou","year":"2019","unstructured":"Jou C (2019) Schema extraction for deep web query interfaces using heuristics rules. Inf Syst Front 21(1):163\u2013174","journal-title":"Inf Syst Front"},{"issue":"3","key":"471_CR32","doi-asserted-by":"publisher","first-page":"1563","DOI":"10.1007\/s11042-015-2624-3","volume":"75","author":"T Tsikrika","year":"2016","unstructured":"Tsikrika T, Moumtzidou A, Vrochidis S, Kompatsiaris I (2016) Focussed crawling of environmental web resources based on the combination of multimedia evidence. Multimed Tools Appl 75(3):1563\u20131587","journal-title":"Multimed Tools Appl"},{"key":"471_CR33","unstructured":"Helfenstein A, Tammela P (2016) Data and text mining analyzing user-generated online content for drug discovery: development and use of Med-crawler, pp. 0\u20135"},{"issue":"6","key":"471_CR34","doi-asserted-by":"publisher","first-page":"2106","DOI":"10.1109\/TIE.2010.2050754","volume":"58","author":"H Dong","year":"2011","unstructured":"Dong H, Hussain FK (2011) Focused crawling for automatic service discovery, annotation, and classification in industrial digital ecosystems. IEEE Trans Ind Electron 58(6):2106\u20132116","journal-title":"IEEE Trans Ind Electron"},{"issue":"7","key":"471_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0200650","volume":"13","author":"S Lopez-Aparicio","year":"2018","unstructured":"Lopez-Aparicio S, Grythe H, Vogt M, Pierce M, Vallejo I (2018) Webcrawling and machine learning as a new approach for the spatial distribution of atmospheric emissions. PLoS ONE 13(7):1\u201315","journal-title":"PLoS ONE"},{"issue":"3","key":"471_CR36","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1145\/1031570.1031584","volume":"33","author":"KC-C Chang","year":"2004","unstructured":"Chang KC-C, He B, Li C, Patel M, Zhang Z (2004) Structured databases on the web. ACM SIGMOD Rec 33(3):61","journal-title":"ACM SIGMOD Rec"},{"key":"471_CR37","doi-asserted-by":"publisher","first-page":"117582","DOI":"10.1109\/ACCESS.2020.3004756","volume":"8","author":"S Kaur","year":"2020","unstructured":"Kaur S, Geetha G (2020) SIMHAR\u2014smart distributed web crawler for the hidden web using SIM+Hash and Redis server. IEEE Access 8:117582\u2013117592","journal-title":"IEEE Access"},{"key":"471_CR38","unstructured":"Teixeira PM (2018) Relevance ranking for predicting web search results, vol 1"},{"key":"471_CR39","unstructured":"Valkanas G, Ntoulas A (2011) Rank-aware crawling of hidden web sites. In: WebDB, pp 1\u20136"},{"key":"471_CR40","doi-asserted-by":"crossref","unstructured":"Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity search on the web. In: Proceedings of the 11th International Conference on World Wide Web, WWW \u201902, pp 432\u2013442","DOI":"10.1145\/511446.511502"},{"issue":"4","key":"471_CR41","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1023\/A:1019213109274","volume":"2","author":"A Heydon","year":"1999","unstructured":"Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219\u2013229","journal-title":"World Wide Web"},{"key":"471_CR42","doi-asserted-by":"publisher","first-page":"4535","DOI":"10.1007\/s10586-018-2084-4","volume":"22","author":"AK Sangaiah","year":"2019","unstructured":"Sangaiah AK, Fakhry AE, Abdel-Basset M, El-henawy I (2019) Arabic text clustering using improved clustering algorithms with dimensionality reduction. Clust Comput 22:4535\u20134549","journal-title":"Clust Comput"},{"issue":"3","key":"471_CR43","doi-asserted-by":"publisher","first-page":"144","DOI":"10.35470\/2226-4116-2020-9-3-144-151","volume":"9","author":"NLH Hien","year":"2020","unstructured":"Hien NLH, Tien TQ, Van Hieu N (2020) Web crawler: design and implementation for extracting article-like contents. Cybern Phys 9(3):144\u2013151","journal-title":"Cybern Phys"},{"key":"471_CR44","doi-asserted-by":"publisher","first-page":"104453","DOI":"10.1016\/j.ijmedinf.2021.104453","volume":"150","author":"J Schedlbauer","year":"2021","unstructured":"Schedlbauer J, Raptis G, Ludwig B (2021) Medical informatics labor market analysis using web crawling, web scraping, and text mining. Int J Med Inform 150:104453","journal-title":"Int J Med Inform"},{"issue":"2","key":"471_CR45","doi-asserted-by":"publisher","first-page":"1403","DOI":"10.1016\/j.matpr.2020.06.596","volume":"37","author":"A Sharma","year":"2021","unstructured":"Sharma A, Shrivastava V, Singh H (2021) Experimental performance analysis of web crawlers using single and multi-threaded web crawling and indexing algorithm for the application of smart web contents. Mater Today: Proc 37(2):1403\u20131408","journal-title":"Mater Today: Proc"}],"updated-by":[{"DOI":"10.1007\/s40747-022-00839-x","type":"correction","label":"Correction","source":"publisher","updated":{"date-parts":[[2022,8,10]],"date-time":"2022-08-10T00:00:00Z","timestamp":1660089600000}}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-021-00471-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-021-00471-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-021-00471-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,7,27]],"date-time":"2023-07-27T09:07:57Z","timestamp":1690448877000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-021-00471-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,24]]},"references-count":45,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["471"],"URL":"https:\/\/doi.org\/10.1007\/s40747-021-00471-1","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,24]]},"assertion":[{"value":"23 October 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 July 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 July 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 August 2022","order":4,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Correction","order":5,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"A Correction to this paper has been published:","order":6,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"https:\/\/doi.org\/10.1007\/s40747-022-00839-x","URL":"https:\/\/doi.org\/10.1007\/s40747-022-00839-x","order":7,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}}]}}