{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T17:30:39Z","timestamp":1778693439847,"version":"3.51.4"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,10,11]],"date-time":"2023-10-11T00:00:00Z","timestamp":1696982400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2024,2,29]]},"abstract":"<jats:p>Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools, because it is based on textual data.<\/jats:p>","DOI":"10.1145\/3616849","type":"journal-article","created":{"date-parts":[[2023,8,19]],"date-time":"2023-08-19T09:54:22Z","timestamp":1692438862000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Scraping Relevant Images from Web Pages without Download"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4351-2244","authenticated-orcid":false,"given":"Erdin\u00e7","family":"Uzun","sequence":"first","affiliation":[{"name":"Tekirda\u011f Nam\u0131k Kemal University, Turkey"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,10,11]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2023.110030"},{"key":"e_1_3_2_3_2","first-page":"393","volume-title":"Web Information Systems Engineering (WISE\u201918)","author":"Alarte Julian","year":"2018","unstructured":"Julian Alarte, David Insa, Josep Silva, and Salvador Tamarit. 2018. Main content extraction from heterogeneous webpages. In Web Information Systems Engineering (WISE\u201918), Hakim Hacid, Wojciech Cellary, Hua Wang, Hye-Young Paik, and Rui Zhou (Eds.). Springer International Publishing, Cham, 393\u2013407."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/FIT47737.2019.00061"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/511446.511522"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.03.002"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.4304\/jetwi.6.2.226-230"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/2857054"},{"key":"e_1_3_2_9_2","first-page":"226","volume-title":"2nd International Conference on Knowledge Discovery and Data Mining (KDD\u201996)","author":"Ester Martin","year":"1996","unstructured":"Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD\u201996). AAAI Press, 226\u2013231."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3365376"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.14704\/WEB\/V16I1\/a177"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Emilio Ferrara and Robert Baumgartner. 2011. Automatic wrapper adaptation by tree edit distance matching. In Combinations of Intelligent Methods and Applications. Springer UK 41\u201354.","DOI":"10.1007\/978-3-642-19618-8_3"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2017.04.007"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.5555\/863312"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.5220\/0005438704110419"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3555349"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2013.09.027"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1117\/12.411880"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4379(98)00027-1"},{"key":"e_1_3_2_20_2","volume-title":"Representative Image Extraction from Web Page","author":"Islam Imranul","year":"2021","unstructured":"Imranul Islam. 2021. Representative Image Extraction from Web Page. Master\u2019s Thesis. University of Eastern Finland, Faculty of Science and Forestry, Joensuu School of Computing."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2021.102683"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCIT.2007.19"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/1718487.1718542"},{"issue":"8","key":"e_1_3_2_24_2","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions and reversals.","volume":"10","author":"Levenshtein Vladimir Iosifovich","year":"1966","unstructured":"Vladimir Iosifovich Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8 (1966), 707\u2013710.","journal-title":"Sov. Phys. Dokl."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-19460-3"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.5626\/JCSE.2017.11.2.39"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-019-01423-6"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13198-021-01166-z"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3365574"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/301136.301191"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988740"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.5555\/645925.671350"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-016-9359-2"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/8291.001.0001"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.softx.2022.100985"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.3906\/elk-2004-67"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2013.02.005"},{"key":"e_1_3_2_38_2","first-page":"275","volume-title":"International Scientific Conference (UNITECH\u201917)","author":"Uzun Erdin\u00e7","year":"2017","unstructured":"Erdin\u00e7 Uzun, Halil Nusret Bulu\u015f, Alpay Doruk, and Erkan \u00d6zhan. 2017. Evaluation of HAP, AngleSharp and HtmlDocument in web content extraction. In International Scientific Conference (UNITECH\u201917). UNITECH, 275\u2013278."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1002\/spe.2195"},{"key":"e_1_3_2_40_2","first-page":"87","article-title":"Comparison of Python libraries used for web data extraction","volume":"24","author":"Uzun Erdin\u00e7","year":"2018","unstructured":"Erdin\u00e7 Uzun, Tar\u0131k Yerlikaya, and O\u011fuz K\u0131rat. 2018. Comparison of Python libraries used for web data extraction. J. Technic. Univ. - Sofia Plovdiv branch, Bulgar. 24 (2018), 87\u201392.","journal-title":"J. Technic. Univ. - Sofia Plovdiv branch, Bulgar."},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/IDAP.2018.8620774"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3039044"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1177\/0165551516666446"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cogsys.2019.07.004"},{"key":"e_1_3_2_45_2","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1007\/978-3-319-76941-7_13","volume-title":"Advances in Information Retrieval","author":"Vogels Thijs","year":"2018","unstructured":"Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. In Advances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, Cham, 167\u2013179."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2019.10.045"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2015.12.025"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2021.102610"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2006.197"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3372117"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2021.102656"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.102097"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/IDAP.2018.8620893"}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3616849","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3616849","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:24Z","timestamp":1750182564000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3616849"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,11]]},"references-count":52,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,2,29]]}},"alternative-id":["10.1145\/3616849"],"URL":"https:\/\/doi.org\/10.1145\/3616849","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"value":"1559-1131","type":"print"},{"value":"1559-114X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,11]]},"assertion":[{"value":"2022-08-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-02","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}