{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T04:28:48Z","timestamp":1760243328138,"version":"build-2065373602"},"reference-count":52,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2014,8,19]],"date-time":"2014-08-19T00:00:00Z","timestamp":1408406400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/3.0\/"}],"funder":[{"DOI":"10.13039\/501100000780","name":"European Commission","doi-asserted-by":"publisher","award":["270239"],"award-info":[{"award-number":["270239"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM\u2019s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM\u2019s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.<\/jats:p>","DOI":"10.3390\/fi6030518","type":"journal-article","created":{"date-parts":[[2014,8,19]],"date-time":"2014-08-19T10:40:39Z","timestamp":1408444839000},"page":"518-541","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["ARCOMEM Crawling Architecture"],"prefix":"10.3390","volume":"6","author":[{"given":"Vassilis","family":"Plachouras","sequence":"first","affiliation":[{"name":"Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Florent","family":"Carpentier","sequence":"additional","affiliation":[{"name":"Internet Memory Foundation, 45 ter rue de la R\u00e9volution, 93100 Montreuil, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Muhammad","family":"Faheem","sequence":"additional","affiliation":[{"name":"CNRS LTCI, Institut Mines-T\u00e9l\u00e9com, T\u00e9l\u00e9com ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Julien","family":"Masan\u00e8s","sequence":"additional","affiliation":[{"name":"Internet Memory Foundation, 45 ter rue de la R\u00e9volution, 93100 Montreuil, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Thomas","family":"Risse","sequence":"additional","affiliation":[{"name":"Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pierre","family":"Senellart","sequence":"additional","affiliation":[{"name":"CNRS LTCI, Institut Mines-T\u00e9l\u00e9com, T\u00e9l\u00e9com ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Patrick","family":"Siehndel","sequence":"additional","affiliation":[{"name":"Research Center, University of Hannover, Appelstr. 9a, 30167 Hannover, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yannis","family":"Stavrakas","sequence":"additional","affiliation":[{"name":"Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2014,8,19]]},"reference":[{"key":"ref_1","first-page":"174","article-title":"A longitudinal study of Web pages continued: A consideration of document persistence","volume":"9","author":"Koehler","year":"2004","journal-title":"Inf. Res."},{"key":"ref_2","unstructured":"Historical Data Not Working. Available online:https:\/\/dev.twitter.com\/discussions\/2483."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Masan\u00e8s, J. (2006). Web Archiving, Springer-Verlag.","DOI":"10.1007\/978-3-540-46332-0"},{"key":"ref_4","unstructured":"Sigur\u00f0sson, K. (2005, January 22\u201323). Incremental crawling with Heritrix. Proceedings of the 5th International Web Archiving Workshop (IWAW\u201905), Vienna, Austria."},{"key":"ref_5","unstructured":"ARCOMEM: Archiving Communities Memories. Available online:http:\/\/www.arcomem.eu\/."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"426","DOI":"10.1007\/978-3-642-33290-6_47","article-title":"Exploiting the Social and Semantic Web for Guided Web Archiving","volume":"Volume 7489","author":"Zaphiris","year":"2012","journal-title":"Theory and Practice of Digital Libraries"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Plachouras, V., Carpentier, F., Masan\u00e9s, J., Risse, T., Senellart, P., Siehndel, P., and Stavrakas, Y. (2013, January 6). An Architecture for Selective Web Harvesting: The Use Case of Heritrix. Proceedings of the 1st International Workshop on Archiving Community Memories, Lisbon, Portugal.","DOI":"10.3390\/fi6030518"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1561\/1500000017","article-title":"Web Crawling","volume":"4","author":"Olston","year":"2010","journal-title":"Found. Trends Inf. Retr."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Brin, S., and Page, L. (1998, January 14\u201318). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th International Conference on World Wide Web, Brisbane, Australia.","DOI":"10.1016\/S0169-7552(98)00110-X"},{"key":"ref_10","unstructured":"Burner, M. Crawling towards eternity: Building an archive of the World Wide Web. Available online:http:\/\/people.apache.org\/ jim\/NewArchitect\/webtech\/1997\/05\/burner\/index.htmL."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1023\/A:1019213109274","article-title":"Mercator: A Scalable, Extensible Web Crawler","volume":"2","author":"Heydon","year":"1999","journal-title":"World Wide Web"},{"key":"ref_12","unstructured":"Najork, M., and Heydon, A. (2002). Handbook of Massive Data Sets, Kluwer Academic Publishers."},{"key":"ref_13","unstructured":"Shkapenyuk, V., and Suel, T. (March, January 26). Design and implementation of a high-performance distributed Web crawler. Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA."},{"key":"ref_14","unstructured":"Mohr, G., Kimpton, M., Stack, M., and Ranitovic, I. (2004, January 16). Introduction to heritrix, an archival quality web crawler. Proceedings of the 4th International Web Archiving Workshop (IWAW\u201904), Bath, UK."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"711","DOI":"10.1002\/spe.587","article-title":"UbiCrawler: A Scalable Fully Distributed Web Crawler","volume":"34","author":"Boldi","year":"2004","journal-title":"Softw. Pract. Exp."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"8:1","DOI":"10.1145\/1541822.1541823","article-title":"IRLbot: Scaling to 6 Billion Pages and Beyond","volume":"3","author":"Lee","year":"2009","journal-title":"ACM Trans. Web"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Ntoulas, A., Cho, J., and Olston, C. (2004, January 17\u201322). What\u2019s New on the Web?: The Evolution of the Web from a Search Engine Perspective. Proceedings of the 13th International Conference on World Wide Web (WWW \u201904), New York, NY, USA.","DOI":"10.1145\/988672.988674"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Fetterly, D., Manasse, M., Najork, M., and Wiener, J. (2003, January 20\u201324). A Large-scale Study of the Evolution of Web Pages. Proceedings of the 12th International Conference on World Wide Web (WWW \u201903), Budapest, Hungary.","DOI":"10.1145\/775244.775246"},{"key":"ref_19","unstructured":"Cho, J., and Garcia-Molina, H. (2000, January 10\u201314). The Evolution of the Web and Implications for an Incremental Crawler. Proceedings of the 26th International Conference on Very Large Data Bases (VLDB \u201900), Cairo, Egypt."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Olston, C., and Pandey, S. (2008, January 21\u201325). Recrawl Scheduling Based on Information Longevity. Proceedings of the 17th International Conference on World Wide Web (WWW \u201908), Beijing, China.","DOI":"10.1145\/1367497.1367557"},{"key":"ref_21","unstructured":"Pandey, S., Dhamdhere, K., and Olston, C. (September, January 29). WIC: A General-purpose Algorithm for Monitoring Web Information Sources. Proceedings of the 30th International Conference on Very Large Data Bases, (VLDB \u201904), Toronto, Canada."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Gouriten, G., Maniu, S., and Senellart, P. (2014, January 1\u20134). Scalable, Generic, and Adaptive Systems for Focused Crawling. Proceedings of the 25th ACM Conference on Hypertext and Social Media, Santiago, Chile.","DOI":"10.1145\/2631775.2631795"},{"key":"ref_23","unstructured":"Tang, T.T., Hawking, D., Craswell, N., and Griffiths, K. (November, January 31). Focused Crawling for Both Topical Relevance and Quality of Medical Information. Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM \u201905), Bremen, Germany."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M.E. (2001, January 9\u201312). Evaluating Topic-driven Web Crawlers. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201901), New Orleans, LA, USA.","DOI":"10.1145\/383952.383995"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1623","DOI":"10.1016\/S1389-1286(99)00052-3","article-title":"Focused crawling: A new approach to topic-specific Web resource discovery","volume":"31","author":"Chakrabarti","year":"1999","journal-title":"Comput. Netw."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"320","DOI":"10.1007\/s00778-003-0100-6","article-title":"THESUS: Organizing Web document collections based on link semantics","volume":"12","author":"Halkidi","year":"2003","journal-title":"VLDB J."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Ehrig, M., and Maedche, A. (2003, January 9\u201312). Ontology-focused Crawling of Web Documents. Proceedings of the 2003 ACM Symposium on Applied Computing (SAC \u201903), Melbourne, FL, USA.","DOI":"10.1145\/952756.952761"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Ahlers, D., and Boll, S. (2009, January 2\u20136). Adaptive Geospatially Focused Crawling. Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM \u201909), Hong Kong, China.","DOI":"10.1145\/1645953.1646011"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Gao, W., Lee, H.C., and Miao, Y. (2006, January 23\u201326). Geographically Focused Collaborative Crawling. Proceedings of the 15th International Conference on World Wide Web (WWW \u201906), Edinburgh, UK.","DOI":"10.1145\/1135777.1135822"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"De Bra, P.M.E., and Post, R.D.J. (1994, January 25\u201327). Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. Proceedings of the 1st Conference on World-Wide Web, Geneva, Switzerland.","DOI":"10.1016\/0169-7552(94)90132-5"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1016\/S0169-7552(98)00038-5","article-title":"The shark-search algorithm. An application: tailored Web site mapping","volume":"30","author":"Hersovici","year":"1998","journal-title":"Comput. Netw. ISDN Syst."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"886","DOI":"10.1016\/j.is.2006.09.004","article-title":"Combining Text and Link Analysis for Focused crawling\u2014An Application for Vertical Search Engines","volume":"32","author":"Almpanidis","year":"2007","journal-title":"Inf. Syst."},{"key":"ref_33","unstructured":"Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., and Gori, M. (2000, January 10\u201314). Focused Crawling Using Context Graphs. Proceedings of the 26th International Conference on Very Large Data Bases (VLDB \u201900), Cairo, Egypt."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"270","DOI":"10.1016\/j.datak.2006.01.012","article-title":"Using HMM to Learn User Browsing Patterns for Focused Web Crawling","volume":"59","author":"Liu","year":"2006","journal-title":"Data Knowl. Eng."},{"key":"ref_35","unstructured":"Partalas, I., Paliouras, G., and Vlahavas, I. (2008, January 13\u201317). Reinforcement Learning with Classifier Selection for Focused Crawling. Proceedings of the 2008 Conference on Artificial Intelligence, Chicago, IL, USA."},{"key":"ref_36","unstructured":"Fawcett, T., and Mishra, N. (2003, January 21\u201324). Evolving Strategies for Focused Web Crawling. Proceedings of the 20th International Conference on Machine Learning, Washington, DC, USA."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"4512","DOI":"10.1016\/j.ins.2008.07.030","article-title":"An ontology-based approach to learnable focused crawling","volume":"178","author":"Zheng","year":"2008","journal-title":"Inf. Sci."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Bergman, M.K. (2001). White paper: the Deep Web: surfacing Hidden Value. J. Electron. Publ., 7, (1), Available online:http:\/\/dx.doi.org\/10.3998\/3336451.0007.104.","DOI":"10.3998\/3336451.0007.104"},{"key":"ref_39","unstructured":"Barbosa, L., and Freire, J. (2004, January 3\u20137). Siphoning Hidden-Web Data through Keyword-Based Interfaces. Proceedings of the 19th Brazilian Symposium on Databases, Brasilia, Brazil."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1241","DOI":"10.14778\/1454159.1454163","article-title":"Google\u2019s Deep Web crawl","volume":"1","author":"Madhavan","year":"2008","journal-title":"Proc. VLDB Endow."},{"key":"ref_41","unstructured":"Arvidson, A., Persson, K., and Mannerheim, J. (2000, January 13\u201318). The Kulturarw3 Project\u2014The Royal Swedish Web Archiw3e\u2014An example of \u201ccomplete\u201d collection of web pages. Proceedings of the 66th IFLA Council and General Conference, Jerusalem, Israel."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Bailey, S., and Thompson, D. (2006). UKWAC: Building the UK\u2019s First Public Web Archive. D-Lib Mag., 12, Available online http:\/\/www.dlib.org\/dlib\/january06\/thompson\/01thompson.html.","DOI":"10.1045\/january2006-thompson"},{"key":"ref_43","unstructured":"Cathro, W., Webb, C., and Whiting, J. Archiving the Web: The PANDORA Archive at the National Library Australia, Available online:http:\/\/www.nla.gov.au\/openpublish\/index.php\/nlasp\/article\/view\/1314\/1600."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Spaniol, M., Denev, D., Mazeika, A., Weikum, G., and Senellart, P. (2009, January 20\u201324). Data Quality in Web Archiving. Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW \u201909), Madrid, Spain.","DOI":"10.1145\/1526993.1526999"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"586","DOI":"10.14778\/1687627.1687694","article-title":"SHARC: Framework for Quality-conscious Web Archiving","volume":"2","author":"Denev","year":"2009","journal-title":"Proc. VLDB Endow."},{"key":"ref_46","unstructured":"Gomes, D., Miranda, J.A., and Costa, M. (2011, January 26\u201328). A Survey on Web Archiving Initiatives. Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries (TPDL\u201911), Berlin, Germany."},{"key":"ref_47","unstructured":"Kay, M., and Boitet, C. (2012, January 8\u201315). NEER: An Unsupervised Method for Named Entity Evolution Recognition. Proceedings of the 24th International Conference on Computational Linguistics (COLING\u2019 12), Mumbai, India."},{"key":"ref_48","first-page":"598","article-title":"Assessing the Coverage of Data Collection Campaigns on Twitter: A Case Study","volume":"Volume 8186","author":"Demey","year":"2013","journal-title":"On the Move to Meaningful Internet Systems: OTM 2013 Workshops"},{"key":"ref_49","first-page":"306","article-title":"Intelligent and Adaptive Crawling of Web Applications for Web Archiving","volume":"Volume 7977","author":"Daniel","year":"2013","journal-title":"Proceedings of the 13th International Conference on Web Engineering (ICWE)"},{"key":"ref_50","unstructured":"Faheem, M., and Senellart, P. (November, January 27). Demonstrating intelligent crawling and archiving of web applications. Proceedings of the 22nd ACM International Conference Information and Knowledge Management (CIKM \u201913), Burlingame, CA, USA."},{"key":"ref_51","unstructured":"Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., and Damljanovic, D. (2011). Text Processing with GATE (Version 6), Department of Computer Science, University of Sheffield."},{"key":"ref_52","unstructured":"Apache Lucene Core. Available online:http:\/\/lucene.apache.org\/core\/."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/6\/3\/518\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T21:14:56Z","timestamp":1760217296000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/6\/3\/518"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,8,19]]},"references-count":52,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2014,9]]}},"alternative-id":["fi6030518"],"URL":"https:\/\/doi.org\/10.3390\/fi6030518","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2014,8,19]]}}}