{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:34:58Z","timestamp":1750307698082,"version":"3.41.0"},"reference-count":10,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2008,11,30]],"date-time":"2008-11-30T00:00:00Z","timestamp":1228003200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGIR Forum"],"published-print":{"date-parts":[[2008,11,30]]},"abstract":"<jats:p>Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph [3] we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. We describe a collection based on a crawl of 100 Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.<\/jats:p>","DOI":"10.1145\/1480506.1480512","type":"journal-article","created":{"date-parts":[[2008,12,30]],"date-time":"2008-12-30T17:45:31Z","timestamp":1230659131000},"page":"39-44","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Compressed collections for simulated crawling"],"prefix":"10.1145","volume":"42","author":[{"given":"Alessio","family":"Orlandi","sequence":"first","affiliation":[{"name":"Universit\u00e0 di Pisa, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sebastiano","family":"Vigna","sequence":"additional","affiliation":[{"name":"Universit\u00e0 degli Studi di Milano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2008,11,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148234"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.587"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988752"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"P. Boldi and S. Vigna. Codes for the world wide web. Internet mathematics 2(4):405--427 2005.  P. Boldi and S. Vigna. Codes for the world wide web. Internet mathematics 2(4):405--427 2005.","DOI":"10.1080\/15427951.2005.10129113"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1189702.1189703"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1389-1286(99)00052-3"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/321812.321820"},{"key":"e_1_2_1_8_1","unstructured":"R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61 Computer Structures Group Project MAC MIT Cambridge Mass. n.d. 1971.  R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61 Computer Structures Group Project MAC MIT Cambridge Mass. n.d. 1971."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/1788888.1788900"},{"key":"e_1_2_1_10_1","unstructured":"WARC file format ISO\/DIS 28500 2007.  WARC file format ISO\/DIS 28500 2007."}],"container-title":["ACM SIGIR Forum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1480506.1480512","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1480506.1480512","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:29:59Z","timestamp":1750253399000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1480506.1480512"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,11,30]]},"references-count":10,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2008,11,30]]}},"alternative-id":["10.1145\/1480506.1480512"],"URL":"https:\/\/doi.org\/10.1145\/1480506.1480512","relation":{},"ISSN":["0163-5840"],"issn-type":[{"type":"print","value":"0163-5840"}],"subject":[],"published":{"date-parts":[[2008,11,30]]},"assertion":[{"value":"2008-11-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}