{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T06:11:19Z","timestamp":1775283079402,"version":"3.50.1"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2009,6,1]],"date-time":"2009-06-01T00:00:00Z","timestamp":1243814400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2009,6]]},"abstract":"<jats:p>This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb\/s (1,789 pages\/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.<\/jats:p>","DOI":"10.1145\/1541822.1541823","type":"journal-article","created":{"date-parts":[[2009,6,30]],"date-time":"2009-06-30T13:10:17Z","timestamp":1246367417000},"page":"1-34","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["IRLbot"],"prefix":"10.1145","volume":"3","author":[{"given":"Hsin-Tsang","family":"Lee","sequence":"first","affiliation":[{"name":"Texas A&amp;M University, College Station, TX"}]},{"given":"Derek","family":"Leonard","sequence":"additional","affiliation":[{"name":"Texas A&amp;M University, College Station, TX"}]},{"given":"Xiaoming","family":"Wang","sequence":"additional","affiliation":[{"name":"Texas A&amp;M University, College Station, TX"}]},{"given":"Dmitri","family":"Loguinov","sequence":"additional","affiliation":[{"name":"Texas A&amp;M University, College Station, TX"}]}],"member":"320","published-online":{"date-parts":[[2009,7,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775192"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/383034.383035"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the World Wide Web Conference (WWW'99)","author":"Bharat K."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.587"},{"key":"e_1_2_1_5_1","volume-title":"Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science","volume":"3243","author":"Boldi P."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the World Wide Web Conference (WWW'98)","author":"Brin S."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0169-7552(97)00031-7"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775247"},{"key":"e_1_2_1_9_1","first-page":"5","article-title":"Crawling towards eternity: Building an archive of the World Wide Web","volume":"2","author":"Burner M.","year":"1997","journal-title":"Web Techn. Mag."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/509907.509965"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/511446.511464"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1149121.1149124"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/371920.371960"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0169-7552(94)90151-1"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148187"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of SuperComputing.","author":"Gleich D."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the International Conference on Very Large Databases (VLDB'05)","author":"Gy\u00f6ngyi Z."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1026711.1026760"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148222"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1019213109274"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the World Wide Web Conference (WWW'00)","author":"Hirai J."},{"key":"e_1_2_1_22_1","unstructured":"Internet Archive. Internet archive homepage. http:\/\/www.archive.org\/.  Internet Archive. Internet archive homepage. http:\/\/www.archive.org\/."},{"key":"e_1_2_1_23_1","unstructured":"IRLbot. 2007. IRLbot project at Texas A&M. http:\/\/irl.cs.tamu.edu\/crawler\/.  IRLbot. 2007. IRLbot project at Texas A&M. http:\/\/irl.cs.tamu.edu\/crawler\/."},{"key":"e_1_2_1_24_1","unstructured":"Kamvar S. D. Haveliwala T. H. Manning C. D. and Golub G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep. Stanford University.  Kamvar S. D. Haveliwala T. H. Manning C. D. and Golub G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep. Stanford University."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775190"},{"key":"e_1_2_1_26_1","volume-title":"International Symposium on Communications and Information Technology.","author":"Koht"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the Latin American Web Congress (LAWEB'03)","author":"Manasse D. F. M."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242592"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/64.577466"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0169-7552(94)90149-X"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Najork M. and Heydon A. 2001. High-performance Web crawling. Tech: rep. 173 Compaq Systems Research Center.  Najork M. and Heydon A. 2001. High-performance Web crawling. Tech: rep. 173 Compaq Systems Research Center.","DOI":"10.1007\/978-1-4615-0005-6_2"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/371920.371965"},{"key":"e_1_2_1_33_1","unstructured":"Official Google Blog. 2008. We knew the Web was big\u2026 http:\/\/googleblog.blogspot.com\/2008\/07\/we-knew-web-was-big.html.  Official Google Blog. 2008. We knew the Web was big\u2026 http:\/\/googleblog.blogspot.com\/2008\/07\/we-knew-web-was-big.html."},{"key":"e_1_2_1_34_1","volume-title":"World Wide Web Conference (WWW'94)","author":"Pinkerton B.","year":"1994"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the IEEE International Conference on Data Engineering (ICDE'02)","author":"Shkapenyuk V."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142","author":"Singh A."},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the International Workshop on Web and Databases (WebDB'03)","author":"Suel T."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/384192.384193"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the International Conference on Adaptive Hypermedia, 265--274","author":"Wu J."}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1541822.1541823","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1541822.1541823","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T12:18:08Z","timestamp":1750249088000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1541822.1541823"}},"subtitle":["Scaling to 6 billion pages and beyond"],"short-title":[],"issued":{"date-parts":[[2009,6]]},"references-count":39,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2009,6]]}},"alternative-id":["10.1145\/1541822.1541823"],"URL":"https:\/\/doi.org\/10.1145\/1541822.1541823","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"value":"1559-1131","type":"print"},{"value":"1559-114X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,6]]},"assertion":[{"value":"2008-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-07-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}