{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T11:50:22Z","timestamp":1775476222483,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2008,10,1]],"date-time":"2008-10-01T00:00:00Z","timestamp":1222819200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. ACM"],"published-print":{"date-parts":[[2008,10]]},"abstract":"<jats:p>We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.<\/jats:p>\n          <jats:p>\n            The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce\n            <jats:italic>biased<\/jats:italic>\n            samples, but each sample is accompanied by a\n            <jats:italic>weight<\/jats:italic>\n            , which represents its bias. The samples, in conjunction with the weights, are then used to\n            <jats:italic>simulate<\/jats:italic>\n            near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods:\n            <jats:italic>rejection sampling<\/jats:italic>\n            ,\n            <jats:italic>importance sampling<\/jats:italic>\n            , the\n            <jats:italic>Metropolis--Hastings<\/jats:italic>\n            algorithm, and the\n            <jats:italic>Maximum Degree<\/jats:italic>\n            method.\n          <\/jats:p>\n          <jats:p>\n            The limited access to search engines force our algorithms to use bias weights that are only \u201capproximate\u201d. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are\n            <jats:italic>guaranteed<\/jats:italic>\n            to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest.\n          <\/jats:p>\n          <jats:p>Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.<\/jats:p>","DOI":"10.1145\/1411509.1411514","type":"journal-article","created":{"date-parts":[[2008,11,6]],"date-time":"2008-11-06T13:49:43Z","timestamp":1225979383000},"page":"1-74","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":77,"title":["Random sampling from a search engine's index"],"prefix":"10.1145","volume":"55","author":[{"given":"Ziv","family":"Bar-Yossef","sequence":"first","affiliation":[{"name":"Technion and Google Haifa, Haifa, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Maxim","family":"Gurevich","sequence":"additional","affiliation":[{"name":"Technion, Haifa, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2008,11,5]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1017\/S0269964800000267"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060784"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 26th International Conference on Very Large Databases (VLDB). ACM","author":"Bar-Yossef Z.","unstructured":"Bar-Yossef , Z. , Berg , A. , Chien , S. , Fakcharoenphol , J. , and Weitz , D . 2000. Approximating aggregate queries about Web pages via random walks . In Proceedings of the 26th International Conference on Very Large Databases (VLDB). ACM , New York, 535--544. Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases (VLDB). ACM, New York, 535--544."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242627"},{"key":"e_1_2_1_5_1","unstructured":"Battelle J. 2005. John Battelle's searchblog. http:\/\/battellemedia.com\/archives\/001889.php.  Battelle J. 2005. John Battelle's searchblog. http:\/\/battellemedia.com\/archives\/001889.php."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 7th International World Wide Web Conference (WWW7). 379--388","author":"Bharat K.","unstructured":"Bharat , K. , and Broder , A . 1998. A technique for measuring the relative size and overlap of public Web search engines . In Proceedings of the 7th International World Wide Web Conference (WWW7). 379--388 . Bharat, K., and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International World Wide Web Conference (WWW7). 379--388."},{"key":"e_1_2_1_7_1","volume-title":"Search engine indexing limits: Where do the bots stop&quest","author":"Bondar S.","unstructured":"Bondar , S. 2006. Search engine indexing limits: Where do the bots stop&quest ; ( Available at http:\/\/www.sitepoint.com\/article\/indexing-limits-where-bots-stop.) Bondar, S. 2006. Search engine indexing limits: Where do the bots stop&quest; (Available at http:\/\/www.sitepoint.com\/article\/indexing-limits-where-bots-stop.)"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0036144503423264"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1287\/mksc.19.1.43.15180"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1183614.1183699"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(03)00040-2"},{"key":"e_1_2_1_12_1","unstructured":"Cheney M. and Perry M. 2005. A comparison of the size of the Yahoo&excl; and Google indices. Available at http:\/\/vburton.ncsa.uiuc.edu\/indexsize.html.  Cheney M. and Perry M. 2005. A comparison of the size of the Yahoo&excl; and Google indices. Available at http:\/\/vburton.ncsa.uiuc.edu\/indexsize.html."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1002\/meet.1450410146"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.1998.1576"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Dobra A. and Fienberg S. E. 2004. How large is the World Wide Web&quest; Web Dyn. 23--44.  Dobra A. and Fienberg S. E. 2004. How large is the World Wide Web&quest; Web Dyn. 23--44.","DOI":"10.1007\/978-3-662-10874-1_2"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0097539794268765"},{"key":"e_1_2_1_17_1","volume-title":"Matrix Computations","author":"Golub G. H.","unstructured":"Golub , G. H. , and van Loan , C. F. 1996. Matrix Computations , 3 rd ed. The Johns Hopkins University Press , Baltimore, MD . Golub, G. H., and van Loan, C. F. 1996. Matrix Computations, 3rd ed. The Johns Hopkins University Press, Baltimore, MD.","edition":"3"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(98)00041-7"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1062745.1062789"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/57.1.97"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1011468107287"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the 8th International World Wide Web Conference. 213--225","author":"Henzinger M. R.","unstructured":"Henzinger , M. R. , Heydon , A. , Mitzenmacher , M. , and Najork , M . 1999. Measuring index quality using random walks on the Web . In Proceedings of the 8th International World Wide Web Conference. 213--225 . Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International World Wide Web Conference. 213--225."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 9th International World Wide Web Conference. 295--308","author":"Henzinger M. R.","unstructured":"Henzinger , M. R. , Heydon , A. , Mitzenmacher , M. , and Najork , M . 2000. On near-uniform URL sampling . In Proceedings of the 9th International World Wide Web Conference. 295--308 . Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International World Wide Web Conference. 295--308."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1017\/S0963548397003209"},{"key":"e_1_2_1_26_1","first-page":"280","article-title":"Searching the World Wide Web","volume":"5360","author":"Lawrence S.","year":"1998","unstructured":"Lawrence , S. , and Giles , C. L. 1998 . Searching the World Wide Web . Science 5360 , 280 , 98. Lawrence, S., and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.","journal-title":"Science"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1038\/21987"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00162521"},{"key":"e_1_2_1_29_1","volume-title":"Monte Carlo Strategies in Scientific Computing","author":"Liu J. S.","unstructured":"Liu , J. S. 2001. Monte Carlo Strategies in Scientific Computing . Springer-Verlag , New York . Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed.","volume":"21","author":"Marshal A.","year":"1956","unstructured":"Marshal , A. 1956 . The use of multi-stage sampling schemes in Monte Carlo computations . In Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed. Vol. 21 . Wiley, New York, 123--140. Marshal, A. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed. Vol. 21. Wiley, New York, 123--140."},{"key":"e_1_2_1_31_1","unstructured":"Mayer T. 2005. Our blog is growing up - and so has our index. Available at http:\/\/www.ysearchblog.com\/archives\/000172.html.  Mayer T. 2005. Our blog is growing up - and so has our index. Available at http:\/\/www.ysearchblog.com\/archives\/000172.html."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1255175.1255237"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1063\/1.1699114"},{"key":"e_1_2_1_34_1","unstructured":"Price G. 2005. More on the total database size battle and Googlewhacking with Yahoo. Available at http:\/\/blog.searchenginewatch.com\/blog\/050811-231448.  Price G. 2005. More on the total database size battle and Googlewhacking with Yahoo. Available at http:\/\/blog.searchenginewatch.com\/blog\/050811-231448."},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation.","author":"Rusmevichientong P.","unstructured":"Rusmevichientong , P. , Pennock , D. , Lawrence , S. , and Giles , C. L . 2001. Methods for sampling pages uniformly from the World Wide Web . In Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation. Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation."},{"key":"e_1_2_1_36_1","volume-title":"Sequential Analysis - Tests and Confidence Intervals","author":"Siegmund D.","unstructured":"Siegmund , D. 1985. Sequential Analysis - Tests and Confidence Intervals . Springer-Verlag , New York . Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag, New York."},{"key":"e_1_2_1_37_1","volume-title":"Algorithms for Random Generation and Counting: A Markov Chain Approach","author":"Sinclair A.","unstructured":"Sinclair , A. 1993. Algorithms for Random Generation and Counting: A Markov Chain Approach . Birkhauser Verlag , Basel, Switzerland . Sinclair, A. 1993. Algorithms for Random Generation and Counting: A Markov Chain Approach. Birkhauser Verlag, Basel, Switzerland."},{"key":"e_1_2_1_38_1","volume-title":"-K","author":"Stuart A.","year":"1994","unstructured":"Stuart , A. and Ord , J . -K . 1994 . Kendall's Advanced Theory of Statistics. Vol. 1 , Distribution Theory. Haddon Arnold, London, UK. Stuart, A. and Ord, J.-K. 1994. Kendall's Advanced Theory of Statistics. Vol. 1, Distribution Theory. Haddon Arnold, London, UK."},{"key":"e_1_2_1_39_1","unstructured":"Sullivan D. 2005. Search engine sizes. http:\/\/searchenginewatch.com\/showPage.html?page=2156481.  Sullivan D. 2005. Search engine sizes. http:\/\/searchenginewatch.com\/showPage.html?page=2156481."},{"key":"e_1_2_1_40_1","volume-title":"Yahoo: Missing pages&quest","author":"V\u00e9ronis J.","year":"2005","unstructured":"V\u00e9ronis , J. 2005 . Yahoo: Missing pages&quest ; (2). http:\/\/aixtal.blogspot.com\/2005\/08\/yahoo-missing-pages-2.html. V\u00e9ronis, J. 2005. Yahoo: Missing pages&quest; (2). http:\/\/aixtal.blogspot.com\/2005\/08\/yahoo-missing-pages-2.html."},{"key":"e_1_2_1_41_1","volume-title":"John von Neumann, Collected Works.","author":"von Neumann J.","unstructured":"von Neumann , J. 1963. Various techniques used in connection with random digits . In John von Neumann, Collected Works. Vol. V . Oxford . von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford."}],"container-title":["Journal of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1411509.1411514","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1411509.1411514","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:29:29Z","timestamp":1750253369000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1411509.1411514"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,10]]},"references-count":40,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2008,10]]}},"alternative-id":["10.1145\/1411509.1411514"],"URL":"https:\/\/doi.org\/10.1145\/1411509.1411514","relation":{},"ISSN":["0004-5411","1557-735X"],"issn-type":[{"value":"0004-5411","type":"print"},{"value":"1557-735X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2008,10]]},"assertion":[{"value":"2006-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-11-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}