{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,18]],"date-time":"2025-12-18T20:08:25Z","timestamp":1766088505132,"version":"3.45.0"},"publisher-location":"New York, NY, USA","reference-count":49,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,28]]},"DOI":"10.1145\/3730567.3764471","type":"proceedings-article","created":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T15:22:38Z","timestamp":1763738558000},"page":"541-557","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Scrapers Selectively Respect robots.txt Directives: Evidence From a Large-Scale Empirical Study"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-3502-3331","authenticated-orcid":false,"given":"Taein","family":"Kim","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-2613-4514","authenticated-orcid":false,"given":"Karstan","family":"Bock","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-1734-8017","authenticated-orcid":false,"given":"Claire","family":"Luo","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8149-8962","authenticated-orcid":false,"given":"Amanda","family":"Liswood","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-7303-8888","authenticated-orcid":false,"given":"Chloe","family":"Poroslay","sequence":"additional","affiliation":[{"name":"Office of Information Technology, Duke University, Durham, NC, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-3346-8226","authenticated-orcid":false,"given":"Emily","family":"Wenger","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2007. Can a \/robots.txt be used in a court of law? https:\/\/www.robotstxt.org\/faq\/legal.html."},{"key":"e_1_3_2_1_2_1","unstructured":"2022. RFC 9309: Robots Exclusion Protocol. https:\/\/www.rfc-editor.org\/rfc\/rfc9309.html."},{"key":"e_1_3_2_1_3_1","unstructured":"2025. Common Crawl. https:\/\/commoncrawl.org\/."},{"key":"e_1_3_2_1_4_1","unstructured":"2025. Dark Visitors. https:\/\/darkvisitors.com\/."},{"key":"e_1_3_2_1_5_1","volume-title":"Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al .","author":"Abdin Marah","year":"2024","unstructured":"Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al . 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024)."},{"key":"e_1_3_2_1_6_1","unstructured":"Open AI. 2025. Introducing Operator. https:\/\/openai.com\/index\/introducing-operator\/."},{"key":"e_1_3_2_1_7_1","volume-title":"International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 135--159","author":"Azad Babak Amin","year":"2020","unstructured":"Babak Amin Azad, Oleksii Starov, Pierre Laperdrix, and Nick Nikiforakis. 2020. Web runner 2049: Evaluating third-party anti-bot services. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 135--159."},{"key":"e_1_3_2_1_8_1","unstructured":"Anthropic. 2025. Claude can now search the web. https:\/\/www.anthropic.com\/news\/web-search."},{"key":"e_1_3_2_1_9_1","unstructured":"Apple. 2025. About Applebot. https:\/\/support.apple.com\/en-us\/119829."},{"key":"e_1_3_2_1_10_1","volume-title":"AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt. Ars Technica","author":"Belanger Ashley","year":"2025","unstructured":"Ashley Belanger. 2025. AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt. Ars Technica (2025). https:\/\/arstechnica.com\/tech-policy\/2025\/01\/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt\/."},{"key":"e_1_3_2_1_11_1","volume-title":"Language models are few-shot learners. arXiv preprint arXiv:2005.14165","author":"Brown Tom B","year":"2020","unstructured":"Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_3_2_1_12_1","volume-title":"8th USENIX Workshop on Offensive Technologies (WOOT 14)","author":"Bursztein Elie","year":"2014","unstructured":"Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, and John C Mitchell. 2014. The end is nigh: Generic solving of text-based {CAPTCHAs}. In 8th USENIX Workshop on Offensive Technologies (WOOT 14)."},{"key":"e_1_3_2_1_13_1","volume-title":"How good are humans at solving CAPTCHAs? A large scale evaluation","author":"Bursztein Elie","unstructured":"Elie Bursztein, Steven Bethard, Celine Fabry, John C Mitchell, and Dan Jurafsky. 2010. How good are humans at solving CAPTCHAs? A large scale evaluation. In IEEE Security & Privacy (SP)."},{"key":"e_1_3_2_1_14_1","first-page":"240","article-title":"2023. Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1--113.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_15_1","unstructured":"Thomas Claburn. 2024. Automation needed to fight army of AI content harvesters stalking the web. https:\/\/www.theregister.com\/2024\/07\/30\/taming_ai_content_crawlers\/."},{"key":"e_1_3_2_1_16_1","unstructured":"Cloudflare. 2024. https:\/\/www.cloudflare.com\/application-services\/products\/bot-management\/."},{"key":"e_1_3_2_1_17_1","volume-title":"C-Frame: Characterizing and measuring in-the-wild CAPTCHA attacks","author":"Nguyen Hoang Dai","unstructured":"Hoang Dai Nguyen, Karthika Subramani, Bhupendra Acharya, Roberto Perdisci, and Phani Vadrevu. 2024. C-Frame: Characterizing and measuring in-the-wild CAPTCHA attacks. In IEEE Security & Privacy (SP)."},{"key":"e_1_3_2_1_18_1","volume-title":"Companion Proceedings of the ACM Web Conference","author":"Dinzinger Michael","year":"2024","unstructured":"Michael Dinzinger and Michael Granitzer. 2024. A longitudinal study of content control mechanisms. In Companion Proceedings of the ACM Web Conference 2024. 1382--1387."},{"key":"e_1_3_2_1_19_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_20_1","unstructured":"Roy Thomas Fielding. 2000. Architectural Styles and the Design of Network-based Software Architectures. https:\/\/ics.uci.edu\/~fielding\/pubs\/dissertation\/top.htm?deviceId=c9100f78--4e17--449f-b247--5f55bf2b13bc."},{"key":"e_1_3_2_1_21_1","unstructured":"Internet Engineering Task Force. 2022. RFC 9309. https:\/\/datatracker.ietf.org\/doc\/html\/rfc9309."},{"key":"e_1_3_2_1_22_1","unstructured":"Google. 2025. Gemini Deep Research. https:\/\/gemini.google\/overview\/deep-research\/?hl=en."},{"key":"e_1_3_2_1_23_1","unstructured":"Google. 2025. How Google interprets the robots.txt specification. https:\/\/developers.google.com\/search\/docs\/crawling-indexing\/robots\/robots_txt."},{"key":"e_1_3_2_1_24_1","unstructured":"Matthew Gray. 1995. Measuring the Growth of the Web. https:\/\/www.mit.edu\/people\/mkgray\/growth\/."},{"key":"e_1_3_2_1_25_1","volume-title":"Gotta captcha em all: A survey of 20 years of the human-or-computer dilemma. ACM Computing Surveys (CSUR)","author":"Guerar Meriem","year":"2021","unstructured":"Meriem Guerar, Luca Verderame, Mauro Migliardi, Francesco Palmieri, and Alessio Merlo. 2021. Gotta captcha em all: A survey of 20 years of the human-or-computer dilemma. ACM Computing Surveys (CSUR) (2021)."},{"key":"e_1_3_2_1_26_1","volume-title":"Proc. of ICML.","author":"Guu Kelvin","year":"2020","unstructured":"Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proc. of ICML."},{"key":"e_1_3_2_1_27_1","unstructured":"Xe Iaso. 2025. Amazon's AI Crawler is Making my Git Server Unstable. https:\/\/xeiaso.net\/notes\/2025\/amazon-crawler\/."},{"key":"e_1_3_2_1_28_1","volume-title":"Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282","author":"Izacard Gautier","year":"2020","unstructured":"Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 (2020)."},{"key":"e_1_3_2_1_29_1","volume-title":"Andrea Madotto, and Pascale Fung.","author":"Ji Ziwei","year":"2023","unstructured":"Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys 55, 12 (2023)."},{"key":"e_1_3_2_1_30_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)."},{"key":"e_1_3_2_1_31_1","volume-title":"Important: Spiders, Robots and Web Wanderers. https:\/\/web.archive.org\/web\/20131029200350\/http:\/\/inkdroid.org\/tmp\/www-talk\/4113.html.","author":"Koster Martijn","year":"1994","unstructured":"Martijn Koster. 1994. Important: Spiders, Robots and Web Wanderers. https:\/\/web.archive.org\/web\/20131029200350\/http:\/\/inkdroid.org\/tmp\/www-talk\/4113.html."},{"key":"e_1_3_2_1_32_1","volume-title":"Classification of web robots: an empirical study based on over one billion requests. Computers & Security 28, 8","author":"Lee Junsup","year":"2009","unstructured":"Junsup Lee, Sungdeok Cha, Dongkun Lee, and Hyungkyu Lee. 2009. Classification of web robots: an empirical study based on over one billion requests. Computers & Security 28, 8 (2009)."},{"key":"e_1_3_2_1_33_1","unstructured":"Christopher Lehane. 2025. [OpenAI Response] OSTP\/NSF RFI: Notice Request for Information on the Development of an Artificial Intelligence (AI) Action Plan. https:\/\/cdn.openai.com\/global-affairs\/ostp-rfi\/ec680b75-d539--4653-b297--8bcf6e5f7686\/openai-response-ostp-nsf-rfi-notice-request-for-information-on-the-development-of-an-artificial-intelligence-ai-action-plan.pdf."},{"key":"e_1_3_2_1_34_1","volume-title":"Proc. of NeurIPS","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Proc. of NeurIPS (2020)."},{"key":"e_1_3_2_1_35_1","volume-title":"Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers. arXiv preprint arXiv:2411.15091","author":"Liu Enze","year":"2024","unstructured":"Enze Liu, Elisa Luo, Shawn Shan, Geoffrey M Voelker, Ben Y Zhao, and Stefan Savage. 2024. Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers. arXiv preprint arXiv:2411.15091 (2024)."},{"key":"e_1_3_2_1_36_1","volume-title":"AI crawler wars threaten to make the web more closed for everyone. MIT Tech Review","author":"Longpre Shayne","year":"2025","unstructured":"Shayne Longpre. 2025. AI crawler wars threaten to make the web more closed for everyone. MIT Tech Review (2025). https:\/\/www.technologyreview.com\/2025\/02\/11\/1111518\/ai-crawler-wars-closed-web\/."},{"key":"e_1_3_2_1_37_1","volume-title":"Proc. of NeurIPS","author":"Longpre Shayne","year":"2025","unstructured":"Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, et al. 2025. Consent in crisis: the rapid decline of the AI data commons. Proc. of NeurIPS (2025)."},{"key":"e_1_3_2_1_38_1","volume-title":"Perplexity is a Bullshit Machine. Wired","author":"Mehrotra Dhruv","year":"2024","unstructured":"Dhruv Mehrotra and Tim Marchman. 2024. Perplexity is a Bullshit Machine. Wired (2024). https:\/\/www.wired.com\/story\/perplexity-is-a-bullshit-machine\/."},{"key":"e_1_3_2_1_39_1","volume-title":"Jonathon Fletcher: forgotten father of the search engine. BBC News","author":"Miller Joe","year":"2013","unstructured":"Joe Miller. 2013. Jonathon Fletcher: forgotten father of the search engine. BBC News (2013). https:\/\/www.bbc.com\/news\/technology-23945326."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2914757"},{"key":"e_1_3_2_1_41_1","volume-title":"The curious case of hallucinations in neural machine translation. arXiv preprint arXiv:2104.06683","author":"Raunak Vikas","year":"2021","unstructured":"Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. 2021. The curious case of hallucinations in neural machine translation. arXiv preprint arXiv:2104.06683 (2021)."},{"key":"e_1_3_2_1_42_1","volume-title":"Proc. of NeurIPS","author":"Schuhmann Christoph","year":"2022","unstructured":"Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al . 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Proc. of NeurIPS (2022)."},{"key":"e_1_3_2_1_43_1","volume-title":"Proceedings of the 16th international conference on World Wide Web. 1123--1124","author":"Sun Yang","year":"2007","unstructured":"Yang Sun, Ziming Zhuang, and C Lee Giles. 2007. A large-scale study of robots. txt. In Proceedings of the 16th international conference on World Wide Web. 1123--1124."},{"key":"e_1_3_2_1_44_1","volume-title":"et al","author":"Team Jamba","year":"2024","unstructured":"Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al . 2024. Jamba-1.5: Hybrid Transformer-Mamba Models at Scale. arXiv preprint arXiv:2408.12570 (2024)."},{"key":"e_1_3_2_1_45_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_3_2_1_46_1","unstructured":"Udger. 2025. YisouSpider Details. https:\/\/udger.com\/resources\/ua-list\/bot-detail?bot=YisouSpider."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.5555\/1766171.1766196"},{"key":"e_1_3_2_1_48_1","volume-title":"Recaptcha: Human-based character recognition via web security measures. Science 321, 5895","author":"Ahn Luis Von","year":"2008","unstructured":"Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. 2008. Recaptcha: Human-based character recognition via web security measures. Science 321, 5895 (2008)."},{"key":"e_1_3_2_1_49_1","volume-title":"ACM Conference on Computer and Communications Security (CCS).","author":"Ye Guixin","year":"2018","unstructured":"Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang. 2018. Yet another text captcha solver: A generative adversarial network based approach. In ACM Conference on Computer and Communications Security (CCS)."}],"event":{"name":"IMC '25:ACM Internet Measurement Conference","location":"Madison WI USA","sponsor":["SIGMETRICS ACM Special Interest Group on Measurement and Evaluation","SIGCOMM ACM Special Interest Group on Data Communication"]},"container-title":["Proceedings of the 2025 ACM Internet Measurement Conference"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3730567.3764471","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T15:28:30Z","timestamp":1763738910000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730567.3764471"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,28]]},"references-count":49,"alternative-id":["10.1145\/3730567.3764471","10.1145\/3730567"],"URL":"https:\/\/doi.org\/10.1145\/3730567.3764471","relation":{},"subject":[],"published":{"date-parts":[[2025,10,28]]},"assertion":[{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}