How do AI agents respect robots.txt?

Question

Accepted Answer

AI agents should respect robots.txt the same way well-behaved web crawlers do: **fetch `/robots.txt` from each origin before crawling, parse the `User-agent: *` and `User-agent: <yourbot>` rules, and skip any path with a matching `Disallow:` directive**. AgentFetch handles this automatically — every `fetch_url` call first checks an in-memory robots.txt cache (refreshed every 24 hours per origin), and returns a structured `robots_disallowed` error if the target path is blocked. You can override via `respect_robots=false` per call, but this should be reserved for owner-verified scraping of your own properties or content where you have explicit license. The 2024-2025 wave of AI training scrapers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) made robots.txt politically salient: most major publishers now explicitly block AI training crawlers and many also block on-demand fetchers. AgentFetch identifies as `AgentFetchBot/1.0 (+https://www.agentfetch.dev/bot)` by default, separating it from training scrapers — many publishers allow on-demand agent fetches while blocking training. For agents acting on behalf of a logged-in user (e.g., "summarize this article I'm reading"), robots.txt is less restrictive — the user has the same right of access as a browser. The defensible posture: respect robots.txt by default, document the override path for legitimate use cases, never crawl at scale against an explicit Disallow.