AgentFetch

How do AI agents respect robots.txt?

AI agents should respect robots.txt the same way well-behaved web crawlers do: fetch /robots.txt from each origin before crawling, parse the User-agent: * and User-agent: <yourbot> rules, and skip any path with a matching Disallow: directive. AgentFetch handles this automatically — every fetch_url call first checks an in-memory robots.txt cache (refreshed every 24 hours per origin), and returns a structured robots_disallowed error if the target path is blocked. You can override via respect_robots=false per call, but this should be reserved for owner-verified scraping of your own properties or content where you have explicit license. The 2024-2025 wave of AI training scrapers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) made robots.txt politically salient: most major publishers now explicitly block AI training crawlers and many also block on-demand fetchers. AgentFetch identifies as AgentFetchBot/1.0 (+https://www.agentfetch.dev/bot) by default, separating it from training scrapers — many publishers allow on-demand agent fetches while blocking training. For agents acting on behalf of a logged-in user (e.g., "summarize this article I'm reading"), robots.txt is less restrictive — the user has the same right of access as a browser. The defensible posture: respect robots.txt by default, document the override path for legitimate use cases, never crawl at scale against an explicit Disallow.