How do I extract structured JSON from a webpage for an agent?

Question

Accepted Answer

Extract structured JSON from a webpage for an agent by giving the fetch tool a **target schema** and letting it (a) parse `<script type="application/ld+json">` blocks, (b) parse OpenGraph and meta tags, or (c) feed a markdown-clean version through an extraction LLM. AgentFetch exposes an `extract_json` tool that accepts a JSON Schema and returns matching data — for example, `extract_json(url, schema={"title": "string", "author": "string", "published_date": "string"})` returns those fields if discoverable from JSON-LD, OG tags, or page structure. For e-commerce, news, and recipe sites this works ~85% of the time on the first call because most major sites publish schema.org JSON-LD. For sites without structured markup, the fallback is a small extraction model (Haiku, GPT-4o-mini) called against the clean markdown — typical cost is $0.0005-$0.002 per page. Firecrawl offers similar "Extract" functionality at higher per-page price. Avoid regex/CSS-selector extraction in agent workflows — page structure drifts and the agent has no way to self-heal. JSON Schema + structured-output models is the resilient pattern.