How do I extract tables from web pages for LLMs?

Question

Accepted Answer

Extract tables from web pages for LLMs by calling AgentFetch's `extract_json` tool with a tabular schema — `extract_json(url, schema={"rows": [{"col1": "string", "col2": "string", "col3": "number"}]})` returns a clean JSON array, dropping rowspan/colspan complexity and HTML formatting. For Wikipedia and other sources with semantic table markup, AgentFetch maps `` cells to property names and `` cells to row values automatically. Raw HTML tables are a worst-case for LLM context: a 50-row, 5-column financial table renders as ~15-25KB of `/` markup but only ~1-3KB of actual data. Converting to JSON or markdown-table format (pipes and hyphens) saves 80-90% of tokens with no information loss. AgentFetch's default markdown conversion produces GitHub-flavored markdown tables, which LLMs parse reliably. For complex tables with merged cells, multi-level headers, or footnotes, request JSON output explicitly via `extract_json` — markdown tables flatten gracefully but lose nested structure. CSV download links (when present on the page) are even better: AgentFetch detects `` patterns and offers to fetch the CSV directly, which is 10-100x more token-efficient than scraped HTML.