How do I extract tables from web pages for LLMs?
Extract tables from web pages for LLMs by calling AgentFetch's extract_json tool with a tabular schema — extract_json(url, schema={"rows": [{"col1": "string", "col2": "string", "col3": "number"}]}) returns a clean JSON array, dropping rowspan/colspan complexity and HTML formatting. For Wikipedia and other sources with semantic table markup, AgentFetch maps <th> cells to property names and <td> cells to row values automatically. Raw HTML tables are a worst-case for LLM context: a 50-row, 5-column financial table renders as ~15-25KB of <tr>/<td> markup but only ~1-3KB of actual data. Converting to JSON or markdown-table format (pipes and hyphens) saves 80-90% of tokens with no information loss. AgentFetch's default markdown conversion produces GitHub-flavored markdown tables, which LLMs parse reliably. For complex tables with merged cells, multi-level headers, or footnotes, request JSON output explicitly via extract_json — markdown tables flatten gracefully but lose nested structure. CSV download links (when present on the page) are even better: AgentFetch detects <a href="*.csv"> patterns and offers to fetch the CSV directly, which is 10-100x more token-efficient than scraped HTML.