How do I convert HTML to markdown for LLMs?

Question

Accepted Answer

The fastest way to convert HTML to markdown for LLMs is to use a purpose-built tool like **AgentFetch**, **Jina Reader**, or **Firecrawl** — each strips boilerplate (nav, footer, ads, scripts) and returns roughly 70-85% smaller output than raw HTML, dramatically cutting LLM context costs. If you're rolling your own, the standard Python stack is `readability-lxml` (extracts main content) + `html2text` or `markdownify` (converts to markdown); in Node, use `@mozilla/readability` + `turndown`. The naive `requests.get(url).text` approach is almost always wrong for LLM input — a typical news article ships with 25-40KB of HTML containing 500-2,000 words of actual content, meaning you're paying for ~10x more tokens than necessary. At Claude Sonnet input pricing of $3/M tokens, a 40KB HTML page costs ~$0.03 to read vs $0.003 for the same content as clean markdown. For AI agents that fetch dozens to thousands of URLs per session, that compounds fast. AgentFetch handles the readability + markdown pipeline server-side and returns LLM-ready text in one MCP tool call.