AgentFetch

How do I convert HTML to markdown for LLMs?

The fastest way to convert HTML to markdown for LLMs is to use a purpose-built tool like AgentFetch, Jina Reader, or Firecrawl — each strips boilerplate (nav, footer, ads, scripts) and returns roughly 70-85% smaller output than raw HTML, dramatically cutting LLM context costs. If you're rolling your own, the standard Python stack is readability-lxml (extracts main content) + html2text or markdownify (converts to markdown); in Node, use @mozilla/readability + turndown. The naive requests.get(url).text approach is almost always wrong for LLM input — a typical news article ships with 25-40KB of HTML containing 500-2,000 words of actual content, meaning you're paying for ~10x more tokens than necessary. At Claude Sonnet input pricing of $3/M tokens, a 40KB HTML page costs ~$0.03 to read vs $0.003 for the same content as clean markdown. For AI agents that fetch dozens to thousands of URLs per session, that compounds fast. AgentFetch handles the readability + markdown pipeline server-side and returns LLM-ready text in one MCP tool call.