News

Cloudflare Workers Free Plan: Efficient HTML to Markdown Conversion Using HTMLRewriter for AI Crawlers

Cloudflare Workers Free Plan: Efficient HTML to Markdown Conversion Using HTMLRewriter for AI Crawlers

AI crawlers like Gemini, GPT, Claude, and Perplexity constantly scan websites, and they significantly prefer parsing Markdown over HTML. Markdown offers cleaner context, fewer tokens, and thus, cheaper inference costs. If your content is already in Markdown (from a CMS, Git, or database), you can simply negotiate the format with Accept: text/markdown. However, if your content is HTML – perhaps you're proxying a third-party page, mirroring documentation, building a reader-mode endpoint, or feeding an LLM summarizer – you'll need to convert it to Markdown within a Cloudflare Worker. On the free plan, this presents a challenge with strict limits: 10ms CPU time and a 1MB compressed bundle size. What strategy can thrive under such constraints?

It's crucial to clarify that Cloudflare uses the term "paid" for two distinct products:

  • Workers Paid ($5/month + usage): This is the Worker runtime upgrade, boosting CPU from 10ms to 30s and bundle size from 1MB to 10MB. This plan fundamentally alters the HTML-to-Markdown conversion calculus.
  • Cloudflare Pro ($20/month per domain): This is a domain plan, adding features like WAF, image optimization, and page rules. It does not change any Worker limits.

Throughout this article, "paid" refers specifically to Workers Paid.

The free plan budgets are tight: 10ms CPU time per request and a 1MB compressed Worker bundle. While 10ms is ample for routing or JSON processing, HTML-to-Markdown is different. It involves parsing a DOM, traversing every node, and emitting a transformed string – a CPU-dense operation. Any strategy that ships its own DOM implementation tends to quickly exhaust the bundle budget.

The effective solution is: HTMLRewriter.

HTMLRewriter is built into workerd – the open-source JavaScript/Wasm runtime that executes your Worker at the edge (and powers wrangler dev locally). It requires zero npm dependencies and is used by Cloudflare itself for response transformation.

Architecturally, HTMLRewriter is streaming and SAX-style: it fires events like <h1> / text / </h1> as bytes arrive, never building an in-memory DOM tree. Libraries like turndown, Readability, or cheerio do the opposite – they buffer the entire document, construct a full DOM with every node and parent pointer allocated, and then traverse it. This construction pass is a significant CPU tax (before even emitting a single Markdown character) and the reason these libraries ship their own DOM implementations, leading to hundreds of KB in bundle size.

Using a sample 34KB HTML article, HTMLRewriter demonstrated impressive performance:

  • Bundle: 10.52 KiB uncompressed / 3.74 KiB gzipped (0.4% of the 1MB budget)
  • CPU: 2ms median over 50 runs (min 2, max 8) – 20% of the 10ms budget
  • Output: 24.9 KB markdown

This provides 5x CPU headroom and 250x bundle headroom on the free plan, a performance unmatched by any other measured alternative.

↗ Read original source