Package Exports
- markdown-for-agents
- markdown-for-agents/extract
- markdown-for-agents/tokens
Readme
markdown-for-agents
Runtime-agnostic HTML to Markdown converter built for AI agents. One dependency, works everywhere.
Convert any HTML page into clean, token-efficient Markdown — with built-in content extraction to strip away navigation, ads, and boilerplate. Inspired by Cloudflare's Markdown for Agents.
Features
- Runtime-agnostic — Node.js, Bun, Deno, Cloudflare Workers, Vercel Edge, browsers
- Content extraction — strip nav, footer, ads, sidebars, cookie banners automatically
- Framework middleware — drop-in support for Express, Fastify, Hono, Next.js, and any Web Standard server
- Content negotiation — respond with Markdown when clients send
Accept: text/markdown - Token estimation — built-in heuristic token counter for LLM cost planning, with support for custom tokenizers
- Plugin system — override or extend any element conversion with custom rules
- Single dependency — only htmlparser2 (no DOM required)
- ESM only — modern, tree-shakeable, with subpath exports
- Fully typed — written in TypeScript with complete type definitions
Install
npm install markdown-for-agentsQuick Start
import { convert } from 'markdown-for-agents';
const html = `
<h1>Hello World</h1>
<p>This is a <strong>simple</strong> example.</p>
`;
const { markdown, tokenEstimate } = convert(html);
console.log(markdown);
// # Hello World
//
// This is a **simple** example.
console.log(tokenEstimate);
// { tokens: 12, characters: 46, words: 8 }Content Extraction
Real-world HTML pages are full of navigation, ads, sidebars, and cookie banners. Enable extraction mode to get just the main content:
const { markdown } = convert(html, { extract: true });This strips <nav>, <header>, <footer>, <aside>, <script>, <style>, ad-related elements, cookie banners, social widgets, and more.
Middleware
Framework middleware is available as separate packages — they serve Markdown automatically when AI agents request it via Accept: text/markdown:
// Express
import { markdown } from '@markdown-for-agents/express';
app.use(markdown());
// Fastify
import { markdown } from '@markdown-for-agents/fastify';
fastify.register(markdown());
// Hono
import { markdown } from '@markdown-for-agents/hono';
app.use(markdown());
// Next.js (auto-unwraps /_next/image URLs)
import { withMarkdown } from '@markdown-for-agents/nextjs';
export default withMarkdown(handler);
// Any Web Standard server (Cloudflare Workers, Deno, Bun)
import { markdownMiddleware } from '@markdown-for-agents/web';
const mw = markdownMiddleware();The middleware inspects the Accept header. Normal browser requests pass through untouched. When an AI agent sends Accept: text/markdown, the HTML response is automatically converted.
| Package | Framework |
|---|---|
@markdown-for-agents/express |
Express |
@markdown-for-agents/fastify |
Fastify |
@markdown-for-agents/hono |
Hono |
@markdown-for-agents/nextjs |
Next.js |
@markdown-for-agents/web |
Web Standard (Cloudflare Workers, Deno, Bun) |
Custom Rules
Override how any element is converted, or add support for custom elements:
import { convert, createRule } from 'markdown-for-agents';
const { markdown } = convert(html, {
rules: [
createRule(
node => node.name === 'div' && node.attribs.class?.includes('callout'),
({ convertChildren, node }) => `\n\n> **Note:** ${convertChildren(node).trim()}\n\n`
)
]
});Custom rules have higher priority than defaults and are applied first.
Options
All options are optional. Defaults are shown below:
convert(html, {
// Content extraction
extract: false, // true | ExtractOptions
// Custom conversion rules
rules: [], // Rule[]
// Base URL for resolving relative links and images
baseUrl: '', // "https://example.com"
// Heading style
headingStyle: 'atx', // "atx" (#) or "setext" (underline)
// Bullet character for unordered lists
bulletChar: '-', // "-", "*", or "+"
// Code block style
codeBlockStyle: 'fenced', // "fenced" or "indented"
// Fence character
fenceChar: '`', // "`" or "~"
// Strong delimiter
strongDelimiter: '**', // "**" or "__"
// Emphasis delimiter
emDelimiter: '*', // "*" or "_"
// Link style
linkStyle: 'inlined', // "inlined" or "referenced"
// Remove duplicate content blocks
deduplicate: false, // true | DeduplicateOptions
// Custom token counter (replaces built-in heuristic)
tokenCounter: undefined // (text: string) => TokenEstimate
});Custom Token Counter
By default, token estimation uses a fast heuristic (~4 characters per token). You can replace it with an exact tokenizer:
import { convert } from 'markdown-for-agents';
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
const { markdown, tokenEstimate } = convert(html, {
tokenCounter: text => ({
tokens: enc.encode(text).length,
characters: text.length,
words: text.split(/\s+/).filter(Boolean).length
})
});The custom counter receives the final markdown string and must return a TokenEstimate object with tokens, characters, and words fields. It flows through to middleware as well — the x-markdown-tokens header will reflect your counter's value.
Deduplication Options
Pass deduplicate: true to use defaults, or pass a DeduplicateOptions object to customize behavior:
const { markdown } = convert(html, {
deduplicate: { minLength: 5 } // catch short repeated phrases like "Read more"
});The minLength option (default: 10) controls the minimum block length eligible for deduplication. Blocks shorter than this are always kept. Lower it to catch short repeated phrases, raise it for more conservative deduplication.
Supported Elements
Block
| HTML | Markdown |
|---|---|
<h1>...<h6> |
# Heading (atx) or underline (setext) |
<p> |
Paragraph with blank lines |
<blockquote> |
> Quoted text |
<pre><code> |
Fenced code block with language |
<hr> |
--- |
<br> |
Trailing double-space line break |
<ul>, <ol>, <li> |
Lists with nesting and indentation |
<table> |
GFM pipe table with separator row |
<script>, <style>, <noscript>, <template> |
Stripped |
Inline
| HTML | Markdown |
|---|---|
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<del>, <s>, <strike> |
~~strikethrough~~ |
<code> |
`inline code` |
<a> |
[text](url) with title and baseUrl support |
<img> |
 with title and baseUrl support |
<sub> |
~subscript~ |
<sup> |
^superscript^ |
<abbr>, <mark> |
Pass-through (text preserved) |
Subpath Exports
The core package provides fine-grained imports for tree-shaking:
import { convert } from 'markdown-for-agents';
import { extractContent } from 'markdown-for-agents/extract';
import { estimateTokens } from 'markdown-for-agents/tokens';Runtime Compatibility
| Runtime | Version | Status |
|---|---|---|
| Node.js | >= 20 | Tested |
| Bun | >= 1.0 | Tested |
| Deno | >= 1.40 | Tested |
| Cloudflare Workers | - | Compatible |
| Vercel Edge | - | Compatible |
| Browsers | ES2022+ | Compatible |
License
MIT