Package Exports
- aeokit
Readme
See your website through an AI agent's eyes.
AEOkit is an Agentic Engine Optimization toolkit — it measures how well an AI agent can actually use your website, by running one. It combines a deterministic static audit (Lab) with a real agent session driving a real browser (Field) and emits a Lighthouse-style report with a grade, six scoring dimensions, and an honest trace of what the agent tried, saw, and missed.

Status: alpha (v0.2.1 on npm). The engine is solid; the product packaging around it isn't finished yet. Read Known limits before you pitch this internally.
Why it exists
The web is being re-consumed by AI agents — search summarisers, shopping agents, coding copilots, computer-use models. "Does my site render?" is no longer the question. The questions are:
- Can an agent discover what my site offers without a human guiding it?
- Can it read the page cheaply, or does it burn 80 K tokens on a header carousel?
- Can it complete a task end-to-end, or does a cookie wall / bot challenge / SPA hydration race stop it on step 1?
AEOkit answers those by running the agent.
What you get
Lab — static audit (no API key)
Runs in seconds against any URL. Checks what an agent would see if it used fetch or curl:
llms.txt,robots.txt,sitemap.xmldiscovery- Per-crawler access matrix (GPTBot, Google-Extended, ClaudeBot, PerplexityBot, CCBot…)
- Token-budget measurement on the landing page + key docs (real
gpt-tokenizercounts, notlen/4estimates) - Two-tier fetcher: plain HTTP first, stealth-Chromium fallback if the plain tier is blocked — so we can tell you which tier your site accepts
Field — agent session (needs an API key)
Launches a real headless Chromium, hands it to an LLM (Claude / OpenAI / Gemini) via a 10-tool browser harness (screenshot, click, type, scroll, select_option, press_key, navigate, get_page_info, wait, go_back), and gives it a plain-English task. Emits a full trace — every step, every tool call, every token.
- Smart observation: viewport-scoped, semantically pruned a11y trees with list summarisation (stable ~18–20 K tokens/step instead of ballooning)
- Pre-flight DOM weight analysis — auto-switches the observation mode on heavy / SPA-hydrating pages
- Domain guardrails: agent can't wander off to
accounts.google.com - Auto-dismisses cookie / privacy banners via DuckDuckGo's
autoconsentrules (~200 CMPs) before the agent sees the page - Optional video recording (
--record) — watch what the agent actually saw - Statistical runs (
--runs N) with pass-rate and per-dimension aggregates
Combined score
aeokit score <url> --scenario task.yaml runs both and produces a single composite grade (Lab × 40% + Field × 60%) with an HTML report that shows them side by side.
Quickstart
Requirements: Node 20+, macOS/Linux/Windows. For Field runs, an API key from at least one of: Anthropic, OpenAI, or Google AI Studio.
# 1. Install
npm install -g aeokit
npx playwright install chromium # one-time — downloads headless browser
# 2. Configure your API key (only needed for Field runs)
# Either set an env var — ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY —
# or drop an aeokit.config.yaml in the directory you run from (see below).Prefer not to install globally? Every command below works with
npx aeokit ….
Verify it's working:
# Lab only — no API key needed
aeokit audit https://modelcontextprotocol.io
# Field run against your own YAML scenario (see "Writing a scenario" below)
aeokit run my-task.yaml --record
# Combined Lab + Field with a composite grade + video recording
aeokit score https://modelcontextprotocol.io --scenario my-task.yaml --record
# Override the model per run (without editing aeokit.config.yaml)
aeokit run my-task.yaml --model claude-opus-4-7Reports land in ./audit-reports/, ./aeokit-results/, or ./score-reports/ (--output overrides). Open the .html file in any browser.
Prebuilt scenarios live in examples/scenarios/ in the GitHub repo — copy any of them into your project as a starting point.
Configuration
AEOkit reads aeokit.config.yaml from the current working directory. Env vars work as a fallback when a provider block is missing — set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY / GEMINI_API_KEY and you can skip the YAML entirely.
For committed config, drop this in aeokit.config.yaml (git-ignore it — it holds secrets):
providers:
claude:
apiKey: sk-ant-...
# model: claude-sonnet-4-6 # optional override
openai:
apiKey: sk-proj-...
# model: gpt-4o
gemini:
apiKey: ...
# model: gemini-2.5-flashThe full example lives at aeokit.config.example.yaml in the GitHub repo.
How it works
┌────────────────── aeokit score <url> ──────────────────────┐
│ │
│ LAB (deterministic, ~10 s, no LLM) │
│ ├─ Discovery: llms.txt, sitemap, robots │
│ ├─ Access: UA matrix across GPTBot/Claude/Perplexity… │
│ └─ Tokens: budget + heatmap via gpt-tokenizer │
│ │
│ ┌──── composite ────┐ │
│ │ Lab 40 + Field 60 │ │
│ └───────────────────┘ │
│ │
│ FIELD (empirical, ~30 s–2 min, needs LLM) │
│ ├─ launch Chromium → attach autoconsent │
│ ├─ goto + preflight DOM weight → pick observation mode │
│ ├─ agent loop: observe → plan → act → trace → repeat │
│ └─ assertions: element_visible · text_contains · │
│ tool_called · url_matches · llm_judge · │
│ custom_eval │
│ │
└────────────────────────────────────────────────────────────┘The same 6-dimension model scores both sides:
| Dimension | Weight | What it measures |
|---|---|---|
| Task Completion | 30% | Assertion pass rate + natural completion |
| Step Efficiency | 15% | Steps per successful action (absolute: <1.5 is excellent) |
| Token Economy | 15% | Tokens per action (absolute: <3 K is excellent) |
| Error Resilience | 15% | Tool-call success rate + recovery detection |
| Navigation Clarity | 15% | Observation-only step ratio — low = page is readable |
| Interaction Directness | 10% | Action vs. observation tool ratio |
Crashed / zero-activity runs don't get vacuous credit: the five non-completion dimensions return 0 when the agent never ran, and the task-completion multiplier drops sharply on fatal errors. A "pre-assertion passed" site that blocked the agent will score F, not C.
Commands
| Command | Purpose |
|---|---|
aeokit audit <url> |
Static audit only. No API key needed. |
aeokit run <scenarios…> |
Empirical agent runs against a task YAML. |
aeokit score <url> |
Audit + (optional) empirical, with a composite grade. |
aeokit inspect <url> |
Probe the page for WebMCP tools (Phase 5 preview) + DOM stats. |
Useful flags:
--provider claude|openai|gemini— pick the LLM (defaults:claude-sonnet-4-6,gpt-4o,gemini-2.5-flash)-m, --model <id>— override the provider's model for a single run (e.g.--model claude-opus-4-7,--model gpt-4o-mini). Persistent defaults live inproviders.<name>.modelinaeokit.config.yaml.--runs N— repeat the scenario N times, report pass-rate + aggregates--record— save a.webmof the agent session--headed— see the browser as the agent drives it--format json,html,sarif,md— pick report formats--min-score N— CI exit code if the composite score drops below N--diff baseline.json --fail-on-regression— compare audits and fail on drops (CI)--no-render— skip the stealth-Chromium fallback in the audit fetcher--user-agent "…"— override the UA for audit fetches
Writing a scenario
# examples/scenarios/real-world/hackernews-browse.yaml
name: "Hacker News - Read top stories"
url: "https://news.ycombinator.com"
mode: general
intent: |
You are on Hacker News. Read the homepage and report the titles of
the top 3 stories along with their points and comment counts.
assertions:
- type: url_matches
pattern: "news.ycombinator.com"
- type: tool_called
tool: get_page_info
- type: llm_judge
question: "Did the agent report the titles of at least 3 actual HN stories?"
expectedAnswer: "yes"
config:
maxSteps: 10
maxTokens: 40000
observationMode: a11y
handleConsent: true # default — set false if you're testing consent UIAssertion types:
element_visible— CSS selector is present and visibletext_contains— element text contains a stringtool_called— agent invoked a specific tool (optionally with args)url_matches— final URL matches a patternllm_judge— semantic yes/no judged by the LLM from the tracecustom_eval— arbitrary JS returning a value compared toexpected
Prebuilt scenarios live in examples/scenarios/ — Hacker News, Wikipedia, GitHub, Stripe docs, MCP docs, Claude docs, NYT, TodoMVC, Reddit.
Reports
Every run produces:
- JSON — full trace, assertions, metrics, scored dimensions, insights. Schema-versioned so you can diff in CI.
- HTML — self-contained, no network. Dimension strip, "What to fix this week" insights, pre-flight analysis, consent outcome, assertion rows, dimension deep-dives, collapsible step-by-step trace.
- SARIF + Markdown — on
aeokit auditwhen--format sarif,mdis set. Drop-in for GitHub code-scanning and sticky PR comments (seeexamples/ci/github-action.yml). - WebM — if
--recordwas set, a video of the agent session.
Programmatic API
import { runScenario, createProvider, loadScenario, loadConfig } from "aeokit";
const config = await loadConfig();
const provider = createProvider("claude", config);
const scenario = await loadScenario("./task.yaml");
const result = await runScenario({
scenario,
provider,
browserOptions: { headless: true },
});
console.log(result.totalSteps, result.assertions, result.consent);Everything the CLI does is exported from the root. The types in dist/index.d.ts are stable within a minor version.
Known limits
In the spirit of not shipping bullshit:
- Three providers wired, none calibrated. Claude / OpenAI / Gemini all run end-to-end, but scoring thresholds were calibrated on Claude traces — don't read a
gpt-4orun's 82 as meaning the same thing as aclaude-sonnet-4-6run's 82 until we publish cross-model normalisation data. - No Playwright auto-install. First run needs
npx playwright install chromium— we detect the missing binary and point you at the command, but we don't fetch it for you. - Sites with aggressive bot defences will fail. Reddit, LinkedIn, and CF-protected banking sites typically return JS-challenge pages. The Field browser runs a stealth preset (
STEALTH_INIT_SCRIPT+ realistic UA) but some sites still detect headless Chromium and the report scores them an honest F. - Error reporting is raw. A bot-challenge shows up as
page.title: Execution context was destroyedinstead of a cleanBLOCKED_BY_BOT_CHALLENGEsignal. Taxonomy is on the roadmap. - Scoring thresholds are principled, not calibrated. They come from first-principles reasoning about tokens/step/actions; they haven't been fitted against a labelled benchmark yet. The composite weights (40/60) are reasonable, not sacred.
- WebMCP mode is a stub.
mode: webmcpin a scenario throws. Discovery vianavigator.modelContextis planned for Phase 5.
Roadmap
| Done | Next | Later |
|---|---|---|
| Lab audit (discovery / access / tokens / parsability / capability) | blockedBy error taxonomy |
WebMCP mode (Phase 5) |
| Field agent loop + smart observation | Multi-LLM comparison report | Calibrated cross-model scoring |
| 6-dimension scoring with honest zeros | Landing page + docs site | 100-site public benchmark |
Combined score command |
— | — |
| Claude / OpenAI / Gemini providers | — | — |
| Consent banner auto-dismiss (200 CMPs) | — | — |
CI primitives: GitHub Action, SARIF, PR comment, --diff |
— | — |
Published on npm (aeokit) via Changesets |
— | — |
Contributing
Bug reports, PRs, and new scenario examples are welcome. Dev setup, the test/lint/build loop, release flow, and commit style all live in CONTRIBUTING.md. The internal architecture notes sit in CLAUDE.md.
License
MIT.