JSPM – aeokit@0.3.0

Package Exports

aeokit

Readme

AEOkit

See your website through an AI agent's eyes.

AEOkit is an Agentic Engine Optimization toolkit — it measures how well an AI agent can actually use your website, by running one. It combines a deterministic static audit (Lab) with a real agent session driving a real browser (Field) and emits a Lighthouse-style report with a grade, six scoring dimensions, and an honest trace of what the agent tried, saw, and missed.

AEOkit score report

Status: alpha (v0.2.1 on npm). The engine is solid; the product packaging around it isn't finished yet. Read Known limits before you pitch this internally.

Why it exists

The web is being re-consumed by AI agents — search summarisers, shopping agents, coding copilots, computer-use models. "Does my site render?" is no longer the question. The questions are:

Can an agent discover what my site offers without a human guiding it?
Can it read the page cheaply, or does it burn 80 K tokens on a header carousel?
Can it complete a task end-to-end, or does a cookie wall / bot challenge / SPA hydration race stop it on step 1?

AEOkit answers those by running the agent.

What you get

Lab — static audit (no API key)

Runs in seconds against any URL. Checks what an agent would see if it used fetch or curl:

llms.txt, robots.txt, sitemap.xml discovery
Per-crawler access matrix (GPTBot, Google-Extended, ClaudeBot, PerplexityBot, CCBot…)
Token-budget measurement on the landing page + key docs (real gpt-tokenizer counts, not len/4 estimates)
Two-tier fetcher: plain HTTP first, stealth-Chromium fallback if the plain tier is blocked — so we can tell you which tier your site accepts

Field — agent session (needs an API key)

Launches a real headless Chromium, hands it to an LLM (Claude / OpenAI / Gemini) via a 10-tool browser harness (screenshot, click, type, scroll, select_option, press_key, navigate, get_page_info, wait, go_back), and gives it a plain-English task. Emits a full trace — every step, every tool call, every token.

Smart observation: viewport-scoped, semantically pruned a11y trees with list summarisation (stable ~18–20 K tokens/step instead of ballooning)
Pre-flight DOM weight analysis — auto-switches the observation mode on heavy / SPA-hydrating pages
Domain guardrails: agent can't wander off to accounts.google.com
Auto-dismisses cookie / privacy banners via DuckDuckGo's autoconsent rules (~200 CMPs) before the agent sees the page
Optional video recording (--record) — watch what the agent actually saw
Statistical runs (--runs N) with pass-rate and per-dimension aggregates

Combined score

aeokit score <url> --scenario task.yaml runs both and produces a single composite grade (Lab × 40% + Field × 60%) with an HTML report that shows them side by side.

Quickstart

Requirements: Node 20+, macOS/Linux/Windows. For Field runs, an API key from at least one of: Anthropic, OpenAI, or Google AI Studio.

# 1. Install
npm install -g aeokit
npx playwright install chromium     # one-time — downloads headless browser

# 2. Configure your API key (only needed for Field runs)
#    Either set an env var — ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY —
#    or drop an aeokit.config.yaml in the directory you run from (see below).

Prefer not to install globally? Every command below works with npx aeokit ….

Verify it's working:

# Lab only — no API key needed
aeokit audit https://modelcontextprotocol.io

# Field run against your own YAML scenario (see "Writing a scenario" below)
aeokit run my-task.yaml --record

# Combined Lab + Field with a composite grade + video recording
aeokit score https://modelcontextprotocol.io --scenario my-task.yaml --record

# Override the model per run (without editing aeokit.config.yaml)
aeokit run my-task.yaml --model claude-opus-4-7

Reports land in ./audit-reports/, ./aeokit-results/, or ./score-reports/ (--output overrides). Open the .html file in any browser.

Prebuilt scenarios live in examples/scenarios/ in the GitHub repo — copy any of them into your project as a starting point.

Configuration

AEOkit reads aeokit.config.yaml from the current working directory. Env vars work as a fallback when a provider block is missing — set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY / GEMINI_API_KEY and you can skip the YAML entirely.

For committed config, drop this in aeokit.config.yaml (git-ignore it — it holds secrets):

providers:
  claude:
    apiKey: sk-ant-...
    # model: claude-sonnet-4-6   # optional override
  openai:
    apiKey: sk-proj-...
    # model: gpt-4o
  gemini:
    apiKey: ...
    # model: gemini-2.5-flash

The full example lives at aeokit.config.example.yaml in the GitHub repo.

How it works

┌────────────────── aeokit score <url> ──────────────────────┐
│                                                            │
│   LAB (deterministic, ~10 s, no LLM)                       │
│   ├─ Discovery: llms.txt, sitemap, robots                  │
│   ├─ Access:    UA matrix across GPTBot/Claude/Perplexity… │
│   └─ Tokens:    budget + heatmap via gpt-tokenizer         │
│                                                            │
│                          ┌──── composite ────┐             │
│                          │ Lab 40 + Field 60 │             │
│                          └───────────────────┘             │
│                                                            │
│   FIELD (empirical, ~30 s–2 min, needs LLM)                │
│   ├─ launch Chromium → attach autoconsent                  │
│   ├─ goto + preflight DOM weight → pick observation mode   │
│   ├─ agent loop:  observe → plan → act → trace → repeat    │
│   └─ assertions:  element_visible · text_contains ·        │
│                   tool_called · url_matches · llm_judge ·  │
│                   custom_eval                              │
│                                                            │
└────────────────────────────────────────────────────────────┘

The same 6-dimension model scores both sides:

Dimension	Weight	What it measures
Task Completion	30%	Assertion pass rate + natural completion
Step Efficiency	15%	Steps per successful action (absolute: `<1.5` is excellent)
Token Economy	15%	Tokens per action (absolute: `<3 K` is excellent)
Error Resilience	15%	Tool-call success rate + recovery detection
Navigation Clarity	15%	Observation-only step ratio — low = page is readable
Interaction Directness	10%	Action vs. observation tool ratio

Crashed / zero-activity runs don't get vacuous credit: the five non-completion dimensions return 0 when the agent never ran, and the task-completion multiplier drops sharply on fatal errors. A "pre-assertion passed" site that blocked the agent will score F, not C.

Commands

Command	Purpose
`aeokit audit <url>`	Static audit only. No API key needed.
`aeokit run <scenarios…>`	Empirical agent runs against a task YAML.
`aeokit score <url>`	Audit + (optional) empirical, with a composite grade.
`aeokit inspect <url>`	Probe the page for WebMCP tools (Phase 5 preview) + DOM stats.

Useful flags:

--provider claude|openai|gemini — pick the LLM (defaults: claude-sonnet-4-6, gpt-4o, gemini-2.5-flash)
-m, --model <id> — override the provider's model for a single run (e.g. --model claude-opus-4-7, --model gpt-4o-mini). Persistent defaults live in providers.<name>.model in aeokit.config.yaml.
--runs N — repeat the scenario N times, report pass-rate + aggregates
--record — save a .webm of the agent session
--headed — see the browser as the agent drives it
--format json,html,sarif,md — pick report formats
--min-score N — CI exit code if the composite score drops below N
--diff baseline.json --fail-on-regression — compare audits and fail on drops (CI)
--no-render — skip the stealth-Chromium fallback in the audit fetcher
--user-agent "…" — override the UA for audit fetches

Writing a scenario

# examples/scenarios/real-world/hackernews-browse.yaml
name: "Hacker News - Read top stories"
url: "https://news.ycombinator.com"
mode: general
intent: |
  You are on Hacker News. Read the homepage and report the titles of
  the top 3 stories along with their points and comment counts.
assertions:
  - type: url_matches
    pattern: "news.ycombinator.com"
  - type: tool_called
    tool: get_page_info
  - type: llm_judge
    question: "Did the agent report the titles of at least 3 actual HN stories?"
    expectedAnswer: "yes"
config:
  maxSteps: 10
  maxTokens: 40000
  observationMode: a11y
  handleConsent: true    # default — set false if you're testing consent UI

Assertion types:

element_visible — CSS selector is present and visible
text_contains — element text contains a string
tool_called — agent invoked a specific tool (optionally with args)
url_matches — final URL matches a pattern
llm_judge — semantic yes/no judged by the LLM from the trace
custom_eval — arbitrary JS returning a value compared to expected

Prebuilt scenarios live in examples/scenarios/ — Hacker News, Wikipedia, GitHub, Stripe docs, MCP docs, Claude docs, NYT, TodoMVC, Reddit.

Reports

Every run produces:

JSON — full trace, assertions, metrics, scored dimensions, insights. Schema-versioned so you can diff in CI.
HTML — self-contained, no network. Dimension strip, "What to fix this week" insights, pre-flight analysis, consent outcome, assertion rows, dimension deep-dives, collapsible step-by-step trace.
SARIF + Markdown — on aeokit audit when --format sarif,md is set. Drop-in for GitHub code-scanning and sticky PR comments (see examples/ci/github-action.yml).
WebM — if --record was set, a video of the agent session.

Programmatic API

import { runScenario, createProvider, loadScenario, loadConfig } from "aeokit";

const config   = await loadConfig();
const provider = createProvider("claude", config);
const scenario = await loadScenario("./task.yaml");

const result = await runScenario({
  scenario,
  provider,
  browserOptions: { headless: true },
});

console.log(result.totalSteps, result.assertions, result.consent);

Everything the CLI does is exported from the root. The types in dist/index.d.ts are stable within a minor version.

Known limits

In the spirit of not shipping bullshit:

Three providers wired, none calibrated. Claude / OpenAI / Gemini all run end-to-end, but scoring thresholds were calibrated on Claude traces — don't read a gpt-4o run's 82 as meaning the same thing as a claude-sonnet-4-6 run's 82 until we publish cross-model normalisation data.
No Playwright auto-install. First run needs npx playwright install chromium — we detect the missing binary and point you at the command, but we don't fetch it for you.
Sites with aggressive bot defences will fail. Reddit, LinkedIn, and CF-protected banking sites typically return JS-challenge pages. The Field browser runs a stealth preset (STEALTH_INIT_SCRIPT + realistic UA) but some sites still detect headless Chromium and the report scores them an honest F.
Error reporting is raw. A bot-challenge shows up as page.title: Execution context was destroyed instead of a clean BLOCKED_BY_BOT_CHALLENGE signal. Taxonomy is on the roadmap.
Scoring thresholds are principled, not calibrated. They come from first-principles reasoning about tokens/step/actions; they haven't been fitted against a labelled benchmark yet. The composite weights (40/60) are reasonable, not sacred.
WebMCP mode is a stub. mode: webmcp in a scenario throws. Discovery via navigator.modelContext is planned for Phase 5.

Roadmap

Done	Next	Later
Lab audit (discovery / access / tokens / parsability / capability)	`blockedBy` error taxonomy	WebMCP mode (Phase 5)
Field agent loop + smart observation	Multi-LLM comparison report	Calibrated cross-model scoring
6-dimension scoring with honest zeros	Landing page + docs site	100-site public benchmark
Combined `score` command	—	—
Claude / OpenAI / Gemini providers	—	—
Consent banner auto-dismiss (200 CMPs)	—	—
CI primitives: GitHub Action, SARIF, PR comment, `--diff`	—	—
Published on npm (`aeokit`) via Changesets	—	—

Contributing

Bug reports, PRs, and new scenario examples are welcome. Dev setup, the test/lint/build loop, release flow, and commit style all live in CONTRIBUTING.md. The internal architecture notes sit in CLAUDE.md.

License

MIT.