JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 21
  • Score
    100M100P100Q75307F
  • License MIT

Agentic Engine Optimization toolkit — audit and test how well AI agents can find, parse, and use your website

Package Exports

  • aeokit

Readme

AEOkit

See your website through an AI agent's eyes.

npm version License: MIT CI

AEOkit is an Agentic Engine Optimization toolkit — it measures how well an AI agent can actually use your website, by running one. It combines a deterministic static audit (Lab) with a real agent session driving a real browser (Field) and emits a Lighthouse-style report with a grade, six scoring dimensions, and an honest trace of what the agent tried, saw, and missed.

AEOkit score report

Status: alpha (v0.2.1 on npm). The engine is solid; the product packaging around it isn't finished yet. Read Known limits before you pitch this internally.


Why it exists

The web is being re-consumed by AI agents — search summarisers, shopping agents, coding copilots, computer-use models. "Does my site render?" is no longer the question. The questions are:

  • Can an agent discover what my site offers without a human guiding it?
  • Can it read the page cheaply, or does it burn 80 K tokens on a header carousel?
  • Can it complete a task end-to-end, or does a cookie wall / bot challenge / SPA hydration race stop it on step 1?

AEOkit answers those by running the agent.


What you get

Lab — static audit (no API key)

Runs in seconds against any URL. Checks what an agent would see if it used fetch or curl:

  • llms.txt, robots.txt, sitemap.xml discovery
  • Per-crawler access matrix (GPTBot, Google-Extended, ClaudeBot, PerplexityBot, CCBot…)
  • Token-budget measurement on the landing page + key docs (real gpt-tokenizer counts, not len/4 estimates)
  • Two-tier fetcher: plain HTTP first, stealth-Chromium fallback if the plain tier is blocked — so we can tell you which tier your site accepts

Field — agent session (needs an API key)

Launches a real headless Chromium, hands it to an LLM (Claude / OpenAI / Gemini) via a 10-tool browser harness (screenshot, click, type, scroll, select_option, press_key, navigate, get_page_info, wait, go_back), and gives it a plain-English task. Emits a full trace — every step, every tool call, every token.

  • Smart observation: viewport-scoped, semantically pruned a11y trees with list summarisation (stable ~18–20 K tokens/step instead of ballooning)
  • Pre-flight DOM weight analysis — auto-switches the observation mode on heavy / SPA-hydrating pages
  • Domain guardrails: agent can't wander off to accounts.google.com
  • Auto-dismisses cookie / privacy banners via DuckDuckGo's autoconsent rules (~200 CMPs) before the agent sees the page
  • Optional video recording (--record) — watch what the agent actually saw
  • Statistical runs (--runs N) with pass-rate and per-dimension aggregates

Combined score

aeokit score <url> --scenario task.yaml runs both and produces a single composite grade (Lab × 40% + Field × 60%) with an HTML report that shows them side by side.


Quickstart

Requirements: Node 20+, macOS/Linux/Windows. For Field runs, an API key from at least one of: Anthropic, OpenAI, or Google AI Studio.

# 1. Install
npm install -g aeokit
npx playwright install chromium     # one-time — downloads headless browser

# 2. Configure your API key (only needed for Field runs)
#    Either set an env var — ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY —
#    or drop an aeokit.config.yaml in the directory you run from (see below).

Prefer not to install globally? Every command below works with npx aeokit ….

Verify it's working:

# Lab only — no API key needed
aeokit audit https://modelcontextprotocol.io

# Field run against your own YAML scenario (see "Writing a scenario" below)
aeokit run my-task.yaml --record

# Combined Lab + Field with a composite grade + video recording
aeokit score https://modelcontextprotocol.io --scenario my-task.yaml --record

# Override the model per run (without editing aeokit.config.yaml)
aeokit run my-task.yaml --model claude-opus-4-7

Reports land in ./audit-reports/, ./aeokit-results/, or ./score-reports/ (--output overrides). Open the .html file in any browser.

Prebuilt scenarios live in examples/scenarios/ in the GitHub repo — copy any of them into your project as a starting point.


Configuration

AEOkit reads aeokit.config.yaml from the current working directory. Env vars work as a fallback when a provider block is missing — set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY / GEMINI_API_KEY and you can skip the YAML entirely.

For committed config, drop this in aeokit.config.yaml (git-ignore it — it holds secrets):

providers:
  claude:
    apiKey: sk-ant-...
    # model: claude-sonnet-4-6   # optional override
  openai:
    apiKey: sk-proj-...
    # model: gpt-4o
  gemini:
    apiKey: ...
    # model: gemini-2.5-flash

The full example lives at aeokit.config.example.yaml in the GitHub repo.


How it works

┌────────────────── aeokit score <url> ──────────────────────┐
│                                                            │
│   LAB (deterministic, ~10 s, no LLM)                       │
│   ├─ Discovery: llms.txt, sitemap, robots                  │
│   ├─ Access:    UA matrix across GPTBot/Claude/Perplexity… │
│   └─ Tokens:    budget + heatmap via gpt-tokenizer         │
│                                                            │
│                          ┌──── composite ────┐             │
│                          │ Lab 40 + Field 60 │             │
│                          └───────────────────┘             │
│                                                            │
│   FIELD (empirical, ~30 s–2 min, needs LLM)                │
│   ├─ launch Chromium → attach autoconsent                  │
│   ├─ goto + preflight DOM weight → pick observation mode   │
│   ├─ agent loop:  observe → plan → act → trace → repeat    │
│   └─ assertions:  element_visible · text_contains ·        │
│                   tool_called · url_matches · llm_judge ·  │
│                   custom_eval                              │
│                                                            │
└────────────────────────────────────────────────────────────┘

The same 6-dimension model scores both sides:

Dimension Weight What it measures
Task Completion 30% Assertion pass rate + natural completion
Step Efficiency 15% Steps per successful action (absolute: <1.5 is excellent)
Token Economy 15% Tokens per action (absolute: <3 K is excellent)
Error Resilience 15% Tool-call success rate + recovery detection
Navigation Clarity 15% Observation-only step ratio — low = page is readable
Interaction Directness 10% Action vs. observation tool ratio

Crashed / zero-activity runs don't get vacuous credit: the five non-completion dimensions return 0 when the agent never ran, and the task-completion multiplier drops sharply on fatal errors. A "pre-assertion passed" site that blocked the agent will score F, not C.


Commands

Command Purpose
aeokit audit <url> Static audit only. No API key needed.
aeokit run <scenarios…> Empirical agent runs against a task YAML.
aeokit score <url> Audit + (optional) empirical, with a composite grade.
aeokit inspect <url> Probe the page for WebMCP tools (Phase 5 preview) + DOM stats.

Useful flags:

  • --provider claude|openai|gemini — pick the LLM (defaults: claude-sonnet-4-6, gpt-4o, gemini-2.5-flash)
  • -m, --model <id> — override the provider's model for a single run (e.g. --model claude-opus-4-7, --model gpt-4o-mini). Persistent defaults live in providers.<name>.model in aeokit.config.yaml.
  • --runs N — repeat the scenario N times, report pass-rate + aggregates
  • --record — save a .webm of the agent session
  • --headed — see the browser as the agent drives it
  • --format json,html,sarif,md — pick report formats
  • --min-score N — CI exit code if the composite score drops below N
  • --diff baseline.json --fail-on-regression — compare audits and fail on drops (CI)
  • --no-render — skip the stealth-Chromium fallback in the audit fetcher
  • --user-agent "…" — override the UA for audit fetches

Writing a scenario

# examples/scenarios/real-world/hackernews-browse.yaml
name: "Hacker News - Read top stories"
url: "https://news.ycombinator.com"
mode: general
intent: |
  You are on Hacker News. Read the homepage and report the titles of
  the top 3 stories along with their points and comment counts.
assertions:
  - type: url_matches
    pattern: "news.ycombinator.com"
  - type: tool_called
    tool: get_page_info
  - type: llm_judge
    question: "Did the agent report the titles of at least 3 actual HN stories?"
    expectedAnswer: "yes"
config:
  maxSteps: 10
  maxTokens: 40000
  observationMode: a11y
  handleConsent: true    # default — set false if you're testing consent UI

Assertion types:

  • element_visible — CSS selector is present and visible
  • text_contains — element text contains a string
  • tool_called — agent invoked a specific tool (optionally with args)
  • url_matches — final URL matches a pattern
  • llm_judge — semantic yes/no judged by the LLM from the trace
  • custom_eval — arbitrary JS returning a value compared to expected

Prebuilt scenarios live in examples/scenarios/ — Hacker News, Wikipedia, GitHub, Stripe docs, MCP docs, Claude docs, NYT, TodoMVC, Reddit.


Reports

Every run produces:

  • JSON — full trace, assertions, metrics, scored dimensions, insights. Schema-versioned so you can diff in CI.
  • HTML — self-contained, no network. Dimension strip, "What to fix this week" insights, pre-flight analysis, consent outcome, assertion rows, dimension deep-dives, collapsible step-by-step trace.
  • SARIF + Markdown — on aeokit audit when --format sarif,md is set. Drop-in for GitHub code-scanning and sticky PR comments (see examples/ci/github-action.yml).
  • WebM — if --record was set, a video of the agent session.

Programmatic API

import { runScenario, createProvider, loadScenario, loadConfig } from "aeokit";

const config   = await loadConfig();
const provider = createProvider("claude", config);
const scenario = await loadScenario("./task.yaml");

const result = await runScenario({
  scenario,
  provider,
  browserOptions: { headless: true },
});

console.log(result.totalSteps, result.assertions, result.consent);

Everything the CLI does is exported from the root. The types in dist/index.d.ts are stable within a minor version.


Known limits

In the spirit of not shipping bullshit:

  • Three providers wired, none calibrated. Claude / OpenAI / Gemini all run end-to-end, but scoring thresholds were calibrated on Claude traces — don't read a gpt-4o run's 82 as meaning the same thing as a claude-sonnet-4-6 run's 82 until we publish cross-model normalisation data.
  • No Playwright auto-install. First run needs npx playwright install chromium — we detect the missing binary and point you at the command, but we don't fetch it for you.
  • Sites with aggressive bot defences will fail. Reddit, LinkedIn, and CF-protected banking sites typically return JS-challenge pages. The Field browser runs a stealth preset (STEALTH_INIT_SCRIPT + realistic UA) but some sites still detect headless Chromium and the report scores them an honest F.
  • Error reporting is raw. A bot-challenge shows up as page.title: Execution context was destroyed instead of a clean BLOCKED_BY_BOT_CHALLENGE signal. Taxonomy is on the roadmap.
  • Scoring thresholds are principled, not calibrated. They come from first-principles reasoning about tokens/step/actions; they haven't been fitted against a labelled benchmark yet. The composite weights (40/60) are reasonable, not sacred.
  • WebMCP mode is a stub. mode: webmcp in a scenario throws. Discovery via navigator.modelContext is planned for Phase 5.

Roadmap

Done Next Later
Lab audit (discovery / access / tokens / parsability / capability) blockedBy error taxonomy WebMCP mode (Phase 5)
Field agent loop + smart observation Multi-LLM comparison report Calibrated cross-model scoring
6-dimension scoring with honest zeros Landing page + docs site 100-site public benchmark
Combined score command
Claude / OpenAI / Gemini providers
Consent banner auto-dismiss (200 CMPs)
CI primitives: GitHub Action, SARIF, PR comment, --diff
Published on npm (aeokit) via Changesets

Contributing

Bug reports, PRs, and new scenario examples are welcome. Dev setup, the test/lint/build loop, release flow, and commit style all live in CONTRIBUTING.md. The internal architecture notes sit in CLAUDE.md.


License

MIT.