Package Exports

aethis-mcp
aethis-mcp/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (aethis-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

aethis-mcp

LLMs interpret rules. This compiles them.

An MCP server that compiles legislation, policy, and regulation into deterministic logic — so your agent gets the same correct answer every time.

The problem

AI agents are making eligibility and compliance decisions using LLM reasoning. Most of the time it works. When it doesn't, nobody notices — the model returns a confident, well-structured wrong answer with no audit trail.

LLMs are good at interpreting rules. They are not reliable at executing them. The failure mode is silent: high confidence, wrong answer, no trace.

Aethis compiles rules into formal logic at authoring time. At decision time, no LLM is involved. Same inputs, same answer, every time — with a full audit trail back to the source clause.

Proof

Numbers below from the paper (Simpson, Kozak, Doake, v3.8, 2026). Three independent evidence sources.

v3.8 adversarial extension (paper §6.4.1): 20 newly-authored construction-CAR scenarios stratified across 5 complexity dimensions, independent-prose-then-engine methodology. Engine 20/20 (100%) by construction; current frontier models still fail:

Configuration	N=20	Failures
Aethis Engine	20/20 (100%)	—
GPT-5.4 (`reasoning_effort=low`)	20/20 (100%)	—
GPT-5.4 (default)	19/20 (95%)	0 reasoning tokens on every scenario — short-circuits on E4 (DE3/LEG3 carveback gap)
Claude Sonnet 4.6	19/20 (95%)	E4
Claude Opus 4.7 (current Anthropic strongest)	18/20 (90%)	E4 + B3 (£499 M boundary)

Three of four frontier configurations fail the same scenario across both Anthropic and OpenAI families.

External validation on LegalBench (paper §6.10): across 9 LegalBench tasks (949 held-out cases authored by Stanford researchers) the engine is significantly more accurate than each of three frontier LLMs by combined paired-binomial McNemar's test: p < 0.001 vs Claude Sonnet 4.6, p = 0.003 vs Claude Opus 4.7, p < 0.001 vs GPT-5.4. The structural advantage is largest on multi-prong rule-application tasks (Δ up to +41 pp) and persists at a smaller cross-task-significant margin on randomly-sampled tasks chosen without fit inspection.

The shifting-ground problem (paper §6.5 Finding 6): between March and April 2026 several v3.7 paper cells closed silently under the same model alias — GPT-5.4 on construction-CAR moved from 96.6% to 100%; Opus 4.6 on spacecraft from 89.7% to 98.5%; the GPT-5.3 alias was deprecated by OpenAI mid-cycle. Frontier-LLM accuracy on a fixed benchmark is a moving target. The Aethis Engine is invariant by construction — same ruleset, same answer, any month, any prompt.

See confidently-wrong-benchmark/legalbench/ for the full harness and per-call replication artefacts.

In regulated workflows (financial services, insurance, immigration, healthcare), decisions must be deterministic (same answer every time), explainable (audit trail to source clause), and reproducible. LLMs fail all three regardless of peak accuracy.

Where LLMs fail

The failure pattern is nested exception chains in a London market insurance endorsement:

Access damage is excluded (Clause 8). Unless the project is worth >=100M — enhanced cover reinstates it (Clause 9(1)). Unless the defect is a design defect — enhanced cover doesn't apply (Clause 9(2)). Unless the project is worth >=500M — pioneer override reinstates it (Clause 9(3)). Unless the defect was known prior — pioneer override is blocked (Clause 9A(1)). Unless there's an engineer assessment — the block is lifted (Clause 9A(2)).

GPT-5.4 fails on the pioneer override boundary at £500M (paper §6.4). GPT-4.1-mini fails systematically across the enhanced cover chain, treating the access damage exclusion as absolute.

Full benchmarks, reproducible test runner, and per-scenario breakdown: Aethis-ai/confidently-wrong-benchmark · aethis-examples

The scenario GPT gets wrong

A £600M pioneer infrastructure project. Design defect. Access damage claim.

aethis_decide({
  ruleset_id: "aethis/construction-all-risks",
  field_values: {
    "car.policy.period_valid": true,
    "car.property.category": "permanent_works",
    "car.loss.is_physical": true,
    "car.component.is_defective": true,
    "car.defect.origin": "design",
    "car.claim.is_rectification": false,
    "car.claim.is_access_damage": true,
    "car.damage.consequence_of_failure": false,
    "car.project.value_millions_gbp": 600,
    "car.notification.within_period": true,
    "car.contract.jct_compliant": true
  },
  include_trace: true
})

{
  "decision": "eligible",
  "ruleset_version": "v3",
  "fields_provided": 11,
  "fields_evaluated": 11,
  "trace": {
    "not_rectification": "PASS — claim is not for rectification",
    "carveback_qualification": "PASS — Route B: not solely access damage for removal",
    "access_exclusion": "TRIGGERED — access damage claimed",
    "enhanced_cover": "PASS — project value 600M >= 100M threshold",
    "design_defect_check": "TRIGGERED — defect origin is design",
    "pioneer_override": "PASS — project value 600M >= 500M, pioneer override applies"
  }
}

GPT says: not covered. Aethis says: covered — pioneer override (Clause 9(3)) reinstates coverage even for design defects on projects >= £500M.

Sub-5ms, no LLM at inference, same trace every time. The example trace above is representative — run the full reproducer (all 11 scenarios, every frontier model) yourself: aethis-examples/construction-all-risks.

When to use this

Use Aethis when:

The decision has regulatory, legal, or financial consequences
You need an audit trail that traces back to source text
Rules involve nested exceptions, conditional thresholds, or override chains
"95% accurate" is not good enough
You need the same answer every time, not just most of the time
You're making decisions at scale — the engine evaluates in under 5ms per decision (1000x faster than an LLM call). A batch of 10,000 evaluations completes in seconds, not hours
Your agent needs to ask the right questions — the engine computes the optimal next question to ask given what it already knows, finding the shortest path to a decision. Two applicants with different facts get different question sequences — the engine adapts in real time

Domains: Loan eligibility, insurance underwriting, immigration compliance, HR policy, benefits qualification, medical device clearance, trade compliance — any domain where rules are written in legislation or policy documents.

You probably don't need this for:

Content recommendations, search ranking, sentiment analysis
Decisions where "close enough" is fine
One-off questions that don't repeat

How it works

The LLM is used once, at authoring time, to compile source text into formal logic. After that, every decision is pure constraint evaluation.

Source text ──→ LLM compiles to rules ──→ Test suite validates ──→ Published rule ruleset
                (authoring time only)                                      │
                                                                           ▼
                                                                 Eligibility engine evaluates
                                                                 (deterministic, no LLM)
                                                                           │
                                                                           ▼
                                                                 eligible / not_eligible / undetermined
                                                                 + trace back to source clause

Quick start

Two use cases — decide which is yours:

Evaluate existing rules — a ruleset already exists, you want to evaluate eligibility against it. No API key needed. Start with aethis_decide or aethis_next_question.
Author new rules — you have a policy document and want to compile it into logic. Requires an API key and Anthropic key. Start with aethis_create_ruleset and follow the TDD workflow.

No sign-up needed to evaluate. Decision tools work immediately.

Recommended — one command via aethis-cli:

# Install once:
uv tool install aethis-cli

# Wire up your MCP client (claude-code, cursor, claude-desktop, windsurf — or all):
aethis mcp install --target all

Idempotent, preserves any other MCP servers you have configured, and re-runs cleanly after aethis account generate rotates your key. Restart your editor to pick up the change. Full options: aethis mcp install --help.

Manual install (if you don't use aethis-cli):

claude mcp add aethis -- npx -y aethis-mcp

For Cursor / Claude Desktop / Windsurf manual config, see Setup below.

Try it immediately with the public demo ruleset (Spacecraft Crew Certification Act 2049):

Is a Vogon eligible for crew certification?

{
  "decision": "not_eligible",
  "fields_provided": 1,
  "fields_evaluated": 11,
  "trace": {
    "species_check": "FAIL — species is 'Vogon' (disqualifying, Section 3)"
  }
}

One field provided. Decision reached instantly — the engine knew a Vogon is disqualified regardless of flight hours, medical certs, or anything else. No further questions asked.

Is a 35-year-old human with 600 flight hours, a pilot licence, GAA exam, valid medical cert, on a suborbital mission with conventional propulsion and a towel — eligible?

{
  "decision": "eligible",
  "fields_provided": 11,
  "fields_evaluated": 11
}

Every decision traces back to the exact section and clause in the source legislation. Pass include_trace: true for the full evaluation trail.

[!TIP] Want to create your own rules from a policy document? See Author your own rules below. Rule authoring is invite-only private beta. Decision tools (above) stay public and free. Request access →

Author your own rules

[!NOTE] Rule authoring is invite-only private beta — approval required. Decision tools (above) work publicly with no keys.

What you'll need once approved: an Aethis API key (we provision one for approved tenants) and your own Anthropic key for generation (passed per-request, never stored). Attempting authoring tools without approval returns 403 Forbidden. Request access →

Aethis is not just a decision engine — it lets your agent compile legislation into executable logic. Paste a policy document, write test cases, and iterate until the rules pass.

Three-phase authoring workflow

Complex legislation that spans multiple sections needs a structured approach before you write rules. The three phases build on each other: discover the structure, nail the field vocabulary, then generate and test rules.

[!TIP] Simple single-section rules? Skip Phases 1–2. Go straight to aethis_create_ruleset → aethis_discover_fields → write tests → aethis_generate_and_test. The phase structure is for multi-section domains where getting the decomposition right matters.

Phase 1 — Section discovery

Use when you have complex legislation that needs to be split into separate, independently-evaluable sections.

aethis_discover_sections({
  domain: "uk_fsm",
  sources: [{ name: "fsm_legislation.md", content: "..." }]
})
→ Suggests: child_eligibility, household_qualifying_criteria, universal_infant_fsm

aethis_validate_sections({
  domain: "uk_fsm",
  expected_sections: ["child_eligibility", "household_qualifying_criteria", "universal_infant_fsm"],
  discovered_sections: [... result from above ...]
})
→ all_match: true

If sections don't match your expectation, refine and re-discover:

aethis_refine_sections({
  domain: "uk_fsm",
  feedback: "Universal Infant Free School Meals must be a separate section with no income test.",
  sources: [...]
})

Phase 2 — Field vocabulary

Use before writing test cases to ensure the field names the engine produces match what you expect. If you skip this, you may write tests with invented field names that silently mismatch.

// Tell the engine what fields you expect (SME-defined spec):
aethis_set_field_spec({
  project_id: "proj_abc123",
  expected_fields: [
    { key: "child.age", sort: "Int" },
    { key: "child.school_type", sort: "Enum", enum_values: ["state_funded", "independent"] }
  ]
})

// Discover fields from source text — auto-validates against the spec:
aethis_discover_fields({ project_id: "proj_abc123" })
→ field list + validation_result if spec was set (shows missing/mismatched fields)

// Refine if fields are wrong:
aethis_refine_fields({
  project_id: "proj_abc123",
  feedback: "child.school_type should include 'home_educated' as a value"
})

// Explicit validation against spec:
aethis_validate_fields({
  project_id: "proj_abc123",
  expected_fields: [...]
})

Phase 3 — Generate and test

What's documented below as Steps 1–4. Once sections are agreed and fields are validated, create rulesets and run the TDD loop.

Step 1: Create

aethis_create_ruleset({
  name: "Consumer Credit Pre-Qualification",
  section_id: "consumer-credit",
  domain: "consumer_credit",                          // optional — groups related sections
  source_text: "Section 3: Adverse credit history\n(1) An applicant with adverse credit history...",
  test_cases: [
    { name: "Adverse credit — decline", field_values: { "credit.has_adverse_history": true }, expected_outcome: "not_eligible" },
    { name: "Good applicant — approve", field_values: { "credit.has_adverse_history": false, "credit.employment_status": "employed", ... }, expected_outcome: "eligible" },
    { name: "High DTI, existing customer — approve", field_values: { "credit.dti_percent": 55, "credit.is_existing_customer": true, ... }, expected_outcome: "eligible" }
  ]
})

Returns a project_id.

[!TIP] Discover field names before writing tests. Call aethis_discover_fields({ project_id }) after creating a ruleset to get the exact field names the engine will use. Writing tests with invented field names causes silent mismatches. Run discover → write tests → generate.

[!TIP] Use domain to share guidance across sections. If you have multiple related rulesets (e.g. residence, english_language, good_character under uk_citizenship), set the same domain on each. Guidance added with aethis_add_domain_guidance for that domain applies automatically to all projects in it — no need to repeat cross-section principles on every ruleset.

Step 2: Generate and test

aethis_generate_and_test({ project_id: "proj_abc123" })

Generation complete. Test results: 2/3 passing.

PASS  Adverse credit — decline
PASS  Good applicant — approve
FAIL  High DTI, existing customer — approve
      Expected: eligible  Got: not_eligible
      The existing customer exemption (Section 10) is not yet captured.

Step 3: Refine

aethis_refine({
  project_id: "proj_abc123",
  feedback: "Section 10 says existing customers (24+ months good standing) are exempt from the DTI threshold in Section 6."
})

Generation complete. Test results: 3/3 passing.

PASS  Adverse credit — decline
PASS  Good applicant — approve
PASS  High DTI, existing customer — approve  (was: FAIL → now: PASS)

You can also add guidance directly without regenerating, and inspect what's accumulated:

// Add targeted guidance for a specific failing test
aethis_add_guidance({
  project_id: "proj_abc123",
  guidance_text: "When DTI > 45%, existing customers with 24+ months good standing are exempt (Section 10).",
  process_type: "rule_generation"    // default; use "field_extraction" for field design principles
})

// Check what guidance is in place before adding more
aethis_list_guidance({ project_id: "proj_abc123" })

For cross-section principles that apply to multiple rulesets in the same domain:

// Add once — applies to all projects in the domain automatically
aethis_add_domain_guidance({
  domain: "consumer_credit",
  guidance_text: "The system flags, never decides. Discretionary clauses ('we will consider', 'may be waived') must produce 'undetermined', not 'not_eligible'.",
  process_type: "rule_generation",
  notes: "Core discretion principle — do not remove."   // stored for SME context, never sent to LLM
})

aethis_list_domain_guidance({ domain: "consumer_credit" })

Diagnosing a specific failure:

aethis_explain_failure({
  // explain-failure currently requires the concrete ruleset_id from the
  // /decide envelope (slugs not yet supported on this endpoint)
  ruleset_id: "<ruleset_id from your /decide response>",
  field_values: { "credit.dti_percent": 55, "credit.is_existing_customer": true },
  expected_outcome: "eligible",
  test_name: "High DTI, existing customer — approve"
})
// Returns: criterion statuses, which rule failed, and a targeted fix hint

Step 4: Publish

aethis_publish({ project_id: "proj_abc123" })

Returns a ruleset_id — ready to use with aethis_decide.

[!NOTE] Test-driven iteration: Aethis generates rules from your source text and guidance — not from your tests. Tests validate the output and show you what guidance to add next. Better tests = faster convergence on correct rules.

[!IMPORTANT] Anthropic key required for authoring. Rule generation uses Anthropic LLM calls. Pass your key as anthropic_key on aethis_generate_and_test or aethis_refine. The key is used for the request only and never stored. Decision tools do not use Anthropic.

[!IMPORTANT] DATE fields use integer ordinals, not ISO strings. Pass dates as Python date.toordinal() values (days since year 1). Example: 2025-04-13 = 739354, 2026-04-13 = 739719. Passing "2025-04-13" will fail with a type error. Quick conversion: python3 -c "from datetime import date; print(date(2025, 4, 13).toordinal())".

Tools

25 tools in four groups. Most agents use Decision (2 calls). Authors use the full Authoring workflow.

Group	Tools	What they do
Decision	`aethis_decide`, `aethis_schema`, `aethis_next_question`, `aethis_explain`, `aethis_explain_failure`	Evaluate eligibility, inspect fields, conversational checks, rule explanations, diagnose failures
Authoring — section & field phases	`aethis_discover_sections`, `aethis_refine_sections`, `aethis_validate_sections`, `aethis_set_field_spec`, `aethis_discover_fields`, `aethis_refine_fields`, `aethis_validate_fields`	Decompose legislation into sections (Phase 1); establish and validate field vocabulary (Phase 2)
Authoring — rule generation	`aethis_create_ruleset`, `aethis_add_guidance`, `aethis_list_guidance`, `aethis_generate_and_test`, `aethis_refine`, `aethis_publish`, `aethis_add_domain_guidance`, `aethis_list_domain_guidance`	Create, iterate, and publish rule rulesets (TDD workflow); manage project and domain guidance
Discovery	`aethis_list_projects`, `aethis_list_rulesets`	Find projects, browse ruleset versions
Management	`aethis_archive_project`, `aethis_archive_ruleset`	Archive projects and rulesets (permanent)

Prompts

MCP prompts are pre-built workflow guides that compatible clients (Claude Desktop, Cursor, VS Code Copilot) can surface as selectable templates.

Prompt	Description
`aethis-author`	Step-by-step TDD workflow: gather requirements → create ruleset → generate → refine → publish
`aethis-decide`	Decision workflow: find ruleset → get schema → evaluate (quick or conversational). Accepts optional `ruleset_id` argument

Workflows

Evaluate eligibility (2 calls)

aethis_schema(ruleset_id)          → Learn what fields are needed
aethis_decide(ruleset_id, fields)  → eligible / not_eligible / undetermined

include_trace: true — full evaluation trace with source citations for each criterion
include_explanation: true — human-readable rule descriptions (useful for surfacing to end users)

Conversational eligibility — optimal question routing

The engine doesn't just evaluate — it tells your agent what to ask next. Given the facts collected so far, it computes the single most informative question and returns the shortest remaining path to a decision.

aethis_next_question(ruleset_id, {})
→ "What is the applicant's species?" (10 questions remaining)

aethis_next_question(ruleset_id, {species: "Vogon"})
→ Decision: not eligible. No more questions needed.

One fact was enough. A Vogon is disqualified immediately — the engine doesn't ask about flight hours, medical certs, or towel compliance. A different applicant might need 5 questions. Another might need 8. The engine adapts the path based on the answers it receives, always choosing the question that resolves the most uncertainty.

This means your agent can run a guided assessment — asking only the questions that matter, in the order that matters — and reach a provable decision in the fewest possible steps.

The response includes optimal_path — the full ranked list of remaining questions. You don't need to ask all of them: call aethis_next_question again after each answer and the engine recomputes the shortest path from the updated state. Once a decision is reachable, is_eligible is returned and no further questions are needed.

Author rules

See Author your own rules for the full TDD workflow.

Setup

Decision tools work with no API key. Add AETHIS_API_KEY when you have authoring access. For most users the aethis-cli one-liner in Quick start is the fastest path; the manual options below are for environments where you don't want to install the Python CLI.

Claude Code

# Decision tools only (no key needed)
claude mcp add aethis -- npx -y aethis-mcp

# With authoring access
claude mcp add aethis -e AETHIS_API_KEY=<your-key> -- npx -y aethis-mcp

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "aethis": {
      "command": "npx",
      "args": ["-y", "aethis-mcp"]
    }
  }
}

To enable authoring, add "env": { "AETHIS_API_KEY": "<your-key>" } to the config above.

Cursor / Windsurf

Add to .cursor/mcp.json or .windsurf/mcp.json (same JSON as above).

Keys & security

AETHIS_API_KEY (ak_live_...) is your platform key. Mint with aethis login (cli) or via the dashboard. Set it in the MCP client's config, not your shell profile — the MCP server process doesn't inherit your shell environment.
ANTHROPIC_API_KEY is forwarded per-request to aethis_generate_and_test. The MCP server never stores it; it accompanies the one request and is discarded server-side.
Rotate by minting a new key (aethis account generate in the cli) and revoking the old one (aethis account revoke <key_id>). For multi-machine setups, mint one key per machine so revocation is surgical.
Both keys live next to each other in your MCP client config; treat that file like any other secrets store (don't commit to a public repo, sync via your normal credential pathway).

Troubleshooting

Error	Cause	Fix
"API key is required"	`AETHIS_API_KEY` not set (authoring tools only)	Configure in MCP client settings (not shell profile). Decision tools don't need a key
"X-Anthropic-Key header is required"	Missing Anthropic key on generation	Pass `anthropic_key` parameter on authoring tools
"Ruleset not found" (404)	Wrong ID or archived	Use `aethis_list_projects` → `aethis_list_rulesets`
"Rate limit exceeded" (429)	Daily limit hit	Client retries automatically. Contact eng@aethis.ai for higher tier
"Cannot publish: tests failing"	Tests don't pass	Fix with `aethis_refine`, or `force=true` to override
Generation timeout (504)	The client timed out waiting (normal for complex rules — generation can take 5–15 min server-side)	The server continues generating after the timeout. Wait 10–15 min, then call `aethis_list_rulesets({ project_id })` to check if a new ruleset appeared. If yes, call `aethis_publish`. If not, the server may still be running — wait and check again rather than re-triggering generation
`"Expected an integer for <field>, got str"`	DATE field passed as ISO string	Pass as `date.toordinal()` integer — e.g. `739354` for 2025-04-13. Quick: `python3 -c "from datetime import date; print(date(2025,4,13).toordinal())"`

DSL capabilities

Supported field types

Type	Description
`Bool`	True / false
`Int`	Integer (includes counts, money as pence, percentages as integers)
`Enum`	Closed set of named values
`Date`	Stored as integer ordinal (days since year 1). Pass via `date.toordinal()`
`Duration`	Integer number of days
`String`	Free text (use sparingly — prefer Enum for known value sets)

Supported operators

Category	Operators
Logic	AND, OR, NOT, IMPLIES
Comparison	`=`, `≠`, `<`, `≤`, `>`, `≥`
Membership	`IN` — field IN [v1, v2, ...]
Arithmetic	`+` and `−` for Int/Date fields; `*` (multiply) for Int fields
Aggregation	`min(a, b, ...)` and `max(a, b, ...)` — return the smallest/largest Int

Helpers

days_between(date_a, date_b) — returns Int (number of days, date_b − date_a)
min(a, b, ...) — minimum of 2+ Int values
max(a, b, ...) — maximum of 2+ Int values
Constant arithmetic is folded at authoring time: 5 * 365 becomes 1825 in the compiled rule

Not supported

Division between runtime field values
Weighted scoring or probabilistic outcomes
Lists as field values (model as pre-aggregated Int or Bool fields instead)
More than 3 outcome tiers (eligible / not_eligible / undetermined)

aethis-cli — Python CLI for file-based rule authoring with YAML test cases and Rich terminal output.

aethis-examples — Benchmark data, test scenarios, and LLM comparison results for construction insurance, consumer credit, and spacecraft certification.

Development

git clone https://github.com/aethis-ai/aethis-mcp.git
cd aethis-mcp
npm install
npm test       # 107 tests
npm run build

License

MIT

aethis-mcp

Package Exports

Readme

aethis-mcp

The problem

Proof

Where LLMs fail

The scenario GPT gets wrong

When to use this

How it works

Quick start

Author your own rules

Three-phase authoring workflow

Phase 1 — Section discovery

Phase 2 — Field vocabulary

Phase 3 — Generate and test

Step 1: Create

Step 2: Generate and test

Step 3: Refine

Step 4: Publish

Tools

Prompts

Workflows

Evaluate eligibility (2 calls)

Conversational eligibility — optimal question routing

Author rules

Setup

Claude Code

Claude Desktop

Cursor / Windsurf

Keys & security

Troubleshooting

Supported field types

Supported operators

Helpers

Not supported

Related

Development

License