Package Exports
- aethis-mcp
- aethis-mcp/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (aethis-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
aethis-mcp
LLMs interpret rules. This compiles them.
An MCP server that compiles legislation, policy, and regulation into deterministic logic — so your agent gets the same correct answer every time.
The problem | Proof | When to use this | Quick start | Author rules | Tools | Workflows | DSL capabilities | Setup | Troubleshooting
The problem
AI agents are making eligibility and compliance decisions using LLM reasoning. Most of the time it works. When it doesn't, nobody notices — the model returns a confident, well-structured wrong answer with no audit trail.
LLMs are good at interpreting rules. They are not reliable at executing them. The failure mode is silent: high confidence, wrong answer, no trace.
Aethis compiles rules into formal logic at authoring time. At decision time, no LLM is involved. Same inputs, same answer, every time — with a full audit trail back to the source clause.
Proof
Numbers below from the paper (Simpson, Kozak, Doake, v3.8, 2026). Three independent evidence sources.
v3.8 adversarial extension (paper §6.4.1): 20 newly-authored construction-CAR scenarios stratified across 5 complexity dimensions, independent-prose-then-engine methodology. Engine 20/20 (100%) by construction; current frontier models still fail:
| Configuration | N=20 | Failures |
|---|---|---|
| Aethis Engine | 20/20 (100%) | — |
GPT-5.4 (reasoning_effort=low) |
20/20 (100%) | — |
| GPT-5.4 (default) | 19/20 (95%) | 0 reasoning tokens on every scenario — short-circuits on E4 (DE3/LEG3 carveback gap) |
| Claude Sonnet 4.6 | 19/20 (95%) | E4 |
| Claude Opus 4.7 (current Anthropic strongest) | 18/20 (90%) | E4 + B3 (£499 M boundary) |
Three of four frontier configurations fail the same scenario across both Anthropic and OpenAI families.
External validation on LegalBench (paper §6.10): across 9 LegalBench tasks (949 held-out cases authored by Stanford researchers) the engine is significantly more accurate than each of three frontier LLMs by combined paired-binomial McNemar's test: p < 0.001 vs Claude Sonnet 4.6, p = 0.003 vs Claude Opus 4.7, p < 0.001 vs GPT-5.4. The structural advantage is largest on multi-prong rule-application tasks (Δ up to +41 pp) and persists at a smaller cross-task-significant margin on randomly-sampled tasks chosen without fit inspection.
The shifting-ground problem (paper §6.5 Finding 6): between March and April 2026 several v3.7 paper cells closed silently under the same model alias — GPT-5.4 on construction-CAR moved from 96.6% to 100%; Opus 4.6 on spacecraft from 89.7% to 98.5%; the GPT-5.3 alias was deprecated by OpenAI mid-cycle. Frontier-LLM accuracy on a fixed benchmark is a moving target. The Aethis Engine is invariant by construction — same ruleset, same answer, any month, any prompt.
See confidently-wrong-benchmark/legalbench/ for the full harness and per-call replication artefacts.
In regulated workflows (financial services, insurance, immigration, healthcare), decisions must be deterministic (same answer every time), explainable (audit trail to source clause), and reproducible. LLMs fail all three regardless of peak accuracy.
Where LLMs fail
The failure pattern is nested exception chains in a London market insurance endorsement:
Access damage is excluded (Clause 8). Unless the project is worth >=100M — enhanced cover reinstates it (Clause 9(1)). Unless the defect is a design defect — enhanced cover doesn't apply (Clause 9(2)). Unless the project is worth >=500M — pioneer override reinstates it (Clause 9(3)). Unless the defect was known prior — pioneer override is blocked (Clause 9A(1)). Unless there's an engineer assessment — the block is lifted (Clause 9A(2)).
GPT-5.4 fails on the pioneer override boundary at £500M (paper §6.4). GPT-4.1-mini fails systematically across the enhanced cover chain, treating the access damage exclusion as absolute.
Full benchmarks, reproducible test runner, and per-scenario breakdown: Aethis-ai/confidently-wrong-benchmark · aethis-examples
The scenario GPT gets wrong
A £600M pioneer infrastructure project. Design defect. Access damage claim.
aethis_decide({
ruleset_id: "aethis/construction-all-risks",
field_values: {
"car.policy.period_valid": true,
"car.property.category": "permanent_works",
"car.loss.is_physical": true,
"car.component.is_defective": true,
"car.defect.origin": "design",
"car.claim.is_rectification": false,
"car.claim.is_access_damage": true,
"car.damage.consequence_of_failure": false,
"car.project.value_millions_gbp": 600,
"car.notification.within_period": true,
"car.contract.jct_compliant": true
},
include_trace: true
}){
"decision": "eligible",
"ruleset_version": "v3",
"fields_provided": 11,
"fields_evaluated": 11,
"trace": {
"not_rectification": "PASS — claim is not for rectification",
"carveback_qualification": "PASS — Route B: not solely access damage for removal",
"access_exclusion": "TRIGGERED — access damage claimed",
"enhanced_cover": "PASS — project value 600M >= 100M threshold",
"design_defect_check": "TRIGGERED — defect origin is design",
"pioneer_override": "PASS — project value 600M >= 500M, pioneer override applies"
}
}GPT says: not covered. Aethis says: covered — pioneer override (Clause 9(3)) reinstates coverage even for design defects on projects >= £500M.
Sub-5ms, no LLM at inference, same trace every time. The example trace above is representative — run the full reproducer (all 11 scenarios, every frontier model) yourself: aethis-examples/construction-all-risks.
When to use this
Use Aethis when:
- The decision has regulatory, legal, or financial consequences
- You need an audit trail that traces back to source text
- Rules involve nested exceptions, conditional thresholds, or override chains
- "95% accurate" is not good enough
- You need the same answer every time, not just most of the time
- You're making decisions at scale — the engine evaluates in under 5ms per decision (1000x faster than an LLM call). A batch of 10,000 evaluations completes in seconds, not hours
- Your agent needs to ask the right questions — the engine computes the optimal next question to ask given what it already knows, finding the shortest path to a decision. Two applicants with different facts get different question sequences — the engine adapts in real time
Domains: Loan eligibility, insurance underwriting, immigration compliance, HR policy, benefits qualification, medical device clearance, trade compliance — any domain where rules are written in legislation or policy documents.
You probably don't need this for:
- Content recommendations, search ranking, sentiment analysis
- Decisions where "close enough" is fine
- One-off questions that don't repeat
How it works
The LLM is used once, at authoring time, to compile source text into formal logic. After that, every decision is pure constraint evaluation.
Source text ──→ LLM compiles to rules ──→ Test suite validates ──→ Published rule ruleset
(authoring time only) │
▼
Eligibility engine evaluates
(deterministic, no LLM)
│
▼
eligible / not_eligible / undetermined
+ trace back to source clauseQuick start
Two use cases — decide which is yours:
- Evaluate existing rules — a ruleset already exists, you want to evaluate eligibility against it. No API key needed. Start with
aethis_decideoraethis_next_question. - Author new rules — you have a policy document and want to compile it into logic. Requires an API key and Anthropic key. Start with
aethis_create_rulesetand follow the TDD workflow.
No sign-up needed to evaluate. Decision tools work immediately.
Recommended — one command via aethis-cli:
# Install once:
uv tool install aethis-cli
# Wire up your MCP client (claude-code, cursor, claude-desktop, windsurf — or all):
aethis mcp install --target allIdempotent, preserves any other MCP servers you have configured, and re-runs cleanly after aethis account generate rotates your key. Restart your editor to pick up the change. Full options: aethis mcp install --help.
Manual install (if you don't use aethis-cli):
claude mcp add aethis -- npx -y aethis-mcpFor Cursor / Claude Desktop / Windsurf manual config, see Setup below.
Try it immediately with the public demo ruleset (Spacecraft Crew Certification Act 2049):
Is a Vogon eligible for crew certification?
{
"decision": "not_eligible",
"fields_provided": 1,
"fields_evaluated": 11,
"trace": {
"species_check": "FAIL — species is 'Vogon' (disqualifying, Section 3)"
}
}One field provided. Decision reached instantly — the engine knew a Vogon is disqualified regardless of flight hours, medical certs, or anything else. No further questions asked.
Is a 35-year-old human with 600 flight hours, a pilot licence, GAA exam, valid medical cert, on a suborbital mission with conventional propulsion and a towel — eligible?
{
"decision": "eligible",
"fields_provided": 11,
"fields_evaluated": 11
}Every decision traces back to the exact section and clause in the source legislation. Pass include_trace: true for the full evaluation trail.
[!TIP] Want to create your own rules from a policy document? See Author your own rules below. Rule authoring is invite-only private beta. Decision tools (above) stay public and free. Request access →
Author your own rules
[!NOTE] Rule authoring is invite-only private beta — approval required. Decision tools (above) work publicly with no keys.
What you'll need once approved: an Aethis API key (we provision one for approved tenants) and your own Anthropic key for generation (passed per-request, never stored). Attempting authoring tools without approval returns
403 Forbidden. Request access →
Aethis is not just a decision engine — it lets your agent compile legislation into executable logic. Paste a policy document, write test cases, and iterate until the rules pass.
Three-phase authoring workflow
Complex legislation that spans multiple sections needs a structured approach before you write rules. The three phases build on each other: discover the structure, nail the field vocabulary, then generate and test rules.
[!TIP] Simple single-section rules? Skip Phases 1–2. Go straight to
aethis_create_ruleset→aethis_discover_fields→ write tests →aethis_generate_and_test. The phase structure is for multi-section domains where getting the decomposition right matters.
Phase 1 — Section discovery
Use when you have complex legislation that needs to be split into separate, independently-evaluable sections.
aethis_discover_sections({
domain: "uk_fsm",
sources: [{ name: "fsm_legislation.md", content: "..." }]
})
→ Suggests: child_eligibility, household_qualifying_criteria, universal_infant_fsm
aethis_validate_sections({
domain: "uk_fsm",
expected_sections: ["child_eligibility", "household_qualifying_criteria", "universal_infant_fsm"],
discovered_sections: [... result from above ...]
})
→ all_match: trueIf sections don't match your expectation, refine and re-discover:
aethis_refine_sections({
domain: "uk_fsm",
feedback: "Universal Infant Free School Meals must be a separate section with no income test.",
sources: [...]
})Phase 2 — Field vocabulary
Use before writing test cases to ensure the field names the engine produces match what you expect. If you skip this, you may write tests with invented field names that silently mismatch.
// Tell the engine what fields you expect (SME-defined spec):
aethis_set_field_spec({
project_id: "proj_abc123",
expected_fields: [
{ key: "child.age", sort: "Int" },
{ key: "child.school_type", sort: "Enum", enum_values: ["state_funded", "independent"] }
]
})
// Discover fields from source text — auto-validates against the spec:
aethis_discover_fields({ project_id: "proj_abc123" })
→ field list + validation_result if spec was set (shows missing/mismatched fields)
// Refine if fields are wrong:
aethis_refine_fields({
project_id: "proj_abc123",
feedback: "child.school_type should include 'home_educated' as a value"
})
// Explicit validation against spec:
aethis_validate_fields({
project_id: "proj_abc123",
expected_fields: [...]
})Phase 3 — Generate and test
What's documented below as Steps 1–4. Once sections are agreed and fields are validated, create rulesets and run the TDD loop.
Step 1: Create
aethis_create_ruleset({
name: "Consumer Credit Pre-Qualification",
section_id: "consumer-credit",
domain: "consumer_credit", // optional — groups related sections
source_text: "Section 3: Adverse credit history\n(1) An applicant with adverse credit history...",
test_cases: [
{ name: "Adverse credit — decline", field_values: { "credit.has_adverse_history": true }, expected_outcome: "not_eligible" },
{ name: "Good applicant — approve", field_values: { "credit.has_adverse_history": false, "credit.employment_status": "employed", ... }, expected_outcome: "eligible" },
{ name: "High DTI, existing customer — approve", field_values: { "credit.dti_percent": 55, "credit.is_existing_customer": true, ... }, expected_outcome: "eligible" }
]
})Returns a project_id.
[!TIP] Discover field names before writing tests. Call
aethis_discover_fields({ project_id })after creating a ruleset to get the exact field names the engine will use. Writing tests with invented field names causes silent mismatches. Run discover → write tests → generate.
[!TIP] Use
domainto share guidance across sections. If you have multiple related rulesets (e.g.residence,english_language,good_characterunderuk_citizenship), set the samedomainon each. Guidance added withaethis_add_domain_guidancefor that domain applies automatically to all projects in it — no need to repeat cross-section principles on every ruleset.
Step 2: Generate and test
aethis_generate_and_test({ project_id: "proj_abc123" })Generation complete. Test results: 2/3 passing.
PASS Adverse credit — decline
PASS Good applicant — approve
FAIL High DTI, existing customer — approve
Expected: eligible Got: not_eligible
The existing customer exemption (Section 10) is not yet captured.Step 3: Refine
aethis_refine({
project_id: "proj_abc123",
feedback: "Section 10 says existing customers (24+ months good standing) are exempt from the DTI threshold in Section 6."
})Generation complete. Test results: 3/3 passing.
PASS Adverse credit — decline
PASS Good applicant — approve
PASS High DTI, existing customer — approve (was: FAIL → now: PASS)You can also add guidance directly without regenerating, and inspect what's accumulated:
// Add targeted guidance for a specific failing test
aethis_add_guidance({
project_id: "proj_abc123",
guidance_text: "When DTI > 45%, existing customers with 24+ months good standing are exempt (Section 10).",
process_type: "rule_generation" // default; use "field_extraction" for field design principles
})
// Check what guidance is in place before adding more
aethis_list_guidance({ project_id: "proj_abc123" })For cross-section principles that apply to multiple rulesets in the same domain:
// Add once — applies to all projects in the domain automatically
aethis_add_domain_guidance({
domain: "consumer_credit",
guidance_text: "The system flags, never decides. Discretionary clauses ('we will consider', 'may be waived') must produce 'undetermined', not 'not_eligible'.",
process_type: "rule_generation",
notes: "Core discretion principle — do not remove." // stored for SME context, never sent to LLM
})
aethis_list_domain_guidance({ domain: "consumer_credit" })Diagnosing a specific failure:
aethis_explain_failure({
// explain-failure currently requires the concrete ruleset_id from the
// /decide envelope (slugs not yet supported on this endpoint)
ruleset_id: "<ruleset_id from your /decide response>",
field_values: { "credit.dti_percent": 55, "credit.is_existing_customer": true },
expected_outcome: "eligible",
test_name: "High DTI, existing customer — approve"
})
// Returns: criterion statuses, which rule failed, and a targeted fix hintStep 4: Publish
aethis_publish({ project_id: "proj_abc123" })Returns a ruleset_id — ready to use with aethis_decide.
[!NOTE] Test-driven iteration: Aethis generates rules from your source text and guidance — not from your tests. Tests validate the output and show you what guidance to add next. Better tests = faster convergence on correct rules.
[!IMPORTANT] Anthropic key required for authoring. Rule generation uses Anthropic LLM calls. Pass your key as
anthropic_keyonaethis_generate_and_testoraethis_refine. The key is used for the request only and never stored. Decision tools do not use Anthropic.
[!IMPORTANT] DATE fields use integer ordinals, not ISO strings. Pass dates as Python
date.toordinal()values (days since year 1). Example:2025-04-13=739354,2026-04-13=739719. Passing"2025-04-13"will fail with a type error. Quick conversion:python3 -c "from datetime import date; print(date(2025, 4, 13).toordinal())".
Tools
25 tools in four groups. Most agents use Decision (2 calls). Authors use the full Authoring workflow.
| Group | Tools | What they do |
|---|---|---|
| Decision | aethis_decide, aethis_schema, aethis_next_question, aethis_explain, aethis_explain_failure |
Evaluate eligibility, inspect fields, conversational checks, rule explanations, diagnose failures |
| Authoring — section & field phases | aethis_discover_sections, aethis_refine_sections, aethis_validate_sections, aethis_set_field_spec, aethis_discover_fields, aethis_refine_fields, aethis_validate_fields |
Decompose legislation into sections (Phase 1); establish and validate field vocabulary (Phase 2) |
| Authoring — rule generation | aethis_create_ruleset, aethis_add_guidance, aethis_list_guidance, aethis_generate_and_test, aethis_refine, aethis_publish, aethis_add_domain_guidance, aethis_list_domain_guidance |
Create, iterate, and publish rule rulesets (TDD workflow); manage project and domain guidance |
| Discovery | aethis_list_projects, aethis_list_rulesets |
Find projects, browse ruleset versions |
| Management | aethis_archive_project, aethis_archive_ruleset |
Archive projects and rulesets (permanent) |
Prompts
MCP prompts are pre-built workflow guides that compatible clients (Claude Desktop, Cursor, VS Code Copilot) can surface as selectable templates.
| Prompt | Description |
|---|---|
aethis-author |
Step-by-step TDD workflow: gather requirements → create ruleset → generate → refine → publish |
aethis-decide |
Decision workflow: find ruleset → get schema → evaluate (quick or conversational). Accepts optional ruleset_id argument |
Workflows
Evaluate eligibility (2 calls)
aethis_schema(ruleset_id) → Learn what fields are needed
aethis_decide(ruleset_id, fields) → eligible / not_eligible / undeterminedinclude_trace: true— full evaluation trace with source citations for each criterioninclude_explanation: true— human-readable rule descriptions (useful for surfacing to end users)
Conversational eligibility — optimal question routing
The engine doesn't just evaluate — it tells your agent what to ask next. Given the facts collected so far, it computes the single most informative question and returns the shortest remaining path to a decision.
aethis_next_question(ruleset_id, {})
→ "What is the applicant's species?" (10 questions remaining)
aethis_next_question(ruleset_id, {species: "Vogon"})
→ Decision: not eligible. No more questions needed.One fact was enough. A Vogon is disqualified immediately — the engine doesn't ask about flight hours, medical certs, or towel compliance. A different applicant might need 5 questions. Another might need 8. The engine adapts the path based on the answers it receives, always choosing the question that resolves the most uncertainty.
This means your agent can run a guided assessment — asking only the questions that matter, in the order that matters — and reach a provable decision in the fewest possible steps.
The response includes optimal_path — the full ranked list of remaining questions. You don't need to ask all of them: call aethis_next_question again after each answer and the engine recomputes the shortest path from the updated state. Once a decision is reachable, is_eligible is returned and no further questions are needed.
Author rules
See Author your own rules for the full TDD workflow.
Setup
Decision tools work with no API key. Add AETHIS_API_KEY when you have authoring access. For most users the aethis-cli one-liner in Quick start is the fastest path; the manual options below are for environments where you don't want to install the Python CLI.
Claude Code
# Decision tools only (no key needed)
claude mcp add aethis -- npx -y aethis-mcp
# With authoring access
claude mcp add aethis -e AETHIS_API_KEY=<your-key> -- npx -y aethis-mcpClaude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"aethis": {
"command": "npx",
"args": ["-y", "aethis-mcp"]
}
}
}To enable authoring, add "env": { "AETHIS_API_KEY": "<your-key>" } to the config above.
Cursor / Windsurf
Add to .cursor/mcp.json or .windsurf/mcp.json (same JSON as above).
Keys & security
AETHIS_API_KEY(ak_live_...) is your platform key. Mint withaethis login(cli) or via the dashboard. Set it in the MCP client's config, not your shell profile — the MCP server process doesn't inherit your shell environment.ANTHROPIC_API_KEYis forwarded per-request toaethis_generate_and_test. The MCP server never stores it; it accompanies the one request and is discarded server-side.- Rotate by minting a new key (
aethis account generatein the cli) and revoking the old one (aethis account revoke <key_id>). For multi-machine setups, mint one key per machine so revocation is surgical. - Both keys live next to each other in your MCP client config; treat that file like any other secrets store (don't commit to a public repo, sync via your normal credential pathway).
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
| "API key is required" | AETHIS_API_KEY not set (authoring tools only) |
Configure in MCP client settings (not shell profile). Decision tools don't need a key |
| "X-Anthropic-Key header is required" | Missing Anthropic key on generation | Pass anthropic_key parameter on authoring tools |
| "Ruleset not found" (404) | Wrong ID or archived | Use aethis_list_projects → aethis_list_rulesets |
| "Rate limit exceeded" (429) | Daily limit hit | Client retries automatically. Contact eng@aethis.ai for higher tier |
| "Cannot publish: tests failing" | Tests don't pass | Fix with aethis_refine, or force=true to override |
| Generation timeout (504) | The client timed out waiting (normal for complex rules — generation can take 5–15 min server-side) | The server continues generating after the timeout. Wait 10–15 min, then call aethis_list_rulesets({ project_id }) to check if a new ruleset appeared. If yes, call aethis_publish. If not, the server may still be running — wait and check again rather than re-triggering generation |
"Expected an integer for <field>, got str" |
DATE field passed as ISO string | Pass as date.toordinal() integer — e.g. 739354 for 2025-04-13. Quick: python3 -c "from datetime import date; print(date(2025,4,13).toordinal())" |
DSL capabilities
Supported field types
| Type | Description |
|---|---|
Bool |
True / false |
Int |
Integer (includes counts, money as pence, percentages as integers) |
Enum |
Closed set of named values |
Date |
Stored as integer ordinal (days since year 1). Pass via date.toordinal() |
Duration |
Integer number of days |
String |
Free text (use sparingly — prefer Enum for known value sets) |
Supported operators
| Category | Operators |
|---|---|
| Logic | AND, OR, NOT, IMPLIES |
| Comparison | =, ≠, <, ≤, >, ≥ |
| Membership | IN — field IN [v1, v2, ...] |
| Arithmetic | + and − for Int/Date fields; * (multiply) for Int fields |
| Aggregation | min(a, b, ...) and max(a, b, ...) — return the smallest/largest Int |
Helpers
days_between(date_a, date_b)— returns Int (number of days,date_b − date_a)min(a, b, ...)— minimum of 2+ Int valuesmax(a, b, ...)— maximum of 2+ Int values- Constant arithmetic is folded at authoring time:
5 * 365becomes1825in the compiled rule
Not supported
- Division between runtime field values
- Weighted scoring or probabilistic outcomes
- Lists as field values (model as pre-aggregated Int or Bool fields instead)
- More than 3 outcome tiers (
eligible/not_eligible/undetermined)
Related
aethis-cli — Python CLI for file-based rule authoring with YAML test cases and Rich terminal output.
aethis-examples — Benchmark data, test scenarios, and LLM comparison results for construction insurance, consumer credit, and spacecraft certification.
Development
git clone https://github.com/aethis-ai/aethis-mcp.git
cd aethis-mcp
npm install
npm test # 107 tests
npm run buildLicense
MIT