Package Exports
- @kindlm/cli
Readme
KindLM
Behavioral regression testing for AI agents. Test what your agents do — not just what they say.
Why KindLM?
LLM evals measure text quality. KindLM tests behavior — the tool calls your agent makes, the decisions it takes, and whether it leaks PII or violates compliance rules. It runs in CI so regressions never ship.
Features
- Tool call assertions — verify agents call the right tools with the right arguments, in the right order
- Schema validation — structured output checked against JSON Schema (AJV)
- PII detection — catch leaked SSNs, credit cards, emails, phone numbers, IBANs
- LLM-as-judge — score responses against natural-language criteria (0.0–1.0)
- Drift detection — semantic + field-level comparison against saved baselines
- Keyword guards — require or forbid specific phrases in output
- Latency & cost budgets — fail tests that exceed time or token-cost thresholds
- EU AI Act compliance — generate Annex IV documentation from test results
- CI-native — exit code 0/1, JUnit XML reporter, GitHub Actions ready
Supported Providers
| Provider | Example config |
|---|---|
| OpenAI | openai:gpt-4o |
| Anthropic | anthropic:claude-sonnet-4-5-20250929 |
| Google Gemini | google:gemini-2.0-flash |
| Mistral | mistral:mistral-large-latest |
| Cohere | cohere:command-r-plus |
| Ollama | ollama:llama3 |
Quick Start
Try it instantly:
npx @kindlm/cli initOr install globally:
npm install -g @kindlm/cli
kindlm initEdit the generated kindlm.yaml:
kindlm: 1
project: "my-agent"
suite:
name: "refund-agent"
providers:
openai:
apiKeyEnv: "OPENAI_API_KEY"
models:
- id: "gpt-4o"
provider: "openai"
model: "gpt-4o"
params:
temperature: 0
prompts:
refund:
system: "You are a refund support agent. Use lookup_order(order_id) to find orders."
user: "{{message}}"
tests:
- name: "looks-up-order"
prompt: "refund"
vars:
message: "I want to return order #12345"
tools:
- name: "lookup_order"
responses:
- when: { order_id: "12345" }
then: { order_id: "12345", status: "eligible" }
expect:
toolCalls:
- tool: "lookup_order"
argsMatch: { order_id: "12345" }
guardrails:
pii:
enabled: true
judge:
- criteria: "Response is empathetic and professional"
minScore: 0.8Run your tests:
kindlm testOutput:
refund-agent / looks-up-order
gpt-4o
✓ looks-up-order (1.3s)
✓ tool_called: lookup_order
✓ pii: no PII detected
✓ judge: 0.92 ≥ 0.80
1 passed, 0 failed
Gates: ✓ PASSEDCLI Flags
| Flag | Description | Added |
|---|---|---|
--reporter <type> |
Output format: pretty (default), json, junit |
v1.0.0 |
--output <path> |
Write report to file | v1.0.0 |
--compliance |
Generate EU AI Act Annex IV report | v1.0.0 |
--pdf <path> |
Export compliance report as PDF (requires --compliance) |
v1.0.0 |
-s <suite> |
Run a specific suite by name | v1.0.0 |
--runs <count> |
Override the repeat count from config |
v1.0.0 |
--gate <percent> |
Fail if suite pass rate falls below threshold (0–100) | v1.0.0 |
--isolate |
Run in an isolated git worktree (clean environment) | v2.0.0 |
--concurrency <n> |
Override the concurrency setting from config | v2.1.0 |
--timeout <ms> |
Override the per-test timeout from config | v2.1.0 |
CI Integration
# .github/workflows/test.yml
- run: npm install -g @kindlm/cli
- run: kindlm test --reporter junit --output results.xmlRepository Layout
packages/
core/ @kindlm/core — Business logic, zero I/O dependencies
cli/ @kindlm/cli — CLI entry point
cloud/ @kindlm/cloud — Cloudflare Workers API + D1 database
docs/ Technical specs and documentation
site/ Documentation website (Next.js)Documentation
Full docs: kindlm.dev | Source: docs/
License
MIT (core + CLI) | AGPL (cloud)