JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 112
  • Score
    100M100P100Q94986F
  • License MIT

Agent evaluation and benchmarking for AgentsKit.

Package Exports

  • @agentskit/eval
  • @agentskit/eval/ci
  • @agentskit/eval/diff
  • @agentskit/eval/replay
  • @agentskit/eval/snapshot

Readme

@agentskit/eval

Measure agent quality with numbers, not vibes — ship with confidence.

npm version npm downloads bundle size license stability GitHub stars

Tags: ai · agents · llm · agentskit · ai-agents · eval · evaluation · benchmarking · testing · ci-cd · llm-testing

Why eval

  • Replace "it seemed to work" with real metrics — accuracy, per-case latency, token cost, and pass/fail for every test case in a single result object
  • CI/CD ready — exit codes reflect suite results; gate deployments on accuracy thresholds so regressions never reach production
  • Flexible assertions — exact string matching, includes for LLM verbosity, or full control with a custom (result) => boolean function per case
  • Provider-agnostic — the agent closure can wrap any async boundary: createRuntime, a custom controller, or an HTTP endpoint

Install

npm install @agentskit/eval

Quick example

import { runEval } from '@agentskit/eval'
import { createRuntime } from '@agentskit/runtime'
import { anthropic } from '@agentskit/adapters'

const runtime = createRuntime({
  adapter: anthropic({ apiKey: process.env.ANTHROPIC_API_KEY, model: 'claude-sonnet-4-6' }),
})

const result = await runEval({
  agent: async (input) => {
    const r = await runtime.run(input)
    return r.content
  },
  suite: {
    name: 'qa-baseline',
    cases: [
      { input: 'What is 2+2?', expected: '4' },
      { input: 'Capital of France?', expected: 'Paris' },
      { input: 'Is TypeScript a superset of JavaScript?', expected: (r) => r.toLowerCase().includes('yes') },
    ],
  },
})

console.log(`Accuracy: ${(result.accuracy * 100).toFixed(1)}%`)
console.log(`Passed: ${result.passed}/${result.totalCases}`)

Features

  • runEval({ agent, suite }) — run a named test suite against any agent function
  • Result: { accuracy, passed, totalCases, cases[] } — per-case latency and outcome
  • Assertion modes: exact match, includes, custom predicate
  • CI exit codes — non-zero on failure for pipeline gating
  • Pair with @agentskit/observability to trace failed cases

Ecosystem

Package Role
@agentskit/runtime Typical agent under test
@agentskit/core Stable I/O contracts for eval harnesses
@agentskit/observability Traces for failed cases

Contributors

AgentsKit contributors

License

MIT — see LICENSE.

Docs

Full documentation · GitHub