JSPM

@verydia/eval

0.1.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • 0
  • Score
    100M100P100Q18324F
  • License MIT

Evaluation harness for testing and benchmarking Verydia agent flows

Package Exports

  • @verydia/eval
  • @verydia/eval/package.json

Readme

@verydia/eval

Evaluation and regression harness for Verydia flows.

Dataset format

Eval datasets are JSON or YAML arrays of cases:

[
  {
    "input": { "text": "hello" },
    "expectedOutput": { "text": "hello", "length": 5 },
    "metadata": { "id": "case-1" }
  }
]
``

Each case:

- `input`: value passed directly to `flow.run(input, deps)`.
- `expectedOutput` (optional): deep JSON equality check against the flow output.
- `expectedBehavior` (optional): behavior assertions, e.g. guard activity.
- `metadata` (optional): arbitrary tags (scenario id, notes, etc.).

YAML is supported via a simple parse-then-JSON transform.

## API

```ts
import { evaluateFlow, loadEvalDatasetFromFile } from "@verydia/eval";
import type { BuiltFlow, FlowRuntimeDeps } from "@verydia/flow-dsl";

const flow: BuiltFlow<any, any> = /* your flow */;
const deps: FlowRuntimeDeps = { /* memoryStore, llmRegistry, etc. */ };

const dataset = await loadEvalDatasetFromFile("./my-dataset.json");
const result = await evaluateFlow({ flow, dataset, deps });

console.log(result.metrics.passRate);

Assertions

evaluateFlow supports basic assertions:

  • expectedOutput: deep/JSON equality against the actual output.
  • expectedBehavior.guardEvaluated: expect at least one policy.evaluate event.
  • expectedBehavior.noGuardEvaluation: expect no policy.evaluate events.

Metrics

For each case, the runner measures:

  • Latency (ms)
  • Number of llm.invoke events (LLM calls)
  • Number of mcp.call events (tool calls)

Aggregated metrics in EvalResult.metrics:

  • totalCases, passCount, failCount, passRate
  • averageLatencyMs, totalLatencyMs
  • totalLlmCalls, totalMcpCalls
  • Optional token and cost estimates (totalTokensIn, totalTokensOut, costEstimateTotal) if you provide an estimator.

CLI: verydia eval run

The Verydia CLI exposes a thin wrapper over @verydia/eval.

verydia eval run clinical-triage-dsl --dataset health-eval.json --json eval-report.json

This will:

  • Load the dataset from --dataset (JSON or YAML).
  • Run the specified demo flow (clinical-triage-dsl or clinical-triage-graph).
  • Print a summary (cases, pass rate, latency, LLM/MCP calls).
  • If --json is provided, write the full EvalResult to the given file.

Use these reports to compare runs over time and catch regressions in flow behavior, guard activation, or performance.