@verydia/eval

Evaluation and regression harness for Verydia flows.

Dataset format

Eval datasets are JSON or YAML arrays of cases:

[
  {
    "input": { "text": "hello" },
    "expectedOutput": { "text": "hello", "length": 5 },
    "metadata": { "id": "case-1" }
  }
]
``

Each case:

- `input`: value passed directly to `flow.run(input, deps)`.
- `expectedOutput` (optional): deep JSON equality check against the flow output.
- `expectedBehavior` (optional): behavior assertions, e.g. guard activity.
- `metadata` (optional): arbitrary tags (scenario id, notes, etc.).

YAML is supported via a simple parse-then-JSON transform.

## API

```ts
import { evaluateFlow, loadEvalDatasetFromFile } from "@verydia/eval";
import type { BuiltFlow, FlowRuntimeDeps } from "@verydia/flow-dsl";

const flow: BuiltFlow<any, any> = /* your flow */;
const deps: FlowRuntimeDeps = { /* memoryStore, llmRegistry, etc. */ };

const dataset = await loadEvalDatasetFromFile("./my-dataset.json");
const result = await evaluateFlow({ flow, dataset, deps });

console.log(result.metrics.passRate);

Assertions

evaluateFlow supports basic assertions:

expectedOutput: deep/JSON equality against the actual output.
expectedBehavior.guardEvaluated: expect at least one policy.evaluate event.
expectedBehavior.noGuardEvaluation: expect no policy.evaluate events.

Metrics

For each case, the runner measures:

Latency (ms)
Number of llm.invoke events (LLM calls)
Number of mcp.call events (tool calls)

Aggregated metrics in EvalResult.metrics:

totalCases, passCount, failCount, passRate
averageLatencyMs, totalLatencyMs
totalLlmCalls, totalMcpCalls
Optional token and cost estimates (totalTokensIn, totalTokensOut, costEstimateTotal) if you provide an estimator.

CLI: verydia eval run

The Verydia CLI exposes a thin wrapper over @verydia/eval.

verydia eval run clinical-triage-dsl --dataset health-eval.json --json eval-report.json

This will:

Load the dataset from --dataset (JSON or YAML).
Run the specified demo flow (clinical-triage-dsl or clinical-triage-graph).
Print a summary (cases, pass rate, latency, LLM/MCP calls).
If --json is provided, write the full EvalResult to the given file.

Use these reports to compare runs over time and catch regressions in flow behavior, guard activation, or performance.