Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Harness Evolver

Autonomous harness optimization for LLM agents. Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM — prompts, retrieval, routing, output parsing — using a multi-agent loop inspired by Meta-Harness (Lee et al., 2026).

The harness is the 80% factor. Changing just the scaffolding can produce a 6x performance gap on the same benchmark. This plugin automates that search.

Install

npx harness-evolver@latest

Works with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.

Quick Start

cd my-llm-project
claude

/harness-evolver:init        # scans code, creates eval + tasks if missing
/harness-evolver:evolve      # runs the optimization loop
/harness-evolver:status      # check progress anytime

Zero-config mode: If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.

How It Works

5 Adaptive Proposers	Each iteration spawns 5 parallel agents: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), and 2 failure-focused agents that target the weakest task clusters. Strategies adapt every iteration based on actual per-task scores — no fixed specialists.
Trace Insights	Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Traces are systematically clustered by error pattern, token usage, and response type — proposers receive structured diagnostic data, not raw logs.
Quality-Diversity Selection	Not winner-take-all. Tracks per-task champions — a candidate that loses overall but excels at specific tasks is preserved as the next crossover parent. The archive never discards variants.
Durable Test Gates	When the loop fixes a failure, regression tasks are automatically generated to lock in the improvement. The test suite grows over iterations — fixed bugs can never silently return.
Critic	Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence.
Architect	Auto-triggers on stagnation or regression. Recommends topology changes (single-call → RAG, chain → ReAct, etc.) with concrete migration steps.
Judge	LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.

Commands

Command	What it does
`/harness-evolver:init`	Scan project, create harness/eval/tasks, run baseline
`/harness-evolver:evolve`	Run the autonomous optimization loop (5 adaptive proposers)
`/harness-evolver:status`	Show progress, scores, stagnation detection
`/harness-evolver:compare`	Diff two versions with per-task analysis
`/harness-evolver:diagnose`	Deep trace analysis of a specific version
`/harness-evolver:deploy`	Promote the best harness back to your project
`/harness-evolver:architect`	Analyze and recommend optimal agent topology
`/harness-evolver:critic`	Evaluate eval quality and detect gaming
`/harness-evolver:import-traces`	Pull production LangSmith traces as eval tasks

Agents

Agent	Role	Color
Proposer	Evolves the harness code based on trace analysis	Green
Architect	Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.)	Blue
Critic	Evaluates eval quality, detects gaming, proposes stricter scoring	Red
Judge	LLM-as-judge scoring — works without expected answers	Yellow
TestGen	Generates synthetic test cases from code analysis	Cyan

Integrations

LangSmith	Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via `langsmith-cli`. Processed into readable format per iteration.
Context7	Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.
LangChain Docs	LangChain/LangGraph-specific documentation search via MCP.

# Optional — install during npx setup or manually:
uv tool install langsmith-cli && langsmith-cli auth login
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp

The Harness Contract

A harness is any executable:

python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]

Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.

Evolution Loop

/harness-evolver:evolve
  │
  ├─ 1.  Get next version
  ├─ 1.5 Gather LangSmith traces (processed into readable format)
  ├─ 1.6 Generate Trace Insights (cluster errors, analyze tokens, cross-ref scores)
  ├─ 1.8 Analyze per-task failures (cluster by category for adaptive briefings)
  ├─ 2.  Spawn 5 proposers in parallel (exploit / explore / crossover / 2× failure-targeted)
  ├─ 3.  Validate all candidates
  ├─ 4.  Evaluate all candidates
  ├─ 4.5 Judge (if using LLM-as-judge eval)
  ├─ 5.  Select winner + track per-task champion
  ├─ 5.5 Test suite growth (generate regression tasks for fixed failures)
  ├─ 6.  Report results
  ├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
  ├─ 7.  Auto-trigger Architect (if regression or stagnation)
  └─ 8.  Check stop conditions (target reached, N iterations, stagnation post-architect)

API Keys

Set in your shell before launching Claude Code:

export GEMINI_API_KEY="AIza..."             # Gemini-based harnesses
export ANTHROPIC_API_KEY="sk-ant-..."       # Claude-based harnesses
export OPENAI_API_KEY="sk-..."              # OpenAI-based harnesses
export OPENROUTER_API_KEY="sk-or-..."       # Multi-model via OpenRouter
export LANGSMITH_API_KEY="lsv2_pt_..."      # Auto-enables LangSmith tracing

The plugin auto-detects available keys. No key needed for the included example.

Comparison

	Meta-Harness	A-Evolve	ECC	Harness Evolver
Format	Paper artifact	Framework (Docker)	Plugin (passive)	Plugin (active)
Search	Code-space	Code-space	Prompt-space	Code-space
Candidates/iter	1	1	N/A	5 parallel (adaptive)
Selection	Single best	Single best	N/A	Quality-diversity (per-task)
Auto-critique	No	No	No	Yes (critic + judge)
Architecture	Fixed	Fixed	N/A	Auto-recommended
Trace analysis	Manual	No	No	Systematic (clustering + insights)
Test growth	No	No	No	Yes (durable regression gates)
LangSmith	No	No	No	Yes
Context7	No	No	No	Yes
Zero-config	No	No	No	Yes

References

Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
Darwin Godel Machine — Sakana AI (parallel evolution architecture)
AlphaEvolve — DeepMind (population-based code evolution)
Agent Skills Specification — Open standard for AI agent skills

License

MIT