Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Harness Evolver
Autonomous harness optimization for LLM agents. Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM — prompts, retrieval, routing, output parsing — using a multi-agent loop inspired by Meta-Harness (Lee et al., 2026).
The harness is the 80% factor. Changing just the scaffolding can produce a 6x performance gap on the same benchmark. This plugin automates that search.
Install
npx harness-evolver@latestWorks with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.
Quick Start
cd my-llm-project
claude
/harness-evolver:init # scans code, creates eval + tasks if missing
/harness-evolver:evolve # runs the optimization loop
/harness-evolver:status # check progress anytimeZero-config mode: If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
How It Works
| 5 Adaptive Proposers | Each iteration spawns 5 parallel agents: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), and 2 failure-focused agents that target the weakest task clusters. Strategies adapt every iteration based on actual per-task scores — no fixed specialists. |
| Trace Insights | Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Traces are systematically clustered by error pattern, token usage, and response type — proposers receive structured diagnostic data, not raw logs. |
| Quality-Diversity Selection | Not winner-take-all. Tracks per-task champions — a candidate that loses overall but excels at specific tasks is preserved as the next crossover parent. The archive never discards variants. |
| Durable Test Gates | When the loop fixes a failure, regression tasks are automatically generated to lock in the improvement. The test suite grows over iterations — fixed bugs can never silently return. |
| Critic | Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence. |
| Architect | Auto-triggers on stagnation or regression. Recommends topology changes (single-call → RAG, chain → ReAct, etc.) with concrete migration steps. |
| Judge | LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed. |
Commands
| Command | What it does |
|---|---|
/harness-evolver:init |
Scan project, create harness/eval/tasks, run baseline |
/harness-evolver:evolve |
Run the autonomous optimization loop (5 adaptive proposers) |
/harness-evolver:status |
Show progress, scores, stagnation detection |
/harness-evolver:compare |
Diff two versions with per-task analysis |
/harness-evolver:diagnose |
Deep trace analysis of a specific version |
/harness-evolver:deploy |
Promote the best harness back to your project |
/harness-evolver:architect |
Analyze and recommend optimal agent topology |
/harness-evolver:critic |
Evaluate eval quality and detect gaming |
/harness-evolver:import-traces |
Pull production LangSmith traces as eval tasks |
Agents
| Agent | Role | Color |
|---|---|---|
| Proposer | Evolves the harness code based on trace analysis | Green |
| Architect | Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.) | Blue |
| Critic | Evaluates eval quality, detects gaming, proposes stricter scoring | Red |
| Judge | LLM-as-judge scoring — works without expected answers | Yellow |
| TestGen | Generates synthetic test cases from code analysis | Cyan |
Integrations
| LangSmith | Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via langsmith-cli. Processed into readable format per iteration. |
| Context7 | Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis. |
| LangChain Docs | LangChain/LangGraph-specific documentation search via MCP. |
# Optional — install during npx setup or manually:
uv tool install langsmith-cli && langsmith-cli auth login
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcpThe Harness Contract
A harness is any executable:
python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
Evolution Loop
/harness-evolver:evolve
│
├─ 1. Get next version
├─ 1.5 Gather LangSmith traces (processed into readable format)
├─ 1.6 Generate Trace Insights (cluster errors, analyze tokens, cross-ref scores)
├─ 1.8 Analyze per-task failures (cluster by category for adaptive briefings)
├─ 2. Spawn 5 proposers in parallel (exploit / explore / crossover / 2× failure-targeted)
├─ 3. Validate all candidates
├─ 4. Evaluate all candidates
├─ 4.5 Judge (if using LLM-as-judge eval)
├─ 5. Select winner + track per-task champion
├─ 5.5 Test suite growth (generate regression tasks for fixed failures)
├─ 6. Report results
├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
├─ 7. Auto-trigger Architect (if regression or stagnation)
└─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)API Keys
Set in your shell before launching Claude Code:
export GEMINI_API_KEY="AIza..." # Gemini-based harnesses
export ANTHROPIC_API_KEY="sk-ant-..." # Claude-based harnesses
export OPENAI_API_KEY="sk-..." # OpenAI-based harnesses
export OPENROUTER_API_KEY="sk-or-..." # Multi-model via OpenRouter
export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracingThe plugin auto-detects available keys. No key needed for the included example.
Comparison
| Meta-Harness | A-Evolve | ECC | Harness Evolver | |
|---|---|---|---|---|
| Format | Paper artifact | Framework (Docker) | Plugin (passive) | Plugin (active) |
| Search | Code-space | Code-space | Prompt-space | Code-space |
| Candidates/iter | 1 | 1 | N/A | 5 parallel (adaptive) |
| Selection | Single best | Single best | N/A | Quality-diversity (per-task) |
| Auto-critique | No | No | No | Yes (critic + judge) |
| Architecture | Fixed | Fixed | N/A | Auto-recommended |
| Trace analysis | Manual | No | No | Systematic (clustering + insights) |
| Test growth | No | No | No | Yes (durable regression gates) |
| LangSmith | No | No | No | Yes |
| Context7 | No | No | No | Yes |
| Zero-config | No | No | No | Yes |
References
- Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
- Darwin Godel Machine — Sakana AI (parallel evolution architecture)
- AlphaEvolve — DeepMind (population-based code evolution)
- Agent Skills Specification — Open standard for AI agent skills
License
MIT