Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Harness Evolver
End-to-end optimization of LLM agent harnesses, inspired by Meta-Harness (Lee et al., 2026).
The harness is the 80% factor. Changing just the scaffolding around a fixed LLM can produce a 6x performance gap on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.
Why
Manual harness engineering is slow and doesn't scale. Existing optimizers work in prompt-space (OPRO, TextGrad, GEPA) or use compressed summaries. Meta-Harness showed that code-space search with full diagnostic context (10M+ tokens of traces) outperforms all of them by 10+ points.
Harness Evolver brings that approach to any domain as a Claude Code plugin.
Install
# Via npx (recommended)
npx harness-evolver@latest
# Or as a Claude Code plugin
/plugin install harness-evolverQuick Start
# 1. Copy the example into a working directory
cp -r ~/.harness-evolver/examples/classifier ./my-classifier
cd my-classifier
# 2. Initialize (validates harness, evaluates baseline)
/harness-evolve-init --harness harness.py --eval eval.py --tasks tasks/
# 3. Run the evolution loop
/harness-evolve --iterations 5
# 4. Check progress anytime
/harness-evolve-statusThe classifier example runs in mock mode (no API key needed) and demonstrates the full loop in under 2 minutes.
How It Works
┌─────────────────────────────┐
│ /harness-evolve │
│ (orchestrator skill) │
└──────────┬──────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌────────────┐ ┌──────────┐
│ PROPOSE │ │ EVALUATE │ │ UPDATE │
│ proposer │ │ evaluate.py│ │ state.py │
│ agent │ │ + eval.py │ │ │
└──────────┘ └────────────┘ └──────────┘
│ │ │
▼ ▼ ▼
harnesses/ traces/ summary.json
v{N}/ per-task STATE.md
harness.py stdout/stderr PROPOSER_HISTORY.md
proposal.md timing.json
scores.json- Propose — A proposer agent (Claude Code subagent) reads all prior candidates' code, execution traces, and scores. It diagnoses failure modes via counterfactual analysis and writes a new harness.
- Evaluate — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). The user's eval script scores the results.
- Update — State files are updated with the new score, parent lineage, and regression detection.
- Repeat — The loop continues until N iterations, stagnation (3 rounds without >1% improvement), or a target score is reached.
The Harness Contract
A harness is any executable that accepts:
python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]--input: JSON with{id, input, metadata}(never sees expected answers)--output: JSON with{id, output}--traces-dir: optional directory for the harness to write rich traces--config: optional JSON with evolvable parameters (model, temperature, etc.)
The eval script is also any executable:
python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.jsonThis means Harness Evolver works with any language, any framework, any domain.
Project Structure
.harness-evolver/ # Created in your project by /harness-evolve-init
├── config.json # Project config (harness cmd, eval cmd, evolution params)
├── summary.json # Source of truth (versions, scores, parents)
├── STATE.md # Human-readable status (generated)
├── PROPOSER_HISTORY.md # Log of all proposals and outcomes
├── baseline/ # Original harness (read-only reference)
│ ├── harness.py
│ └── config.json
├── eval/
│ ├── eval.py # Scoring script
│ └── tasks/ # Test cases (JSON files)
└── harnesses/
└── v001/
├── harness.py # Candidate code
├── config.json # Evolvable parameters
├── proposal.md # Proposer's reasoning
├── scores.json # Evaluation results
└── traces/ # Full execution traces
├── stdout.log
├── stderr.log
├── timing.json
└── task_001/
├── input.json # What the harness received
└── output.json # What the harness returnedPlugin Architecture
Three-layer design inspired by GSD:
Layer 1: Skills + Agents (markdown) → AI orchestration
Layer 2: Tools (Python stdlib-only) → Deterministic operations
Layer 3: Installer (Node.js) → Distribution via npx| Component | Files | Purpose |
|---|---|---|
| Skills | skills/harness-evolve-init/, skills/harness-evolve/, skills/harness-evolve-status/ |
Slash commands that orchestrate the loop |
| Agent | agents/harness-evolver-proposer.md |
The proposer — 4-phase workflow (orient, diagnose, propose, document) with 6 rules |
| Tools | tools/evaluate.py, tools/state.py, tools/init.py, tools/detect_stack.py, tools/trace_logger.py |
CLI tools called via subprocess — zero LLM tokens spent on deterministic work |
| Installer | bin/install.js, package.json |
Copies skills/agents/tools to the right locations |
| Example | examples/classifier/ |
10-task medical classifier with mock mode |
Integrations
LangSmith (optional)
If LANGSMITH_API_KEY is set, the plugin automatically:
- Enables
LANGCHAIN_TRACING_V2for auto-tracing of LangChain/LangGraph harnesses - Detects langsmith-cli for the proposer to query traces directly
# Setup
export LANGSMITH_API_KEY=lsv2_...
uv tool install langsmith-cli && langsmith-cli auth login
# The proposer can then do:
langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
langsmith-cli --json runs stats --project harness-evolver-v003No custom API client — the proposer uses langsmith-cli like it uses grep and diff.
Context7 (optional)
The plugin detects the harness's technology stack via AST analysis (17 libraries supported) and instructs the proposer to consult current documentation before proposing API changes.
# Setup
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
# The proposer automatically:
# 1. Reads config.json → stack.detected (e.g., LangChain, ChromaDB)
# 2. Queries Context7 for current docs before writing code
# 3. Annotates proposal.md with "API verified via Context7"Without Context7, the proposer uses model knowledge and annotates "API not verified against current docs."
LangChain Docs MCP (optional)
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcpComplements Context7 with LangChain/LangGraph/LangSmith-specific documentation search.
The Proposer
The proposer agent is the core of the system. It follows a 4-phase workflow derived from the Meta-Harness paper:
| Phase | Context % | What it does |
|---|---|---|
| Orient | ~6% | Read summary.json and PROPOSER_HISTORY.md. Decide which 2-3 versions to investigate. |
| Diagnose | ~80% | Deep trace analysis on selected versions. grep for errors, diff between good/bad versions, counterfactual diagnosis. |
| Propose | ~10% | Write new harness.py + config.json. Prefer additive changes after regressions. |
| Document | ~4% | Write proposal.md with evidence. Append to PROPOSER_HISTORY.md. |
6 rules:
- Every change motivated by evidence (cite task ID, trace line, or score delta)
- After regression, prefer additive changes
- Don't repeat past mistakes (read PROPOSER_HISTORY.md)
- One hypothesis at a time when possible
- Maintain the CLI interface
- Prefer readable harnesses over defensive ones
Supported Libraries (Stack Detection)
The AST-based stack detector recognizes 17 libraries:
| Category | Libraries |
|---|---|
| AI Frameworks | LangChain, LangGraph, LlamaIndex, OpenAI, Anthropic, DSPy, CrewAI, AutoGen |
| Vector Stores | ChromaDB, Pinecone, Qdrant, Weaviate |
| Web | FastAPI, Flask, Pydantic |
| Data | Pandas, NumPy |
Development
# Run all tests (41 tests, stdlib-only, no pip install needed)
python3 -m unittest discover -s tests -v
# Test the example manually
cd examples/classifier
python3 harness.py --input tasks/task_001.json --output /tmp/result.json --config config.json
cat /tmp/result.json
# Run the installer locally
node bin/install.jsComparison with Related Work
| Meta-Harness (paper) | A-Evolve | ECC /evolve | Harness Evolver | |
|---|---|---|---|---|
| Format | Paper artifact | Framework (Docker) | Plugin (passive) | Plugin (active) |
| Search space | Code-space | Code-space | Prompt-space | Code-space |
| Context/iter | 10M tokens | Variable | N/A | Full filesystem |
| Domain | TerminalBench-2 | Coding benchmarks | Dev workflow | Any domain |
| Install | Manual Python | Docker CLI | /plugin install |
npx or /plugin install |
| LangSmith | No | No | No | Yes (langsmith-cli) |
| Context7 | No | No | No | Yes (MCP) |
References
- Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
- GSD (Get Shit Done) — CLI architecture inspiration
- LangSmith CLI — Trace analysis for the proposer
- Context7 — Documentation lookup via MCP
- Design Spec
- LangSmith Integration Spec
- Context7 Integration Spec
License
MIT