Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Harness Evolver

End-to-end optimization of LLM agent harnesses, inspired by Meta-Harness (Lee et al., 2026).

The harness is the 80% factor. Changing just the scaffolding around a fixed LLM can produce a 6x performance gap on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.

Install

npx harness-evolver@latest

Select your runtime (Claude Code, Cursor, Codex, Windsurf) and scope (global/local). Then restart your AI coding agent for the skills to appear.

Prerequisites

API Keys (set in your shell before launching Claude Code)

The harness you're evolving may call LLM APIs. Set the keys your harness needs:

# Required: at least one LLM provider
export ANTHROPIC_API_KEY="sk-ant-..."       # For Claude-based harnesses
export OPENAI_API_KEY="sk-..."              # For OpenAI-based harnesses
export GEMINI_API_KEY="AIza..."             # For Gemini-based harnesses
export OPENROUTER_API_KEY="sk-or-..."       # For OpenRouter (multi-model)

# Optional: enhanced tracing
export LANGSMITH_API_KEY="lsv2_pt_..."      # Auto-enables LangSmith tracing

The plugin auto-detects which keys are available during /harness-evolver:init and shows them. The proposer agent knows which APIs are available and uses them accordingly.

No API key needed for the example — the classifier example uses keyword matching (mock mode), no LLM calls.

Optional: Enhanced Integrations

# LangSmith — rich trace analysis for the proposer
uv tool install langsmith-cli && langsmith-cli auth login

# Context7 — up-to-date library documentation for the proposer
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest

# LangChain Docs — LangChain/LangGraph-specific documentation
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp

Quick Start

Try the Example (no API key needed)

# 1. Copy the example
cp -r ~/.harness-evolver/examples/classifier ./my-classifier
cd my-classifier

# 2. Open Claude Code
claude

# 3. Initialize — auto-detects harness.py, eval.py, tasks/
/harness-evolver:init

# 4. Run the evolution loop
/harness-evolver:evolve --iterations 3

# 5. Check progress
/harness-evolver:status

Use with Your Own Project

cd my-llm-project
claude

# Init scans your project, identifies the entry point,
# and helps create harness wrapper + eval + tasks if missing
/harness-evolver:init

# Run optimization
/harness-evolver:evolve --iterations 10

The init skill adapts to your project — if you have graph.py instead of harness.py, it creates a thin wrapper. If you don't have an eval script, it helps you write one.

Available Commands

Command	What it does
`/harness-evolver:init`	Scan project, create harness/eval/tasks, run baseline
`/harness-evolver:evolve`	Run the autonomous optimization loop
`/harness-evolver:status`	Show progress (scores, iterations, stagnation)
`/harness-evolver:compare`	Diff two versions with per-task analysis
`/harness-evolver:diagnose`	Deep trace analysis of a specific version
`/harness-evolver:deploy`	Copy the best harness back to your project

How It Works

                    ┌─────────────────────────────┐
                    │   /harness-evolver:evolve    │
                    │     (orchestrator skill)     │
                    └──────────┬──────────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
        ┌──────────┐   ┌────────────┐   ┌──────────┐
        │ PROPOSE  │   │  EVALUATE  │   │  UPDATE   │
        │ proposer │   │ evaluate.py│   │ state.py  │
        │ agent    │   │ + eval.py  │   │           │
        └──────────┘   └────────────┘   └──────────┘
              │                │                │
              ▼                ▼                ▼
        harnesses/       traces/         summary.json
        v{N}/            per-task        STATE.md
        harness.py       stdout/stderr   PROPOSER_HISTORY.md
        proposal.md      timing.json
        scores.json

Propose — A proposer agent reads all prior candidates' code, execution traces, and scores. Diagnoses failure modes via counterfactual analysis and writes a new harness.
Evaluate — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
Update — State files are updated with the new score, parent lineage, and regression detection.
Repeat — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.

The Harness Contract

A harness is any executable that accepts:

python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]

--input: JSON with {id, input, metadata} (never sees expected answers)
--output: JSON with {id, output}
--traces-dir: optional directory for rich traces
--config: optional JSON with evolvable parameters

The eval script is also any executable:

python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.json

Works with any language, any framework, any domain.

Project Structure (after init)

.harness-evolver/                     # Created by /harness-evolver:init
├── config.json                       # Project config (harness cmd, eval, API keys detected)
├── summary.json                      # Source of truth (versions, scores, parents)
├── STATE.md                          # Human-readable status
├── PROPOSER_HISTORY.md               # Log of all proposals and outcomes
├── baseline/                         # Original harness (read-only)
│   └── harness.py
├── eval/
│   ├── eval.py                       # Your scoring script
│   └── tasks/                        # Test cases
└── harnesses/
    └── v001/
        ├── harness.py                # Evolved candidate
        ├── proposal.md               # Why this version was created
        ├── scores.json               # How it scored
        └── traces/                   # Full execution traces
            ├── stdout.log
            ├── stderr.log
            ├── timing.json
            └── task_001/
                ├── input.json
                └── output.json

The Proposer

The core of the system. 4-phase workflow from the Meta-Harness paper:

Phase	What it does
Orient	Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate.
Diagnose	Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis.
Propose	Write new harness. Prefer additive changes after regressions.
Document	Write `proposal.md` with evidence. Update history.

7 rules: evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.

Integrations

LangSmith (optional, recommended for LangChain/LangGraph harnesses)

export LANGSMITH_API_KEY=lsv2_...
uv tool install langsmith-cli && langsmith-cli auth login

When detected, the plugin:

Sets LANGCHAIN_TRACING_V2=true automatically — all LLM calls are traced
The proposer queries traces directly via langsmith-cli:

langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
langsmith-cli --json runs stats --project harness-evolver-v003

Context7 (optional, recommended for any library-heavy harness)

claude mcp add context7 -- npx -y @upstash/context7-mcp@latest

The plugin detects your stack via AST analysis (17 libraries: LangChain, LangGraph, OpenAI, Anthropic, ChromaDB, FastAPI, etc.) and instructs the proposer to consult current docs before proposing API changes.

Development

# Run all tests (41 tests, stdlib-only)
python3 -m unittest discover -s tests -v

# Test example manually
python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json

# Install locally for development
node bin/install.js

Comparison

	Meta-Harness	A-Evolve	ECC	Harness Evolver
Format	Paper artifact	Framework (Docker)	Plugin (passive)	Plugin (active)
Search	Code-space	Code-space	Prompt-space	Code-space
Domain	TerminalBench-2	Coding benchmarks	Dev workflow	Any domain
Install	Manual Python	Docker CLI	`/plugin install`	`npx`
LangSmith	No	No	No	Yes
Context7	No	No	No	Yes

References

Meta-Harness paper (arxiv 2603.28052) — Lee et al., 2026
Design Spec
LangSmith Integration
Context7 Integration

License

MIT