Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Harness Evolver
LangSmith-native autonomous agent optimization. Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
Inspired by Meta-Harness (Lee et al., 2026). The scaffolding around your LLM produces a 6x performance gap on the same benchmark. This plugin automates the search for better scaffolding.
Install
Claude Code Plugin (recommended)
/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolverUpdates are automatic. Python dependencies (langsmith, langsmith-cli) are installed on first session start via hook.
npx (first-time setup or non-Claude Code runtimes)
npx harness-evolver@latestInteractive installer that configures LangSmith API key, creates Python venv, and installs all dependencies. Works with Claude Code, Cursor, Codex, and Windsurf.
Both install paths work together. Use npx for initial setup (API key, venv), then the plugin marketplace handles updates automatically.
Quick Start
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude
/evolver:setup # explores project, configures LangSmith
/evolver:evolve # runs the optimization loop
/evolver:status # check progress
/evolver:deploy # tag, push, finalizeHow It Works
| LangSmith-Native | No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI. |
| Real Code Evolution | Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically. |
| Self-Organizing Proposers | Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by Dochkina (2026). |
| Agent-Based Evaluation | The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using the same Claude model powering the other agents, and writes scores back. No OpenAI API key or openevals dependency needed. |
| Production Traces | Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization. |
| Active Critic | Auto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes. |
| ULTRAPLAN Architect | Auto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.). |
| Evolution Memory | Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences. |
| Dataset Health | Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits. Auto-corrects issues before evolution starts. |
| Smart Gating | Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. No hardcoded thresholds. State validation ensures config hasn't diverged from LangSmith. |
| Background Mode | Run all iterations in background while you continue working. Get notified on completion or significant improvements. |
Commands
| Command | What it does |
|---|---|
/evolver:setup |
Explore project, configure LangSmith (dataset, evaluators), run baseline |
/evolver:health |
Check dataset quality (size, difficulty, coverage, splits), auto-correct issues |
/evolver:evolve |
Run the optimization loop (dynamic self-organizing proposers in worktrees) |
/evolver:status |
Show progress, scores, history |
/evolver:deploy |
Tag, push, clean up temporary files |
Agents
| Agent | Role | Color |
|---|---|---|
| Proposer | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
| Evaluator | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
| Architect | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
| Critic | Active — detects gaming AND implements stricter evaluators | Red |
| Consolidator | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
| TestGen | Generates test inputs + adversarial injection mode | Cyan |
Evolution Loop
/evolver:evolve
|
+- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
+- 0.6 /evolver:health — dataset quality check + auto-correct
+- 1. Read state (.evolver.json + LangSmith experiments)
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
+- 1.8 Analyze per-task failures (train split only — proposers don't see held-out)
+- 1.8a Claude generates strategy.md + lenses.json from analysis data
+- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
+- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
+- 3. Run target for each candidate (code-based evaluators)
+- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
+- 4. Compare experiments -> select winner + per-task champion
+- 5. Merge winning worktree into main branch
+- 5.5 Regression tracking (auto-add guard examples to dataset)
+- 6. Report results
+- 6.2 Consolidator agent updates evolution memory (runs in background)
+- 6.5 Auto-trigger Active Critic (detect + fix evaluator gaming)
+- 7. Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
+- 8. Claude assesses gate conditions (plateau, target, diminishing returns)Architecture
Plugin hook (SessionStart)
└→ Creates venv, installs langsmith + langsmith-cli, exports env vars
Skills (markdown)
├── /evolver:setup → explores project, smart defaults, runs setup.py
├── /evolver:health → dataset quality check + auto-correct
├── /evolver:evolve → orchestrates the evolution loop
├── /evolver:status → reads .evolver.json + LangSmith
└── /evolver:deploy → tags and pushes
Agents (markdown)
├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
├── Evaluator → LLM-as-judge via langsmith-cli
├── Critic → detects gaming + implements stricter evaluators
├── Architect → ULTRAPLAN deep analysis (opus model)
├── Consolidator → cross-iteration memory (autoDream-inspired)
└── TestGen → generates test inputs + adversarial injection
Tools (Python + langsmith SDK)
├── setup.py → creates datasets, configures evaluators
├── run_eval.py → runs target against dataset
├── read_results.py → compares experiments
├── trace_insights.py → clusters errors from traces
├── seed_from_traces.py → imports production traces
├── validate_state.py → validates config vs LangSmith state
├── dataset_health.py → dataset quality diagnostic (size, difficulty, coverage, splits)
├── regression_tracker.py → tracks regressions, adds guard examples
├── add_evaluator.py → programmatically adds evaluators
└── adversarial_inject.py → detects memorization, injects adversarial testsRequirements
- LangSmith account +
LANGSMITH_API_KEY - Python 3.10+
- Git (for worktree-based isolation)
- Claude Code (or Cursor/Codex/Windsurf)
Dependencies (langsmith, langsmith-cli) are installed automatically by the plugin hook or the npx installer.
Framework Support
LangSmith traces any AI framework. The evolver works with all of them:
| Framework | LangSmith Tracing |
|---|---|
| LangChain / LangGraph | Auto (env vars only) |
| OpenAI SDK | wrap_openai() (2 lines) |
| Anthropic SDK | wrap_anthropic() (2 lines) |
| CrewAI / AutoGen | OpenTelemetry (~10 lines) |
| Any Python code | @traceable decorator |
References
- Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
- Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures — Dochkina, 2026
- Darwin Godel Machine — Sakana AI
- AlphaEvolve — DeepMind
- LangSmith Evaluation — LangChain
- Harnessing Claude's Intelligence — Martin, Anthropic, 2026
- Traces Start the Agent Improvement Loop — LangChain
License
MIT