Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Harness Evolver

LangSmith-native autonomous agent optimization. Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.

Inspired by Meta-Harness (Lee et al., 2026). The scaffolding around your LLM produces a 6x performance gap on the same benchmark. This plugin automates the search for better scaffolding.

Install

Claude Code Plugin (recommended)

/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver

Updates are automatic. Python dependencies (langsmith, langsmith-cli) are installed on first session start via hook.

npx (first-time setup or non-Claude Code runtimes)

npx harness-evolver@latest

Interactive installer that configures LangSmith API key, creates Python venv, and installs all dependencies. Works with Claude Code, Cursor, Codex, and Windsurf.

Both install paths work together. Use npx for initial setup (API key, venv), then the plugin marketplace handles updates automatically.

Quick Start

cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude

/evolver:setup      # explores project, configures LangSmith
/evolver:health     # check dataset quality (auto-corrects issues)
/evolver:evolve     # runs the optimization loop
/evolver:status     # check progress (rich ASCII chart)
/evolver:deploy     # tag, push, finalize

How It Works

LangSmith-Native	No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.
Real Code Evolution	Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically. Config files (.evolver.json, .env) are auto-propagated to worktrees.
Self-Organizing Proposers	Two-wave spawning: critical lenses run first, then medium/open lenses see wave 1 results (+14% quality). Dynamic investigation lenses from failure data, architecture analysis, production traces, evolution memory, and archive branching (revisit losing candidates). Proposers self-organize, self-abstain, and can fork from any ancestor — not just the current best. Inspired by Dochkina (2026) and Darwin Godel Machine.
Rubric-Based Evaluation	Dataset examples support `expected_behavior` rubrics — specific criteria the judge evaluates against ("should mention null safety and Android development"), not just generic correctness. Partial scoring (0.5) for partially-met rubrics. Inspired by Hermes Agent Self-Evolution.
Constraint Gates	Proposals must pass hard constraints before merge: code growth ≤30%, entry point syntax valid, test suite passes. Candidates that fail are rejected and the next-best is tried. Prevents code bloat and broken merges.
Weighted Evaluators + Pareto	Configure `evaluator_weights` to prioritize what matters (e.g., correctness 50%, latency 30%). When candidates offer genuinely different tradeoffs, the Pareto front is reported instead of forcing a single winner.
Agent-Based Evaluation	The evaluator agent reasons through justification BEFORE assigning scores (15-25% reliability improvement). Reads experiment outputs via langsmith-cli, judges correctness using rubrics when available, writes scores back. Judge feedback surfaced to proposers for targeted mutations. Position bias mitigation built-in. Few-shot self-improvement from human corrections via LangSmith annotation feedback. Pairwise head-to-head comparison when top candidates are within 5%.
Canary Preflight	Before running the full evaluation, 1 example is tested as a canary. If the agent produces no output, evaluation stops immediately — no API quota wasted on broken agents.
Secret Detection	Detects 15+ secret patterns (API keys, tokens, PEM keys) in production traces and dataset examples. Secrets are filtered from `seed_from_traces` and flagged as critical issues in dataset health checks.
Evolution Chart	Rich ASCII visualization with ANSI colors: sparkline trend, score progression table (per-evaluator breakdown), what-changed narrative, horizontal bar chart, and code growth tracking with warnings.
Production Traces	Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization. Can also mine Claude Code session history for eval data.
Active Critic	Auto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes.
ULTRAPLAN Architect	Auto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).
Evolution Memory	Anchored iterative summarization — promoted insights (rec >= 3) are immutable anchors never re-summarized. New observations use literal text from proposals. Garbage collection removes stale observations. Inspired by Claude Code's autoDream and Context Engineering research.
Dataset Health	Integrated preflight runs 5 checks in one pass: API key, config schema, LangSmith state, dataset health (size, difficulty, splits, secrets), and entry point canary. Reports all issues at once.
Smart Gating	Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. Holdout enforcement ensures final comparison uses unseen data. Baseline is re-scored with LLM-judge before the loop to prevent inflated starting scores.
Background Mode	Run all iterations in background while you continue working. Get notified on completion or significant improvements.

Commands

Command	What it does
`/evolver:setup`	Explore project, configure LangSmith (dataset, evaluators), run baseline
`/evolver:health`	Check dataset quality (size, difficulty, coverage, splits, secrets), auto-correct
`/evolver:evolve`	Run the optimization loop (dynamic self-organizing proposers in worktrees)
`/evolver:status`	Show progress with rich ASCII evolution chart
`/evolver:deploy`	Tag, push, clean up temporary files

Agents

Agent	Role	Color
Proposer	Self-organizing — investigates a data-driven lens, decides own approach, may abstain	Green
Evaluator	LLM-as-judge — rubric-aware scoring via langsmith-cli, textual feedback	Yellow
Architect	ULTRAPLAN mode — deep topology analysis with Opus model	Blue
Critic	Active — detects gaming AND implements stricter evaluators	Red
Consolidator	Cross-iteration memory consolidation (autoDream-inspired)	Cyan
TestGen	Generates test inputs with rubrics + adversarial injection mode	Cyan

Evolution Loop

/evolver:evolve
  |
  +- 0.5  Validate state (check .evolver.json vs LangSmith)
  +- 0.6  /evolver:health — dataset quality + secret scan + auto-correct
  +- 0.7  Baseline LLM-judge — re-score baseline with correctness if only has_output exists
  +- 1.   Read state (.evolver.json + LangSmith experiments)
  +- 1.5  Gather trace insights + judge feedback (cluster errors, tokens, latency)
  +- 1.8  Analyze per-task failures with judge comments (train split only)
  +- 1.8a Claude generates strategy.md + lenses.json (incl. archive_branch lens)
  +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)
  +- 2.   Wave 1: spawn critical/high proposers in parallel worktrees
  +- 2.5  Wave 2: medium/open proposers see wave 1 results before starting
  +- 3.   Copy config to worktrees, run canary, evaluate candidates
  +- 3.5  Spawn evaluator agent (rubric-aware, few-shot calibrated LLM-as-judge)
  +- 4.   Compare on held-out split -> winner + Pareto front + pairwise if close
  +- 4.5  Constraint gate — reject candidates that break size/tests/entry-point
  +- 5.   Merge winning worktree into main branch
  +- 5.5  Archive ALL candidates (winners + losers) to evolution_archive/
  +- 5.6  Regression tracking + auto-guard failures
  +- 6.   Report results + evolution chart
  +- 6.2  Consolidator agent updates evolution memory (runs in background)
  +- 6.5  Auto-trigger Active Critic (detect + fix evaluator gaming)
  +- 7.   Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
  +- 8.   Claude assesses gate conditions (plateau, target, diminishing returns)

Architecture

Plugin hook (SessionStart)
  └→ Creates venv, installs langsmith + langsmith-cli, exports env vars

Skills (markdown)
  ├── /evolver:setup    → explores project, smart defaults, runs setup.py
  ├── /evolver:health   → dataset quality + secret scan + auto-correct
  ├── /evolver:evolve   → orchestrates the evolution loop
  ├── /evolver:status   → rich ASCII evolution chart + stagnation detection
  └── /evolver:deploy   → tags and pushes

Agents (markdown)
  ├── Proposer (xN)     → self-organizing, lens-driven, isolated git worktrees
  ├── Evaluator          → rubric-aware LLM-as-judge via langsmith-cli
  ├── Critic             → detects gaming + implements stricter evaluators
  ├── Architect          → ULTRAPLAN deep analysis (opus model)
  ├── Consolidator       → cross-iteration memory (autoDream-inspired)
  └── TestGen            → generates test inputs with rubrics + adversarial injection

Tools (Python)
  ├── setup.py              → creates datasets, configures evaluators + weights
  ├── run_eval.py           → runs target against dataset (canary preflight, {input_text})
  ├── read_results.py       → weighted scoring, Pareto front, judge feedback
  ├── trace_insights.py     → clusters errors from traces
  ├── seed_from_traces.py   → imports production traces (secret-filtered)
  ├── evolution_chart.py    → rich ASCII chart (stdlib-only)
  ├── constraint_check.py   → validates proposals (growth, syntax, tests) (stdlib-only)
  ├── secret_filter.py      → detects 15+ secret patterns (stdlib-only)
  ├── mine_sessions.py      → extracts eval data from Claude Code history (stdlib-only)
  ├── dataset_health.py     → dataset quality diagnostic + secret scanning
  ├── validate_state.py     → validates config vs LangSmith state
  ├── regression_tracker.py → tracks regressions, auto-adds failure guards
  ├── archive.py            → persistent candidate history (diffs, proposals, scores)
  ├── add_evaluator.py      → programmatically adds evaluators
  └── adversarial_inject.py → detects memorization, injects adversarial tests

Entry Point Placeholders

When configuring your agent's entry point during setup, use the placeholder that matches how your agent takes input:

Placeholder	Behavior	Use when
`{input_text}`	Extracts plain text, shell-escapes it	Agent takes `--query "text"` or positional args
`{input}`	Passes path to a JSON file	Agent reads structured JSON from file
`{input_json}`	Passes raw JSON string inline	Agent parses JSON from command line

Example:

# Agent that takes a query as text:
python agent.py --query {input_text}

# Agent that reads a JSON file:
python agent.py {input}

Requirements

LangSmith account + LANGSMITH_API_KEY
Python 3.10+
Git (for worktree-based isolation)
Claude Code (or Cursor/Codex/Windsurf)

Dependencies (langsmith, langsmith-cli) are installed automatically by the plugin hook or the npx installer.

Framework Support

LangSmith traces any AI framework. The evolver works with all of them:

Framework	LangSmith Tracing
LangChain / LangGraph	Auto (env vars only)
OpenAI SDK	`wrap_openai()` (2 lines)
Anthropic SDK	`wrap_anthropic()` (2 lines)
CrewAI / AutoGen	OpenTelemetry (~10 lines)
Any Python code	`@traceable` decorator

References

Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures — Dochkina, 2026
Hermes Agent Self-Evolution — NousResearch (rubric-based eval, constraint gates)
Agent Skills for Context Engineering — Koylan (justification-before-score, observation masking, anchored summarization)
Darwin Godel Machine — Sakana AI
AlphaEvolve — DeepMind
LangSmith Evaluation — LangChain
Harnessing Claude's Intelligence — Martin, Anthropic, 2026
Traces Start the Agent Improvement Loop — LangChain

License

MIT