Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Harness Evolver

Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.

Install

Claude Code Plugin (recommended)

/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver

npx (first-time setup or non-Claude Code runtimes)

npx harness-evolver@latest

Works with Claude Code, Cursor, Codex, and Windsurf.

Quick Start

cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude

/evolver:setup      # explores project, configures LangSmith
/evolver:health     # check dataset quality (auto-corrects issues)
/evolver:evolve     # runs the optimization loop
/evolver:status     # check progress (rich ASCII chart)
/evolver:deploy     # tag, push, finalize

How It Works


LangSmith-Native	No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI.
Real Code Evolution	Proposers modify actual code in isolated git worktrees. Winners merge automatically.
Self-Organizing Proposers	Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant.
Rubric-Based Evaluation	LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison.
Smart Gating	Constraint gates, regression guards, Pareto selection, holdout enforcement, stagnation detection.

Full feature list

Evolution Loop

/evolver:evolve
  |
  +- 1. Preflight  (validate state + dataset health + baseline scoring)
  +- 2. Analyze    (trace insights + failure clusters + strategy synthesis)
  +- 3. Propose    (spawn N proposers in git worktrees, two-wave)
  +- 4. Evaluate   (canary → run target → LLM-as-judge → weighted scoring)
  +- 5. Select     (held-out comparison → Pareto front → constraint gate → merge)
  +- 6. Learn      (archive candidates + regression guards + evolution memory)
  +- 7. Gate       (plateau → target check → critic/architect → continue or stop)

Detailed loop with all sub-steps

Agents

Agent	Role
Proposer	Self-organizing — investigates a data-driven lens, decides own approach, may abstain
Evaluator	LLM-as-judge — rubric-aware scoring via langsmith-cli, few-shot calibration
Architect	ULTRAPLAN mode — deep topology analysis with Opus model
Critic	Active — detects evaluator gaming, implements stricter evaluators
Consolidator	Cross-iteration memory — anchored summarization, garbage collection
TestGen	Generates test inputs with rubrics + adversarial injection

Requirements

LangSmith account + LANGSMITH_API_KEY
Python 3.10+ · Git · Claude Code (or Cursor/Codex/Windsurf)

Dependencies installed automatically by the plugin hook or npx installer.

LangSmith traces any AI framework: LangChain/LangGraph (auto), OpenAI/Anthropic SDK (wrap_*, 2 lines), CrewAI/AutoGen (OpenTelemetry), any Python (@traceable).

References

Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
Self-Organizing LLM Agents Outperform Designed Structures — Dochkina, 2026
Hermes Agent Self-Evolution — NousResearch
Agent Skills for Context Engineering — Koylan
A-Evolve: Automated Agent Evolution — Amazon (5-stage evolution loop, git-tagged mutations)
Meta Context Engineering via Agentic Skill Evolution — Ye et al., Peking University, 2026
EvoAgentX: Evolving Agentic Workflows — Wang et al., 2026
Darwin Godel Machine — Sakana AI
AlphaEvolve — DeepMind
LangSmith Evaluation — LangChain
Harnessing Claude's Intelligence — Martin, Anthropic, 2026
Traces Start the Agent Improvement Loop — LangChain

License

MIT