Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Harness Evolver
LangSmith-native autonomous agent optimization. Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
Inspired by Meta-Harness (Lee et al., 2026). The scaffolding around your LLM produces a 6x performance gap on the same benchmark. This plugin automates the search for better scaffolding.
Install
npx harness-evolver@latestWorks with Claude Code, Cursor, Codex, and Windsurf. Requires LangSmith account + API key.
Quick Start
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude
/evolver:setup # explores project, configures LangSmith
/evolver:evolve # runs the optimization loop
/evolver:status # check progress
/evolver:deploy # tag, push, finalizeHow It Works
| LangSmith-Native | No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI. |
| Real Code Evolution | Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically. |
| 5 Adaptive Proposers | Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions. |
| Production Traces | Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization. |
| Critic | Auto-triggers when scores jump suspiciously fast. Checks if evaluators are being gamed. |
| Architect | Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.). |
Commands
| Command | What it does |
|---|---|
/evolver:setup |
Explore project, configure LangSmith (dataset, evaluators), run baseline |
/evolver:evolve |
Run the optimization loop (5 parallel proposers in worktrees) |
/evolver:status |
Show progress, scores, history |
/evolver:deploy |
Tag, push, clean up temporary files |
Agents
| Agent | Role | Color |
|---|---|---|
| Proposer | Modifies agent code in isolated worktrees based on trace analysis | Green |
| Architect | Recommends multi-agent topology changes | Blue |
| Critic | Validates evaluator quality, detects gaming | Red |
| TestGen | Generates test inputs for LangSmith datasets | Cyan |
Evolution Loop
/evolver:evolve
|
+- 1. Read state (.evolver.json + LangSmith experiments)
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
+- 1.8 Analyze per-task failures (adaptive briefings)
+- 2. Spawn 5 proposers in parallel (each in a git worktree)
+- 3. Evaluate each candidate (client.evaluate() -> LangSmith experiments)
+- 4. Compare experiments -> select winner + per-task champion
+- 5. Merge winning worktree into main branch
+- 5.5 Test suite growth (add regression examples to dataset)
+- 6. Report results
+- 6.5 Auto-trigger Critic (if score jumped >0.3)
+- 7. Auto-trigger Architect (if stagnation or regression)
+- 8. Check stop conditionsRequirements
- LangSmith account +
LANGSMITH_API_KEY - Python 3.10+ with
langsmithandopenevalspackages - Git (for worktree-based isolation)
- Claude Code (or Cursor/Codex/Windsurf)
export LANGSMITH_API_KEY="lsv2_pt_..."
pip install langsmith openevalsFramework Support
LangSmith traces any AI framework. The evolver works with all of them:
| Framework | LangSmith Tracing |
|---|---|
| LangChain / LangGraph | Auto (env vars only) |
| OpenAI SDK | wrap_openai() (2 lines) |
| Anthropic SDK | wrap_anthropic() (2 lines) |
| CrewAI / AutoGen | OpenTelemetry (~10 lines) |
| Any Python code | @traceable decorator |
References
- Meta-Harness: End-to-End Optimization of Model Harnesses — Lee et al., 2026
- Darwin Godel Machine — Sakana AI
- AlphaEvolve — DeepMind
- LangSmith Evaluation — LangChain
- Traces Start the Agent Improvement Loop — LangChain
License
MIT