JSPM

  • Created
  • Published
  • Downloads 5418
  • Score
    100M100P100Q131509F
  • License MIT

LangSmith-native autonomous agent optimization for Claude Code

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (harness-evolver) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    Harness Evolver

    Harness Evolver

    npm License: MIT Paper Built by Raphael Valdetaro

    Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.


    Install

    /plugin marketplace add raphaelchristi/harness-evolver-marketplace
    /plugin install harness-evolver

    npx (first-time setup or non-Claude Code runtimes)

    npx harness-evolver@latest

    Works with Claude Code, Cursor, Codex, and Windsurf.


    Quick Start

    cd my-llm-project
    export LANGSMITH_API_KEY="lsv2_pt_..."
    claude
    
    /harness:setup      # explores project, configures LangSmith
    /harness:health     # check dataset quality (auto-corrects issues)
    /harness:evolve     # runs the optimization loop
    /harness:status     # check progress (rich ASCII chart)
    /harness:deploy     # tag, push, finalize

    What It Looks Like

    Tested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):

    xychart-beta
        title "agno-deepknowledge: 0.575 → 1.000 (+74%)"
        x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
        y-axis "Correctness" 0 --> 1
        line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
        bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]
    Iter Score Merged? What the proposer did
    baseline 0.575 Original agent — hallucinations, broken tool calls, no retry logic
    v001 0.333 Yes Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits)
    v002 0.950 Yes Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits
    v003 0.720 No Attempted hybrid retrieval — regressed, rejected by constraint gate
    v004 0.875 No Response completeness fix — improved one case but regressed others
    v005 0.680 No Reduced tool calls — broke edge cases, rejected
    v006 0.880 Yes Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive
    v007 1.000 Yes One-shot example injection + rubric-aligned responses — perfect on held-out

    The line shows best score (only goes up — regressions aren't merged). The bars show each candidate's raw score. 4 merged, 3 rejected by gate checks. Not every iteration improves — that's the point.


    How It Works

    LangSmith-Native No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI.
    Real Code Evolution Proposers modify actual code in isolated git worktrees. Winners merge automatically.
    Self-Organizing Proposers Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant.
    Rubric-Based Evaluation LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison.
    Smart Gating Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection.

    Full feature list


    Evolution Loop

    /harness:evolve
      |
      +- 1. Preflight  (validate state + dataset health + baseline scoring)
      +- 2. Analyze    (trace insights + failure clusters + strategy synthesis)
      +- 3. Propose    (spawn N proposers in git worktrees, two-wave)
      +- 4. Evaluate   (canary → run target → auto-spawn LLM-as-judge → rate-limit abort)
      +- 5. Select     (held-out comparison → Pareto front → efficiency gate → constraint gate → merge)
      +- 6. Learn      (archive candidates + regression guards + evolution memory)
      +- 7. Gate       (plateau → target check → critic/architect → continue or stop)

    Detailed loop with all sub-steps


    Agents

    Agent Role
    Proposer Self-organizing — investigates a data-driven lens, decides own approach, may abstain
    Evaluator LLM-as-judge — rubric-aware scoring via langsmith-cli, few-shot calibration
    Architect ULTRAPLAN mode — deep topology analysis with Opus model
    Critic Active — detects evaluator gaming, implements stricter evaluators
    Consolidator Cross-iteration memory — anchored summarization, garbage collection
    TestGen Generates test inputs with rubrics + adversarial injection

    Requirements

    • LangSmith account + LANGSMITH_API_KEY
    • Python 3.10+ · Git · Claude Code (or Cursor/Codex/Windsurf)

    Dependencies installed automatically by the plugin hook or npx installer.

    LangSmith traces any AI framework: LangChain/LangGraph (auto), OpenAI/Anthropic SDK (wrap_*, 2 lines), CrewAI/AutoGen (OpenTelemetry), any Python (@traceable).


    Companion: LangSmith Tracing

    For full observability into what each proposer does during evolution (every file read, edit, and commit), install the LangSmith tracing plugin:

    /plugin marketplace add langchain-ai/langsmith-claude-code-plugins
    /plugin install langsmith-tracing@langsmith-claude-code-plugins

    With both plugins installed, the evolution loop traces to LangSmith as a hierarchy: iteration → proposers → tool calls.


    References


    License

    MIT