JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 49
  • Score
    100M100P100Q114594F
  • License Apache-2.0

Standalone Node.js CLI port of Harbor's Terminus2 agent — runs entirely inside the task container.

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@zhuerle/terminus-2-cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    terminus-2-cli

    Standalone Node.js port of Harbor's Terminus2 agent — runs entirely inside the task container. Same loop semantics as the Python terminus-2, but the agent process and the tmux it drives now live on the same side of the docker boundary, eliminating per-command docker exec round-trips.

    npm license

    npm install -g @zhuerle/terminus-2-cli

    Motivation

    Harbor's terminus-2 agent (Python) orchestrates from the host: every shell command the model issues is delivered into the task container via docker exec round-trip, and every tmux pane capture is another docker exec round-trip. On reasoning-heavy benchmarks that's tens of round-trips per trial; the cost adds up to ~12-18 seconds per step of pure infra overhead on top of the LLM call itself.

    terminus-2-cli keeps the loop, parsers, prompts, and trajectory format identical, but runs inside the task container as a single npm install-able Node.js binary. Tool execution becomes direct tmux send-keys against the local pane — no IPC, no docker socket — bringing pure infra overhead down to ~6.8 seconds per step (~2.8× faster on the architecture layer).

    The trajectory format is ATIF-v1.6 (forward-compatible superset of t2's ATIF-v1.2), so any tooling that consumes a t2 trajectory.json reads the cli's output unchanged.

    Performance comparison vs terminus-2

    Two model families, same tasks, same gateway, aligned thinking config.

    claude-opus-4-7 on terminal-bench-2 (89 tasks, n=20 concurrent, k=1)

    Anthropic Messages API via api-gateway.glm.ai, adaptive thinking + display:summarized + output_config.effort:max, max_new_tokens=32768, max_input_tokens=300000, max_turns=120, --agent-timeout 7200s, 4 cpus / 8 GB per trial. The 85 shared tasks both implementations finished are reported below.

    t2-cli (this) t2 (Python) Δ
    Accuracy 70/85 = 0.824 61/85 = 0.718 +10.6 pp
    Per-trial wall 828 s 1091 s cli −24 %
    Per-step wall 25.3 s 31.0 s cli −18 %
    LLM-call wall (gateway-bound, identical for both) 18.0 s 15.8 s tied
    Pure infra wall / step (step − llm_mean) 6.8 s 18.8 s cli 2.8× faster
    Pure infra wall / trial 303 s 531 s cli saves 228 s/trial
    Cross-tab cli-only ✓ / t2-only ✓ 10 / 1 cli net +9 tasks

    GLM-4.8 on terminal-bench-2 (89 tasks, n=180, k=4 → 356 trials)

    SGLang OpenAI-compatible endpoint, no thinking. Both implementations hit the same endpoint at the same time.

    t2-cli v0.0.3 t2 (Python) Δ
    Accuracy 0.500 (178/356) 0.475 (169/356) +2.5 pp
    Per-trial wall 1830 s 2227 s cli −18 %
    Per-step wall 51.3 s 58.5 s cli −14 %

    The architectural advantage shows up larger on opus-4-7 because the adaptive-thinking LLM call is much heavier per turn — every saved docker-exec round-trip translates more directly into wall savings, and the extra max_turns headroom converts directly into accuracy on the "mark_task_complete near the cap" tasks.

    Full report: /workspace/swe-data/zhuerle/perf_compare/REPORT_CLI_VS_T2_中文综合.md.

    What's ported

    • Prompt templates (prompts/terminus-{xml,json}-plain.txt, copied verbatim from the Python repo)
    • Both parsers (XML plain + JSON plain), with the same warnings and auto-fixes
    • tmux session driver (script(1)-allocated PTY → tmux new-session -d), send-key chunking under the ~16 KB tmux command-buffer limit
    • Main agent loop with batched send + capture, observation feedback, task_complete confirmation, parser-error re-prompt
    • Per-step trajectory in ATIF-v1.6 shape, written to <logs-dir>/trajectory.json
    • 3-subagent context summarization (writes sibling trajectory.summarization-N-{summary,questions,answers}.json files, pivot-able from the parent trajectory's compaction marker)

    Anthropic Messages API support

    cli auto-detects Claude models by name (/^claude/i) and switches to POST /v1/messages with:

    • adaptive thinking (thinking:{type:"adaptive", display:"summarized"}), with output_config.effort honored
    • multi-turn signature roundtrip (each thinking block's signature is threaded back into the next request, so multi-turn behaviour matches Anthropic's expectation)
    • ephemeral prompt caching with a single breakpoint at messages[len-3] (openhands pattern, stays under Anthropic's 4-breakpoint cap)
    • anthropic-beta: interleaved-thinking-2025-05-14 header
    • per-call DEBUG dump to <logs-dir>/anthropic_raw_calls.jsonl (request body / SSE event histogram / response usage; toggle with debug_anthropic_raw_calls=false)

    Switch heuristic:

    model use_anthropic (default)
    claude-* true (Anthropic Messages API)
    claude-*-max true + auto-switches to enabled thinking + budget_tokens=6144
    anything else false (OpenAI/SGLang chat-completions)

    Force either way with --agent-kwarg use_anthropic=true|false.

    What's not ported

    • Tokenizer-based exact token counting (we trust the API's usage block; the proactive summarize check uses a chars/4 estimate when the model doesn't expose a tokenizer)
    • Asciinema recording, skill discovery, subagent metrics, linear-history splitting, output-length salvage

    Requirements

    • Node.js >= 18.17 (uses built-in fetch and node:util.parseArgs)
    • tmux and script available on PATH inside the container
    • An OpenAI-compatible chat-completions endpoint, or an Anthropic Messages-compatible endpoint (api.anthropic.com, api-gateway.glm.ai, any /v1/messages shim)

    Standalone usage

    cat > /tmp/cfg.json <<EOF
    {
      "model_name": "claude-opus-4-7",
      "parser_name": "xml",
      "max_turns": 120,
      "max_new_tokens": 32768,
      "model_info": {"max_input_tokens": 300000},
      "interleaved_thinking": true,
      "anthropic_output_effort": "max",
      "anthropic_thinking_display": "summarized"
    }
    EOF
    
    echo "Print 'hello world' to the terminal and stop." > /tmp/instruction.txt
    mkdir -p /tmp/agent-logs
    
    ANTHROPIC_BASE_URL=https://api.anthropic.com \
    ANTHROPIC_AUTH_TOKEN=$ANTHROPIC_API_KEY \
    terminus-2-cli run \
      --config /tmp/cfg.json \
      --instruction /tmp/instruction.txt \
      --logs-dir /tmp/agent-logs \
      --session-id demo

    After the run:

    • /tmp/agent-logs/trajectory.json — per-step ATIF-v1.6 trajectory
    • /tmp/agent-logs/anthropic_raw_calls.jsonl — per-call request/response (Claude path only)
    • /tmp/agent-logs/context.json — final token / cost counters
    • /tmp/agent-logs/exception.txt — only present on failure

    Config schema

    The --config JSON accepts both Python-style snake_case and JS-style camelCase keys; CLI flags override config values.

    Key Type Notes
    model_name string required (or pass --model)
    api_base string OpenAI-compatible base URL
    api_key string falls back to OPENAI_API_KEY env
    parser_name "xml" | "json" default "xml"
    max_turns number default 1,000,000 (i.e. unbounded)
    temperature number default 0.7; auto-omitted when Claude+thinking is on
    top_p number optional
    max_new_tokens number optional
    reasoning_effort string OpenAI-style
    max_thinking_tokens number Anthropic thinking.budget_tokens
    interleaved_thinking bool default false; on enables adaptive thinking on Claude
    use_anthropic bool force Anthropic path on/off (default: auto by model name)
    anthropic_output_effort string "high" / "max" (sets output_config.effort)
    anthropic_thinking_display string e.g. "summarized" (gateway redaction control)
    anthropic_cache_control bool default true; ephemeral cache breakpoint at msg[len-3]
    debug_anthropic_raw_calls bool default true on Claude; per-call request/response JSONL
    llm_request_timeout_sec number default 600
    llm_extra_body object merged into the request body
    llm_headers object extra HTTP headers
    tmux_pane_width / tmux_pane_height number default 160×40
    model_info.max_input_tokens number proactive-summarize threshold
    enable_summarize bool default true
    summarization_mode "truncate" | "summarize" default "summarize"
    proactive_summarization_threshold number default 8000 (tokens of headroom)

    Running under harbor

    The companion host wrapper at src/harbor/agents/installed/terminus_2_cli.py registers this CLI as harbor's terminus-2-cli agent. The host wrapper handles tokenizer staging and, at install time, picks the best install path:

    1. Local tarball (set TERMINUS_2_CLI_LOCAL_DIR=/path/to/source for dev)
    2. Public npm registry (npm install -g @zhuerle/terminus-2-cli@<version>)
    3. git+https://github.com/... fallback (legacy)

    Reference launcher: scripts/run_terminus_2_cli.sh in the harbor repository — env-var-driven and copy-pasteable. All harbor-level kwargs (--agent-kwarg max_turns=120, etc.) are forwarded into the config JSON.

    Tests

    node --test test/

    54+ unit tests cover: parsers, tmux session lifecycle, agent loop, three subagent compaction flow, Anthropic SSE parsing, signature roundtrip, cache_control breakpoint placement, model-name auto-routing, raw-call DEBUG logging.