Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@zhuerle/terminus-2-cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

terminus-2-cli

Standalone Node.js port of Harbor's Terminus2 agent — runs entirely inside the task container. Same loop semantics as the Python terminus-2, but the agent process and the tmux it drives now live on the same side of the docker boundary, eliminating per-command docker exec round-trips.

npm install -g @zhuerle/terminus-2-cli

Motivation

Harbor's terminus-2 agent (Python) orchestrates from the host: every shell command the model issues is delivered into the task container via docker exec round-trip, and every tmux pane capture is another docker exec round-trip. On reasoning-heavy benchmarks that's tens of round-trips per trial; the cost adds up to ~12-18 seconds per step of pure infra overhead on top of the LLM call itself.

terminus-2-cli keeps the loop, parsers, prompts, and trajectory format identical, but runs inside the task container as a single npm install-able Node.js binary. Tool execution becomes direct tmux send-keys against the local pane — no IPC, no docker socket — bringing pure infra overhead down to ~6.8 seconds per step (~2.8× faster on the architecture layer).

The trajectory format is ATIF-v1.6 (forward-compatible superset of t2's ATIF-v1.2), so any tooling that consumes a t2 trajectory.json reads the cli's output unchanged.

Performance comparison vs `terminus-2`

Two model families, same tasks, same gateway, aligned thinking config.

`claude-opus-4-7` on terminal-bench-2 (89 tasks, n=20 concurrent, k=1)

Anthropic Messages API via api-gateway.glm.ai, adaptive thinking + display:summarized + output_config.effort:max, max_new_tokens=32768, max_input_tokens=300000, max_turns=120, --agent-timeout 7200s, 4 cpus / 8 GB per trial. The 85 shared tasks both implementations finished are reported below.

	t2-cli (this)	t2 (Python)	Δ
Accuracy	70/85 = 0.824	61/85 = 0.718	+10.6 pp
Per-trial wall	828 s	1091 s	cli −24 %
Per-step wall	25.3 s	31.0 s	cli −18 %
LLM-call wall (gateway-bound, identical for both)	18.0 s	15.8 s	tied
Pure infra wall / step (`step − llm_mean`)	6.8 s	18.8 s	cli 2.8× faster
Pure infra wall / trial	303 s	531 s	cli saves 228 s/trial
Cross-tab cli-only ✓ / t2-only ✓	10 / 1	—	cli net +9 tasks

`GLM-4.8` on terminal-bench-2 (89 tasks, n=180, k=4 → 356 trials)

SGLang OpenAI-compatible endpoint, no thinking. Both implementations hit the same endpoint at the same time.

	t2-cli v0.0.3	t2 (Python)	Δ
Accuracy	0.500 (178/356)	0.475 (169/356)	+2.5 pp
Per-trial wall	1830 s	2227 s	cli −18 %
Per-step wall	51.3 s	58.5 s	cli −14 %

The architectural advantage shows up larger on opus-4-7 because the adaptive-thinking LLM call is much heavier per turn — every saved docker-exec round-trip translates more directly into wall savings, and the extra max_turns headroom converts directly into accuracy on the "mark_task_complete near the cap" tasks.

Full report: /workspace/swe-data/zhuerle/perf_compare/REPORT_CLI_VS_T2_中文综合.md.

What's ported

Prompt templates (prompts/terminus-{xml,json}-plain.txt, copied verbatim from the Python repo)
Both parsers (XML plain + JSON plain), with the same warnings and auto-fixes
tmux session driver (script(1)-allocated PTY → tmux new-session -d), send-key chunking under the ~16 KB tmux command-buffer limit
Main agent loop with batched send + capture, observation feedback, task_complete confirmation, parser-error re-prompt
Per-step trajectory in ATIF-v1.6 shape, written to <logs-dir>/trajectory.json
3-subagent context summarization (writes sibling trajectory.summarization-N-{summary,questions,answers}.json files, pivot-able from the parent trajectory's compaction marker)

Anthropic Messages API support

cli auto-detects Claude models by name (/^claude/i) and switches to POST /v1/messages with:

adaptive thinking (thinking:{type:"adaptive", display:"summarized"}), with output_config.effort honored
multi-turn signature roundtrip (each thinking block's signature is threaded back into the next request, so multi-turn behaviour matches Anthropic's expectation)
ephemeral prompt caching with a single breakpoint at messages[len-3] (openhands pattern, stays under Anthropic's 4-breakpoint cap)
anthropic-beta: interleaved-thinking-2025-05-14 header
per-call DEBUG dump to <logs-dir>/anthropic_raw_calls.jsonl (request body / SSE event histogram / response usage; toggle with debug_anthropic_raw_calls=false)

Switch heuristic:

`model`	`use_anthropic` (default)
`claude-*`	`true` (Anthropic Messages API)
`claude-*-max`	`true` + auto-switches to `enabled` thinking + `budget_tokens=6144`
anything else	`false` (OpenAI/SGLang chat-completions)

Force either way with --agent-kwarg use_anthropic=true|false.

What's not ported

Tokenizer-based exact token counting (we trust the API's usage block; the proactive summarize check uses a chars/4 estimate when the model doesn't expose a tokenizer)
Asciinema recording, skill discovery, subagent metrics, linear-history splitting, output-length salvage

Requirements

Node.js >= 18.17 (uses built-in fetch and node:util.parseArgs)
tmux and script available on PATH inside the container
An OpenAI-compatible chat-completions endpoint, or an Anthropic Messages-compatible endpoint (api.anthropic.com, api-gateway.glm.ai, any /v1/messages shim)

Standalone usage

cat > /tmp/cfg.json <<EOF
{
  "model_name": "claude-opus-4-7",
  "parser_name": "xml",
  "max_turns": 120,
  "max_new_tokens": 32768,
  "model_info": {"max_input_tokens": 300000},
  "interleaved_thinking": true,
  "anthropic_output_effort": "max",
  "anthropic_thinking_display": "summarized"
}
EOF

echo "Print 'hello world' to the terminal and stop." > /tmp/instruction.txt
mkdir -p /tmp/agent-logs

ANTHROPIC_BASE_URL=https://api.anthropic.com \
ANTHROPIC_AUTH_TOKEN=$ANTHROPIC_API_KEY \
terminus-2-cli run \
  --config /tmp/cfg.json \
  --instruction /tmp/instruction.txt \
  --logs-dir /tmp/agent-logs \
  --session-id demo

After the run:

/tmp/agent-logs/trajectory.json — per-step ATIF-v1.6 trajectory
/tmp/agent-logs/anthropic_raw_calls.jsonl — per-call request/response (Claude path only)
/tmp/agent-logs/context.json — final token / cost counters
/tmp/agent-logs/exception.txt — only present on failure

Config schema

The --config JSON accepts both Python-style snake_case and JS-style camelCase keys; CLI flags override config values.

Key	Type	Notes
`model_name`	string	required (or pass `--model`)
`api_base`	string	OpenAI-compatible base URL
`api_key`	string	falls back to `OPENAI_API_KEY` env
`parser_name`	`"xml" \| "json"`	default `"xml"`
`max_turns`	number	default 1,000,000 (i.e. unbounded)
`temperature`	number	default 0.7; auto-omitted when Claude+thinking is on
`top_p`	number	optional
`max_new_tokens`	number	optional
`reasoning_effort`	string	OpenAI-style
`max_thinking_tokens`	number	Anthropic `thinking.budget_tokens`
`interleaved_thinking`	bool	default false; on enables adaptive thinking on Claude
`use_anthropic`	bool	force Anthropic path on/off (default: auto by model name)
`anthropic_output_effort`	string	`"high"` / `"max"` (sets `output_config.effort`)
`anthropic_thinking_display`	string	e.g. `"summarized"` (gateway redaction control)
`anthropic_cache_control`	bool	default true; ephemeral cache breakpoint at msg[len-3]
`debug_anthropic_raw_calls`	bool	default true on Claude; per-call request/response JSONL
`llm_request_timeout_sec`	number	default 600
`llm_extra_body`	object	merged into the request body
`llm_headers`	object	extra HTTP headers
`tmux_pane_width` / `tmux_pane_height`	number	default 160×40
`model_info.max_input_tokens`	number	proactive-summarize threshold
`enable_summarize`	bool	default true
`summarization_mode`	`"truncate" \| "summarize"`	default `"summarize"`
`proactive_summarization_threshold`	number	default 8000 (tokens of headroom)

Running under harbor

The companion host wrapper at src/harbor/agents/installed/terminus_2_cli.py registers this CLI as harbor's terminus-2-cli agent. The host wrapper handles tokenizer staging and, at install time, picks the best install path:

Local tarball (set TERMINUS_2_CLI_LOCAL_DIR=/path/to/source for dev)
Public npm registry (npm install -g @zhuerle/terminus-2-cli@<version>)
git+https://github.com/... fallback (legacy)

Reference launcher: scripts/run_terminus_2_cli.sh in the harbor repository — env-var-driven and copy-pasteable. All harbor-level kwargs (--agent-kwarg max_turns=120, etc.) are forwarded into the config JSON.

Tests

node --test test/

54+ unit tests cover: parsers, tmux session lifecycle, agent loop, three subagent compaction flow, Anthropic SSE parsing, signature roundtrip, cache_control breakpoint placement, model-name auto-routing, raw-call DEBUG logging.

@zhuerle/terminus-2-cli