Package Exports

open-agents-ai
open-agents-ai/dist/index.js
open-agents-ai/dist/launcher.cjs

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (open-agents-ai) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

npm version npm downloads

freedom of information · freedom of patterns · creating freely · open-weights
libertad de informacion · crear libremente · creer librement · liberte d'expression
Freiheit der Muster · jiyuu ni souzou suru · jayuroun changjak · svoboda tvorchestva
liberdade de criar · creare liberamente · ozgurce yarat · skapa fritt
vrij creeren · tworz swobodnie · dimiourgia elefthera · khuli soch
hurriyat al-ibdaa · code is poetry · democratize AI · imagine freely

Open Agents

npm i -g open-agents-ai && oa

AI coding agent powered entirely by open-weight models. No API keys. No cloud. Your code never leaves your machine.

An autonomous multi-turn tool-calling agent that reads your code, makes changes, runs tests, and fixes failures in an iterative loop until the task is complete. First launch auto-detects your hardware and configures the optimal model with expanded context window automatically.

Features

47 autonomous tools — file I/O, shell, grep, web search/fetch/crawl, memory (read/write/search), sub-agents, background tasks, image/OCR/PDF, git, diagnostics, vision, desktop automation, structured files, code sandbox, transcription, skills
Moondream vision — see and interact with the desktop via Moondream VLM (caption, query, detect, point-and-click)
Desktop automation — vision-guided clicking: describe a UI element in natural language, the agent finds and clicks it
Auto-install desktop deps — screenshot, mouse, OCR, and image tools auto-install missing system packages (scrot, xdotool, tesseract, imagemagick) on first use
Parallel tool execution — read-only tools run concurrently via Promise.allSettled
Sub-agent delegation — spawn independent agents for parallel workstreams
Ralph Loop — iterative task execution that keeps retrying until completion criteria are met
Dream Mode — creative idle exploration modeled after real sleep architecture (NREM→REM cycles)
Live Listen — bidirectional voice communication with real-time Whisper transcription
Neural TTS — hear what the agent is doing via GLaDOS or Overwatch ONNX voices
Human expert speed ratio — real-time Exp: Nx gauge comparing agent speed to a leading human expert, calibrated across 47 tool baselines
Cost tracking — real-time token cost estimation for 15+ cloud providers
Work evaluation — LLM-as-judge scoring with task-type-specific rubrics
Session metrics — track turns, tool calls, tokens, files modified, tasks completed per session
Structured file generation — create CSV, TSV, JSON, Markdown tables, and Excel-compatible files
Code sandbox — isolated code execution in subprocess or Docker (JS, Python, Bash, TypeScript)
Structured file reading — parse CSV, TSV, JSON, Markdown tables with binary format detection
Multi-provider web search — DuckDuckGo (free), Tavily (structured), Jina AI (markdown) with auto-detection
Web crawling — multi-page web scraping with Crawlee/Playwright for deep documentation extraction
Task templates — specialized system prompts and tool recommendations for code, document, analysis, plan tasks
Auto-expanding context — detects RAM/VRAM and creates an optimized model variant on first run
Mid-task steering — type while the agent works to add context without interrupting
Smart compaction — 6 context compaction strategies (default, aggressive, decisions, errors, summary, structured) with research-backed design
Memex experience archive — large tool outputs archived during compaction with hash-based retrieval
Persistent memory — learned patterns stored in .oa/memory/ across sessions
Session context persistence — auto-saves context on task completion, manual /context save|restore across sessions
Self-learning — auto-fetches docs from the web when encountering unfamiliar APIs
Seamless /update — in-place update and reload with automatic context save/restore
Task control — /pause (gentle halt at turn boundary), /stop (immediate kill), /resume to continue
Model-tier awareness — dynamic tool sets, prompt complexity, and context limits scale with model size (small/medium/large)

How It Works

You: oa "fix the null check in auth.ts"

Agent: [Turn 1] file_read(src/auth.ts)
       [Turn 2] grep_search(pattern="null", path="src/auth.ts")
       [Turn 3] file_edit(old_string="if (user)", new_string="if (user != null)")
       [Turn 4] shell(command="npm test")
       [Turn 5] task_complete(summary="Fixed null check — all tests pass")

The agent uses tools autonomously in a loop — reading errors, fixing code, and re-running validation until the task succeeds or the turn limit is reached.

Ralph Loop — Iteration-First Design

The Ralph Loop is the core execution philosophy: iteration beats perfection. Instead of trying to get everything right on the first attempt, the agent executes in a retry loop where errors become learning data rather than session-ending failures.

/ralph "fix all failing tests" --completion "npm test passes with 0 failures"
/ralph "migrate to TypeScript" --completion "npx tsc --noEmit exits 0" --max-iterations 20
/ralph "reach 80% coverage" --completion "coverage report shows >80%" --timeout 120

Each iteration:

Execute — make changes based on the task + all accumulated learnings
Verify — run the completion command (tests, build, lint, coverage)
Learn — if verification fails, extract what went wrong and why
Iterate — retry with the new knowledge until passing or limits reached

The loop tracks iteration history, generates completion reports saved to .aiwg/ralph/, and supports resume/abort for interrupted sessions. Safety bounds (max iterations, timeout) prevent runaway loops.

/ralph-status     # Check current/previous loop status
/ralph-resume     # Resume interrupted loop
/ralph-abort      # Cancel running loop

Context Compaction — Research-Backed Memory Management

Long conversations consume context window tokens. Open Agents uses progressive context compaction to compress older messages while preserving critical information — decisions, errors, file states, and task progress.

How It Works

Compaction triggers automatically when estimated token usage reaches 75% of the model's context window. The system:

Preserves the system prompt and initial user task (head messages)
Summarizes middle messages (tool calls, results, exploration) into a structured digest
Keeps recent messages verbatim (scaled by model tier and context size)
Archives large tool outputs to the Memex experience archive (retrievable by hash ID via memex_retrieve)

Compaction Strategies

Six strategies are available via /compact <strategy>:

Strategy	What It Preserves	Best For
`default`	Progressive summarization — decisions, errors, file changes, task state	General use
`aggressive`	Only key decisions and errors, maximum compression	Very long sessions
`decisions`	Action→outcome pairs only, discards exploration	Decision-heavy workflows
`errors`	Full error context preserved, successes compressed	Debugging sessions
`summary`	High-level paragraph summary, minimal detail	Quick context reset
`structured`	LLM-generated structured summary via a separate inference call	Highest quality summaries

Automatic Compaction

Compaction thresholds scale dynamically with model size:

Model Tier	Threshold	Recent Messages Kept
Large (30B+)	40,000 tokens (or 75% of context)	12 messages
Medium (8-29B)	24,000 tokens (or 75% of context)	8 messages
Small (≤7B)	12,000 tokens (or 75% of context)	4-6 messages

Memex Experience Archive

During compaction, large tool outputs (file reads, grep results, command output) are archived with a short hash ID. The agent can recover any archived result using memex_retrieve:

Agent: memex_retrieve(id="a3f2c1")
       → [Full original content of the archived tool result]

This gives the agent "perfect recall" of any prior tool output despite compaction.

Design Rationale

The compaction system draws on several research findings:

RECOMP (arXiv:2310.04408, ICLR 2024) — Demonstrated that retrieved context can be compressed to 6% of original size with minimal quality loss. Our observation masking pre-pass applies this principle to tool outputs.
Tool Documentation Enables Zero-Shot Tool-Usage (arXiv:2308.00675) — Showed that documentation quality matters more than example quantity. Our compaction preserves tool schemas while discarding verbose results.
ToolLLM DFSDT (arXiv:2307.16789) — Validated that backtracking and error preservation improve multi-step task success by +35pp. Our error-preserving strategy directly implements this insight.
Long Context Does Not Solve Planning (NATURAL PLAN, arXiv:2406.04520) — GPT-4 achieves only 31% on trip planning even with full context. This confirms that efficient context use outperforms naive context expansion, motivating aggressive compaction with selective preservation.

Domain-Aware Preservation

Compaction summaries include:

Task state — current phase, goals, progress, blockers
File registry — per-file metadata (last action, line count, purpose) for files touched during the session
Memex index — hash IDs and one-line summaries of archived tool outputs

This ensures the agent can resume coherently after compaction without re-reading files or re-running commands.

Task Control

Pause, Stop, Resume, Destroy

Command	Behavior
`/pause`	Gentle halt — lets the current inference turn finish, then stops before the next turn. No new tool calls or inference will begin until `/resume`.
`/stop`	Immediate kill — aborts the current inference mid-stream, saves task state for later resumption.
`/resume`	Continue — resumes a paused or stopped task from where it left off. Also resumes tasks saved by `/stop` or interrupted by `/update`.
`/destroy`	Nuclear option — aborts any active task, deletes the `.oa/` directory, clears the console, and exits to shell.

Session Context Persistence

Context is automatically saved on every task completion and preserved across /update restarts.

/context save      # Force-save current session context
/context restore   # Load previous session context into next task
/context show      # Show saved context status (entries, last saved)

The system maintains a rolling window of the last 20 session entries in .oa/context/session-context.json. When you run /context restore, the last 10 entries are formatted into a restore prompt and injected into your next task, giving the agent continuity across sessions.

During /update, context is automatically saved before the process restarts and restored when the new version resumes your task.

Auto-Restore on Startup

When you launch oa in a workspace that has saved session context from a previous run, you'll be prompted to restore it:

ℹ Previous session found (5 entries, last active 2h ago)
ℹ Last task: fix the auth bug in src/middleware.ts
ℹ Restore previous context? (y/n)
❯ y
ℹ Context restored from 5 session(s). Will be injected into your next task.

Type y to restore — the previous session context will be prepended to your next task, giving the agent full continuity. Type n (or anything else) to start fresh. The prompt only appears on fresh starts, not on /update resumes (which auto-restore context).

Dream Mode — Creative Idle Exploration

When you're not actively tasking the agent, Dream Mode lets it creatively explore your codebase and generate improvement proposals autonomously. The system models real human sleep architecture with four stages per cycle:

Stage	Name	What Happens
NREM-1	Light Scan	Quick codebase overview, surface observations
NREM-2	Pattern Detection	Identify recurring patterns, technical debt, gaps
NREM-3	Deep Consolidation	Synthesize findings into structured proposals
REM	Creative Expansion	Novel ideas, cross-domain connections, bold plans

Each cycle expands through all four stages then contracts (evaluation, pruning of weak ideas). Three modes control how far the agent can go:

/dream              # Default — read-only exploration, proposals saved to .oa/dreams/
/dream deep         # Multi-cycle deep exploration with expansion/contraction phases
/dream lucid        # Full implementation — saves workspace backup, then implements,
                    #   tests, evaluates, and self-plays each proposal with checkpoints
/dream stop         # Wake up — stop dreaming

Default and Deep modes are completely safe — the agent can only read your code and write proposals to .oa/dreams/. File writes, edits, and shell commands outside that directory are blocked by sandboxed dream tools.

Lucid mode unlocks full write access. Before making changes, it saves a workspace checkpoint so you can roll back. Each cycle goes: dream → implement → test → evaluate → checkpoint → next cycle.

All proposals are indexed in .oa/dreams/PROPOSAL-INDEX.md for easy review.

Listen Mode — Live Bidirectional Audio

Listen mode enables real-time voice communication with the agent. Your microphone audio is captured, streamed through Whisper, and the transcription is injected directly into the input line — creating a hands-free coding workflow.

Two transcription backends ensure broad platform support:

transcribe-cli (faster-whisper / ONNX) — used by default, fastest on x86
openai-whisper (Python venv) — automatic fallback for ARM, linux-arm64, or when ONNX is unavailable. Auto-creates a venv and installs deps on first use.

/listen             # Toggle microphone capture on/off
/listen auto        # Auto-submit after 3 seconds of silence (hands-free)
/listen confirm     # Require Enter to submit transcription (default)
/listen stop        # Stop listening

Model selection — choose the Whisper model size for your hardware:

/listen tiny        # Fastest, least accurate (~39MB)
/listen base        # Good balance (~74MB)
/listen small       # Better accuracy (~244MB)
/listen medium      # High accuracy (~769MB)
/listen large       # Best accuracy, slower (~1.5GB)

When combined with /voice, you get full bidirectional audio — speak your tasks, hear the agent's progress through TTS, and speak corrections mid-task. The status bar shows a blinking red ● REC indicator with a countdown timer during auto-mode recording.

Platform support:

Linux x86: arecord (ALSA) or ffmpeg (PulseAudio) + transcribe-cli
Linux ARM: arecord or ffmpeg + openai-whisper (auto-installed in Python venv)
macOS: sox (CoreAudio) or ffmpeg (AVFoundation)

The transcribe-cli dependency auto-installs in the background on first use. On ARM or when transcribe-cli fails, the system automatically falls back to openai-whisper via a self-managed Python venv (same approach used by Moondream vision).

File transcription: Drag-and-drop audio/video files (.mp3, .wav, .mp4, .mkv, etc.) onto the terminal to transcribe them. Results are saved to .oa/transcripts/.

Vision & Desktop Automation (Moondream)

Open Agents can see your screen, understand UI elements, and interact with desktop applications through natural language — powered by the Moondream vision language model running entirely locally.

Desktop Awareness

The agent can take a screenshot and describe what's on screen:

You: what's on my desktop right now?

Agent: [Turn 1] desktop_describe()
       → "A Linux desktop showing three terminal windows with code editors,
          a file manager in the background, and a taskbar at the bottom
          with Firefox, Files, and Terminal icons."

Ask specific questions about the screen:

Agent: [Turn 1] desktop_describe(question="What application is in focus?")
       → "The focused application is a terminal running vim with a Python file open."

Vision Analysis

Analyze any image with four actions:

Agent: vision(image="screenshot.png", action="caption")
       → "A terminal window displaying code with syntax highlighting"

Agent: vision(image="ui.png", action="query", prompt="How many buttons are visible?")
       → "There are 4 buttons visible: Save, Cancel, Help, and Close"

Agent: vision(image="ui.png", action="detect", prompt="button")
       → Detected 4 "button" in ui.png:
         1. bbox: [0.10, 0.85, 0.25, 0.95]
         2. bbox: [0.30, 0.85, 0.45, 0.95]
         ...

Agent: vision(image="ui.png", action="point", prompt="close button")
       → Found 1 "close button" at (0.95, 0.02) — pixel (1824, 22)

Point-and-Click

Describe what to click in plain English — the agent screenshots, finds the element with Moondream, and clicks it:

Agent: desktop_click(target="the Save button")
       → Clicked "Save button" at (480, 920)

Agent: desktop_click(target="File menu", button="left")
       → Clicked "File menu" at (45, 12)

Agent: desktop_click(target="terminal icon", click_type="double")
       → Clicked "terminal icon" at (1850, 540)

Supports left/right/middle click, single/double click, multi-match selection by index, dry-run mode for verification, and configurable delay for UI transitions.

Setup

Moondream runs locally — no API keys, no cloud, your screen data never leaves your machine:

# Create a Python venv and install Moondream Station
python3 -m venv .moondream-venv
.moondream-venv/bin/pip install moondream-station pydantic uvicorn fastapi packaging

# Start the vision server (downloads model on first run, ~1.7GB)
.moondream-venv/bin/python packages/execution/scripts/start-moondream.py

The vision tools auto-detect a running Moondream Station on localhost:2020. For cloud inference, set MOONDREAM_API_KEY instead.

System dependencies (auto-installed on first use):

Desktop tools automatically install missing system packages when first needed. No manual setup required — just use the tool and it handles the rest:

Tool	Linux Package	What It Does
`scrot`	`apt install scrot`	Screenshot capture
`xdotool`	`apt install xdotool`	Mouse/keyboard automation
`tesseract`	`apt install tesseract-ocr`	OCR text extraction
`identify`	`apt install imagemagick`	Image dimensions/conversion

Supports apt (Debian/Ubuntu), dnf (Fedora), pacman (Arch), and brew (macOS). You can also pre-install everything at once:

./scripts/setup-desktop.sh          # Install all desktop deps
./scripts/setup-desktop.sh --check-only  # Just check what's missing

Vision backend:

Moondream Station (local) — runs entirely on your machine, no API keys needed
Moondream Cloud API — set MOONDREAM_API_KEY for cloud inference

Interactive TUI

Launch without arguments to enter the interactive REPL:

oa

The TUI features an animated multilingual phrase carousel, live metrics bar with pastel-colored labels (token in/out, context window usage, human expert speed ratio, cost), rotating tips, syntax-highlighted tool output, and dynamic terminal-width cropping.

Slash Commands

Command	Description
Model & Endpoint
`/model <name>`	Switch to a different model
`/models`	List all available models
`/endpoint <url>`	Connect to a remote vLLM or OpenAI-compatible API
`/endpoint <url> --auth <key>`	Set endpoint with Bearer auth
Task Control
`/pause`	Pause after current turn finishes (gentle halt)
`/stop`	Kill current inference immediately, save state
`/resume`	Resume a paused or stopped task
`/destroy`	Remove `.oa/` folder, kill all tasks, clear console, exit
Context & Memory
`/context save`	Force-save session context to `.oa/context/`
`/context restore`	Restore context from previous sessions into next task
`/context show`	Show saved session context status
`/compact`	Force context compaction now (default strategy)
`/compact <strategy>`	Compact with strategy: `aggressive`, `decisions`, `errors`, `summary`, `structured`
Audio & Vision
`/voice [model]`	Toggle TTS voice (GLaDOS, Overwatch)
`/listen [mode]`	Toggle live microphone transcription
`/dream [mode]`	Start dream mode (default, deep, lucid)
Display & Behavior
`/stream`	Toggle streaming token display with pastel syntax highlighting
`/bruteforce`	Toggle brute-force mode (auto re-engage on turn limit)
`/verbose`	Toggle verbose mode
Tools & Skills
`/tools`	List agent-created custom tools
`/skills [keyword]`	List/search available AIWG skills
`/<skill-name> [args]`	Invoke an AIWG skill directly
Metrics & Updates
`/cost`	Show token cost breakdown for the current session
`/evaluate`	Score the last completed task with LLM-as-judge
`/stats`	Show session dashboard (turns, tools, tokens, files, task history)
`/task-type <type>`	Set task type for specialized prompts (code, document, analysis, plan)
`/update`	Check for and install updates (seamless context-preserving reload)
`/update auto\|manual`	Set update mode (auto after task completion, or manual only)
General
`/config`	Show current configuration
`/clear`	Clear the screen
`/help`	Show all available commands
`/quit`	Exit

All settings commands accept --local to save to project .oa/settings.json instead of global config.

Mid-Task Steering

While the agent is working (shown by the + prompt), type to add context:

> fix the auth bug
  ⎿  Read: src/auth.ts
+ also check the session handling        ← typed while agent works
  ↪ Context added: also check the session handling
  ⎿  Search: session
  ⎿  Edit: src/auth.ts

Tools (47)

Tool	Description
File Operations
`file_read`	Read file contents with line numbers (offset/limit for large files)
`file_write`	Create or overwrite files with automatic directory creation
`file_edit`	Precise string replacement in files (preferred over rewriting)
`file_patch`	Edit specific line ranges in large files (replace, insert_before/after, delete)
`batch_edit`	Multiple edits across files in one call
`list_directory`	List directory contents with types and sizes
Search & Navigation
`grep_search`	Search file contents with regex (ripgrep with grep fallback)
`find_files`	Find files by glob pattern (excludes node_modules/.git)
`codebase_map`	High-level project structure overview with directory tree and language breakdown
Shell & Execution
`shell`	Execute any shell command (non-interactive, CI=true, sudo support)
`code_sandbox`	Isolated code execution (JS, Python, Bash, TS) in subprocess or Docker
`background_run`	Run shell command in background, returns task ID
`task_status`	Check background task status
`task_output`	Read background task output
`task_stop`	Stop a background task
Web
`web_search`	Search the web (DuckDuckGo, Tavily, Jina AI — auto-detected)
`web_fetch`	Fetch and extract text from web pages (HTML stripping)
`web_crawl`	Multi-page web scraping with Crawlee/Playwright for deep documentation
Structured Data
`structured_file`	Generate CSV, TSV, JSON, Markdown tables, Excel-compatible files
`structured_read`	Parse CSV, TSV, JSON, Markdown tables with binary format detection
Vision & Desktop
`vision`	Moondream VLM — caption, query, detect, point on any image
`desktop_click`	Vision-guided clicking: describe a UI element, agent finds and clicks it
`desktop_describe`	Screenshot + Moondream caption/query for desktop awareness
`image_read`	Read images (base64 + OCR metadata)
`screenshot`	Capture screen/window/active window
`ocr`	Extract text from images (Tesseract with multi-variant preprocessing)
`ocr_image_advanced`	Advanced multi-variant OCR pipeline with preprocessing, multi-PSM, and confidence scoring
`ocr_pdf`	Add searchable text layer to scanned/image PDFs
`pdf_to_text`	Extract text from PDF using pdftotext (Poppler) with OCR fallback
Transcription
`transcribe_file`	Transcribe local audio/video files to text (Whisper)
`transcribe_url`	Download and transcribe audio/video from URLs
Memory & Knowledge
`memory_read`	Read from persistent memory store by topic and key
`memory_write`	Store facts/patterns in persistent memory with provenance tracking
`memory_search`	Semantic search across all memory entries by query
`memex_retrieve`	Recover full tool output archived during context compaction by hash ID
Git & Diagnostics
`diagnostic`	Lint/typecheck/test/build validation pipeline in one call
`git_info`	Structured git status, log, diff, branch, staged/unstaged files
Agents & Delegation
`sub_agent`	Delegate subtasks to independent agent instances (foreground or background)
`explore_tools`	Meta-tool: discover and unlock additional tools on demand (for small models)
`task_complete`	Signal task completion with summary
Custom Tools & Skills
`create_tool`	Create reusable custom tools from workflow patterns at runtime
`manage_tools`	List, inspect, delete custom tools
`skill_list`	Discover available AIWG skills
`skill_execute`	Run an AIWG skill
AIWG SDLC
`aiwg_setup`	Deploy AIWG SDLC framework
`aiwg_health`	Analyze project SDLC health and readiness
`aiwg_workflow`	Execute AIWG commands and workflows

Read-only tools execute concurrently when called in the same turn. Mutating tools run sequentially.

Auto-Expanding Context Window

On startup and /model switch, Open Agents detects your RAM/VRAM and creates an optimized model variant:

Available Memory	Context Window
200GB+	128K tokens
100GB+	64K tokens
50GB+	32K tokens
20GB+	16K tokens
8GB+	8K tokens
< 8GB	4K tokens

Model-Tier Awareness

Open Agents classifies models into three tiers and adapts its behavior accordingly:

Tier	Parameters	Base Tools	System Prompt	Compaction
Large (≥30B)	70B, 122B	All 47 tools	Full (344 lines)	40K threshold
Medium (8-29B)	9B, 27B	15 core tools	Condensed (100 lines)	24K threshold
Small (≤7B)	4B, 1.5B	6 base tools + explore_tools	Minimal (15 lines)	12K threshold

Tool Nesting for Small Models

Small models use an explore_tools meta-tool pattern inspired by hierarchical API retrieval research (ToolLLM, arXiv:2307.16789). Instead of presenting all 47 tools (which overwhelms small context windows), only 6 core tools are loaded initially:

file_read, file_write, file_edit, shell, task_complete, explore_tools

The agent can call explore_tools() to see a catalog of additional tools with one-line descriptions, then explore_tools(enable="grep_search") to unlock specific tools as needed. This reduces tool schema tokens by ~80% while preserving access to the full toolset.

This approach is substantiated by:

Gorilla (arXiv:2305.15334) — 7B model with retrieval outperforms GPT-4 on tool-calling hallucination rate
DFSDT (arXiv:2307.16789) — ToolLLaMA-7B with depth-first search scored 66.7%, approaching GPT-4's 70.4%
Octopus v2 (arXiv:2404.01744) — 2B model achieved 99.5% function-calling accuracy with context-efficient tool encoding

Dynamic Context Limits

All context-dependent values scale automatically with the actual context window size:

Setting	How It Scales
Compaction threshold	min(tier default, 75% of context window)
Recent messages kept	1 message per 2-4K of context (tier-dependent)
Max output tokens	25% of context window (min 2048)
Tool output cap	2K-8K chars (scales with context)
File read limits	80-120 line cap for small/medium context windows

Voice Feedback (TTS)

/voice              # Toggle on/off (default: GLaDOS)
/voice glados       # GLaDOS voice
/voice overwatch    # Overwatch voice

Auto-downloads the ONNX voice model (~50MB) on first use. Install espeak-ng for best quality (apt install espeak-ng / brew install espeak-ng).

Human Expert Speed Ratio

The status bar displays a real-time Exp: Nx gauge estimating how fast the agent is working relative to a leading human expert performing equivalent tasks.

In: 12,345 | Out: 4,567 | Ctx: 18,000/131,072 86% | Exp: 4.2x | Cost: $0.34
                                                       ^^^^^^^^
                                                    Agent is 4.2x faster
                                                    than a human expert

How It Works

Each tool call maps to a calibrated expert baseline time — the estimated seconds a top-tier human developer would take to perform the equivalent operation manually:

Operation	Expert Time	Agent Equivalent
Read a file	12s	`file_read`
Write a new file	90s	`file_write`
Make a precise edit	25s	`file_edit`
Grep search + scan results	15s	`grep_search`
Run a shell command	20s	`shell`
Web search + evaluate	60s	`web_search`
Survey codebase structure	180s	`codebase_map`

Additional overhead per action:

+5s context-switch per tool call (expert switching between tools)
+15s planning per reasoning turn (expert thinking about next step)

The ratio accumulates across all tasks in the session:

speedRatio = totalHumanExpertTime / totalAgentWallClockTime

Color coding: green (2x+ faster), yellow (1-2x, comparable), red (<1x, slower than expert).

All 47 tools have calibrated baselines ranging from 3s (task_stop) to 180s (codebase_map). Unknown tools default to 20s.

Cost Tracking & Session Metrics

Real-time token cost estimation for cloud providers. The status bar shows running cost when using a paid endpoint.

/cost              # Show cost breakdown by model/provider
/stats             # Session metrics: turns, tool calls, tokens, files modified
/evaluate          # Score the last completed task (LLM-as-judge, 5 rubric dimensions)

Cost tracking supports 15+ providers including Groq, Together AI, OpenRouter, Fireworks AI, DeepInfra, Mistral, Cerebras, and more. Pricing is per-million tokens with separate input/output rates.

Work evaluation uses five task-type-specific rubrics (code, document, analysis, plan, general) scoring correctness, completeness, efficiency, code quality, and communication on a 1-5 scale.

Code Sandbox

Execute code snippets in isolated environments without affecting your project:

Agent: code_sandbox(language="python", code="import math; print(math.factorial(20))")
       → 2432902008176640000

Agent: code_sandbox(language="javascript", code="console.log([...new Set([1,2,2,3])].length)")
       → 3

Supports JavaScript, TypeScript, Python, and Bash. Two execution modes:

Subprocess (default) — runs in a child process with timeout and output limits
Docker — runs in an isolated container when docker is available

Structured Data Tools

Generate structured files

Create CSV, TSV, JSON, Markdown tables, and Excel-compatible files from data:

Agent: structured_file(format="csv", path="results.csv", columns=["name","score"],
         data=[{"name":"Alice","score":95},{"name":"Bob","score":87}])
       → Created results.csv (2 rows, 2 columns)

Read structured files

Parse existing data files with automatic format detection:

Agent: read_structured_file(path="data.csv")
       → CSV: 150 rows, 5 columns [showing first 100]

Agent: read_structured_file(path="report.md")
       → Markdown: 3 table(s) extracted

Detects binary formats (XLSX, PDF, DOCX) and suggests conversion tools.

Multi-Provider Web Search

Web search automatically selects the best available provider:

Provider	Trigger	Features
DuckDuckGo	Default (no key needed)	Free, privacy-focused
Tavily	`TAVILY_API_KEY` set	Structured results + AI-generated answer
Jina AI	`JINA_API_KEY` set	Markdown-formatted results

export TAVILY_API_KEY=tvly-...   # Enable Tavily (optional)
export JINA_API_KEY=jina_...     # Enable Jina AI (optional)

Task Templates

Set a task type to get specialized system prompts, recommended tools, and output guidance:

/task-type code       # Code generation/fix — emphasizes tests, diffs, file edits
/task-type document   # Documentation — emphasizes clarity, structure, completeness
/task-type analysis   # Analysis tasks — emphasizes data, metrics, evidence
/task-type plan       # Planning — emphasizes steps, dependencies, risks

Configuration

Config priority: CLI flags > env vars > ~/.open-agents/config.json > defaults.

open-agents config set model qwen3.5:122b
open-agents config set backendUrl http://localhost:11434

Project Context

Create AGENTS.md, OA.md, or .open-agents.md in your project root for agent instructions. Context files merge from parent to child directories.

`.oa/` Project Directory

.oa/
├── config.json        # Project config overrides
├── settings.json      # TUI settings (model, endpoint, voice, stream, etc.)
├── memory/            # Persistent memory store (topics, patterns, facts)
├── dreams/            # Dream mode proposals & checkpoints
├── transcripts/       # Audio/video transcriptions
├── index/             # Cached codebase index
├── context/           # Session context persistence
│   └── session-context.json  # Rolling 20-entry context window
├── session/           # Compaction summaries for crash recovery
├── history/           # Session history
└── pending-task.json  # Saved task state for /stop and /update resume

Model Support

Primary target: Qwen3.5-122B-A10B via Ollama (MoE, 48GB+ VRAM)

Any Ollama or OpenAI-compatible API model with tool calling works:

oa --model qwen2.5-coder:32b "fix the bug"
oa --backend vllm --backend-url http://localhost:8000/v1 "add tests"
oa --backend-url http://10.0.0.5:11434 "refactor auth"

Supported Inference Providers

Open Agents auto-detects your provider from the endpoint URL and configures auth + health checks accordingly. All providers use standard Authorization: Bearer <key> authentication.

Provider	Endpoint URL	API Key	Notes
Ollama (local)	`http://localhost:11434`	None	Default. Auto-detects, auto-expands context window
vLLM (local)	`http://localhost:8000`	Optional	Self-hosted OpenAI-compatible server
LM Studio (local)	`http://localhost:1234`	None	Local model server with GUI
Chutes AI	`https://llm.chutes.ai`	`cpk_...`	Bearer auth. Fast cloud inference
Together AI	`https://api.together.xyz`	Required	Large model catalog
Groq	`https://api.groq.com/openai`	`gsk_...`	Ultra-fast LPU inference
OpenRouter	`https://openrouter.ai/api`	`sk-or-...`	Multi-provider routing
Fireworks AI	`https://api.fireworks.ai/inference`	`fw_...`	Fast serverless inference
DeepInfra	`https://api.deepinfra.com`	Required	Cost-effective inference
Mistral AI	`https://api.mistral.ai`	Required	Mistral models
Cerebras	`https://api.cerebras.ai`	`csk-...`	Wafer-scale inference
SambaNova	`https://api.sambanova.ai`	Required	RDU-accelerated inference
NVIDIA NIM	`https://integrate.api.nvidia.com`	`nvapi-...`	NVIDIA cloud inference
Hyperbolic	`https://api.hyperbolic.xyz`	Required	GPU cloud inference
OpenAI	`https://api.openai.com`	`sk-...`	GPT models (tool calling)

Connecting to a Provider

Use /endpoint in the TUI or pass via CLI:

# Chutes AI
/endpoint https://llm.chutes.ai --auth cpk_your_key_here

# Groq
/endpoint https://api.groq.com/openai --auth gsk_your_key_here

# Together AI
/endpoint https://api.together.xyz --auth your_key_here

# Self-hosted vLLM on LAN
/endpoint http://10.0.0.5:8000

The agent auto-detects the provider, normalizes the URL (strips /v1/chat/completions if pasted), tests connectivity, and saves the configuration. You can paste full endpoint URLs — they'll be cleaned up automatically.

Evaluation Suite

40 evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, and memory systems:

node eval/run-agentic.mjs                          # Run all tasks
node eval/run-agentic.mjs 04-add-test              # Single task
node eval/run-agentic.mjs --model qwen2.5-coder:32b  # Different model

ID	Task	Category
01	Fix typo in function name	Code Fix
02	Add isPrime function	Code Generation
03	Fix off-by-one bug	Code Fix
04	Write comprehensive tests	Test Generation
05	Extract functions from long method	Refactoring
06	Fix TypeScript type errors	Type Safety
07	Add REST API endpoint	Feature Addition
08	Add pagination across files	Multi-File Edit
09	CSS named color lookup (148 colors)	Web Research
10	HTTP status code lookup (32+ codes)	Web Research
11	MIME type lookup (30+ types)	Web Research
12	SDLC health analyzer	AIWG Analysis
13	SDLC artifact generator	AIWG Generation
14	Batch refactor variable names	Multi-File Refactor
15	Codebase overview from structure	Code Analysis
16	Diagnostic fix loop	Error Recovery
17	Git repository analyzer	Git Integration
18	Create custom tool from spec	Tool Creation
19	Tool from usage pattern	Tool Discovery
20	Tool management operations	Tool Lifecycle
21	Large file patch	Precision Editing
22	Skill discovery	Skill System
23	Skill execution	Skill System
24-30	Additional coding tasks	Various
31	Web extractor bug fixes (3 bugs)	Multi-Bug Fix
32	CSV pipeline across 3 files	Multi-File Tracking
33	FSM bug fixes + factory implementation	State Machine
34	Search pre-populated memories	Memory Search
35	Analyze code, write to memory, cross-reference	Memory Cross-Reference
36	Discover explore_tools, unlock grep_search	Explore Tools
37	Analyze code patterns, store and recall from memory	Memory Store & Recall
38	Read configs, write to multiple memory topics	Memory Multi-Topic
39	Search pre-loaded memories across 3 topics	Memory Pre-Loaded Search
40	Combined explore_tools + memory analysis pipeline	Explore + Memory

Tasks 31-33 are designed for small model (≤9B) evaluation using file_edit patterns. Tasks 34-40 test the memory system (read/write/search) and tool discovery.

Benchmark Results

Qwen3.5-122B: 100% pass rate (37/37 tasks, including memory tasks 34-40)
Qwen3.5-27B:  100% pass rate (30/30 tasks)
Qwen3.5-9B:   100% pass rate (tasks 31-33, file_edit-optimized)
              71% pass rate (5/7 memory tasks 34-40)

The eval runner includes model-tier-aware features: automatic tool set filtering, HTTP 500 recovery with file_edit hints, loop detection with tool banning, and tier-based output truncation.

AIWG Integration

Open Agents integrates with AIWG for AI-augmented software development:

npm i -g aiwg
oa "analyze this project's SDLC health and set up documentation"

Capability	Description
Structured Memory	`.aiwg/` directory persists project knowledge
SDLC Artifacts	Requirements, architecture, test strategy, deployment docs
Health Analysis	Score your project's SDLC maturity
85+ Agents	Specialized AI personas (Test Engineer, Security Auditor, API Designer)
Traceability	@-mention system links requirements to code to tests

Architecture

The core is AgenticRunner — a multi-turn tool-calling loop with context management:

User task → System prompt + tools → LLM → tool_calls → Execute → Feed results → LLM
                                          ↓                                      ↑
                                    Compaction check ─── Memex archive ─── Context restore
                                          (repeat until task_complete or max turns)

Tool-first — the model explores via tools, not pre-stuffed context
Iterative — tests, sees failures, fixes them
Parallel-safe — read-only tools concurrent, mutating tools sequential
Observable — every tool call and result emitted as a real-time event
Bounded — max turns, timeout, output limits prevent runaway loops
Context-aware — dynamic compaction, Memex archiving, session persistence, model-tier scaling
Brute-force — optional auto re-engagement when turn limit is hit (keeps going until task_complete or user abort)

License

MIT

open-agents-ai

Package Exports

Readme

Open Agents

Features

How It Works

Ralph Loop — Iteration-First Design

Context Compaction — Research-Backed Memory Management

How It Works

Compaction Strategies

Automatic Compaction

Memex Experience Archive

Design Rationale

Domain-Aware Preservation

Task Control

Pause, Stop, Resume, Destroy

Session Context Persistence

Auto-Restore on Startup

Dream Mode — Creative Idle Exploration

Listen Mode — Live Bidirectional Audio

Vision & Desktop Automation (Moondream)

Desktop Awareness

Vision Analysis

Point-and-Click

Setup

Interactive TUI

Slash Commands

Mid-Task Steering

Tools (47)

Auto-Expanding Context Window

Model-Tier Awareness

Tool Nesting for Small Models

Dynamic Context Limits

Voice Feedback (TTS)

Human Expert Speed Ratio

How It Works

Cost Tracking & Session Metrics

Code Sandbox

Structured Data Tools

Generate structured files

Read structured files

Multi-Provider Web Search

Task Templates

Configuration

Project Context

.oa/ Project Directory

Model Support

Supported Inference Providers

Connecting to a Provider

Evaluation Suite

Benchmark Results

AIWG Integration

Architecture

License

`.oa/` Project Directory