JSPM

open-agents-ai

0.36.0
    • ESM via JSPM
    • ES Module Entrypoint
    • Export Map
    • Keywords
    • License
    • Repository URL
    • TypeScript Types
    • README
    • Created
    • Published
    • Downloads 26260
    • Score
      100M100P100Q142379F
    • License MIT

    AI coding agent powered by open-source models (Ollama/vLLM) — interactive TUI with agentic tool-calling loop

    Package Exports

    • open-agents-ai
    • open-agents-ai/dist/index.js
    • open-agents-ai/dist/launcher.cjs

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (open-agents-ai) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    npm version npm downloads license node version open-weight models

    freedom of information · freedom of patterns · creating freely · open-weights
    libertad de informacion · crear libremente · creer librement · liberte d'expression
    Freiheit der Muster · jiyuu ni souzou suru · jayuroun changjak · svoboda tvorchestva
    liberdade de criar · creare liberamente · ozgurce yarat · skapa fritt
    vrij creeren · tworz swobodnie · dimiourgia elefthera · khuli soch
    hurriyat al-ibdaa · code is poetry · democratize AI · imagine freely


    Open Agents

    npm i -g open-agents-ai && oa

    AI coding agent powered entirely by open-weight models. No API keys. No cloud. Your code never leaves your machine.

    An autonomous multi-turn tool-calling agent that reads your code, makes changes, runs tests, and fixes failures in an iterative loop until the task is complete. First launch auto-detects your hardware and configures the optimal model with expanded context window automatically.

    Features

    • 48 autonomous tools — file I/O, shell, grep, web search/fetch/crawl, memory (read/write/search), sub-agents, background tasks, image/OCR/PDF, git, diagnostics, vision, desktop automation, browser automation, structured files, code sandbox, transcription, skills
    • Moondream vision — see and interact with the desktop via Moondream VLM (caption, query, detect, point-and-click)
    • Desktop automation — vision-guided clicking: describe a UI element in natural language, the agent finds and clicks it
    • Auto-install desktop deps — screenshot, mouse, OCR, and image tools auto-install missing system packages (scrot, xdotool, tesseract, imagemagick) on first use
    • Parallel tool execution — read-only tools run concurrently via Promise.allSettled
    • Sub-agent delegation — spawn independent agents for parallel workstreams
    • Ralph Loop — iterative task execution that keeps retrying until completion criteria are met
    • Dream Mode — creative idle exploration modeled after real sleep architecture (NREM→REM cycles)
    • Live Listen — bidirectional voice communication with real-time Whisper transcription
    • Neural TTS — hear what the agent is doing via GLaDOS or Overwatch ONNX voices, with personality-driven expressiveness
    • Personality Core — SAC framework-based style control (concise/balanced/verbose/pedagogical) that shapes agent response depth, voice expressiveness, and system prompt behavior
    • Human expert speed ratio — real-time Exp: Nx gauge comparing agent speed to a leading human expert, calibrated across 47 tool baselines
    • Cost tracking — real-time token cost estimation for 15+ cloud providers
    • Work evaluation — LLM-as-judge scoring with task-type-specific rubrics
    • Session metrics — track turns, tool calls, tokens, files modified, tasks completed per session
    • Structured file generation — create CSV, TSV, JSON, Markdown tables, and Excel-compatible files
    • Code sandbox — isolated code execution in subprocess or Docker (JS, Python, Bash, TypeScript)
    • Structured file reading — parse CSV, TSV, JSON, Markdown tables with binary format detection
    • Multi-provider web search — DuckDuckGo (free), Tavily (structured), Jina AI (markdown) with auto-detection
    • Browser automation — headless Chrome control via Selenium: navigate, click, type, screenshot, read DOM — auto-starts on first use with self-bootstrapping Python venv
    • Web crawling — multi-page web scraping with Crawlee/Playwright for deep documentation extraction
    • Task templates — specialized system prompts and tool recommendations for code, document, analysis, plan tasks
    • Auto-expanding context — detects RAM/VRAM and creates an optimized model variant on first run
    • Mid-task steering — type while the agent works to add context without interrupting
    • Smart compaction — 6 context compaction strategies (default, aggressive, decisions, errors, summary, structured) with research-backed design
    • Memex experience archive — large tool outputs archived during compaction with hash-based retrieval
    • Persistent memory — learned patterns stored in .oa/memory/ across sessions
    • Session context persistence — auto-saves context on task completion, manual /context save|restore across sessions
    • Self-learning — auto-fetches docs from the web when encountering unfamiliar APIs
    • Seamless /update — in-place update and reload with automatic context save/restore
    • Task control/pause (gentle halt at turn boundary), /stop (immediate kill), /resume to continue
    • Model-tier awareness — dynamic tool sets, prompt complexity, and context limits scale with model size (small/medium/large)

    How It Works

    You: oa "fix the null check in auth.ts"
    
    Agent: [Turn 1] file_read(src/auth.ts)
           [Turn 2] grep_search(pattern="null", path="src/auth.ts")
           [Turn 3] file_edit(old_string="if (user)", new_string="if (user != null)")
           [Turn 4] shell(command="npm test")
           [Turn 5] task_complete(summary="Fixed null check — all tests pass")

    The agent uses tools autonomously in a loop — reading errors, fixing code, and re-running validation until the task succeeds or the turn limit is reached.

    Ralph Loop — Iteration-First Design

    The Ralph Loop is the core execution philosophy: iteration beats perfection. Instead of trying to get everything right on the first attempt, the agent executes in a retry loop where errors become learning data rather than session-ending failures.

    /ralph "fix all failing tests" --completion "npm test passes with 0 failures"
    /ralph "migrate to TypeScript" --completion "npx tsc --noEmit exits 0" --max-iterations 20
    /ralph "reach 80% coverage" --completion "coverage report shows >80%" --timeout 120

    Each iteration:

    1. Execute — make changes based on the task + all accumulated learnings
    2. Verify — run the completion command (tests, build, lint, coverage)
    3. Learn — if verification fails, extract what went wrong and why
    4. Iterate — retry with the new knowledge until passing or limits reached

    The loop tracks iteration history, generates completion reports saved to .aiwg/ralph/, and supports resume/abort for interrupted sessions. Safety bounds (max iterations, timeout) prevent runaway loops.

    /ralph-status     # Check current/previous loop status
    /ralph-resume     # Resume interrupted loop
    /ralph-abort      # Cancel running loop

    Context Compaction — Research-Backed Memory Management

    Long conversations consume context window tokens. Open Agents uses progressive context compaction to compress older messages while preserving critical information — decisions, errors, file states, and task progress.

    How It Works

    Compaction triggers automatically when estimated token usage reaches 75% of the model's context window. The system:

    1. Preserves the system prompt and initial user task (head messages)
    2. Summarizes middle messages (tool calls, results, exploration) into a structured digest
    3. Keeps recent messages verbatim (scaled by model tier and context size)
    4. Archives large tool outputs to the Memex experience archive (retrievable by hash ID via memex_retrieve)

    Compaction Strategies

    Six strategies are available via /compact <strategy>:

    Strategy What It Preserves Best For
    default Progressive summarization — decisions, errors, file changes, task state General use
    aggressive Only key decisions and errors, maximum compression Very long sessions
    decisions Action→outcome pairs only, discards exploration Decision-heavy workflows
    errors Full error context preserved, successes compressed Debugging sessions
    summary High-level paragraph summary, minimal detail Quick context reset
    structured LLM-generated structured summary via a separate inference call Highest quality summaries

    Automatic Compaction

    Compaction thresholds scale dynamically with model size:

    Model Tier Threshold Recent Messages Kept
    Large (30B+) 40,000 tokens (or 75% of context) 12 messages
    Medium (8-29B) 24,000 tokens (or 75% of context) 8 messages
    Small (≤7B) 12,000 tokens (or 75% of context) 4-6 messages

    Status Bar Context Tracking (Ctx:)

    The status bar displays a live Ctx: gauge showing estimated context window usage:

    In: 12,345 | Out: 4,567 | Ctx: 18,000/131,072 86% | Exp: 4.2x
                               ^^^^^^^^^^^^^^^^^^^^^^^^
                               Estimated tokens used / total context window

    This gauge reflects the post-compaction token count — when compaction fires, the Ctx: value drops to match the actual compressed message history. The compaction warning message shows the before/after:

    ⚠ Context compacted: Compacted 70 messages | ~40,279 → ~22,754 tokens (saved ~17,525)

    After this compaction, Ctx: updates to reflect ~22,754 tokens (not the pre-compaction ~40,279). Both the main inference loop and the brute-force re-engagement path calculate context tokens from the compacted message array, ensuring the status bar always represents the true context state sent to the model.

    The percentage shows context remaining (not used) — green when >50% free, yellow at 25-50%, red below 25%.

    Memex Experience Archive

    During compaction, large tool outputs (file reads, grep results, command output) are archived with a short hash ID. The agent can recover any archived result using memex_retrieve:

    Agent: memex_retrieve(id="a3f2c1")
           → [Full original content of the archived tool result]

    This gives the agent "perfect recall" of any prior tool output despite compaction.

    Design Rationale

    The compaction system draws on several research findings:

    • RECOMP (arXiv:2310.04408, ICLR 2024) — Demonstrated that retrieved context can be compressed to 6% of original size with minimal quality loss. Our observation masking pre-pass applies this principle to tool outputs.
    • Tool Documentation Enables Zero-Shot Tool-Usage (arXiv:2308.00675) — Showed that documentation quality matters more than example quantity. Our compaction preserves tool schemas while discarding verbose results.
    • ToolLLM DFSDT (arXiv:2307.16789) — Validated that backtracking and error preservation improve multi-step task success by +35pp. Our error-preserving strategy directly implements this insight.
    • Long Context Does Not Solve Planning (NATURAL PLAN, arXiv:2406.04520) — GPT-4 achieves only 31% on trip planning even with full context. This confirms that efficient context use outperforms naive context expansion, motivating aggressive compaction with selective preservation.

    Domain-Aware Preservation

    Compaction summaries include:

    • Task state — current phase, goals, progress, blockers
    • File registry — per-file metadata (last action, line count, purpose) for files touched during the session
    • Memex index — hash IDs and one-line summaries of archived tool outputs

    This ensures the agent can resume coherently after compaction without re-reading files or re-running commands.

    Task Control

    Pause, Stop, Resume, Destroy

    Command Behavior
    /pause Gentle halt — lets the current inference turn finish, then stops before the next turn. No new tool calls or inference will begin until /resume.
    /stop Immediate kill — aborts the current inference mid-stream, saves task state for later resumption.
    /resume Continue — resumes a paused or stopped task from where it left off. Also resumes tasks saved by /stop or interrupted by /update.
    /destroy Nuclear option — aborts any active task, deletes the .oa/ directory, clears the console, and exits to shell.

    Session Context Persistence

    Context is automatically saved on every task completion and preserved across /update restarts.

    /context save      # Force-save current session context
    /context restore   # Load previous session context into next task
    /context show      # Show saved context status (entries, last saved)

    The system maintains a rolling window of the last 20 session entries in .oa/context/session-context.json. When you run /context restore, the last 10 entries are formatted into a restore prompt and injected into your next task, giving the agent continuity across sessions.

    During /update, context is automatically saved before the process restarts and restored when the new version resumes your task.

    Auto-Restore on Startup

    When you launch oa in a workspace that has saved session context from a previous run, you'll be prompted to restore it:

    ℹ Previous session found (5 entries, last active 2h ago)
    ℹ Last task: fix the auth bug in src/middleware.ts
    ℹ Restore previous context? (y/n)
    ❯ y
    ℹ Context restored from 5 session(s). Will be injected into your next task.

    Type y to restore — the previous session context will be prepended to your next task, giving the agent full continuity. Type n (or anything else) to start fresh. The prompt only appears on fresh starts, not on /update resumes (which auto-restore context).

    Dream Mode — Creative Idle Exploration

    When you're not actively tasking the agent, Dream Mode lets it creatively explore your codebase and generate improvement proposals autonomously. The system models real human sleep architecture with four stages per cycle:

    Stage Name What Happens
    NREM-1 Light Scan Quick codebase overview, surface observations
    NREM-2 Pattern Detection Identify recurring patterns, technical debt, gaps
    NREM-3 Deep Consolidation Synthesize findings into structured proposals
    REM Creative Expansion Novel ideas, cross-domain connections, bold plans

    Each cycle expands through all four stages then contracts (evaluation, pruning of weak ideas). Three modes control how far the agent can go:

    /dream              # Default — read-only exploration, proposals saved to .oa/dreams/
    /dream deep         # Multi-cycle deep exploration with expansion/contraction phases
    /dream lucid        # Full implementation — saves workspace backup, then implements,
                        #   tests, evaluates, and self-plays each proposal with checkpoints
    /dream stop         # Wake up — stop dreaming

    Default and Deep modes are completely safe — the agent can only read your code and write proposals to .oa/dreams/. File writes, edits, and shell commands outside that directory are blocked by sandboxed dream tools.

    Lucid mode unlocks full write access. Before making changes, it saves a workspace checkpoint so you can roll back. Each cycle goes: dream → implement → test → evaluate → checkpoint → next cycle.

    All proposals are indexed in .oa/dreams/PROPOSAL-INDEX.md for easy review.

    Listen Mode — Live Bidirectional Audio

    Listen mode enables real-time voice communication with the agent. Your microphone audio is captured, streamed through Whisper, and the transcription is injected directly into the input line — creating a hands-free coding workflow.

    Two transcription backends ensure broad platform support:

    • transcribe-cli (faster-whisper / ONNX) — used by default, fastest on x86
    • openai-whisper (Python venv) — automatic fallback for ARM, linux-arm64, or when ONNX is unavailable. Auto-creates a venv and installs deps on first use.
    /listen             # Toggle microphone capture on/off
    /listen auto        # Auto-submit after 3 seconds of silence (hands-free)
    /listen confirm     # Require Enter to submit transcription (default)
    /listen stop        # Stop listening

    Model selection — choose the Whisper model size for your hardware:

    /listen tiny        # Fastest, least accurate (~39MB)
    /listen base        # Good balance (~74MB)
    /listen small       # Better accuracy (~244MB)
    /listen medium      # High accuracy (~769MB)
    /listen large       # Best accuracy, slower (~1.5GB)

    When combined with /voice, you get full bidirectional audio — speak your tasks, hear the agent's progress through TTS, and speak corrections mid-task. The status bar shows a blinking red ● REC indicator with a countdown timer during auto-mode recording.

    Platform support:

    • Linux x86: arecord (ALSA) or ffmpeg (PulseAudio) + transcribe-cli
    • Linux ARM: arecord or ffmpeg + openai-whisper (auto-installed in Python venv)
    • macOS: sox (CoreAudio) or ffmpeg (AVFoundation)

    The transcribe-cli dependency auto-installs in the background on first use. On ARM or when transcribe-cli fails, the system automatically falls back to openai-whisper via a self-managed Python venv (same approach used by Moondream vision).

    File transcription: Drag-and-drop audio/video files (.mp3, .wav, .mp4, .mkv, etc.) onto the terminal to transcribe them. Results are saved to .oa/transcripts/.

    Vision & Desktop Automation (Moondream)

    Open Agents can see your screen, understand UI elements, and interact with desktop applications through natural language — powered by the Moondream vision language model running entirely locally.

    Desktop Awareness

    The agent can take a screenshot and describe what's on screen:

    You: what's on my desktop right now?
    
    Agent: [Turn 1] desktop_describe()
           → "A Linux desktop showing three terminal windows with code editors,
              a file manager in the background, and a taskbar at the bottom
              with Firefox, Files, and Terminal icons."

    Ask specific questions about the screen:

    Agent: [Turn 1] desktop_describe(question="What application is in focus?")
           → "The focused application is a terminal running vim with a Python file open."

    Vision Analysis

    Analyze any image with four actions:

    Agent: vision(image="screenshot.png", action="caption")
           → "A terminal window displaying code with syntax highlighting"
    
    Agent: vision(image="ui.png", action="query", prompt="How many buttons are visible?")
           → "There are 4 buttons visible: Save, Cancel, Help, and Close"
    
    Agent: vision(image="ui.png", action="detect", prompt="button")
           → Detected 4 "button" in ui.png:
             1. bbox: [0.10, 0.85, 0.25, 0.95]
             2. bbox: [0.30, 0.85, 0.45, 0.95]
             ...
    
    Agent: vision(image="ui.png", action="point", prompt="close button")
           → Found 1 "close button" at (0.95, 0.02) — pixel (1824, 22)

    Point-and-Click

    Describe what to click in plain English — the agent screenshots, finds the element with Moondream, and clicks it:

    Agent: desktop_click(target="the Save button")
           → Clicked "Save button" at (480, 920)
    
    Agent: desktop_click(target="File menu", button="left")
           → Clicked "File menu" at (45, 12)
    
    Agent: desktop_click(target="terminal icon", click_type="double")
           → Clicked "terminal icon" at (1850, 540)

    Supports left/right/middle click, single/double click, multi-match selection by index, dry-run mode for verification, and configurable delay for UI transitions.

    Browser Automation

    Headless Chrome automation via Selenium — no display server required. The scrape service auto-starts on first use, creates its own Python venv, and installs all dependencies:

    You: go to github.com and screenshot the page
    
    Agent: [Turn 1] browser_action(action="navigate", url="https://github.com")
           → Navigated to https://github.com
           [Turn 2] browser_action(action="screenshot")
           → Screenshot captured (1920x1080)

    Available actions:

    Action Description
    navigate Go to a URL
    click Click element by CSS selector
    click_xy Click at viewport coordinates
    type Type text into a form element
    screenshot Capture the current page
    dom Read the page DOM (up to 50K chars)
    scroll / scroll_up / scroll_down Scroll the page
    back / forward Browser history navigation
    close End the browser session

    The service runs on localhost:8130 and uses headless Chrome/Chromium. Requires Python 3.9+ and Chrome or Chromium installed on the system.

    Setup

    Moondream runs locally — no API keys, no cloud, your screen data never leaves your machine:

    # Create a Python venv and install Moondream Station
    python3 -m venv .moondream-venv
    .moondream-venv/bin/pip install moondream-station pydantic uvicorn fastapi packaging
    
    # Start the vision server (downloads model on first run, ~1.7GB)
    .moondream-venv/bin/python packages/execution/scripts/start-moondream.py

    The vision tools auto-detect a running Moondream Station on localhost:2020. For cloud inference, set MOONDREAM_API_KEY instead.

    System dependencies (auto-installed on first use):

    Desktop tools automatically install missing system packages when first needed. No manual setup required — just use the tool and it handles the rest:

    Tool Linux Package What It Does
    scrot apt install scrot Screenshot capture
    xdotool apt install xdotool Mouse/keyboard automation
    tesseract apt install tesseract-ocr OCR text extraction
    identify apt install imagemagick Image dimensions/conversion

    Supports apt (Debian/Ubuntu), dnf (Fedora), pacman (Arch), and brew (macOS). You can also pre-install everything at once:

    ./scripts/setup-desktop.sh          # Install all desktop deps
    ./scripts/setup-desktop.sh --check-only  # Just check what's missing

    Vision backend:

    • Moondream Station (local) — runs entirely on your machine, no API keys needed
    • Moondream Cloud API — set MOONDREAM_API_KEY for cloud inference

    Interactive TUI

    Launch without arguments to enter the interactive REPL:

    oa

    The TUI features an animated multilingual phrase carousel, live metrics bar with pastel-colored labels (token in/out, context window usage, human expert speed ratio, cost), rotating tips, syntax-highlighted tool output, and dynamic terminal-width cropping.

    Slash Commands

    Command Description
    Model & Endpoint
    /model <name> Switch to a different model
    /models List all available models
    /endpoint <url> Connect to a remote vLLM or OpenAI-compatible API
    /endpoint <url> --auth <key> Set endpoint with Bearer auth
    Task Control
    /pause Pause after current turn finishes (gentle halt)
    /stop Kill current inference immediately, save state
    /resume Resume a paused or stopped task
    /destroy Remove .oa/ folder, kill all tasks, clear console, exit
    Context & Memory
    /context save Force-save session context to .oa/context/
    /context restore Restore context from previous sessions into next task
    /context show Show saved session context status
    /compact Force context compaction now (default strategy)
    /compact <strategy> Compact with strategy: aggressive, decisions, errors, summary, structured
    Audio & Vision
    /voice [model] Toggle TTS voice (GLaDOS, Overwatch)
    /listen [mode] Toggle live microphone transcription
    /dream [mode] Start dream mode (default, deep, lucid)
    Display & Behavior
    /stream Toggle streaming token display with pastel syntax highlighting
    /bruteforce Toggle brute-force mode (auto re-engage on turn limit)
    /verbose Toggle verbose mode
    /style [preset] Set personality style: concise, balanced, verbose, pedagogical
    /personality [preset] Alias for /style
    Tools & Skills
    /tools List agent-created custom tools
    /skills [keyword] List/search available AIWG skills
    /<skill-name> [args] Invoke an AIWG skill directly
    Metrics & Updates
    /cost Show token cost breakdown for the current session
    /evaluate Score the last completed task with LLM-as-judge
    /stats Show session dashboard (turns, tools, tokens, files, task history)
    /task-type <type> Set task type for specialized prompts (code, document, analysis, plan)
    /update Check for and install updates (seamless context-preserving reload)
    /update auto|manual Set update mode (auto after task completion, or manual only)
    General
    /config Show current configuration
    /clear Clear the screen
    /help Show all available commands
    /quit Exit

    All settings commands accept --local to save to project .oa/settings.json instead of global config.

    Mid-Task Steering

    While the agent is working (shown by the + prompt), type to add context:

    > fix the auth bug
      ⎿  Read: src/auth.ts
    + also check the session handling        ← typed while agent works
      ↪ Context added: also check the session handling
      ⎿  Search: session
      ⎿  Edit: src/auth.ts

    Tools (47)

    Tool Description
    File Operations
    file_read Read file contents with line numbers (offset/limit for large files)
    file_write Create or overwrite files with automatic directory creation
    file_edit Precise string replacement in files (preferred over rewriting)
    file_patch Edit specific line ranges in large files (replace, insert_before/after, delete)
    batch_edit Multiple edits across files in one call
    list_directory List directory contents with types and sizes
    Search & Navigation
    grep_search Search file contents with regex (ripgrep with grep fallback)
    find_files Find files by glob pattern (excludes node_modules/.git)
    codebase_map High-level project structure overview with directory tree and language breakdown
    Shell & Execution
    shell Execute any shell command (non-interactive, CI=true, sudo support)
    code_sandbox Isolated code execution (JS, Python, Bash, TS) in subprocess or Docker
    background_run Run shell command in background, returns task ID
    task_status Check background task status
    task_output Read background task output
    task_stop Stop a background task
    Web
    web_search Search the web (DuckDuckGo, Tavily, Jina AI — auto-detected)
    web_fetch Fetch and extract text from web pages (HTML stripping)
    web_crawl Multi-page web scraping with Crawlee/Playwright for deep documentation
    browser_action Headless Chrome automation: navigate, click, type, screenshot, read DOM, scroll, history
    Structured Data
    structured_file Generate CSV, TSV, JSON, Markdown tables, Excel-compatible files
    structured_read Parse CSV, TSV, JSON, Markdown tables with binary format detection
    Vision & Desktop
    vision Moondream VLM — caption, query, detect, point on any image
    desktop_click Vision-guided clicking: describe a UI element, agent finds and clicks it
    desktop_describe Screenshot + Moondream caption/query for desktop awareness
    image_read Read images (base64 + OCR metadata)
    screenshot Capture screen/window/active window
    ocr Extract text from images (Tesseract with multi-variant preprocessing)
    ocr_image_advanced Advanced multi-variant OCR pipeline with preprocessing, multi-PSM, and confidence scoring
    ocr_pdf Add searchable text layer to scanned/image PDFs
    pdf_to_text Extract text from PDF using pdftotext (Poppler) with OCR fallback
    Transcription
    transcribe_file Transcribe local audio/video files to text (Whisper)
    transcribe_url Download and transcribe audio/video from URLs
    Memory & Knowledge
    memory_read Read from persistent memory store by topic and key
    memory_write Store facts/patterns in persistent memory with provenance tracking
    memory_search Semantic search across all memory entries by query
    memex_retrieve Recover full tool output archived during context compaction by hash ID
    Git & Diagnostics
    diagnostic Lint/typecheck/test/build validation pipeline in one call
    git_info Structured git status, log, diff, branch, staged/unstaged files
    Agents & Delegation
    sub_agent Delegate subtasks to independent agent instances (foreground or background)
    explore_tools Meta-tool: discover and unlock additional tools on demand (for small models)
    task_complete Signal task completion with summary
    Custom Tools & Skills
    create_tool Create reusable custom tools from workflow patterns at runtime
    manage_tools List, inspect, delete custom tools
    skill_list Discover available AIWG skills
    skill_execute Run an AIWG skill
    AIWG SDLC
    aiwg_setup Deploy AIWG SDLC framework
    aiwg_health Analyze project SDLC health and readiness
    aiwg_workflow Execute AIWG commands and workflows

    Read-only tools execute concurrently when called in the same turn. Mutating tools run sequentially.

    Auto-Expanding Context Window

    On startup and /model switch, Open Agents detects your RAM/VRAM and creates an optimized model variant:

    Available Memory Context Window
    200GB+ 128K tokens
    100GB+ 64K tokens
    50GB+ 32K tokens
    20GB+ 16K tokens
    8GB+ 8K tokens
    < 8GB 4K tokens

    Model-Tier Awareness

    Open Agents classifies models into three tiers and adapts its behavior accordingly:

    Tier Parameters Base Tools System Prompt Compaction
    Large (≥30B) 70B, 122B All 47 tools Full (344 lines) 40K threshold
    Medium (8-29B) 9B, 27B 15 core tools Condensed (100 lines) 24K threshold
    Small (≤7B) 4B, 1.5B 6 base tools + explore_tools Minimal (15 lines) 12K threshold

    Tool Nesting for Small Models

    Small models use an explore_tools meta-tool pattern inspired by hierarchical API retrieval research (ToolLLM, arXiv:2307.16789). Instead of presenting all 47 tools (which overwhelms small context windows), only 6 core tools are loaded initially:

    • file_read, file_write, file_edit, shell, task_complete, explore_tools

    The agent can call explore_tools() to see a catalog of additional tools with one-line descriptions, then explore_tools(enable="grep_search") to unlock specific tools as needed. This reduces tool schema tokens by ~80% while preserving access to the full toolset.

    This approach is substantiated by:

    • Gorilla (arXiv:2305.15334) — 7B model with retrieval outperforms GPT-4 on tool-calling hallucination rate
    • DFSDT (arXiv:2307.16789) — ToolLLaMA-7B with depth-first search scored 66.7%, approaching GPT-4's 70.4%
    • Octopus v2 (arXiv:2404.01744) — 2B model achieved 99.5% function-calling accuracy with context-efficient tool encoding

    Dynamic Context Limits

    All context-dependent values scale automatically with the actual context window size:

    Setting How It Scales
    Compaction threshold min(tier default, 75% of context window)
    Recent messages kept 1 message per 2-4K of context (tier-dependent)
    Max output tokens 25% of context window (min 2048)
    Tool output cap 2K-8K chars (scales with context)
    File read limits 80-120 line cap for small/medium context windows

    Voice Feedback (TTS)

    /voice              # Toggle on/off (default: GLaDOS)
    /voice glados       # GLaDOS voice
    /voice overwatch    # Overwatch voice

    Auto-downloads the ONNX voice model (~50MB) on first use. Install espeak-ng for best quality (apt install espeak-ng / brew install espeak-ng).

    Personality-Aware Voice

    Voice output adapts to the active personality style — the same tool call sounds different depending on the /style preset:

    Style Example (file_read) Example (npm test)
    concise "Reading app.ts" "Running tests"
    balanced "Let me take a look at app.ts" "Let's run the tests and see how we're doing"
    verbose "Alright, let's crack open app.ts and see what we're working with" "Alright, moment of truth, let's see if the tests pass"

    Task completion, tool failures, and all TTS announcements follow the same personality tier. Set the style with /style verbose and the voice output becomes conversational rather than robotic.

    Personality Core — SAC Framework Style Control

    The personality system controls how the agent communicates — from silent operator to teacher mode. It's based on the SAC framework (arXiv:2506.20993) which models personality along five behavioral intensity dimensions rather than binary trait toggles.

    /style concise       # Silent operator — acts without explaining
    /style balanced      # Default — moderate narration
    /style verbose       # Thorough explainer — narrates reasoning
    /style pedagogical   # Teacher mode — maximum explanation with alternatives

    How It Works

    Each personality preset maps to a PersonalityProfile with five dimensions scored 1-5:

    Dimension What It Controls concise balanced verbose pedagogical
    Frequency How often the agent narrates actions 1 3 5 5
    Depth Reasoning detail exposed in output 1 3 4 5
    Threshold When to speak vs. act silently 1 3 4 5
    Effort Response formatting quality 2 3 4 5
    Willingness Proactive suggestions beyond the task 1 3 4 5

    The profile is compiled into a system prompt suffix (max 80 tokens) injected at the end of the base prompt. This follows research showing prompt-level steering dominates activation-level interventions (arXiv:2512.17639) and uses positive framing ("Be concise") over negation ("Don't be verbose") per KAIST findings.

    What Changes Per Style

    Aspect concise balanced verbose pedagogical
    System prompt "Act silently, raw results only" No override "Explain reasoning, summarize" "Thorough explanations, alternatives"
    Voice TTS Terse: "Reading file.ts" Conversational: "Let me take a look" Chatty: "Alright, let's crack it open" Chatty + context
    Tool calls observed Same behavior Same behavior More exploration, diagnostics Maximum exploration
    Response length Minimal Moderate Detailed Comprehensive

    Persistence

    The style is saved to .oa/settings.json (with --local) or ~/.open-agents/config.json (global) and persists across sessions. Change it anytime with /style <preset> — takes effect on the next task.

    Research Provenance

    The personality system draws on:

    • SAC Framework (arXiv:2506.20993) — Five behavioral intensity dimensions with adjective-based semantic anchoring for stable trait expression
    • Lost in the Middle (arXiv:2307.03172) — U-shaped attention bias; personality suffix placed at prompt boundaries, not middle
    • Same Task, More Tokens (arXiv:2402.14848) — LLM reasoning degrades at ~3K system prompt tokens; personality suffix stays under 80 tokens
    • Linear Personality Probing (arXiv:2512.17639) — Prompt-level steering completely dominates activation-level interventions
    • The Prompt Report (arXiv:2406.06608) — Positive framing outperforms negated instructions for behavioral control

    Human Expert Speed Ratio

    The status bar displays a real-time Exp: Nx gauge estimating how fast the agent is working relative to a leading human expert performing equivalent tasks.

    In: 12,345 | Out: 4,567 | Ctx: 18,000/131,072 86% | Exp: 4.2x | Cost: $0.34
                                                           ^^^^^^^^
                                                        Agent is 4.2x faster
                                                        than a human expert

    How It Works

    Each tool call maps to a calibrated expert baseline time — the estimated seconds a top-tier human developer would take to perform the equivalent operation manually:

    Operation Expert Time Agent Equivalent
    Read a file 12s file_read
    Write a new file 90s file_write
    Make a precise edit 25s file_edit
    Grep search + scan results 15s grep_search
    Run a shell command 20s shell
    Web search + evaluate 60s web_search
    Survey codebase structure 180s codebase_map

    Additional overhead per action:

    • +5s context-switch per tool call (expert switching between tools)
    • +15s planning per reasoning turn (expert thinking about next step)

    The ratio accumulates across all tasks in the session:

    speedRatio = totalHumanExpertTime / totalAgentWallClockTime

    Color coding: green (2x+ faster), yellow (1-2x, comparable), red (<1x, slower than expert).

    All 47 tools have calibrated baselines ranging from 3s (task_stop) to 180s (codebase_map). Unknown tools default to 20s.

    Cost Tracking & Session Metrics

    Real-time token cost estimation for cloud providers. The status bar shows running cost when using a paid endpoint.

    /cost              # Show cost breakdown by model/provider
    /stats             # Session metrics: turns, tool calls, tokens, files modified
    /evaluate          # Score the last completed task (LLM-as-judge, 5 rubric dimensions)

    Cost tracking supports 15+ providers including Groq, Together AI, OpenRouter, Fireworks AI, DeepInfra, Mistral, Cerebras, and more. Pricing is per-million tokens with separate input/output rates.

    Work evaluation uses five task-type-specific rubrics (code, document, analysis, plan, general) scoring correctness, completeness, efficiency, code quality, and communication on a 1-5 scale.

    Code Sandbox

    Execute code snippets in isolated environments without affecting your project:

    Agent: code_sandbox(language="python", code="import math; print(math.factorial(20))")
           → 2432902008176640000
    
    Agent: code_sandbox(language="javascript", code="console.log([...new Set([1,2,2,3])].length)")
           → 3

    Supports JavaScript, TypeScript, Python, and Bash. Two execution modes:

    • Subprocess (default) — runs in a child process with timeout and output limits
    • Docker — runs in an isolated container when docker is available

    Structured Data Tools

    Generate structured files

    Create CSV, TSV, JSON, Markdown tables, and Excel-compatible files from data:

    Agent: structured_file(format="csv", path="results.csv", columns=["name","score"],
             data=[{"name":"Alice","score":95},{"name":"Bob","score":87}])
           → Created results.csv (2 rows, 2 columns)

    Read structured files

    Parse existing data files with automatic format detection:

    Agent: read_structured_file(path="data.csv")
           → CSV: 150 rows, 5 columns [showing first 100]
    
    Agent: read_structured_file(path="report.md")
           → Markdown: 3 table(s) extracted

    Detects binary formats (XLSX, PDF, DOCX) and suggests conversion tools.

    Web search automatically selects the best available provider:

    Provider Trigger Features
    DuckDuckGo Default (no key needed) Free, privacy-focused
    Tavily TAVILY_API_KEY set Structured results + AI-generated answer
    Jina AI JINA_API_KEY set Markdown-formatted results
    export TAVILY_API_KEY=tvly-...   # Enable Tavily (optional)
    export JINA_API_KEY=jina_...     # Enable Jina AI (optional)

    Task Templates

    Set a task type to get specialized system prompts, recommended tools, and output guidance:

    /task-type code       # Code generation/fix — emphasizes tests, diffs, file edits
    /task-type document   # Documentation — emphasizes clarity, structure, completeness
    /task-type analysis   # Analysis tasks — emphasizes data, metrics, evidence
    /task-type plan       # Planning — emphasizes steps, dependencies, risks

    Configuration

    Config priority: CLI flags > env vars > ~/.open-agents/config.json > defaults.

    open-agents config set model qwen3.5:122b
    open-agents config set backendUrl http://localhost:11434

    Project Context

    Create AGENTS.md, OA.md, or .open-agents.md in your project root for agent instructions. Context files merge from parent to child directories.

    .oa/ Project Directory

    .oa/
    ├── config.json        # Project config overrides
    ├── settings.json      # TUI settings (model, endpoint, voice, stream, etc.)
    ├── memory/            # Persistent memory store (topics, patterns, facts)
    ├── dreams/            # Dream mode proposals & checkpoints
    ├── transcripts/       # Audio/video transcriptions
    ├── index/             # Cached codebase index
    ├── context/           # Session context persistence
    │   └── session-context.json  # Rolling 20-entry context window
    ├── session/           # Compaction summaries for crash recovery
    ├── history/           # Session history
    └── pending-task.json  # Saved task state for /stop and /update resume

    Model Support

    Primary target: Qwen3.5-122B-A10B via Ollama (MoE, 48GB+ VRAM)

    Any Ollama or OpenAI-compatible API model with tool calling works:

    oa --model qwen2.5-coder:32b "fix the bug"
    oa --backend vllm --backend-url http://localhost:8000/v1 "add tests"
    oa --backend-url http://10.0.0.5:11434 "refactor auth"

    Supported Inference Providers

    Open Agents auto-detects your provider from the endpoint URL and configures auth + health checks accordingly. All providers use standard Authorization: Bearer <key> authentication.

    Provider Endpoint URL API Key Notes
    Ollama (local) http://localhost:11434 None Default. Auto-detects, auto-expands context window
    vLLM (local) http://localhost:8000 Optional Self-hosted OpenAI-compatible server
    LM Studio (local) http://localhost:1234 None Local model server with GUI
    Chutes AI https://llm.chutes.ai cpk_... Bearer auth. Fast cloud inference
    Together AI https://api.together.xyz Required Large model catalog
    Groq https://api.groq.com/openai gsk_... Ultra-fast LPU inference
    OpenRouter https://openrouter.ai/api sk-or-... Multi-provider routing
    Fireworks AI https://api.fireworks.ai/inference fw_... Fast serverless inference
    DeepInfra https://api.deepinfra.com Required Cost-effective inference
    Mistral AI https://api.mistral.ai Required Mistral models
    Cerebras https://api.cerebras.ai csk-... Wafer-scale inference
    SambaNova https://api.sambanova.ai Required RDU-accelerated inference
    NVIDIA NIM https://integrate.api.nvidia.com nvapi-... NVIDIA cloud inference
    Hyperbolic https://api.hyperbolic.xyz Required GPU cloud inference
    OpenAI https://api.openai.com sk-... GPT models (tool calling)

    Connecting to a Provider

    Use /endpoint in the TUI or pass via CLI:

    # Chutes AI
    /endpoint https://llm.chutes.ai --auth cpk_your_key_here
    
    # Groq
    /endpoint https://api.groq.com/openai --auth gsk_your_key_here
    
    # Together AI
    /endpoint https://api.together.xyz --auth your_key_here
    
    # Self-hosted vLLM on LAN
    /endpoint http://10.0.0.5:8000

    The agent auto-detects the provider, normalizes the URL (strips /v1/chat/completions if pasted), tests connectivity, and saves the configuration. You can paste full endpoint URLs — they'll be cleaned up automatically.

    Evaluation Suite

    40 evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, and memory systems:

    node eval/run-agentic.mjs                          # Run all tasks
    node eval/run-agentic.mjs 04-add-test              # Single task
    node eval/run-agentic.mjs --model qwen2.5-coder:32b  # Different model
    ID Task Category
    01 Fix typo in function name Code Fix
    02 Add isPrime function Code Generation
    03 Fix off-by-one bug Code Fix
    04 Write comprehensive tests Test Generation
    05 Extract functions from long method Refactoring
    06 Fix TypeScript type errors Type Safety
    07 Add REST API endpoint Feature Addition
    08 Add pagination across files Multi-File Edit
    09 CSS named color lookup (148 colors) Web Research
    10 HTTP status code lookup (32+ codes) Web Research
    11 MIME type lookup (30+ types) Web Research
    12 SDLC health analyzer AIWG Analysis
    13 SDLC artifact generator AIWG Generation
    14 Batch refactor variable names Multi-File Refactor
    15 Codebase overview from structure Code Analysis
    16 Diagnostic fix loop Error Recovery
    17 Git repository analyzer Git Integration
    18 Create custom tool from spec Tool Creation
    19 Tool from usage pattern Tool Discovery
    20 Tool management operations Tool Lifecycle
    21 Large file patch Precision Editing
    22 Skill discovery Skill System
    23 Skill execution Skill System
    24-30 Additional coding tasks Various
    31 Web extractor bug fixes (3 bugs) Multi-Bug Fix
    32 CSV pipeline across 3 files Multi-File Tracking
    33 FSM bug fixes + factory implementation State Machine
    34 Search pre-populated memories Memory Search
    35 Analyze code, write to memory, cross-reference Memory Cross-Reference
    36 Discover explore_tools, unlock grep_search Explore Tools
    37 Analyze code patterns, store and recall from memory Memory Store & Recall
    38 Read configs, write to multiple memory topics Memory Multi-Topic
    39 Search pre-loaded memories across 3 topics Memory Pre-Loaded Search
    40 Combined explore_tools + memory analysis pipeline Explore + Memory

    Tasks 31-33 are designed for small model (≤9B) evaluation using file_edit patterns. Tasks 34-40 test the memory system (read/write/search) and tool discovery.

    Benchmark Results

    Qwen3.5-122B: 100% pass rate (37/37 tasks, including memory tasks 34-40)
    Qwen3.5-27B:  100% pass rate (30/30 tasks)
    Qwen3.5-9B:   100% pass rate (tasks 31-33, file_edit-optimized)
                  71% pass rate (5/7 memory tasks 34-40)

    The eval runner includes model-tier-aware features: automatic tool set filtering, HTTP 500 recovery with file_edit hints, loop detection with tool banning, and tier-based output truncation.

    AIWG Integration

    Open Agents integrates with AIWG for AI-augmented software development:

    npm i -g aiwg
    oa "analyze this project's SDLC health and set up documentation"
    Capability Description
    Structured Memory .aiwg/ directory persists project knowledge
    SDLC Artifacts Requirements, architecture, test strategy, deployment docs
    Health Analysis Score your project's SDLC maturity
    85+ Agents Specialized AI personas (Test Engineer, Security Auditor, API Designer)
    Traceability @-mention system links requirements to code to tests

    Architecture

    The core is AgenticRunner — a multi-turn tool-calling loop with context management:

    User task → System prompt + tools → LLM → tool_calls → Execute → Feed results → LLM
                                              ↓                                      ↑
                                        Compaction check ─── Memex archive ─── Context restore
                                              (repeat until task_complete or max turns)
    • Tool-first — the model explores via tools, not pre-stuffed context
    • Iterative — tests, sees failures, fixes them
    • Parallel-safe — read-only tools concurrent, mutating tools sequential
    • Observable — every tool call and result emitted as a real-time event
    • Bounded — max turns, timeout, output limits prevent runaway loops
    • Context-aware — dynamic compaction, Memex archiving, session persistence, model-tier scaling
    • Brute-force — optional auto re-engagement when turn limit is hit (keeps going until task_complete or user abort)

    License

    MIT