Package Exports

@centralinc/browseragent

Readme

@centralinc/browseragent

Browser automation agent using Computer Use with Playwright. This TypeScript SDK combines Anthropic's Computer Use capabilities with Playwright to provide a clean, type-safe interface for automating browser interactions using Claude's computer use abilities.

This fork is purpose-built for high-volume RPA scenarios—think large insurance back-offices, government form-filling portals, and other data-heavy workflows.
It runs seamlessly inside Temporal workflows: the agent's native pause / resume / cancel signals can be surfaced as Temporal signals, letting your orchestration layer coordinate long-running jobs while operators jump in when needed (no tight coupling between the human and Temporal itself).
Our goal is to expose a highly configurable, fine-grained agent—dial it up for raw speed or dial it down for pixel-perfect, human-like precision.

🆕 Additional Features in This Fork

At-a-glance feature matrix

⚙️ Capability What it does Why it rocks

Tool Registry Generic capability system for any tool Extend agents with Slack, Discord, databases, etc.

Smart Scrolling 90 % viewport scrolls + instant text navigation Turbo page traversal and zero-waste dropdown control

Typing Modes Fill, fast-character, human-character Match CAPTCHA tolerances or burn through inputs

Signal Bus Pause / Resume / Cancel at any step Add human QA checkpoints in production

URL Extractor Find links by visible text Zero CSS selectors needed

Speed Tweaks Screenshot + delay optimisations Cut multi-step flows from minutes to seconds

⚙️ Capability	What it does	Why it rocks
Tool Registry	Generic capability system for any tool	Extend agents with Slack, Discord, databases, etc.
Smart Scrolling	90 % viewport scrolls + instant text navigation	Turbo page traversal and zero-waste dropdown control
Typing Modes	Fill, fast-character, human-character	Match CAPTCHA tolerances or burn through inputs
Signal Bus	Pause / Resume / Cancel at any step	Add human QA checkpoints in production
URL Extractor	Find links by visible text	Zero CSS selectors needed
Speed Tweaks	Screenshot + delay optimisations	Cut multi-step flows from minutes to seconds

Below are the flagship improvements shipped in the fork:

🔗 URL Extraction Tool

Extract URLs from any visible element - no CSS selectors needed! This feature is unique to this fork.

How It Works

The agent automatically uses the URL extraction tool when you ask for URLs by visible text:

// Simple URL extraction - just ask naturally!
const url = await agent.execute('Extract the URL from the "Learn More" link');

// Extract from article titles
const articleUrl = await agent.execute(
  'Get the URL from the article titled "Introduction to AI"',
);

// Extract multiple URLs with structured output
const urls = await agent.execute(
  "Extract URLs from the top 3 navigation links",
  z.array(
    z.object({
      linkText: z.string(),
      url: z.string(),
    }),
  ),
);

Advanced Capabilities

Smart Search Strategies (prioritized in order):

Exact text matching - Finds elements containing the exact visible text
Partial text matching - Matches text within larger content blocks
Anchor tag detection - Locates <a> tags containing the text
CSS selector fallback - Direct element selection if text is a valid selector
Clickable element search - Finds interactive elements with the text
URL pattern extraction - Detects URLs directly within text content

Technical Features:

Computer Use optimized - Works seamlessly with Claude's visual perception
Multiple HTML structures - Handles complex nested elements and dynamic content
Automatic URL normalization - Converts relative to absolute URLs
Smart error handling - Provides helpful feedback when elements aren't found
Logging and debugging - Built-in console logging for troubleshooting

Best Practices:

Use the exact visible text you can see on the page
For buttons or links, use their label text (e.g., "Download", "Read More", "View Details")
For articles or stories, use their title text
The tool will automatically handle finding the associated URL

🛠️ Tool Registry System

Extend your agents with any external tool using our flexible capability system - not just Playwright!

How It Works

The Tool Registry provides a simple, type-safe way to add capabilities to your agents:

import { registerPlaywrightCapability } from "@centralinc/browseragent";

// Add a custom Playwright capability
registerPlaywrightCapability({
  method: "check_all",
  displayName: "Check All Checkboxes",
  description: "Check all checkboxes matching a pattern",
  usage: "Check multiple checkboxes at once by pattern",
  schema: z.tuple([z.string()]),
  handler: async (page, args) => {
    const [pattern] = args;
    await page.locator(`input[type="checkbox"]${pattern}`).check();
    return { output: `Checked all checkboxes matching ${pattern}` };
  },
});

// Use it naturally in prompts
await agent.execute('Check all the "Accept Terms" checkboxes on this form');

Extend Beyond Playwright

The registry supports any tool type. Here's a Slack integration example:

// Create a Slack tool
class SlackTool implements ComputerUseTool {
  name: "slack" = "slack";
  // ... implementation
}

// Use it with the agent
const agent = new ComputerUseAgent({
  apiKey: ANTHROPIC_API_KEY,
  page,
  additionalTools: [new SlackTool(SLACK_TOKEN)],
});

// Natural language Slack operations
await agent.execute(
  "Send a message to #general saying the deployment is complete",
);
await agent.execute(
  "Navigate to the metrics dashboard and share a screenshot in #analytics",
);

Supported Tool Types:

📧 Communication: Slack, Discord, Teams, Email
🗄️ Data: Databases, APIs, File systems
🔧 Utilities: AWS, GitHub, Jira
🤖 Custom: Any tool you can imagine!

Key Features:

Type-safe with Zod schemas
Auto-generated documentation
Natural language prompts
No complex inheritance needed

See the Tool Registry Design Doc for complete examples.

Jump directly to any text in dropdowns, lists, or scrollable containers - no multiple scroll attempts needed!

How It Works

The agent can use the scroll_to_text playwright method to instantly navigate to specific text:

// The agent sees a state dropdown and needs Wyoming
await agent.execute(`
  Use the playwright scroll_to_text method to find "Wyoming" in the state picker
`);

// Behind the scenes, the agent calls:
// {"name": "playwright", "input": {"method": "scroll_to_text", "args": ["Wyoming"]}}

Smart Features:

Automatically detects scrollable containers in viewport
Searches visible containers first, then whole page
Case-insensitive fallback if exact match not found
Graceful fallback to regular scrolling if text not found
No CSS selectors needed - just the visible text!

When the agent uses this:

Finding specific options in dropdowns (states, countries, etc.)
Navigating to products in long lists
Jumping to specific items in sidebars
Any scenario where exact text is known

Example: Instead of 10+ small scrolls to find "Wyoming", it's now a single instant jump!

🖱️ Smart Scrolling (90 % Viewport)

Speed through long pages while preserving precise control in small UI elements.

Default behaviour → Scrolls ~90 % of the viewport with ~10 % overlap for maximum throughput.
Fine control → scroll_amount between 5-20 performs tiny scrolls—perfect for dropdowns, lists, side-panels.
Configurable → Accepts any scroll_amount 1-100 and degrades gracefully.

Why it matters: Form-heavy portals (e.g. insurance claim systems) often require rapid page-level scrolling punctuated by pixel-perfect adjustments inside select widgets. This feature automatically handles both cases.

⚡ Speed Optimizations

Screenshots now capture ~5× faster and post-action waits are shortened:

Action	Old Delay	New Delay
Screenshot wait	2 s	0.3 s
Post-typing wait	0.5 s	0.1 s
Post-scroll wait	0.5 s	0.1 s
Mouse move pause	0.1 s	0.02 s

These cut 1-2 seconds from each multi-step interaction.

⚠️ Heads-up: Some sites rely on human-like pacing for anti-bot checks. If you encounter captchas or missing render states, increase the delays via the new constructor parameters:
const fastComputer = new ComputerTool(
  page,
  "20250124",
  /* screenshotDelay */ 0.5,
);
// or adjust post-action waits inside ComputerTool if needed

⏯️ Agent Signals (Pause / Resume / Cancel)

Bring human-in-the-loop control to long-running automation workflows.

Pause an active agent.execute() run to inspect or fix the page
Resume from the exact step where you left off
Cancel gracefully without killing the process
Real-time events: onPause, onResume, onCancel, onError

const agent = new ComputerUseAgent({ apiKey, page });

// Subscribe to events
agent.controller.on("onPause", ({ step }) => console.log("Paused at", step));

⚙️ Configurable Execution Behavior

This fork includes a powerful configuration system that allows you to customize how the agent executes browser automation tasks. You can control typing speed, screenshot timing, scrolling strategy, mouse behaviour, and other automation settings to optimise for raw speed or human-like interaction.

Available Configuration Options

import type { ExecutionConfig } from "@centralinc/browseragent";

const executionConfig: ExecutionConfig = {
  typing: {
    mode: "fill" | "character-by-character",
    characterDelay: 12, // milliseconds between characters (character-by-character mode)
    completionDelay: 100, // milliseconds to wait after typing completes
  },
  screenshot: {
    delay: 0.3, // seconds to wait before taking screenshots
    quality: "low" | "medium" | "high",
  },
  mouse: {
    moveSpeed: "instant" | "fast" | "normal" | "slow",
    clickDelay: 50, // milliseconds to wait after clicks
  },
  scrolling: {
    /**
     * When no scroll_amount is provided the agent will use this mode
     * with ~90 % viewport coverage for page-level scrolling.
     */
    mode: "percentage", // (future-proofed for pixel or element-based modes)
    /** Default percentage of the viewport to scroll. */
    percentage: 90,
    /** Overlap percentage to keep for context during large scrolls. */
    overlap: 10,
  },
};

Typing Mode Configuration

The most impactful configuration is the typing behavior. You can choose between two modes:

🚀 Fill Mode (Fastest) - Directly fills input fields bypassing keyboard events entirely:

const fastAgent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
  executionConfig: {
    typing: { mode: "fill", completionDelay: 50 },
  },
});

⌨️ Character-by-Character Mode (Human-like) - Types text one character at a time with configurable delays:

const humanLikeAgent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
  executionConfig: {
    typing: {
      mode: "character-by-character",
      characterDelay: 100, // 100ms between each character
      completionDelay: 200,
    },
  },
});

⚡ Fast Character Mode (Balanced) - Best of both worlds - visible typing but very fast:

const balancedAgent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
  executionConfig: {
    typing: {
      mode: "character-by-character",
      characterDelay: 5, // Very fast character typing
      completionDelay: 75,
    },
  },
});

Performance Comparison:

Mode	Speed	Visibility	Use Case
Fill	⚡⚡⚡ Fastest	❌ Instant	Production, speed-critical tasks
Fast Character	⚡⚡ Very Fast	✅ Visible	Development, debugging
Slow Character	⚡ Human-like	✅ Very visible	Demos, human-like automation

Try the Example

Run the included example to see the performance differences:

# Run the typing configuration example (set ANTHROPIC_API_KEY first)
npx ts-node examples/example-typing-config.ts
agent.controller.on('onResume', () => console.log('Resumed'));

// Trigger a pause after 5 s
setTimeout(() => agent.controller.signal('pause'), 5_000);

// Start a task (the controller is available immediately)
await agent.execute('Get the titles of the top 10 stories');

Great for debugging, watchdog timeouts, and manual overrides.

Features

🤖 Simple API: Single ComputerUseAgent class for all computer use tasks
🔄 Dual Response Types: Support for both text and structured (JSON) responses
🛡️ Type Safety: Full TypeScript support with Zod schema validation
⚡ Optimized: Clean error handling and robust JSON parsing
🎯 Focused: Clean API surface with sensible defaults

Installation

npm install @centralinc/browseragent playwright @playwright/test
# or
yarn add @centralinc/browseragent playwright @playwright/test
# or
pnpm add @centralinc/browseragent playwright @playwright/test

Quick Start

import { chromium } from "playwright";
import { ComputerUseAgent } from "@centralinc/browseragent";

const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

// Navigate to Hacker News manually first
await page.goto("https://news.ycombinator.com/");

const agent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
});

// Simple text response
const answer = await agent.execute("Tell me the title of the top story");
console.log(answer);

await browser.close();

API Reference

`ComputerUseAgent`

The main class for computer use automation.

Constructor

new ComputerUseAgent(options: {
  apiKey: string;
  page: Page;
  model?: string;
})

Parameters:

apiKey (string): Your Anthropic API key. Get one from Anthropic Console
page (Page): Playwright page instance to control
model (string, optional): Anthropic model to use. Defaults to 'claude-sonnet-4-20250514'

Supported Models: See Anthropic's Computer Use documentation for the latest model compatibility.

`execute()` Method

async execute<T = string>(
  query: string,
  schema?: z.ZodSchema<T>,
  options?: {
    systemPromptSuffix?: string;
    thinkingBudget?: number;
  }
): Promise<T>

Parameters:

query (string): The task description for Claude to execute
schema (ZodSchema, optional): Zod schema for structured responses. When provided, the response will be validated against this schema
options (object, optional):
- systemPromptSuffix (string): Additional instructions appended to the system prompt
- thinkingBudget (number): Token budget for Claude's internal reasoning process. Default: 1024. See Extended Thinking documentation for details

Returns:

Promise<T>: When schema is provided, returns validated data of type T
Promise<string>: When no schema is provided, returns the text response

Usage Examples

Text Response

import { ComputerUseAgent } from "@centralinc/browseragent";

// Navigate to the target page first
await page.goto("https://news.ycombinator.com/");

const agent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
});

const result = await agent.execute(
  "Tell me the title of the top story on this page",
);
console.log(result); // "Title of the top story"

Structured Response with Zod

import { z } from "zod";
import { ComputerUseAgent } from "@centralinc/browseragent";

const agent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
});

const HackerNewsStory = z.object({
  title: z.string(),
  points: z.number(),
  author: z.string(),
  comments: z.number(),
  url: z.string().optional(),
});

const stories = await agent.execute(
  "Get the top 5 Hacker News stories with their details",
  z.array(HackerNewsStory).max(5),
);

console.log(stories);
// [
//   {
//     title: "Example Story",
//     points: 150,
//     author: "user123",
//     comments: 42,
//     url: "https://example.com"
//   },
//   ...
// ]

Advanced Options

const result = await agent.execute(
  "Complex task requiring more thinking",
  undefined, // No schema for text response
  {
    systemPromptSuffix: "Be extra careful with form submissions.",
    thinkingBudget: 4096, // More thinking tokens for complex tasks
  },
);

Retry Configuration

The SDK includes built-in retry logic for handling connection errors and transient failures:

import { ComputerUseAgent, type RetryConfig } from "@centralinc/browseragent";

const retryConfig: RetryConfig = {
  maxRetries: 5,             // Maximum retry attempts (default: 3)
  initialDelayMs: 2000,      // Initial delay between retries (default: 1000ms)
  maxDelayMs: 60000,         // Maximum delay between retries (default: 30000ms)
  backoffMultiplier: 2.5,    // Exponential backoff multiplier (default: 2)
  preferIPv4: true,          // Prefer IPv4 DNS resolution (helpful with VPNs like Tailscale)
  retryableErrors: [         // Errors that trigger retries
    "Connection error",
    "ECONNREFUSED",
    "ETIMEDOUT",
    "ECONNRESET",
    "socket hang up",
  ],
};

const agent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY!,
  page,
  retryConfig, // Custom retry configuration
});

The retry mechanism uses exponential backoff with jitter to avoid thundering herd problems. Connection errors and network timeouts are automatically retried with increasing delays.

Note for VPN/Tailscale Users: If you're experiencing ENETUNREACH errors with IPv6 addresses, set preferIPv4: true in your retry configuration to resolve DNS to IPv4 addresses only.

Tool Registry API

The SDK exports functions for extending capabilities:

import {
  registerPlaywrightCapability,
  getToolRegistry,
  defineCapability,
} from "@centralinc/browseragent";

// Register a new Playwright capability
registerPlaywrightCapability({
  method: "custom_action",
  displayName: "Custom Action",
  description: "Performs a custom browser action",
  usage: "Detailed usage instructions",
  schema: z.object({ selector: z.string() }),
  handler: async (page, args) => {
    // Implementation
    return { output: "Success" };
  },
});

// Register capabilities for other tools
const registry = getToolRegistry();
registry.register(
  defineCapability("slack", "send_message", {
    displayName: "Send Message",
    description: "Send a Slack message",
    usage: "Send message to channel",
    schema: z.tuple([z.string(), z.string()]),
  }),
);

Environment Setup

Anthropic API Key: Set your API key as an environment variable:
```
export ANTHROPIC_API_KEY=your_api_key_here
```
Playwright: Install Playwright and browser dependencies:
```
npx playwright install
```

Computer Use Parameters

This SDK leverages Anthropic's Computer Use API with the following key parameters:

Model Selection

Claude 3.5 Sonnet: Best balance of speed and capability for most tasks
Claude 4 Models: Enhanced reasoning with extended thinking capabilities
Claude 3.7 Sonnet: Advanced reasoning with thinking transparency

Thinking Budget

The thinkingBudget parameter controls Claude's internal reasoning process:

1024 tokens (default): Suitable for simple tasks
4096+ tokens: Better for complex reasoning tasks
16k+ tokens: Recommended for highly complex multi-step operations

See Anthropic's Extended Thinking guide for optimization tips.

Error Handling

The SDK includes built-in error handling:

try {
  const result = await agent.execute("Your task here");
  console.log(result);
} catch (error) {
  if (error.message.includes("No response received")) {
    console.log("Agent did not receive a response from Claude");
  } else {
    console.log("Other error:", error.message);
  }
}

Best Practices

Use specific, clear instructions: "Click the red 'Submit' button" vs "click submit"
For complex tasks, break them down: Use step-by-step instructions in your query
Optimize thinking budget: Start with default (1024) and increase for complex tasks
Handle errors gracefully: Implement proper error handling for production use
Use structured responses: When you need specific data format, use Zod schemas
Test in headless: false: During development, run with visible browser to debug

Security Considerations

⚠️ Important: Computer use can interact with any visible application. Always:

Run in isolated environments (containers/VMs) for production
Avoid providing access to sensitive accounts or data
Review Claude's actions in logs before production deployment
Use allowlisted domains when possible

See Anthropic's Computer Use Security Guide for detailed security recommendations.

Requirements

Node.js 18+
TypeScript 5+
Playwright 1.52+
Anthropic API key

License

See License