Package Exports

@purepageio/fetch-engines
@purepageio/fetch-engines/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@purepageio/fetch-engines) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

@purepageio/fetch-engines

Fetch websites with confidence. @purepageio/fetch-engines gives teams an HTTP-first workflow that automatically promotes tricky pages to a managed Playwright browser and can even hand structured results back through OpenAI.

Why fetch-engines?
Installation
Quick start
Usage patterns
- Pick an engine
- Structured extraction
Configuration
- Essentials
- Complete reference
Error handling
Tooling and examples
Contributing
License

Why fetch-engines?

One API for multiple strategies – Call fetchHTML for rendered pages or fetchContent for raw responses. The library handles HTTP shortcuts and Playwright fallbacks automatically.
Production-minded defaults – Retries, caching, and consistent telemetry are ready out of the box.
Drop-in AI enrichment – Provide a Zod schema and let OpenAI (or any OpenAI-compatible API) convert full pages into structured data.
Typed and tested – Built in TypeScript with examples that mirror real-world scraping pipelines.

Installation

pnpm add @purepageio/fetch-engines
# install Playwright browsers once if you plan to use the Hybrid or Playwright engines
pnpm exec playwright install

Quick start

import { HybridEngine } from "@purepageio/fetch-engines";

const engine = new HybridEngine();

const page = await engine.fetchHTML("https://example.com");
console.log(page.title);

await engine.cleanup();

Usage patterns

Pick an engine

Engine	When to use it
`HybridEngine`	Default option. Starts with HTTP, then retries via Playwright for tougher pages.
`FetchEngine`	Lightweight HTML/text fetching with zero browser overhead.
`StructuredContentEngine`	Fetch a page and transform it into typed data with OpenAI.

Structured extraction

import { fetchStructuredContent } from "@purepageio/fetch-engines";
import { z } from "zod";

// IMPORTANT: All schema fields must have .describe() calls to guide the AI model
const schema = z.object({
  title: z.string().describe("The title of the article"),
  summary: z.string().describe("A brief summary of the article content"),
});

// model is required - use any model supported by your API provider
const result = await fetchStructuredContent("https://example.com/article", schema, { model: "gpt-4.1-mini" });

console.log(result.data.summary);

Set OPENAI_API_KEY (or OPENROUTER_API_KEY) before running structured helpers, or use apiConfig to connect to OpenAI-compatible APIs like OpenRouter. The engine automatically adds the Authorization header when you provide an API key:

const result = await fetchStructuredContent("https://example.com/article", schema, {
  model: "anthropic/claude-3.5-sonnet",
  apiConfig: {
    apiKey: process.env.OPENROUTER_API_KEY,
    baseURL: "https://openrouter.ai/api/v1",
    headers: {
      "HTTP-Referer": "https://your-app.com",
      "X-Title": "Your App Name",
    },
  },
});

When you supply a custom baseURL, the engine automatically switches to the Vercel AI SDK's createOpenAICompatible provider (instead of createOpenAI) so OpenAI-compatible services like OpenRouter receive the correct API-key auth flow.

Configuration

Essentials

All engines accept familiar fetch options such as custom headers. Additional Hybrid/Playwright options you are likely to tweak:

markdown – return Markdown instead of HTML.
spaMode & spaRenderDelayMs – allow single-page apps to render before extraction.
cacheTTL, maxRetries, and browser pool sizes – control resilience and throughput.

Check the inline TypeScript docs or the /examples directory for end-to-end flows.

Complete reference

Every option from PlaywrightEngineConfig (consumed by HybridEngine) with defaults:

Option	Default	Purpose
`headers`	`{}`	Extra headers merged into every request.
`concurrentPages`	`3`	Maximum Playwright pages processed at once.
`maxRetries`	`3`	Additional retry attempts after the first failure.
`retryDelay`	`5000`	Milliseconds to wait between retries.
`cacheTTL`	`900000`	Cache lifetime in ms (`0` disables caching).
`useHttpFallback`	`true`	Try a fast HTTP GET before spinning up Playwright.
`useHeadedModeFallback`	`false`	Automatically retry a domain in headed mode after repeated failures.
`defaultFastMode`	`true`	Block non-critical assets and skip human simulation unless overridden.
`simulateHumanBehavior`	`true`	When not in fast mode, add delays and scrolling to avoid bot detection.
`maxBrowsers`	`2`	Highest number of Playwright browser instances kept in the pool.
`maxPagesPerContext`	`6`	Pages opened per browser context before recycling it.
`maxBrowserAge`	`1200000`	Milliseconds before a browser instance is torn down (20 minutes).
`healthCheckInterval`	`60000`	Pool health check frequency in ms.
`poolBlockedDomains`	`[]`	Domains blocked across every Playwright request (inherit pool defaults if empty).
`poolBlockedResourceTypes`	`[]`	Resource types (e.g. `"image"`) blocked globally.
`proxy`	`undefined`	Per-browser proxy `{ server, username?, password? }`.
`useHeadedMode`	`false`	Force every browser to launch with a visible window.
`markdown`	`true`	Return Markdown (instead of HTML) when possible. Override per request with `markdown: false`.
`spaMode`	`false`	Enable SPA heuristics and allow additional waits for client rendering.
`spaRenderDelayMs`	`0`	Extra delay after load when `spaMode` is `true`.
`playwrightOnlyPatterns`	`[]`	URLs matching any string/regex go straight to Playwright, skipping HTTP fetches.
`playwrightLaunchOptions`	`undefined`	Options passed to `browserType.launch` (see Playwright docs).

Per-request overrides: fetchHTML accepts fastMode, markdown, spaMode, and headers, while fetchContent supports fastMode and headers.

Error handling

Failures raise a typed FetchError exposing code, statusCode, and the underlying error. Log these fields to diagnose issues quickly and tune your retry policy.

Tooling and examples

Explore the examples directory for scripts you can run end-to-end.
Ready-to-use TypeScript types ship with the package.
pnpm test runs the automated suite when you are ready to contribute.

Contributing

Issues and pull requests are welcome! Please follow the existing linting/test commands before sending a change.

License

Distributed under the MIT license.