Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@purepageio/fetch-engines) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
@purepageio/fetch-engines
Web scraping requires handling static HTML, JavaScript-heavy sites, network errors, retries, caching, and bot detection. Managing browser automation tools like Playwright adds complexity with resource pooling and stealth configurations.
@purepageio/fetch-engines provides engines for retrieving web content through a unified API.
Key Benefits:
- Unified API: Use
fetchHTML(url, options?)for processed content orfetchContent(url, options?)for raw content - Smart Fallback Strategy: Tries fast HTTP first, automatically falls back to full browser for complex sites
- AI-Powered Data Extraction: Extract structured data from web pages using OpenAI and Zod schemas
- Raw Content Support: Retrieve PDFs, images, APIs with the same fallback logic
- Built-in Resilience: Caching, retries, and standardised error handling
- Browser Management: Automatic browser pooling and stealth measures for complex sites
- Content Transformation: Convert HTML to clean Markdown
- TypeScript Ready: Fully typed codebase
Table of Contents
- Installation
- Engines
- Basic Usage
- fetchHTML vs fetchContent
- Structured Content Extraction
- Configuration
- Return Value
- API Reference
- Stealth Features
- Error Handling
- Contributing
Installation
pnpm add @purepageio/fetch-enginesFor HybridEngine (uses Playwright), install browser binaries:
pnpm exec playwright installEngines
HybridEngine (recommended): Attempts fast HTTP fetch first, falls back to Playwright browser on failure or when SPA shell detected. Handles both simple and complex sites automatically.
FetchEngine: Lightweight HTTP-only engine for basic sites without browser fallback.
StructuredContentEngine: AI-powered engine that combines HybridEngine with OpenAI for structured data extraction.
Basic Usage
Quick Start
import { HybridEngine } from "@purepageio/fetch-engines";
const engine = new HybridEngine();
// Simple sites use fast HTTP
const simple = await engine.fetchHTML("https://example.com");
console.log(`Title: ${simple.title}`);
// Complex sites automatically use browser
const complex = await engine.fetchHTML("https://spa-site.com", {
markdown: true,
spaMode: true,
});
await engine.cleanup(); // Important: releases browser resourcesWith Custom Headers
const engine = new HybridEngine({
headers: { "X-Custom-Header": "value" },
});
const result = await engine.fetchHTML("https://example.com", {
headers: { "X-Request-Header": "value" },
});Raw Content (PDFs, Images, APIs)
const engine = new HybridEngine();
// Fetch PDF
const pdf = await engine.fetchContent("https://example.com/doc.pdf");
console.log(`PDF size: ${pdf.content.length} bytes`);
// Fetch JSON API with auth
const api = await engine.fetchContent("https://api.example.com/data", {
headers: { Authorization: "Bearer token" },
});
await engine.cleanup();fetchHTML vs fetchContent
fetchHTML(url, options?)
Use for: Web page content extraction
- Processes HTML and extracts metadata (title, etc.)
- Supports HTML-to-Markdown conversion
- Content-type restrictions (HTML/XML only)
- Returns processed content as
string
fetchContent(url, options?)
Use for: Raw content retrieval (like standard fetch)
- Retrieves any content type (PDFs, images, JSON, XML, etc.)
- No content-type restrictions
- Returns
Buffer(binary) orstring(text) - Preserves original MIME type
Example Comparison
// fetchHTML - processes content
const html = await engine.fetchHTML("https://example.com");
console.log(html.title); // "Example Domain"
console.log(html.contentType); // "html" or "markdown"
// fetchContent - raw content
const raw = await engine.fetchContent("https://example.com");
console.log(raw.contentType); // "text/html"
console.log(typeof raw.content); // "string" (raw HTML)
// Binary content
const pdf = await engine.fetchContent("https://example.com/doc.pdf");
console.log(Buffer.isBuffer(pdf.content)); // trueStructured Content Extraction
Extract structured data from web pages using AI and Zod schemas.
Prerequisites
Set environment variable:
export OPENAI_API_KEY="your-openai-api-key"Basic Usage
import { fetchStructuredContent } from "@purepageio/fetch-engines";
import { z } from "zod";
const articleSchema = z.object({
title: z.string(),
author: z.string().optional(),
publishDate: z.string().optional(),
summary: z.string(),
tags: z.array(z.string()),
});
const result = await fetchStructuredContent("https://example.com/article", articleSchema, {
model: "gpt-4.1-mini",
customPrompt: "Extract main article information",
});
console.log("Extracted:", result.data);
console.log("Token usage:", result.usage);StructuredContentEngine Class
import { StructuredContentEngine } from "@purepageio/fetch-engines";
const productSchema = z.object({
name: z.string(),
price: z.number(),
inStock: z.boolean(),
});
const engine = new StructuredContentEngine({
spaMode: true,
spaRenderDelayMs: 2000,
});
const result = await engine.fetchStructuredContent("https://shop.com/product", productSchema);
console.log(`${result.data.name} costs $${result.data.price}`);
await engine.cleanup();Supported Models
'gpt-5-mini'- Latest model, mini version (default)'gpt-5'- Most capable model'gpt-4.1-mini'- Fast and cost-effective'gpt-4.1'- More capable GPT-4.1 version
Configuration
FetchEngine Options
| Option | Type | Default | Description |
|---|---|---|---|
markdown |
boolean |
false |
Convert HTML to Markdown |
headers |
Record<string, string> |
{} |
Custom HTTP headers |
HybridEngine Configuration
| Option | Type | Default | Description |
|---|---|---|---|
headers |
Record<string, string> |
{} |
Default headers for both engines |
markdown |
boolean |
false |
Default Markdown conversion |
useHttpFallback |
boolean |
true |
Try HTTP before Playwright |
spaMode |
boolean |
false |
Enable SPA mode with patient load conditions |
spaRenderDelayMs |
number |
0 |
Delay after page load in SPA mode |
maxRetries |
number |
3 |
Max retry attempts |
cacheTTL |
number |
900000 |
Cache TTL in ms (15 min default) |
concurrentPages |
number |
3 |
Max concurrent pages |
Browser Pool Options
| Option | Type | Default | Description |
|---|---|---|---|
maxBrowsers |
number |
2 |
Max browser instances |
maxPagesPerContext |
number |
6 |
Pages per context before recycling |
maxBrowserAge |
number |
1200000 |
Browser lifetime (20 min) |
Header Precedence
Headers merge in this order (highest precedence first):
- Request-specific headers in
fetchHTML(url, { headers }) - Engine constructor headers
- Engine default headers
Return Value
HTMLFetchResult (fetchHTML)
content(string): HTML or Markdown contentcontentType('html' | 'markdown'): Content formattitle(string | null): Extracted page titleurl(string): Final URL after redirectsisFromCache(boolean): Cache hit indicatorstatusCode(number | undefined): HTTP status code
ContentFetchResult (fetchContent)
content(Buffer | string): Raw content (binary as Buffer, text as string)contentType(string): Original MIME typetitle(string | null): Title if HTML content, otherwise nullurl(string): Final URL after redirectsisFromCache(boolean): Cache hit indicatorstatusCode(number | undefined): HTTP status code
API Reference
engine.fetchHTML(url, options?)
url(string): Target URLoptions?(FetchOptions):headers?: Record<string, string>: Request headersmarkdown?: boolean: Request Markdown (HybridEngine only)fastMode?: boolean: Override fast mode (HybridEngine only)spaMode?: boolean: Override SPA mode (HybridEngine only)
- Returns:
Promise<HTMLFetchResult>
engine.fetchContent(url, options?)
url(string): Target URLoptions?(ContentFetchOptions):headers?: Record<string, string>: Request headers
- Returns:
Promise<ContentFetchResult>
fetchStructuredContent(url, schema, options?)
url(string): Target URLschema(z.ZodSchema<T>): Zod schema for extractionoptions?(StructuredContentOptions):model?: string: OpenAI model (default: 'gpt-5-mini')customPrompt?: string: Additional AI contextengineConfig?: PlaywrightEngineConfig: HybridEngine config
- Returns:
Promise<StructuredContentResult<T>>
engine.cleanup()
Shuts down browser instances for HybridEngine and StructuredContentEngine. Call when finished to release resources. No-op for FetchEngine.
Stealth Features
When HybridEngine uses Playwright, it automatically applies stealth measures via playwright-extra and stealth plugins to bypass common bot detection. No manual configuration required.
Stealth techniques are not foolproof against sophisticated detection systems.
Error Handling
Errors are thrown as FetchError instances with additional context:
message(string): Error descriptioncode(string | undefined): Specific error codeoriginalError(Error | undefined): Underlying errorstatusCode(number | undefined): HTTP status code
Common error codes:
ERR_HTTP_ERROR: HTTP status >= 400ERR_NON_HTML_CONTENT: Non-HTML content for HTML requestERR_FETCH_FAILED: General fetch operation failureERR_PLAYWRIGHT_OPERATION: Playwright operation failureERR_NAVIGATION: Navigation timeout or failureERR_BROWSER_POOL_EXHAUSTED: No available browser resourcesERR_MAX_RETRIES_REACHED: All retry attempts exhaustedERR_MARKDOWN_CONVERSION_NON_HTML: Markdown conversion on non-HTML content
import { HybridEngine } from "@purepageio/fetch-engines";
const engine = new HybridEngine();
try {
const result = await engine.fetchHTML(url);
} catch (error: any) {
console.error(`Error: ${error.code || "Unknown"} - ${error.message}`);
if (error.statusCode) console.error(`Status: ${error.statusCode}`);
}Contributing
Contributions welcome! Open an issue or submit a pull request on GitHub.
License
MIT