Package Exports
- @purepageio/fetch-engines
- @purepageio/fetch-engines/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@purepageio/fetch-engines) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
@purepageio/fetch-engines
A collection of configurable engines for fetching HTML content using plain fetch or Playwright.
This package provides robust and customizable ways to retrieve web page content, handling retries, caching, user agents, and optional browser automation via Playwright for complex JavaScript-driven sites.
Installation
pnpm add @purepageio/fetch-engines
# or with npm
npm install @purepageio/fetch-engines
# or with yarn
yarn add @purepageio/fetch-enginesIf you plan to use the PlaywrightEngine or HybridEngine, you also need to install Playwright's browser binaries:
pnpm exec playwright install
# or
npx playwright installEngines
FetchEngine: Uses the standardfetchAPI. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast.PlaywrightEngine: Uses Playwright to control a managed pool of headless browsers (Chromium by default viaplaywright-extra). Handles JavaScript rendering, complex interactions, and provides automatic stealth/anti-bot detection measures. More resource-intensive but necessary for dynamic websites.HybridEngine: A smart combination. It first attempts to fetch content using the lightweightFetchEngine. If that fails for any reason (e.g., network error, non-HTML content, HTTP error like 403), it automatically falls back to using thePlaywrightEngine. This provides the speed ofFetchEnginefor simple sites while retaining the power ofPlaywrightEnginefor complex ones.
Basic Usage
FetchEngine
import { FetchEngine } from "@purepageio/fetch-engines";
const engine = new FetchEngine();
async function main() {
try {
const url = "https://example.com";
const result = await engine.fetchHTML(url);
console.log(`Fetched ${result.url} (Status: ${result.statusCode})`);
console.log(`Title: ${result.title}`);
// console.log(`HTML: ${result.html.substring(0, 200)}...`);
} catch (error) {
console.error("Fetch failed:", error);
}
}
main();PlaywrightEngine
import { PlaywrightEngine } from "@purepageio/fetch-engines";
// Configure engine options (optional)
const engine = new PlaywrightEngine({
maxRetries: 2, // Number of retry attempts
useHttpFallback: true, // Try simple HTTP fetch first
cacheTTL: 5 * 60 * 1000, // Cache results for 5 minutes (in milliseconds)
});
async function main() {
try {
const url = "https://quotes.toscrape.com/"; // A site that might benefit from JS rendering
const result = await engine.fetchHTML(url);
console.log(`Fetched ${result.url} (Status: ${result.statusCode})`);
console.log(`Title: ${result.title}`);
// console.log(`HTML: ${result.html.substring(0, 200)}...`);
} catch (error) {
console.error("Playwright fetch failed:", error);
} finally {
// Important: Clean up browser resources when done
await engine.cleanup();
}
}
main();HybridEngine
import { HybridEngine } from '@purepageio/fetch-engines';
// Configure the underlying PlaywrightEngine (optional)
const engine = new HybridEngine({
maxRetries: 2, // PlaywrightEngine retry config
maxBrowsers: 3, // PlaywrightEngine pool config
// FetchEngine part has no config
});
async function main() {
try {
// Try a simple site (likely uses FetchEngine)
const url1 = 'https://example.com';
const result1 = await engine.fetchHTML(url1);
console.log(`Fetched ${result1.url} (Status: ${result1.statusCode}) - Title: ${result1.title}`);
// Try a complex site (likely falls back to PlaywrightEngine)
const url2 = 'https://quotes.toscrape.com/';
const result2 = await engine.fetchHTML(url2);
console.log(`Fetched ${result2.url} (Status: ${result2.statusCode}) - Title: ${result2.title}`);
} catch (error) {
console.error("Hybrid fetch failed:", error);
} finally {
// Important: Clean up browser resources (for the Playwright part) when done
await engine.cleanup();
}
}
main();Configuration
Engines accept an optional configuration object in their constructor to customize behavior.
FetchEngine
The FetchEngine currently has no configurable options via its constructor. It uses standard fetch with default browser/Node.js retry/timeout behavior and a fixed set of browser-like headers.
PlaywrightEngine
The PlaywrightEngine accepts a PlaywrightEngineConfig object. See the detailed options below:
General Options:
concurrentPages(number, default:3)- Maximum number of Playwright pages to process concurrently across all browser instances.
maxRetries(number, default:3)- Maximum number of retry attempts for a failed Playwright fetch operation (excluding initial attempt).
retryDelay(number, default:5000)- Delay in milliseconds between Playwright retry attempts.
cacheTTL(number, default:900000(15 minutes))- Time-to-live for cached results in milliseconds. Set to
0to disable the in-memory cache. Affects both HTTP fallback and Playwright results.
- Time-to-live for cached results in milliseconds. Set to
useHttpFallback(boolean, default:true)- If
true, the engine first attempts a simple, fast HTTP GET request. If this fails or appears to receive a challenge/CAPTCHA page, it then proceeds with a full Playwright browser request.
- If
useHeadedModeFallback(boolean, default:false)- If
trueand a Playwright request fails (potentially due to bot detection), subsequent Playwright requests to that specific domain will automatically use a headed (visible) browser instance.
- If
defaultFastMode(boolean, default:true)- If
true, Playwright requests initially run in "fast mode", blocking non-essential resources and skipping human behavior simulation. Can be overridden per-request viafetchHTMLoptions.
- If
simulateHumanBehavior(boolean, default:true)- If
trueand the Playwright request is not infastMode, the engine attempts basic human-like interactions. Note: This simulation is currently basic.
- If
Browser Pool Options (Passed to internal PlaywrightBrowserPool):
maxBrowsers(number, default:2)- Maximum number of concurrent browser instances the pool will manage.
maxPagesPerContext(number, default:6)- Maximum number of pages per browser context before recycling.
maxBrowserAge(number, default:1200000(20 minutes))- Maximum age in milliseconds a browser instance lives before recycling.
healthCheckInterval(number, default:60000(1 minute))- How often (in milliseconds) the pool checks browser health.
useHeadedMode(boolean, default:false)- Forces the entire browser pool to launch browsers in headed (visible) mode.
poolBlockedDomains(string[], default:[]- uses pool's internal defaults)- List of domain glob patterns to block browser requests to.
poolBlockedResourceTypes(string[], default:[]- uses pool's internal defaults)- List of Playwright resource types (e.g.,
image,font) to block.
- List of Playwright resource types (e.g.,
proxy(object | undefined, default:undefined)- Proxy configuration for browser instances (
server,username?,password?).
- Proxy configuration for browser instances (
HybridEngine
The HybridEngine constructor accepts a single optional argument: playwrightConfig. This object follows the PlaywrightEngineConfig structure described above.
import { HybridEngine } from '@purepageio/fetch-engines';
const engine = new HybridEngine({
// These options configure the PlaywrightEngine used for fallbacks
maxRetries: 1,
maxBrowsers: 1,
cacheTTL: 0 // Disable caching in the Playwright part
});The internal FetchEngine used by HybridEngine is not configurable.
Return Value
Both FetchEngine.fetchHTML() and PlaywrightEngine.fetchHTML() return a Promise that resolves to a FetchResult object with the following properties:
html(string): The full HTML content of the fetched page.title(string | null): The extracted<title>tag content, ornullif no title is found.url(string): The final URL after any redirects.isFromCache(boolean):trueif the result was served from the engine's cache,falseotherwise.statusCode(number | undefined): The HTTP status code of the final response. This is typically available forFetchEngineand the HTTP fallback inPlaywrightEngine, but might beundefinedfor some Playwright navigation scenarios if the primary response wasn't directly captured.error(FetchError | Error | undefined): If an error occurred during the final fetch attempt (after retries), this property will contain the error object. It might be a specificFetchError(see Error Handling) or a genericError.
API Reference
engine.fetchHTML(url, options?)
url(string): The URL of the page to fetch.options(object, optional): Per-request options to override engine defaults.- For
PlaywrightEngine, you can overridefastMode(boolean) to force or disable fast mode for this specific request. - (Other per-request options may be added in the future).
- For
- Returns:
Promise<FetchResult>
Fetches the HTML content for the given URL using the engine's configured strategy (plain fetch or Playwright).
engine.cleanup() (PlaywrightEngine only)
- Returns:
Promise<void>
Gracefully shuts down all browser instances managed by the PlaywrightEngine's browser pool. It is crucial to call await engine.cleanup() when you are finished using a PlaywrightEngine instance to release system resources.
Stealth / Anti-Detection (PlaywrightEngine)
The PlaywrightEngine automatically integrates playwright-extra and its powerful stealth plugin (puppeteer-extra-plugin-stealth). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
There are no manual configuration options for stealth; it is enabled by default when using PlaywrightEngine. The previous options (useStealthMode, randomizeFingerprint, evasionLevel) have been removed.
While effective, be aware that no stealth technique is foolproof, and sophisticated websites may still detect automated browsing.
Error Handling
Errors during fetching are typically thrown as instances of FetchError (or its subclasses like FetchEngineHttpError), providing more context than standard Error objects.
FetchErrorproperties:message(string): Description of the error.code(string | undefined): A specific error code (e.g.,ERR_NAVIGATION_TIMEOUT,ERR_HTTP_ERROR,ERR_NON_HTML_CONTENT).originalError(Error | undefined): The underlying error that caused this fetch error (e.g., a Playwright error object).
Common error scenarios include:
- Network issues (DNS resolution failure, connection refused).
- HTTP errors (4xx client errors, 5xx server errors).
- Non-HTML content type received (for
FetchEngine). - Playwright navigation timeouts.
- Proxy connection errors.
- Page crashes within Playwright.
- Errors thrown by the browser pool (e.g., failure to launch browser).
The FetchResult object may also contain an error property if the final fetch attempt failed after all retries.
Logging
Contributing
Contributions are welcome! Please open an issue or submit a pull request on the GitHub repository.
License
MIT