Package Exports
- graby-ts-site-config
- graby-ts-site-config/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (graby-ts-site-config) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Graby-TS Site Config
A dynamic site configuration loader for Graby-TS based on FiveFilters site patterns format. This library provides standardized content extraction rules for different websites, allowing for consistent extraction across a wide range of domains.
The site configuration rules are sourced from FiveFilters ftr-site-config, which contains a comprehensive collection of extraction rules for thousands of websites.
Features
- Dynamically loads site-specific extraction rules
- Well-typed with full TypeScript support
- Memory-efficient with on-demand loading and caching
- Compatible with all JavaScript environments
- Supports wildcard domain patterns
- Based on the established FiveFilters site patterns format
Installation
npm install graby-ts-site-configUsage
import { SiteConfigManager } from 'graby-ts-site-config';
// Create a site config manager instance
const configManager = new SiteConfigManager();
async function extractContent(url) {
// Get the site configuration for this URL
const { hostname } = new URL(url);
const config = await configManager.getConfigForHost(hostname);
// Now use the configuration with your content extractor
console.log('Using config:', config);
// Example: checking if this site has specific extraction rules
if (config.title && config.title.length > 0) {
console.log('This site has custom title extraction rules');
}
}
// Preload configs for frequently used sites
configManager.preloadConfigs(['medium.com', 'wikipedia.org']);API
SiteConfigManager
getConfigForHost(hostname: string): Promise<SiteConfig>
Asynchronously loads and returns the configuration for the given hostname.
const config = await configManager.getConfigForHost('nytimes.com');hasConfigForHost(hostname: string): boolean
Checks if a configuration exists for the given hostname.
if (configManager.hasConfigForHost('medium.com')) {
console.log('Medium has custom extraction rules');
}preloadConfigs(hostnames: string[]): Promise<void>
Preloads configurations for an array of hostnames to improve performance.
await configManager.preloadConfigs(['medium.com', 'wikipedia.org']);clearCache(): void
Clears the internal configuration cache.
configManager.clearCache();Configuration Fields
The SiteConfig object contains various fields that control how content is extracted from a website:
Content Selection (XPath expressions)
| Field | Type | Description |
|---|---|---|
title |
string[] | XPath expressions to extract the page title |
body |
string[] | XPath expressions to extract the article body content |
date |
string[] | XPath expressions to extract the publication date |
author |
string[] | XPath expressions to extract the author(s) information |
Content Cleaning
| Field | Type | Description |
|---|---|---|
strip |
string[] | XPath expressions for elements to remove from the content |
strip_id_or_class |
string[] | Element IDs or classes to remove from the content |
strip_image_src |
string[] | Remove images with matching src attributes |
native_ad_clue |
string[] | XPath expressions to identify native advertisements |
Processing Options
| Field | Type | Default | Description |
|---|---|---|---|
prune |
boolean | true |
Clean content from non-essential elements using Readability algorithm |
autodetect_on_failure |
boolean | true |
Fall back to auto-detection if the pattern-based extraction fails |
insert_detected_image |
boolean | true |
Insert the main image detected from metadata |
skip_json_ld |
boolean | true |
Skip extraction from JSON-LD structured data |
Multi-page Handling
| Field | Type | Description |
|---|---|---|
single_page_link |
string[] | XPath expressions to find the "view as single page" link |
single_page_link_in_feed |
string[] | XPath for single-page links in feed items |
next_page_link |
string[] | XPath expressions to find links to subsequent pages |
if_page_contains |
string[] | XPath expressions for conditional processing of multi-page content |
Content Enhancement
| Field | Type | Description |
|---|---|---|
find_string |
string[] | Strings to find and replace in the content |
replace_string |
string[] | Replacement strings (paired with find_string) |
wrap_in |
Record<string, string> | Wrap matching elements with specified tags |
src_lazy_load_attr |
string[] | Image attribute names for lazy-loaded images |
HTTP Options
| Field | Type | Description |
|---|---|---|
http_header |
Record<string, string> | Additional HTTP headers to send with requests |
License
MIT