JSPM

graby-ts-site-config

1.1.1
    • ESM via JSPM
    • ES Module Entrypoint
    • Export Map
    • Keywords
    • License
    • Repository URL
    • TypeScript Types
    • README
    • Created
    • Published
    • Downloads 3
    • Score
      100M100P100Q37037F
    • License MIT

    Site configuration loader for Graby-TS with dynamic imports

    Package Exports

    • graby-ts-site-config
    • graby-ts-site-config/dist/index.js

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (graby-ts-site-config) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    Graby-TS Site Config

    A dynamic site configuration loader for Graby-TS based on FiveFilters site patterns format. This library provides standardized content extraction rules for different websites, allowing for consistent extraction across a wide range of domains.

    The site configuration rules are sourced from FiveFilters ftr-site-config, which contains a comprehensive collection of extraction rules for thousands of websites.

    Features

    • Dynamically loads site-specific extraction rules
    • Well-typed with full TypeScript support
    • Memory-efficient with on-demand loading and caching
    • Compatible with all JavaScript environments
    • Supports wildcard domain patterns
    • Based on the established FiveFilters site patterns format

    Installation

    npm install graby-ts-site-config

    Usage

    import { SiteConfigManager } from 'graby-ts-site-config';
    
    // Create a site config manager instance
    const configManager = new SiteConfigManager();
    
    async function extractContent(url) {
      // Get the site configuration for this URL
      const { hostname } = new URL(url);
      const config = await configManager.getConfigForHost(hostname);
      
      // Now use the configuration with your content extractor
      console.log('Using config:', config);
      
      // Example: checking if this site has specific extraction rules
      if (config.title && config.title.length > 0) {
        console.log('This site has custom title extraction rules');
      }
    }
    
    // Preload configs for frequently used sites
    configManager.preloadConfigs(['medium.com', 'wikipedia.org']);

    API

    SiteConfigManager

    getConfigForHost(hostname: string): Promise<SiteConfig>

    Asynchronously loads and returns the configuration for the given hostname.

    const config = await configManager.getConfigForHost('nytimes.com');

    hasConfigForHost(hostname: string): boolean

    Checks if a configuration exists for the given hostname.

    if (configManager.hasConfigForHost('medium.com')) {
      console.log('Medium has custom extraction rules');
    }

    preloadConfigs(hostnames: string[]): Promise<void>

    Preloads configurations for an array of hostnames to improve performance.

    await configManager.preloadConfigs(['medium.com', 'wikipedia.org']);

    clearCache(): void

    Clears the internal configuration cache.

    configManager.clearCache();

    Configuration Fields

    The SiteConfig object contains various fields that control how content is extracted from a website:

    Content Selection (XPath expressions)

    Field Type Description
    title string[] XPath expressions to extract the page title
    body string[] XPath expressions to extract the article body content
    date string[] XPath expressions to extract the publication date
    author string[] XPath expressions to extract the author(s) information

    Content Cleaning

    Field Type Description
    strip string[] XPath expressions for elements to remove from the content
    strip_id_or_class string[] Element IDs or classes to remove from the content
    strip_image_src string[] Remove images with matching src attributes
    native_ad_clue string[] XPath expressions to identify native advertisements

    Processing Options

    Field Type Default Description
    prune boolean true Clean content from non-essential elements using Readability algorithm
    autodetect_on_failure boolean true Fall back to auto-detection if the pattern-based extraction fails
    insert_detected_image boolean true Insert the main image detected from metadata
    skip_json_ld boolean true Skip extraction from JSON-LD structured data

    Multi-page Handling

    Field Type Description
    single_page_link string[] XPath expressions to find the "view as single page" link
    single_page_link_in_feed string[] XPath for single-page links in feed items
    next_page_link string[] XPath expressions to find links to subsequent pages
    if_page_contains string[] XPath expressions for conditional processing of multi-page content

    Content Enhancement

    Field Type Description
    find_string string[] Strings to find and replace in the content
    replace_string string[] Replacement strings (paired with find_string)
    wrap_in Record<string, string> Wrap matching elements with specified tags
    src_lazy_load_attr string[] Image attribute names for lazy-loaded images

    HTTP Options

    Field Type Description
    http_header Record<string, string> Additional HTTP headers to send with requests

    License

    MIT