Package Exports

xscrape

Readme

🕷️
`xscrape`

Extract and transform HTML with your own schema, powered by Standard Schema compatibility.
by @johnie

Overview

xscrape is a powerful HTML scraping library that combines the flexibility of query selectors with the safety of schema validation. It works with any validation library that implements the Standard Schema specification, including Zod, Valibot, ArkType, and Effect Schema.

Features

HTML Parsing: Extract data from HTML using query selectors powered by cheerio
Universal Schema Support: Works with any Standard Schema compatible library
Type Safety: Full TypeScript support with inferred types from your schemas
Flexible Extraction: Support for nested objects, arrays, and custom transformation functions
Error Handling: Comprehensive error handling with detailed validation feedback
Custom Transformations: Apply post-processing transformations to validated data
Default Values: Handle missing data gracefully through schema defaults

Installation

Install xscrape with your preferred package manager:

npm install xscrape
# or
pnpm add xscrape
# or
bun add xscrape

Quick Start

import { defineScraper } from 'xscrape';
import { z } from 'zod';

// Define your schema
const schema = z.object({
  title: z.string(),
  description: z.string(),
  keywords: z.array(z.string()),
  views: z.coerce.number(),
});

// Create a scraper
const scraper = defineScraper({
  schema,
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    keywords: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',') || [],
    },
    views: { selector: 'meta[name="views"]', value: 'content' },
  },
});

// Use the scraper
const { data, error } = await scraper(htmlString);

Usage Examples

Basic Extraction

Extract basic metadata from an HTML page:

import { defineScraper } from 'xscrape';
import { z } from 'zod';

const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    description: z.string(),
    author: z.string(),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    author: { selector: 'meta[name="author"]', value: 'content' },
  },
});

const html = `
<!DOCTYPE html>
<html>
<head>
  <title>My Blog Post</title>
  <meta name="description" content="An interesting blog post">
  <meta name="author" content="John Doe">
</head>
<body>...</body>
</html>
`;

const { data, error } = await scraper(html);
// data: { title: "My Blog Post", description: "An interesting blog post", author: "John Doe" }

Handling Missing Data

Use schema defaults to handle missing data gracefully:

const scraper = defineScraper({
  schema: z.object({
    title: z.string().default('Untitled'),
    description: z.string().default('No description available'),
    publishedAt: z.string().optional(),
    views: z.coerce.number().default(0),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    publishedAt: { selector: 'meta[name="published"]', value: 'content' },
    views: { selector: 'meta[name="views"]', value: 'content' },
  },
});

// Even with incomplete HTML, you get sensible defaults
const { data } = await scraper('<html><head><title>Test</title></head></html>');
// data: { title: "Test", description: "No description available", views: 0 }

Extracting Arrays

Extract multiple elements as arrays:

const scraper = defineScraper({
  schema: z.object({
    links: z.array(z.string()),
    headings: z.array(z.string()),
  }),
  extract: {
    links: [{ selector: 'a', value: 'href' }],
    headings: [{ selector: 'h1, h2, h3' }],
  },
});

const html = `
<html>
<body>
  <h1>Main Title</h1>
  <h2>Subtitle</h2>
  <a href="/page1">Link 1</a>
  <a href="/page2">Link 2</a>
</body>
</html>
`;

const { data } = await scraper(html);
// data: {
//   links: ["/page1", "/page2"],
//   headings: ["Main Title", "Subtitle"]
// }

Nested Objects

Extract complex nested data structures:

const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    socialMedia: z.object({
      image: z.string().url(),
      width: z.coerce.number(),
      height: z.coerce.number(),
      type: z.string(),
    }),
  }),
  extract: {
    title: { selector: 'title' },
    socialMedia: {
      selector: 'head',
      value: {
        image: { selector: 'meta[property="og:image"]', value: 'content' },
        width: { selector: 'meta[property="og:image:width"]', value: 'content' },
        height: { selector: 'meta[property="og:image:height"]', value: 'content' },
        type: { selector: 'meta[property="og:type"]', value: 'content' },
      },
    },
  },
});

Custom Value Transformations

Apply custom logic to extracted values:

const scraper = defineScraper({
  schema: z.object({
    tags: z.array(z.string()),
    publishedDate: z.date(),
    readingTime: z.number(),
  }),
  extract: {
    tags: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',').map(tag => tag.trim()) || [],
    },
    publishedDate: {
      selector: 'meta[name="published"]',
      value: (el) => new Date(el.attribs['content']),
    },
    readingTime: {
      selector: 'article',
      value: (el) => {
        const text = el.text();
        const wordsPerMinute = 200;
        const wordCount = text.split(/\s+/).length;
        return Math.ceil(wordCount / wordsPerMinute);
      },
    },
  },
});

Post-Processing with Transform

Apply transformations to the validated data:

const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    description: z.string(),
    tags: z.array(z.string()),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    tags: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',') || [],
    },
  },
  transform: (data) => ({
    ...data,
    slug: data.title.toLowerCase().replace(/\s+/g, '-'),
    tagCount: data.tags.length,
    summary: data.description.substring(0, 100) + '...',
  }),
});

Schema Library Examples

Zod

import { z } from 'zod';

const schema = z.object({
  title: z.string(),
  price: z.coerce.number(),
  inStock: z.boolean().default(false),
});

Valibot

import * as v from 'valibot';

const schema = v.object({
  title: v.string(),
  price: v.pipe(v.string(), v.transform(Number)),
  inStock: v.optional(v.boolean(), false),
});

ArkType

import { type } from 'arktype';

const schema = type({
  title: 'string',
  price: 'number',
  inStock: 'boolean = false',
});

Effect Schema

import { Schema } from 'effect';

const schema = Schema.Struct({
  title: Schema.String,
  price: Schema.NumberFromString,
  inStock: Schema.optionalWith(Schema.Boolean, { default: () => false }),
});

API Reference

`defineScraper(config)`

Creates a scraper function with the specified configuration.

Parameters

config.schema: A Standard Schema compatible schema object
config.extract: Extraction configuration object
config.transform?: Optional post-processing function

Returns

A scraper function that takes HTML string and returns Promise<{ data?: T, error?: unknown }>.

Extraction Configuration

The extract object defines how to extract data from HTML:

type ExtractConfig = {
  [key: string]: ExtractDescriptor | [ExtractDescriptor];
};

type ExtractDescriptor = {
  selector: string;
  value?: string | ((el: Element) => any) | ExtractConfig;
};

Properties

selector: CSS selector to find elements
value: How to extract the value:
- string: Attribute name (e.g., 'href', 'content')
- function: Custom extraction function
- object: Nested extraction configuration
- undefined: Extract text content

Array Extraction

Wrap the descriptor in an array to extract multiple elements:

{
  links: [{ selector: 'a', value: 'href' }]
}

Error Handling

xscrape provides comprehensive error handling:

const { data, error } = await scraper(html);

if (error) {
  // Handle validation errors, extraction errors, or transform errors
  console.error('Scraping failed:', error);
} else {
  // Use the validated data
  console.log('Extracted data:', data);
}

Best Practices

Use Specific Selectors: Be as specific as possible with CSS selectors to avoid unexpected matches
Handle Missing Data: Use schema defaults or optional fields for data that might not be present
Validate URLs: Use URL validation in your schema for href attributes
Transform Data Early: Use custom value functions rather than post-processing when possible
Type Safety: Let TypeScript infer types from your schema for better developer experience

Common Use Cases

Web Scraping: Extract structured data from websites
Meta Tag Extraction: Get social media and SEO metadata
Content Migration: Transform HTML content to structured data
Testing: Validate HTML structure in tests
RSS/Feed Processing: Extract article data from HTML feeds

Performance Considerations

xscrape uses cheerio for fast HTML parsing
Schema validation is performed once after extraction
Consider using streaming for large HTML documents
Cache scrapers when processing many similar documents

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT License. See the LICENSE file for details.

cheerio - jQuery-like server-side HTML parsing
Standard Schema - Universal schema specification
Zod - TypeScript-first schema validation
Valibot - Modular and type-safe schema library
Effect - Maximum Type-safety (incl. error handling)
ArkType - TypeScript's 1:1 validator, optimized from editor to runtime

xscrape