Package Exports
- xscrape
Readme
🕷️
xscrape
  Extract and transform HTML with your own schema, powered by Standard Schema compatibility.
    
    by @johnie
  
Overview
xscrape is a powerful HTML scraping library that combines the flexibility of query selectors with the safety of schema validation. It works with any validation library that implements the Standard Schema specification, including Zod, Valibot, ArkType, and Effect Schema.
Features
- HTML Parsing: Extract data from HTML using query selectors powered by cheerio
- Universal Schema Support: Works with any Standard Schema compatible library
- Type Safety: Full TypeScript support with inferred types from your schemas
- Flexible Extraction: Support for nested objects, arrays, and custom transformation functions
- Error Handling: Comprehensive error handling with detailed validation feedback
- Custom Transformations: Apply post-processing transformations to validated data
- Default Values: Handle missing data gracefully through schema defaults
Installation
Install xscrape with your preferred package manager:
npm install xscrape
# or
pnpm add xscrape
# or
bun add xscrapeQuick Start
import { defineScraper } from 'xscrape';
import { z } from 'zod';
// Define your schema
const schema = z.object({
  title: z.string(),
  description: z.string(),
  keywords: z.array(z.string()),
  views: z.coerce.number(),
});
// Create a scraper
const scraper = defineScraper({
  schema,
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    keywords: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',') || [],
    },
    views: { selector: 'meta[name="views"]', value: 'content' },
  },
});
// Use the scraper
const { data, error } = await scraper(htmlString);Usage Examples
Basic Extraction
Extract basic metadata from an HTML page:
import { defineScraper } from 'xscrape';
import { z } from 'zod';
const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    description: z.string(),
    author: z.string(),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    author: { selector: 'meta[name="author"]', value: 'content' },
  },
});
const html = `
<!DOCTYPE html>
<html>
<head>
  <title>My Blog Post</title>
  <meta name="description" content="An interesting blog post">
  <meta name="author" content="John Doe">
</head>
<body>...</body>
</html>
`;
const { data, error } = await scraper(html);
// data: { title: "My Blog Post", description: "An interesting blog post", author: "John Doe" }Handling Missing Data
Use schema defaults to handle missing data gracefully:
const scraper = defineScraper({
  schema: z.object({
    title: z.string().default('Untitled'),
    description: z.string().default('No description available'),
    publishedAt: z.string().optional(),
    views: z.coerce.number().default(0),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    publishedAt: { selector: 'meta[name="published"]', value: 'content' },
    views: { selector: 'meta[name="views"]', value: 'content' },
  },
});
// Even with incomplete HTML, you get sensible defaults
const { data } = await scraper('<html><head><title>Test</title></head></html>');
// data: { title: "Test", description: "No description available", views: 0 }Extracting Arrays
Extract multiple elements as arrays:
const scraper = defineScraper({
  schema: z.object({
    links: z.array(z.string()),
    headings: z.array(z.string()),
  }),
  extract: {
    links: [{ selector: 'a', value: 'href' }],
    headings: [{ selector: 'h1, h2, h3' }],
  },
});
const html = `
<html>
<body>
  <h1>Main Title</h1>
  <h2>Subtitle</h2>
  <a href="/page1">Link 1</a>
  <a href="/page2">Link 2</a>
</body>
</html>
`;
const { data } = await scraper(html);
// data: {
//   links: ["/page1", "/page2"],
//   headings: ["Main Title", "Subtitle"]
// }Nested Objects
Extract complex nested data structures:
const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    socialMedia: z.object({
      image: z.string().url(),
      width: z.coerce.number(),
      height: z.coerce.number(),
      type: z.string(),
    }),
  }),
  extract: {
    title: { selector: 'title' },
    socialMedia: {
      selector: 'head',
      value: {
        image: { selector: 'meta[property="og:image"]', value: 'content' },
        width: { selector: 'meta[property="og:image:width"]', value: 'content' },
        height: { selector: 'meta[property="og:image:height"]', value: 'content' },
        type: { selector: 'meta[property="og:type"]', value: 'content' },
      },
    },
  },
});Custom Value Transformations
Apply custom logic to extracted values:
const scraper = defineScraper({
  schema: z.object({
    tags: z.array(z.string()),
    publishedDate: z.date(),
    readingTime: z.number(),
  }),
  extract: {
    tags: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',').map(tag => tag.trim()) || [],
    },
    publishedDate: {
      selector: 'meta[name="published"]',
      value: (el) => new Date(el.attribs['content']),
    },
    readingTime: {
      selector: 'article',
      value: (el) => {
        const text = el.text();
        const wordsPerMinute = 200;
        const wordCount = text.split(/\s+/).length;
        return Math.ceil(wordCount / wordsPerMinute);
      },
    },
  },
});Post-Processing with Transform
Apply transformations to the validated data:
const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    description: z.string(),
    tags: z.array(z.string()),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    tags: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',') || [],
    },
  },
  transform: (data) => ({
    ...data,
    slug: data.title.toLowerCase().replace(/\s+/g, '-'),
    tagCount: data.tags.length,
    summary: data.description.substring(0, 100) + '...',
  }),
});Schema Library Examples
Zod
import { z } from 'zod';
const schema = z.object({
  title: z.string(),
  price: z.coerce.number(),
  inStock: z.boolean().default(false),
});Valibot
import * as v from 'valibot';
const schema = v.object({
  title: v.string(),
  price: v.pipe(v.string(), v.transform(Number)),
  inStock: v.optional(v.boolean(), false),
});ArkType
import { type } from 'arktype';
const schema = type({
  title: 'string',
  price: 'number',
  inStock: 'boolean = false',
});Effect Schema
import { Schema } from 'effect';
const schema = Schema.Struct({
  title: Schema.String,
  price: Schema.NumberFromString,
  inStock: Schema.optionalWith(Schema.Boolean, { default: () => false }),
});API Reference
defineScraper(config)
Creates a scraper function with the specified configuration.
Parameters
- config.schema: A Standard Schema compatible schema object
- config.extract: Extraction configuration object
- config.transform?: Optional post-processing function
Returns
A scraper function that takes HTML string and returns Promise<{ data?: T, error?: unknown }>.
Extraction Configuration
The extract object defines how to extract data from HTML:
type ExtractConfig = {
  [key: string]: ExtractDescriptor | [ExtractDescriptor];
};
type ExtractDescriptor = {
  selector: string;
  value?: string | ((el: Element) => any) | ExtractConfig;
};Properties
- selector: CSS selector to find elements
- value: How to extract the value:- string: Attribute name (e.g.,- 'href',- 'content')
- function: Custom extraction function
- object: Nested extraction configuration
- undefined: Extract text content
 
Array Extraction
Wrap the descriptor in an array to extract multiple elements:
{
  links: [{ selector: 'a', value: 'href' }]
}Error Handling
xscrape provides comprehensive error handling:
const { data, error } = await scraper(html);
if (error) {
  // Handle validation errors, extraction errors, or transform errors
  console.error('Scraping failed:', error);
} else {
  // Use the validated data
  console.log('Extracted data:', data);
}Best Practices
- Use Specific Selectors: Be as specific as possible with CSS selectors to avoid unexpected matches
- Handle Missing Data: Use schema defaults or optional fields for data that might not be present
- Validate URLs: Use URL validation in your schema for href attributes
- Transform Data Early: Use custom value functions rather than post-processing when possible
- Type Safety: Let TypeScript infer types from your schema for better developer experience
Common Use Cases
- Web Scraping: Extract structured data from websites
- Meta Tag Extraction: Get social media and SEO metadata
- Content Migration: Transform HTML content to structured data
- Testing: Validate HTML structure in tests
- RSS/Feed Processing: Extract article data from HTML feeds
Performance Considerations
- xscrape uses cheerio for fast HTML parsing
- Schema validation is performed once after extraction
- Consider using streaming for large HTML documents
- Cache scrapers when processing many similar documents
Contributing
We welcome contributions! Please see our Contributing Guide for details.
License
MIT License. See the LICENSE file for details.
Related Projects
- cheerio - jQuery-like server-side HTML parsing
- Standard Schema - Universal schema specification
- Zod - TypeScript-first schema validation
- Valibot - Modular and type-safe schema library
- Effect - Maximum Type-safety (incl. error handling)
- ArkType - TypeScript's 1:1 validator, optimized from editor to runtime