JSPM

magnitude-extract

0.0.2
    • ESM via JSPM
    • ES Module Entrypoint
    • Export Map
    • Keywords
    • License
    • Repository URL
    • TypeScript Types
    • README
    • Created
    • Published
    • Downloads 2412
    • Score
      100M100P100Q121715F
    • License MIT

    TypeScript DOM cleaning and structuring library

    Package Exports

    • magnitude-extract

    Readme

    unstructured-ts

    A TypeScript library for cleaning and structuring DOM content, inspired by Unstructured. Built with Cheerio for fast, server-side HTML processing.

    Features

    • ๐Ÿงน DOM Cleaning: Remove scripts, styles, navigation, and other unwanted elements
    • ๐Ÿ—๏ธ Semantic Structure: Classify elements as titles, paragraphs, lists, tables, etc.
    • ๐Ÿ“Š Table Extraction: Extract tables with headers and structured data
    • ๐Ÿ–ผ๏ธ Image Handling: Extract images with metadata and alt text
    • โšก Fast Processing: Built on Cheerio for efficient server-side HTML parsing
    • ๐ŸŽฏ Configurable: Flexible options for different use cases
    • ๐Ÿ“ TypeScript: Full type safety and excellent IDE support

    Installation

    npm install unstructured-ts

    Quick Start

    import { partitionHtml } from 'unstructured-ts';
    
    const html = `
    <html>
      <body>
        <nav>Skip this navigation</nav>
        <h1>Main Title</h1>
        <p>This is a paragraph with some content.</p>
        <ul>
          <li>First item</li>
          <li>Second item</li>
        </ul>
        <table>
          <tr><th>Name</th><th>Age</th></tr>
          <tr><td>John</td><td>30</td></tr>
        </table>
      </body>
    </html>
    `;
    
    const result = partitionHtml(html);
    
    console.log(result.elements);
    // [
    //   { type: 'Title', text: 'Main Title', ... },
    //   { type: 'NarrativeText', text: 'This is a paragraph with some content.', ... },
    //   { type: 'ListItem', text: 'First item', ... },
    //   { type: 'ListItem', text: 'Second item', ... },
    //   { type: 'Table', text: 'Name | Age\\n--- | ---\\nJohn | 30', rows: [['John', '30']], headers: ['Name', 'Age'], ... }
    // ]

    Advanced Usage

    Custom Options

    import { DOMPartitioner } from 'unstructured-ts';
    
    const partitioner = new DOMPartitioner({
      skipNavigation: true,      // Remove navigation elements
      skipHeaders: false,        // Keep header elements
      skipFooters: true,         // Remove footer elements
      skipForms: true,           // Remove form elements
      minTextLength: 15,         // Minimum text length to include
      extractTables: true,       // Extract table structure
      extractImages: true,       // Extract image elements
      includeImageAlt: true,     // Include alt text in image elements
      includeOriginalHtml: false // Include original HTML in metadata
    });
    
    const result = partitioner.partition(html);

    Working with Elements

    import { ElementType } from 'unstructured-ts';
    
    const result = partitionHtml(html);
    
    // Filter by element type
    const titles = result.elements.filter(el => el.type === ElementType.TITLE);
    const paragraphs = result.elements.filter(el => el.type === ElementType.NARRATIVE_TEXT);
    const tables = result.elements.filter(el => el.type === ElementType.TABLE);
    
    // Access table data
    tables.forEach(table => {
      if (table.type === ElementType.TABLE) {
        console.log('Headers:', table.headers);
        console.log('Rows:', table.rows);
      }
    });
    
    // Access metadata
    result.elements.forEach(element => {
      console.log(`${element.type}: ${element.text}`);
      console.log('Metadata:', element.metadata);
    });

    Element Types

    The library classifies DOM elements into semantic types:

    • Title: Headings (h1-h6) and title-like content
    • NarrativeText: Paragraphs and article content
    • ListItem: List items and bullet points
    • Text: Generic text content
    • Table: Structured tabular data
    • Image: Images with metadata
    • Header/Footer: Page headers and footers
    • Navigation: Navigation menus and links
    • Form: Form elements and inputs

    API Reference

    partitionHtml(html: string, options?: PartitionOptions): PartitionResult

    Convenience function to partition HTML content.

    DOMPartitioner

    Main class for partitioning DOM content.

    Constructor

    new DOMPartitioner(options?: PartitionOptions)

    Methods

    • partition(html: string): PartitionResult - Partition HTML content

    PartitionOptions

    Configuration options for partitioning:

    interface PartitionOptions {
      skipNavigation?: boolean;     // Default: true
      skipHeaders?: boolean;        // Default: false
      skipFooters?: boolean;        // Default: false
      skipForms?: boolean;          // Default: true
      minTextLength?: number;       // Default: 10
      preserveWhitespace?: boolean; // Default: false
      extractTables?: boolean;      // Default: true
      extractImages?: boolean;      // Default: true
      includeImageAlt?: boolean;    // Default: true
      includeOriginalHtml?: boolean;// Default: false
    }

    Element

    Base element interface:

    interface Element {
      id: string;
      type: ElementType;
      text: string;
      metadata: ElementMetadata;
    }

    TableElement

    Extended element for tables:

    interface TableElement extends Element {
      type: ElementType.TABLE;
      rows: string[][];
      headers?: string[];
    }

    ImageElement

    Extended element for images:

    interface ImageElement extends Element {
      type: ElementType.IMAGE;
      src?: string;
      alt?: string;
      width?: number;
      height?: number;
    }

    Comparison with Unstructured Python Library

    This library is inspired by the Python Unstructured library but is designed specifically for TypeScript/JavaScript environments:

    Feature unstructured-ts Unstructured Python
    DOM Processing โœ… Cheerio-based โœ… BeautifulSoup-based
    Element Classification โœ… Simplified โœ… Comprehensive
    Table Extraction โœ… Basic structure โœ… Advanced analysis
    Multiple File Formats โŒ HTML only โœ… PDF, DOCX, etc.
    OCR Support โŒ โœ…
    Language TypeScript Python
    Performance โšก Fast ๐ŸŒ Slower
    Dependencies Minimal Heavy

    Contributing

    Contributions are welcome! Please feel free to submit a Pull Request.

    License

    MIT