JSPM

  • Created
  • Published
  • Downloads 213
  • Score
    100M100P100Q62370F
  • License ISC

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (clean-web-scraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    Web Content Scraper

    A powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! ๐Ÿค–

    โœจ Features

    • ๐ŸŒ Smart web crawling of internal links
    • ๐Ÿ”„ Smart retry mechanism with proxy fallback
    • ๐Ÿ“ Clean content extraction using Mozilla's Readability
    • ๐Ÿงน Smart content processing and cleaning
    • ๐Ÿ—‚๏ธ Maintains original URL structure in saved files
    • ๐Ÿšซ Excludes unwanted paths from scraping
    • ๐Ÿšฆ Configurable rate limiting and delays
    • ๐Ÿค– AI-friendly output formats (JSONL, CSV, clean text)
    • ๐Ÿ“Š Rich metadata extraction
    • ๐Ÿ“ Combine results from multiple scrapers into a unified dataset

    ๐Ÿ› ๏ธ Prerequisites

    • Node.js (v18 or higher)
    • npm

    ๐Ÿ“ฆ Dependencies

    • axios - HTTP requests master
    • jsdom - DOM parsing wizard
    • @mozilla/readability - Content extraction genius

    ๐Ÿš€ Installation

    npm i clean-web-scraper
    
    # OR
    
    git clone https://github.com/mlibre/Clean-Web-Scraper
    cd Clean-Web-Scraper
    sudo pacman -S extra/xorg-server-xvfb chromium
    npm install
    
    # Skip chromium download during npm installation
    # npm i --ignore-scripts

    ๐Ÿ’ป Usage

    const WebScraper = require('clean-web-scraper');
    
    const scraper = new WebScraper({
      baseURL: 'https://example.com/news',          // Required: The website base url to scrape
      startURL: 'https://example.com/blog',         // Optional: Custom starting URL
      excludeList: ['/admin', '/private'],          // Optional: Paths to exclude
      exactExcludeList: ['/specific-page',          // Optional: Exact URLs to exclude 
      /^https:\/\/host\.com\/\d{4}\/$/],            // Optional: Regex patterns to exclude. this will exclude urls likee https://host.com/2023/
      scrapResultPath: './example.com/website',     // Required: Where to save the content
      jsonlOutputPath: './example.com/train.jsonl', // Optional: Custom JSONL output path
      textOutputPath: "./example.com/texts",        // Optional: Custom text output path
      csvOutputPath: "./example.com/train.csv",     // Optional: Custom CSV output path
      strictBaseURL: true,                          // Optional: Only scrape URLs from same domain
      maxDepth: Infinity,                           // Optional: Maximum crawling depth
      maxArticles: Infinity,                        // Optional: Maximum articles to scrape
      crawlingDelay: 1000,                          // Optional: Delay between requests (ms)
      batchSize: 5,                                 // Optional: Number of URLs to process concurrently
    
      // Network options
      axiosHeaders: {},                             // Optional: Custom HTTP headers
      axiosProxy: {                                 // Optional: HTTP/HTTPS proxy
       host: "localhost",
       port: 2080,
       protocol: "http"
      },              
      axiosMaxRetries: 5,                           // Optional: Max retry attempts
      axiosRetryDelay: 40000,                       // Optional: Delay between retries (ms)
      useProxyAsFallback: false,                    // Optional: Fallback to proxy on failure
      
      // Puppeteer options for handling dynamic content
      usePuppeteer: false,                          // Optional: Enable Puppeteer browser
    });
    await scraper.start();

    ๐Ÿ’ป Advanced Usage: Multi-Site Scraping

    const WebScraper = require('clean-web-scraper');
    
    // Scrape documentation website
    const docsScraper = new WebScraper({
      baseURL: 'https://docs.example.com',
      scrapResultPath: './datasets/docs',
      maxDepth: 3,                               // Optional: Maximum depth for recursive crawling
      includeMetadata: true,                     // Optional: Include metadata in output files
      metadataFields: ["author", "articleTitle", "pageTitle", "description", "dateScrapedDate"],
       // Optional: Specify metadata fields to include
    });
    
    // Scrape blog website
    const blogScraper = new WebScraper({
      baseURL: 'https://blog.example.com',
      scrapResultPath: './datasets/blog',
      maxDepth: 3,                               // Optional: Maximum depth for recursive crawling
      includeMetadata: true,                     // Optional: Include metadata in output files
      metadataFields: ["author", "articleTitle", "pageTitle", "description", "dateScrapedDate"],
       // Optional: Specify metadata fields to include
    });
    
    // Start scraping both sites
    await docsScraper.start();
    await blogScraper.start();
    
    // Combine all scraped content into a single dataset
    await WebScraper.combineResults('./combined', [docsScraper, blogScraper]);
    # 8 GB RAM
    node --max-old-space-size=8192 example-usage.js

    ๐Ÿ“ค Output

    Your AI-ready content is saved in a clean, structured format:

    • ๐Ÿ“ Base folder: ./folderPath/example.com/
    • ๐Ÿ“‘ Files preserve original URL paths
    • ๐Ÿค– No HTML, no noise - just clean, structured text (.txt files)
    • ๐Ÿ“Š JSONL and CSV outputs, ready for AI consumption, model training and fine-tuning
    example.com/
    โ”œโ”€โ”€ website/
    โ”‚   โ”œโ”€โ”€ page1.txt         # Clean text content
    โ”‚   โ”œโ”€โ”€ page1.json        # Full metadata
    โ”‚   โ””โ”€โ”€ blog/
    โ”‚       โ”œโ”€โ”€ post1.txt
    โ”‚       โ””โ”€โ”€ post1.json
    โ”œโ”€โ”€ texts/                # Numbered text files
    โ”‚   โ”œโ”€โ”€ 1.txt
    โ”‚   โ””โ”€โ”€ 2.txt
    โ”œโ”€โ”€ texts_with_metadata/  # When includeMetadata is true
    โ”‚   โ”œโ”€โ”€ 1.txt
    โ”‚   โ””โ”€โ”€ 2.txt
    โ”œโ”€โ”€ train.jsonl           # Combined content
    โ”œโ”€โ”€ train_with_metadata.jsonl  # When includeMetadata is true
    โ”œโ”€โ”€ train.csv             # Clean text in CSV format
    โ””โ”€โ”€ train_with_metadata.csv    # When includeMetadata is true
    
    combined/
    โ”œโ”€โ”€ texts/                # Combined numbered text files
    โ”‚   โ”œโ”€โ”€ 1.txt
    โ”‚   โ”œโ”€โ”€ 2.txt
    โ”‚   โ””โ”€โ”€ n.txt
    โ”œโ”€โ”€ texts_with_metadata/  # Combined metadata text files
    โ”‚   โ”œโ”€โ”€ 1.txt
    โ”‚   โ”œโ”€โ”€ 2.txt
    โ”‚   โ””โ”€โ”€ n.txt
    โ”œโ”€โ”€ combined.jsonl        # Combined JSONL content
    โ”œโ”€โ”€ combined_with_metadata.jsonl
    โ”œโ”€โ”€ combined.csv         # Combined CSV content
    โ””โ”€โ”€ combined_with_metadata.csv

    ๐Ÿ“„ Output File Formats

    ๐Ÿ“ Text Files (*.txt)

    The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage

    ๐Ÿ“‘ Text Files with Metadata (texts_with_metadata/*.txt)

    articleTitle: My Awesome Page
    description: This is a great article about coding
    author: John Doe
    language: en
    dateScraped: 2024-01-20T10:30:00Z
    
    \-\-\-
    
    The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage.

    ๐Ÿ“Š JSONL Files (train.jsonl)

    {"text": "Clean article content here"}
    {"text": "Another article content here"}

    ๐Ÿ“ˆ JSONL with Metadata (train_with_metadata.jsonl)

    {"text": "Article content", "metadata": {"articleTitle": "Page Title", "author": "John Doe"}}
    {"text": "Another article", "metadata": {"articleTitle": "Second Page", "author": "Jane Smith"}}

    ๐Ÿ—ƒ๏ธ JSON Files In Website Output (*.json)

    {
      "url": "<https://example.com/page>",
      "title": "Page Title",
      "description": "Page description",
      "dateScraped": "2024-01-20T10:30:00Z"
    }

    ๐Ÿ“‹ CSV Files (train.csv)

    text
    "Clean article content here"
    "Another article content here"

    ๐Ÿ“Š CSV with Metadata (train_with_metadata.csv)

    text,articleTitle,author,description
    "Article content","Page Title","John Doe","Page description"
    "Another article","Second Page","Jane Smith","Another description"

    Standing with Palestine ๐Ÿ‡ต๐Ÿ‡ธ

    This project supports Palestinian rights and stands in solidarity with Palestine. We believe in the importance of documenting and preserving Palestinian narratives, history, and struggles for justice and liberation.

    Free Palestine ๐Ÿ‡ต๐Ÿ‡ธ