JSPM

  • Created
  • Published
  • Downloads 213
  • Score
    100M100P100Q62419F
  • License ISC

Package Exports

  • clean-web-scraper
  • clean-web-scraper/main.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (clean-web-scraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Web Content Scraper

A powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! ๐Ÿค–

โœจ Features

  • ๐ŸŒ Smart recursive web crawling of internal links
  • ๐Ÿ“ Clean content extraction using Mozilla's Readability
  • ๐Ÿงน Smart content processing and cleaning
  • ๐Ÿ—‚๏ธ Maintains original URL structure in saved files
  • ๐Ÿšซ Excludes unwanted paths from scraping
  • ๐Ÿ”„ Handles relative and absolute URLs like a pro
  • ๐ŸŽฏ No duplicate page visits
  • ๐Ÿ“Š Generates JSONL output file for ML training
  • ๐Ÿ“Š AI-friendly clean text and csv output (perfect for LLM fine-tuning!)
  • ๐Ÿ“Š Rich metadata extraction
  • ๐Ÿ“ Combine results from multiple scrapers into a unified dataset

๐Ÿ› ๏ธ Prerequisites

  • Node.js (v18 or higher)
  • npm

๐Ÿ“ฆ Dependencies

  • axios - HTTP requests master
  • jsdom - DOM parsing wizard
  • @mozilla/readability - Content extraction genius

๐Ÿš€ Installation

npm i clean-web-scraper

# OR

git clone https://github.com/mlibre/Clean-Web-Scraper
cd Clean-Web-Scraper
npm install

๐Ÿ’ป Usage

const WebScraper = require('clean-web-scraper');

const scraper = new WebScraper({
  baseURL: 'https://example.com/news',          // Required: The website base url to scrape
  startURL: 'https://example.com/blog',         // Optional: Custom starting URL
  excludeList: ['/admin', '/private'],          // Optional: Paths to exclude
  exactExcludeList: ['/specific-page'],         // Optional: Exact URLs to exclude
  scrapResultPath: './example.com/website',     // Required: Where to save the content
  jsonlOutputPath: './example.com/train.jsonl', // Optional: Custom JSONL output path
  textOutputPath: "./example.com/texts",        // Optional: Custom text output path
  csvOutputPath: "./example.com/train.csv"      // Optional: Custom CSV output path
  maxDepth: 3,                                  // Optional: Maximum depth for recursive crawling
  includeTitles: true,                          // Optional: Include page titles in outputs
});
scraper.start();

// Combine results from multiple scrapers
WebScraper.combineResults('./combined-dataset', [scraper1, scraper2]);
node example-usage.js

๐Ÿ“ค Output

Your AI-ready content is saved in a clean, structured format:

  • ๐Ÿ“ Base folder: ./folderPath/example.com/
  • ๐Ÿ“‘ Files preserve original URL paths
  • ๐Ÿ“ Pure text format, perfect for LLM training and fine-tuning
  • ๐Ÿค– No HTML, no mess - just clean, structured text ready for AI consumption
  • ๐Ÿ“Š JSONL output for ML training
  • ๐Ÿ“ˆ CSV output with clean text content
example.com/
โ”œโ”€โ”€ website/
โ”‚   โ”œโ”€โ”€ page1.txt         # Clean text content
โ”‚   โ”œโ”€โ”€ page1.json        # Full metadata
โ”‚   โ””โ”€โ”€ blog/
โ”‚       โ”œโ”€โ”€ post1.txt
โ”‚       โ””โ”€โ”€ post1.json
โ”‚โ”€โ”€ texts/           # Numbered text files
โ”‚       โ”œโ”€โ”€ 1.txt
โ”‚       โ”œโ”€โ”€ 2.txt
โ”‚โ”€โ”€ train.jsonl      # Combined content
โ””โ”€โ”€ train.csv        # Clean text in CSV format

๐Ÿค– AI/LLM Training Ready

The output is specifically formatted for AI training purposes:

  • Clean, processed text without HTML markup
  • Multiple formats (JSONL, CSV, text files)
  • Structured content perfect for fine-tuning LLMs
  • Ready to use in your ML pipelines

Standing with Palestine ๐Ÿ‡ต๐Ÿ‡ธ

This project supports Palestinian rights and stands in solidarity with Palestine. We believe in the importance of documenting and preserving Palestinian narratives, history, and struggles for justice and liberation.

Free Palestine ๐Ÿ‡ต๐Ÿ‡ธ