Package Exports
- clean-web-scraper
- clean-web-scraper/main.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (clean-web-scraper) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Web Content Scraper
A powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! ๐ค
โจ Features
- ๐ Smart recursive web crawling of internal links
- ๐ Clean content extraction using Mozilla's Readability
- ๐งน Smart content processing and cleaning
- ๐๏ธ Maintains original URL structure in saved files
- ๐ซ Excludes unwanted paths from scraping
- ๐ Handles relative and absolute URLs like a pro
- ๐ฏ No duplicate page visits
- ๐ Generates JSONL output file for ML training
- ๐ AI-friendly clean text and csv output (perfect for LLM fine-tuning!)
๐ ๏ธ Prerequisites
- Node.js (v18 or higher)
- npm
๐ฆ Dependencies
- axios - HTTP requests master
- jsdom - DOM parsing wizard
- @mozilla/readability - Content extraction genius
๐ Installation
npm i clean-web-scraper
# OR
git clone https://github.com/mlibre/Clean-Web-Scraper
cd Clean-Web-Scraper
npm install๐ป Usage
const WebScraper = require('clean-web-scraper');
const scraper = new WebScraper({
baseURL: 'https://example.com', // Required: The website to scrape
scrapResultPath: './output', // Required: Where to save the content
excludeList: ['/admin', '/private'], // Optional: Paths to exclude
exactExcludeList: ['/specific-page'], // Optional: Exact URLs to exclude
jsonlPath: 'output.jsonl', // Optional: Custom JSONL output path
textOutputPath: "./dataset/texts", // Optional: Custom text output path
csvPath: "./dataset/train.csv" // Optional: Custom CSV output path
});
scraper.start();node example-usage.js๐ค Output
Your AI-ready content is saved in a clean, structured format:
- ๐ Base folder: ./folderPath/example.com/
- ๐ Files preserve original URL paths
- ๐ Pure text format, perfect for LLM training and fine-tuning
- ๐ค No HTML, no mess - just clean, structured text ready for AI consumption
- ๐ JSONL output for ML training
- ๐ CSV output with clean text content
๐ค AI/LLM Training Ready
The output is specifically formatted for AI training purposes:
- Clean, processed text without HTML markup
- Consistent formatting across all documents
- Structured content perfect for fine-tuning LLMs
- Ready to use in your ML pipelines
Standing with Palestine ๐ต๐ธ
This project supports Palestinian rights and stands in solidarity with Palestine. We believe in the importance of documenting and preserving Palestinian narratives, history, and struggles for justice and liberation.
Free Palestine ๐ต๐ธ