Package Exports
- enterprise-ai-recursive-web-scraper
- enterprise-ai-recursive-web-scraper/cli
Readme
Enterprise AI Recursive Web Scraper
Advanced AI-powered recursive web scraper utilizing Groq LLMs, Puppeteer, and Playwright for intelligent content extraction
β¨ Features
- π High Performance: Blazing fast multi-threaded scraping with concurrent processing
- π€ AI-Powered: Intelligent content extraction using Groq LLMs
- π Multi-Browser: Support for Chromium, Firefox, and WebKit
- π Smart Extraction:
- Structured data extraction without LLMs using CSS selectors
- Topic-based and semantic chunking strategies
- Cosine similarity clustering for content deduplication
- π― Advanced Capabilities:
- Recursive domain crawling with boundary respect
- Session management for complex multi-page flows
- Custom JavaScript execution support
- Enhanced screenshot capture with lazy-load detection
- iframe content extraction
- π Enterprise Ready:
- Proxy support with authentication
- Custom headers and user-agent configuration
- Comprehensive error handling
- Flexible timeout management
π Quick Start
To install the package, run:
npm install enterprise-ai-recursive-web-scraper
Using the CLI
The enterprise-ai-recursive-web-scraper
package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.
Installation
Ensure that the package is installed globally to use the CLI:
npm install -g enterprise-ai-recursive-web-scraper
Running the CLI
Once installed, you can use the web-scraper
command to start scraping. Hereβs a basic example of how to use it:
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output
CLI Options
-k, --api-key <key>
: (Required) Your Google Gemini API key.-u, --url <url>
: (Required) The URL of the website you want to scrape.-o, --output <directory>
: The directory where the scraped data will be saved. Default isscraping_output
.-d, --depth <number>
: Maximum crawl depth. Default is3
.-c, --concurrency <number>
: Concurrent scraping limit. Default is5
.-t, --timeout <seconds>
: Request timeout in seconds. Default is30
.-f, --format <type>
: Output format (json
,csv
,markdown
). Default isjson
.--screenshot
: Capture screenshots of pages.--no-headless
: Run the browser in non-headless mode.--proxy <url>
: Use a proxy server.-v, --verbose
: Enable verbose logging.--config <path>
: Path to a configuration file.
Example Command
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output --depth 5 --concurrency 10 --format csv --verbose
This command will scrape the specified URL with a maximum depth of 5, using 10 concurrent requests, and save the output in CSV format in the ./output
directory with verbose logging enabled.
π§ Advanced Usage
Structured Data Extraction
To extract structured data using a JSON schema, you can use the JsonExtractionStrategy
:
import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";
const schema = {
baseSelector: "article",
fields: [
{ name: "title", selector: "h1" },
{ name: "content", selector: ".content" },
{ name: "date", selector: "time", attribute: "datetime" }
]
};
const scraper = new WebScraper({
extractionStrategy: new JsonExtractionStrategy(schema)
});
Custom Browser Session
You can customize the browser session with specific configurations:
import { WebScraper } from "enterprise-ai-recursive-web-scraper";
const scraper = new WebScraper({
browserConfig: {
headless: false,
proxy: "http://proxy.example.com",
userAgent: "Custom User Agent"
}
});
π€ Contributors
Mike Odnis π» π π€ π |
π License
MIT Β© Mike Odnis
π Built with
create-typescript-app