Enterprise AI Recursive Web Scraper

Advanced AI-powered recursive web scraper utilizing Groq LLMs, Puppeteer, and Playwright for intelligent content extraction

✨ Features

🚀 High Performance: Blazing fast multi-threaded scraping with concurrent processing
🤖 AI-Powered: Intelligent content extraction using Groq LLMs
🌐 Multi-Browser: Support for Chromium, Firefox, and WebKit
📊 Smart Extraction:
- Structured data extraction without LLMs using CSS selectors
- Topic-based and semantic chunking strategies
- Cosine similarity clustering for content deduplication
🎯 Advanced Capabilities:
- Recursive domain crawling with boundary respect
- Session management for complex multi-page flows
- Custom JavaScript execution support
- Enhanced screenshot capture with lazy-load detection
- iframe content extraction
🔒 Enterprise Ready:
- Proxy support with authentication
- Custom headers and user-agent configuration
- Comprehensive error handling
- Flexible timeout management

🚀 Quick Start

To install the package, run:

npm install enterprise-ai-recursive-web-scraper

Using the CLI

The enterprise-ai-recursive-web-scraper package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

Installation

Ensure that the package is installed globally to use the CLI:

npm install -g enterprise-ai-recursive-web-scraper

Running the CLI

Once installed, you can use the web-scraper command to start scraping. Here’s a basic example of how to use it:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output

CLI Options

-k, --api-key <key>: (Required) Your Google Gemini API key.
-u, --url <url>: (Required) The URL of the website you want to scrape.
-o, --output <directory>: The directory where the scraped data will be saved. Default is scraping_output.
-d, --depth <number>: Maximum crawl depth. Default is 3.
-c, --concurrency <number>: Concurrent scraping limit. Default is 5.
-t, --timeout <seconds>: Request timeout in seconds. Default is 30.
-f, --format <type>: Output format (json, csv, markdown). Default is json.
--screenshot: Capture screenshots of pages.
--no-headless: Run the browser in non-headless mode.
--proxy <url>: Use a proxy server.
-v, --verbose: Enable verbose logging.
--config <path>: Path to a configuration file.

Example Command

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output --depth 5 --concurrency 10 --format csv --verbose

This command will scrape the specified URL with a maximum depth of 5, using 10 concurrent requests, and save the output in CSV format in the ./output directory with verbose logging enabled.

🔧 Advanced Usage

Structured Data Extraction

To extract structured data using a JSON schema, you can use the JsonExtractionStrategy:

import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {
    baseSelector: "article",
    fields: [
        { name: "title", selector: "h1" },
        { name: "content", selector: ".content" },
        { name: "date", selector: "time", attribute: "datetime" }
    ]
};

const scraper = new WebScraper({
    extractionStrategy: new JsonExtractionStrategy(schema)
});

Custom Browser Session

You can customize the browser session with specific configurations:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    browserConfig: {
        headless: false,
        proxy: "http://proxy.example.com",
        userAgent: "Custom User Agent"
    }
});

🤝 Contributors

_{Mike Odnis}
💻 🖋 🤔 🚇

📄 License

💙 Built with create-typescript-app