Package Exports
- @anisirji/web-extractor
- @anisirji/web-extractor/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@anisirji/web-extractor) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
@anisirji/web-extractor
Powerful web content extraction SDK with intelligent URL handling, content cleaning, and comprehensive metadata extraction.
Features
โจ Smart URL Handling
- URL validation and normalization
- Subdomain detection
- Duplicate URL filtering
- Pattern-based URL filtering
๐งน Content Cleaning
- Automatic markdown/HTML/text extraction
- Whitespace normalization
- Word counting
- Language detection
๐ Rich Metadata
- Scraping timestamps
- Word counts
- Page descriptions
- Status codes
- Custom metadata support
๐ Easy to Use
- Simple, intuitive API
- TypeScript support
- Promise-based
- Comprehensive error handling
๐ Documentation
- Testing Guide - Comprehensive guide on testing the SDK
- Test Results - Latest test results for astratechai.com
- API Documentation - Complete API reference and examples
Installation
npm install @anisirji/web-extractorQuick Start
Extract a Single Page
import { WebExtractor } from '@anisirji/web-extractor';
const extractor = new WebExtractor({
apiKey: 'your-firecrawl-api-key'
});
// Extract single page
const page = await extractor.extractPage('https://example.com');
console.log(page.title);
console.log(page.content);
console.log(page.metadata.wordCount);Extract Entire Website
const result = await extractor.extractWebsite('https://docs.example.com', {
maxPages: 20,
includeSubdomains: false,
titlePrefix: 'Docs',
maxDepth: 3
});
console.log(`Extracted ${result.pages.length} pages`);
console.log(`Success rate: ${result.stats.successRate}%`);
console.log(`Total words: ${result.stats.totalWords}`);
for (const page of result.pages) {
console.log(`${page.title} - ${page.url}`);
}API Reference
WebExtractor
Main class for web extraction.
Constructor
new WebExtractor(config: WebExtractorConfig)Config Options:
apiKey(required): Your Firecrawl API keybaseUrl(optional): Custom Firecrawl API URLtimeout(optional): Request timeout in ms (default: 30000)debug(optional): Enable debug logging (default: false)
Methods
extractPage(url, options?)
Extract content from a single page.
await extractor.extractPage('https://example.com', {
onlyMainContent: true, // Extract only main content
format: 'markdown', // 'markdown' | 'html' | 'text'
waitFor: 1000 // Wait time before extraction (ms)
});Returns: Promise<ExtractedPage>
extractWebsite(url, options?)
Extract content from entire website (crawl).
await extractor.extractWebsite('https://example.com', {
maxPages: 10, // Maximum pages to scrape
includeSubdomains: false, // Include subdomains
titlePrefix: 'My Site', // Prefix for all titles
maxDepth: 3, // Maximum crawl depth
followExternalLinks: false, // Follow external links
includePatterns: [/\/docs\//], // URL patterns to include
excludePatterns: [/\/blog\//], // URL patterns to exclude
onlyMainContent: true, // Extract only main content
format: 'markdown' // Output format
});Returns: Promise<ExtractionResult>
URL Utilities
Powerful URL manipulation utilities.
import {
normalizeUrl,
validateUrl,
deduplicateUrls,
isSameDomain,
extractDomain
} from '@anisirji/web-extractor';
// Normalize URL
const normalized = normalizeUrl('https://Example.com/path/?b=2&a=1#hash', {
lowercase: true, // Convert to lowercase
removeTrailingSlash: true, // Remove trailing slash
removeFragment: true, // Remove #hash
sortQueryParams: true // Sort query params
});
// => 'https://example.com/path?a=1&b=2'
// Validate URL
const urlObj = validateUrl('https://example.com'); // Returns URL object or throws
// Deduplicate URLs
const unique = deduplicateUrls([
'https://example.com/page',
'https://example.com/page/',
'https://EXAMPLE.COM/page'
]);
// => ['https://example.com/page']
// Check same domain
isSameDomain('https://example.com', 'https://example.com/page'); // true
isSameDomain('https://example.com', 'https://other.com'); // false
// Extract domain
extractDomain('https://blog.example.com/page'); // => 'blog.example.com'Content Utilities
Content processing utilities.
import {
cleanContent,
countWords,
generateExcerpt,
detectLanguage
} from '@anisirji/web-extractor';
// Clean content
const cleaned = cleanContent(' text\n\n\n\nmore text ');
// => 'text\n\nmore text'
// Count words
countWords('Hello world from TermiX'); // => 4
// Generate excerpt
generateExcerpt('Very long content here...', 10);
// => 'Very long content here (first 10 words)...'
// Detect language
detectLanguage('This is an English text'); // => 'en'Advanced Examples
Filter URLs by Pattern
const result = await extractor.extractWebsite('https://docs.example.com', {
maxPages: 50,
// Only include documentation pages
includePatterns: [
/\/docs\//,
/\/api\//,
/\/guides\//
],
// Exclude blog and changelog
excludePatterns: [
/\/blog\//,
/\/changelog\//
]
});Custom Processing Pipeline
const result = await extractor.extractWebsite('https://example.com', {
maxPages: 30
});
// Filter by word count
const substantialPages = result.pages.filter(
page => page.metadata.wordCount > 500
);
// Group by language
const byLanguage = result.pages.reduce((acc, page) => {
const lang = page.metadata.language || 'unknown';
acc[lang] = acc[lang] || [];
acc[lang].push(page);
return acc;
}, {});
// Calculate reading time
const withReadingTime = result.pages.map(page => ({
...page,
readingTimeMinutes: Math.ceil(page.metadata.wordCount / 200)
}));Batch Processing with Error Handling
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
const results = await Promise.allSettled(
urls.map(url => extractor.extractPage(url))
);
const successful = results
.filter(r => r.status === 'fulfilled')
.map(r => r.value);
const failed = results
.filter(r => r.status === 'rejected')
.map((r, i) => ({ url: urls[i], error: r.reason }));
console.log(`Success: ${successful.length}, Failed: ${failed.length}`);Types
ExtractedPage
interface ExtractedPage {
title: string;
content: string;
url: string;
metadata: PageMetadata;
}PageMetadata
interface PageMetadata {
scrapedAt: Date;
sourceUrl: string;
description?: string;
wordCount: number;
language?: string;
statusCode?: number;
[key: string]: any; // Custom metadata
}ExtractionResult
interface ExtractionResult {
pages: ExtractedPage[];
totalPages: number;
failed: FailedExtraction[];
stats: ExtractionStats;
}ExtractionStats
interface ExtractionStats {
duration: number; // Total time in ms
successRate: number; // Success rate %
totalWords: number; // Total words extracted
avgWordsPerPage: number; // Average words per page
}Use Cases
- ๐ Documentation Scraping: Extract and index documentation sites
- ๐ง Knowledge Base Building: Build AI knowledge bases from websites
- ๐ Content Analysis: Analyze website content and structure
- ๐ SEO Analysis: Extract metadata for SEO analysis
- ๐ค AI Training Data: Collect training data for AI models
- ๐ Content Migration: Migrate content from old to new sites
Requirements
- Node.js >= 16
- Firecrawl API key (Get one here)
Testing
Run the comprehensive test suite:
# Unit tests
npm test
# Integration test with astratechai.com
npm run test:astratechai
# Basic usage example
npm run test:integrationSee Testing Guide for detailed instructions on creating tests for your own websites.
License
MIT
Repository
- ๐ GitHub
- ๐ Report Issues
- ๐ฆ NPM Package
- ๐ Documentation
- ๐งช Test Results
Built with โค๏ธ by anisirji