Package Exports
- extract2md
- extract2md/dist/assets/extract2md.esm.js
- extract2md/dist/assets/extract2md.umd.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (extract2md) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Extract2MD - Enhanced PDF to Markdown Converter
A powerful client-side JavaScript library for converting PDFs to Markdown with multiple extraction methods and optional LLM enhancement. Now with scenario-specific methods for different use cases.
🚀 Quick Start
Extract2MD now offers 5 distinct scenarios for different conversion needs:
import Extract2MDConverter from 'extract2md';
// Scenario 1: Quick conversion only
const markdown1 = await Extract2MDConverter.quickConvertOnly(pdfFile);
// Scenario 2: High accuracy OCR conversion only
const markdown2 = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);
// Scenario 3: Quick conversion + LLM enhancement
const markdown3 = await Extract2MDConverter.quickConvertWithLLM(pdfFile);
// Scenario 4: High accuracy conversion + LLM enhancement
const markdown4 = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);
// Scenario 5: Combined extraction + LLM enhancement (most comprehensive)
const markdown5 = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);📋 Scenarios Explained
Scenario 1: Quick Convert Only
- Use case: Fast conversion when PDF has selectable text
- Method:
quickConvertOnly(pdfFile, config?) - Tech: PDF.js text extraction only
- Output: Basic markdown formatting
Scenario 2: High Accuracy Convert Only
- Use case: PDFs with images, scanned documents, complex layouts
- Method:
highAccuracyConvertOnly(pdfFile, config?) - Tech: Tesseract.js OCR
- Output: Markdown from OCR extraction
Scenario 3: Quick Convert + LLM
- Use case: Fast extraction with AI enhancement for better formatting
- Method:
quickConvertWithLLM(pdfFile, config?) - Tech: PDF.js + WebLLM
- Output: AI-enhanced markdown with improved structure and clarity
Scenario 4: High Accuracy + LLM
- Use case: OCR extraction with AI enhancement
- Method:
highAccuracyConvertWithLLM(pdfFile, config?) - Tech: Tesseract.js OCR + WebLLM
- Output: AI-enhanced markdown from OCR
Scenario 5: Combined + LLM (Recommended)
- Use case: Most comprehensive conversion using both extraction methods
- Method:
combinedConvertWithLLM(pdfFile, config?) - Tech: PDF.js + Tesseract.js + WebLLM with specialized prompts
- Output: Best possible markdown leveraging strengths of both extraction methods
⚙️ Configuration
Create a configuration object or JSON file to customize behavior:
const config = {
// PDF.js Worker
pdfJsWorkerSrc: "../pdf.worker.min.mjs",
// Tesseract OCR Settings
tesseract: {
workerPath: "./tesseract-worker.min.js",
corePath: "./tesseract-core.wasm.js",
langPath: "./lang-data/",
language: "eng",
options: {}
},
// LLM Configuration
webllm: {
model: "Qwen3-0.6B-q4f16_1-MLC",
// Optional: Custom model
customModel: {
model: "https://huggingface.co/mlc-ai/your-model/resolve/main/",
model_id: "YourModel-ID",
model_lib: "https://example.com/your-model.wasm",
required_features: ["shader-f16"],
overrides: { conv_template: "qwen" }
},
options: {
temperature: 0.7,
maxTokens: 4096
}
},
// System Prompt Customizations
systemPrompts: {
singleExtraction: "Focus on preserving code examples exactly.",
combinedExtraction: "Pay attention to tables and diagrams from OCR."
},
// Processing Options
processing: {
splitPascalCase: false,
pdfRenderScale: 2.5,
postProcessRules: [
{ find: /\bAPI\b/g, replace: "API" }
]
},
// Progress Tracking
progressCallback: (progress) => {
console.log(`${progress.stage}: ${progress.message}`);
if (progress.currentPage) {
console.log(`Page ${progress.currentPage}/${progress.totalPages}`);
}
}
};
// Use configuration
const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);🔧 Advanced Usage
Using Individual Components
import {
WebLLMEngine,
OutputParser,
SystemPrompts,
ConfigValidator
} from 'extract2md';
// Validate configuration
const validatedConfig = ConfigValidator.validate(userConfig);
// Initialize WebLLM engine
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();
// Generate text
const result = await engine.generate("Your prompt here");
// Parse output
const parser = new OutputParser();
const cleanMarkdown = parser.parse(result);Custom System Prompts
The library uses different system prompts for different scenarios:
// For scenarios 3 & 4 (single extraction)
const singlePrompt = SystemPrompts.getSingleExtractionPrompt(
"Additional instruction: Preserve all technical terms."
);
// For scenario 5 (combined extraction)
const combinedPrompt = SystemPrompts.getCombinedExtractionPrompt(
"Focus on creating comprehensive documentation."
);Configuration from JSON
import { ConfigValidator } from 'extract2md';
// Load from JSON string
const config = ConfigValidator.fromJSON(configJsonString);
// Use with any scenario
const result = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);🎯 Error Handling & Progress Tracking
const config = {
progressCallback: (progress) => {
switch (progress.stage) {
case 'scenario_5_start':
console.log('Starting combined conversion...');
break;
case 'webllm_load_progress':
console.log(`Loading model: ${progress.progress}%`);
break;
case 'ocr_page_process':
console.log(`OCR: ${progress.currentPage}/${progress.totalPages}`);
break;
case 'webllm_generate_start':
console.log('AI enhancement in progress...');
break;
case 'scenario_5_complete':
console.log('Conversion completed!');
break;
default:
console.log(`${progress.stage}: ${progress.message}`);
}
if (progress.error) {
console.error('Error:', progress.error);
}
}
};
try {
const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
console.log('Success:', result);
} catch (error) {
console.error('Conversion failed:', error.message);
}🔄 Migration from Legacy API
If you're using the old API, you can still access it:
import { LegacyExtract2MDConverter } from 'extract2md';
// Old way
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);
const ocr = await converter.highAccuracyConvert(pdfFile);
const enhanced = await converter.llmRewrite(text);
// New way (recommended)
const quick = await Extract2MDConverter.quickConvertOnly(pdfFile, config);
const ocr = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile, config);
const enhanced = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);🌟 Features
- 5 Scenario-Specific Methods: Choose the right approach for your use case
- WebLLM Integration: Client-side AI enhancement with Qwen models
- Custom Model Support: Use your own trained models
- Advanced Output Parsing: Automatic removal of thinking tags and formatting
- Comprehensive Configuration: Fine-tune every aspect of the conversion
- Progress Tracking: Real-time updates for UI integration
- TypeScript Support: Full type definitions included
- Backwards Compatible: Legacy API still available
📚 TypeScript Support
Full TypeScript definitions are included:
import Extract2MDConverter, {
Extract2MDConfig,
ProgressReport,
CustomModelConfig
} from 'extract2md';
const config: Extract2MDConfig = {
webllm: {
model: "Qwen3-0.6B-q4f16_1-MLC",
options: {
temperature: 0.7,
maxTokens: 4096
}
},
progressCallback: (progress: ProgressReport) => {
console.log(progress.stage, progress.message);
}
};
const result: string = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);🏗️ Installation & Deployment
NPM Installation
npm install extract2mdCDN Usage
<script src="https://unpkg.com/extract2md@2.0.0/dist/assets/extract2md.umd.js"></script>
<script>
// Available as global Extract2MD
const result = await Extract2MD.Extract2MDConverter.quickConvertOnly(pdfFile);
</script>Worker Files Configuration
The package requires worker files for PDF.js and Tesseract.js. These are automatically copied during build:
// Default worker paths (adjust for your deployment)
const config = {
pdfJsWorkerSrc: "/pdf.worker.min.mjs",
tesseract: {
workerPath: "/tesseract-worker.min.js",
corePath: "/tesseract-core.wasm.js"
}
};Bundle Size Considerations
- Total Size: ~11 MB (includes OCR and PDF processing)
- PDF.js: ~950 KB
- Tesseract.js: ~4.5 MB
- WebLLM: Variable (model-dependent)
Use lazy loading and code splitting for production deployments.
📚 Documentation
- Migration Guide - Upgrade from legacy API
- Deployment Guide - Production deployment instructions
- Examples - Complete usage examples
- TypeScript Definitions - Full type definitions
📄 License
MIT License - see LICENSE file for details.
🤝 Contributing
Contributions welcome! Please read the contributing guidelines before submitting PRs.
🐛 Issues
Report issues on the GitHub Issues page.