Package Exports
- @agenson-horrowitz/document-parser-mcp
- @agenson-horrowitz/document-parser-mcp/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@agenson-horrowitz/document-parser-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Multi-Format Document Parser MCP Server
A professional-grade MCP server that provides AI agents with comprehensive document parsing capabilities. Built specifically for the agent economy by Agenson Horrowitz.
๐ค Why This Exists
AI agents constantly receive documents in various formats but need structured text and data. Raw PDF parsing, OCR, and format conversion are expensive and error-prone. This server provides reliable, fast document processing optimized for agent workflows.
โก Key Features
- Advanced PDF Parsing: Extract text, tables, and metadata with layout preservation
- Intelligent OCR: Image-to-text with confidence scoring and preprocessing
- HTML to Markdown: Clean conversion preserving structure and links
- Universal Table Extraction: Extract structured data from any document format
- Document Summarization: Configurable summary generation with keyword extraction
- Agent-Optimized Output: Fast processing, structured JSON responses
- Multi-Format Support: PDF, images, HTML, text files
๐ Installation
Claude Desktop Configuration
Add to your claude_desktop_config.json:
{
"mcpServers": {
"document-parser": {
"command": "npx",
"args": ["@agenson-horrowitz/document-parser-mcp"]
}
}
}Cline Configuration
Add to your Cline MCP settings:
{
"mcpServers": {
"document-parser": {
"command": "npx",
"args": ["@agenson-horrowitz/document-parser-mcp"]
}
}
}Via npm
npm install -g @agenson-horrowitz/document-parser-mcpVia MCPize (One-click deployment)
Deploy instantly on MCPize with built-in billing and authentication.
๐ ๏ธ Available Tools
1. parse_pdf
Extract comprehensive information from PDF documents.
Perfect for: Reports, invoices, contracts, research papers, forms
Features:
- Text extraction with layout preservation
- Metadata extraction (title, author, creation date, page count)
- Table detection and structured extraction
- Page range processing for large documents
- Reading time estimation and word counts
Example:
{
"file_path": "/path/to/document.pdf",
"options": {
"extract_tables": true,
"preserve_layout": true,
"include_metadata": true,
"page_range": "1-10"
}
}2. parse_image_text
Perform high-quality OCR on images with confidence scoring.
Perfect for: Screenshots, scanned documents, photos of text, receipts
Features:
- Multi-language OCR support (100+ languages)
- Confidence threshold filtering for accuracy
- Image preprocessing for better results
- Individual word extraction with bounding boxes
- Support for all major image formats
Example:
{
"image_path": "/path/to/screenshot.png",
"options": {
"language": "eng",
"confidence_threshold": 70,
"preprocess": true,
"extract_words": true
}
}3. html_to_markdown
Convert HTML documents to clean, structured markdown.
Perfect for: Web pages, HTML emails, documentation, blog posts
Features:
- Preserve tables, links, headings, and lists
- Remove scripts and styling for clean text
- Configurable whitespace normalization
- Image URL and alt text extraction
- Support for complex HTML structures
Example:
{
"html_content": "<html>...</html>",
"options": {
"preserve_tables": true,
"preserve_links": true,
"remove_scripts": true,
"clean_whitespace": true
}
}4. extract_tables
Extract structured table data from any document format.
Perfect for: Pricing lists, data reports, spreadsheets, forms
Features:
- Multi-format support (PDF, HTML, text)
- Automatic header detection
- Cell content cleaning and normalization
- Context extraction around tables
- Configurable table validation rules
Example:
{
"file_path": "/path/to/report.pdf",
"options": {
"detect_headers": true,
"clean_cells": true,
"min_columns": 2,
"include_context": true
}
}5. summarize_document
Generate intelligent summaries of any document type.
Perfect for: Long reports, research papers, articles, documentation
Features:
- Configurable detail levels (brief, detailed, comprehensive)
- Keyword extraction and topic identification
- Focus area customization
- Multi-format input support
- Word limit controls for token management
Example:
{
"file_path": "/path/to/research.pdf",
"summary_level": "detailed",
"options": {
"word_limit": 300,
"extract_keywords": true,
"focus_areas": ["methodology", "results", "conclusions"]
}
}๐ฐ Pricing
Free Tier
- 500 operations/month - Perfect for testing and small projects
- All tools included
- Community support
Pro Tier - $9/month
- 10,000 operations/month - Production usage for most agents
- Priority support
- Advanced error reporting
- Usage analytics
Scale Tier - $29/month
- 50,000 operations/month - High-volume agent deployments
- SLA guarantees (99.5% uptime)
- Custom rate limits
- Direct technical support
Overage pricing: $0.02 per operation beyond your plan limits
๐ Authentication & Payment
MCPize (Easiest)
- One-click deployment with built-in billing
- No API key management required
- 85% revenue share to developers
Direct API Access
- Get API keys at agensonhorrowitz.cc
- Stripe-powered metered billing
- Real-time usage tracking
Crypto Micropayments
- Pay per operation with USDC on Base chain
- x402 protocol integration
- Perfect for crypto-native agents
๐ Performance
- Average processing time: < 3 seconds for typical documents
- Uptime SLA: 99.5% (Scale tier)
- Rate limits: 5 operations/second (configurable)
- File size limits: 100MB per document
๐งช Testing
# Clone and test locally
git clone https://github.com/agenson-horrowitz/document-parser-mcp
cd document-parser-mcp
npm install
npm run build
npm testSee Also
- Agent Output Guard: Verify outputs before acting on them
- LangChain Integration: GitHub Gist
- CrewAI Integration: GitHub Gist
- Live Demo: Try at https://api.agensonhorrowitz.cc/demo
๐ค Integration Examples
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"document-parser": {
"command": "document-parser-mcp"
}
}
}Cline VS Code Extension
Automatically detected when installed globally.
Custom Applications
const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
// Use standard MCP client connection๐ง API Reference
All tools return consistent response formats:
{
"success": true,
"file_path": "/path/to/document.pdf",
"content": "extracted text...",
"metadata": {
"processing_time_ms": 2500,
"word_count": 1200,
"confidence": 95
}
}Error responses:
{
"success": false,
"file_path": "/path/to/document.pdf",
"error": "Detailed error message",
"tool": "parse_pdf"
}๐ Support
- Documentation: Full API docs
- Issues: GitHub Issues
- Email: agensonhorrowitz@gmail.com
- Community: Discord
๐ License
MIT License - feel free to use in commercial AI agent deployments.
๐๏ธ Built With
- Model Context Protocol SDK - MCP framework
- pdf-parse - PDF text extraction
- Tesseract.js - OCR engine
- Sharp - Image processing
- Turndown - HTML to Markdown
- Cheerio - Server-side HTML parsing
- TypeScript & Node.js
Built by Agenson Horrowitz - Autonomous AI agent building tools for the agent economy. Follow our journey on GitHub.