JSPM

@paradyno/pdf-mcp-server

0.1.1
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 12
  • Score
    100M100P100Q67478F
  • License Apache-2.0

MCP server for PDF processing - text extraction, search, and outline extraction

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@paradyno/pdf-mcp-server) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    πŸ“„ PDF MCP Server

    A high-performance MCP server for PDF processing, built in Rust.

    License CI codecov

    Give your AI agents powerful PDF capabilities β€” extract text, search, split, merge, encrypt, and more. All dependencies are Apache 2.0 licensed, keeping your project clean and permissive.

    ✨ Features

    Category Tools
    πŸ“– Reading extract_text Β· extract_metadata Β· extract_outline Β· extract_annotations Β· extract_links Β· extract_form_fields
    πŸ” Search & Discovery search Β· list_pdfs Β· get_page_info Β· summarize_structure
    πŸ–ΌοΈ Media Image extraction (via extract_text) Β· convert_page_to_image
    βœ‚οΈ Manipulation split_pdf Β· merge_pdfs Β· compress_pdf Β· fill_form
    πŸ”’ Security protect_pdf Β· unprotect_pdf Β· Password-protected PDF support
    πŸ“¦ Resources Expose PDFs as MCP Resources for direct client access
    ⚑ Performance Batch processing · LRU caching · Operation chaining via cache keys

    πŸš€ Installation

    npm install -g @paradyno/pdf-mcp-server

    Pre-built Binaries

    Download from GitHub Releases:

    Platform x86_64 ARM64
    🐧 Linux pdf-mcp-server-linux-x64 pdf-mcp-server-linux-arm64
    🍎 macOS pdf-mcp-server-darwin-x64 pdf-mcp-server-darwin-arm64
    πŸͺŸ Windows pdf-mcp-server-windows-x64.exe β€”

    From Source

    cargo install --git https://github.com/paradyno/pdf-mcp-server

    βš™οΈ Configuration

    Claude Desktop

    Add to your claude_desktop_config.json:

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
    • Windows: %APPDATA%\Claude\claude_desktop_config.json
    {
      "mcpServers": {
        "pdf": {
          "command": "npx",
          "args": ["@paradyno/pdf-mcp-server"]
        }
      }
    }

    Claude Code

    claude mcp add pdf -- npx @paradyno/pdf-mcp-server

    VS Code

    {
      "mcp.servers": {
        "pdf": {
          "command": "npx",
          "args": ["@paradyno/pdf-mcp-server"]
        }
      }
    }

    πŸ› οΈ Tools

    Source Types

    All tools accept PDF sources in multiple formats:

    { "path": "/documents/file.pdf" }
    { "base64": "JVBERi0xLjQK..." }
    { "url": "https://example.com/document.pdf" }
    { "cache_key": "abc123" }

    πŸ“– extract_text

    Extract text content with LLM-optimized formatting (paragraph detection, multi-column reordering, watermark removal).

    Example & Parameters
    {
      "sources": [{ "path": "/documents/report.pdf" }],
      "pages": "1-10",
      "include_metadata": true
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    pages string No all Page selection (e.g., "1-5,10,15-20")
    include_metadata boolean No true Include PDF metadata
    include_images boolean No false Include extracted images (base64 PNG)
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    πŸ“– extract_outline

    Extract PDF bookmarks / table of contents.

    Example, Parameters & Response
    {
      "sources": [{ "path": "/documents/book.pdf" }]
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    Response:

    {
      "results": [{
        "source": "/documents/book.pdf",
        "outline": [
          {
            "title": "Chapter 1: Introduction",
            "page": 1,
            "children": [
              { "title": "1.1 Background", "page": 3, "children": [] }
            ]
          }
        ]
      }]
    }

    πŸ“– extract_metadata

    Extract PDF metadata (author, title, dates, etc.) without loading full content.

    Example & Parameters
    {
      "sources": [{ "path": "/documents/report.pdf" }]
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    πŸ“– extract_annotations

    Extract highlights, comments, underlines, and other annotations.

    Example & Parameters
    {
      "sources": [{ "path": "/documents/report.pdf" }],
      "annotation_types": ["highlight", "text"],
      "pages": "1-5"
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    annotation_types array No all Filter by types (highlight, underline, text, etc.)
    pages string No all Page selection
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    Extract hyperlinks and internal page navigation links.

    Example, Parameters & Response
    {
      "sources": [{ "path": "/documents/paper.pdf" }],
      "pages": "1-10"
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    pages string No all Page selection
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    Response:

    {
      "results": [{
        "source": "/documents/paper.pdf",
        "links": [
          { "page": 1, "url": "https://example.com", "text": "Click here" },
          { "page": 3, "dest_page": 10, "text": "See Chapter 5" }
        ],
        "total_count": 2
      }]
    }

    πŸ“– extract_form_fields

    Read form field names, types, current values, and properties from PDF forms.

    Example, Parameters & Response
    {
      "sources": [{ "path": "/documents/form.pdf" }],
      "pages": "1"
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    pages string No all Page selection
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    Response:

    {
      "results": [{
        "source": "/documents/form.pdf",
        "fields": [
          {
            "page": 1,
            "name": "full_name",
            "field_type": "text",
            "value": "John Doe",
            "is_read_only": false,
            "is_required": true,
            "properties": { "is_multiline": false, "is_password": false }
          },
          {
            "page": 1,
            "name": "agree_terms",
            "field_type": "checkbox",
            "is_checked": true,
            "is_read_only": false,
            "is_required": false,
            "properties": {}
          }
        ],
        "total_fields": 2
      }]
    }

    πŸ–ΌοΈ convert_page_to_image

    Render PDF pages as PNG images (base64). Enables Vision LLMs to understand visual layouts, charts, and diagrams.

    Example, Parameters & Response
    {
      "sources": [{ "path": "/documents/chart.pdf" }],
      "pages": "1-3",
      "width": 1200
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    pages string No all Page selection
    width integer No 1200 Target width in pixels
    height integer No β€” Target height in pixels
    scale float No β€” Scale factor (overrides width/height)
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    Response:

    {
      "results": [{
        "source": "/documents/chart.pdf",
        "pages": [
          {
            "page": 1,
            "width": 1200,
            "height": 1553,
            "data_base64": "iVBORw0KGgo...",
            "mime_type": "image/png"
          }
        ]
      }]
    }

    Full-text search within PDFs with surrounding context.

    Example & Parameters
    {
      "sources": [{ "path": "/documents/manual.pdf" }],
      "query": "error handling",
      "context_chars": 100
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    query string Yes β€” Search query
    case_sensitive boolean No false Case-sensitive search
    max_results integer No 100 Maximum results to return
    context_chars integer No 50 Characters of context around match
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    πŸ” get_page_info

    Get page dimensions, word/char counts, token estimates, and file sizes. Useful for planning LLM context usage.

    Example, Parameters & Response
    {
      "sources": [{ "path": "/documents/report.pdf" }]
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching
    skip_file_sizes boolean No false Skip file size calculation (faster)

    Response:

    {
      "results": [{
        "source": "/documents/report.pdf",
        "pages": [{
          "page": 1,
          "width": 612.0, "height": 792.0,
          "rotation": 0, "orientation": "portrait",
          "char_count": 2500, "word_count": 450,
          "estimated_token_count": 625,
          "file_size": 102400
        }],
        "total_pages": 10,
        "total_chars": 25000,
        "total_words": 4500,
        "total_estimated_token_count": 6250
      }]
    }

    Note: Token counts are model-dependent approximations (~4 chars/token for Latin, ~2 tokens/char for CJK). Use as rough guidance only.

    πŸ” summarize_structure

    One-call comprehensive overview of a PDF's structure. Helps LLMs decide how to process a document.

    Example, Parameters & Response
    {
      "sources": [{ "path": "/documents/report.pdf" }]
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources
    password string No β€” PDF password if encrypted
    cache boolean No false Enable caching

    Response:

    {
      "results": [{
        "source": "/documents/report.pdf",
        "page_count": 25,
        "file_size": 1048576,
        "metadata": { "title": "Annual Report", "author": "Acme Corp" },
        "has_outline": true,
        "outline_items": 12,
        "total_chars": 50000,
        "total_words": 9000,
        "total_estimated_tokens": 12500,
        "pages": [
          { "page": 1, "width": 612.0, "height": 792.0, "char_count": 2000, "word_count": 360, "has_images": true, "has_links": false, "has_annotations": false }
        ],
        "total_images": 5,
        "total_links": 3,
        "total_annotations": 2,
        "has_form": false,
        "form_field_count": 0,
        "form_field_types": {},
        "is_encrypted": false
      }]
    }

    πŸ” list_pdfs

    Discover PDF files in a directory with optional filtering.

    Example & Parameters
    {
      "directory": "/documents",
      "recursive": true,
      "pattern": "invoice*.pdf"
    }
    Parameter Type Required Default Description
    directory string Yes β€” Directory to search
    recursive boolean No false Search subdirectories
    pattern string No β€” Filename pattern (e.g., "report*.pdf")

    βœ‚οΈ split_pdf

    Extract specific pages from a PDF to create a new PDF.

    Example, Parameters & Page Range Syntax
    {
      "source": { "path": "/documents/book.pdf" },
      "pages": "1-10,15,20-z",
      "output_path": "/output/excerpt.pdf"
    }
    Parameter Type Required Default Description
    source object Yes β€” PDF source
    pages string Yes β€” Page range (see syntax below)
    output_path string No β€” Save output to file
    password string No β€” PDF password if encrypted

    Page Range Syntax:

    Syntax Description
    1-5 Pages 1 through 5
    1,3,5 Specific pages
    z Last page
    r1 Last page (reverse)
    5-z Page 5 to end
    z-1 All pages reversed
    1-z:odd Odd pages only
    1-z:even Even pages only
    1-10,x5 Pages 1–10 except page 5

    βœ‚οΈ merge_pdfs

    Merge multiple PDFs into a single file.

    Example & Parameters
    {
      "sources": [
        { "path": "/documents/chapter1.pdf" },
        { "path": "/documents/chapter2.pdf" }
      ],
      "output_path": "/output/complete-book.pdf"
    }
    Parameter Type Required Default Description
    sources array Yes β€” PDF sources to merge (in order)
    output_path string No β€” Save output to file

    βœ‚οΈ compress_pdf

    Reduce PDF file size using stream optimization, object deduplication, and compression.

    Example, Parameters & Response
    {
      "source": { "path": "/documents/large-report.pdf" },
      "compression_level": 9,
      "output_path": "/output/compressed.pdf"
    }
    Parameter Type Required Default Description
    source object Yes β€” PDF source
    object_streams string No "generate" "generate" (best) Β· "preserve" Β· "disable"
    compression_level integer No 9 1–9 (higher = better compression)
    output_path string No β€” Save output to file
    password string No β€” PDF password if encrypted

    Response:

    {
      "results": [{
        "source": "/documents/large-report.pdf",
        "original_size": 5242880,
        "compressed_size": 2097152,
        "compression_ratio": 0.4,
        "bytes_saved": 3145728
      }]
    }

    βœ‚οΈ fill_form

    Write values into existing PDF form fields and produce a new PDF.

    Example, Parameters & Limitations
    {
      "source": { "path": "/documents/form.pdf" },
      "field_values": [
        { "name": "full_name", "value": "Jane Smith" },
        { "name": "agree_terms", "checked": true }
      ],
      "output_path": "/output/filled-form.pdf"
    }
    Parameter Type Required Default Description
    source object Yes β€” PDF source
    field_values array Yes β€” Fields to fill (see below)
    output_path string No β€” Save output to file
    password string No β€” PDF password if encrypted

    Field value format:

    Field Type Description
    name string Field name (use extract_form_fields to discover names)
    value string Text value (for text fields)
    checked boolean Checked state (for checkbox/radio fields)

    Supported field types: Text fields, checkboxes, radio buttons. ComboBox/ListBox selection is read-only.

    πŸ”’ protect_pdf

    Add password protection using 256-bit AES encryption.

    Example & Parameters
    {
      "source": { "path": "/documents/confidential.pdf" },
      "user_password": "secret123",
      "allow_print": "none",
      "allow_copy": false
    }
    Parameter Type Required Default Description
    source object Yes β€” PDF source
    user_password string Yes β€” Password to open the PDF
    owner_password string No user_password Password to change permissions
    allow_print string No "full" "full" Β· "low" Β· "none"
    allow_copy boolean No true Allow copying text/images
    allow_modify boolean No true Allow modifying the document
    output_path string No β€” Save output to file
    password string No β€” Password for source PDF if encrypted

    πŸ”“ unprotect_pdf

    Remove password protection from an encrypted PDF.

    Example & Parameters
    {
      "source": { "path": "/documents/protected.pdf" },
      "password": "secret123",
      "output_path": "/output/unprotected.pdf"
    }
    Parameter Type Required Default Description
    source object Yes β€” PDF source
    password string Yes β€” Password for the encrypted PDF
    output_path string No β€” Save output to file

    πŸ“¦ MCP Resources

    Expose PDFs from configured directories as MCP Resources for direct client discovery and reading.

    Configuration & Details

    Enabling Resources

    # Command line
    pdf-mcp-server --resource-dir /documents --resource-dir /data/pdfs
    
    # Short form
    pdf-mcp-server -r /documents -r /data/pdfs
    
    # Environment variable (colon-separated)
    PDF_RESOURCE_DIRS=/documents:/data/pdfs pdf-mcp-server

    Claude Desktop with resources:

    {
      "mcpServers": {
        "pdf": {
          "command": "npx",
          "args": ["@paradyno/pdf-mcp-server", "--resource-dir", "/documents"],
          "env": {
            "PDF_RESOURCE_DIRS": "/data/pdfs:/shared/documents"
          }
        }
      }
    }

    Both methods can be combined β€” command line arguments are added to environment variable paths.

    Resource URIs

    PDFs are exposed with file:// URIs:

    file:///documents/report.pdf
    file:///documents/2024/invoice.pdf

    Operations

    • resources/list β€” Returns all PDFs with URI, name, MIME type, size, and description
    • resources/read β€” Returns extracted text content, formatted for LLM consumption

    Resources vs Tools vs Caching

    Feature Purpose Use Case
    Resources Passive file discovery Browse and preview available PDFs
    Tools Active PDF processing Extract, search, manipulate PDFs
    CacheRef Tool chaining Pass output between operations

    πŸ”— Caching

    When cache: true is specified, the server returns a cache_key for use in subsequent requests:

    // Step 1: Extract with caching
    { "sources": [{ "path": "/documents/large.pdf" }], "cache": true }
    
    // Step 2: Use cache_key from response
    { "sources": [{ "cache_key": "a1b2c3d4" }], "pages": "50-60" }

    πŸ—οΈ Architecture

    block-beta
      columns 1
      block:server["MCP Server (rmcp)"]
        columns 3
        extract_text search split_pdf
      end
      block:common["Common Layer"]
        columns 3
        Cache["Cache Manager"] Source["Source Resolver"] Batch["Batch Executor"]
      end
      block:pdf["PDF Processing"]
        columns 2
        PDFium["pdfium-render\n(reading)"] qpdf["qpdf FFI\n(manipulation)"]
      end
    
      server --> common --> pdf

    ⚑ Performance

    Benchmarked with a 14-page technical paper (tracemonkey.pdf, ~1 MB) on Docker (Apple Silicon):

    Operation Time What it means
    Extract text (14 pages) 170 ms Process ~80 documents per minute
    Metadata only 0.26 ms ~4,000 documents per second
    Search 0.01 ms Instant results on extracted text
    100 files batch 4.8 s ~21 documents per second

    Key takeaways

    • Fast enough for interactive use β€” Text extraction completes in under 200ms
    • Metadata is nearly instant β€” Use extract_metadata or summarize_structure to quickly assess documents before full processing
    • Search is blazing fast β€” Once text is extracted, searching is essentially free
    • Batch processing scales linearly β€” No significant overhead when processing many files

    Run benchmarks yourself:

    docker compose --profile dev run --rm bench

    πŸ§‘β€πŸ’» Development

    Docker (Recommended)
    # Build
    docker compose --profile dev run --rm dev cargo build
    
    # Run tests
    docker compose --profile dev run --rm test
    
    # Run tests with coverage
    docker compose --profile dev run --rm coverage
    
    # Format code
    docker compose --profile dev run --rm dev cargo fmt --all
    
    # Lint
    docker compose --profile dev run --rm clippy
    
    # Performance benchmarks
    docker compose --profile dev run --rm bench
    
    # Build production image (~120MB)
    docker compose --profile prod build production
    
    # Clean up
    docker compose --profile dev down --rmi local
    Native Development

    Requires PDFium installed locally. Download from pdfium-binaries and set PDFIUM_PATH.

    cargo build --release
    cargo test
    cargo bench
    cargo llvm-cov --html
    Project Structure
    src/
    β”œβ”€β”€ main.rs              # Entry point, CLI args
    β”œβ”€β”€ lib.rs               # Library root
    β”œβ”€β”€ server.rs            # MCP server & tool handlers
    β”œβ”€β”€ error.rs             # Error types
    β”œβ”€β”€ pdf/
    β”‚   β”œβ”€β”€ reader.rs        # PDFium wrapper (text, metadata, outline)
    β”‚   β”œβ”€β”€ annotations.rs   # Annotation extraction
    β”‚   β”œβ”€β”€ images.rs        # Image extraction
    β”‚   └── qpdf.rs          # qpdf FFI (split, merge, encrypt)
    └── source/
        β”œβ”€β”€ resolver.rs      # Path/URL/Base64 resolution
        └── cache.rs         # LRU caching layer

    πŸ—ΊοΈ Roadmap

    Completed Phases

    Phase 1: Core Reading βœ…

    extract_text Β· extract_outline Β· search Β· extract_metadata Β· extract_annotations Β· Image extraction Β· Batch processing Β· Caching

    Phase 2: PDF Manipulation βœ…

    split_pdf Β· merge_pdfs Β· protect_pdf Β· unprotect_pdf Β· compress_pdf Β· extract_links Β· get_page_info

    Phase 2.5: LLM-Optimized Text βœ…

    Dynamic thresholds Β· Paragraph detection Β· Multi-column layout Β· Watermark removal

    Phase 2.6: Discovery & Resources βœ…

    list_pdfs Β· MCP Resources Β· Resource directory configuration

    Phase 2.7: Vision & Forms βœ…

    convert_page_to_image Β· extract_form_fields Β· fill_form Β· summarize_structure

    Phase 3: Advanced Features (Planned)

    • rotate_pages β€” Rotate specific pages
    • extract_tables β€” Structured table extraction
    • add_watermark β€” Text/image watermarks
    • linearize_pdf β€” Web optimization
    • OCR support Β· PDF/A validation Β· Digital signature verification
    Waiting for MCP Protocol
    • Large file upload β€” MCP lacks a standard API for uploading large files (>20MB). Discussed in #1197, #1220, #1659.
    • Chunked file transfer β€” No standard mechanism exists yet.

    Current workarounds: shared filesystem (path), object storage with pre-signed URLs (url), or base64 encoding.

    Deferred Features

    These provide limited value for LLM use cases:

    • Hyphenation merging β€” LLMs understand hyphenated words
    • Fixed-pitch mode β€” Limited use cases
    • Bounding box output β€” LLMs don't need coordinates
    • Invisible text removal β€” Not supported by pdfium-render API

    πŸ“„ License

    Apache License 2.0

    πŸ™ Acknowledgments

    • PDFium β€” PDF rendering engine (Apache 2.0)
    • pdfium-render β€” Rust PDFium bindings (Apache 2.0)
    • qpdf β€” PDF transformation library, vendored via FFI (Apache 2.0)
    • rmcp β€” Rust MCP SDK