JSPM

pdftotext-mcp

1.0.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 3
  • Score
    100M100P100Q25201F
  • License MIT

A reliable Model Context Protocol server for PDF text extraction using pdftotext from poppler-utils

Package Exports

  • pdftotext-mcp
  • pdftotext-mcp/src/server.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (pdftotext-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

PDFtotext MCP Server

A reliable Model Context Protocol (MCP) server for PDF text extraction using the proven pdftotext utility from poppler-utils.

npm version License: MIT

๐Ÿš€ Why This Server?

Unlike other PDF MCP servers that suffer from logging interference, complex dependencies, and reliability issues, pdftotext-mcp is:

  • โœ… Actually works - Clean JSON-RPC communication without stdout pollution
  • โœ… Reliable - Built on mature pdftotext from poppler-utils (used by millions)
  • โœ… Lightweight - Minimal dependencies, maximum compatibility
  • โœ… Production tested - Successfully tested with Claude Desktop and other MCP clients
  • โœ… Feature complete - Page-specific extraction, layout preservation, encoding options
  • โœ… Error handling - Comprehensive validation and helpful error messages

๐Ÿ“‹ Features

  • ๐Ÿ“„ Extract text from entire PDF documents or specific pages
  • ๐ŸŽจ Preserve original layout formatting (optional)
  • ๐Ÿ”ค Multiple text encoding support (UTF-8, Latin1, ASCII)
  • ๐Ÿ“Š Comprehensive metadata in responses (word count, file info, etc.)
  • ๐Ÿ›ก๏ธ File validation and security checks
  • โšก Fast processing with configurable timeouts
  • ๐Ÿ” Detailed error reporting with troubleshooting hints

๐Ÿ”ง Prerequisites

You must have pdftotext installed on your system:

Ubuntu/Debian

sudo apt update
sudo apt install poppler-utils

macOS

brew install poppler

Windows

# Using Chocolatey
choco install poppler

# Using Scoop
scoop install poppler

Verify Installation

pdftotext -v

๐Ÿ“ฆ Installation

npm install -g pdftotext-mcp

Option 2: Use with npx (No Installation)

npx pdftotext-mcp

Option 3: Local Development

git clone https://github.com/jpwebb/pdftotext-mcp.git
cd pdftotext-mcp
npm install
npm start

โš™๏ธ Configuration

Add to your MCP client configuration:

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "pdftotext": {
      "command": "pdftotext-mcp"
    }
  }
}

Or with npx:

{
  "mcpServers": {
    "pdftotext": {
      "command": "npx",
      "args": ["pdftotext-mcp"]
    }
  }
}

Other MCP Clients

The server works with any MCP-compatible client. Use pdftotext-mcp as the command.

๐ŸŽฏ Usage

The server provides a single, powerful tool: read_pdf_text

Basic Usage

Extract entire document

{
  "tool": "read_pdf_text",
  "arguments": {
    "path": "./document.pdf"
  }
}

Extract specific page

{
  "tool": "read_pdf_text",
  "arguments": {
    "path": "./document.pdf",
    "page": 2
  }
}

Preserve layout formatting

{
  "tool": "read_pdf_text",
  "arguments": {
    "path": "./document.pdf",
    "layout": true
  }
}

Custom encoding

{
  "tool": "read_pdf_text",
  "arguments": {
    "path": "./document.pdf",
    "encoding": "Latin1"
  }
}

Response Format

Success Response

{
  "success": true,
  "file": "document.pdf",
  "path": "/absolute/path/to/document.pdf",
  "extractedText": "Full text content...",
  "pageSpecific": "all",
  "layoutPreserved": false,
  "encoding": "UTF-8",
  "fileSize": 1048576,
  "lastModified": "2024-01-15T10:30:00.000Z",
  "extractedAt": "2024-01-15T10:35:00.000Z",
  "textLength": 5234,
  "wordCount": 892
}

Error Response

{
  "success": false,
  "error": "File not found: ./nonexistent.pdf",
  "errorType": "FILE_NOT_FOUND",
  "file": "./nonexistent.pdf",
  "timestamp": "2024-01-15T10:35:00.000Z"
}

๐Ÿ“š API Reference

Tool: read_pdf_text

Extracts text content from PDF files using pdftotext.

Parameters

Parameter Type Required Default Description
path string โœ… - Path to PDF file (relative or absolute)
page number โŒ all pages Specific page to extract (1-based)
layout boolean โŒ false Preserve original text layout
encoding string โŒ "UTF-8" Output text encoding

Supported Encodings

  • UTF-8 (default)
  • Latin1
  • ASCII

Error Types

  • FILE_NOT_FOUND - PDF file doesn't exist
  • PERMISSION_DENIED - Cannot read the file
  • INVALID_PDF - File is not a valid PDF
  • PDFTOTEXT_ERROR - pdftotext utility error
  • UNKNOWN_ERROR - Unexpected error

๐Ÿ”ง Troubleshooting

"pdftotext is not available"

Solution: Install poppler-utils (see Prerequisites)

"File not found"

Solutions:

  • Use absolute paths: /home/user/document.pdf
  • Check file exists: ls -la /path/to/file.pdf
  • Verify MCP server working directory

"Permission denied"

Solutions:

  • Check file permissions: chmod 644 document.pdf
  • Ensure directory is readable: chmod 755 /path/to/directory/

"File is not a valid PDF"

Solutions:

  • Verify file is actually a PDF: file document.pdf
  • Check for file corruption
  • Try with a different PDF file

MCP Connection Issues

Solutions:

  • Restart your MCP client completely
  • Check configuration syntax in config file
  • Verify pdftotext-mcp is accessible in PATH
  • Check MCP client logs for detailed errors

๐Ÿงช Testing

# Run tests
npm test

# Run tests with watch mode
npm run test:watch

# Run linter
npm run lint

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

git clone https://github.com/jpwebb/pdftotext-mcp.git
cd pdftotext-mcp
npm install

Running Locally

npm start

Code Style

This project uses ESLint. Run npm run lint to check code style.

๐Ÿ“„ License

MIT - see LICENSE file for details.

๐Ÿ™ Acknowledgments


Made for the MCP community