pdf-parse

A pure TypeScript/JavaScript, cross-platform module for extracting text, images, and tabular data from PDF files.

Contributing Note: When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently.

Features

Supports Node.js and browsers
CommonJS and ESM support
Extract page text: getText
Extract embedded images: getImage
Render pages as images: pageToImage
Detect and extract tabular data: getTable
For additional usage examples, check the example and test folders.

Similar Packages

pdf2json — Buggy, memory leaks, uncatchable errors in some PDF files.
j-pdfjson — Fork of pdf2json
pdfreader — Uses pdf2json
pdf-extract — Not cross-platform, depends on xpdf

Installation

npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parse

Basic Usage

High-level helper for v1 compatibility: pdf
Full API: PDFParse

CommonJS Example, helper for v1 compatibility

const pdf  = require('pdf-parse');
// or 
// const {pdf,PDFParse}  = require('pdf-parse');
const fs = require('fs');

const data = fs.readFileSync('test.pdf');

pdf(data).then(result=>{
    console.log(result.text);
});

getText — Extract Text

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const textResult = await parser.getText();
console.log(textResult.text);

For a complete list of configuration options, see:

DocumentInitParameters - PDF.js document initialization options
ParseParameters - pdf-parse specific options

Usage Examples

Parse password protected PDF: test/test-06-password
Parse only specific pages: test/test-parse-parameters
Parse embedded hyperlinks: test/test-hyperlinks
Load PDF from URL: test/test-types

pageToImage — Render Page to PNG

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const result = await parser.pageToImage();

for (const pageData of result.pages) {
    const imgFileName = `page_${pageData.pageNumber}.png`;
    await writeFile(imgFileName, pageData.data, { flag: 'w' });
}

getImage — Extract Embedded Images

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const result = await parser.getImage();

for (const pageData of result.pages) {
    for (const pageImage of pageData.images) {
        const imgFileName = `page_${pageData.pageNumber}-${pageImage.fileName}.png`;
        await writeFile(imgFileName, pageImage.data, { flag: 'w' });
    }
}

getTable — Extract Tabular Data

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const result = await parser.getTable();

for (const pageData of result.pages) {
    for (const table of pageData.tables) {
        console.log(table);
    }
}

Example — Web / Browser

After running npm run build, you will find both regular and minified browser bundles in dist/browser (e.g., pdf-parse.es.js and pdf-parse.es.min.js).
See a minimal browser example in example/browser/pdf-parse.es.html.

You can use the minified versions (.min.js) for production to reduce file size, or the regular versions for development and debugging.

Inline browser usage example:

You can use any of the following browser bundles depending on your module system and requirements:

pdf-parse.es.js or pdf-parse.es.min.js for ES modules

pdf-parse.umd.js or pdf-parse.umd.min.js for UMD/global usage

You can include the browser bundle directly from a CDN. Use the latest version:

Or specify a particular version:

<!-- Import the ES browser bundle built to dist/browser/pdf-parse.es.js -->
<script type="module">
import { pdf } from './dist/browser/pdf-parse.es.js';

const input = document.querySelector('#file');
const btn = document.querySelector('#parse');
const out = document.querySelector('#output');

btn.addEventListener('click', async () => {
  const f = input.files?.[0];
  if (!f) {
    out.textContent = 'Please select a PDF file.';
    return;
  }
  const ab = await f.arrayBuffer();
  const result = await pdf(new Uint8Array(ab));
  out.textContent = result.text?.slice(0, 10000) ?? 'No text found.';
});
</script>

Options

Most options are forwarded to pdfjs (getDocument). Common ParseOptions supported by the public API:

data: ArrayBuffer | Uint8Array | TypedArray | number[]
The binary PDF data. Prefer Uint8Array to reduce main-thread memory usage (typed arrays can be transferred to the worker).
url: string | URL
Remote PDF URL. The helper pdf() accepts a URL instance.
partial: boolean (default: false)
Enable partial parsing of pages. When true, use first and/or last.
first: number
If partial is true, parse the first N pages.
last: number
If partial is true, parse the last N pages.
verbosity: pdfjs.VerbosityLevel
Controls pdf.js logging. The library sets a default (ERRORS), but you can override it.
cMapUrl, cMapPacked, standardFontDataUrl (browser)
Paths to cmap and standard font data when running in the browser build.

Note: Any other options accepted by pdfjs getDocument() may also be provided and will be forwarded.

Examples

Node / ESM (text extraction, partial parsing)

import fs from 'fs/promises';
import { PDFParse, pdf } from 'pdf-parse';

// Using the helper
const result = await pdf(new Uint8Array(await fs.readFile('test/test-01/test.pdf')));
console.log(result.text);

// Full API with options
const data = new Uint8Array(await fs.readFile('test/test-01/test.pdf'));
const parser = new PDFParse({ data, partial: true, first: 2 }); // Only the first 2 pages
const textRes = await parser.getText();
console.log(textRes.pages.length);

Browser (file input, custom cmaps)

<input id="file" type="file" accept="application/pdf">
<button id="parse">Parse</button>
<pre id="out"></pre>

<script type="module">
import { pdf, PDFParse } from './dist/browser/pdf-parse.es.js';

document.querySelector('#parse').addEventListener('click', async () => {
  const f = document.querySelector('#file').files[0];
  if (!f) return;
  const ab = await f.arrayBuffer();
  // Using the helper
  const res = await pdf(new Uint8Array(ab));
  document.querySelector('#out').textContent = res.text.slice(0, 200);

  // Or full API with browser-specific options
  const parser = new PDFParse({
    data: new Uint8Array(ab),
    cMapUrl: '/cmaps/',
    cMapPacked: true
  });
  const pages = await parser.pageToImage();
  console.log(pages);
});
</script>

Worker Note: In browser environments, the package sets pdfjs.GlobalWorkerOptions.workerSrc automatically when imported from the built browser bundle. If you use a custom build or host pdf.worker yourself, configure pdfjs accordingly.