Package Exports
- pdf-parse
- pdf-parse/node
- pdf-parse/worker
Readme
pdf-parse
Pure TypeScript, cross-platform module for extracting text, images, and tables from PDFs.
Run 🤗 directly in your browser or in Node!
Getting Started with v2 (Coming from v1)
// v1
const pdf = require('pdf-parse');
pdf(buffer).then(result => console.log(result.text));
// v2
const { PDFParse } = require('pdf-parse');
// or use the bundled build
// const { PDFParse } = require('pdf-parse/node');
const parser = new PDFParse({ data: buffer });
parser.getText().then((result)=>{
console.log(result.text)
}).finally(async ()=>{
await parser.destroy();
});Features 
- CommonJS, ESM, Node.js, and browser support.
- Can be integrated with React, Vue, Angular, or any other web framework.
Security Policy- Retrieve headers and validate PDF :
getHeader - Extract document info :
getInfo - Extract page text :
getText - Render pages as PNG :
getScreenshot - Extract embedded images :
getImage - Detect and extract tabular data :
getTable - Well-covered with
unit tests Integration teststo validate end-to-end behavior across environments.- See DocumentInitParameters and ParseParameters for all available options.
- For usage examples, see
live_demo,example,testandtest/examplefolders.
Installation
npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parseUsage
getHeader
// Node / ESM
import { PDFParse } from 'pdf-parse';
const parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });
// HEAD request to retrieve HTTP headers and file size without downloading the full file.
// Pass `true` to check PDF magic bytes via range request
const headerResult = await parser.getHeader(true);
console.log(`Status: ${headerResult.status}`);
console.log(`Content-Length: ${headerResult.size}`);
console.log(`Is PDF: ${headerResult.isPdf}`);
console.log(`Headers:`, headerResult.headers);// The getHeader function can also be used directly
// without creating a PDFParse instance by importing it from pdf-parse.
import { getHeader } from 'pdf-parse';
const headerResult = await getHeader('https://bitcoin.org/bitcoin.pdf', true); Usage Examples:
- Optionally validates PDFs by fetching the first 4 bytes (magic bytes).
- Useful for checking file existence, size, and type before full parsing.
- For URL-based PDFs, ensure CORS is configured if used in browsers.
getInfo — Extract Metadata and Document Information
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const info = await parser.getInfo();
await parser.destroy();
console.log(`Total pages: ${info.total}`);
console.log(`Title: ${info.info?.Title}`);
console.log(`Author: ${info.info?.Author}`);
console.log(`Creator: ${info.info?.Creator}`);
console.log(`Producer: ${info.info?.Producer}`);
// Access parsed date information
const dates = info.getDateNode();
console.log(`Creation Date: ${dates.CreationDate}`);
console.log(`Modification Date: ${dates.ModDate}`);
// Links, pageLabel, width, height (when `parsePageInfo` is true)
console.log(`Per-page information: ${info.pages}`);
Usage Examples:
- Parse hyperlinks from pages:
test/test-01-get-info - To extract hyperlinks, pass
{ parsePageInfo: true }
getText — Extract Text
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const textResult = await parser.getText();
await parser.destroy();
console.log(textResult.text);For a complete list of configuration options, see:
- DocumentInitParameters - document initialization options
- ParseParameters - parse options
Usage Examples:
- Parse password protected PDF:
password.test.ts - Parse only specific pages:
specific-pages.test.ts - Parse embedded hyperlinks:
hyperlink.test.ts - Set verbosity level:
password.test.ts - Load PDF from URL:
url.test.ts - Load PDF from base64 data:
base64.test.ts - Loading large files (> 5 MB):
large-file.test.ts
getScreenshot — Render Pages as PNG
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getScreenshot();
await parser.destroy();
for (const pageData of result.pages) {
const imgFileName = `page_${pageData.pageNumber}.png`;
await writeFile(imgFileName, pageData.data, { flag: 'w' });
}Usage Examples:
- Limit output resolution or specific pages using ParseParameters
getScreenshot({scale:1.5})— Increase rendering scale (higher DPI / larger image)getScreenshot({desiredWidth:1024})— Request a target width in pixels; height scales to keep aspect ratioimageDataUrl(default:true) — include base64 data URL string in the result.imageBuffer(default:true) — include a binary buffer for each image.- Select specific pages with
partial(e.g.getScreenshot({ partial: [1,3] })) partialoverridesfirst/last.- Use
firstto render the first N pages (e.g.getScreenshot({ first: 3 })). - Use
lastto render the last N pages (e.g.getScreenshot({ last: 2 })). - When both
firstandlastare provided they form an inclusive range (first..last).
getImage — Extract Embedded Images
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getImage();
await parser.destroy();
for (const pageData of result.pages) {
for (const pageImage of pageData.images) {
const imgFileName = `page_${pageData.pageNumber}-${pageImage.name}.png`;
await writeFile(imgFileName, pageImage.data, { flag: 'w' });
}
}Usage Examples:
- Exclude images width or height <= 50 px:
getImage({ imageThreshold: 50 }) - Default
imageThresholdis80(pixels) - Useful for excluding tiny decorative or tracking images.
- To disable size-based filtering and include all images, set
imageThreshold: 0. imageDataUrl(default:true) — include base64 data URL string in the result.imageBuffer(default:true) — include a binary buffer for each image.- Extract images from specific pages:
getImage({ partial: [2,4] })
getTable — Extract Tabular Data
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getTable();
await parser.destroy();
for (const pageData of result.pages) {
for (const table of pageData.tables) {
console.log(table);
}
}Worker Configuration (Node.js / Backend)
If you only need the default behavior you can ignore worker configuration — pdf-parse will automatically configure the worker for most environments. If you need advanced or platform-specific instructions, see: README.worker.md
Error Handling
import { PDFParse, VerbosityLevel } from 'pdf-parse';
const parser = new PDFParse({ data: buffer, verbosity: VerbosityLevel.WARNINGS });
try {
const result = await parser.getText();
} catch (error) {
console.error('PDF parsing failed:', error);
} finally {
// Always call destroy() to free memory
await parser.destroy();
}Web / Browser 
- Can be integrated into
React,Vue,Angular, or any other web framework. - Live Demo:
https://mehmet-kozan.github.io/pdf-parse/ - Demo Source:
reports_site/live_demo
CDN Usage
<!-- ES Module -->
<script type="module">
import { PDFParse } from 'https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/browser/pdf-parse.es.min.js';
</script>| Bundle Type | Development | Production (Minified) |
|---|---|---|
| ES Module | pdf-parse.es.js |
pdf-parse.es.min.js |
| UMD/Global | pdf-parse.umd.js |
pdf-parse.umd.min.js |
CDN Options: https://www.jsdelivr.com/package/npm/pdf-parse
https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/browser/pdf-parse.es.jshttps://cdn.jsdelivr.net/npm/pdf-parse@2.2.7/dist/browser/pdf-parse.es.min.jshttps://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/browser/pdf-parse.umd.jshttps://cdn.jsdelivr.net/npm/pdf-parse@2.2.7/dist/browser/pdf-parse.es.umd.js
Worker Configuration
In browser environments, pdf-parse requires a separate worker file to process PDFs in a background thread. By default, pdf-parse automatically loads the worker from the jsDelivr CDN. However, you can configure a custom worker source if needed.
When to Configure Worker Source:
- Using a custom build of
pdf-parse - Self-hosting worker files for security or offline requirements
- Using a different CDN provider
Available Worker Files:
https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/browser/pdf.worker.mjshttps://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/browser/pdf.worker.min.mjs
See example/basic.esm.worker.html for a working example of browser usage with worker configuration.
Similar Packages
- pdf2json — Buggy, memory leaks, uncatchable errors in some PDF files.
- pdfdataextract —
pdf-parsebased - unpdf —
pdf-parsebased - pdf-extract — Non cross-platform, depends on xpdf
- j-pdfjson — Fork of pdf2json
- pdfreader — Uses pdf2json
- pdf-extract — Non cross-platform, depends on xpdf
Benchmark Note: The benchmark currently runs only against
pdf2json. I don't know the current state ofpdf2json— the original reason for creatingpdf-parsewas to work around stability issues withpdf2json. I deliberately did not includepdf-parseor otherpdf.js-based packages in the benchmark because dependencies conflict. If you have recommendations for additional packages to include, please open an issue, seebenchmark results.
Supported Node.js Versions
- Supported: Node.js 20 (>= 20.16.0), Node.js 22 (>= 22.3.0), Node.js 23 (>= 23.0.0), and Node.js 24 (>= 24.0.0).
- Not supported: Node.js 21.x, and Node.js 19.x and earlier.
Integration tests run on Node.js 20–24, see test_integration.yml.
Contributing
When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see: contributing to pdf-parse