Package Exports
- pdf-parse
Readme
pdf-parse
A pure JavaScript, cross-platform module to extract text, images, and tables from PDF files.
Similar packages
- pdf2json — buggy, no longer supported, memory leaks, and throws uncaught fatal errors
- j-pdfjson — fork of pdf2json
- pdf-parser — buggy, no tests
- pdfreader — uses pdf2json
- pdf-extract — not cross-platform (depends on xpdf)
Installation
npm install pdf-parseBasic Usage
API
GetText — text extraction
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const data = await readFile('test/test-01/test.pdf');
const buffer = new Uint8Array(data);
// Using helper function
const result = await pdf(buffer);
// Using the class
const parser = new PDFParse({ data: buffer });
const textResult = await parser.GetText();
console.log(textResult.text);PageToImage — render page to PNG
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const data = await readFile('test/test-01/test.pdf');
const buffer = new Uint8Array(data);
// Using the class
const parser = new PDFParse({ data: buffer });
const result = await parser.PageToImage();
for (const pageData of result.pages) {
const imgFileName = `page_${pageData.pageNumber}.png`;
await writeFile(imgFileName, pageData.data, {
flag: 'w',
});
}GetImage — extract embedded images
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const data = await readFile('test/test-01/test.pdf');
const buffer = new Uint8Array(data);
// Using the class
const parser = new PDFParse({ data: buffer });
const result = await parser.GetImage();
for (const pageData of result.pages) {
for (const pageImage of pageData.images) {
const imgFileName = `page_${pageData.pageNumber}-${pageImage.fileName}.png`;
await writeFile(imgFileName, pageImage.data, {
flag: 'w',
});
}
}GetTable — extract tabular data
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const data = await readFile('test/test-01/test.pdf');
const buffer = new Uint8Array(data);
// Using the class
const parser = new PDFParse({ data: buffer });
const result = await parser.GetTable();
for (const pageData of result.pages) {
for (const table of pageData.tables) {
console.log(table);
}
}Example — Web / Browser
- After running
npm run build, use the browser bundle indist/browser. - See a minimal browser example in example/browser/pdf-parse.es.html.
Inline browser usage example:
<!-- Import the ES browser bundle built to dist/browser/pdf-parse.es.js -->
<script type="module">
import { pdf } from './dist/browser/pdf-parse.es.js';
const input = document.querySelector('#file');
const btn = document.querySelector('#parse');
const out = document.querySelector('#output');
btn.addEventListener('click', async () => {
const f = input.files?.[0];
if (!f) {
out.textContent = 'Please select a PDF file.';
return;
}
const ab = await f.arrayBuffer();
const result = await pdf(new Uint8Array(ab));
out.textContent = result.text?.slice(0, 10000) ?? 'No text found.';
});
</script>Options
The library forwards most options to pdfjs (getDocument). Common ParseOptions supported by the public API:
- data: ArrayBuffer | Uint8Array | TypedArray | number[]
Binary PDF data. Prefer Uint8Array to reduce main-thread memory usage (typed arrays may be transferred to the worker). - url: string | URL
Remote PDF URL. The helperpdf()accepts a URL instance. - partial: boolean (default: false)
Enable partial parsing of pages. When true, usefirstand/orlast. - first: number
Ifpartialis true, parse the first N pages. - last: number
Ifpartialis true, parse the last N pages. - verbosity: pdfjs.VerbosityLevel
Controls pdf.js logging. The library sets a default (ERRORS) but you can override it. - cMapUrl, cMapPacked, standardFontDataUrl (browser)
Paths to cmap and standard font data when running in the browser build.
Note: Any other options accepted by pdfjs getDocument() may also be provided and will be forwarded.
Examples
Node / ESM (text extraction, partial)
import fs from 'fs/promises';
import { PDFParse, pdf } from 'pdf-parse';
// helper
const result = await pdf(new Uint8Array(await fs.readFile('test/test-01/test.pdf')));
console.log(result.text);
// full API with options
const data = new Uint8Array(await fs.readFile('test/test-01/test.pdf'));
const parser = new PDFParse({ data, partial: true, first: 2 }); // only first 2 pages
const textRes = await parser.GetText();
console.log(textRes.pages.length);Browser (file input, custom cmaps)
<input id="file" type="file" accept="application/pdf">
<button id="parse">Parse</button>
<pre id="out"></pre>
<script type="module">
import { pdf, PDFParse } from './dist/browser/pdf-parse.es.js';
document.querySelector('#parse').addEventListener('click', async () => {
const f = document.querySelector('#file').files[0];
if (!f) return;
const ab = await f.arrayBuffer();
// helper usage
const res = await pdf(new Uint8Array(ab));
document.querySelector('#out').textContent = res.text.slice(0, 200);
// or full API with browser-specific options
const parser = new PDFParse({
data: new Uint8Array(ab),
cMapUrl: '/cmaps/',
cMapPacked: true
});
const pages = await parser.PageToImage();
console.log(pages);
});
</script>Features
- Extract page text: GetText (via
pdforPDFParse) - Extract embedded images: GetImage
- Render page to image: PageToImage
- Detect and extract tabular data: GetTable
Notes
- Uses
pdfjs-distfor PDF parsing and rendering (see worker setup insrc/PDFParse.ts). - Tests are in test/ and run with Vitest.
Contributing
- Fork and branch
- Make changes and run tests
- Open a pull request
License
- Apache-2.0 (see LICENSE)
Worker note: In browser environments the package sets pdfjs GlobalWorkerOptions.workerSrc automatically when imported from the built browser bundle. If you use a custom build or host pdf.worker yourself, configure pdfjs accordingly.