Package Exports

text-extract
text-extract/text-extract.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (text-extract) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

text-extract

Robust, multi-format text extraction from binary buffers in Node.js

Extract readable text from PDFs, Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), plain text files, and legacy Microsoft Office compound files — with graceful error handling and MIME-type detection.

npm install text-extract

Features

Supports the most common office & document formats:
- PDF
- DOC / DOCX (modern and legacy)
- XLS / XLSX
- Plain text (.txt)
- Compound File Binary Format (CFB) containers (old .doc / .xls)
Automatic file-type detection
Parallel processing of multiple files
Clean error handling — one corrupt file doesn't crash the whole batch

Supported Formats

Format	Extension(s)	MIME Type(s)	Notes
PDF	`.pdf`	`application/pdf`	Text layer extraction
Word (modern)	`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`
Word (legacy)	`.doc`	`application/msword`
Excel (modern)	`.xlsx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	CSV-style output per sheet
Excel (legacy)	`.xls`	`application/vnd.ms-excel`, `application/x-cfb`	CSV-style output per sheet
Plain Text	`.txt`	`text/plain`	UTF-8 decoded

Usage

Extract text from a single buffer

import { readFile } from 'node:fs/promises';
import { parseText } from 'text-extract';

const buffer = await readFile('invoice.pdf');

const result = await parseText(buffer);

if (result) {
  console.log(`Format: ${result.ext}`);
  console.log('Text length:', result.text.length);
  console.log(result.text.substring(0, 300)); // first 300 chars
} else {
  console.log('Could not extract text');
}

Batch process multiple files

import { parseTexts } from 'text-extract';

const buffers = [
  await readFile('report.pdf'),
  await readFile('proposal.docx'),
  await readFile('data.xlsx'),
  // ...
];

const results = await parseTexts(buffers, (result) => {
  console.log(`Processed ${result.ext} – ${result.text.length} chars`);
});

console.log(`Successfully extracted text from ${results.length} files`);

API

`parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>`

Extracts text from a single file buffer.
Returns null if the format is unsupported or extraction fails.

`parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>`

Processes multiple buffers in parallel.
Optional onComplete callback is called for every successfully processed file.

Error Handling

The library is designed to be very forgiving:

One corrupt or unsupported file → that file returns null
Exceptions are caught and logged (console.error)
Invalid UTF-8 or binary garbage is safely handled

License

MIT

Made with ❤️ for Node.js developers who hate broken document parsers