Package Exports
- text-extract
- text-extract/text-extract.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (text-extract) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
text-extract
Robust, multi-format text extraction from binary buffers in Node.js
Extract readable text from PDFs, Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), plain text files, and legacy Microsoft Office compound files — with graceful error handling and MIME-type detection.
npm install text-extractFeatures
- Supports the most common office & document formats:
- DOC / DOCX (modern and legacy)
- XLS / XLSX
- Plain text (.txt)
- Compound File Binary Format (CFB) containers (old .doc / .xls)
- Automatic file-type detection
- Parallel processing of multiple files
- Clean error handling — one corrupt file doesn't crash the whole batch
Supported Formats
| Format | Extension(s) | MIME Type(s) | Notes |
|---|---|---|---|
.pdf |
application/pdf |
Text layer extraction | |
| Word (modern) | .docx |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
|
| Word (legacy) | .doc |
application/msword |
|
| Excel (modern) | .xlsx |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
CSV-style output per sheet |
| Excel (legacy) | .xls |
application/vnd.ms-excel, application/x-cfb |
CSV-style output per sheet |
| Plain Text | .txt |
text/plain |
UTF-8 decoded |
Usage
Extract text from a single buffer
import { readFile } from 'node:fs/promises';
import { parseText } from 'text-extract';
const buffer = await readFile('invoice.pdf');
const result = await parseText(buffer);
if (result) {
console.log(`Format: ${result.ext}`);
console.log('Text length:', result.text.length);
console.log(result.text.substring(0, 300)); // first 300 chars
} else {
console.log('Could not extract text');
}Batch process multiple files
import { parseTexts } from 'text-extract';
const buffers = [
await readFile('report.pdf'),
await readFile('proposal.docx'),
await readFile('data.xlsx'),
// ...
];
const results = await parseTexts(buffers, (result) => {
console.log(`Processed ${result.ext} – ${result.text.length} chars`);
});
console.log(`Successfully extracted text from ${results.length} files`);API
parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>
Extracts text from a single file buffer.
Returns null if the format is unsupported or extraction fails.
parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>
Processes multiple buffers in parallel.
Optional onComplete callback is called for every successfully processed file.
Error Handling
The library is designed to be very forgiving:
- One corrupt or unsupported file → that file returns
null - Exceptions are caught and logged (console.error)
- Invalid UTF-8 or binary garbage is safely handled
License
MIT
Made with ❤️ for Node.js developers who hate broken document parsers