JSPM

text-extract

1.0.5
    • ESM via JSPM
    • ES Module Entrypoint
    • Export Map
    • Keywords
    • License
    • Repository URL
    • TypeScript Types
    • README
    • Created
    • Published
    • Downloads 10
    • Score
      100M100P100Q64972F
    • License MIT

    A robust Node.js utility for extracting text from PDF, DOCX, DOC, XLSX, and TXT buffers.

    Package Exports

    • text-extract
    • text-extract/text-extract.js

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (text-extract) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    text-extract

    Robust, multi-format text extraction from binary buffers in Node.js

    Extract readable text from PDFs, Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), plain text files, and legacy Microsoft Office compound files — with graceful error handling and MIME-type detection.

    npm install text-extract

    Features

    • Supports the most common office & document formats:
      • PDF
      • DOC / DOCX (modern and legacy)
      • XLS / XLSX
      • Plain text (.txt)
      • Compound File Binary Format (CFB) containers (old .doc / .xls)
    • Automatic file-type detection
    • Parallel processing of multiple files
    • Clean error handling — one corrupt file doesn't crash the whole batch

    Supported Formats

    Format Extension(s) MIME Type(s) Notes
    PDF .pdf application/pdf Text layer extraction
    Word (modern) .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document
    Word (legacy) .doc application/msword
    Excel (modern) .xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet CSV-style output per sheet
    Excel (legacy) .xls application/vnd.ms-excel, application/x-cfb CSV-style output per sheet
    Plain Text .txt text/plain UTF-8 decoded

    Usage

    Extract text from a single buffer

    import { readFile } from 'node:fs/promises';
    import { parseText } from 'text-extract';
    
    const buffer = await readFile('invoice.pdf');
    
    const result = await parseText(buffer);
    
    if (result) {
      console.log(`Format: ${result.ext}`);
      console.log('Text length:', result.text.length);
      console.log(result.text.substring(0, 300)); // first 300 chars
    } else {
      console.log('Could not extract text');
    }

    Batch process multiple files

    import { parseTexts } from 'text-extract';
    
    const buffers = [
      await readFile('report.pdf'),
      await readFile('proposal.docx'),
      await readFile('data.xlsx'),
      // ...
    ];
    
    const results = await parseTexts(buffers, (result) => {
      console.log(`Processed ${result.ext}${result.text.length} chars`);
    });
    
    console.log(`Successfully extracted text from ${results.length} files`);

    API

    parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>

    Extracts text from a single file buffer.
    Returns null if the format is unsupported or extraction fails.

    parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>

    Processes multiple buffers in parallel.
    Optional onComplete callback is called for every successfully processed file.

    Error Handling

    The library is designed to be very forgiving:

    • One corrupt or unsupported file → that file returns null
    • Exceptions are caught and logged (console.error)
    • Invalid UTF-8 or binary garbage is safely handled

    License

    MIT


    Made with ❤️ for Node.js developers who hate broken document parsers