JSPM

@pisanvs/mistralocr-cli

1.0.2
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 13
  • Score
    100M100P100Q46074F
  • License MIT

CLI for the Mistral OCR API — converts documents and images to Markdown

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@pisanvs/mistralocr-cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    mistralocr-cli

    Convert PDFs, images, and documents to Markdown using the Mistral OCR API.

    mistralocr-cli is a fast, caching-enabled command-line tool that extracts text from scanned documents and images. It handles large files automatically (chunking), retries on transient errors, and can embed or save extracted images alongside the Markdown output.


    Table of Contents


    Features

    • 📄 Multi-format support — PDFs, images (PNG/JPEG/WebP/…), Word, PowerPoint, EPUB, and more
    • 🗂️ Smart caching — SHA-256 based; re-uses previous results so you never pay for the same file twice
    • ✂️ Automatic chunking — splits large PDFs into chunks and reassembles the result
    • 🔁 Retry with back-off — handles rate limits and transient network errors gracefully
    • 🖼️ Image extraction — embed images as base64 data URIs, or save them to a folder
    • 📊 Table extraction — save each detected table to its own Markdown file
    • 🔄 Auto-conversion — converts legacy .doc / .ppt / .xls files via LibreOffice (if installed)
    • 📣 Verbose mode — real-time spinners and detailed progress output

    Requirements

    Requirement Notes
    Node.js ≥ 20 Required
    Mistral API key Get one at console.mistral.ai
    pdfinfo (optional) Enables automatic page-count detection and chunking for PDFs. Install via poppler-utils
    LibreOffice (optional) Required only for converting legacy .doc, .ppt, .xls files

    Installation

    npm install -g @pisanvs/mistralocr-cli

    Local / per-project

    npm install @pisanvs/mistralocr-cli
    npx mistralocr <file> [options]

    API key

    Set your Mistral API key as an environment variable (recommended):

    export MISTRAL_API_KEY="your-api-key-here"

    Or pass it inline with --api-key on every command.


    Quick Start

    # Set your API key
    export MISTRAL_API_KEY="your-api-key-here"
    
    # Convert a PDF and print Markdown to the terminal
    mistralocr report.pdf
    
    # Save to a file instead
    mistralocr report.pdf --output report.md

    Usage

    mistralocr [file] [options]

    CLI Reference

    Flag Description Default
    [file] Path to the PDF or image file to process (required)
    -k, --api-key <key> Mistral API key (overrides MISTRAL_API_KEY env var)
    -o, --output <file> Write Markdown output to a file instead of stdout stdout
    -m, --model <model> OCR model to use mistral-ocr-latest
    -p, --pages <range> Page range to process (see Page Ranges) all pages
    --include-images Embed extracted images as base64 data URIs in the Markdown false
    --extract-images <dir> Save extracted images to <dir>; Markdown will reference them relatively
    --extract-tables <dir> Save each detected table as a separate .md file in <dir>
    --bbox-annotation <json> JSON schema for bounding-box annotation format
    --document-annotation <json> JSON schema for document-level annotation format
    --no-cache Bypass cache and always call the API false
    --clear-cache Delete the cache directory and exit
    --cache-dir <dir> Custom cache directory path .mistralocr-cache
    --chunk-size <n> Pages per API call; set to 0 to disable chunking 50
    --max-retries <n> Maximum retry attempts per API call 3
    --retry-delay <ms> Initial retry back-off in milliseconds 1000
    -v, --verbose Show detailed progress information false
    -V, --version Print version number and exit

    Examples

    Basic OCR

    Convert a PDF and stream Markdown to your terminal:

    mistralocr scan.pdf

    Convert a JPEG image:

    mistralocr photo.jpg

    Save Output to a File

    mistralocr report.pdf --output report.md

    Process Specific Pages

    Process only pages 1–5:

    mistralocr book.pdf --pages "1-5" --output chapter1.md

    Process pages 1, 3, and 7–10:

    mistralocr book.pdf --pages "1,3,7-10" --output selection.md

    Extract Images to a Folder

    Images are saved to ./images/ and the Markdown contains relative links to them:

    mistralocr report.pdf --extract-images ./images --output report.md

    Files are named <basename>-p<page>-img<n>.<ext> (e.g. report-p1-img1.png).

    Embed Images as Base64

    All images are inlined as data URIs — useful for fully self-contained documents:

    mistralocr invoice.pdf --include-images --output invoice.md

    Extract Tables to Separate Files

    Each detected table is saved as its own .md file:

    mistralocr data.pdf --extract-tables ./tables --output data.md

    Table files are named <basename>-p<page>-table<n>.md.

    Skip Cache / Clear Cache

    Force a fresh API call (ignore any cached result):

    mistralocr report.pdf --no-cache

    Delete all cached results:

    mistralocr --clear-cache

    Use a custom cache location:

    mistralocr report.pdf --cache-dir /tmp/my-cache

    Auto-Convert Legacy Office Files

    If LibreOffice is installed, .doc, .ppt, .xls, and similar files are automatically converted before OCR:

    mistralocr legacy-document.doc --output result.md

    Tune Chunking and Retries

    Process 100 pages at a time instead of the default 50:

    mistralocr large-book.pdf --chunk-size 100

    Disable chunking entirely:

    mistralocr small.pdf --chunk-size 0

    Increase retry attempts and set a longer initial delay for slow connections:

    mistralocr report.pdf --max-retries 5 --retry-delay 2000

    Verbose Output

    See real-time progress spinners and detailed logs:

    mistralocr report.pdf --verbose --output report.md

    Supported Formats

    Documents

    Format Extension(s)
    PDF .pdf
    Word .docx, .doc (auto-converted via LibreOffice)
    PowerPoint .pptx, .ppt (auto-converted via LibreOffice)
    EPUB .epub
    RTF .rtf
    OpenDocument Text .odt
    LaTeX .tex
    Jupyter Notebook .ipynb
    BibTeX .bib
    FictionBook .fb2
    OPML .opml
    XML (DocBook/JATS) .xml
    Troff/Man .1, .man

    Images

    Format Extension(s)
    JPEG .jpg, .jpeg
    PNG .png
    WebP .webp
    GIF .gif
    TIFF .tiff, .tif
    BMP .bmp
    AVIF .avif
    HEIC/HEIF .heic, .heif

    Caching

    Results are cached locally so repeated runs on the same file (with the same options) return instantly without calling the API.

    • Location: .mistralocr-cache/ in the current directory (override with --cache-dir)
    • Key: SHA-256 hash of the file contents + a hash of the relevant options (model, page range, image settings)
    • Invalidation: The cache is automatically invalidated when the file changes or any option that affects the output changes
    • Bypass: Use --no-cache to force a fresh API call
    • Clear: Use --clear-cache to delete all cached results

    Page Ranges

    The --pages option accepts 1-indexed page numbers in several formats:

    Format Example Pages processed
    Single page 5 5
    Range 1-5 1, 2, 3, 4, 5
    List 1,3,5 1, 3, 5
    Mixed 1-3,7,10-12 1, 2, 3, 7, 10, 11, 12

    Development

    # Clone the repository
    git clone https://github.com/pisanvs/mistralocr-cli.git
    cd mistralocr-cli
    
    # Install dependencies
    npm install
    
    # Run in development mode (TypeScript, no build step)
    npm run dev -- report.pdf
    
    # Build to dist/
    npm run build
    
    # Type-check without building
    npm run typecheck

    License

    MIT