JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 1679
  • Score
    100M100P100Q105551F
  • License MIT

Fast PDF classification and text extraction. Detect text-based vs scanned PDFs, extract text by region with quality checks. Native Rust performance via napi-rs.

Package Exports

  • firecrawl-pdf-inspector
  • firecrawl-pdf-inspector/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (firecrawl-pdf-inspector) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

firecrawl-pdf-inspector

Fast PDF classification and region-based text extraction for Node.js/Bun. Native Rust performance via napi-rs.

Built by Firecrawl for hybrid OCR pipelines — extract text from PDF structure where possible, fall back to OCR only when needed.

Install

npm install firecrawl-pdf-inspector
# or
bun add firecrawl-pdf-inspector

Prebuilt binaries included for linux-x64 and macOS ARM64. No Rust toolchain needed.

API

classifyPdf(buffer: Buffer): PdfClassification

Classify a PDF as TextBased, Scanned, Mixed, or ImageBased (~10-50ms). Returns which pages need OCR.

import { classifyPdf } from 'firecrawl-pdf-inspector'
import { readFileSync } from 'fs'

const pdf = readFileSync('document.pdf')
const result = classifyPdf(pdf)

console.log(result.pdfType)        // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
console.log(result.pageCount)      // 42
console.log(result.pagesNeedingOcr) // [5, 12, 15] (0-indexed)
console.log(result.confidence)     // 0.875

extractTextInRegions(buffer: Buffer, pageRegions: PageRegions[]): PageRegionTexts[]

Extract text within bounding-box regions from a PDF. Designed for hybrid OCR pipelines where a layout model detects regions in rendered page images, and this function extracts text from the PDF structure for text-based pages — skipping GPU OCR.

Each region result includes a needsOcr flag that signals unreliable extraction (empty text, GID-encoded fonts, garbage text, encoding issues).

import { extractTextInRegions } from 'firecrawl-pdf-inspector'

const result = extractTextInRegions(pdf, [
  {
    page: 0, // 0-indexed
    regions: [
      [0, 0, 300, 400],    // [x1, y1, x2, y2] in PDF points, top-left origin
      [300, 0, 612, 400],
    ]
  }
])

for (const region of result[0].regions) {
  if (region.needsOcr) {
    // Unreliable text — send this region to OCR instead
  } else {
    console.log(region.text) // Extracted text in reading order
  }
}

Types

interface PdfClassification {
  pdfType: string          // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
  pageCount: number
  pagesNeedingOcr: number[] // 0-indexed page numbers
  confidence: number        // 0.0 - 1.0
}

interface PageRegions {
  page: number              // 0-indexed
  regions: number[][]       // [[x1, y1, x2, y2], ...] in PDF points, top-left origin
}

interface PageRegionTexts {
  page: number
  regions: RegionText[]
}

interface RegionText {
  text: string
  needsOcr: boolean         // true when text is unreliable
}

Platforms

Platform Architecture Supported
Linux x64 Yes
macOS ARM64 Yes

License

MIT