Package Exports

firecrawl-pdf-inspector
firecrawl-pdf-inspector/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (firecrawl-pdf-inspector) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

firecrawl-pdf-inspector

Fast PDF classification and region-based text extraction for Node.js/Bun. Native Rust performance via napi-rs.

Built by Firecrawl for hybrid OCR pipelines — extract text from PDF structure where possible, fall back to OCR only when needed.

Install

npm install firecrawl-pdf-inspector
# or
bun add firecrawl-pdf-inspector

Prebuilt binaries included for linux-x64 and macOS ARM64. No Rust toolchain needed.

API

`classifyPdf(buffer: Buffer): PdfClassification`

Classify a PDF as TextBased, Scanned, Mixed, or ImageBased (~10-50ms). Returns which pages need OCR.

import { classifyPdf } from 'firecrawl-pdf-inspector'
import { readFileSync } from 'fs'

const pdf = readFileSync('document.pdf')
const result = classifyPdf(pdf)

console.log(result.pdfType)        // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
console.log(result.pageCount)      // 42
console.log(result.pagesNeedingOcr) // [5, 12, 15] (0-indexed)
console.log(result.confidence)     // 0.875

`extractTextInRegions(buffer: Buffer, pageRegions: PageRegions[]): PageRegionTexts[]`

Extract text within bounding-box regions from a PDF. Designed for hybrid OCR pipelines where a layout model detects regions in rendered page images, and this function extracts text from the PDF structure for text-based pages — skipping GPU OCR.

Each region result includes a needsOcr flag that signals unreliable extraction (empty text, GID-encoded fonts, garbage text, encoding issues).

import { extractTextInRegions } from 'firecrawl-pdf-inspector'

const result = extractTextInRegions(pdf, [
  {
    page: 0, // 0-indexed
    regions: [
      [0, 0, 300, 400],    // [x1, y1, x2, y2] in PDF points, top-left origin
      [300, 0, 612, 400],
    ]
  }
])

for (const region of result[0].regions) {
  if (region.needsOcr) {
    // Unreliable text — send this region to OCR instead
  } else {
    console.log(region.text) // Extracted text in reading order
  }
}

Types

interface PdfClassification {
  pdfType: string          // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
  pageCount: number
  pagesNeedingOcr: number[] // 0-indexed page numbers
  confidence: number        // 0.0 - 1.0
}

interface PageRegions {
  page: number              // 0-indexed
  regions: number[][]       // [[x1, y1, x2, y2], ...] in PDF points, top-left origin
}

interface PageRegionTexts {
  page: number
  regions: RegionText[]
}

interface RegionText {
  text: string
  needsOcr: boolean         // true when text is unreliable
}

Platforms

Platform	Architecture	Supported
Linux	x64	Yes
macOS	ARM64	Yes

License

MIT