Package Exports
- firecrawl-pdf-inspector
- firecrawl-pdf-inspector/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (firecrawl-pdf-inspector) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
firecrawl-pdf-inspector
Fast PDF classification and region-based text extraction for Node.js/Bun. Native Rust performance via napi-rs.
Built by Firecrawl for hybrid OCR pipelines — extract text from PDF structure where possible, fall back to OCR only when needed.
Install
npm install firecrawl-pdf-inspector
# or
bun add firecrawl-pdf-inspectorPrebuilt binaries included for linux-x64 and macOS ARM64. No Rust toolchain needed.
API
classifyPdf(buffer: Buffer): PdfClassification
Classify a PDF as TextBased, Scanned, Mixed, or ImageBased (~10-50ms). Returns which pages need OCR.
import { classifyPdf } from 'firecrawl-pdf-inspector'
import { readFileSync } from 'fs'
const pdf = readFileSync('document.pdf')
const result = classifyPdf(pdf)
console.log(result.pdfType) // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
console.log(result.pageCount) // 42
console.log(result.pagesNeedingOcr) // [5, 12, 15] (0-indexed)
console.log(result.confidence) // 0.875extractTextInRegions(buffer: Buffer, pageRegions: PageRegions[]): PageRegionTexts[]
Extract text within bounding-box regions from a PDF. Designed for hybrid OCR pipelines where a layout model detects regions in rendered page images, and this function extracts text from the PDF structure for text-based pages — skipping GPU OCR.
Each region result includes a needsOcr flag that signals unreliable extraction (empty text, GID-encoded fonts, garbage text, encoding issues).
import { extractTextInRegions } from 'firecrawl-pdf-inspector'
const result = extractTextInRegions(pdf, [
{
page: 0, // 0-indexed
regions: [
[0, 0, 300, 400], // [x1, y1, x2, y2] in PDF points, top-left origin
[300, 0, 612, 400],
]
}
])
for (const region of result[0].regions) {
if (region.needsOcr) {
// Unreliable text — send this region to OCR instead
} else {
console.log(region.text) // Extracted text in reading order
}
}Types
interface PdfClassification {
pdfType: string // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
pageCount: number
pagesNeedingOcr: number[] // 0-indexed page numbers
confidence: number // 0.0 - 1.0
}
interface PageRegions {
page: number // 0-indexed
regions: number[][] // [[x1, y1, x2, y2], ...] in PDF points, top-left origin
}
interface PageRegionTexts {
page: number
regions: RegionText[]
}
interface RegionText {
text: string
needsOcr: boolean // true when text is unreliable
}Platforms
| Platform | Architecture | Supported |
|---|---|---|
| Linux | x64 | Yes |
| macOS | ARM64 | Yes |
License
MIT