Package Exports
- parseflow-core
- parseflow-core/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (parseflow-core) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
parseflow-core
Core PDF parsing library for ParseFlow - Extract text, metadata, images, and TOC from PDF files.
✨ Features
- 📄 Text Extraction - Extract text from PDF with multiple strategies (raw, formatted, clean)
- 📊 Metadata Extraction - Get title, author, page count, creation date, etc.
- 🔍 Keyword Search - Search for specific content in PDFs with context
- 🖼️ Image Extraction - Extract images from PDFs (requires poppler-utils)
- 📑 Table of Contents - Extract PDF bookmarks and outline structure (requires pdftk/pdfinfo)
📦 Installation
npm install parseflow-coreOr using pnpm:
pnpm add parseflow-coreOr using yarn:
yarn add parseflow-core🚀 Quick Start
Text Extraction
import { PDFParser } from 'parseflow-core';
const parser = new PDFParser();
// Extract all text
const result = await parser.extractText('path/to/document.pdf');
console.log(result.text);
// Extract specific page
const page2 = await parser.extractText('path/to/document.pdf', { page: 2 });
// Extract page range
const pages = await parser.extractText('path/to/document.pdf', { range: '1-5' });Metadata Extraction
const metadata = await parser.getMetadata('path/to/document.pdf');
console.log(metadata);
// {
// title: 'Document Title',
// author: 'Author Name',
// pageCount: 10,
// creationDate: '2025-01-01',
// ...
// }Keyword Search
const results = await parser.searchPDF('path/to/document.pdf', 'keyword', {
caseSensitive: false,
maxResults: 10
});
results.forEach(result => {
console.log(`Found on page ${result.page}: ${result.context}`);
});Image Extraction (requires poppler-utils)
import { ImageExtractorExternal } from 'parseflow-core';
const extractor = new ImageExtractorExternal();
const images = await extractor.extract('path/to/document.pdf', './output', {
format: 'png'
});Table of Contents (requires pdftk or pdfinfo)
import { TOCExtractorExternal } from 'parseflow-core';
const tocExtractor = new TOCExtractorExternal();
const toc = await tocExtractor.extract('path/to/document.pdf');
console.log(toc);📚 API Reference
PDFParser
Main parser class for PDF operations.
Methods
extractText(path, options?)- Extract text from PDFgetMetadata(path)- Get PDF metadatasearchPDF(path, query, options?)- Search for keywords
ImageExtractorExternal
Extract images from PDF using external tools.
Methods
isAvailable()- Check if pdfimages is availableextract(pdfPath, outputDir, options?)- Extract images
TOCExtractorExternal
Extract table of contents from PDF.
Methods
isAvailable()- Check if pdftk/pdfinfo is availableextract(pdfPath, options?)- Extract TOC
🔧 External Tools
Some features require external tools:
Image Extraction
Windows:
- Download Poppler
- Add to system PATH
Linux:
sudo apt-get install poppler-utilsmacOS:
brew install popplerTOC Extraction
Windows:
- Download Poppler (includes pdfinfo)
Linux:
sudo apt-get install poppler-utils pdftkmacOS:
brew install poppler pdftk-java💻 Requirements
- Node.js: >= 18.0.0
- TypeScript: >= 5.0.0 (for development)
📖 Documentation
For complete documentation, visit:
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for details.
📄 License
MIT © Libres-coder
🙏 Acknowledgments
🔗 Links
Made with ❤️ by ParseFlow Team