Package Exports
- pdf2html
- pdf2html/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (pdf2html) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
pdf2html
Convert PDF files to HTML, extract text, generate thumbnails, and extract metadata using Apache Tika and PDFBox
🚀 Features
- PDF to HTML conversion - Maintains formatting and structure
- Text extraction - Extract plain text content from PDFs
- Page-by-page processing - Process PDFs page by page
- Metadata extraction - Extract author, title, creation date, and more
- Thumbnail generation - Generate preview images from PDF pages
- Buffer support - Process PDFs from memory buffers or file paths
- TypeScript support - Full type definitions included
- Async/Promise based - Modern async API
- Configurable - Extensive options for customization
📋 Prerequisites
- Node.js >= 14.0.0
- Java Runtime Environment (JRE) >= 8
- Required for Apache Tika and PDFBox
- Download Java
📦 Installation
Using npm:
npm install pdf2htmlUsing yarn:
yarn add pdf2htmlUsing pnpm:
pnpm add pdf2htmlThe installation process will automatically download the required Apache Tika and PDFBox JAR files. You'll see a progress indicator during the download.
🔧 Basic Usage
Convert PDF to HTML
const pdf2html = require('pdf2html');
const fs = require('fs');
// From file path
const html = await pdf2html.html('path/to/document.pdf');
console.log(html);
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const html = await pdf2html.html(pdfBuffer);
console.log(html);
// With options
const html = await pdf2html.html(pdfBuffer, {
maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});Extract Text
// From file path
const text = await pdf2html.text('path/to/document.pdf');
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const text = await pdf2html.text(pdfBuffer);
console.log(text);Process Pages Individually
// From file path
const htmlPages = await pdf2html.pages('path/to/document.pdf');
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const htmlPages = await pdf2html.pages(pdfBuffer);
htmlPages.forEach((page, index) => {
console.log(`Page ${index + 1}:`, page);
});
// Get text for each page
const textPages = await pdf2html.pages(pdfBuffer, {
text: true,
});Extract Metadata
// From file path or buffer
const metadata = await pdf2html.meta(pdfBuffer);
console.log(metadata);
// Output: {
// title: 'Document Title',
// author: 'John Doe',
// subject: 'Document Subject',
// keywords: 'pdf, conversion',
// creator: 'Microsoft Word',
// producer: 'Adobe PDF Library',
// creationDate: '2023-01-01T00:00:00Z',
// modificationDate: '2023-01-02T00:00:00Z',
// pages: 10
// }Generate Thumbnails
// From file path
const thumbnailPath = await pdf2html.thumbnail('path/to/document.pdf');
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer);
console.log('Thumbnail saved to:', thumbnailPath);
// Custom thumbnail options
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer, {
page: 1, // Page number (default: 1)
imageType: 'png', // 'png' or 'jpg' (default: 'png')
width: 300, // Width in pixels (default: 160)
height: 400, // Height in pixels (default: 226)
});⚙️ Advanced Configuration
Buffer Size Configuration
By default, the maximum buffer size is 2MB. For large PDFs, you may need to increase this:
const options = {
maxBuffer: 1024 * 1024 * 50, // 50MB buffer
};
// Apply to any method
await pdf2html.html('large-file.pdf', options);
await pdf2html.text('large-file.pdf', options);
await pdf2html.pages('large-file.pdf', options);
await pdf2html.meta('large-file.pdf', options);
await pdf2html.thumbnail('large-file.pdf', options);Error Handling
Always wrap your calls in try-catch blocks for proper error handling:
try {
const html = await pdf2html.html('document.pdf');
// Process HTML
} catch (error) {
if (error.code === 'ENOENT') {
console.error('PDF file not found');
} else if (error.message.includes('Java')) {
console.error('Java is not installed or not in PATH');
} else {
console.error('PDF processing failed:', error.message);
}
}🏗️ API Reference
pdf2html.html(input, [options])
Converts PDF to HTML format.
- input
string | Buffer- Path to the PDF file or PDF buffer - options
object(optional)maxBuffernumber- Maximum buffer size in bytes (default: 2MB)
- Returns:
Promise<string>- HTML content
pdf2html.text(input, [options])
Extracts text from PDF.
- input
string | Buffer- Path to the PDF file or PDF buffer - options
object(optional)maxBuffernumber- Maximum buffer size in bytes
- Returns:
Promise<string>- Extracted text
pdf2html.pages(input, [options])
Processes PDF page by page.
- input
string | Buffer- Path to the PDF file or PDF buffer - options
object(optional)textboolean- Extract text instead of HTML (default: false)maxBuffernumber- Maximum buffer size in bytes
- Returns:
Promise<string[]>- Array of HTML or text strings
pdf2html.meta(input, [options])
Extracts PDF metadata.
- input
string | Buffer- Path to the PDF file or PDF buffer - options
object(optional)maxBuffernumber- Maximum buffer size in bytes
- Returns:
Promise<object>- Metadata object
pdf2html.thumbnail(input, [options])
Generates a thumbnail image from PDF.
- input
string | Buffer- Path to the PDF file or PDF buffer - options
object(optional)pagenumber- Page to thumbnail (default: 1)imageTypestring- 'png' or 'jpg' (default: 'png')widthnumber- Thumbnail width (default: 160)heightnumber- Thumbnail height (default: 226)maxBuffernumber- Maximum buffer size in bytes
- Returns:
Promise<string>- Path to generated thumbnail
🔧 Manual Dependency Installation
If automatic download fails (e.g., due to network restrictions), you can manually download the dependencies:
Create the vendor directory:
mkdir -p node_modules/pdf2html/vendor
Download the required JAR files:
cd node_modules/pdf2html/vendor # Download Apache PDFBox wget https://archive.apache.org/dist/pdfbox/2.0.33/pdfbox-app-2.0.33.jar # Download Apache Tika wget https://archive.apache.org/dist/tika/3.1.0/tika-app-3.1.0.jar
Verify the files are in place:
ls -la node_modules/pdf2html/vendor/ # Should show both JAR files
🐛 Troubleshooting
Common Issues
"Java is not installed"
- Install Java JRE 8 or higher
- Ensure
javais in your system PATH - Verify with:
java -version
"File not found" errors
- Check that the PDF path is correct
- Use absolute paths for better reliability
- Ensure the file has read permissions
"Buffer size exceeded"
- Increase maxBuffer option
- Process large PDFs page by page
- Consider splitting very large PDFs
"Download failed during installation"
- Check internet connection
- Try manual installation (see above)
- Check proxy settings if behind firewall
Debug Mode
Enable debug output for troubleshooting:
DEBUG=pdf2html node your-script.js🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Apache Tika - Content analysis toolkit
- Apache PDFBox - PDF manipulation library
📊 Dependencies
- Production: Apache Tika 3.1.0, Apache PDFBox 2.0.33
- Development: See package.json for development dependencies
Made with ❤️ by the pdf2html community