Package Exports
- pdf-stream
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (pdf-stream) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
pdf-stream
Creates a stream from PDF
Node.js module for streaming PDF text content.
Based on PDF.js library.
Table of Contents
Install
npm i pdf-stream --save
Usage
Basic
Text stream from PDF file
'use strict';
const text = require('pdf-stream').text;
// Load file contents to ArrayBuffer synchronously
let file = './example.pdf';
let pdf = new Uint8Array(fs.readFileSync(file));
// Stream PDF text to stdout
text(pdf)
.pipe(process.stdout);
Text stream from PDF link
You need the XMLHttpRequest
as global variable.
Install the xhr2 library locally:
npm i xhr2 --save
'use strict';
const text = require('pdf-stream').text;
global.XMLHttpRequest = require('xhr2'); // for PDFJS
let pdf = 'https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf';
text(pdf)
.pipe(process.stdout);
Text stream from PDF link with metadata as XML string
If you got error:
ReferenceError: DOMParser is not defined
You need the DOMParser
as global variable, because PDF.js use it for XML metadata parsing. Install the xmldom library locally:
npm i xmldom --save
'use strict';
const text = require('pdf-stream').text;
global.XMLHttpRequest = require('xhr2'); // File download
global.DOMParser = require('xmldom').DOMParser; // XML Metadata parsing
let pdf = 'https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf';
text(pdf)
.pipe(process.stdout);
Advanced
Create transform class for replacing string
'use strict';
const Transform = require('stream').Transform;
const pdf_stream = require('pdf-stream');
const PDFReadable = pdf_stream.PDFReadable;
const PDFStringifyTransform = pdf_stream.PDFStringifyTransform;
let url = 'https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf';
// Transform class for replacing strings
class ReplaceTransform extends Transform {
constructor(options) {
super({
writableObjectMode: true,
readableObjectMode: true
});
this.from = options.from;
this.to = options.to;
}
// For every object
_transform(obj, encoding, cb) {
// Get text content items
if (typeof obj.textContent !== 'undefined'
&& Array.isArray(obj.textContent.items)) {
obj.textContent.items.forEach((item, i) => {
// Working with the PDF.js `textContent` object
// Replace substring to another
obj.textContent.items[i].str = item.str.replace(this.from, this.to);
});
}
this.push(obj);
cb();
}
}
// Pipeline
new PDFReadable(url)
.pipe(new ReplaceTransform({
from: /trace/gi,
to: ':-)'
}))
.pipe(new PDFStringifyTransform()) // Convert stream from object to string
.pipe(process.stdout);
API
All methods are streams, use them with .pipe()
.
text(options)
alternative usage:
text(pdf, whitespace)
Gets text stream from PDF.
Convert PDF to text, optionally can replace whitespaces.
Options:
pdf
— URL or ArrayBuffer;whitespace
— the string that replaces the whitespace␣
. Replacement disabled by default.
In the PDF.js viewer whitespaces is an empty string. For making output comparable with the viewer use:
text(pdf, '')
Return: {stream.Readable}
new PDFReadable(options)
alternative usage:
new PDFReadable(pdf)
Making the Readable stream in object mode from PDF text content.
Options:
pdf
— URL or ArrayBuffer;- inherit from
stream.Readable
options.
Return: {stream.Readable}
new PDFStringifyTransform(options)
alternative usage:
new PDFStringifyTransform(whitespace)
Transform PDF text content object to string.
Options:
whitespace
— the string that replaces the whitespace␣
. Replacement disabled by default;- inherit from
stream.Transform
options.
Return: {stream.Readable}
Contribute
Contributors are welcome. Open an issue or submit pull request.
Small note: If editing the README, please conform to the standard-readme specification.
License
Apache 2.0
© Sergey N