Package Exports
- html-tokenizer
- html-tokenizer/parser
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (html-tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
HTML Tokenizer
A small, super-fast, event-driven, fault-tolerant, html tag-soup tokenizer that works in node or browsers via browserify.
You pass it a string which is supposed to contain HTML, and it emits a stream of events telling you what things it finds.
npm install html-tokenizerTokenizer Example
var Tokenizer = require('html-tokenizer')
var tokenizer = new Tokenizer()
tokenizer.on('opening-tag', function(name) { ... })
tokenizer.on('closing-tag', function(name) { ... })
...etc...
tokenizer.tokenize('<p>hello</p>')
tokenizer.tokenize('<foo></bar>')Parser Example
A basic HTML parser is included in the project which you can require separately.
var Parser = require('html-tokenizer/parser')
var parser = new Parser()
parser.on('open', function(name, attributes) { ... })
parser.on('close', function(name) { ... })
...etc...
parser.parse('<p>hello</p>')
parser.parse('<foo></bar>')Tokenizer API
var Tokenizer = require('html-tokenizer')var tokenizer = new Tokenizer(opts)opts.entitiesAll numeric entity codes are supported, but only a few common text ones are. If you wanted exhaustive support, you could add a map here like{ copy: '©' }which will be merged over the ones it already supports.
tokenizer.on(event, fn)- A tokenizer is an EventEmitter. Events are emitted synchronously.tokenizer.tokenize(html)- Make sure you set up all your events before doing this!tokenizer.cancel()- Abort the current parsing operation for whatever reason.- event
"opening-tag" (name)- Found the beginning of a tag - event
"attribute" (name, value)- Found an attribute - event
"text" (text)- Found some text - event
"comment" (commentText)- Found a comment - event
"opening-tag-end" (name, token)- Found the end of an opening tag.tokenwill either be">"or"/>"so between that andnameyou can make informed choices when writing your parser. - event
"closing-tag" (name)- Found a closing tag like</foo> - event
"done" ()- The tokenizer finished - event
"cancel" ()- The tokenizer was canceled before it finished
Parser API
var Parser = require('html-tokenizer/parser')var parser = new Parser()parser.on(event, fn)- A parser is an EventEmitter. Events are emitted synchronously.parser.parse(html)- Make sure you set up all your events before doing this!- event
"open" (name, attributes, immediateClose)- A tag has opened.immediateClosewill be true if the tag self-closes. - event
"close" (name, immediateClose)- A tag has closed, either due to</tag>or<tag/>, or self-closers like<br>.immediateClosewill be true if this close was due to self-closing. - event
"text" (text)- Found some text - event
"comment" (commentText)- Found a comment - event
"done" ()- The parser finished
Tokenizer Caveats
- Does not handle
<![CDATA[]]>(passes through as text) - Does not handle
<!doctype>(passes through as text) - Does not handle
<? processing instructions ?>(passes through as text) - Only converts a few
&entities;by default - Won't handle every corner case identically to HTML5 browsers
- Does not consume or produce Node.js streams
- Mainly intended for client-side processing of small html snippets
- On unrecoverable errors, finishes early rather than throwing
- Performs best on clean markup