JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 2204
  • Score
    100M100P100Q131456F
  • License MIT

Small, fast, event-driven, fault-tolerant html tokenizer. Works in node or browsers.

Package Exports

  • html-tokenizer
  • html-tokenizer/entity-map
  • html-tokenizer/parser

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (html-tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

HTML Tokenizer

A small, super-fast, event-driven, fault-tolerant, html tag-soup tokenizer that works in node or browsers via browserify.

You pass it a string which is supposed to contain HTML, and it emits a stream of events telling you what things it finds.

npm install html-tokenizer

Tokenizer Example

var Tokenizer = require('html-tokenizer')
var tokenizer = new Tokenizer({entities:{copy:'\u00A9'}})
tokenizer.on('opening-tag', function(name) { ... })
tokenizer.on('closing-tag', function(name) { ... })
...etc...
tokenizer.tokenize('<p>Copyright &copy; 1998</p>')
tokenizer.tokenize('<foo></bar>')

Parser Example

A basic HTML parser is included in the project which you can require separately. Instead of just telling you what things it finds, its reveals the tag structure of the document. The tokenizer makes no such guarantees.

var Parser = require('html-tokenizer/parser')
var parser = new Parser({entities:{copy:'\u00A9'}})
parser.on('open', function(name, attributes) { ... })
parser.on('close', function(name) { ... })
...etc...
parser.parse('<p>Copyright &copy; 1998</p>')
parser.parse('<foo></bar>')

Tokenizer API

Name Description
var Tokenizer = require('html-tokenizer') Module exports a constructor.
new Tokenizer(opts) Constructor takes options (optional).
opts.entities Constructor option. Entity => charcode map, e.g. {copy:'\u00A9'}. Merged over the defaults. By default only numeric codes are supported, plus a small subset of textual ones.
Tokenizer.defaultEntityMap Default set of entities.
tokenizer.on(event, fn) Events are emitted synchronously during tokenize().
tokenizer.tokenize(html) Can be called arbitrarily many times per instance.
tokenizer.cancel() Abort the current parsing operation for whatever reason.

Events

Events are emitted during the tokenize() operation.

Event Signature Description
start () Emitted once at beginning.
opening-tag (name) Beginning of opening tag, like <foo.
attribute (name, value) Only fires between "opening-tag" and "opening-tag-end" events.
opening-tag-end (name, token) Closing bracket of opening tag. token will either be ">" or "/>".
text (text) Text snippet.
comment (commentText) Comment text.
closing-tag (name) Closing tag, like </foo>.
done () All done.
cancel () Current tokenize() run was canceled before it finished.

Parser API

Name Description
var Parser = require('html-tokenizer/parser') Module exports a constructor.
var parser = new Parser(opts) Constructor takes options (optional). Relevant options passed to Tokenizer().
opts.entities Constructor option. See above.
parser.on(event, fn) Events are emitted synchronously during parse().
parser.parse(html) Can be called arbitrarily many times per instance.

Events

Events are emitted during the parse() operation.

Event Signature Description
start () Emitted once at beginning.
open (name, attributes, selfClosing) Opening tag. selfClosing will be true if this tag self-closes.
text (text) Text snippet.
comment (commentText) Comment text snippet.
close (name, selfClosing) Closing tag. selfClosing will be true if this was a self-closing tag.
done () All done.

Entities

Tokenizer() and Parser() take an options.entities object in order to broaden the set of supported HTML character entities. Exhaustive support can be added, however for browser-based apps this pulls in a large-ish file. Thus, exhaustive entity support must be required separately.

var Parser = require('html-tokenizer/parser')
var entityMap = require('html-tokenizer/entity-map')
new Parser({ entities: entityMap })

Tokenizer Caveats

  • Does not handle <![CDATA[]]> (passes through as text)
  • Does not handle <!doctype> (passes through as text)
  • Does not handle <? processing instructions ?> (passes through as text)
  • Only converts a few &entities; by default
  • Won't handle every corner case identically to HTML5 browsers
  • Does not consume or produce Node.js streams
  • Mainly intended for client-side processing of small html snippets
  • On unrecoverable errors, finishes early rather than throwing
  • Performs best on clean markup