JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 2204
  • Score
    100M100P100Q131553F
  • License MIT

Small, fast, event-driven, fault-tolerant html tokenizer. Works in node or browsers.

Package Exports

  • html-tokenizer
  • html-tokenizer/parser

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (html-tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

HTML Tokenizer

A small, super-fast, event-driven, fault-tolerant, html tag-soup tokenizer that works in node or browsers via browserify.

You pass it a string which is supposed to contain HTML, and it emits a stream of events telling you what things it finds.

npm install html-tokenizer

Tokenizer Example

var Tokenizer = require('html-tokenizer')
var tokenizer = new Tokenizer()
tokenizer.on('opening-tag', function(name) { ... })
tokenizer.on('closing-tag', function(name) { ... })
...etc...
tokenizer.tokenize('<p>hello</p>')
tokenizer.tokenize('<foo></bar>')

Parser Example

A basic HTML parser is included in the project which you can require separately.

var Parser = require('html-tokenizer/parser')
var parser = new Parser()
parser.on('open', function(name, attributes) { ... })
parser.on('close', function(name) { ... })
...etc...
parser.parse('<p>hello</p>')
parser.parse('<foo></bar>')

Tokenizer API

  • var Tokenizer = require('html-tokenizer')
  • var tokenizer = new Tokenizer(opts)
    • opts.entities All numeric entity codes are supported, but only a few common text ones are. If you wanted exhaustive support, you could add a map here like { copy: '©' } which will be merged over the ones it already supports.
  • tokenizer.on(event, fn) - A tokenizer is an EventEmitter. Events are emitted synchronously.
  • tokenizer.tokenize(html) - Make sure you set up all your events before doing this!
  • tokenizer.cancel() - Abort the current parsing operation for whatever reason.
  • event "opening-tag" (name) - Found the beginning of a tag
  • event "attribute" (name, value) - Found an attribute
  • event "text" (text) - Found some text
  • event "comment" (commentText) - Found a comment
  • event "opening-tag-end" (name, token) - Found the end of an opening tag. token will either be ">" or "/>" so between that and name you can make informed choices when writing your parser.
  • event "closing-tag" (name) - Found a closing tag like </foo>
  • event "done" () - The tokenizer finished
  • event "cancel" () - The tokenizer was canceled before it finished

Parser API

  • var Parser = require('html-tokenizer/parser')
  • var parser = new Parser()
  • parser.on(event, fn) - A parser is an EventEmitter. Events are emitted synchronously.
  • parser.parse(html) - Make sure you set up all your events before doing this!
  • event "open" (name, attributes, immediateClose) - A tag has opened. immediateClose will be true if the tag self-closes.
  • event "close" (name, immediateClose) - A tag has closed, either due to </tag> or <tag/>, or self-closers like <br>. immediateClose will be true if this close was due to self-closing.
  • event "text" (text) - Found some text
  • event "comment" (commentText) - Found a comment
  • event "done" () - The parser finished

Tokenizer Caveats

  • Does not handle <![CDATA[]]> (passes through as text)
  • Does not handle <!doctype> (passes through as text)
  • Does not handle <? processing instructions ?> (passes through as text)
  • Only converts a few &entities; by default
  • Won't handle every corner case identically to HTML5 browsers
  • Does not consume or produce Node.js streams
  • Mainly intended for client-side processing of small html snippets
  • On unrecoverable errors, finishes early rather than throwing
  • Performs best on clean markup