JSPM

  • Created
  • Published
  • Downloads 37761
  • Score
    100M100P100Q146598F
  • License MIT

String Tokenization Library for JavaScript

Package Exports

  • tokenizr

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (tokenizr) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Tokenizr

String Tokenization Library for JavaScript

About

Tokenizr is a small JavaScript library, providing flexible string tokenization functionality. It is intended to be be used as the underlying "lexical scanner" in a Recursive Descent based "syntax parser". Its distinct features are:

  • Efficient Iteration:
    It iterates over the input character string in a read-only and copy-less fashion.

  • Stacked States:
    Its tokenization is based on stacked states for determining rules which can be applied. Each rule can be enabled for one or more particular states only.

  • Regular Expression Matching:
    Its tokenization is based on Regular Expressions for matching the input string.

  • Match Repeating:
    Rule actions can (change the state and then) force the repeating of the matching process from scratch at the current input position.

  • Match Rejecting:
    Rule actions can reject their matching at the current input position and let subsequent rules to still match.

  • Match Ignoring:
    Rule actions can force the matched input to be ignored (without generating a token at all).

  • Match Accepting:
    Rule actions can accept the matched input and provide one or even more resulting tokens.

  • Shared Context Data:
    Rule actions (during tokenization) can optionally store and retrieve arbitrary values to/from their tokenization context to share data between rules.

  • Token Text and Value:
    Tokens provide information about their matched input text and can provide a different corresponding (pre-parsed) value, too.

  • Debug Mode:
    The tokenization process can be debugged through optional detailed logging of the internal processing.

  • Nestable Transactions:
    The tokenization can be split into distinct (and nestable) transactions which can be committed or rolled back. This way the tokenization can be incrementally stepped back and this way support the attempt of parsing alternatives.

  • Token Look-Ahead:
    The forthcoming tokens can be inspected to support alternative decisions from within the parser based on look-ahead tokens.

Installation

Node environments (with NPM package manager):

$ npm install tokenizr

Browser environments (with Bower package manager):

$ bower install tokenizr

Usage

Suppose we have a configuration file sample.cfg:

foo {
    baz = 1 // sample comment
    bar {
        quux = 42
        hello = "hello \"world\"!"
    }
    quux = 7
}

Then we can write a lexical scanner in ECMAScript 6 (under Node.js) for the tokens like this:

import fs       from "fs"
import Tokenizr from "tokenizr"

let lexer = new Tokenizr()

lexer.rule(/[a-zA-Z_][a-zA-Z0-9_]*/, (ctx, match) => {
    ctx.accept("id")
})
lexer.rule(/[+-]?[0-9]+/, function (ctx, match) => {
    ctx.accept("number", parseInt(match[0]))
})
lexer.rule(/"((?:\\\"|[^\r\n]+)+)"/, (ctx, match) => {
    ctx.accept("string", match[1].replace(/\\"/g, "\""))
})
lexer.rule(/\/\/[^\r\n]+\r?\n/, (ctx, match) => {
    ctx.ignore()
})
lexer.rule(/[ \t\r\n]+/, (ctx, match) => {
    ctx.ignore()
})
lexer.rule(/./, (ctx, match) => {
    ctx.accept("char")
})

let cfg = fs.readFileSync("sample.cfg", "utf8")

lexer.input(cfg)
lexer.debug(true)
lexer.tokens().forEach((token) => {
    console.log(token.toString())
})

The output of running this sample program is:

<type: id, value: "foo", text: "foo", pos: 5, line: 2, column: 5>
<type: char, value: "{", text: "{", pos: 9, line: 2, column: 9>
<type: id, value: "baz", text: "baz", pos: 19, line: 3, column: 9>
<type: char, value: "=", text: "=", pos: 23, line: 3, column: 13>
<type: number, value: 1, text: "1", pos: 25, line: 3, column: 15>
<type: id, value: "bar", text: "bar", pos: 53, line: 4, column: 9>
<type: char, value: "{", text: "{", pos: 57, line: 4, column: 13>
<type: id, value: "quux", text: "quux", pos: 71, line: 5, column: 13>
<type: char, value: "=", text: "=", pos: 76, line: 5, column: 18>
<type: number, value: 42, text: "42", pos: 78, line: 5, column: 20>
<type: id, value: "hello", text: "hello", pos: 93, line: 6, column: 13>
<type: char, value: "=", text: "=", pos: 99, line: 6, column: 19>
<type: string, value: "hello \"world\"!", text: "hello "world"!"", pos: 101, line: 6, column: 21>
<type: char, value: "}", text: "}", pos: 126, line: 7, column: 9>
<type: id, value: "quux", text: "quux", pos: 136, line: 8, column: 9>
<type: char, value: "=", text: "=", pos: 141, line: 8, column: 14>
<type: number, value: 7, text: "7", pos: 143, line: 8, column: 16>
<type: char, value: "}", text: "}", pos: 149, line: 9, column: 5>

Application Programming Interface (API)

  • new Tokenizr(): Tokenizr
    Create a new tokenization instance.

  • Tokenizr#reset(): Tokenizr
    Reset the tokenization instance to a fresh one.

  • Tokenizr#debug(enable: Boolean): Tokenizr
    Enable (or disable) verbose logging for debugging purposes.

  • Tokenizr#input(input: String): Tokenizr
    Set the input string to tokenize. This implicitly performs a reset() operation beforehand.

  • Tokenizr#rule(state?: String, pattern: RegExp, action: (ctx: TokenizerContext, match: Array[String]) => Void): Tokenizr
    Configure a token matching rule which executes its action in case the current tokenization state is one of the states in the comma-separated state (by default the rule matches all states if state is not specified) and the next input characters match against the pattern. The ctx argument provides a context object for token repeating/rejecting/ignoring/accepting, the match argument is the result of the underlying RegExp#exec call.

  • Tokenizr#token(): Token
    Get next token.

FIXME: methods still to be documented!

Implementation Notice

Although Tokenizr is written in ECMAScript 6, it is transpiled to ECMAScript 5 and this way runs in really all(!) current (as of 2015) JavaScript environments, of course.

Internally, Tokenizr scans the input string in a read-only fashion by leveraging RegExp's g (global) flag in combination with the lastIndex field, the best one can do on ECMAScript 5 runtime. For ECMAScript 6 runtimes we will switch to RegExp's new y (sticky) flag in the future as it is even more efficient.

License

Copyright (c) 2015 Ralf S. Engelschall (http://engelschall.com/)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.