Package Exports
- regexp-tree
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (regexp-tree) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
regexp-tree

Regular expressions processor (parser/traversal/generator) in JavaScript
Table of Contents
- Installation
- Development
- Usage as a CLI
- Usage from Node
- Capturing locations
- Using traversal API
- Using generator API
- Creating RegExp objects
- AST nodes specification
Installation
The parser can be installed as an npm module:
npm install -g regexp-tree
regexp-tree --help
Development
- Fork https://github.com/DmitrySoshnikov/regexp-tree repo
- Make your changes
- Make sure
npm test
still passes (add new tests if needed) - Submit a PR
The regexp-tree parser is implemented as an automatic LR parser using Syntax tool. The parser module is generated from the regexp grammar, which is based on the regular expressions grammar used in ECMAScript.
For development from the github repository, run build
command to generate the parser module:
git clone https://github.com/<your-github-account>/regexp-tree.git
cd regexp-tree
npm install
npm run build
./bin/regexp-tree --help
NOTE: You need to run
build
command every time you change the grammar file.
Usage as a CLI
Check the options available from CLI:
regexp-tree --help
Usage: regexp-tree [options]
Options:
-e, --expression A regular expression to be parsed
-l, --loc Whether to capture AST node locations
To parse a regular expression, pass -e
option:
regexp-tree -e '/a|b/i'
Which produces an AST node corresponding to this regular expression:
{
type: 'RegExp',
body: {
type: 'Disjunction',
left: {
type: 'Char',
value: 'a',
kind: 'simple'
},
right: {
type: 'Char',
value: 'b',
kind: 'simple'
}
},
flags: [
'i'
]
}
NOTE: the format of a regexp is
/ Body / OptionalFlags
.
Usage from Node
The parser can also be used as a Node module:
const regexpTree = require('regexp-tree');
const regexpString = (/a|b/i).toString();
console.log(regexpTree.parse(regexpString)); // RegExp AST
Capturing locations
For source code transformation tools it might be useful also to capture locations of the AST nodes. From the command line it's controlled via the -l
option:
regexp-tree -e '/ab/' -l
This attaches loc
object to each AST node:
{
type: 'RegExp',
body: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple',
loc: {
start: 1,
end: 2
}
},
{
type: 'Char',
value: 'b',
kind: 'simple',
loc: {
start: 2,
end: 3
}
}
],
loc: {
start: 1,
end: 3
}
},
flags: [],
loc: {
start: 0,
end: 4
}
}
From Node it's controlled via setOptions
method exposed on the parser:
const regexpTree = require('regexp-tree');
const parsed = regexpTree
.parser
.setOptions({captureLocations: true})
.parse('/a|b/');
Using traversal API
The traverse module allows handling needed AST nodes using visitor pattern. In Node the module is exposed as regexpTree.traverse
method.
Example:
const regexpTree = require('regexp-tree');
// Get AST.
const ast = regexpTree.parse('/[a-z]{1,}/');
// Handle nodes.
regexpTree.traverse(ast, {
// Handle "Quantifier" node type,
// transforming `{1,}` quantifier to `+`.
onQuantifier(node) {
// {1,} -> +
if (
node.kind === 'Range' &&
node.from === 1 &&
!node.to
) {
node.kind = '+';
delete node.from;
}
},
});
// Generate the regexp.
const re = regexpTree.generate(ast);
console.log(re); // '/[a-z]+/'
Using generator API
The generator module generates regular expressions from corresponding AST nodes. In Node the module is exposed as regexpTree.generate
method.
Example:
const regexpTree = require('regexp-tree');
const re = regexpTree.generate({
type: 'RegExp',
body: {
type: 'Char',
value: 'a',
kind: 'simple',
},
flags: ['i'],
});
console.log(re); // '/a/i'
Creating RegExp objects
To create an actual RegExp
JavaScript object, we can use regexpTree.toRegExp
method:
const regexpTree = require('regexp-tree');
const re = regexpTree.toRegExp('/[a-z]/i');
console.log(
re.test('a'), // true
re.test('Z'), // true
);
AST nodes specification
Below are the AST node types for different regular expressions patterns:
Char
A basic building block, single character. Can be escaped, and be of different kinds.
Simple char
Basic non-escaped char in a regexp:
z
Node:
{
type: 'Char',
value: 'z',
kind: 'simple'
}
NOTE: to test this from CLI, the char should be in an actual regexp --
/z/
.
Escaped char
\z
The same value, escaped
flag is added:
{
type: 'Char',
value: 'z',
kind: 'simple',
escaped: true
}
Escaping is mostly used with meta symbols:
// Syntax error
*
\*
OK, node:
{
type: 'Char',
value: '*',
kind: 'simple',
escaped: true
}
Meta char
A meta character should not be confused with an escaped char.
Example:
\n
Node:
{
type: 'Char',
value: '\\n',
kind: 'meta',
}
Among other meta character are: \f
, \r
, \n
, \t
, \v
, \0
, [\b]
(backspace char), \s
, \S
, \w
, \W
, \d
, \D
.
NOTE:
\b
and\B
are parsed asAssertion
node type, notChar
.
Control char
A char preceded with \c
, e.g. \cx
, which stands for CTRL+x
:
\cx
Node:
{
type: 'Char',
value: '\\cx',
kind: 'control',
}
HEX char-code
A char preceded with \x
, followed by a HEX-code, e.g. \x3B
(symbol ;
):
\x3B
Node:
{
type: 'Char',
value: '\\x3B',
kind: 'hex',
}
Decimal char-code
Char-code:
\42
Node:
{
type: 'Char',
value: '\\42',
kind: 'decimal',
}
Octal char-code
Char-code started with \0
, followed by an octal number:
\073
Node:
{
type: 'Char',
value: '\\073',
kind: 'oct',
}
Unicode
Unicode char started with \u
, followed by a hex number:
\u003B
\u{003B}
Node:
{
type: 'Char',
value: '\\u003B',
kind: 'unicode',
}
Character class
Character classes define a set of characters. A set may include as simple characters, as well as character ranges. A class can be positive (any from the characters in the class match), or negative (any but the characters from the class match).
Positive character class
A positive character class is defined between [
and ]
brackets:
[a*]
A node:
{
type: 'CharacterClass',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Char',
value: '*',
kind: 'simple'
}
]
}
NOTE: some meta symbols are treated as normal characters in a character class. E.g.
*
is not a repetition quantifier, but a simple char.
Negative character class
A negative character class is defined between [^
and ]
brackets:
[^ab]
An AST node is the same, just negative
property is added:
{
type: 'CharacterClass',
negative: true,
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Char',
value: 'b',
kind: 'simple'
}
]
}
Character class ranges
As mentioned, a character class may also contain ranges of symbols:
[a-z]
A node:
{
type: 'CharacterClass',
expressions: [
{
type: 'ClassRange',
from: {
type: 'Char',
value: 'a',
kind: 'simple'
},
to: {
type: 'Char',
value: 'z',
kind: 'simple'
}
}
]
}
NOTE: it is a syntax error if
to
value is less thanfrom
value:/[z-a]/
.
The range value can be the same for from
and to
, and the special range -
character is treated as a simple character when it stands in a char position:
// from: 'a', to: 'a'
[a-a]
// from: '-', to: '-'
[---]
// simple '-' char:
[-]
// 3 ranges:
[a-zA-Z0-9]+
Alternative
An alternative (or concatenation) defines a chain of patterns followed one after another:
abc
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Char',
value: 'b',
kind: 'simple'
},
{
type: 'Char',
value: 'c',
kind: 'simple'
}
]
}
Another examples:
// 'a' with a quantifier, followed by 'b'
a?b
// A group followed by a class:
(ab)[a-z]
Disjunction
The disjunction defines "OR" operation for regexp patterns. It's a binary operation, having left
, and right
nodes.
Matches a
or b
:
a|b
A node:
{
type: 'Disjunction',
left: {
type: 'Char',
value: 'a',
kind: 'simple'
},
right: {
type: 'Char',
value: 'b',
kind: 'simple'
}
}
Groups
The groups play two roles: they define grouping precedence, and allow to capture needed sub-expressions in case of a capturing group.
Capturing group
"Capturing" means the matched string can be referred later by a user, including in the pattern itself -- by using backreferences.
Char a
, and b
are grouped, followed by the c
char:
(ab)c
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: true,
expression: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Char',
value: 'b',
kind: 'simple'
}
]
}
},
{
type: 'Char',
value: 'c',
kind: 'simple'
}
]
}
Another example:
// A grouped disjunction of a symbol, and a character class:
(5|[a-z])
Non-capturing group
Sometimes we don't need to actually capture the matched string from a group. In this case we can use a non-capturing group:
Char a
, and b
are grouped, but not captured, followed by the c
char:
(?:ab)c
The same node, the capturing
flag is false
:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: false,
expression: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Char',
value: 'b',
kind: 'simple'
}
]
}
},
{
type: 'Char',
value: 'c',
kind: 'simple'
}
]
}
Backreferences
A captured group can be referenced in the pattern using notation of an escaped group number.
Matches abab
string:
(ab)\1
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: true,
expression: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Char',
value: 'b',
kind: 'simple'
}
]
}
},
{
type: 'Backreference',
reference: 1,
}
]
}
Quantifiers
Quantifiers specify repetition of a regular expression (or of its part). Below are the quantifiers which wrap a parsed expression into a Repetition
node. The quantifier itself can be of different kinds, and has Quantifier
node type.
? zero-or-one
The ?
quantifier is short for {0,1}
.
a?
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: '?',
greedy: true
}
}
* zero-or-more
The *
quantifier is short for {0,}
.
a*
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: '*',
greedy: true
}
}
+ one-or-more
The +
quantifier is short for {1,}
.
// Same as `aa*`, or `a{1,}`
a+
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: '+',
greedy: true
}
}
Range-based quantifiers
Explicit range-based quantifiers are parsed as follows:
Exact number of matches
a{3}
The type of the quantifier is Range
, and from
, and to
properties have the same value:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: 'Range',
from: 3,
to: 3,
greedy: true
}
}
Open range
An open range doesn't have max value (assuming semantic "more", or Infinity value):
a{3,}
An AST node for such range doesn't contain to
property:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: 'Range',
from: 3,
greedy: true
}
}
Closed range
A closed range has explicit max value: (which syntactically can be the same as min value):
a{3,5}
// Same as a{3}
a{3,3}
An AST node for a closed range:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: 'Range',
from: 3,
to: 5,
greedy: true
}
}
NOTE: it is a syntax error if the max value is less than min value:
/a{3,2}/
Non-greedy
If any quantifier is followed by the ?
, the quantifier becomes non-greedy.
Example:
a+?
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
kind: 'simple'
},
quantifier: {
type: 'Quantifier',
kind: '+',
greedy: false
}
}
Other examples:
a??
a*?
a{1}?
a{1,}?
a{1,3}?
Assertions
Assertions appear as separate AST nodes, however instread of manipulating on the characters themselves, they assert certain conditions of a matching string. Examples: ^
-- beginning of a string (or a line in multiline mode), $
-- end of a string, etc.
^ begin marker
The ^
assertion checks whether a scanner is at the beginning of a string (or a line in multiline mode).
In the example below ^
is not a property of the a
symbol, but a separate AST node for the assertion. The parsed node is actually an Alternative
with two nodes:
^a
The node:
{
type: 'Alternative',
expressions: [
{
type: 'Assertion',
kind: '^'
},
{
type: 'Char',
value: 'a',
kind: 'simple'
}
]
}
Since assertion is a separate node, it may appear anywhere in the matching string. The following regexp is completely valid, and asserts beginning of the string; it'll match an empty string:
^^^^^
$ end marker
The $
assertion is similar to ^
, but asserts the end of a string (or a line in a multiline mode):
a$
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Assertion',
kind: '$'
}
]
}
And again, this is a completely valid regexp, and matches an empty string:
^^^^$$$$$
// valid too:
$^
Boundary assertions
The \b
assertion check for word boundary, i.e. the position between a word and a space.
Matches x
in x y
, but not in xy
:
x\b
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'x',
kind: 'simple'
},
{
type: 'Assertion',
kind: '\\b'
}
]
}
The \B
is vice-versa checks for non-word boundary. The following example matches x
in xy
, but not in x y
:
x\B
A node is the same:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'x',
kind: 'simple'
},
{
type: 'Assertion',
kind: '\\B'
}
]
}
Lookahead assertions
These assertions check whether a pattern is followed (or not followed for the negative assertion) by another pattern.
Positive lookahead assertion
Matches a
only if it's followed by b
:
a(?=b)
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Assertion',
kind: 'Lookahead',
assertion: {
type: 'Char',
value: 'b',
kind: 'simple'
}
}
]
}
Negative lookahead assertion
Matches a
only if it's not followed by b
:
a(?!b)
A node is similar, just negative
flag is added:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
kind: 'simple'
},
{
type: 'Assertion',
kind: 'Lookahead',
negative: true,
assertion: {
type: 'Char',
value: 'b',
kind: 'simple'
}
}
]
}
Lookbehind assertions
Lookbehind assertions are not yet supported (since the grammar is based on ECMAScript regular expressions which don't support lookbehind assertions), however the support can be added later.