Package Exports
- oniguruma-to-es
Readme
Oniguruma-To-ES
A lightweight Oniguruma to JavaScript RegExp transpiler that runs in the browser and on your server. Use it to:
- Take advantage of Oniguruma's extended regex capabilities in JavaScript.
- Run regexes intended for Oniguruma in JavaScript, such as those used in TextMate grammars (used by VS Code, Shiki syntax highlighter, etc.).
- Share regexes across your Ruby and JavaScript code.
Compared to running the actual Oniguruma C library in JavaScript via WASM bindings (e.g. via vscode-oniguruma), this library is much lighter weight and its regexes run much faster since they run as native JavaScript.
[!WARNING] This library is currently in alpha and has known bugs.
Try the demo REPL
Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's obsessive about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have exactly the same behavior, even in extreme edge cases. A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can disable the allowBestEffort
option to throw for such patterns (see details below).
π Contents
πΉοΈ Install and use
npm install oniguruma-to-es
import {compile} from 'oniguruma-to-es';
In browsers:
<script type="module">
import {compile} from 'https://esm.run/oniguruma-to-es';
compile(String.raw`β¦`);
</script>
Using a global name (no import)
<script src="https://cdn.jsdelivr.net/npm/oniguruma-to-es/dist/index.min.js"></script>
<script>
const {compile} = OnigurumaToES;
</script>
π API
compile
Transpiles an Oniguruma regex pattern and flags to native JavaScript.
function compile(
pattern: string,
flags?: OnigurumaFlags,
options?: CompileOptions
): {
pattern: string;
flags: string;
};
The returned pattern
and flags
can be provided directly to the JavaScript RegExp
constructor. Various JavaScript flags might have been added or removed compared to the Oniguruma flags provided, as part of the emulation process.
Type OnigurumaFlags
A string with i
, m
, and x
in any order (all optional).
[!IMPORTANT] Oniguruma and JavaScript both have an
m
flag but with different meanings. Oniguruma'sm
is equivalent to JavaScript'ss
(dotAll
).
Type CompileOptions
type CompileOptions = {
allowBestEffort?: boolean;
global?: boolean;
hasIndices?: boolean;
maxRecursionDepth?: number | null;
optimize?: boolean;
target?: 'ES2018' | 'ES2024' | 'ESNext';
};
See Options for more details.
toRegExp
Transpiles an Oniguruma regex pattern and flags and returns a native JavaScript RegExp
.
function toRegExp(
pattern: string,
flags?: OnigurumaFlags,
options?: CompileOptions
): RegExp;
[!TIP] Try it in the demo REPL.
toOnigurumaAst
Generates an Oniguruma AST from an Oniguruma pattern and flags.
function toOnigurumaAst(
pattern: string,
flags?: OnigurumaFlags
): OnigurumaAst;
toRegexAst
Generates a regex
AST from an Oniguruma pattern and flags.
function toRegexAst(
pattern: string,
flags?: OnigurumaFlags
): RegexAst;
regex
's syntax and behavior is a strict superset of native JavaScript, so the AST is very close to representing native ESNext RegExp
but with some added features (atomic groups, possessive quantifiers, recursion). The regex
AST doesn't use some of regex
's extended features like flag x
or subroutines because they follow PCRE behavior and work somewhat differently than in Oniguruma. The AST represents what's needed to precisely reproduce the Oniguruma behavior using regex
.
π© Options
These options are shared by functions compile
and toRegExp
.
allowBestEffort
Allows results that differ from Oniguruma in rare cases. If false
, throws if the pattern can't be emulated with identical behavior for the given target
.
Default: true
.
More details
Specifically, this option enables the following additional features, depending on target
:
- All targets (
ESNext
and earlier):- Enables use of
\X
using a close approximation of a Unicode extended grapheme cluster. - Enables recursion (e.g. via
\g<0>
) using a depth limit specified via optionmaxRecursionDepth
.
- Enables use of
ES2024
and earlier:- Enables use of case-insensitive backreferences to case-sensitive groups.
ES2018
:- Enables use of POSIX classes
[:graph:]
and[:print:]
using ASCII-based versions rather than the Unicode versions available forES2024
and later. Other POSIX classes are always based on Unicode.
- Enables use of POSIX classes
global
Include JavaScript flag g
(global
) in results.
Default: false
.
hasIndices
Include JavaScript flag d
(hasIndices
) in results.
Default: false
.
maxRecursionDepth
If null
, any use of recursion throws. If an integer between 2
and 100
(and allowBestEffort
is true
), common recursion forms are supported and recurse up to the specified max depth.
Default: 6
.
More details
Using a high limit is not a problem if needed. Although there can be a performance cost (minor unless it's exacerbating an existing issue with runaway backtracking), there is no effect on regexes that don't use recursion.
optimize
Simplify the generated pattern when it doesn't change the meaning.
Default: true
.
target
Sets the JavaScript language version for generated patterns and flags. Later targets allow faster processing, simpler generated source, and support for additional features.
Default: 'ES2024'
.
More details
ES2018
: Uses JS flagu
.- Emulation restrictions: Character class intersection, nested negated character classes, and Unicode properties added after ES2018 are not allowed.
- Generated regexes might use ES2018 features that require Node.js 10 or a browser version released during 2018 to 2023 (in Safari's case). Minimum requirement for any regex is Node.js 6 or a 2016-era browser.
ES2024
: Uses JS flagv
.- No emulation restrictions.
- Generated regexes require Node.js 20 or a 2023-era browser (compat table).
ESNext
: Uses JS flagv
and allows use of flag groups and duplicate group names.- Benefits: Faster transpilation, simpler generated source, and duplicate group names are preserved across separate alternation paths.
- Generated regexes might use features that require Node.js 23 or a 2024-era browser (except Safari, which lacks support).
β Supported features
Following are the supported features by target.
Targets
ES2024
andESNext
have the same emulation capabilities. Resulting regexes might have different source and flags, but they match the same strings.
Notice that nearly every feature below has at least subtle differences from JavaScript. Some features and subfeatures listed as unsupported are not emulatable using native JavaScript regexes, but support for others might be added in future versions of Oniguruma-To-ES. Unsupported features throw an error.
Feature | Example | ES2018 | ES2024+ | Subfeatures & JS differences | |
---|---|---|---|---|---|
Flags | i |
i |
β | β |
β Unicode case folding (same as JS with flag u , v ) |
m |
m |
β | β |
β Equivalent to JS flag s (dotAll ) |
|
x |
x |
β | β |
β Unicode whitespace ignored β Line comments with # β Whitespace/comments allowed between a token and its quantifier β Whitespace/comments between a quantifier and the ? /+ that makes it lazy/possessive changes it to a chained quantifierβ Whitespace/comments separate tokens (ex: \1 0 )β Whitespace and # not ignored in char classes |
|
Flag modifiers | Group | (?im-x:β¦) |
β | β |
β Unicode case folding for i β Allows enabling and disabling the same flag (priority: disable) β Allows lone or multiple - |
Directive | (?im-x) |
β | β |
β Continues until end of pattern or group (spanning alternatives) |
|
Characters | Literal | E , ! |
β | β |
β Code point based matching (same as JS with flag u , v )β Standalone ] , { , } don't require escaping |
Identity escape | \E , \! |
β | β |
β Different allowed set than JS β Allows multibyte chars |
|
Escaped metachar | \\ , \. |
β | β |
β Same as JS |
|
Shorthand | \t |
β | β |
β The JS set plus \a , \e |
|
\xNN |
\x7F |
β | β |
β Allows 1 hex digit β Error for 2 hex digits > 7F |
|
\uNNNN |
\uFFFF |
β | β |
β Same as JS with flag u , v |
|
\x{β¦} |
\x{A} |
β | β |
β Allows leading 0s up to 8 total hex digits |
|
Escaped num | \20 |
β | β |
β Can be backref, error, null, octal, identity escape, or any of these combined with literal digits, based on complex rules that differ from JS β Always handles escaped single digit 1-9 outside char class as backref β Allows null with 1-3 0s β Error for octal > 177 |
|
Control | \cA , \C-A |
β | β |
β With A-Za-z (JS: only \c form) |
|
Other (rare) | β | β |
Not yet supported: β \cx , \C-x with non-A-Za-zβ Meta \M-x , \M-\C-x β Multibyte octal \o{β¦} |
||
Character sets | Digit, word | \d , \w , etc. |
β | β |
β Same as JS (ASCII) |
Hex digit | \h , \H |
β | β |
β ASCII |
|
Whitespace | \s , \S |
β | β |
β ASCII (unlike JS) |
|
Dot | . |
β | β |
β Excludes only \n (unlike JS) |
|
Any | \O |
β | β |
β Any char (with any flags) β Identity escape in char class |
|
Not newline | \N |
β | β |
β Identity escape in char class |
|
Unicode property |
\p{L} ,\P{L}
|
β [1] | β |
β Binary properties β Categories β Scripts β Aliases β POSIX properties β Negate with \p{^β¦} , \P{^β¦} β Insignificant spaces, underscores, and casing in names β \p , \P without { is an identity escapeβ Error for key prefixes β Error for props of strings β Blocks (wontfix[2]) |
|
Variable-length sets | Newline | \R |
β | β |
β Matched atomically |
Grapheme | \X |
βοΈ | βοΈ |
β Uses a close approximation β Matched atomically |
|
Character classes | Base | [β¦] , [^β¦] |
β | β |
β Unescaped - is literal char in some contexts (different than JS rules in any mode)β Fewer chars require escaping than JS |
Empty | [] , [^] |
β | β |
β Error |
|
Range | [a-z] |
β | β |
β Same as JS with flag u , v |
|
POSIX class |
[[:word:]] ,[[:^word:]]
|
βοΈ[3] | β |
β All use Unicode definitions |
|
Nested class | [β¦[β¦]] |
βοΈ[4] | β |
β Same as JS with flag v |
|
Intersection | [β¦&&β¦] |
β | β |
β Doesn't require nested classes for union and ranges |
|
Assertions | Line start, end | ^ , $ |
β | β |
β Always "multiline" (per JS) β Only \n as newlineβ Allows following quantifier |
String start, end | \A , \z |
β | β |
β Like JS ^ $ without JS flag m |
|
String end or before terminating newline | \Z |
β | β |
β Only \n as newline |
|
Search start | \G |
βοΈ | βοΈ |
β Supported when used at start of pattern (if no top-level alternation) and when at start of all top-level alternatives |
|
Word boundary | \b , \B |
β | β |
β Unicode based (unlike JS) β Allows following quantifier |
|
Lookahead |
(?=β¦) ,(?!β¦)
|
β | β |
β Allows following quantifier (unlike JS with flag u , v )β Values captured within min-0 quantified lookahead remain referenceable (unlike JS) |
|
Lookbehind |
(?<=β¦) ,(?<!β¦)
|
β | β |
β Error for variable-length quantifiers within lookbehind (allowed in JS) β Allows variable-length top-level alternatives β Allows following quantifier (unlike JS in any mode) β Values captured within min-0 quantified lookbehind remain referenceable |
|
Quantifiers | Greedy, lazy | * , +? , {2,} , etc. |
β | β |
β Includes all JS forms β Adds form {,n} for implicit min 0β Explicit bounds have upper limit of 100,000 (unlimited in JS) β Allowed to follow assertions |
Possessive | ?+ , *+ , ++ |
β | β |
β + suffix doesn't make interval ({β¦} ) quantifiers possessive (creates a chained quantifier) |
|
Chained | ** , ??+* , {2,3}+ , etc. |
β | β |
β Further repeats the preceding repetition |
|
Groups | Noncapturing | (?:β¦) |
β | β |
β Same as JS |
Atomic | (?>β¦) |
β | β |
β Supported |
|
Capturing | (β¦) |
β | β |
β Is noncapturing if named capture present |
|
Named capturing |
(?<a>β¦) ,(?'a'β¦)
|
β | β |
β Duplicate names allowed (including within the same alternation path) unless directly referenced by a subroutine β Error for names invalid in Oniguruma or JS |
|
Backreferences | Numbered | \1 |
β | β |
β Error if named capture used β Refs the most recent of a capture/subroutine set |
Enclosed numbered, relative |
\k<1> ,\k'1' ,\k<-1> ,\k'-1'
|
β | β |
β Error if named capture used β Allows leading 0s β Refs the most recent of a capture/subroutine set |
|
Named |
\k<a> ,\k'a'
|
β | β |
β For duplicate group names, rematch any of their matches (multiplex) β Refs the most recent of a capture/subroutine set (no multiplex) β Combination of multiplex and most recent of capture/subroutine set if duplicate name is indirectly created by a subroutine |
|
To nonparticipating groups | βοΈ | βοΈ |
β Error if group to the right[5] β Duplicate names (and subroutines) to the right not included in multiplex β Fail to match (or don't include in multiplex) ancestor groups and groups in preceding alternation paths β Some rare cases are indeterminable at compile time and use the JS behavior of matching an empty string |
||
Subroutines | Numbered, relative |
\g<1> ,\g'1' ,\g<-1> ,\g'-1' ,\g<+1> ,\g'+1'
|
β | β |
β Allowed before reffed group β Can be nested (any depth) β Doesn't alter backref nums β Reuses flags from the reffed group (ignores local flags) β Replaces most recent captured values (for backrefs) β Error if named capture used |
Named |
\g<a> ,\g'a'
|
β | β |
β Same behavior as numbered β Error if reffed group uses duplicate name |
|
Recursion | Full pattern |
\g<0> ,\g'0'
|
βοΈ | βοΈ |
β Limited support[6] |
Numbered, relative | (β¦\g<1>?β¦) , etc. |
β | β |
β Not yet supported |
|
Named | (?<a>β¦\g<a>?β¦) , etc. |
βοΈ | βοΈ |
β Limited support[6] |
|
Other | Comment group | (?#β¦) |
β | β |
β Allows escaping \) , \\ β Comments allowed between a token and its quantifier β Comments between a quantifier and the ? /+ that makes it lazy/possessive changes it to a chained quantifier |
Alternation | β¦|β¦ |
β | β |
β Same as JS |
|
Keep | \K |
βοΈ | βοΈ |
β Supported if at top level and no top-level alternation is used |
|
Absence operator | (?~β¦) |
β | β |
β Some forms are supportable |
|
Conditional | (?(1)β¦) |
β | β |
β Some forms are supportable |
|
Char sequence |
\x{1 2 β¦N} ,\o{1 2 β¦N} |
β | β |
β Not yet supported |
|
JS features unknown to Oniguruma are handled using Oniguruma syntax | β | β |
β \u{β¦} is an errorβ [\q{β¦}] matches q , etc.β [a--b] includes the invalid reversed range a to - |
||
Invalid Oniguruma syntax | β | β |
β Error |
The table above doesn't include all aspects that Oniguruma-To-ES emulates (including error handling, most aspects that work the same as in JavaScript, and many aspects of non-JavaScript features that work the same in the other regex flavors that support them).
Footnotes
- Target
ES2018
doesn't allow Unicode property names added in JavaScript specifications after ES2018 to be used. - Unicode blocks are easily emulatable but their character data would significantly increase library weight. They're also a deeply flawed and arguably-unuseful feature, given the ability to use Unicode scripts and other properties instead.
- With target
ES2018
, the specific POSIX classes[:graph:]
and[:print:]
are an error if optionallowBestEffort
isfalse
, and they use ASCII-based versions rather than the Unicode versions available for targetES2024
and later. - Target
ES2018
doesn't support nested negated character classes. - It's not an error for numbered backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because (1) most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), (2) erroring matches the behavior of named backreferences, and (3) the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JS and aren't emulatable. Note that it's not a backreference in the first place if using
\10
or higher and not as many capturing groups are defined to the left (it's an octal or identity escape). - Recursion depth is limited, and specified by option
maxRecursionDepth
. Any use of recursion results in an error ifmaxRecursionDepth
isnull
orallowBestEffort
isfalse
. Additionally, some forms of recursion are not yet supported, including mixing recursion with backreferences, using multiple recursions in the same pattern, and recursion by group number. Because recursion is bounded, patterns that fail due to infinite recursion in Oniguruma might find a match in Oniguruma-To-ES. Future versions will detect this and throw an error.
γοΈ Unicode / mixed case-sensitivity
Oniguruma-To-ES fully supports mixed case-sensitivity (and handles the Unicode edge cases) regardless of JavaScript target. It also restricts Unicode properties to those supported by Oniguruma and the target JavaScript version.
Oniguruma-To-ES focuses on being lightweight to make it better for use in browsers. This is partly achieved by not including heavyweight Unicode character data, which imposes a couple of minor/rare restrictions:
- Character class intersection and nested negated character classes are unsupported with target
ES2018
. Use targetES2024
or later if you need support for these Oniguruma features. - With targets before
ESNext
, a handful of Unicode properties that target a specific character case (ex:\p{Lower}
) can't be used case-insensitively in patterns that contain other characters with a specific case that are used case-sensitively.- In other words, almost every usage is fine, including
A\p{Lower}
,(?i:A\p{Lower})
,(?i:A)\p{Lower}
,(?i:A(?-i:\p{Lower}))
, and\w(?i:\p{Lower})
, but notA(?i:\p{Lower})
. - Using these properties case-insensitively is basically never done intentionally, so you're unlikely to encounter this error unless it's catching a mistake.
- In other words, almost every usage is fine, including
π Similar projects
JsRegex transpiles Onigmo regexes to JavaScript (Onigmo is a fork of Oniguruma with mostly shared syntax and behavior). It's written in Ruby and relies on the Regexp::Parser Ruby gem, which means regexes must be pre-transpiled on the server to use them in JavaScript. Note that JsRegex doesn't always translate edge case behavior differences.
π·οΈ About
Oniguruma-To-ES was created by Steven Levithan.
If you want to support this project, I'd love your help by contributing improvements, sharing it with others, or sponsoring ongoing development.
Β© 2024βpresent. MIT License.