This Truffle language represents classic regular expressions. It treats a given regular expression as a "program" that you can execute to obtain a regular expression matcher object, which in turn can be used to perform regex searches via Truffle interop.
The expected syntax is options/regex/flags
, where options
is a comma-separated list of key-value pairs which affect
how the regex is interpreted, and /regex/flags
is equivalent to the popular regular expression literal format found in
e.g. JavaScript or Ruby.
When parsing a regular expression, TRegex will return a Truffle CallTarget, which, when called, will yield one of the following results:
- a (Truffle) null value, indicating that TRegex cannot handle the given regex.
- a "compiled regex" (
RegexObject
) object, which can be used to match the given regex. - a Truffle
PARSE_ERROR
exception may be thrown to indicate a syntax error.
An example of how to parse a regular expression:
Source source = Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build();
Object regex;
try {
regex = getContext().getEnv().parseInternal(source).call();
} catch (AbstractTruffleException e) {
if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
// handle parser error
} else {
// fatal error, this should never happen
}
}
if (InteropLibrary.getUncached().isNull(regex)) {
// regex is not supported by TRegex, fall back to a different regex engine
}
A RegexObject
represents a compiled regular expression that can be used to match against input strings. It exposes the
following three properties:
pattern
: the source string of the compiled regular expression.flags
: an object representing the set of flags passed to the regular expression compiler, depending on the flavor of regular expressions used.groupCount
: the number of capture groups present in the regular expression, including group 0.groups
: a map of all named capture groups to their respective group number, or a null value if the expression does not contain named capture groups.exec
: an executable method that matches the compiled regular expression against a string. The method accepts two parameters:input
: the character sequence to search in. This may either be a Java String, or a Truffle Object that behaves like achar
-array.fromIndex
: the position to start searching from.- The return value is a
RegexResult
object.
A RegexResult
object represents the result of matching a regular expression against a string. It can be obtained as
the result of a RegexObject
's
exec
-method and has the following properties:
boolean isMatch
:true
if a match was found,false
otherwise.int getStart(int groupNumber)
: returns the position where the beginning of the capture group with the given number was found. If the result is no match, the returned value is undefined. Capture group number0
denotes the boundaries of the entire expression. If no match was found for a particular capture group, the returned value is-1
.int getEnd(int groupNumber)
: returns the position where the end of the capture group with the given number was found. If the result is no match, the returned value is undefined. Capture group number0
denotes the boundaries of the entire expression. If no match was found for a particular capture group, the returned value is-1
.
Compiled regex usage example in pseudocode:
regex = <matcher from previous example>
assert(regex.pattern == "(a|(b))c")
assert(regex.flags.ignoreCase == true)
assert(regex.groupCount == 3)
result = regex.exec("xacy", 0)
assert(result.isMatch == true)
assertEquals([result.getStart(0), result.getEnd(0)], [ 1, 3])
assertEquals([result.getStart(1), result.getEnd(1)], [ 1, 2])
assertEquals([result.getStart(2), result.getEnd(2)], [-1, -1])
result2 = regex.exec("xxx", 0)
assert(result2.isMatch == false)
// result2.getStart(...) and result2.getEnd(...) are undefined
These options define how TRegex should interpret a given regular expression:
Flavor
: specifies the regex dialect to use. Possible values:ECMAScript
: ECMAScript/JavaScript syntax (default).Python
: Python 3 syntax.Ruby
: Ruby syntax.
Encoding
: specifies the string encoding to match against. Possible values:UTF-8
UTF-16
(default)UTF-32
LATIN-1
BYTES
(equivalent toLATIN-1
)
Validate
: don't generate a regex matcher object, just check the regex for syntax errors.U180EWhitespace
: treat0x180E MONGOLIAN VOWEL SEPARATOR
as part of\s
. This is a legacy feature for languages using a Unicode standard older than 6.3, such as ECMAScript 6 and older.
UTF16ExplodeAstralSymbols
: generate one DFA states per (16 bit)char
instead of per-codepoint. This may improve performance in certain scenarios, but increases the likelihood of DFA state explosion.AlwaysEager
: do not generate any lazy regex matchers (lazy in the sense that they may lazily compute properties of a {@link RegexResult}).
RegressionTestMode
: exercise all supported regex matcher variants, and check if they produce the same results.DumpAutomata
: dump all generated parser trees, NFA, and DFA to disk. This will generate debugging dumps of most relevant data structures in JSON, GraphViz and LaTex format.StepExecution
: dump tracing information about all DFA matcher runs.
All options except Flavor
and Encoding
are boolean and false
by default.