forked from oracle/graal
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
TRegex: remove lazyness from RegexLanguage.parse, add more documentation
- Loading branch information
Showing
5 changed files
with
240 additions
and
125 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
# TRegex - Truffle Regular Expression Language | ||
|
||
This Truffle language represents classic regular expressions. It treats a given regular expression as a "program" that | ||
you can execute to obtain a regular expression matcher object, which in turn can be used to perform regex searches via | ||
Truffle interop. | ||
|
||
The expected syntax is `options/regex/flags`, where `options` is a comma-separated list of key-value pairs which affect | ||
how the regex is interpreted, and `/regex/flags` is equivalent to the popular regular expression literal format found in | ||
e.g. JavaScript or Ruby. | ||
|
||
### Parsing | ||
|
||
When parsing a regular expression, TRegex will return a Truffle CallTarget, which, when called, will yield one of the | ||
following results: | ||
|
||
* a (Truffle) null value, indicating that TRegex cannot handle the given regex. | ||
* a "compiled regex" (`RegexObject`) object, which can be used to match the given regex. | ||
* a Truffle `PARSE_ERROR` exception may be thrown to indicate a syntax error. | ||
|
||
An example of how to parse a regular expression: | ||
|
||
``` | ||
Source source = Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build(); | ||
Object regex; | ||
try { | ||
regex = getContext().getEnv().parseInternal(source).call(); | ||
} catch (AbstractTruffleException e) { | ||
if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) { | ||
// handle parser error | ||
} else { | ||
// fatal error, this should never happen | ||
} | ||
} | ||
if (InteropLibrary.getUncached().isNull(regex)) { | ||
// regex is not supported by TRegex, fall back to a different regex engine | ||
} | ||
``` | ||
|
||
### The compiled regex object | ||
|
||
A `RegexObject` represents a compiled regular expression that can be used to match against input strings. It exposes the | ||
following three properties: | ||
|
||
* `pattern`: the source string of the compiled regular expression. | ||
* `flags`: an object representing the set of flags passed to the regular expression compiler, depending on the flavor of | ||
regular expressions used. | ||
* `groupCount`: the number of capture groups present in the regular expression, including group 0. | ||
* `groups`: a map of all named capture groups to their respective group number, or a null value if the expression does | ||
not contain named capture groups. | ||
* `exec`: an executable method that matches the compiled regular expression against a string. The method accepts two | ||
parameters: | ||
* `input`: the character sequence to search in. This may either be a Java String, or a Truffle Object that behaves | ||
like a `char`-array. | ||
* `fromIndex`: the position to start searching from. | ||
* The return value is a `RegexResult` object. | ||
|
||
### The result object | ||
|
||
A `RegexResult` object represents the result of matching a regular expression against a string. It can be obtained as | ||
the result of a `RegexObject`'s | ||
`exec`-method and has the following properties: | ||
|
||
* `boolean isMatch`: `true` if a match was found, `false` otherwise. | ||
* `int getStart(int groupNumber)`: returns the position where the beginning of the capture group with the given number | ||
was found. If the result is no match, the returned value is undefined. Capture group number `0` denotes the boundaries | ||
of the entire expression. If no match was found for a particular capture group, the returned value is `-1`. | ||
* `int getEnd(int groupNumber)`: returns the position where the end of the capture group with the given number was | ||
found. If the result is no match, the returned value is undefined. Capture group number `0` denotes the boundaries of | ||
the entire expression. If no match was found for a particular capture group, the returned value is `-1`. | ||
|
||
Compiled regex usage example in pseudocode: | ||
|
||
``` | ||
regex = <matcher from previous example> | ||
assert(regex.pattern == "(a|(b))c") | ||
assert(regex.flags.ignoreCase == true) | ||
assert(regex.groupCount == 3) | ||
result = regex.exec("xacy", 0) | ||
assert(result.isMatch == true) | ||
assertEquals([result.getStart(0), result.getEnd(0)], [ 1, 3]) | ||
assertEquals([result.getStart(1), result.getEnd(1)], [ 1, 2]) | ||
assertEquals([result.getStart(2), result.getEnd(2)], [-1, -1]) | ||
result2 = regex.exec("xxx", 0) | ||
assert(result2.isMatch == false) | ||
// result2.getStart(...) and result2.getEnd(...) are undefined | ||
``` | ||
|
||
### Available options | ||
|
||
These options define how TRegex should interpret a given regular expression: | ||
|
||
#### User options | ||
* `Flavor`: specifies the regex dialect to use. Possible values: | ||
* `ECMAScript`: ECMAScript/JavaScript syntax (default). | ||
* `PythonStr`: regular Python 3 syntax. | ||
* `PythonBytes` Python 3 syntax, but for `bytes`-objects. | ||
* `Ruby`: ruby syntax. | ||
* `Encoding`: specifies the string encoding to match against. Possible values: | ||
* `UTF-8` | ||
* `UTF-16` (default) | ||
* `UTF-32` | ||
* `LATIN-1` | ||
* `BYTES` (equivalent to `LATIN-1`) | ||
* `Validate`: don't generate a regex matcher object, just check the regex for syntax errors. | ||
* `U180EWhitespace`: treat `0x180E MONGOLIAN VOWEL SEPARATOR` as part of `\s`. This is a legacy feature for languages | ||
using a Unicode standard older than 6.3, such as ECMAScript 6 and older. | ||
|
||
#### Performance tuning options | ||
* `UTF16ExplodeAstralSymbols`: generate one DFA states per (16 bit) `char` instead of per-codepoint. This may | ||
improve performance in certain scenarios, but increases the likelihood of DFA state explosion. | ||
* `AlwaysEager`: do not generate any lazy regex matchers (lazy in the sense that they may lazily compute properties of a | ||
{@link RegexResult}). | ||
|
||
#### Debugging options | ||
* `RegressionTestMode`: exercise all supported regex matcher variants, and check if they produce the same results. | ||
* `DumpAutomata`: dump all generated parser trees, NFA, and DFA to disk. This will generate debugging dumps of most | ||
relevant data structures in JSON, GraphViz and LaTex format. | ||
* `StepExecution`: dump tracing information about all DFA matcher runs. | ||
|
||
All options except `Flavor` and `Encoding` are boolean and `false` by default. |
109 changes: 0 additions & 109 deletions
109
regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/GetRegexObjectNode.java
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.