Skip to content

Commit

Permalink
TRegex: remove lazyness from RegexLanguage.parse, add more documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
djoooooe committed Dec 16, 2020
1 parent 86d1439 commit 90deca7
Show file tree
Hide file tree
Showing 5 changed files with 240 additions and 125 deletions.
123 changes: 123 additions & 0 deletions regex/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# TRegex - Truffle Regular Expression Language

This Truffle language represents classic regular expressions. It treats a given regular expression as a "program" that
you can execute to obtain a regular expression matcher object, which in turn can be used to perform regex searches via
Truffle interop.

The expected syntax is `options/regex/flags`, where `options` is a comma-separated list of key-value pairs which affect
how the regex is interpreted, and `/regex/flags` is equivalent to the popular regular expression literal format found in
e.g. JavaScript or Ruby.

### Parsing

When parsing a regular expression, TRegex will return a Truffle CallTarget, which, when called, will yield one of the
following results:

* a (Truffle) null value, indicating that TRegex cannot handle the given regex.
* a "compiled regex" (`RegexObject`) object, which can be used to match the given regex.
* a Truffle `PARSE_ERROR` exception may be thrown to indicate a syntax error.

An example of how to parse a regular expression:

```
Source source = Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build();
Object regex;
try {
regex = getContext().getEnv().parseInternal(source).call();
} catch (AbstractTruffleException e) {
if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
// handle parser error
} else {
// fatal error, this should never happen
}
}
if (InteropLibrary.getUncached().isNull(regex)) {
// regex is not supported by TRegex, fall back to a different regex engine
}
```

### The compiled regex object

A `RegexObject` represents a compiled regular expression that can be used to match against input strings. It exposes the
following three properties:

* `pattern`: the source string of the compiled regular expression.
* `flags`: an object representing the set of flags passed to the regular expression compiler, depending on the flavor of
regular expressions used.
* `groupCount`: the number of capture groups present in the regular expression, including group 0.
* `groups`: a map of all named capture groups to their respective group number, or a null value if the expression does
not contain named capture groups.
* `exec`: an executable method that matches the compiled regular expression against a string. The method accepts two
parameters:
* `input`: the character sequence to search in. This may either be a Java String, or a Truffle Object that behaves
like a `char`-array.
* `fromIndex`: the position to start searching from.
* The return value is a `RegexResult` object.

### The result object

A `RegexResult` object represents the result of matching a regular expression against a string. It can be obtained as
the result of a `RegexObject`'s
`exec`-method and has the following properties:

* `boolean isMatch`: `true` if a match was found, `false` otherwise.
* `int getStart(int groupNumber)`: returns the position where the beginning of the capture group with the given number
was found. If the result is no match, the returned value is undefined. Capture group number `0` denotes the boundaries
of the entire expression. If no match was found for a particular capture group, the returned value is `-1`.
* `int getEnd(int groupNumber)`: returns the position where the end of the capture group with the given number was
found. If the result is no match, the returned value is undefined. Capture group number `0` denotes the boundaries of
the entire expression. If no match was found for a particular capture group, the returned value is `-1`.

Compiled regex usage example in pseudocode:

```
regex = <matcher from previous example>
assert(regex.pattern == "(a|(b))c")
assert(regex.flags.ignoreCase == true)
assert(regex.groupCount == 3)
result = regex.exec("xacy", 0)
assert(result.isMatch == true)
assertEquals([result.getStart(0), result.getEnd(0)], [ 1, 3])
assertEquals([result.getStart(1), result.getEnd(1)], [ 1, 2])
assertEquals([result.getStart(2), result.getEnd(2)], [-1, -1])
result2 = regex.exec("xxx", 0)
assert(result2.isMatch == false)
// result2.getStart(...) and result2.getEnd(...) are undefined
```

### Available options

These options define how TRegex should interpret a given regular expression:

#### User options
* `Flavor`: specifies the regex dialect to use. Possible values:
* `ECMAScript`: ECMAScript/JavaScript syntax (default).
* `PythonStr`: regular Python 3 syntax.
* `PythonBytes` Python 3 syntax, but for `bytes`-objects.
* `Ruby`: ruby syntax.
* `Encoding`: specifies the string encoding to match against. Possible values:
* `UTF-8`
* `UTF-16` (default)
* `UTF-32`
* `LATIN-1`
* `BYTES` (equivalent to `LATIN-1`)
* `Validate`: don't generate a regex matcher object, just check the regex for syntax errors.
* `U180EWhitespace`: treat `0x180E MONGOLIAN VOWEL SEPARATOR` as part of `\s`. This is a legacy feature for languages
using a Unicode standard older than 6.3, such as ECMAScript 6 and older.

#### Performance tuning options
* `UTF16ExplodeAstralSymbols`: generate one DFA states per (16 bit) `char` instead of per-codepoint. This may
improve performance in certain scenarios, but increases the likelihood of DFA state explosion.
* `AlwaysEager`: do not generate any lazy regex matchers (lazy in the sense that they may lazily compute properties of a
{@link RegexResult}).

#### Debugging options
* `RegressionTestMode`: exercise all supported regex matcher variants, and check if they produce the same results.
* `DumpAutomata`: dump all generated parser trees, NFA, and DFA to disk. This will generate debugging dumps of most
relevant data structures in JSON, GraphViz and LaTex format.
* `StepExecution`: dump tracing information about all DFA matcher runs.

All options except `Flavor` and `Encoding` are boolean and `false` by default.

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -45,38 +45,68 @@
import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
import com.oracle.truffle.api.Truffle;
import com.oracle.truffle.api.TruffleLanguage;
import com.oracle.truffle.api.exception.AbstractTruffleException;
import com.oracle.truffle.api.instrumentation.ProvidedTags;
import com.oracle.truffle.api.instrumentation.StandardTags;
import com.oracle.truffle.api.interop.ExceptionType;
import com.oracle.truffle.api.nodes.RootNode;
import com.oracle.truffle.api.source.Source;
import com.oracle.truffle.regex.tregex.TRegexCompiler;
import com.oracle.truffle.regex.tregex.TRegexOptions;
import com.oracle.truffle.regex.tregex.nfa.PureNFAIndex;
import com.oracle.truffle.regex.tregex.parser.RegexParserGlobals;
import com.oracle.truffle.regex.tregex.parser.RegexValidator;
import com.oracle.truffle.regex.tregex.parser.ast.GroupBoundaries;
import com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor;
import com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavorProcessor;
import com.oracle.truffle.regex.tregex.string.Encodings;
import com.oracle.truffle.regex.util.LRUCache;
import com.oracle.truffle.regex.util.TruffleNull;

import java.util.Collections;
import java.util.Map;

/**
* Truffle Regular Expression Language
* <p>
* This language represents classic regular expressions. By evaluating any source, you get access to
* the {@link RegexEngineBuilder}. By calling this builder, you can build your custom
* {@link RegexEngine} which implements your flavor of regular expressions and uses your fallback
* compiler for expressions not covered. The {@link RegexEngine} accepts regular expression patterns
* and flags and compiles them to {@link RegexObject}s, which you can use to match the regular
* expressions against strings.
* This language represents classic regular expressions. It accepts regular expressions in the
* following format: {@code options/regex/flags}, where {@code options} is a comma-separated list of
* key-value pairs which affect how the regex is interpreted (see {@link RegexOptions}), and
* {@code /regex/flags} is equivalent to the popular regular expression literal format found in e.g.
* JavaScript or Ruby.
* <p>
* When parsing a regular expression, TRegex will return a {@link CallTarget}, which, when called,
* will yield one of the following results:
* <ul>
* <li>a {@link TruffleNull} object, indicating that TRegex cannot handle the given regex</li>
* <li>a {@link RegexObject}, which can be used to match the given regex</li>
* <li>a {@link RegexSyntaxException} may be thrown to indicate a syntax error. This exception is an
* {@link AbstractTruffleException} with exception type {@link ExceptionType#PARSE_ERROR}.</li>
* </ul>
*
* An example of how to parse a regular expression:
*
* <pre>
* Usage example in pseudocode:
* {@code
* engineBuilder = <eval any source in the "regex" language>
* engine = engineBuilder("Flavor=ECMAScript", optionalFallbackCompiler)
* Object regex;
* try {
* regex = getContext().getEnv().parseInternal(Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build()).call();
* } catch (AbstractTruffleException e) {
* if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
* // handle parser error
* } else {
* // fatal error, this should never happen
* }
* }
* if (InteropLibrary.getUncached().isNull(regex)) {
* // regex is not supported by TRegex, fall back to a different regex engine
* }
* </pre>
*
* regex = engine("(a|(b))c", "i")
* Regex matcher usage example in pseudocode:
*
* <pre>
* {@code
* regex = <matcher from previous example>
* assert(regex.pattern == "(a|(b))c")
* assert(regex.flags.ignoreCase == true)
* assert(regex.groupCount == 3)
Expand All @@ -92,8 +122,10 @@
* // result2.getStart(...) and result2.getEnd(...) are undefined
* }
* </pre>
*
* @see RegexOptions
* @see RegexObject
*/

@TruffleLanguage.Registration(name = RegexLanguage.NAME, id = RegexLanguage.ID, characterMimeTypes = RegexLanguage.MIME_TYPE, version = "0.1", contextPolicy = TruffleLanguage.ContextPolicy.SHARED, internal = true, interactive = false)
@ProvidedTags(StandardTags.RootTag.class)
public final class RegexLanguage extends TruffleLanguage<RegexLanguage.RegexContext> {
Expand Down Expand Up @@ -127,7 +159,7 @@ protected CallTarget parse(ParsingRequest parsingRequest) {
RegexSource regexSource = createRegexSource(source);
CallTarget result = cacheGet(regexSource);
if (result == null) {
result = Truffle.getRuntime().createCallTarget(new GetRegexObjectNode(this, source, regexSource));
result = Truffle.getRuntime().createCallTarget(RootNode.createConstantNode(createRegexObject(regexSource)));
cachePut(regexSource, result);
}
return result;
Expand Down Expand Up @@ -155,7 +187,7 @@ private static RegexSource createRegexSource(Source source) {
int firstSlash = optBuilder.parseOptions(srcStr);
int lastSlash = srcStr.lastIndexOf('/');
assert firstSlash >= 0 && firstSlash <= srcStr.length();
if (lastSlash <= firstSlash || lastSlash >= srcStr.length()) {
if (lastSlash <= firstSlash) {
throw CompilerDirectives.shouldNotReachHere("malformed regex");
}
String pattern = srcStr.substring(firstSlash + 1, lastSlash);
Expand All @@ -167,6 +199,31 @@ private static RegexSource createRegexSource(Source source) {
return new RegexSource(pattern, flags, optBuilder.build(), source);
}

private Object createRegexObject(RegexSource source) {
RegexFlavor flavor = source.getOptions().getFlavor();
try {
if (flavor != null) {
RegexFlavorProcessor flavorProcessor = flavor.forRegex(source);
flavorProcessor.validate();
if (!source.getOptions().isValidate()) {
return new RegexObject(TRegexCompiler.compile(this, source), source, flavorProcessor.getFlags(), flavorProcessor.getNumberOfCaptureGroups(),
flavorProcessor.getNamedCaptureGroups());
}
} else {
RegexValidator validator = new RegexValidator(source);
validator.validate();
if (!source.getOptions().isValidate()) {
return new RegexObject(TRegexCompiler.compile(this, source), source, RegexFlags.parseFlags(source.getFlags()), validator.getNumberOfCaptureGroups(),
validator.getNamedCaptureGroups());
}
}
} catch (UnsupportedRegexException e) {
return TruffleNull.INSTANCE;
}
// reached only if source.getOptions().isValidate()
return TruffleNull.INSTANCE;
}

@Override
protected RegexContext createContext(Env env) {
return new RegexContext(env, engineBuilder);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
import com.oracle.truffle.api.CallTarget;
import com.oracle.truffle.api.CompilerDirectives;
import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
import com.oracle.truffle.api.TruffleLanguage;
import com.oracle.truffle.api.dsl.Cached;
import com.oracle.truffle.api.dsl.GenerateUncached;
import com.oracle.truffle.api.dsl.ImportStatic;
Expand Down Expand Up @@ -76,8 +77,9 @@

/**
* {@link RegexObject} represents a compiled regular expression that can be used to match against
* input strings. It is the result of executing a {@link RegexEngine}. It exposes the following
* three properties:
* input strings. It is the result of a call to
* {@link RegexLanguage#parse(TruffleLanguage.ParsingRequest)}. It exposes the following three
* properties:
* <ol>
* <li>{@link String} {@code pattern}: the source of the compiled regular expression</li>
* <li>{@link TruffleObject} {@code flags}: the set of flags passed to the regular expression
Expand Down
Loading

0 comments on commit 90deca7

Please sign in to comment.