TRegex: remove lazyness from RegexLanguage.parse, add more documentation

The-Alchemist · Dec 16, 2020 · 90deca7 · 90deca7
1 parent 86d1439
commit 90deca7
Show file tree

Hide file tree

Showing 5 changed files with 240 additions and 125 deletions.
diff --git a/regex/docs/README.md b/regex/docs/README.md
@@ -0,0 +1,123 @@
+# TRegex - Truffle Regular Expression Language
+
+This Truffle language represents classic regular expressions. It treats a given regular expression as a "program" that
+you can execute to obtain a regular expression matcher object, which in turn can be used to perform regex searches via
+Truffle interop.
+
+The expected syntax is `options/regex/flags`, where `options` is a comma-separated list of key-value pairs which affect
+how the regex is interpreted, and `/regex/flags` is equivalent to the popular regular expression literal format found in
+e.g. JavaScript or Ruby.
+
+### Parsing
+
+When parsing a regular expression, TRegex will return a Truffle CallTarget, which, when called, will yield one of the
+following results:
+
+* a (Truffle) null value, indicating that TRegex cannot handle the given regex.
+* a "compiled regex" (`RegexObject`) object, which can be used to match the given regex.
+* a Truffle `PARSE_ERROR` exception may be thrown to indicate a syntax error.
+
+An example of how to parse a regular expression:
+
+```
+Source source = Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build();
+Object regex;
+try {
+    regex = getContext().getEnv().parseInternal(source).call();
+} catch (AbstractTruffleException e) {
+    if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
+        // handle parser error
+    } else {
+        // fatal error, this should never happen
+    }
+}
+if (InteropLibrary.getUncached().isNull(regex)) {
+    // regex is not supported by TRegex, fall back to a different regex engine
+}
+```
+
+### The compiled regex object
+
+A `RegexObject` represents a compiled regular expression that can be used to match against input strings. It exposes the
+following three properties:
+
+* `pattern`: the source string of the compiled regular expression.
+* `flags`: an object representing the set of flags passed to the regular expression compiler, depending on the flavor of
+  regular expressions used.
+* `groupCount`: the number of capture groups present in the regular expression, including group 0.
+* `groups`: a map of all named capture groups to their respective group number, or a null value if the expression does
+  not contain named capture groups.
+* `exec`: an executable method that matches the compiled regular expression against a string. The method accepts two
+  parameters:
+    * `input`: the character sequence to search in. This may either be a Java String, or a Truffle Object that behaves
+      like a `char`-array.
+    * `fromIndex`: the position to start searching from.
+    * The return value is a `RegexResult` object.
+
+### The result object
+
+A `RegexResult` object represents the result of matching a regular expression against a string. It can be obtained as
+the result of a `RegexObject`'s
+`exec`-method and has the following properties:
+
+* `boolean isMatch`: `true` if a match was found, `false` otherwise.
+* `int getStart(int groupNumber)`: returns the position where the beginning of the capture group with the given number
+  was found. If the result is no match, the returned value is undefined. Capture group number `0` denotes the boundaries
+  of the entire expression. If no match was found for a particular capture group, the returned value is `-1`.
+* `int getEnd(int groupNumber)`: returns the position where the end of the capture group with the given number was
+  found. If the result is no match, the returned value is undefined. Capture group number `0` denotes the boundaries of
+  the entire expression. If no match was found for a particular capture group, the returned value is `-1`.
+
+Compiled regex usage example in pseudocode:
+
+```
+regex = <matcher from previous example>
+assert(regex.pattern == "(a|(b))c")
+assert(regex.flags.ignoreCase == true)
+assert(regex.groupCount == 3)
+
+result = regex.exec("xacy", 0)
+assert(result.isMatch == true)
+assertEquals([result.getStart(0), result.getEnd(0)], [ 1,  3])
+assertEquals([result.getStart(1), result.getEnd(1)], [ 1,  2])
+assertEquals([result.getStart(2), result.getEnd(2)], [-1, -1])
+
+result2 = regex.exec("xxx", 0)
+assert(result2.isMatch == false)
+// result2.getStart(...) and result2.getEnd(...) are undefined
+
+```
+
+### Available options
+
+These options define how TRegex should interpret a given regular expression:
+
+#### User options
+* `Flavor`: specifies the regex dialect to use. Possible values:
+  * `ECMAScript`: ECMAScript/JavaScript syntax (default).
+  * `PythonStr`: regular Python 3 syntax.
+  * `PythonBytes` Python 3 syntax, but for `bytes`-objects.
+  * `Ruby`: ruby syntax.
+* `Encoding`: specifies the string encoding to match against. Possible values:
+  * `UTF-8`
+  * `UTF-16` (default)
+  * `UTF-32`
+  * `LATIN-1`
+  * `BYTES` (equivalent to `LATIN-1`)
+* `Validate`: don't generate a regex matcher object, just check the regex for syntax errors.
+* `U180EWhitespace`: treat `0x180E MONGOLIAN VOWEL SEPARATOR` as part of `\s`. This is a legacy feature for languages
+  using a Unicode standard older than 6.3, such as ECMAScript 6 and older.
+
+#### Performance tuning options
+* `UTF16ExplodeAstralSymbols`: generate one DFA states per (16 bit) `char` instead of per-codepoint. This may
+  improve performance in certain scenarios, but increases the likelihood of DFA state explosion.
+* `AlwaysEager`: do not generate any lazy regex matchers (lazy in the sense that they may lazily compute properties of a
+  {@link RegexResult}).
+
+#### Debugging options
+* `RegressionTestMode`: exercise all supported regex matcher variants, and check if they produce the same results.
+* `DumpAutomata`: dump all generated parser trees, NFA, and DFA to disk. This will generate debugging dumps of most
+  relevant data structures in JSON, GraphViz and LaTex format.
+* `StepExecution`: dump tracing information about all DFA matcher runs.
+
+All options except `Flavor` and `Encoding` are boolean and `false` by default.
diff --git a/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/GetRegexObjectNode.java b/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/GetRegexObjectNode.java
diff --git a/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/RegexLanguage.java b/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/RegexLanguage.java
@@ -45,38 +45,68 @@
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
 import com.oracle.truffle.api.Truffle;
 import com.oracle.truffle.api.TruffleLanguage;
+import com.oracle.truffle.api.exception.AbstractTruffleException;
 import com.oracle.truffle.api.instrumentation.ProvidedTags;
 import com.oracle.truffle.api.instrumentation.StandardTags;
+import com.oracle.truffle.api.interop.ExceptionType;
 import com.oracle.truffle.api.nodes.RootNode;
 import com.oracle.truffle.api.source.Source;
+import com.oracle.truffle.regex.tregex.TRegexCompiler;
 import com.oracle.truffle.regex.tregex.TRegexOptions;
 import com.oracle.truffle.regex.tregex.nfa.PureNFAIndex;
 import com.oracle.truffle.regex.tregex.parser.RegexParserGlobals;
+import com.oracle.truffle.regex.tregex.parser.RegexValidator;
 import com.oracle.truffle.regex.tregex.parser.ast.GroupBoundaries;
+import com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor;
+import com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavorProcessor;
 import com.oracle.truffle.regex.tregex.string.Encodings;
 import com.oracle.truffle.regex.util.LRUCache;
+import com.oracle.truffle.regex.util.TruffleNull;
 
 import java.util.Collections;
 import java.util.Map;
 
 /**
  * Truffle Regular Expression Language
  * <p>
- * This language represents classic regular expressions. By evaluating any source, you get access to
- * the {@link RegexEngineBuilder}. By calling this builder, you can build your custom
- * {@link RegexEngine} which implements your flavor of regular expressions and uses your fallback
- * compiler for expressions not covered. The {@link RegexEngine} accepts regular expression patterns
- * and flags and compiles them to {@link RegexObject}s, which you can use to match the regular
- * expressions against strings.
+ * This language represents classic regular expressions. It accepts regular expressions in the
+ * following format: {@code options/regex/flags}, where {@code options} is a comma-separated list of
+ * key-value pairs which affect how the regex is interpreted (see {@link RegexOptions}), and
+ * {@code /regex/flags} is equivalent to the popular regular expression literal format found in e.g.
+ * JavaScript or Ruby.
  * <p>
+ * When parsing a regular expression, TRegex will return a {@link CallTarget}, which, when called,
+ * will yield one of the following results:
+ * <ul>
+ * <li>a {@link TruffleNull} object, indicating that TRegex cannot handle the given regex</li>
+ * <li>a {@link RegexObject}, which can be used to match the given regex</li>
+ * <li>a {@link RegexSyntaxException} may be thrown to indicate a syntax error. This exception is an
+ * {@link AbstractTruffleException} with exception type {@link ExceptionType#PARSE_ERROR}.</li>
+ * </ul>
  *
+ * An example of how to parse a regular expression:
+ * 
  * <pre>
- * Usage example in pseudocode:
- * {@code
- * engineBuilder = <eval any source in the "regex" language>
- * engine = engineBuilder("Flavor=ECMAScript", optionalFallbackCompiler)
+ * Object regex;
+ * try {
+ *     regex = getContext().getEnv().parseInternal(Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build()).call();
+ * } catch (AbstractTruffleException e) {
+ *     if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
+ *         // handle parser error
+ *     } else {
+ *         // fatal error, this should never happen
+ *     }
+ * }
+ * if (InteropLibrary.getUncached().isNull(regex)) {
+ *     // regex is not supported by TRegex, fall back to a different regex engine
+ * }
+ * </pre>
  *
- * regex = engine("(a|(b))c", "i")
+ * Regex matcher usage example in pseudocode:
+ * 
+ * <pre>
+ * {@code
+ * regex = <matcher from previous example>
  * assert(regex.pattern == "(a|(b))c")
  * assert(regex.flags.ignoreCase == true)
  * assert(regex.groupCount == 3)
@@ -92,8 +122,10 @@
  * // result2.getStart(...) and result2.getEnd(...) are undefined
  * }
  * </pre>
+ *
+ * @see RegexOptions
+ * @see RegexObject
  */
-
 @TruffleLanguage.Registration(name = RegexLanguage.NAME, id = RegexLanguage.ID, characterMimeTypes = RegexLanguage.MIME_TYPE, version = "0.1", contextPolicy = TruffleLanguage.ContextPolicy.SHARED, internal = true, interactive = false)
 @ProvidedTags(StandardTags.RootTag.class)
 public final class RegexLanguage extends TruffleLanguage<RegexLanguage.RegexContext> {
@@ -127,7 +159,7 @@ protected CallTarget parse(ParsingRequest parsingRequest) {
             RegexSource regexSource = createRegexSource(source);
             CallTarget result = cacheGet(regexSource);
             if (result == null) {
-                result = Truffle.getRuntime().createCallTarget(new GetRegexObjectNode(this, source, regexSource));
+                result = Truffle.getRuntime().createCallTarget(RootNode.createConstantNode(createRegexObject(regexSource)));
                 cachePut(regexSource, result);
             }
             return result;
@@ -155,7 +187,7 @@ private static RegexSource createRegexSource(Source source) {
         int firstSlash = optBuilder.parseOptions(srcStr);
         int lastSlash = srcStr.lastIndexOf('/');
         assert firstSlash >= 0 && firstSlash <= srcStr.length();
-        if (lastSlash <= firstSlash || lastSlash >= srcStr.length()) {
+        if (lastSlash <= firstSlash) {
             throw CompilerDirectives.shouldNotReachHere("malformed regex");
         }
         String pattern = srcStr.substring(firstSlash + 1, lastSlash);
@@ -167,6 +199,31 @@ private static RegexSource createRegexSource(Source source) {
         return new RegexSource(pattern, flags, optBuilder.build(), source);
     }
 
+    private Object createRegexObject(RegexSource source) {
+        RegexFlavor flavor = source.getOptions().getFlavor();
+        try {
+            if (flavor != null) {
+                RegexFlavorProcessor flavorProcessor = flavor.forRegex(source);
+                flavorProcessor.validate();
+                if (!source.getOptions().isValidate()) {
+                    return new RegexObject(TRegexCompiler.compile(this, source), source, flavorProcessor.getFlags(), flavorProcessor.getNumberOfCaptureGroups(),
+                                    flavorProcessor.getNamedCaptureGroups());
+                }
+            } else {
+                RegexValidator validator = new RegexValidator(source);
+                validator.validate();
+                if (!source.getOptions().isValidate()) {
+                    return new RegexObject(TRegexCompiler.compile(this, source), source, RegexFlags.parseFlags(source.getFlags()), validator.getNumberOfCaptureGroups(),
+                                    validator.getNamedCaptureGroups());
+                }
+            }
+        } catch (UnsupportedRegexException e) {
+            return TruffleNull.INSTANCE;
+        }
+        // reached only if source.getOptions().isValidate()
+        return TruffleNull.INSTANCE;
+    }
+
     @Override
     protected RegexContext createContext(Env env) {
         return new RegexContext(env, engineBuilder);

diff --git a/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/RegexObject.java b/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/RegexObject.java
@@ -45,6 +45,7 @@
 import com.oracle.truffle.api.CallTarget;
 import com.oracle.truffle.api.CompilerDirectives;
 import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
+import com.oracle.truffle.api.TruffleLanguage;
 import com.oracle.truffle.api.dsl.Cached;
 import com.oracle.truffle.api.dsl.GenerateUncached;
 import com.oracle.truffle.api.dsl.ImportStatic;
@@ -76,8 +77,9 @@
 
 /**
  * {@link RegexObject} represents a compiled regular expression that can be used to match against
- * input strings. It is the result of executing a {@link RegexEngine}. It exposes the following
- * three properties:
+ * input strings. It is the result of a call to
+ * {@link RegexLanguage#parse(TruffleLanguage.ParsingRequest)}. It exposes the following three
+ * properties:
  * <ol>
  * <li>{@link String} {@code pattern}: the source of the compiled regular expression</li>
  * <li>{@link TruffleObject} {@code flags}: the set of flags passed to the regular expression