Skip to content

Commit

Permalink
Apply Regex starting loop optimization to non-atomic loops as well (d…
Browse files Browse the repository at this point in the history
…otnet#35936)

* Apply Regex starting loop optimization to non-atomic loops as well

* Remove min iteration restriction

The node.N > 0 restriction isn't necessary, and prevents this optimization from being used with * loops.    Worst case, the loop doesn't match anything, and we pay to overwrite the starting position with itself.  Best case, we eliminate a ton of cost.
  • Loading branch information
stephentoub authored May 9, 2020
1 parent 3cd02df commit 386ee46
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 11 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -310,18 +310,18 @@ internal RegexNode FinalOptimize()
// to implementations that don't support backtracking.
EliminateEndingBacktracking(rootNode.Child(0), DefaultMaxRecursionDepth);

// Optimization: unnecessary re-processing of atomic starting groups.
// If an expression is guaranteed to begin with a single-character infinite atomic group that isn't part of an alternation (in which case it
// Optimization: unnecessary re-processing of starting loops.
// If an expression is guaranteed to begin with a single-character unbounded loop that isn't part of an alternation (in which case it
// wouldn't be guaranteed to be at the beginning) or a capture (in which case a back reference could be influenced by its length), then we
// can update the tree with a temporary node to indicate that the implementation should use that node's ending position in the input text
// as the next starting position at which to start the next match. This avoids redoing matches we've already performed, e.g. matching
// "\[email protected]" against "is this a valid [email protected]", the \w+ will initially match the "is" and then will fail to match the "@".
// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". We limit ourselves to
// one/set atomic loops with a min iteration count of 1 so that we know we'll get something in exchange for the extra overhead of storing
// the updated position. For functional correctness we can only consider infinite atomic loops, as to be able to start at the end of the
// loop we need the loop to have consumed all possible matches; otherwise, you could end up with a pattern like "a{1,3}b" matching
// against "aaaabc", which should match, but if we pre-emptively stop consuming after the first three a's and re-start from that position,
// we'll end up failing the match even though it should have succeeded.
// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". For functional correctness
// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
// of a single starting position.
{
RegexNode node = rootNode.Child(0); // skip implicit root capture node
while (true)
Expand All @@ -333,9 +333,12 @@ internal RegexNode FinalOptimize()
node = node.Child(0);
continue;

case Oneloopatomic when node.M > 0 && node.N == int.MaxValue:
case Notoneloopatomic when node.M > 0 && node.N == int.MaxValue:
case Setloopatomic when node.M > 0 && node.N == int.MaxValue:
case Oneloop when node.N == int.MaxValue:
case Oneloopatomic when node.N == int.MaxValue:
case Notoneloop when node.N == int.MaxValue:
case Notoneloopatomic when node.N == int.MaxValue:
case Setloop when node.N == int.MaxValue:
case Setloopatomic when node.N == int.MaxValue:
RegexNode? parent = node.Next;
if (parent != null && parent.Type == Concatenate)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ public static IEnumerable<object[]> Match_Basic_TestData()
yield return new object[] { @"\w+(?<!a)", "aa", RegexOptions.None, 0, 2, false, string.Empty };
yield return new object[] { @"(?>\w+)(?<!a)", "a", RegexOptions.None, 0, 1, false, string.Empty };
yield return new object[] { @"(?>\w+)(?<!a)", "aa", RegexOptions.None, 0, 2, false, string.Empty };
yield return new object[] { @".+a", "baa", RegexOptions.None, 0, 3, true, "baa" };
yield return new object[] { @"[ab]+a", "cacbaac", RegexOptions.None, 0, 7, true, "baa" };

// Using beginning/end of string chars \A, \Z: Actual - "\\Aaaa\\w+zzz\\Z"
yield return new object[] { @"\Aaaa\w+zzz\Z", "aaaasdfajsdlfjzzz", RegexOptions.IgnoreCase, 0, 17, true, "aaaasdfajsdlfjzzz" };
Expand Down

0 comments on commit 386ee46

Please sign in to comment.