forked from dotnet/runtime
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve regex reductions and code gen for some alternations (dotnet#5…
…9903) * Add optimize alternation path to regex source generator If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character. * Improve alternation extraction of prefixes Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters. * Improve reduction for atomic alternations The primary change here is an additional reduction for atomic alternations that enables subsequent optimizations to do more. We previously added an optimization for alternation reduction that enables extraction of a common prefix from a contiguous sequence of branches in an alternation. Such extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the semantics of ordering being visible. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well. This also adds a minor optimization for atomic alternations that trims away all branches after an empty branch. And it tweaks the pass that finds nodes to mark as atomic, ensuring that a top-level alternation is marked atomic so that the aforementioned optimizations kick in. * Address PR feedback
- Loading branch information
1 parent
62d70c4
commit 44f8982
Showing
3 changed files
with
466 additions
and
228 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.