Improve regex optimizer through investigation of regex optimizer passes#125289
Improve regex optimizer through investigation of regex optimizer passes#125289danmoseley merged 9 commits intodotnet:mainfrom
Conversation
After FinalOptimize's EliminateEndingBacktracking and FindAndMakeLoopsAtomic create new tree structures (Atomic wrappers, restructured alternations), walk the tree bottom-up and re-call Reduce() on each node. This cleans up patterns like Concat(X, Empty), redundant Atomic(Oneloopatomic), and enables further prefix extraction — improving 221 of 18,931 real-world NuGet patterns (1.2%), all converging in a single extra round. Also moves UpdateBumpalong after ReReduceTree so it operates on the final tree structure. Fixes dotnet#66031 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR enhances System.Text.RegularExpressions’ post-parse optimization pipeline by adding a final “re-reduce” cleanup pass after FinalOptimize transformations, ensuring any newly introduced structures are simplified by running Reduce() again.
Changes:
- Add a
ReReduceTree()traversal at the end ofRegexNode.FinalOptimize()(afterEliminateEndingBacktrackingand beforeUpdateBumpalong). - Move
UpdateBumpalongto run after the re-reduction so it sees the final reshaped tree. - Add focused unit tests covering patterns that only fully optimize with the new re-reduce pass.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs |
Adds ReReduceTree() and updates FinalOptimize() pass ordering so reductions are re-applied after global rewrites. |
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs |
Adds regression cases asserting the re-reduced trees match the expected simplified forms. |
- Row 3: Remove spurious capture group from before-fix equivalent (?>([ab]+c[ab]+|[ab]+)) -> (?>[ab]+c[ab]+|[ab]+) - Row 5: Add missing 'a' prefix in before-fix equivalent (?>(?>b?)|d) -> (?>a(?>b?)|d) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
ReplaceChild already handles Reduce + re-parenting, so delegate to it instead of duplicating that logic. Also avoids a double-Reduce that occurred when the manual code passed the reduced node to ReplaceChild. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Better name that pairs with FinalOptimize which calls it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8cda521 to
0281f05
Compare
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Thanks. Could you explore variations, in particular for overall parse time, like not doing any reduction (other than maybe removing Group) until the end, and then reducing until it stabilizes? |
|
I explored several variations of deferring reduction to the end. Here's a summary of the experiments and findings. Setup: 15,817 unique regex patterns from a real-world JSON corpus, Release build, Experiment 1: BaselinesMeasured the cost of
Experiment 2: Phase profiling (this PR)Instrumented each phase to understand where time is spent:
Experiment 3: Naive deferral --
|
| Variant | Per-pattern (μs) |
|---|---|
| This PR (reduce during parse + FinalReduce) | 4.08 +/- 0.07 |
| Deferred (ReduceMinimal during parse + 2x FinalReduce) | 4.29 +/- 0.07 |
| Difference | +0.21 μs (5%, t=8.2, p << 0.01) |
Deferred phase breakdown:
| Phase | Per-pattern (μs) | % of total |
|---|---|---|
Parse + ReduceMinimal |
2.3 | 54% |
Pre-optimization FinalReduce |
0.8 | 18% |
FindAndMakeLoopsAtomic |
0.5 | 12% |
EliminateEndingBacktracking |
0.2 | 4% |
Post-optimization FinalReduce |
0.5 | 12% |
| Total | 4.2 |
Deferring saves ~0.4 μs/pattern during parse but adds 0.8 μs for the pre-optimization FinalReduce tree walk.
Conclusions
The deferred approach is both more complex and slower:
- More code: Needs a new
AddChildMinimalmethod, changes to 13 parser call sites,FinalReducemoved outside the RTL/NonBacktracking guard, and two full tree-walk reduction passes instead of one. - Slower: 4.29 vs 4.08 μs/pattern (5% regression, statistically significant). The parse-time savings from skipping full
Reduceare more than offset by the additional tree walk. - Fundamental constraint: Reduction methods (
ReduceAlternation, etc.) create new nodes viaAddChildand depend on full reduction happening there. You can't simply removeReducefromAddChild-- you need a two-track system (minimal for parser, full for reducers).
The current PR approach -- reduce during parse + one FinalReduce at end -- appears to be the simplest correct approach. The integrated parse-time reduction is essentially free (no extra tree walk), and only one post-optimization FinalReduce pass is needed.
|
See benchmark results at https://gist.github.com/MihuBot/0deee742dd4364d8c006182180915f68 |
|
Almost all at parity, a few that are likely noise. Let me run again to be certain. We don't expect any of these to change in a measurable way. It's a small proportion of real world patterns that will benefit as discussed above. |
|
@MihuBot benchmark Regex |
|
See benchmark results at https://gist.github.com/MihuBot/0bd6ddbc404527a5af0885570b295f74 |
|
Copilot; Summary of MihuBot perf runs for #125289 Two independent MihuBot runs show a consistent pattern:
Conclusion: ⸻
There is a consistent signal in setup-related paths: These align with the PR behavior: ⸻
⸻
|
|
Can we restrict the final reduce to only apply to compiler/source generator? |
FinalReduce is only beneficial for Compiled and source-generated regexes, where the one-time construction cost is amortized over many matches and simpler trees produce better generated code. For interpreted regexes, the construction overhead is not worth it. The source generator already sets RegexOptions.Compiled when parsing (RegexGenerator.cs), so checking for Compiled covers both cases. NonBacktracking was already excluded by the existing FinalOptimize guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@MihuBot benchmark Regex |
|
@MihuBot regexdiff |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs
Show resolved
Hide resolved
Early-return when ChildCount is 0 to avoid the stack check overhead on leaf nodes where no recursion will occur. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
214 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"\\b(prima(\\s+(di|del(l['aoei])?|degli|dei)) ..." (212 uses)[GeneratedRegex("\\b(prima(\\s+(di|del(l['aoei])?|degli|dei))?|entro\\s*(l['aoe]|il?|gli|i)?|(non\\s+dopo\\s+(il?|l[oae']|gli)|non\\s+più\\s+tardi\\s+(di|del(l['aoei])?|degli|dei)|termina(no)?(\\s+con)?(\\s+(il?|l[oae']|gli))?|precedente\\s+a((l(l['aoe])?)|gli|i)?|fino\\s+a((l(l['aoe])?)|gli|i)?))\\b", RegexOptions.ExplicitCapture | RegexOptions.Singleline)] /// ○ Match 'i'.<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match the string "non".<br/>
+ /// ○ Match a whitespace character atomically at least once.<br/>
/// ○ Match with 2 alternative expressions.<br/>
/// ○ Match a sequence of expressions.<br/>
- /// ○ Match a whitespace character atomically at least once.<br/>
/// ○ Match the string "dopo".<br/>
/// ○ Match a whitespace character atomically at least once.<br/>
/// ○ Match with 3 alternative expressions.<br/>
/// ○ Match a character in the set ['aeo].<br/>
/// ○ Match the string "gli".<br/>
/// ○ Match a sequence of expressions.<br/>
- /// ○ Match a whitespace character atomically at least once.<br/>
/// ○ Match the string "più".<br/>
/// ○ Match a whitespace character atomically at least once.<br/>
/// ○ Match the string "tardi".<br/>
goto AlternationBranch8;
}
+ // Match a whitespace character atomically at least once.
+ {
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ int iteration2 = 0;
+ while ((uint)iteration2 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration2]))
+ {
+ iteration2++;
+ }
+
+ if (iteration2 == 0)
+ {
+ goto AlternationBranch8;
+ }
+
+ slice = slice.Slice(iteration2);
+ pos += iteration2;
+ }
+
// Match with 2 alternative expressions.
//{
alternation_starting_pos4 = pos;
// Branch 0
//{
- // Match a whitespace character atomically at least once.
- {
- pos += 3;
- slice = inputSpan.Slice(pos);
- int iteration2 = 0;
- while ((uint)iteration2 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration2]))
- {
- iteration2++;
- }
-
- if (iteration2 == 0)
- {
- goto AlternationBranch9;
- }
-
- slice = slice.Slice(iteration2);
- pos += iteration2;
- }
-
// Match the string "dopo".
if (!slice.StartsWith("dopo"))
{
// Branch 1
//{
+ // Match the string "più".
+ if (!slice.StartsWith("più"))
+ {
+ goto AlternationBranch8;
+ }
+
// Match a whitespace character atomically at least once.
{
pos += 3;
pos += iteration4;
}
- // Match the string "più".
- if (!slice.StartsWith("più"))
+ // Match the string "tardi".
+ if (!slice.StartsWith("tardi"))
{
goto AlternationBranch8;
}
// Match a whitespace character atomically at least once.
{
- pos += 3;
+ pos += 5;
slice = inputSpan.Slice(pos);
int iteration5 = 0;
while ((uint)iteration5 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration5]))
pos += iteration5;
}
- // Match the string "tardi".
- if (!slice.StartsWith("tardi"))
- {
- goto AlternationBranch8;
- }
-
- // Match a whitespace character atomically at least once.
- {
- pos += 5;
- slice = inputSpan.Slice(pos);
- int iteration6 = 0;
- while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
- {
- iteration6++;
- }
-
- if (iteration6 == 0)
- {
- goto AlternationBranch8;
- }
-
- slice = slice.Slice(iteration6);
- pos += iteration6;
- }
-
// Match 'd'.
if (slice.IsEmpty || slice[0] != 'd')
{
// Match a whitespace character atomically at least once.
{
- int iteration7 = 0;
- while ((uint)iteration7 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration7]))
+ int iteration6 = 0;
+ while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
{
- iteration7++;
+ iteration6++;
}
- if (iteration7 == 0)
+ if (iteration6 == 0)
{
goto LoopIterationNoMatch5;
}
- slice = slice.Slice(iteration7);
- pos += iteration7;
+ slice = slice.Slice(iteration6);
+ pos += iteration6;
}
// Match the string "con".
// Match a whitespace character atomically at least once.
{
- int iteration8 = 0;
- while ((uint)iteration8 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration8]))
+ int iteration7 = 0;
+ while ((uint)iteration7 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration7]))
{
- iteration8++;
+ iteration7++;
}
- if (iteration8 == 0)
+ if (iteration7 == 0)
{
goto LoopIterationNoMatch6;
}
- slice = slice.Slice(iteration8);
- pos += iteration8;
+ slice = slice.Slice(iteration7);
+ pos += iteration7;
}
// Match with 3 alternative expressions.
{
pos += 10;
slice = inputSpan.Slice(pos);
- int iteration9 = 0;
- while ((uint)iteration9 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration9]))
+ int iteration8 = 0;
+ while ((uint)iteration8 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration8]))
{
- iteration9++;
+ iteration8++;
}
- if (iteration9 == 0)
+ if (iteration8 == 0)
{
goto AlternationBranch18;
}
- slice = slice.Slice(iteration9);
- pos += iteration9;
+ slice = slice.Slice(iteration8);
+ pos += iteration8;
}
// Match 'a'.
{
pos += 4;
slice = inputSpan.Slice(pos);
- int iteration10 = 0;
- while ((uint)iteration10 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration10]))
+ int iteration9 = 0;
+ while ((uint)iteration9 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration9]))
{
- iteration10++;
+ iteration9++;
}
- if (iteration10 == 0)
+ if (iteration9 == 0)
{
return false; // The input didn't match.
}
- slice = slice.Slice(iteration10);
- pos += iteration10;
+ slice = slice.Slice(iteration9);
+ pos += iteration9;
}
// Match 'a'."\\b(1\\s*:\\s*1)|(one (on )?one|one\\s*-\\s* ..." (182 uses)[GeneratedRegex("\\b(1\\s*:\\s*1)|(one (on )?one|one\\s*-\\s*one|one\\s*:\\s*one)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] /// ○ Match a character in the set [Oo].<br/>
/// ○ Match a character in the set [Nn].<br/>
/// ○ Match a character in the set [Ee].<br/>
- /// ○ Match with 3 alternative expressions.<br/>
+ /// ○ Match with 2 alternative expressions.<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match ' '.<br/>
/// ○ Optional (greedy).<br/>
/// ○ Match a character in the set [Ee].<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match a whitespace character atomically any number of times.<br/>
- /// ○ Match '-'.<br/>
- /// ○ Match a whitespace character atomically any number of times.<br/>
- /// ○ Match a character in the set [Oo].<br/>
- /// ○ Match a character in the set [Nn].<br/>
- /// ○ Match a character in the set [Ee].<br/>
- /// ○ Match a sequence of expressions.<br/>
- /// ○ Match a whitespace character atomically any number of times.<br/>
- /// ○ Match ':'.<br/>
- /// ○ Match a whitespace character atomically any number of times.<br/>
- /// ○ Match a character in the set [Oo].<br/>
- /// ○ Match a character in the set [Nn].<br/>
- /// ○ Match a character in the set [Ee].<br/>
+ /// ○ Match with 2 alternative expressions.<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match '-'.<br/>
+ /// ○ Match a whitespace character atomically any number of times.<br/>
+ /// ○ Match a character in the set [Oo].<br/>
+ /// ○ Match a character in the set [Nn].<br/>
+ /// ○ Match a character in the set [Ee].<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match ':'.<br/>
+ /// ○ Match a whitespace character atomically any number of times.<br/>
+ /// ○ Match a character in the set [Oo].<br/>
+ /// ○ Match a character in the set [Nn].<br/>
+ /// ○ Match a character in the set [Ee].<br/>
/// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
return false; // The input didn't match.
}
- // Match with 3 alternative expressions.
+ // Match with 2 alternative expressions.
//{
alternation_starting_pos1 = pos;
alternation_starting_capturepos1 = base.Crawlpos();
pos += iteration2;
}
- // Match '-'.
- if (slice.IsEmpty || slice[0] != '-')
- {
- goto AlternationBranch2;
- }
-
- // Match a whitespace character atomically any number of times.
- {
- int iteration3 = 1;
- while ((uint)iteration3 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration3]))
+ // Match with 2 alternative expressions.
+ //{
+ if (slice.IsEmpty)
{
- iteration3++;
+ UncaptureUntil(0);
+ return false; // The input didn't match.
}
- slice = slice.Slice(iteration3);
- pos += iteration3;
- }
-
- if ((uint)slice.Length < 3 ||
- !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
- {
- goto AlternationBranch2;
- }
+ switch (slice[0])
+ {
+ case '-':
+
+ // Match a whitespace character atomically any number of times.
+ {
+ int iteration3 = 1;
+ while ((uint)iteration3 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration3]))
+ {
+ iteration3++;
+ }
+
+ slice = slice.Slice(iteration3);
+ pos += iteration3;
+ }
+
+ if ((uint)slice.Length < 3 ||
+ !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
+ {
+ UncaptureUntil(0);
+ return false; // The input didn't match.
+ }
+
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ break;
+
+ case ':':
+
+ // Match a whitespace character atomically any number of times.
+ {
+ int iteration4 = 1;
+ while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
+ {
+ iteration4++;
+ }
+
+ slice = slice.Slice(iteration4);
+ pos += iteration4;
+ }
+
+ if ((uint)slice.Length < 3 ||
+ !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
+ {
+ UncaptureUntil(0);
+ return false; // The input didn't match.
+ }
+
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ break;
+
+ default:
+ UncaptureUntil(0);
+ return false; // The input didn't match.
+ }
+ //}
alternation_branch = 1;
- pos += 3;
- slice = inputSpan.Slice(pos);
- goto AlternationMatch1;
-
- AlternationBranch2:
- pos = alternation_starting_pos1;
- slice = inputSpan.Slice(pos);
- UncaptureUntil(alternation_starting_capturepos1);
- //}
-
- // Branch 2
- //{
- // Match a whitespace character atomically any number of times.
- {
- int iteration4 = 3;
- while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
- {
- iteration4++;
- }
-
- slice = slice.Slice(iteration4);
- pos += iteration4;
- }
-
- // Match ':'.
- if (slice.IsEmpty || slice[0] != ':')
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
- // Match a whitespace character atomically any number of times.
- {
- int iteration5 = 1;
- while ((uint)iteration5 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration5]))
- {
- iteration5++;
- }
-
- slice = slice.Slice(iteration5);
- pos += iteration5;
- }
-
- if ((uint)slice.Length < 3 ||
- !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
- alternation_branch = 2;
- pos += 3;
- slice = inputSpan.Slice(pos);
goto AlternationMatch1;
//}
case 0:
goto LoopIterationNoMatch;
case 1:
- goto AlternationBranch2;
- case 2:
UncaptureUntil(0);
return false; // The input didn't match.
}"(?<till>zu|bis\\s*zum|zum|bis|bis\\s*hin(\\s ..." (136 uses)[GeneratedRegex("(?<till>zu|bis\\s*zum|zum|bis|bis\\s*hin(\\s*zum)?|--|-|—|——)", RegexOptions.IgnoreCase | RegexOptions.Singleline)] /// ○ Match a character in the set [Bb].<br/>
/// ○ Match a character in the set [Ii].<br/>
/// ○ Match a character in the set [Ss].<br/>
- /// ○ Match an empty string.<br/>
/// ○ Match the string "--".<br/>
/// ○ Match a character in the set [\-\u2014].<br/>
/// ○ Match the string "——".<br/>
goto AlternationBranch3;
}
-
pos += 3;
slice = inputSpan.Slice(pos);
goto AlternationMatch;For more diff examples, see https://gist.github.com/MihuBot/08ae4323ca7f212b9eeeaea972deb8fd Sample source code for further analysisconst string JsonPath = "RegexResults-1826.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FJerkLNA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
|
See benchmark results at https://gist.github.com/MihuBot/20d2370165c397728eb16ec6231da139 |
Perf analysis: latest MihuBot run vs. earlier runsThe construction/cache regressions flagged in my earlier analysis are resolved after the last push.
A few noisy outliers exist but aren't concerning:
No remaining regressions of significance. |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
/ba-g mac issues. sufficient validation |
Fixes #66031
Human/copilot collaboration to understand whether the regex optimizer passes can be improved, using the real world patterns corpus as a test bed.
Investigation
We analyzed the issue by running experiments against 18,931 real-world regex patterns extracted from NuGet packages (dotnet/runtime-assets corpus):
Fixed-point convergence: Re-ran
Reduce()in a loop until the tree stabilized. 221 of 18,931 patterns (1.2%) benefit from a second round. All converge in exactly 2 rounds -- zero oscillation, zero regressions.Pass ordering sensitivity: Tried all permutations of the FinalOptimize passes. 0 patterns where ordering matters (beyond the FinalReduce placement).
Minimal fix analysis: Compared re-running just
Reduce()vs. re-running the full FinalOptimize passes. Re-reduce alone captures 100% of the improvements. Re-runningFindAndMakeLoopsAtomic+EliminateEndingBacktrackingadds nothing.Problem
The regex optimizer runs in two phases: per-node
Reduce()calls during parsing, then three globalFinalOptimize()passes after parsing (FindAndMakeLoopsAtomic,EliminateEndingBacktracking,UpdateBumpalong). The global passes create new tree structures (Atomic wrappers, restructured alternations) that are themselves eligible for further reduction -- butReduce()never re-runs on them. This leaves optimization opportunities on the table.Change
Add a single
FinalReduce()call at the end ofFinalOptimize(), afterEliminateEndingBacktrackingand beforeUpdateBumpalong. It walks the tree bottom-up and re-callsReduce()on each node, replacing any node that simplifies. This is a 15-line private method with aStackHelper.TryEnsureSufficientExecutionStack()guard.UpdateBumpalongwas also moved to run afterFinalReduce, sinceFinalReducecan restructure alternations into concatenations with a leading loop thatUpdateBumpalongneeds to see.Alternatives rejected
Distinct improvement categories in real world corpus
Analysis of all 231 changed pattern trees (221 unique patterns, some with multiple option variants) across 35 distinct structural signatures identified ~4 distinct improvement variants:
a|aba(?:)a\n|\n\r|\r\n(?>\n(?:)|\r\n)(?>\n|\r\n)[ab]+c[ab]+|[ab]+(?>[ab]+c[ab]+|[ab]+)(?>(?>[ab]+)(?:c(?>[ab]+))?)ab|a|aca(?>b?)ab?ab|a|ac|d(?>a(?>b?)|d)(?>ab?|d)a?b|a??b(?>a?(?>[b]))(?>a?(?>b))[b]simplified tob[ab]?c|[ab]??c(?>[ab]?(?>[c]))(?>[ab]?(?>c))These are captured in new reduction tests. All 7 tests are verified to fail without the
FinalReducechange and pass with it.Performance
PatternsReduceIdenticallytests continue to pass unchanged.