Improve regex optimizer through investigation of regex optimizer passes by danmoseley · Pull Request #125289 · dotnet/runtime

danmoseley · 2026-03-07T08:30:23Z

Human/copilot collaboration to understand whether the regex optimizer passes can be improved, using the real world patterns corpus as a test bed.

Investigation

We analyzed the issue by running experiments against 18,931 real-world regex patterns extracted from NuGet packages (dotnet/runtime-assets corpus):

Fixed-point convergence: Re-ran Reduce() in a loop until the tree stabilized. 221 of 18,931 patterns (1.2%) benefit from a second round. All converge in exactly 2 rounds -- zero oscillation, zero regressions.
Pass ordering sensitivity: Tried all permutations of the FinalOptimize passes. 0 patterns where ordering matters (beyond the FinalReduce placement).
Minimal fix analysis: Compared re-running just Reduce() vs. re-running the full FinalOptimize passes. Re-reduce alone captures 100% of the improvements. Re-running FindAndMakeLoopsAtomic + EliminateEndingBacktracking adds nothing.

Problem

The regex optimizer runs in two phases: per-node Reduce() calls during parsing, then three global FinalOptimize() passes after parsing (FindAndMakeLoopsAtomic, EliminateEndingBacktracking, UpdateBumpalong). The global passes create new tree structures (Atomic wrappers, restructured alternations) that are themselves eligible for further reduction -- but Reduce() never re-runs on them. This leaves optimization opportunities on the table.

Change

Add a single FinalReduce() call at the end of FinalOptimize(), after EliminateEndingBacktracking and before UpdateBumpalong. It walks the tree bottom-up and re-calls Reduce() on each node, replacing any node that simplifies. This is a 15-line private method with a StackHelper.TryEnsureSufficientExecutionStack() guard.

UpdateBumpalong was also moved to run after FinalReduce, since FinalReduce can restructure alternations into concatenations with a leading loop that UpdateBumpalong needs to see.

Alternatives rejected

Alternative	Why discounted
Full fixed-point loop (re-run all passes until convergence)	Unnecessary -- single re-reduce pass captures 100% of improvements; all patterns converge in 1 extra round
Reorder existing passes instead of adding a new one	Pass ordering has zero effect on the 18,931 patterns
Fix individual Reduce methods to handle FinalOptimize-created structures	Would require changes across multiple Reduce methods; fragile and wouldn't catch future cases
Do nothing	221 patterns get suboptimal trees; the fix is small and safe

Distinct improvement categories in real world corpus

Analysis of all 231 changed pattern trees (221 unique patterns, some with multiple option variants) across 35 distinct structural signatures identified ~4 distinct improvement variants:

#	Test pattern	Reduces before fix	Reduces to after fix	Mechanism
1	`a\|ab`	`a(?:)`	`a`	Empty-in-Concat: prefix extraction leaves trailing Empty; re-reduce strips it
2	`\n\|\n\r\|\r\n`	`(?>\n(?:)\|\r\n)`	`(?>\n\|\r\n)`	Empty-in-Alternate: shared prefix creates Empty branch; re-reduce collapses
3	`[ab]+c[ab]+\|[ab]+`	`(?>[ab]+c[ab]+\|[ab]+)`	`(?>(?>[ab]+)(?:c(?>[ab]+))?)`	Prefix extraction + Alternate-to-Loop: set loop prefix not extracted until re-reduce
4	`ab\|a\|ac`	`a(?>b?)`	`ab?`	Redundant Atomic removal: Atomic wrapping non-backtracking child is stripped
5	`ab\|a\|ac\|d`	`(?>a(?>b?)\|d)`	`(?>ab?\|d)`	Same Atomic removal, within a larger Alternate
6	`a?b\|a??b`	`(?>a?(?>[b]))`	`(?>a?(?>b))`	Set-to-One: greedy/lazy branches merge after atomic promotion; single-char `[b]` simplified to `b`
7	`[ab]?c\|[ab]??c`	`(?>[ab]?(?>[c]))`	`(?>[ab]?(?>c))`	Same Set-to-One with set loop prefix

These are captured in new reduction tests. All 7 tests are verified to fail without the FinalReduce change and pass with it.

Performance

Parse-time cost is negligible: Measured at 0.3% of total parse time, within noise, across 18,931 patterns.
Zero tree regressions: All 18,931 patterns produce trees that are either identical (18,710) or strictly simpler (221). No pattern produces a worse tree.
All existing tests pass: The 424 pre-existing PatternsReduceIdentically tests continue to pass unchanged.
Convergence is immediate: Every pattern stabilizes in at most 1 extra round, so there's no risk of expensive iteration.
The improvements are structurally provable: Fewer nodes (Empty removal, Concat unwrapping), simpler node types (Set to One), and eliminated redundant wrappers (Atomic around non-backtracking children) all reduce work at match time.
Existing microbenchmarks unaffected: None of the 38 patterns from the dotnet/performance regex benchmarks are affected by this change; their patterns are not complex enough.

After FinalOptimize's EliminateEndingBacktracking and FindAndMakeLoopsAtomic create new tree structures (Atomic wrappers, restructured alternations), walk the tree bottom-up and re-call Reduce() on each node. This cleans up patterns like Concat(X, Empty), redundant Atomic(Oneloopatomic), and enables further prefix extraction — improving 221 of 18,931 real-world NuGet patterns (1.2%), all converging in a single extra round. Also moves UpdateBumpalong after ReReduceTree so it operates on the final tree structure. Fixes dotnet#66031 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR enhances System.Text.RegularExpressions’ post-parse optimization pipeline by adding a final “re-reduce” cleanup pass after FinalOptimize transformations, ensuring any newly introduced structures are simplified by running Reduce() again.

Changes:

Add a ReReduceTree() traversal at the end of RegexNode.FinalOptimize() (after EliminateEndingBacktracking and before UpdateBumpalong).
Move UpdateBumpalong to run after the re-reduction so it sees the final reshaped tree.
Add focused unit tests covering patterns that only fully optimize with the new re-reduce pass.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs`	Adds `ReReduceTree()` and updates `FinalOptimize()` pass ordering so reductions are re-applied after global rewrites.
`src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs`	Adds regression cases asserting the re-reduced trees match the expected simplified forms.

- Row 3: Remove spurious capture group from before-fix equivalent (?>([ab]+c[ab]+|[ab]+)) -> (?>[ab]+c[ab]+|[ab]+) - Row 5: Add missing 'a' prefix in before-fix equivalent (?>(?>b?)|d) -> (?>a(?>b?)|d) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

ReplaceChild already handles Reduce + re-parenting, so delegate to it instead of duplicating that logic. Also avoids a double-Reduce that occurred when the manual code passed the reduced node to ReplaceChild. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Better name that pairs with FinalOptimize which calls it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

stephentoub · 2026-03-07T11:14:48Z

Thanks. Could you explore variations, in particular for overall parse time, like not doing any reduction (other than maybe removing Group) until the end, and then reducing until it stabilizes?

danmoseley · 2026-03-07T16:44:47Z

I explored several variations of deferring reduction to the end. Here's a summary of the experiments and findings.

Setup: 15,817 unique regex patterns from a real-world JSON corpus, Release build, DOTNET_TieredCompilation=0, 15 iterations, 5-round warmup. All timings are average per-pattern cost in microseconds (mean +/- 1 stddev across iterations).

Experiment 1: Baselines

Measured the cost of FinalReduce as implemented in this PR (reduce during parse + one FinalReduce pass at end):

Variant	Per-pattern (μs)
Without `FinalReduce` (upstream behavior)	3.51 +/- 0.06
With `FinalReduce` (this PR)	4.08 +/- 0.07
`FinalReduce` overhead	+0.57 μs (16%)

Experiment 2: Phase profiling (this PR)

Instrumented each phase to understand where time is spent:

Phase	Per-pattern (μs)	% of total
Parse + `Reduce` during parse	2.7	69%
`FindAndMakeLoopsAtomic`	0.5	13%
`EliminateEndingBacktracking`	0.2	4%
`FinalReduce`	0.6	14%
Total	3.9

Experiment 3: Naive deferral -- `ReduceMinimal` in `AddChild`

Replaced Reduce() in AddChild/InsertChild with a minimal reducer (Group unwrap + 0/1-child Concatenation/Alternation unwrap + IgnoreCase strip), keeping full Reduce only in ReplaceChild (used by FinalReduce).

Result: 183/431 PatternsReduceIdentically tests fail. The optimization passes (FindAndMakeLoopsAtomic, EliminateEndingBacktracking) produce different results when operating on unreduced trees.

Adding a FinalReduce call before the optimization passes reduced failures to 23. The remaining 23 fail because reduction methods like ReduceAlternation internally create new nodes via AddChild and depend on those nodes being fully reduced. So swapping Reduce for ReduceMinimal in AddChild breaks the reducers themselves.

Experiment 4: Parser-level deferral -- `AddChildMinimal` for parser only

Created a separate AddChildMinimal method (calls ReduceMinimal) and changed all 13 AddChild calls in RegexParser to use it. AddChild itself still calls full Reduce(), so reduction methods work correctly. FinalReduce runs both before and after the optimization passes. Also moved the initial FinalReduce outside the RTL/NonBacktracking guard since those patterns also need reduction.

Code: danmoseley@75612fc (diff vs this PR)

Result: All 1,043 tests pass, but slower:

Variant	Per-pattern (μs)
This PR (reduce during parse + FinalReduce)	4.08 +/- 0.07
Deferred (ReduceMinimal during parse + 2x FinalReduce)	4.29 +/- 0.07
Difference	+0.21 μs (5%, t=8.2, p << 0.01)

Deferred phase breakdown:

Phase	Per-pattern (μs)	% of total
Parse + `ReduceMinimal`	2.3	54%
Pre-optimization `FinalReduce`	0.8	18%
`FindAndMakeLoopsAtomic`	0.5	12%
`EliminateEndingBacktracking`	0.2	4%
Post-optimization `FinalReduce`	0.5	12%
Total	4.2

Deferring saves ~0.4 μs/pattern during parse but adds 0.8 μs for the pre-optimization FinalReduce tree walk.

Conclusions

The deferred approach is both more complex and slower:

More code: Needs a new AddChildMinimal method, changes to 13 parser call sites, FinalReduce moved outside the RTL/NonBacktracking guard, and two full tree-walk reduction passes instead of one.
Slower: 4.29 vs 4.08 μs/pattern (5% regression, statistically significant). The parse-time savings from skipping full Reduce are more than offset by the additional tree walk.
Fundamental constraint: Reduction methods (ReduceAlternation, etc.) create new nodes via AddChild and depend on full reduction happening there. You can't simply remove Reduce from AddChild -- you need a two-track system (minimal for parser, full for reducers).

The current PR approach -- reduce during parse + one FinalReduce at end -- appears to be the simplest correct approach. The integrated parse-time reduction is essentially free (no extra tree walk), and only one post-optimization FinalReduce pass is needed.

MihuBot · 2026-03-18T02:23:43Z

See benchmark results at https://gist.github.com/MihuBot/0deee742dd4364d8c006182180915f68

danmoseley · 2026-03-18T03:11:46Z

Almost all at parity, a few that are likely noise. Let me run again to be certain. We don't expect any of these to change in a measurable way. It's a small proportion of real world patterns that will benefit as discussed above.

danmoseley · 2026-03-18T03:11:49Z

@MihuBot benchmark Regex

MihuBot · 2026-03-18T04:50:26Z

See benchmark results at https://gist.github.com/MihuBot/0bd6ddbc404527a5af0885570b295f74

danmoseley · 2026-03-18T05:16:33Z

Copilot;

Summary of MihuBot perf runs for #125289

Two independent MihuBot runs show a consistent pattern:

Steady-state matching: effectively unchanged
• Core regex match benchmarks (e.g., RegexRedux variants and industry patterns) are tightly clustered around 1.00x in both runs.
• Any small deviations tend to flip direction between runs, suggesting noise rather than signal.
• No sustained regressions appear in hot-path matching scenarios.

Conclusion:
No meaningful regression in steady-state regex execution.

⸻

Construction and cache-heavy scenarios: small but real regression

There is a consistent signal in setup-related paths:
• Regex construction (Ctor)
• ~1.16–1.19x slower across both runs → clear, repeatable
• Cache-sensitive IsMatch scenarios
• cacheSize 0: ~1.03x → 1.10x
• cacheSize 15: ~1.05x–1.07x
• cacheSize 800: ~1.04x
• Direction is consistent across runs → likely real

These align with the PR behavior:
• Additional FinalReduce() work increases per-pattern processing cost
• Impact shows up when patterns are frequently created or evicted from cache

⸻

Interpretation
• The change shifts cost slightly from match-time → construction-time
• For typical usage (compile once, match many), impact is negligible
• For workloads with:
• high pattern churn
• small/disabled cache
• frequent new Regex instances
there is a low-single-digit to ~10% overhead

⸻

Bottom line
• Match performance: no meaningful change
• Construction/cache paths: modest regression (expected from added reduction pass)
• Overall: perf-neutral for steady-state workloads, small cost for churn-heavy scenarios

stephentoub · 2026-03-18T11:00:44Z

Can we restrict the final reduce to only apply to compiler/source generator?

FinalReduce is only beneficial for Compiled and source-generated regexes, where the one-time construction cost is amortized over many matches and simpler trees produce better generated code. For interpreted regexes, the construction overhead is not worth it. The source generator already sets RegexOptions.Compiled when parsing (RegexGenerator.cs), so checking for Compiled covers both cases. NonBacktracking was already excluded by the existing FinalOptimize guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley · 2026-03-18T14:47:04Z

@MihuBot benchmark Regex

danmoseley · 2026-03-18T14:47:57Z

@MihuBot regexdiff

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs

Early-return when ChildCount is 0 to avoid the stack check overhead on leaf nodes where no recursion will occur. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

MihuBot · 2026-03-18T15:03:16Z

214 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs

"\\b(prima(\\s+(di|del(l['aoei])?|degli|dei)) ..." (212 uses)

[GeneratedRegex("\\b(prima(\\s+(di|del(l['aoei])?|degli|dei))?|entro\\s*(l['aoe]|il?|gli|i)?|(non\\s+dopo\\s+(il?|l[oae']|gli)|non\\s+più\\s+tardi\\s+(di|del(l['aoei])?|degli|dei)|termina(no)?(\\s+con)?(\\s+(il?|l[oae']|gli))?|precedente\\s+a((l(l['aoe])?)|gli|i)?|fino\\s+a((l(l['aoe])?)|gli|i)?))\\b", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]

  ///                 ○ Match 'i'.<br/>
  ///     ○ Match a sequence of expressions.<br/>
  ///         ○ Match the string "non".<br/>
+   ///         ○ Match a whitespace character atomically at least once.<br/>
  ///         ○ Match with 2 alternative expressions.<br/>
  ///             ○ Match a sequence of expressions.<br/>
-   ///                 ○ Match a whitespace character atomically at least once.<br/>
  ///                 ○ Match the string "dopo".<br/>
  ///                 ○ Match a whitespace character atomically at least once.<br/>
  ///                 ○ Match with 3 alternative expressions.<br/>
  ///                         ○ Match a character in the set ['aeo].<br/>
  ///                     ○ Match the string "gli".<br/>
  ///             ○ Match a sequence of expressions.<br/>
-   ///                 ○ Match a whitespace character atomically at least once.<br/>
  ///                 ○ Match the string "più".<br/>
  ///                 ○ Match a whitespace character atomically at least once.<br/>
  ///                 ○ Match the string "tardi".<br/>
                              goto AlternationBranch8;
                          }
                          
+                           // Match a whitespace character atomically at least once.
+                           {
+                               pos += 3;
+                               slice = inputSpan.Slice(pos);
+                               int iteration2 = 0;
+                               while ((uint)iteration2 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration2]))
+                               {
+                                   iteration2++;
+                               }
+                               
+                               if (iteration2 == 0)
+                               {
+                                   goto AlternationBranch8;
+                               }
+                               
+                               slice = slice.Slice(iteration2);
+                               pos += iteration2;
+                           }
+                           
                          // Match with 2 alternative expressions.
                          //{
                              alternation_starting_pos4 = pos;
                              
                              // Branch 0
                              //{
-                                   // Match a whitespace character atomically at least once.
-                                   {
-                                       pos += 3;
-                                       slice = inputSpan.Slice(pos);
-                                       int iteration2 = 0;
-                                       while ((uint)iteration2 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration2]))
-                                       {
-                                           iteration2++;
-                                       }
-                                       
-                                       if (iteration2 == 0)
-                                       {
-                                           goto AlternationBranch9;
-                                       }
-                                       
-                                       slice = slice.Slice(iteration2);
-                                       pos += iteration2;
-                                   }
-                                   
                                  // Match the string "dopo".
                                  if (!slice.StartsWith("dopo"))
                                  {
                              
                              // Branch 1
                              //{
+                                   // Match the string "più".
+                                   if (!slice.StartsWith("più"))
+                                   {
+                                       goto AlternationBranch8;
+                                   }
+                                   
                                  // Match a whitespace character atomically at least once.
                                  {
                                      pos += 3;
                                      pos += iteration4;
                                  }
                                  
-                                   // Match the string "più".
-                                   if (!slice.StartsWith("più"))
+                                   // Match the string "tardi".
+                                   if (!slice.StartsWith("tardi"))
                                  {
                                      goto AlternationBranch8;
                                  }
                                  
                                  // Match a whitespace character atomically at least once.
                                  {
-                                       pos += 3;
+                                       pos += 5;
                                      slice = inputSpan.Slice(pos);
                                      int iteration5 = 0;
                                      while ((uint)iteration5 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration5]))
                                      pos += iteration5;
                                  }
                                  
-                                   // Match the string "tardi".
-                                   if (!slice.StartsWith("tardi"))
-                                   {
-                                       goto AlternationBranch8;
-                                   }
-                                   
-                                   // Match a whitespace character atomically at least once.
-                                   {
-                                       pos += 5;
-                                       slice = inputSpan.Slice(pos);
-                                       int iteration6 = 0;
-                                       while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
-                                       {
-                                           iteration6++;
-                                       }
-                                       
-                                       if (iteration6 == 0)
-                                       {
-                                           goto AlternationBranch8;
-                                       }
-                                       
-                                       slice = slice.Slice(iteration6);
-                                       pos += iteration6;
-                                   }
-                                   
                                  // Match 'd'.
                                  if (slice.IsEmpty || slice[0] != 'd')
                                  {
                              
                              // Match a whitespace character atomically at least once.
                              {
-                                   int iteration7 = 0;
-                                   while ((uint)iteration7 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration7]))
+                                   int iteration6 = 0;
+                                   while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
                                  {
-                                       iteration7++;
+                                       iteration6++;
                                  }
                                  
-                                   if (iteration7 == 0)
+                                   if (iteration6 == 0)
                                  {
                                      goto LoopIterationNoMatch5;
                                  }
                                  
-                                   slice = slice.Slice(iteration7);
-                                   pos += iteration7;
+                                   slice = slice.Slice(iteration6);
+                                   pos += iteration6;
                              }
                              
                              // Match the string "con".
                              
                              // Match a whitespace character atomically at least once.
                              {
-                                   int iteration8 = 0;
-                                   while ((uint)iteration8 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration8]))
+                                   int iteration7 = 0;
+                                   while ((uint)iteration7 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration7]))
                                  {
-                                       iteration8++;
+                                       iteration7++;
                                  }
                                  
-                                   if (iteration8 == 0)
+                                   if (iteration7 == 0)
                                  {
                                      goto LoopIterationNoMatch6;
                                  }
                                  
-                                   slice = slice.Slice(iteration8);
-                                   pos += iteration8;
+                                   slice = slice.Slice(iteration7);
+                                   pos += iteration7;
                              }
                              
                              // Match with 3 alternative expressions.
                          {
                              pos += 10;
                              slice = inputSpan.Slice(pos);
-                               int iteration9 = 0;
-                               while ((uint)iteration9 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration9]))
+                               int iteration8 = 0;
+                               while ((uint)iteration8 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration8]))
                              {
-                                   iteration9++;
+                                   iteration8++;
                              }
                              
-                               if (iteration9 == 0)
+                               if (iteration8 == 0)
                              {
                                  goto AlternationBranch18;
                              }
                              
-                               slice = slice.Slice(iteration9);
-                               pos += iteration9;
+                               slice = slice.Slice(iteration8);
+                               pos += iteration8;
                          }
                          
                          // Match 'a'.
                          {
                              pos += 4;
                              slice = inputSpan.Slice(pos);
-                               int iteration10 = 0;
-                               while ((uint)iteration10 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration10]))
+                               int iteration9 = 0;
+                               while ((uint)iteration9 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration9]))
                              {
-                                   iteration10++;
+                                   iteration9++;
                              }
                              
-                               if (iteration10 == 0)
+                               if (iteration9 == 0)
                              {
                                  return false; // The input didn't match.
                              }
                              
-                               slice = slice.Slice(iteration10);
-                               pos += iteration10;
+                               slice = slice.Slice(iteration9);
+                               pos += iteration9;
                          }
                          
                          // Match 'a'.

"\\b(1\\s*:\\s*1)|(one (on )?one|one\\s*-\\s* ..." (182 uses)

[GeneratedRegex("\\b(1\\s*:\\s*1)|(one (on )?one|one\\s*-\\s*one|one\\s*:\\s*one)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

  ///             ○ Match a character in the set [Oo].<br/>
  ///             ○ Match a character in the set [Nn].<br/>
  ///             ○ Match a character in the set [Ee].<br/>
-   ///             ○ Match with 3 alternative expressions.<br/>
+   ///             ○ Match with 2 alternative expressions.<br/>
  ///                 ○ Match a sequence of expressions.<br/>
  ///                     ○ Match ' '.<br/>
  ///                     ○ Optional (greedy).<br/>
  ///                     ○ Match a character in the set [Ee].<br/>
  ///                 ○ Match a sequence of expressions.<br/>
  ///                     ○ Match a whitespace character atomically any number of times.<br/>
-   ///                     ○ Match '-'.<br/>
-   ///                     ○ Match a whitespace character atomically any number of times.<br/>
-   ///                     ○ Match a character in the set [Oo].<br/>
-   ///                     ○ Match a character in the set [Nn].<br/>
-   ///                     ○ Match a character in the set [Ee].<br/>
-   ///                 ○ Match a sequence of expressions.<br/>
-   ///                     ○ Match a whitespace character atomically any number of times.<br/>
-   ///                     ○ Match ':'.<br/>
-   ///                     ○ Match a whitespace character atomically any number of times.<br/>
-   ///                     ○ Match a character in the set [Oo].<br/>
-   ///                     ○ Match a character in the set [Nn].<br/>
-   ///                     ○ Match a character in the set [Ee].<br/>
+   ///                     ○ Match with 2 alternative expressions.<br/>
+   ///                         ○ Match a sequence of expressions.<br/>
+   ///                             ○ Match '-'.<br/>
+   ///                             ○ Match a whitespace character atomically any number of times.<br/>
+   ///                             ○ Match a character in the set [Oo].<br/>
+   ///                             ○ Match a character in the set [Nn].<br/>
+   ///                             ○ Match a character in the set [Ee].<br/>
+   ///                         ○ Match a sequence of expressions.<br/>
+   ///                             ○ Match ':'.<br/>
+   ///                             ○ Match a whitespace character atomically any number of times.<br/>
+   ///                             ○ Match a character in the set [Oo].<br/>
+   ///                             ○ Match a character in the set [Nn].<br/>
+   ///                             ○ Match a character in the set [Ee].<br/>
  ///         ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
                                      return false; // The input didn't match.
                                  }
                                  
-                                   // Match with 3 alternative expressions.
+                                   // Match with 2 alternative expressions.
                                  //{
                                      alternation_starting_pos1 = pos;
                                      alternation_starting_capturepos1 = base.Crawlpos();
                                              pos += iteration2;
                                          }
                                          
-                                           // Match '-'.
-                                           if (slice.IsEmpty || slice[0] != '-')
-                                           {
-                                               goto AlternationBranch2;
-                                           }
-                                           
-                                           // Match a whitespace character atomically any number of times.
-                                           {
-                                               int iteration3 = 1;
-                                               while ((uint)iteration3 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration3]))
+                                           // Match with 2 alternative expressions.
+                                           //{
+                                               if (slice.IsEmpty)
                                              {
-                                                   iteration3++;
+                                                   UncaptureUntil(0);
+                                                   return false; // The input didn't match.
                                              }
                                              
-                                               slice = slice.Slice(iteration3);
-                                               pos += iteration3;
-                                           }
-                                           
-                                           if ((uint)slice.Length < 3 ||
-                                               !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
-                                           {
-                                               goto AlternationBranch2;
-                                           }
+                                               switch (slice[0])
+                                               {
+                                                   case '-':
+                                                       
+                                                       // Match a whitespace character atomically any number of times.
+                                                       {
+                                                           int iteration3 = 1;
+                                                           while ((uint)iteration3 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration3]))
+                                                           {
+                                                               iteration3++;
+                                                           }
+                                                           
+                                                           slice = slice.Slice(iteration3);
+                                                           pos += iteration3;
+                                                       }
+                                                       
+                                                       if ((uint)slice.Length < 3 ||
+                                                           !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
+                                                       {
+                                                           UncaptureUntil(0);
+                                                           return false; // The input didn't match.
+                                                       }
+                                                       
+                                                       pos += 3;
+                                                       slice = inputSpan.Slice(pos);
+                                                       break;
+                                                       
+                                                   case ':':
+                                                       
+                                                       // Match a whitespace character atomically any number of times.
+                                                       {
+                                                           int iteration4 = 1;
+                                                           while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
+                                                           {
+                                                               iteration4++;
+                                                           }
+                                                           
+                                                           slice = slice.Slice(iteration4);
+                                                           pos += iteration4;
+                                                       }
+                                                       
+                                                       if ((uint)slice.Length < 3 ||
+                                                           !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
+                                                       {
+                                                           UncaptureUntil(0);
+                                                           return false; // The input didn't match.
+                                                       }
+                                                       
+                                                       pos += 3;
+                                                       slice = inputSpan.Slice(pos);
+                                                       break;
+                                                       
+                                                   default:
+                                                       UncaptureUntil(0);
+                                                       return false; // The input didn't match.
+                                               }
+                                           //}
                                          
                                          alternation_branch = 1;
-                                           pos += 3;
-                                           slice = inputSpan.Slice(pos);
-                                           goto AlternationMatch1;
-                                           
-                                           AlternationBranch2:
-                                           pos = alternation_starting_pos1;
-                                           slice = inputSpan.Slice(pos);
-                                           UncaptureUntil(alternation_starting_capturepos1);
-                                       //}
-                                       
-                                       // Branch 2
-                                       //{
-                                           // Match a whitespace character atomically any number of times.
-                                           {
-                                               int iteration4 = 3;
-                                               while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
-                                               {
-                                                   iteration4++;
-                                               }
-                                               
-                                               slice = slice.Slice(iteration4);
-                                               pos += iteration4;
-                                           }
-                                           
-                                           // Match ':'.
-                                           if (slice.IsEmpty || slice[0] != ':')
-                                           {
-                                               UncaptureUntil(0);
-                                               return false; // The input didn't match.
-                                           }
-                                           
-                                           // Match a whitespace character atomically any number of times.
-                                           {
-                                               int iteration5 = 1;
-                                               while ((uint)iteration5 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration5]))
-                                               {
-                                                   iteration5++;
-                                               }
-                                               
-                                               slice = slice.Slice(iteration5);
-                                               pos += iteration5;
-                                           }
-                                           
-                                           if ((uint)slice.Length < 3 ||
-                                               !slice.StartsWith("one", StringComparison.OrdinalIgnoreCase)) // Match the string "one" (ordinal case-insensitive)
-                                           {
-                                               UncaptureUntil(0);
-                                               return false; // The input didn't match.
-                                           }
-                                           
-                                           alternation_branch = 2;
-                                           pos += 3;
-                                           slice = inputSpan.Slice(pos);
                                          goto AlternationMatch1;
                                      //}
                                      
                                          case 0:
                                              goto LoopIterationNoMatch;
                                          case 1:
-                                               goto AlternationBranch2;
-                                           case 2:
                                              UncaptureUntil(0);
                                              return false; // The input didn't match.
                                      }

"(?<till>zu|bis\\s*zum|zum|bis|bis\\s*hin(\\s ..." (136 uses)

[GeneratedRegex("(?<till>zu|bis\\s*zum|zum|bis|bis\\s*hin(\\s*zum)?|--|-|—|——)", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

  ///             ○ Match a character in the set [Bb].<br/>
  ///             ○ Match a character in the set [Ii].<br/>
  ///             ○ Match a character in the set [Ss].<br/>
-   ///             ○ Match an empty string.<br/>
  ///         ○ Match the string "--".<br/>
  ///         ○ Match a character in the set [\-\u2014].<br/>
  ///         ○ Match the string "——".<br/>
                                  goto AlternationBranch3;
                              }
                              
-                               
                              pos += 3;
                              slice = inputSpan.Slice(pos);
                              goto AlternationMatch;

For more diff examples, see https://gist.github.com/MihuBot/08ae4323ca7f212b9eeeaea972deb8fd

Sample source code for further analysis

const string JsonPath = "RegexResults-1826.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FJerkLNA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

MihuBot · 2026-03-18T16:26:09Z

See benchmark results at https://gist.github.com/MihuBot/20d2370165c397728eb16ec6231da139

danmoseley · 2026-03-18T17:22:41Z

Perf analysis: latest MihuBot run vs. earlier runs

The construction/cache regressions flagged in my earlier analysis are resolved after the last push.

Area	Before (earlier runs)	Now (latest run)	Status
Ctor (None)	1.16–1.19x	1.04x	✅ Fixed
Ctor (IgnoreCase, Compiled)	~1.16x	1.05x	Was noise
Cache cacheSize=0	1.03–1.10x	1.00x	✅ Fixed
Cache cacheSize=15	1.05–1.07x	1.02x	✅ Fixed
Cache cacheSize=800	1.04x	1.01x	✅ Fixed
Steady-state matching	1.00x	1.00x	No change

A few noisy outliers exist but aren't concerning:

ReplaceWords IgnoreCase,Compiled shows 1.33x, but SplitWords same config shows 0.82x — bimodal, cancels out.
\p{Ll} NonBacktracking shows 1.11x, but \p{L} NonBacktracking shows 0.91x — same story.
IsMatch_Multithreading 400000/1/15 has a 61x allocation spike, but timing is 0.98x and the single-threaded equivalent has identical allocations (233B) — GC measurement noise.

No remaining regressions of significance.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

danmoseley · 2026-03-19T15:12:32Z

/ba-g mac issues. sufficient validation

Copilot AI review requested due to automatic review settings March 7, 2026 08:30

github-actions bot added the area-System.Text.RegularExpressions label Mar 7, 2026

dotnet-policy-service bot assigned danmoseley Mar 7, 2026

Copilot started reviewing on behalf of danmoseley March 7, 2026 08:31 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

danmoseley changed the title ~~Add ReReduceTree pass to regex FinalOptimize~~ Improve regex optimizer through investigation of regex optimizer passes Mar 7, 2026

danmoseley requested a review from Copilot March 7, 2026 09:21

Copilot started reviewing on behalf of danmoseley March 7, 2026 09:23 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Outdated Show resolved Hide resolved

danmoseley commented Mar 7, 2026

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Outdated Show resolved Hide resolved

Fix indentation in ReReduceTree

4f2b287

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 7, 2026 09:45

Copilot started reviewing on behalf of danmoseley March 7, 2026 09:46 View session

Rename ReReduceTree to FinalReduce

0281f05

Better name that pairs with FinalOptimize which calls it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley force-pushed the regex-rereduce branch from 8cda521 to 0281f05 Compare March 7, 2026 09:49

Copilot AI reviewed Mar 7, 2026

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Outdated Show resolved Hide resolved

Apply suggestion from @Copilot

048f5cd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 7, 2026 09:51

Copilot started reviewing on behalf of danmoseley March 7, 2026 09:52 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

MihuBot mentioned this pull request Mar 18, 2026

[Benchmark X64] [danmoseley] Improve regex optimizer through investigation o ... MihuBot/runtime-utils#1824

Open

danmoseley enabled auto-merge (squash) March 18, 2026 03:12

Copilot AI review requested due to automatic review settings March 18, 2026 14:46

MihuBot mentioned this pull request Mar 18, 2026

[Benchmark X64] [danmoseley] Improve regex optimizer through investigation o ... MihuBot/runtime-utils#1825

Open

MihuBot mentioned this pull request Mar 18, 2026

[RegexDiff X64] [danmoseley] Improve regex optimizer through investigation o ... MihuBot/runtime-utils#1826

Open

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Copilot started reviewing on behalf of danmoseley March 18, 2026 14:54 View session

Skip TryEnsureSufficientExecutionStack for leaf nodes in FinalReduce

2229edf

Early-return when ChildCount is 0 to avoid the stack check overhead on leaf nodes where no recursion will occur. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley requested a review from stephentoub March 18, 2026 17:23

stephentoub reviewed Mar 18, 2026

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Show resolved Hide resolved

stephentoub approved these changes Mar 18, 2026

View reviewed changes

danmoseley merged commit 9955df2 into dotnet:main Mar 19, 2026
80 of 88 checks passed

danmoseley deleted the regex-rereduce branch March 19, 2026 17:06

Conversation

danmoseley commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation

Problem

Change

Alternatives rejected

Distinct improvement categories in real world corpus

Performance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

stephentoub commented Mar 7, 2026

Uh oh!

danmoseley commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiment 1: Baselines

Experiment 2: Phase profiling (this PR)

Experiment 3: Naive deferral -- ReduceMinimal in AddChild

Experiment 4: Parser-level deferral -- AddChildMinimal for parser only

Conclusions

Uh oh!

MihuBot commented Mar 18, 2026

Uh oh!

danmoseley commented Mar 18, 2026

Uh oh!

danmoseley commented Mar 18, 2026

Uh oh!

MihuBot commented Mar 18, 2026

Uh oh!

danmoseley commented Mar 18, 2026

Uh oh!

stephentoub commented Mar 18, 2026

Uh oh!

danmoseley commented Mar 18, 2026

Uh oh!

danmoseley commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MihuBot commented Mar 18, 2026

Uh oh!

MihuBot commented Mar 18, 2026

Uh oh!

danmoseley commented Mar 18, 2026

Perf analysis: latest MihuBot run vs. earlier runs

Uh oh!

Uh oh!

danmoseley commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

danmoseley commented Mar 7, 2026 •

edited

Loading

danmoseley commented Mar 7, 2026 •

edited

Loading

Experiment 3: Naive deferral -- `ReduceMinimal` in `AddChild`

Experiment 4: Parser-level deferral -- `AddChildMinimal` for parser only