Skip to content

fix(aspnetcore): serialize WithWebHostBuilder to stop _derivedFactories race (flaky disposal NRE)#6251

Merged
thomhurst merged 2 commits into
mainfrom
fix/waf-derivedfactories-race
Jun 15, 2026
Merged

fix(aspnetcore): serialize WithWebHostBuilder to stop _derivedFactories race (flaky disposal NRE)#6251
thomhurst merged 2 commits into
mainfrom
fix/waf-derivedfactories-race

Conversation

@thomhurst

Copy link
Copy Markdown
Owner

Problem

Users running ASP.NET Core integration tests in parallel with TUnit.AspNetCore hit a flaky crash during disposal:

[Null Reference] Object reference not set to an instance of an object.
   at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.DisposeAsync()
   at TUnit.Core.Helpers.Disposer.DisposeAsync(Object obj)
   at TUnit.Core.Tracking.ObjectTracker.UntrackObject(Object obj)
   at TUnit.Core.Tracking.ObjectTracker.UntrackObjectsAsync(SortedList`2 trackedObjects)

Root cause

WebApplicationTest<TFactory, TEntryPoint> exposes a single shared factory:

[ClassDataSource(Shared = [SharedType.PerTestSession])]
public TFactory GlobalFactory { get; set; }

Every test's [Before(HookType.Test)] hook runs in parallel and calls
GlobalFactory.GetIsolatedFactory(...), which calls
WebApplicationFactory.WithWebHostBuilder. That method appends to the base
factory's private _derivedFactories List<T> with no synchronization
(confirmed against the ASP.NET Core source):

private readonly List<WebApplicationFactory<TEntryPoint>> _derivedFactories = new();
// WithWebHostBuilderCore:
_derivedFactories.Add(factory);          // unsynchronized
// DisposeAsync:
foreach (var factory in _derivedFactories)
    await ((IAsyncDisposable)factory).DisposeAsync();   // null slot -> NRE

Concurrent List.Add tears the backing array, leaving a null slot. Tests
pass. At session end the last test untracks the shared GlobalFactory, whose
DisposeAsync enumerates the corrupted list and dereferences the null slot →
NullReferenceException. The race only sometimes tears the array, hence the
flakiness. ServerInitSemaphore did not cover this — it guards only the
Server host build, which happens after the unsynchronized
GetIsolatedFactory call.

Fix

Wrap the WithWebHostBuilder call in a per-factory lock. It is synchronous
and only does fast configuration, so a plain lock suffices; the expensive host
build remains throttled separately by ServerInitSemaphore.

Regression test

IsolatedFactoryConcurrencyTests fires 64 concurrent GetIsolatedFactory
calls (released together via a gate) against one shared factory across 8
rounds, then disposes. Verified it reproduces the exact NRE with the lock
removed and passes with the fix. Full TUnit.AspNetCore.Tests suite (36
tests) green.

Note for users not on the base class

If you manage your own shared WebApplicationFactory and call
.WithWebHostBuilder() from parallel tests yourself, the same ASP.NET Core
race applies in your code — serialize that call.

…es race

A shared (e.g. SharedType.PerTestSession) GlobalFactory has GetIsolatedFactory
called from every test's parallel Before(Test) hook. Internally that calls
WebApplicationFactory.WithWebHostBuilder, which appends to the base factory's
private _derivedFactories List<T> with no synchronization. Concurrent List.Add
tears the backing array (lost entries / null slots); the damage only surfaces
later as a NullReferenceException inside WebApplicationFactory.DisposeAsync when
the shared factory is disposed and enumerates that list (via ObjectTracker).

Wrap the WithWebHostBuilder call in a per-factory lock. The call is synchronous
and only does fast configuration, so a plain lock is enough; the expensive host
build is throttled separately by ServerInitSemaphore.

Add a regression test that hammers GetIsolatedFactory concurrently then disposes;
it reproduces the exact NRE without the lock and passes with it.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Summary

The root cause analysis is accurate and well-documented: appends to a private without synchronization, and concurrent from parallel hooks tears the backing array. The fix is minimal, correctly targeted, and includes a solid regression test.

The Fix ()

Correct and well-scoped. The lock is instance-scoped ( per factory), so parallel test suites using separate factory instances don't block each other — only the shared serializes its GetIsolatedFactory calls, which is exactly the right granularity.

One thing worth calling out explicitly: the lock correctly wraps the entire WithWebHostBuilder call (including the builder delegate), not just a List.Add. This is necessary because the _derivedFactories.Add happens synchronously inside WithWebHostBuilder as it processes the delegate. The PR description and the code comment both make this clear — just worth stating for reviewers unfamiliar with the ASP.NET Core internals.

The builder delegate itself does only fast synchronous work (configuration, service registration), so holding the lock during it is fine. The expensive host build remains un-blocked, throttled separately by ServerInitSemaphore as noted.

Minor style note: CLAUDE.md specifies "one short line max" for comments, but the 8-line block comment on is justified here — the bug is subtle enough (third-party internal mutation, delayed NRE symptoms) that the full WHY is worth preserving.

The Test ()

The gate-and-release pattern () is the right technique for maximizing concurrency and creating a wide race window. 64 concurrent calls × 8 rounds is a strong stress load.

One correctness note: is captured outside and assigned to before the lambda, so async-local flow into thread-pool threads is not a concern — the closure holds a direct reference. Good.

Potential flakiness: The test asserts , which requires all 64 slots to actually run concurrently and return before disposal. In most environments this will be fine, but on heavily constrained CI runners (e.g., single-core), the thread pool might not spin up 64 threads simultaneously, meaning the gate releases before all 64 tasks are scheduled. The test would still pass (the lock fix would prevent NREs), but the distinct-count assertion might fail if some tasks see the same factory object. Consider whether that assertion is load-bearing for the regression or whether it can be dropped in favour of only asserting no-throw.

Alternative Approaches Considered

The only real alternative would be making return a pre-configured factory from a cache, but that doesn't apply here since each isolated factory has per-test service registrations (e.g. ). The plain is the right tool.

Verdict

Approve with one suggestion: the assertion in the regression test may be fragile on single-core runners. Consider dropping it or wrapping it with a note that it validates uniqueness rather than the race fix itself (the ThrowsNothing assertion is what proves the fix). Everything else is well-reasoned and correctly implemented.

@codacy-production

codacy-production Bot commented Jun 15, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Summary

The root cause analysis is accurate and well-documented: WebApplicationFactory.WithWebHostBuilder appends to a private List<T> without synchronization, and concurrent List.Add from parallel Before(Test) hooks tears the backing array. The fix is minimal, correctly targeted, and includes a solid regression test.

The Fix (TestWebApplicationFactory.cs)

Correct and well-scoped. The lock is instance-scoped (_isolationLock per factory), so parallel test suites using separate factory instances do not block each other — only the shared GlobalFactory serializes its GetIsolatedFactory calls, which is exactly the right granularity.

One thing worth calling out explicitly: the lock correctly wraps the entire WithWebHostBuilder call (including the builder delegate), not just a List.Add. This is necessary because _derivedFactories.Add happens synchronously inside WithWebHostBuilder as it processes the delegate. The PR description and code comment both make this clear — just worth stating for reviewers unfamiliar with ASP.NET Core internals.

The builder delegate itself does only fast synchronous work (configuration, service registration), so holding the lock during it is fine. The expensive host build remains un-blocked, throttled separately by ServerInitSemaphore as noted.

Minor style note: CLAUDE.md specifies "one short line max" for comments, but the 8-line block comment on _isolationLock is justified here — the bug is subtle enough (third-party internal mutation, delayed NRE symptoms) that the full WHY is worth preserving in-situ.

The Test (IsolatedFactoryConcurrencyTests.cs)

The gate-and-release pattern (TaskCompletionSource) is the right technique for maximizing concurrency and creating a wide race window. 64 concurrent calls × 8 rounds is a strong stress load.

Correctness note: TestContext.Current! is captured outside Task.Run and assigned to testContext before the lambda, so async-local flow into thread-pool threads is not a concern — the closure holds a direct reference. Good.

Potential flakiness concern: The assertion derived.Distinct().Count() == Concurrency requires all 64 thread-pool workers to have been scheduled and to have returned distinct instances. On a heavily constrained CI runner (e.g., 2-core), the thread pool may not spin up 64 threads simultaneously — a few tasks might see duplicate factory objects or queue behind each other. The test would still pass (the lock prevents NREs), but the distinct-count assertion could give a false failure. Consider whether this assertion is load-bearing for the regression (it is not — ThrowsNothing proves the fix; Distinct() proves ASP.NET Core's WithWebHostBuilder always returns a fresh instance). Dropping or annotating the Distinct().Count() assertion would make the test more reliable across CI environments.

Alternatives Considered

Caching pre-configured factories in a ConcurrentDictionary would not apply here since each isolated factory carries per-test service registrations (e.g. services.AddSingleton(testContext)). A SemaphoreSlim(1, 1) would enable async waiting but WithWebHostBuilder is synchronous, so a plain lock is the right and simpler tool.

Verdict

The fix is correct, well-reasoned, and the right approach. One suggestion: consider whether the Distinct().Count() assertion in the regression test adds robustness or fragility on resource-constrained CI runners — the ThrowsNothing assertion is the load-bearing part.

@thomhurst thomhurst enabled auto-merge (squash) June 15, 2026 14:39

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (Follow-up)

Two prior reviews from claude[bot] were submitted on this PR, but both contained a factual inaccuracy: they flagged a concern about derived.Distinct().Count() == Concurrency being fragile on low-core CI runners. That assertion does not exist in the code. The actual assertion is:

await Assert.That(derived.Length).IsEqualTo(Concurrency);

Since this is derived from Task.WhenAll, which re-throws on any failure and always returns the full result array on success, the length will always equal Concurrency whenever the test passes — no thread pool saturation concern applies here.


Fix (TestWebApplicationFactory.cs) — Correct

The lock (_isolationLock) approach is exactly right:

  • Scope is correct: the lock wraps the entire WithWebHostBuilder call, not just a hypothetical List.Add. The add happens synchronously inside that call when processing the builder delegate — locking around just the call site without covering the delegate execution would not help.
  • Granularity is correct: _isolationLock is per-instance, so separate factory instances don't contend. Only the shared GlobalFactory serializes its GetIsolatedFactory calls.
  • Lock held during synchronous-only work: the builder delegate is fast synchronous configuration; the expensive host build happens after, outside the lock, still throttled by ServerInitSemaphore. No async-over-sync or deadlock risk.
  • Comment length: the 8-line block comment on _isolationLock exceeds CLAUDE.md's "one short line max" guideline, but it's justified — the bug is subtle (third-party internal mutation, delayed NRE symptom), and the full WHY is valuable for future maintainers.

Test (IsolatedFactoryConcurrencyTests.cs) — Solid

  • Gate-and-release pattern (TaskCompletionSource) correctly maximises concurrency for the race window. 64 workers × 8 rounds is a strong stress load.
  • TestContext.Current! is captured before Task.Run, so the closure holds a direct reference — no async-local propagation issue.
  • Derived factories are not manually disposed, which is correct: globalFactory.DisposeAsync() internally enumerates _derivedFactories and disposes each one. That traversal is exactly the crash site the test is exercising.
  • TestWebAppFactory pre-exists at TUnit.AspNetCore.Tests/TestWebAppFactory.cs — not missing from the PR.

Verdict

Approve. The fix is minimal, correctly targeted, and well-tested. No changes needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant