fix(aspnetcore): serialize WithWebHostBuilder to stop _derivedFactories race (flaky disposal NRE)#6251
Conversation
…es race A shared (e.g. SharedType.PerTestSession) GlobalFactory has GetIsolatedFactory called from every test's parallel Before(Test) hook. Internally that calls WebApplicationFactory.WithWebHostBuilder, which appends to the base factory's private _derivedFactories List<T> with no synchronization. Concurrent List.Add tears the backing array (lost entries / null slots); the damage only surfaces later as a NullReferenceException inside WebApplicationFactory.DisposeAsync when the shared factory is disposed and enumerates that list (via ObjectTracker). Wrap the WithWebHostBuilder call in a per-factory lock. The call is synchronous and only does fast configuration, so a plain lock is enough; the expensive host build is throttled separately by ServerInitSemaphore. Add a regression test that hammers GetIsolatedFactory concurrently then disposes; it reproduces the exact NRE without the lock and passes with it.
There was a problem hiding this comment.
Code Review
Summary
The root cause analysis is accurate and well-documented: appends to a private without synchronization, and concurrent from parallel hooks tears the backing array. The fix is minimal, correctly targeted, and includes a solid regression test.
The Fix ()
Correct and well-scoped. The lock is instance-scoped ( per factory), so parallel test suites using separate factory instances don't block each other — only the shared serializes its GetIsolatedFactory calls, which is exactly the right granularity.
One thing worth calling out explicitly: the lock correctly wraps the entire WithWebHostBuilder call (including the builder delegate), not just a List.Add. This is necessary because the _derivedFactories.Add happens synchronously inside WithWebHostBuilder as it processes the delegate. The PR description and the code comment both make this clear — just worth stating for reviewers unfamiliar with the ASP.NET Core internals.
The builder delegate itself does only fast synchronous work (configuration, service registration), so holding the lock during it is fine. The expensive host build remains un-blocked, throttled separately by ServerInitSemaphore as noted.
Minor style note: CLAUDE.md specifies "one short line max" for comments, but the 8-line block comment on is justified here — the bug is subtle enough (third-party internal mutation, delayed NRE symptoms) that the full WHY is worth preserving.
The Test ()
The gate-and-release pattern () is the right technique for maximizing concurrency and creating a wide race window. 64 concurrent calls × 8 rounds is a strong stress load.
One correctness note: is captured outside and assigned to before the lambda, so async-local flow into thread-pool threads is not a concern — the closure holds a direct reference. Good.
Potential flakiness: The test asserts , which requires all 64 slots to actually run concurrently and return before disposal. In most environments this will be fine, but on heavily constrained CI runners (e.g., single-core), the thread pool might not spin up 64 threads simultaneously, meaning the gate releases before all 64 tasks are scheduled. The test would still pass (the lock fix would prevent NREs), but the distinct-count assertion might fail if some tasks see the same factory object. Consider whether that assertion is load-bearing for the regression or whether it can be dropped in favour of only asserting no-throw.
Alternative Approaches Considered
The only real alternative would be making return a pre-configured factory from a cache, but that doesn't apply here since each isolated factory has per-test service registrations (e.g. ). The plain is the right tool.
Verdict
Approve with one suggestion: the assertion in the regression test may be fragile on single-core runners. Consider dropping it or wrapping it with a note that it validates uniqueness rather than the race fix itself (the ThrowsNothing assertion is what proves the fix). Everything else is well-reasoned and correctly implemented.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
Summary
The root cause analysis is accurate and well-documented: WebApplicationFactory.WithWebHostBuilder appends to a private List<T> without synchronization, and concurrent List.Add from parallel Before(Test) hooks tears the backing array. The fix is minimal, correctly targeted, and includes a solid regression test.
The Fix (TestWebApplicationFactory.cs)
Correct and well-scoped. The lock is instance-scoped (_isolationLock per factory), so parallel test suites using separate factory instances do not block each other — only the shared GlobalFactory serializes its GetIsolatedFactory calls, which is exactly the right granularity.
One thing worth calling out explicitly: the lock correctly wraps the entire WithWebHostBuilder call (including the builder delegate), not just a List.Add. This is necessary because _derivedFactories.Add happens synchronously inside WithWebHostBuilder as it processes the delegate. The PR description and code comment both make this clear — just worth stating for reviewers unfamiliar with ASP.NET Core internals.
The builder delegate itself does only fast synchronous work (configuration, service registration), so holding the lock during it is fine. The expensive host build remains un-blocked, throttled separately by ServerInitSemaphore as noted.
Minor style note: CLAUDE.md specifies "one short line max" for comments, but the 8-line block comment on _isolationLock is justified here — the bug is subtle enough (third-party internal mutation, delayed NRE symptoms) that the full WHY is worth preserving in-situ.
The Test (IsolatedFactoryConcurrencyTests.cs)
The gate-and-release pattern (TaskCompletionSource) is the right technique for maximizing concurrency and creating a wide race window. 64 concurrent calls × 8 rounds is a strong stress load.
Correctness note: TestContext.Current! is captured outside Task.Run and assigned to testContext before the lambda, so async-local flow into thread-pool threads is not a concern — the closure holds a direct reference. Good.
Potential flakiness concern: The assertion derived.Distinct().Count() == Concurrency requires all 64 thread-pool workers to have been scheduled and to have returned distinct instances. On a heavily constrained CI runner (e.g., 2-core), the thread pool may not spin up 64 threads simultaneously — a few tasks might see duplicate factory objects or queue behind each other. The test would still pass (the lock prevents NREs), but the distinct-count assertion could give a false failure. Consider whether this assertion is load-bearing for the regression (it is not — ThrowsNothing proves the fix; Distinct() proves ASP.NET Core's WithWebHostBuilder always returns a fresh instance). Dropping or annotating the Distinct().Count() assertion would make the test more reliable across CI environments.
Alternatives Considered
Caching pre-configured factories in a ConcurrentDictionary would not apply here since each isolated factory carries per-test service registrations (e.g. services.AddSingleton(testContext)). A SemaphoreSlim(1, 1) would enable async waiting but WithWebHostBuilder is synchronous, so a plain lock is the right and simpler tool.
Verdict
The fix is correct, well-reasoned, and the right approach. One suggestion: consider whether the Distinct().Count() assertion in the regression test adds robustness or fragility on resource-constrained CI runners — the ThrowsNothing assertion is the load-bearing part.
There was a problem hiding this comment.
Code Review (Follow-up)
Two prior reviews from claude[bot] were submitted on this PR, but both contained a factual inaccuracy: they flagged a concern about derived.Distinct().Count() == Concurrency being fragile on low-core CI runners. That assertion does not exist in the code. The actual assertion is:
await Assert.That(derived.Length).IsEqualTo(Concurrency);Since this is derived from Task.WhenAll, which re-throws on any failure and always returns the full result array on success, the length will always equal Concurrency whenever the test passes — no thread pool saturation concern applies here.
Fix (TestWebApplicationFactory.cs) — Correct
The lock (_isolationLock) approach is exactly right:
- Scope is correct: the lock wraps the entire
WithWebHostBuildercall, not just a hypotheticalList.Add. The add happens synchronously inside that call when processing the builder delegate — locking around just the call site without covering the delegate execution would not help. - Granularity is correct:
_isolationLockis per-instance, so separate factory instances don't contend. Only the sharedGlobalFactoryserializes itsGetIsolatedFactorycalls. - Lock held during synchronous-only work: the builder delegate is fast synchronous configuration; the expensive host build happens after, outside the lock, still throttled by
ServerInitSemaphore. No async-over-sync or deadlock risk. - Comment length: the 8-line block comment on
_isolationLockexceeds CLAUDE.md's "one short line max" guideline, but it's justified — the bug is subtle (third-party internal mutation, delayed NRE symptom), and the full WHY is valuable for future maintainers.
Test (IsolatedFactoryConcurrencyTests.cs) — Solid
- Gate-and-release pattern (
TaskCompletionSource) correctly maximises concurrency for the race window. 64 workers × 8 rounds is a strong stress load. TestContext.Current!is captured beforeTask.Run, so the closure holds a direct reference — no async-local propagation issue.- Derived factories are not manually disposed, which is correct:
globalFactory.DisposeAsync()internally enumerates_derivedFactoriesand disposes each one. That traversal is exactly the crash site the test is exercising. TestWebAppFactorypre-exists atTUnit.AspNetCore.Tests/TestWebAppFactory.cs— not missing from the PR.
Verdict
Approve. The fix is minimal, correctly targeted, and well-tested. No changes needed.
Problem
Users running ASP.NET Core integration tests in parallel with
TUnit.AspNetCorehit a flaky crash during disposal:Root cause
WebApplicationTest<TFactory, TEntryPoint>exposes a single shared factory:Every test's
[Before(HookType.Test)]hook runs in parallel and callsGlobalFactory.GetIsolatedFactory(...), which callsWebApplicationFactory.WithWebHostBuilder. That method appends to the basefactory's private
_derivedFactoriesList<T>with no synchronization(confirmed against the ASP.NET Core source):
Concurrent
List.Addtears the backing array, leaving anullslot. Testspass. At session end the last test untracks the shared
GlobalFactory, whoseDisposeAsyncenumerates the corrupted list and dereferences the null slot →NullReferenceException. The race only sometimes tears the array, hence theflakiness.
ServerInitSemaphoredid not cover this — it guards only theServerhost build, which happens after the unsynchronizedGetIsolatedFactorycall.Fix
Wrap the
WithWebHostBuildercall in a per-factorylock. It is synchronousand only does fast configuration, so a plain lock suffices; the expensive host
build remains throttled separately by
ServerInitSemaphore.Regression test
IsolatedFactoryConcurrencyTestsfires 64 concurrentGetIsolatedFactorycalls (released together via a gate) against one shared factory across 8
rounds, then disposes. Verified it reproduces the exact NRE with the lock
removed and passes with the fix. Full
TUnit.AspNetCore.Testssuite (36tests) green.
Note for users not on the base class
If you manage your own shared
WebApplicationFactoryand call.WithWebHostBuilder()from parallel tests yourself, the same ASP.NET Corerace applies in your code — serialize that call.