Fix #4727: release the message-batch semaphore on the double-checked-lock fast path (composite rebuild deadlock)#4728
Merged
jeremydmiller merged 2 commits intoJun 12, 2026
Conversation
…checked-lock fast path ProjectionUpdateBatch.CurrentMessageBatch acquired _semaphore and then returned the already-created batch from the inner double-checked `if (_batch != null) return _batch;` which sat OUTSIDE the try/finally, so the second concurrent caller to win the semaphore returned without releasing it. During an optimized composite rebuild whose stage emits side-effect messages, the parallel event slices all call CurrentMessageBatch (via ProjectionBatch.PublishMessageAsync); the leaked semaphore then deadlocks every queued slice, freezing the rebuild forever (idle, no error, no query) - the symptom reported in JasperFx#4727 and originally JasperFx#4721. Move the inner null-check inside the try so finally always releases the semaphore, and drop the random Task.Delay race band-aid that was masking the leak. Adds DaemonTests/Bugs/Bug_4727_message_batch_semaphore_leak.cs: a gated IMessageOutbox holds the semaphore while the first batch is created so N concurrent CurrentMessageBatch callers pile up on it. The test deadlocks (15s timeout) on the previous code and passes in ~1s with this fix.
…ed + tenant-partitioned tenancy (JasperFx#4727) Regression for the full production configuration that exposed the CurrentMessageBatch semaphore deadlock: MultiTenantedWithShardedDatabases + TenancyStyle.Conjoined + UseTenantPartitionedEvents + a multi-stage CompositeProjectionFor whose stage-2 member publishes side-effect messages (RaiseSideEffects -> slice.PublishMessage), driven through an optimized composite rebuild. The optimized rebuild runs in ShardExecutionMode.Continuous, so the stage-2 side effects fire and the parallel event slices contend on ProjectionUpdateBatch.CurrentMessageBatch. On the pre-fix code the rebuild deadlocks and never completes (the test hangs); with the semaphore fix it completes in ~1s and every tenant's documents on the multi-tenant shard materialize.
This was referenced Jun 15, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4727 (and the residual deadlock behind the closed #4721).
Problem
ProjectionUpdateBatch.CurrentMessageBatchleaks itsSemaphoreSlim. Afterawait _semaphore.WaitAsync(...), the inner double-checkedif (_batch != null) return _batch;sits outside thetry/finally, so the second concurrent caller to acquire the semaphore returns without ever calling_semaphore.Release().During an optimized composite rebuild whose stage emits side-effect messages, the parallel event slices all call
CurrentMessageBatchviaProjectionBatch.PublishMessageAsync. They queue on the semaphore; the first creates the batch and releases; the second hits the inner early-return and leaks the semaphore — so every remaining queued slice is parked onWaitAsyncforever and the rebuild freezes (idle, no error, no DB query). Captured via a managed dump in #4727.Fix
Move the inner null-check inside the
trysofinallyalways releases the semaphore, and drop theTask.Delay(Random.Shared.Next(25, 200))band-aid that was only masking the race.Test
DaemonTests/Bugs/Bug_4727_message_batch_semaphore_leak.cs— a gatedIMessageOutboxholds the semaphore while the first batch is created so N concurrentCurrentMessageBatchcallers reliably pile up on it. The test deadlocks (15s timeout) on the previous code and passes in ~1s with this fix; all callers observe the single shared batch.Found in production on a sharded + tenant-partitioned store (Marten 9.7.3 / JasperFx 2.9.9 / Wolverine 6.8.0): an
invoicescomposite rebuild whose stage-2 members publish side-effect messages froze at a batch boundary on every deploy.