RavenDb - Ensure only one durability agent is instantiated#2623
Closed
Bishbulb wants to merge 2 commits into
Closed
RavenDb - Ensure only one durability agent is instantiated#2623Bishbulb wants to merge 2 commits into
Bishbulb wants to merge 2 commits into
Conversation
…ability agents RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at process boot. NodeAgentController separately builds and StartAsync's another agent for the wolverinedb://ravendb/durability URI via the IAgentFamily/MessageStoreCollection path once the node has assumed leadership and EvaluateAssignmentsAsync runs. The two agent instances are registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability) so the NodeAgentController's Agents dictionary doesn't dedupe them, and they end up running concurrently in the same process. That alone would be wasteful but mostly harmless. The real damage is that both agents share the singleton RavenDbMessageStore instance, and therefore share its _scheduledLock and _lastScheduledLockIndex fields. TryAttainScheduledJobLockAsync branches on _scheduledLock == null: - Agent JasperFx#1 takes the initial-acquire path, PUTs the CE value, sets _scheduledLock and _lastScheduledLockIndex. - Agent JasperFx#2 sees _scheduledLock != null on the shared instance and takes the renewal path, PUT-ing with the current _lastScheduledLockIndex - which succeeds against the server because the index is still valid. - Both agents now believe they hold the lock. They both query the scheduled-jobs index, fetch the same envelopes at the same change vector, and race to mark them Incoming. The loser surfaces Raven.Client.Exceptions.ConcurrencyException; Wolverine requeues the failed message; the timeout fires twice. Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync gets StartAsync called by NodeAgentController.StartAgentAsync and is the single owner of the scheduled-jobs poller and recovery loop. The agent returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync; its timers are never started, and StopAsync is null-safe.
Boots the host in Solo mode and counts every RavenDbDurabilityAgent with running timers, sourced from both NodeAgentController.Agents (cluster-managed) and WolverineRuntime.DurableScheduledJobs (boot-path CompositeAgent, reached via reflection so the test project does not need adding to InternalsVisibleTo). Pre-fix: count is 2 (boot agent + cluster agent both polling). Post-fix: count is 1 (cluster only).
2 tasks
Member
|
Closing in favor of #2629, which carries the same production fix with a reflection-free regression test (adds public IsPolling / InnerAgents accessors instead of poking at private fields). Thanks @Bishbulb — kept you as co-author on the new commit. The CosmosDb note you flagged is preserved in #2629's description for follow-up. |
jeremydmiller
added a commit
that referenced
this pull request
Apr 29, 2026
…2623 pattern CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers() pattern as RavenDb did before the #2623 fix, so on first read it looked like the same bug. Investigation showed it does NOT have the bug today: CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection (the IAgentFamily for the wolverinedb scheme) never registers a competing agent through NodeAgentController. Only the agent built and started in StartScheduledJobs polls. This commit: - Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls StartTimers eagerly), with a comment explaining why the equivalent RavenDb fix doesn't apply. - Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic. - Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read WolverineRuntime.DurableScheduledJobs without reflection. - Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard: asserts exactly one CosmosDbDurabilityAgent has timers running after host startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise causes a second polling instance), this test will fail loudly. All 65 CosmosDbTests pass locally against the CosmosDb emulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
Apr 29, 2026
…ction version) (#2629) * fix(RavenDb): only one durability agent polls after host start (#2623) RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at boot. NodeAgentController also built *another* agent for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection path. The result was two durability agents registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability) — the controller didn't dedupe them, so they polled concurrently in the same process. Both agents shared the singleton RavenDbMessageStore, leading to dual lock acquisition and ConcurrencyException when they raced to mark the same envelopes Incoming. Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs poller and recovery loop; NodeAgentController calls StartAsync on it. The agent returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync, which is null-safe on the unstarted task fields. Supersedes PR #2623 with a reflection-free regression test: - Adds public RavenDbDurabilityAgent.IsPolling so callers (and tests) can detect the multi-instance "two pollers" condition without poking at private fields. - Adds public CompositeAgent.InnerAgents for the same reason. - Adds [InternalsVisibleTo("RavenDbTests")] so the test can read WolverineRuntime.DurableScheduledJobs without reflection. CosmosDbMessageStore.StartScheduledJobs mirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug — flagging for separate follow-up. Co-Authored-By: Dan Bishop <bishbulb@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(CosmosDb): regression-guard test + IsPolling diagnostic for the #2623 pattern CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers() pattern as RavenDb did before the #2623 fix, so on first read it looked like the same bug. Investigation showed it does NOT have the bug today: CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection (the IAgentFamily for the wolverinedb scheme) never registers a competing agent through NodeAgentController. Only the agent built and started in StartScheduledJobs polls. This commit: - Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls StartTimers eagerly), with a comment explaining why the equivalent RavenDb fix doesn't apply. - Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic. - Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read WolverineRuntime.DurableScheduledJobs without reflection. - Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard: asserts exactly one CosmosDbDurabilityAgent has timers running after host startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise causes a second polling instance), this test will fail loudly. All 65 CosmosDbTests pass locally against the CosmosDb emulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Dan Bishop <bishbulb@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
jeremydmiller
added a commit
that referenced
this pull request
May 20, 2026
… of 'cosmosdb' (#2863) Closes #2845. CosmosDbMessageStore.Uri and CosmosDbDurabilityAgent.Uri were built as cosmosdb://durability. MessageStoreCollection (the wolverinedb agent family) advertises every store's Uri via AllKnownAgentsAsync and distributes them under the wolverinedb scheme, so NodeAgentController tried to start cosmosdb://durability and findAgentAsync threw "Unrecognized agent scheme 'cosmosdb'" every poll cycle. Same class of bug as the RavenDb fix in #2261. Switches both URIs to wolverinedb://cosmosdb/durability so the scheme resolves to MessageStoreCollection, which builds the durability agent via the store's BuildAgent. Because the agent is now cluster-managed (NodeAgentController calls StartAsync on it), CosmosDbMessageStore.StartScheduledJobs must no longer eagerly call StartTimers() — doing so would leave two pollers racing on the same store, the #2623 duplicate-poller bug. The agent built in StartScheduledJobs is retained only for disposal-time StopAsync. The existing durability_agent_lifecycle guard already asserts exactly one polling agent across both sources. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 20, 2026
… of 'cosmosdb' (#2864) Closes #2845. CosmosDbMessageStore.Uri and CosmosDbDurabilityAgent.Uri were built as cosmosdb://durability. MessageStoreCollection (the wolverinedb agent family) advertises every store's Uri via AllKnownAgentsAsync and distributes them under the wolverinedb scheme, so NodeAgentController tried to start cosmosdb://durability and findAgentAsync threw "Unrecognized agent scheme 'cosmosdb'" every poll cycle. Same class of bug as the RavenDb fix in #2261. Switches both URIs to wolverinedb://cosmosdb/durability so the scheme resolves to MessageStoreCollection, which builds the durability agent via the store's BuildAgent. Because the agent is now cluster-managed (NodeAgentController calls StartAsync on it), CosmosDbMessageStore.StartScheduledJobs must no longer eagerly call StartTimers() — doing so would leave two pollers racing on the same store, the #2623 duplicate-poller bug. The agent built in StartScheduledJobs is retained only for disposal-time StopAsync. The existing durability_agent_lifecycle guard already asserts exactly one polling agent across both sources. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
RavenDbMessageStore.StartScheduledJobseagerly built and started aRavenDbDurabilityAgentat boot. However, theNodeAgentControlleralso built another agent forwolverinedb://ravendb/durabilityvia theIAgentFamily/MessageStoreCollectionpath. The result is that we had two durability agents registered under different URIs (internal://scheduledjobsvswolverinedb://ravendb/durability), so the controller didn't dedupe them, and they polled concurrently in the same process.Both agents share the singleton
RavenDbMessageStore, which led to this behavior:_scheduledLockand_lastScheduledLockIndex._scheduledLock != nullon the shared instance and takes the code path to do a renewal, which results in a PUT with the current index (this succeeds against the server because that index is still valid)Incoming. Loser surfacesConcurrencyException, Wolverine requeues the failed message, and the timeout fires twice.I observed this when testing a real workload, and added instrumentation which showed two
RavenDbDurabilityAgentinstance ids both LOCK-ACQUIRED in the same window, fetching the sameIncomingMessagedoc at the same change vector, which resulted in downstream concurrency exceptions.Fix
Drop the eager
StartTimers()call. The cluster-managed agent built viaMessageStoreCollection.BuildAgentAsyncis the single owner of the scheduled-jobs poller and recovery loop;NodeAgentControllercallsStartAsyncon it. The agent returned fromStartScheduledJobsis held byWolverineRuntime.DurableScheduledJobspurely for its disposal-timeStopAsync, which is null-safe on the unstarted task fields.Note
Wolverine.CosmosDb.Internals.CosmosDbMessageStore.StartScheduledJobsmirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug. Flagging in case you want to look at that separately.