RavenDb - Ensure only one durability agent is instantiated by Bishbulb · Pull Request #2623 · JasperFx/wolverine

Bishbulb · 2026-04-29T16:50:25Z

Overview

RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at boot. However, the NodeAgentController also built another agent for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection path. The result is that we had two durability agents registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability), so the controller didn't dedupe them, and they polled concurrently in the same process.

Both agents share the singleton RavenDbMessageStore, which led to this behavior:

Agent 1 follows the code path to do an initial acquire and sets _scheduledLock and _lastScheduledLockIndex.
Agent 2 sees _scheduledLock != null on the shared instance and takes the code path to do a renewal, which results in a PUT with the current index (this succeeds against the server because that index is still valid)
Both agents now believe they hold the lock, and both fetch the same envelopes at the same change vector, and race to mark them Incoming. Loser surfaces ConcurrencyException, Wolverine requeues the failed message, and the timeout fires twice.

I observed this when testing a real workload, and added instrumentation which showed two RavenDbDurabilityAgent instance ids both LOCK-ACQUIRED in the same window, fetching the same IncomingMessage doc at the same change vector, which resulted in downstream concurrency exceptions.

Fix

Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs poller and recovery loop; NodeAgentController calls StartAsync on it. The agent returned from
StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync, which is null-safe on the unstarted task fields.

Note

Wolverine.CosmosDb.Internals.CosmosDbMessageStore.StartScheduledJobs mirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug. Flagging in case you want to look at that separately.

…ability agents RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at process boot. NodeAgentController separately builds and StartAsync's another agent for the wolverinedb://ravendb/durability URI via the IAgentFamily/MessageStoreCollection path once the node has assumed leadership and EvaluateAssignmentsAsync runs. The two agent instances are registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability) so the NodeAgentController's Agents dictionary doesn't dedupe them, and they end up running concurrently in the same process. That alone would be wasteful but mostly harmless. The real damage is that both agents share the singleton RavenDbMessageStore instance, and therefore share its _scheduledLock and _lastScheduledLockIndex fields. TryAttainScheduledJobLockAsync branches on _scheduledLock == null: - Agent JasperFx#1 takes the initial-acquire path, PUTs the CE value, sets _scheduledLock and _lastScheduledLockIndex. - Agent JasperFx#2 sees _scheduledLock != null on the shared instance and takes the renewal path, PUT-ing with the current _lastScheduledLockIndex - which succeeds against the server because the index is still valid. - Both agents now believe they hold the lock. They both query the scheduled-jobs index, fetch the same envelopes at the same change vector, and race to mark them Incoming. The loser surfaces Raven.Client.Exceptions.ConcurrencyException; Wolverine requeues the failed message; the timeout fires twice. Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync gets StartAsync called by NodeAgentController.StartAgentAsync and is the single owner of the scheduled-jobs poller and recovery loop. The agent returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync; its timers are never started, and StopAsync is null-safe.

Boots the host in Solo mode and counts every RavenDbDurabilityAgent with running timers, sourced from both NodeAgentController.Agents (cluster-managed) and WolverineRuntime.DurableScheduledJobs (boot-path CompositeAgent, reached via reflection so the test project does not need adding to InternalsVisibleTo). Pre-fix: count is 2 (boot agent + cluster agent both polling). Post-fix: count is 1 (cluster only).

jeremydmiller · 2026-04-29T20:57:45Z

Closing in favor of #2629, which carries the same production fix with a reflection-free regression test (adds public IsPolling / InnerAgents accessors instead of poking at private fields). Thanks @Bishbulb — kept you as co-author on the new commit. The CosmosDb note you flagged is preserved in #2629's description for follow-up.

…2623 pattern CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers() pattern as RavenDb did before the #2623 fix, so on first read it looked like the same bug. Investigation showed it does NOT have the bug today: CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection (the IAgentFamily for the wolverinedb scheme) never registers a competing agent through NodeAgentController. Only the agent built and started in StartScheduledJobs polls. This commit: - Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls StartTimers eagerly), with a comment explaining why the equivalent RavenDb fix doesn't apply. - Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic. - Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read WolverineRuntime.DurableScheduledJobs without reflection. - Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard: asserts exactly one CosmosDbDurabilityAgent has timers running after host startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise causes a second polling instance), this test will fail loudly. All 65 CosmosDbTests pass locally against the CosmosDb emulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ction version) (#2629) * fix(RavenDb): only one durability agent polls after host start (#2623) RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at boot. NodeAgentController also built *another* agent for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection path. The result was two durability agents registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability) — the controller didn't dedupe them, so they polled concurrently in the same process. Both agents shared the singleton RavenDbMessageStore, leading to dual lock acquisition and ConcurrencyException when they raced to mark the same envelopes Incoming. Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs poller and recovery loop; NodeAgentController calls StartAsync on it. The agent returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync, which is null-safe on the unstarted task fields. Supersedes PR #2623 with a reflection-free regression test: - Adds public RavenDbDurabilityAgent.IsPolling so callers (and tests) can detect the multi-instance "two pollers" condition without poking at private fields. - Adds public CompositeAgent.InnerAgents for the same reason. - Adds [InternalsVisibleTo("RavenDbTests")] so the test can read WolverineRuntime.DurableScheduledJobs without reflection. CosmosDbMessageStore.StartScheduledJobs mirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug — flagging for separate follow-up. Co-Authored-By: Dan Bishop <bishbulb@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(CosmosDb): regression-guard test + IsPolling diagnostic for the #2623 pattern CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers() pattern as RavenDb did before the #2623 fix, so on first read it looked like the same bug. Investigation showed it does NOT have the bug today: CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection (the IAgentFamily for the wolverinedb scheme) never registers a competing agent through NodeAgentController. Only the agent built and started in StartScheduledJobs polls. This commit: - Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls StartTimers eagerly), with a comment explaining why the equivalent RavenDb fix doesn't apply. - Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic. - Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read WolverineRuntime.DurableScheduledJobs without reflection. - Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard: asserts exactly one CosmosDbDurabilityAgent has timers running after host startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise causes a second polling instance), this test will fail loudly. All 65 CosmosDbTests pass locally against the CosmosDb emulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Dan Bishop <bishbulb@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… of 'cosmosdb' (#2863) Closes #2845. CosmosDbMessageStore.Uri and CosmosDbDurabilityAgent.Uri were built as cosmosdb://durability. MessageStoreCollection (the wolverinedb agent family) advertises every store's Uri via AllKnownAgentsAsync and distributes them under the wolverinedb scheme, so NodeAgentController tried to start cosmosdb://durability and findAgentAsync threw "Unrecognized agent scheme 'cosmosdb'" every poll cycle. Same class of bug as the RavenDb fix in #2261. Switches both URIs to wolverinedb://cosmosdb/durability so the scheme resolves to MessageStoreCollection, which builds the durability agent via the store's BuildAgent. Because the agent is now cluster-managed (NodeAgentController calls StartAsync on it), CosmosDbMessageStore.StartScheduledJobs must no longer eagerly call StartTimers() — doing so would leave two pollers racing on the same store, the #2623 duplicate-poller bug. The agent built in StartScheduledJobs is retained only for disposal-time StopAsync. The existing durability_agent_lifecycle guard already asserts exactly one polling agent across both sources. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… of 'cosmosdb' (#2864) Closes #2845. CosmosDbMessageStore.Uri and CosmosDbDurabilityAgent.Uri were built as cosmosdb://durability. MessageStoreCollection (the wolverinedb agent family) advertises every store's Uri via AllKnownAgentsAsync and distributes them under the wolverinedb scheme, so NodeAgentController tried to start cosmosdb://durability and findAgentAsync threw "Unrecognized agent scheme 'cosmosdb'" every poll cycle. Same class of bug as the RavenDb fix in #2261. Switches both URIs to wolverinedb://cosmosdb/durability so the scheme resolves to MessageStoreCollection, which builds the durability agent via the store's BuildAgent. Because the agent is now cluster-managed (NodeAgentController calls StartAsync on it), CosmosDbMessageStore.StartScheduledJobs must no longer eagerly call StartTimers() — doing so would leave two pollers racing on the same store, the #2623 duplicate-poller bug. The agent built in StartScheduledJobs is retained only for disposal-time StopAsync. The existing durability_agent_lifecycle guard already asserts exactly one polling agent across both sources. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bishbulb added 2 commits April 29, 2026 10:38

jeremydmiller mentioned this pull request Apr 29, 2026

Fix #2623: only one durability agent polls after host start (no-reflection version) #2629

Merged

2 tasks

jeremydmiller closed this Apr 29, 2026

This was referenced May 20, 2026

Fix CosmosDb durability agent URI scheme ('cosmosdb' → 'wolverinedb') (#2845) #2863

Merged

Fix CosmosDb durability agent URI scheme ('cosmosdb' → 'wolverinedb') (#2845) [5.0] #2864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RavenDb - Ensure only one durability agent is instantiated#2623

RavenDb - Ensure only one durability agent is instantiated#2623
Bishbulb wants to merge 2 commits into
JasperFx:mainfrom
Bishbulb:fix/ravendb-double-durability-agent

Bishbulb commented Apr 29, 2026

Uh oh!

jeremydmiller commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Bishbulb commented Apr 29, 2026

Overview

Fix

Note

Uh oh!

jeremydmiller commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants