Skip to content

RavenDb - Ensure only one durability agent is instantiated#2623

Closed
Bishbulb wants to merge 2 commits into
JasperFx:mainfrom
Bishbulb:fix/ravendb-double-durability-agent
Closed

RavenDb - Ensure only one durability agent is instantiated#2623
Bishbulb wants to merge 2 commits into
JasperFx:mainfrom
Bishbulb:fix/ravendb-double-durability-agent

Conversation

@Bishbulb

Copy link
Copy Markdown
Contributor

Overview

RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at boot. However, the NodeAgentController also built another agent for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection path. The result is that we had two durability agents registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability), so the controller didn't dedupe them, and they polled concurrently in the same process.

Both agents share the singleton RavenDbMessageStore, which led to this behavior:

  • Agent 1 follows the code path to do an initial acquire and sets _scheduledLock and _lastScheduledLockIndex.
  • Agent 2 sees _scheduledLock != null on the shared instance and takes the code path to do a renewal, which results in a PUT with the current index (this succeeds against the server because that index is still valid)
  • Both agents now believe they hold the lock, and both fetch the same envelopes at the same change vector, and race to mark them Incoming. Loser surfaces ConcurrencyException, Wolverine requeues the failed message, and the timeout fires twice.

I observed this when testing a real workload, and added instrumentation which showed two RavenDbDurabilityAgent instance ids both LOCK-ACQUIRED in the same window, fetching the same IncomingMessage doc at the same change vector, which resulted in downstream concurrency exceptions.

Fix

Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs poller and recovery loop; NodeAgentController calls StartAsync on it. The agent returned from
StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync, which is null-safe on the unstarted task fields.

Note

Wolverine.CosmosDb.Internals.CosmosDbMessageStore.StartScheduledJobs mirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug. Flagging in case you want to look at that separately.

…ability agents

RavenDbMessageStore.StartScheduledJobs eagerly built and started a
RavenDbDurabilityAgent at process boot. NodeAgentController separately
builds and StartAsync's another agent for the wolverinedb://ravendb/durability
URI via the IAgentFamily/MessageStoreCollection path once the node has
assumed leadership and EvaluateAssignmentsAsync runs.

The two agent instances are registered under different URIs
(internal://scheduledjobs vs wolverinedb://ravendb/durability) so the
NodeAgentController's Agents dictionary doesn't dedupe them, and they
end up running concurrently in the same process.

That alone would be wasteful but mostly harmless. The real damage is
that both agents share the singleton RavenDbMessageStore instance, and
therefore share its _scheduledLock and _lastScheduledLockIndex fields.
TryAttainScheduledJobLockAsync branches on _scheduledLock == null:

  - Agent JasperFx#1 takes the initial-acquire path, PUTs the CE value, sets
    _scheduledLock and _lastScheduledLockIndex.
  - Agent JasperFx#2 sees _scheduledLock != null on the shared instance and
    takes the renewal path, PUT-ing with the current
    _lastScheduledLockIndex - which succeeds against the server because
    the index is still valid.
  - Both agents now believe they hold the lock. They both query the
    scheduled-jobs index, fetch the same envelopes at the same change
    vector, and race to mark them Incoming. The loser surfaces
    Raven.Client.Exceptions.ConcurrencyException; Wolverine requeues
    the failed message; the timeout fires twice.

Drop the eager StartTimers() call. The cluster-managed agent built via
MessageStoreCollection.BuildAgentAsync gets StartAsync called by
NodeAgentController.StartAgentAsync and is the single owner of the
scheduled-jobs poller and recovery loop. The agent returned from
StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs
purely for its disposal-time StopAsync; its timers are never started,
and StopAsync is null-safe.
Boots the host in Solo mode and counts every RavenDbDurabilityAgent
with running timers, sourced from both NodeAgentController.Agents
(cluster-managed) and WolverineRuntime.DurableScheduledJobs
(boot-path CompositeAgent, reached via reflection so the test
project does not need adding to InternalsVisibleTo).

Pre-fix: count is 2 (boot agent + cluster agent both polling).
Post-fix: count is 1 (cluster only).
@jeremydmiller

Copy link
Copy Markdown
Member

Closing in favor of #2629, which carries the same production fix with a reflection-free regression test (adds public IsPolling / InnerAgents accessors instead of poking at private fields). Thanks @Bishbulb — kept you as co-author on the new commit. The CosmosDb note you flagged is preserved in #2629's description for follow-up.

jeremydmiller added a commit that referenced this pull request Apr 29, 2026
…2623 pattern

CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers()
pattern as RavenDb did before the #2623 fix, so on first read it looked like
the same bug. Investigation showed it does NOT have the bug today:
CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri
uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection
(the IAgentFamily for the wolverinedb scheme) never registers a competing agent
through NodeAgentController. Only the agent built and started in
StartScheduledJobs polls.

This commit:
- Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls
  StartTimers eagerly), with a comment explaining why the equivalent RavenDb
  fix doesn't apply.
- Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic.
- Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read
  WolverineRuntime.DurableScheduledJobs without reflection.
- Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard:
  asserts exactly one CosmosDbDurabilityAgent has timers running after host
  startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise
  causes a second polling instance), this test will fail loudly.

All 65 CosmosDbTests pass locally against the CosmosDb emulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request Apr 29, 2026
…ction version) (#2629)

* fix(RavenDb): only one durability agent polls after host start (#2623)

RavenDbMessageStore.StartScheduledJobs eagerly built and started a
RavenDbDurabilityAgent at boot. NodeAgentController also built *another* agent
for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection
path. The result was two durability agents registered under different URIs
(internal://scheduledjobs vs wolverinedb://ravendb/durability) — the controller
didn't dedupe them, so they polled concurrently in the same process. Both agents
shared the singleton RavenDbMessageStore, leading to dual lock acquisition
and ConcurrencyException when they raced to mark the same envelopes Incoming.

Drop the eager StartTimers() call. The cluster-managed agent built via
MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs
poller and recovery loop; NodeAgentController calls StartAsync on it. The agent
returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs
purely for its disposal-time StopAsync, which is null-safe on the unstarted
task fields.

Supersedes PR #2623 with a reflection-free regression test:
- Adds public RavenDbDurabilityAgent.IsPolling so callers (and tests) can detect
  the multi-instance "two pollers" condition without poking at private fields.
- Adds public CompositeAgent.InnerAgents for the same reason.
- Adds [InternalsVisibleTo("RavenDbTests")] so the test can read
  WolverineRuntime.DurableScheduledJobs without reflection.

CosmosDbMessageStore.StartScheduledJobs mirrors this pattern
(agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same
bug — flagging for separate follow-up.

Co-Authored-By: Dan Bishop <bishbulb@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(CosmosDb): regression-guard test + IsPolling diagnostic for the #2623 pattern

CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers()
pattern as RavenDb did before the #2623 fix, so on first read it looked like
the same bug. Investigation showed it does NOT have the bug today:
CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri
uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection
(the IAgentFamily for the wolverinedb scheme) never registers a competing agent
through NodeAgentController. Only the agent built and started in
StartScheduledJobs polls.

This commit:
- Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls
  StartTimers eagerly), with a comment explaining why the equivalent RavenDb
  fix doesn't apply.
- Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic.
- Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read
  WolverineRuntime.DurableScheduledJobs without reflection.
- Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard:
  asserts exactly one CosmosDbDurabilityAgent has timers running after host
  startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise
  causes a second polling instance), this test will fail loudly.

All 65 CosmosDbTests pass locally against the CosmosDb emulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Dan Bishop <bishbulb@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 20, 2026
… of 'cosmosdb' (#2863)

Closes #2845. CosmosDbMessageStore.Uri and CosmosDbDurabilityAgent.Uri were
built as cosmosdb://durability. MessageStoreCollection (the wolverinedb agent
family) advertises every store's Uri via AllKnownAgentsAsync and distributes
them under the wolverinedb scheme, so NodeAgentController tried to start
cosmosdb://durability and findAgentAsync threw "Unrecognized agent scheme
'cosmosdb'" every poll cycle. Same class of bug as the RavenDb fix in #2261.

Switches both URIs to wolverinedb://cosmosdb/durability so the scheme resolves
to MessageStoreCollection, which builds the durability agent via the store's
BuildAgent. Because the agent is now cluster-managed (NodeAgentController calls
StartAsync on it), CosmosDbMessageStore.StartScheduledJobs must no longer
eagerly call StartTimers() — doing so would leave two pollers racing on the same
store, the #2623 duplicate-poller bug. The agent built in StartScheduledJobs is
retained only for disposal-time StopAsync. The existing durability_agent_lifecycle
guard already asserts exactly one polling agent across both sources.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 20, 2026
… of 'cosmosdb' (#2864)

Closes #2845. CosmosDbMessageStore.Uri and CosmosDbDurabilityAgent.Uri were
built as cosmosdb://durability. MessageStoreCollection (the wolverinedb agent
family) advertises every store's Uri via AllKnownAgentsAsync and distributes
them under the wolverinedb scheme, so NodeAgentController tried to start
cosmosdb://durability and findAgentAsync threw "Unrecognized agent scheme
'cosmosdb'" every poll cycle. Same class of bug as the RavenDb fix in #2261.

Switches both URIs to wolverinedb://cosmosdb/durability so the scheme resolves
to MessageStoreCollection, which builds the durability agent via the store's
BuildAgent. Because the agent is now cluster-managed (NodeAgentController calls
StartAsync on it), CosmosDbMessageStore.StartScheduledJobs must no longer
eagerly call StartTimers() — doing so would leave two pollers racing on the same
store, the #2623 duplicate-poller bug. The agent built in StartScheduledJobs is
retained only for disposal-time StopAsync. The existing durability_agent_lifecycle
guard already asserts exactly one polling agent across both sources.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants