Skip to content

Fix #2623: only one durability agent polls after host start (no-reflection version)#2629

Merged
jeremydmiller merged 2 commits into
mainfrom
fix-2623-no-reflection
Apr 29, 2026
Merged

Fix #2623: only one durability agent polls after host start (no-reflection version)#2629
jeremydmiller merged 2 commits into
mainfrom
fix-2623-no-reflection

Conversation

@jeremydmiller

Copy link
Copy Markdown
Member

Supersedes #2623 by @Bishbulb with the same production fix and a reflection-free regression test.

Summary

  • Production fix (same as RavenDb - Ensure only one durability agent is instantiated #2623): Drop the eager StartTimers() call in RavenDbMessageStore.StartScheduledJobs. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs poller and recovery loop; NodeAgentController calls StartAsync on it. The agent returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync, which is null-safe on the unstarted task fields.
  • Reflection-free regression test:
    • Adds public RavenDbDurabilityAgent.IsPolling (_recoveryTask is not null || _scheduledJob is not null) so callers (and tests) can observe the multi-instance "two pollers" condition.
    • Adds public CompositeAgent.InnerAgents so callers can enumerate the underlying agents.
    • Adds [InternalsVisibleTo("RavenDbTests")] so the test can read WolverineRuntime.DurableScheduledJobs without reflection.

The original bug from @Bishbulb's investigation: two RavenDbDurabilityAgent instances polled concurrently against the shared singleton store, both believing they held the scheduled-job lock, racing to mark the same envelopes Incoming, surfacing ConcurrencyException and double-firing timeouts. Full root-cause writeup is in #2623's description.

Closes #2623.

Note

Wolverine.CosmosDb.Internals.CosmosDbMessageStore.StartScheduledJobs mirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug. Flagging for separate follow-up.

Test plan

  • All 147 RavenDbTests pass locally (embedded RavenDB driver, single run): Failed: 0, Passed: 147, Skipped: 0, Total: 147, Duration: 46s.
  • New durability_agent_lifecycle.only_one_durability_agent_polls_after_host_start passes — uses no reflection.

🤖 Generated with Claude Code

RavenDbMessageStore.StartScheduledJobs eagerly built and started a
RavenDbDurabilityAgent at boot. NodeAgentController also built *another* agent
for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection
path. The result was two durability agents registered under different URIs
(internal://scheduledjobs vs wolverinedb://ravendb/durability) — the controller
didn't dedupe them, so they polled concurrently in the same process. Both agents
shared the singleton RavenDbMessageStore, leading to dual lock acquisition
and ConcurrencyException when they raced to mark the same envelopes Incoming.

Drop the eager StartTimers() call. The cluster-managed agent built via
MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs
poller and recovery loop; NodeAgentController calls StartAsync on it. The agent
returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs
purely for its disposal-time StopAsync, which is null-safe on the unstarted
task fields.

Supersedes PR #2623 with a reflection-free regression test:
- Adds public RavenDbDurabilityAgent.IsPolling so callers (and tests) can detect
  the multi-instance "two pollers" condition without poking at private fields.
- Adds public CompositeAgent.InnerAgents for the same reason.
- Adds [InternalsVisibleTo("RavenDbTests")] so the test can read
  WolverineRuntime.DurableScheduledJobs without reflection.

CosmosDbMessageStore.StartScheduledJobs mirrors this pattern
(agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same
bug — flagging for separate follow-up.

Co-Authored-By: Dan Bishop <bishbulb@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2623 pattern

CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers()
pattern as RavenDb did before the #2623 fix, so on first read it looked like
the same bug. Investigation showed it does NOT have the bug today:
CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri
uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection
(the IAgentFamily for the wolverinedb scheme) never registers a competing agent
through NodeAgentController. Only the agent built and started in
StartScheduledJobs polls.

This commit:
- Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls
  StartTimers eagerly), with a comment explaining why the equivalent RavenDb
  fix doesn't apply.
- Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic.
- Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read
  WolverineRuntime.DurableScheduledJobs without reflection.
- Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard:
  asserts exactly one CosmosDbDurabilityAgent has timers running after host
  startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise
  causes a second polling instance), this test will fail loudly.

All 65 CosmosDbTests pass locally against the CosmosDb emulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 314ab29 into main Apr 29, 2026
21 checks passed
@jeremydmiller jeremydmiller deleted the fix-2623-no-reflection branch April 29, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant