Fix #2623: only one durability agent polls after host start (no-reflection version)#2629
Merged
Conversation
RavenDbMessageStore.StartScheduledJobs eagerly built and started a RavenDbDurabilityAgent at boot. NodeAgentController also built *another* agent for wolverinedb://ravendb/durability via the IAgentFamily / MessageStoreCollection path. The result was two durability agents registered under different URIs (internal://scheduledjobs vs wolverinedb://ravendb/durability) — the controller didn't dedupe them, so they polled concurrently in the same process. Both agents shared the singleton RavenDbMessageStore, leading to dual lock acquisition and ConcurrencyException when they raced to mark the same envelopes Incoming. Drop the eager StartTimers() call. The cluster-managed agent built via MessageStoreCollection.BuildAgentAsync is the single owner of the scheduled-jobs poller and recovery loop; NodeAgentController calls StartAsync on it. The agent returned from StartScheduledJobs is held by WolverineRuntime.DurableScheduledJobs purely for its disposal-time StopAsync, which is null-safe on the unstarted task fields. Supersedes PR #2623 with a reflection-free regression test: - Adds public RavenDbDurabilityAgent.IsPolling so callers (and tests) can detect the multi-instance "two pollers" condition without poking at private fields. - Adds public CompositeAgent.InnerAgents for the same reason. - Adds [InternalsVisibleTo("RavenDbTests")] so the test can read WolverineRuntime.DurableScheduledJobs without reflection. CosmosDbMessageStore.StartScheduledJobs mirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug — flagging for separate follow-up. Co-Authored-By: Dan Bishop <bishbulb@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2623 pattern CosmosDbMessageStore.StartScheduledJobs follows the same eager-StartTimers() pattern as RavenDb did before the #2623 fix, so on first read it looked like the same bug. Investigation showed it does NOT have the bug today: CosmosDbMessageStore.BuildAgentFamily returns null and CosmosDbMessageStore.Uri uses the cosmosdb:// scheme rather than wolverinedb://, so MessageStoreCollection (the IAgentFamily for the wolverinedb scheme) never registers a competing agent through NodeAgentController. Only the agent built and started in StartScheduledJobs polls. This commit: - Leaves CosmosDbMessageStore.StartScheduledJobs unchanged (still calls StartTimers eagerly), with a comment explaining why the equivalent RavenDb fix doesn't apply. - Adds CosmosDbDurabilityAgent.IsPolling for parity with RavenDb's diagnostic. - Adds [InternalsVisibleTo("CosmosDbTests")] so future tests can read WolverineRuntime.DurableScheduledJobs without reflection. - Adds CosmosDbTests/durability_agent_lifecycle.cs as a regression guard: asserts exactly one CosmosDbDurabilityAgent has timers running after host startup. If anyone ever wires a CosmosDb-side IAgentFamily (or otherwise causes a second polling instance), this test will fail loudly. All 65 CosmosDbTests pass locally against the CosmosDb emulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #2623 by @Bishbulb with the same production fix and a reflection-free regression test.
Summary
StartTimers()call inRavenDbMessageStore.StartScheduledJobs. The cluster-managed agent built viaMessageStoreCollection.BuildAgentAsyncis the single owner of the scheduled-jobs poller and recovery loop;NodeAgentControllercallsStartAsyncon it. The agent returned fromStartScheduledJobsis held byWolverineRuntime.DurableScheduledJobspurely for its disposal-timeStopAsync, which is null-safe on the unstarted task fields.RavenDbDurabilityAgent.IsPolling(_recoveryTask is not null || _scheduledJob is not null) so callers (and tests) can observe the multi-instance "two pollers" condition.CompositeAgent.InnerAgentsso callers can enumerate the underlying agents.[InternalsVisibleTo("RavenDbTests")]so the test can readWolverineRuntime.DurableScheduledJobswithout reflection.The original bug from @Bishbulb's investigation: two
RavenDbDurabilityAgentinstances polled concurrently against the shared singleton store, both believing they held the scheduled-job lock, racing to mark the same envelopesIncoming, surfacingConcurrencyExceptionand double-firing timeouts. Full root-cause writeup is in #2623's description.Closes #2623.
Note
Wolverine.CosmosDb.Internals.CosmosDbMessageStore.StartScheduledJobsmirrors this pattern (agent.As<CosmosDbDurabilityAgent>().StartTimers()) and very likely has the same bug. Flagging for separate follow-up.Test plan
RavenDbTestspass locally (embedded RavenDB driver, single run):Failed: 0, Passed: 147, Skipped: 0, Total: 147, Duration: 46s.durability_agent_lifecycle.only_one_durability_agent_polls_after_host_startpasses — uses no reflection.🤖 Generated with Claude Code