Skip to content

Fix LocalGrainDirectory membership reconciliation#10086

Merged
ReubenBond merged 4 commits into
dotnet:mainfrom
ReubenBond:split/pr10085-01-membership-reconcile
May 12, 2026
Merged

Fix LocalGrainDirectory membership reconciliation#10086
ReubenBond merged 4 commits into
dotnet:mainfrom
ReubenBond:split/pr10085-01-membership-reconcile

Conversation

@ReubenBond

@ReubenBond ReubenBond commented May 12, 2026

Copy link
Copy Markdown
Member

Part 1 of 7 split from #10085.

Problem:
LocalGrainDirectory membership changes were applied as one-off events, so partial failures could leave directory/cache state inconsistent. Directory-owned activation cleanup also lived in Catalog, which coupled Catalog to directory ownership details.

Solution:
Process cluster membership as snapshots inside LocalGrainDirectory, reconcile the local directory/cache from each snapshot, move directory-owned activation cleanup into LocalGrainDirectory, and keep handoff work retrying while the directory is running.

Stack:
Merge this PR first. Next: #10087.

Review focus:
Snapshot application ordering, directory/cache reconciliation, activation deactivation behavior, and handoff retry semantics.

ReubenBond and others added 3 commits May 11, 2026 15:20
Process cluster membership as snapshots in LocalGrainDirectory so directory state can be reconciled and retried after failures. Move silo-removal activation cleanup out of Catalog and keep handoff operations retrying until success, obsolescence, or shutdown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Suppress expected shutdown failures while stopping membership processing and disposing the directory cache.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove redundant locking and cancellation from snapshot application, publish the latest directory membership before side effects, and simplify defunct entry cleanup against the current membership.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors LocalGrainDirectory to process cluster membership changes as versioned snapshots (instead of one-off silo status events), reconciling local directory/cache state from each snapshot and moving directory-owned activation cleanup out of Catalog and into LocalGrainDirectory. It also adjusts handoff execution behavior to keep retrying while the directory is running, and updates a unit test to accommodate the new IClusterMembershipService dependency.

Changes:

  • Replace ISiloStatusOracle event-driven membership updates in LocalGrainDirectory with snapshot-driven reconciliation via IClusterMembershipService.
  • Move “directory-owned activation” deactivation logic from Catalog to LocalGrainDirectory when a partition owner transitions to a terminating state.
  • Update handoff manager retry mechanics and add guards to reduce stale/incorrect handoff processing; update a test for the new constructor dependency.
Show a summary per file
File Description
test/Orleans.Core.Tests/Directory/CachedGrainLocatorTests.cs Updates the LocalGrainDirectory test construction to pass a mocked IClusterMembershipService.
src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs Implements snapshot-based membership processing/reconciliation and relocates activation deactivation logic into the directory.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs Changes handoff execution/retry behavior and adds additional successor checks and filtering logic.
src/Orleans.Runtime/Catalog/Catalog.cs Removes directory-ownership cleanup logic and related dependencies now handled by LocalGrainDirectory.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment thread src/Orleans.Runtime/GrainDirectory/GrainDirectoryHandoffManager.cs
Comment thread src/Orleans.Runtime/Catalog/Catalog.cs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond ReubenBond added this pull request to the merge queue May 12, 2026
Merged via the queue into dotnet:main with commit 80ef781 May 12, 2026
64 of 65 checks passed
@ReubenBond ReubenBond deleted the split/pr10085-01-membership-reconcile branch May 12, 2026 16:26
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants