Skip to content

Avoid rewiring sub-replicas to failed grand-primaries#3468

Open
sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
sarthakaggarwal97:fix-sub-replica-failed-grand-primary
Open

Avoid rewiring sub-replicas to failed grand-primaries#3468
sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
sarthakaggarwal97:fix-sub-replica-failed-grand-primary

Conversation

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Apr 9, 2026

This fixes the case from valkey-fuzzer#85

The short version is:

  • A is the old primary
  • B is a replica of A
  • C is another replica in the same shard

A dies, B wins failover, and C starts following B, which is the right thing.

The problem is that C can then process a stale view where B still looks like a replica of A. At that point C briefly thinks the topology is C -> B -> A.

We already have logic to flatten that kind of sub-replica chain by making C follow the grand-primary directly. Normally that is fine. Here it is not, because the grand-primary is A, and A is already failed.

So instead of staying on healthy B, C gets rewired back to dead A. Once that happens, the reconnect fails, C can start another failover, and we can end up with an extra slotless primary while B still owns the shard slots. That lines up with the fuzzer result.

The fix here is small: when we hit that sub-replica repair path, do not rewrite to the grand-primary if the grand-primary is already marked FAIL. In this case that means C just keeps following B.

Healthy sub-replica repair should still behave the same.

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the fix-sub-replica-failed-grand-primary branch from 8f71845 to 43956ff Compare April 9, 2026 06:08
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.77%. Comparing base (be52d8b) to head (4bc176a).
⚠️ Report is 2 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 91.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3468      +/-   ##
============================================
+ Coverage     76.44%   76.77%   +0.32%     
============================================
  Files           157      157              
  Lines         79035    79054      +19     
============================================
+ Hits          60421    60690     +269     
+ Misses        18614    18364     -250     
Files with missing lines Coverage Δ
src/cluster_legacy.c 88.04% <91.66%> (-0.08%) ⬇️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Member

@dvkashapov dvkashapov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, I was thinking about proposing similar change for this exact scenario, that occurred in this CI run: https://github.com/valkey-io/valkey/actions/runs/23878117959/job/69625509673?pr=3430#step:9:7025

@dvkashapov
Copy link
Copy Markdown
Member

Looking at the commit that was in the fuzzer run, it turns out it was before #2811, this PR improved stale packet detection and it may have already resolved this scenario, can you rerun with same seed on newer unstable?

@dvkashapov dvkashapov self-requested a review April 9, 2026 10:59
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

sarthakaggarwal97 commented Apr 9, 2026

@dvkashapov good point. It's still kinda hard to reproduce even with the same seed (before or after the PR). I think #2811 helps with the issue mentioned in the fuzzer. I feel this PR is a good guard rail in scenarios where a node should not collapse a sub-replica chain onto a grand-primary that is already FAIL.

It is possible to form a scenario like this -

  1. C is following B.
  2. A is already marked FAIL in C’s local view.
  3. C receives a packet that makes B look like “I am now a replica of A”.
  4. C updates its local topology to C -> B -> A.
  5. The sub-replica repair logic runs and says “I’m a sub-replica, so I should follow my grand-primary directly”.
  6. Unpatched code rewires C from B to A.
  7. But A is dead, so C leaves a healthy primary and starts handshaking/reconnecting to a failed one.

@sarthakaggarwal97 sarthakaggarwal97 requested a review from hpatro April 9, 2026 21:12
@hpatro
Copy link
Copy Markdown
Contributor

hpatro commented Apr 9, 2026

  • Unpatched code rewires C from B to A.

If someone were to patch things they would have the patch from #2811 first. So, not a strong argument. I'm still trying to understand this better. Will get back soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants