Avoid rewiring sub-replicas to failed grand-primaries by sarthakaggarwal97 · Pull Request #3468 · valkey-io/valkey

sarthakaggarwal97 · 2026-04-09T06:03:12Z

This fixes the case from valkey-fuzzer#85

The short version is:

A is the old primary
B is a replica of A
C is another replica in the same shard

A dies, B wins failover, and C starts following B, which is the right thing.

The problem is that C can then process a stale view where B still looks like a replica of A. At that point C briefly thinks the topology is C -> B -> A.

We already have logic to flatten that kind of sub-replica chain by making C follow the grand-primary directly. Normally that is fine. Here it is not, because the grand-primary is A, and A is already failed.

So instead of staying on healthy B, C gets rewired back to dead A. Once that happens, the reconnect fails, C can start another failover, and we can end up with an extra slotless primary while B still owns the shard slots. That lines up with the fuzzer result.

The fix here is small: when we hit that sub-replica repair path, do not rewrite to the grand-primary if the grand-primary is already marked FAIL. In this case that means C just keeps following B.

Healthy sub-replica repair should still behave the same.

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

codecov · 2026-04-09T07:54:54Z

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.77%. Comparing base (be52d8b) to head (4bc176a).
⚠️ Report is 2 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	91.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3468      +/-   ##
============================================
+ Coverage     76.44%   76.77%   +0.32%     
============================================
  Files           157      157              
  Lines         79035    79054      +19     
============================================
+ Hits          60421    60690     +269     
+ Misses        18614    18364     -250

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`88.04% <91.66%> (-0.08%)`	⬇️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dvkashapov

Funnily enough, I was thinking about proposing similar change for this exact scenario, that occurred in this CI run: https://github.com/valkey-io/valkey/actions/runs/23878117959/job/69625509673?pr=3430#step:9:7025

src/cluster_legacy.c

dvkashapov · 2026-04-09T10:57:41Z

Looking at the commit that was in the fuzzer run, it turns out it was before #2811, this PR improved stale packet detection and it may have already resolved this scenario, can you rerun with same seed on newer unstable?

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

sarthakaggarwal97 · 2026-04-09T16:22:03Z

@dvkashapov good point. It's still kinda hard to reproduce even with the same seed (before or after the PR). I think #2811 helps with the issue mentioned in the fuzzer. I feel this PR is a good guard rail in scenarios where a node should not collapse a sub-replica chain onto a grand-primary that is already FAIL.

It is possible to form a scenario like this -

C is following B.
A is already marked FAIL in C’s local view.
C receives a packet that makes B look like “I am now a replica of A”.
C updates its local topology to C -> B -> A.
The sub-replica repair logic runs and says “I’m a sub-replica, so I should follow my grand-primary directly”.
Unpatched code rewires C from B to A.
But A is dead, so C leaves a healthy primary and starts handshaking/reconnecting to a failed one.

hpatro · 2026-04-09T22:44:34Z

Unpatched code rewires C from B to A.

If someone were to patch things they would have the patch from #2811 first. So, not a strong argument. I'm still trying to understand this better. Will get back soon.

Avoid failed grand-primary sub-replica rewrites

43956ff

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

sarthakaggarwal97 force-pushed the fix-sub-replica-failed-grand-primary branch from 8f71845 to 43956ff Compare April 9, 2026 06:08

github-actions bot assigned sarthakaggarwal97 Apr 9, 2026

dvkashapov approved these changes Apr 9, 2026

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

dvkashapov self-requested a review April 9, 2026 10:59

Remove redundant failed-grand-primary check

4bc176a

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

sarthakaggarwal97 requested a review from hpatro April 9, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid rewiring sub-replicas to failed grand-primaries#3468

Avoid rewiring sub-replicas to failed grand-primaries#3468
sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
sarthakaggarwal97:fix-sub-replica-failed-grand-primary

sarthakaggarwal97 commented Apr 9, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

dvkashapov left a comment •

edited

Loading

Uh oh!

Uh oh!

dvkashapov commented Apr 9, 2026

Uh oh!

sarthakaggarwal97 commented Apr 9, 2026 •

edited

Loading

Uh oh!

hpatro commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sarthakaggarwal97 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dvkashapov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dvkashapov commented Apr 9, 2026

Uh oh!

sarthakaggarwal97 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hpatro commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sarthakaggarwal97 commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading

dvkashapov left a comment •

edited

Loading

sarthakaggarwal97 commented Apr 9, 2026 •

edited

Loading