Avoid rewiring sub-replicas to failed grand-primaries#3468
Avoid rewiring sub-replicas to failed grand-primaries#3468sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
Conversation
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
8f71845 to
43956ff
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3468 +/- ##
============================================
+ Coverage 76.44% 76.77% +0.32%
============================================
Files 157 157
Lines 79035 79054 +19
============================================
+ Hits 60421 60690 +269
+ Misses 18614 18364 -250
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Funnily enough, I was thinking about proposing similar change for this exact scenario, that occurred in this CI run: https://github.com/valkey-io/valkey/actions/runs/23878117959/job/69625509673?pr=3430#step:9:7025
|
Looking at the commit that was in the fuzzer run, it turns out it was before #2811, this PR improved stale packet detection and it may have already resolved this scenario, can you rerun with same seed on newer unstable? |
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
|
@dvkashapov good point. It's still kinda hard to reproduce even with the same seed (before or after the PR). I think #2811 helps with the issue mentioned in the fuzzer. I feel this PR is a good guard rail in scenarios where a node should not collapse a sub-replica chain onto a grand-primary that is already FAIL. It is possible to form a scenario like this -
|
If someone were to patch things they would have the patch from #2811 first. So, not a strong argument. I'm still trying to understand this better. Will get back soon. |
This fixes the case from valkey-fuzzer#85
The short version is:
Ais the old primaryBis a replica ofACis another replica in the same shardAdies,Bwins failover, andCstarts followingB, which is the right thing.The problem is that
Ccan then process a stale view whereBstill looks like a replica ofA. At that pointCbriefly thinks the topology isC -> B -> A.We already have logic to flatten that kind of sub-replica chain by making
Cfollow the grand-primary directly. Normally that is fine. Here it is not, because the grand-primary isA, andAis already failed.So instead of staying on healthy
B,Cgets rewired back to deadA. Once that happens, the reconnect fails,Ccan start another failover, and we can end up with an extra slotless primary whileBstill owns the shard slots. That lines up with the fuzzer result.The fix here is small: when we hit that sub-replica repair path, do not rewrite to the grand-primary if the grand-primary is already marked
FAIL. In this case that meansCjust keeps followingB.Healthy sub-replica repair should still behave the same.