perf(merge-sync): cache PosForwardHeaderProvider.GetBlockHeaders#11617
perf(merge-sync): cache PosForwardHeaderProvider.GetBlockHeaders#116170xDevNinja wants to merge 1 commit into
Conversation
Deep reviewVerdict: request changes. The optimization is sound in concept, but the cache slicing has an off-by-one that drops the anchor block on every cache hit in the typical PoS-sync state (the walk-back case). The test suite doesn't catch it because the mock simulates a non-walk-back scenario that doesn't occur in real sync. 1. Critical: off-by-one in
|
| # | Severity | Issue |
|---|---|---|
| 1 | Blocker | Off-by-one in desiredStart drops the anchor on cache hits in the walk-back path (typical PoS sync). Test mock hides it. |
| 2 | Medium | ValidateSeals no longer runs per slice — terminal-transition forced check may be skipped. |
| 3 | Low | skipLastN semantics drift from tail-trim to front-cap. |
| 4 | Medium | Cache invalidation misses reorgs that don't touch ProcessDestination; consider subscribing to BlockTree.NewHeadBlock. |
| 5 | Nit | Const duplication, double allocation on hit, dead branch, nullable-shape cleanup. |
Fixing #1 and adding a walk-back regression test (mock start: 0 so the anchor is included) would unblock; #4 worth revisiting before merge.
30acbd5 to
376311a
Compare
|
@claude review |
|
Claude finished @LukaszRozmej's task in 3m 32s —— View job Code Review:
|
| long currentNumber = processDestination?.Number ?? long.MaxValue; | ||
| if (cachedHash != currentHash || cachedNumber != currentNumber) return null; | ||
|
|
||
| long desiredStart = Math.Min(_blockTree.BestKnownNumber + 1, currentNumber); |
There was a problem hiding this comment.
Critical: off-by-one drops the anchor block on every cache hit
ChainLevelHelper.GetStartingPoint() always returns the last processed (non-beacon) block as the anchor. In the typical active-sync state where BestKnownNumber + 1 is a beacon-info block, the walk-back loop returns (BestKnownNumber, hash_of_BestKnownNumber), so fresh[0].Number == BestKnownNumber. The cache therefore stores the anchor at index 0.
On the first call (cache miss) BuildSlice correctly returns [block@BestKnown, …]. On subsequent calls (cache hit, same BestKnownNumber):
desiredStart = BestKnownNumber + 1 // ← wrong
offset = 1
slice = [block@BestKnown+1, …] // anchor dropped
BlockDownloader relies on headers[0] being the already-known anchor that seeds AssembleSatisfiedEntries / AssembleRequest. Dropping it means the body for BestKnown+1 is never moved to a satisfied entry and lingers in _downloadRequests, and parentHeader becomes a beacon-only header on the next iteration.
Fix:
| long desiredStart = Math.Min(_blockTree.BestKnownNumber + 1, currentNumber); | |
| long desiredStart = Math.Min(_blockTree.BestKnownNumber, currentNumber); |
| public void SetUp() | ||
| { | ||
| _chainLevelHelper = Substitute.For<IChainLevelHelper>(); | ||
| _chainLevelHelper.GetNextHeaders(default, default, default).ReturnsForAnyArgs(_ => BuildSequentialHeaders(start: 1, count: CachedBatchSize)); |
There was a problem hiding this comment.
Test mock hides the off-by-one bug
The mock returns BuildSequentialHeaders(start: 1, count: 64) with BestKnownNumber = 0. This means desiredStart = Math.Min(0+1, 1000) = 1 = cacheStart, so offset = 0 — the exact case that avoids the off-by-one.
In production, ChainLevelHelper.GetStartingPoint() returns the anchor at BestKnownNumber, so fresh[0].Number == BestKnownNumber. The mock should match this:
| _chainLevelHelper.GetNextHeaders(default, default, default).ReturnsForAnyArgs(_ => BuildSequentialHeaders(start: 1, count: CachedBatchSize)); | |
| _chainLevelHelper.GetNextHeaders(default, default, default).ReturnsForAnyArgs(_ => BuildSequentialHeaders(start: 0, count: CachedBatchSize)); |
With this change cacheStart = 0 = BestKnownNumber, so desiredStart = BestKnownNumber + 1 = 1 ≠ 0 and the off-by-one is exposed. The Cached_slice_advances_with_best_known_number assertion second[0]!.Number.Should().Be(21) would also need updating to Be(20) (the anchor at the new BestKnownNumber).
|
|
||
| // Fetch a larger batch than asked so subsequent peer allocations can be served from the cache. | ||
| int fetchSize = Math.Max(maxHeader * CacheBatchMultiplier, MinCachedHeaderBatchSize); | ||
| BlockHeader?[]? fresh = chainLevelHelper.GetNextHeaders(fetchSize, long.MaxValue, skipLastBlockCount: 0); |
There was a problem hiding this comment.
Medium: skipLastN semantics differ from the original
The original code forwarded skipLastN directly to ChainLevelHelper as skipLastBlockCount, which appended skipLastN extra headers to the fetch window and then trimmed them from the tail — effectively ensuring skipLastN blocks at the tip of the available chain are not included.
With skipLastBlockCount: 0 here, the fetch retrieves fetchSize headers right up to the tip, caches them all, and then trims skipLastN at serve time from the end of the cached window. In the end-of-chain regime where ChainLevelHelper returns fewer than fetchSize headers (i.e. close to ProcessDestination), the two semantics diverge: the original would exclude the last skipLastN of the available chain; this version may include them if the slice happens to reach the end of the cache.
At minimum worth a comment; worst case it may serve headers too close to an in-progress finalization boundary.
|
|
||
| // Alternatively we can do this in BeaconHeadersSyncFeed, but this seems easier. | ||
| ValidateSeals(headers!, cancellation); | ||
| ValidateSeals(fresh!, cancellation); |
There was a problem hiding this comment.
Medium: ValidateSeals no longer runs on each served slice
The original code called ValidateSeals on every returned batch. Post-merge seal validation is a no-op, but the method also contains the terminal-block transition check:
bool terminalBlock = !lastBlock && headers[i+1].Difficulty == 0 && headers[i].Difficulty != 0;
bool forceValidation = lastBlock || i == randomNumberForValidation || terminalBlock;Running it once on the large fresh array means: (a) the forced last-element check and the random-index check are fixed at fill time, not per-slice; (b) terminal-block transitions at a slice boundary (not present in the original fetch) are never force-validated on cache hits. This is safe in pure PoS but is a latent risk in the merge-transition region. Consider calling ValidateSeals(slice, cancellation) on cache hits too — it's cheap post-merge and restores the original safety invariant.
Fetches a 4x batch from ChainLevelHelper, slices per-call, and serves from cache while BeaconPivot.ProcessDestination is unchanged and the desired start sits inside the cached range. Closes NethermindEth#8573.
376311a to
d9849ce
Compare
|
@claude re-review |
|
Claude finished @LukaszRozmej's task in 3m 48s —— View job Re-review:
|
| # | Severity | Finding | Status |
|---|---|---|---|
| 1 | Critical | Off-by-one in TryServeFromCache (desiredStart = BKN+1) drops anchor on cache hit in walk-back path |
Still present — line 98 unchanged |
| 2 | Critical | Test mock uses start: 1 / BKN=0 (no-walk-back shape) and hides the bug |
Still present — line 36 unchanged |
| 3 | Medium | ValidateSeals not re-run per slice; terminal-transition forced check missed at slice boundaries |
Still present — line 72 unchanged |
| 4 | Medium | Cache invalidation misses intra-range reorgs that don't change ProcessDestination |
Still present |
| 5 | Low | skipLastN semantics drift: original trimmed from chain tail; PR front-caps the cached window |
Still present — line 64 unchanged |
| 6 | Nit | MinCachedHeaderBatchSize = 32 duplicated from PowForwardHeaderProvider (base class already has it at line 40) |
Still present — line 33 unchanged |
| 7 | Nit | cached.Length == 0 branch on line 91 is unreachable (UpdateCache guards on MinCachedHeaderBatchSize) |
Still present |
| 8 | Nit | Double allocation per cache hit (new BlockHeader[take] + ToPooledList) |
Still present |
Critical Analysis (re-confirmed against ChainLevelHelper.cs)
Reading ChainLevelHelper.GetStartingPoint() directly confirms the blocker. The method has two exit paths:
Path A — no walk-back (block at BKN+1 is already processed, !IsBeaconInfo):
// line 179-182
if (!beaconMainChainBlock.IsBeaconInfo)
return (startingPoint, beaconMainChainBlock.BlockHash); // returns BKN+1Here headers[0].Number == BKN+1. The +1 in line 98 is coincidentally correct.
Path B — walk-back (typical active beacon sync, BKN+1 is IsBeaconInfo):
// ChainLevelHelper.cs:207-208 (explicit comment):
// "Note: the starting point, points to the non-beacon info block.
// MergeBlockDownloader does not download the first header so this is deliberate"
// Decrements startingPoint → returns (BKN, hash_of_BKN)
Here headers[0].Number == BKN. On every cache hit: desiredStart = BKN+1, offset = 1, anchor dropped.
The test only exercises Path A (start: 1, BKN=0 → cacheStart == BKN+1). Path B (the dominant production state) is not covered. Inline comments with exact fix suggestions have been posted on lines 98 and 36.
Verdict
Not mergeable. The critical off-by-one (finding #1) and the concealing test mock (finding #2) are both unaddressed. The medium findings (#3, #4) should also be resolved or explicitly acknowledged before merge.
| long currentNumber = processDestination?.Number ?? long.MaxValue; | ||
| if (cachedHash != currentHash || cachedNumber != currentNumber) return null; | ||
|
|
||
| long desiredStart = Math.Min(_blockTree.BestKnownNumber + 1, currentNumber); |
There was a problem hiding this comment.
[Critical — still present] Off-by-one drops the anchor on every cache hit in typical PoS sync
ChainLevelHelper.GetStartingPoint() has two exit paths:
- No walk-back (block at
BestKnownNumber+1is already processed): returns(BestKnownNumber+1, hash)→headers[0].Number == BestKnownNumber+1. The+1here is correct. - Walk-back (typical active beacon sync — block at
BestKnownNumber+1isIsBeaconInfo): the do/while loop decrementsstartingPointonce toBestKnownNumberand returns(BestKnownNumber, hash)— confirmed byChainLevelHelper.cs:207-208: "the starting point, points to the non-beacon info block. MergeBlockDownloader does not download the first header so this is deliberate". In this caseheaders[0].Number == BestKnownNumberand the+1is wrong.
On a cache hit (same BestKnownNumber, walk-back path):
cacheStart = BestKnownNumber (from first call)
desiredStart = BestKnownNumber + 1 (← off by one)
offset = 1
slice = [block@BKN+1, …] (anchor at BKN dropped)
BlockDownloader.AssembleRequest relies on headers[0] as the parent anchor (parentHeader), iterating headers.Skip(1) for actual downloads. Dropping the anchor means the body for BKN+1 is never moved into a satisfied entry.
| long desiredStart = Math.Min(_blockTree.BestKnownNumber + 1, currentNumber); | |
| long desiredStart = Math.Min(_blockTree.BestKnownNumber, currentNumber); |
| public void SetUp() | ||
| { | ||
| _chainLevelHelper = Substitute.For<IChainLevelHelper>(); | ||
| _chainLevelHelper.GetNextHeaders(default, default, default).ReturnsForAnyArgs(_ => BuildSequentialHeaders(start: 1, count: CachedBatchSize)); |
There was a problem hiding this comment.
[Critical — still present] Test mock covers the wrong code path and hides the off-by-one
BuildSequentialHeaders(start: 1, count: CachedBatchSize) with BestKnownNumber = 0 models the no-walk-back path where headers[0].Number == BestKnownNumber+1 == 1. In this shape desiredStart = BKN+1 = 1 = cacheStart, so offset = 0 and the off-by-one is invisible.
Production uses the walk-back path (BestKnownNumber+1 is IsBeaconInfo) which returns headers[0].Number == BestKnownNumber == 0. The mock should match that:
| _chainLevelHelper.GetNextHeaders(default, default, default).ReturnsForAnyArgs(_ => BuildSequentialHeaders(start: 1, count: CachedBatchSize)); | |
| _chainLevelHelper.GetNextHeaders(default, default, default).ReturnsForAnyArgs(_ => BuildSequentialHeaders(start: 0, count: CachedBatchSize)); |
With start: 0: cacheStart = 0 = BestKnownNumber. The buggy desiredStart = BKN+1 = 1 ≠ 0 → offset = 1 → anchor dropped. The test would then fail, exposing the bug. The Cached_slice_advances_with_best_known_number assertion on line 107 (second[0]!.Number.Should().Be(21)) would also need to change to Be(20) once the code fix is applied.
Resolves #8573
Changes
PosForwardHeaderProvider.GetBlockHeadersnow fetches a4 * maxHeaderbatch fromChainLevelHelperonce, caches it, and serves subsequent calls by slicing the cached array.IBeaconPivot.ProcessDestination(hash or number) changes, whenBestKnownNumberadvances past the cached end, or when the requested slice cannot be satisfied without going below the cached start.base.GetBlockHeaders) is unchanged; theLastResponseBatchmechanism inPowForwardHeaderProvidercontinues to drive that branch.ValidateSealsruns once per cache fill (on the freshly fetched array) rather than on every peer allocation.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
PosForwardHeaderProviderCacheTests(5 cases) covers: cache hit on identical inputs, invalidation whenProcessDestination.Hashchanges, invalidation whenBestKnownNumberadvances past the cached range, slice advances withBestKnownNumberwhile inside the range, andskipLastNhonored on cache hits. ExistingForwardHeaderProviderTests(37/38 pass, 1 pre-existing skip) continue to pass.Documentation
Requires documentation update
Requires explanation in Release Notes
Remarks
Invalidation strategy follows the hint left in the issue: with no explicit signal for when
ChainLevelHelperwould return different headers, identity ofBeaconPivot.ProcessDestination(set byNewPayload/ForkchoiceUpdated) plusBestKnownNumberadvance covers the cases where the underlying chain-level data could have moved. Happy to swap to a more aggressive (e.g.BlockTree.NewHeadBlock) invalidation if reviewers prefer.