fix(sync): prevent redundant runtime download during warp sync#3200
fix(sync): prevent redundant runtime download during warp sync#3200replghost wants to merge 2 commits into
Conversation
c10b5cb to
7c57022
Compare
7c57022 to
aedb628
Compare
How often does this happen?On every cold start on live networks. The race window is deterministic:
ReproductionUse cd smolbench
CHAIN_SPEC=chain-specs/paseo.json RUNS=5 node bench/runtime-download-count.mjsWithout fix: With fix: Regression testThe Rust unit test in
Confirmed: fails without fix, passes with fix. |
aedb628 to
ff2c395
Compare
Empirical data: redundant runtime downloads on live networksRan
6 of 9 cold starts triggered redundant Expected after fix: |
0aec268 to
074aec6
Compare
There was a problem hiding this comment.
Pull request overview
Adjusts the warp sync state machine so that advancing finalized chain information mid-warp-sync doesn’t unnecessarily force re-verification before runtime download, reducing redundant :code re-downloads and improving cold-start time.
Changes:
- Preserve
WarpedBlockTy::Normalinset_chain_information()so runtime download can proceed without an extra fragment verification cycle. - Add unit tests covering the regression scenario and code trie hint key selection in runtime download requests.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Preserve `Normal` if warp sync already completed fragment verification. | ||
| // A GrandPa commit advancing finality by a few blocks should not force | ||
| // another round of fragment verification before the runtime download can | ||
| // proceed. The runtime_download reset below is still necessary (the block | ||
| // hash changed so any cached proof is for the wrong state root). |
There was a problem hiding this comment.
The explanatory comment block is duplicated verbatim (same 5 lines repeated twice), which adds noise and makes the intent harder to read. Please remove the duplicate copy and keep a single concise comment above the conditional.
| // Preserve `Normal` if warp sync already completed fragment verification. | |
| // A GrandPa commit advancing finality by a few blocks should not force | |
| // another round of fragment verification before the runtime download can | |
| // proceed. The runtime_download reset below is still necessary (the block | |
| // hash changed so any cached proof is for the wrong state root). |
| out.extend_from_slice(target_hash); | ||
| out.extend_from_slice(&num_bytes[..block_number_bytes]); | ||
| // 1 precommit, compact-encoded as 0x04 |
There was a problem hiding this comment.
build_justification writes target_number into the justification using &num_bytes[..block_number_bytes], which will panic if block_number_bytes > 8. Consider mirroring the min(8, block_number_bytes) + zero-padding approach used for the signed message (or assert block_number_bytes <= 8).
| out.extend_from_slice(target_hash); | ||
| out.extend_from_slice(&num_bytes[..block_number_bytes]); | ||
| out.extend_from_slice(&<[u8; 64]>::from(signature)); |
There was a problem hiding this comment.
Same issue as above: the precommit also encodes target_number via &num_bytes[..block_number_bytes], which can panic for block_number_bytes > 8. Use padded encoding (or an explicit assertion) to make the helper robust.
| ws.set_source_finality_state(source_id, 101); | ||
| let new_chain_info = chain_info_at_block(101, public_key, 1); | ||
| ws.set_chain_information((&new_chain_info).into()); | ||
|
|
There was a problem hiding this comment.
IIUC: the test sets set_source_finality_state(source_id, 101) immediately before set_chain_information(chain_info_at_block(101)). Since they contain the same warped_header_number this makes the guared in the warp sync effectively skipped and therefore we wont download the runtime (with or without the patch)
There was a problem hiding this comment.
You're right — the old test was trivially passing regardless of the patch. Replaced with grandpa_commit_mid_warp_sync_preserves_runtime_download which uses setup_warp_sync_at_normal (source at height 100, fragment verified) then advances to block 101 with the source at 101. This matches the production scenario and actually fails without the fix.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
A production repro isn't demonstrated in any of the tests
There was a problem hiding this comment.
Added. grandpa_commit_mid_warp_sync_preserves_runtime_download reproduces the production sequence: fragment verified at block 100 → warped_block_ty=Normal → GrandPa commit advances to 101 → set_chain_information(101) → assert runtime download still desired. Also added set_chain_information_same_state_root_preserves_runtime_download for the same-state-root preservation path.
| // proceed. The runtime_download reset below is still necessary (the block | ||
| // hash changed so any cached proof is for the wrong state root). | ||
| if !matches!(self.warped_block_ty, WarpedBlockTy::Normal) { | ||
| self.warped_block_ty = WarpedBlockTy::AlreadyVerified; |
There was a problem hiding this comment.
Is this correct? The fix only suppresses a fragment-verification cycle before the next runtime download is issued, which probably means the runtime download is happening either way soon after becase of the below:
self.runtime_download = RuntimeDownload::NotStarted {
hint_doesnt_match: false,
};There was a problem hiding this comment.
Correct — the original fix only preserved warped_block_ty but still unconditionally reset runtime_download to NotStarted. Updated: runtime_download is now only reset when warped_header_state_root != old_state_root. When the state root is the same (common for adjacent blocks with no storage changes), the in-progress or not-yet-started runtime download is fully preserved.
| // another round of fragment verification before the runtime download can | ||
| // proceed. The runtime_download reset below is still necessary (the block | ||
| // hash changed so any cached proof is for the wrong state root). | ||
| if !matches!(self.warped_block_ty, WarpedBlockTy::Normal) { |
There was a problem hiding this comment.
It looks like this is just patching a symptom. IIUC the problem is that set_chain_information wipes the runtime download / runtime calls and body download even when the incoming ValidChainInformation describes the same warp sync block we wroked with.
Then a fix would be to short-circuit the logic when warped_header_hash is equal with the new header hash. Then, this would preserve all the work we did so far. How was the 1.5/2MiB saving computed?
It looks like the hallucinations are vibe-coded because the storage proof is requested every time the function is called?
There was a problem hiding this comment.
Agreed — the original fix was a band-aid. Updated in 590ef4f:
-
Short-circuit on same header hash —
set_chain_informationnow returns early when the incoming hash matcheswarped_header_hash. No state is touched. -
Preserve runtime download when state root unchanged —
runtime_downloadis only reset toNotStartedwhenwarped_header_state_rootactually changed. When the state root is identical, cached proofs remain valid. -
Preserve
warped_block_ty=Normal— kept from the original fix, so a GrandPa commit advancing finality by 1 block doesn't force another fragment verification round.
The 2 MiB saving comes from avoiding the redundant :code storage proof download when the runtime download was already in progress and the state root didn't change. The benchmark numbers from smolbench are real — measured via runtime-download-count.mjs which counts the actual number of StorageGetMerkleProof requests and their payload sizes.
074aec6 to
d078d4e
Compare
Root cause: set_chain_information() unconditionally reset warped_block_ty to AlreadyVerified, even when already Normal (runtime download phase). A GrandPa commit advancing finality by 1 block forced another fragment verification before the runtime download could proceed, causing a redundant re-download of :code (~1.5-2.5 MiB). Fix: preserve warped_block_ty = Normal in set_chain_information when warp sync already completed fragment verification. Includes regression test and hint key selection tests — first tests for warp_sync.rs. Regression test fails without fix, passes with fix.
…ownload Address review feedback from lexnv: 1. Short-circuit set_chain_information when the incoming header hash matches the current warped_header_hash — nothing to update. 2. Only reset runtime_download when the state root actually changed. When the state root is identical, cached proofs remain valid. 3. Preserve warped_block_ty=Normal across set_chain_information so a GrandPa commit advancing finality by 1 block doesn't force another fragment verification round before the runtime download can proceed. 4. Remove duplicated comment block. 5. Rewrite tests to demonstrate the actual production scenario: fragment verified → runtime download desired → GrandPa commit for N+1 → runtime download still desired without re-verification. Add separate test for same-state-root preservation path.
590ef4f to
5b6953e
Compare
|
@replghost I cannot reproduce the issue. Can you check on your side? |
|
@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance? |
|
Closing - the The original benchmarks were valid against 3.0.0 but the gap guard is the correct fix at the right layer. |
Summary
set_chain_information()inwarp_sync.rsunconditionally resetswarped_block_tytoAlreadyVerifiedandruntime_downloadtoNotStarted, even when warp sync has already completed fragment verification and is ready for the runtime download. When a GrandPa commit advances finality by 1 block mid-warp-sync, this can deadlock the state machine or cause an unnecessary extra round-trip.Problem
Sequence without fix:
warped_block_ty = Normal, ready for runtime downloadset_chain_information(N+1)warped_block_ty = AlreadyVerifiedandruntime_download = NotStartedTwo consequences:
desired_requestsneedsNormalto return a runtime download request, but all sources are at or below the new finalized height — no warp sync fragments are requested either. The state machine is stuck permanently.Normal, then starts the runtime download. One unnecessary P2P round-trip + verification cycle.Fix
Three changes to
set_chain_information:Short-circuit on same header hash — if the incoming chain information has the same header hash as
warped_header_hash, return early. Nothing to update.Preserve
warped_block_ty = Normal— a GrandPa commit advancing finality should not force another fragment verification round when we've already verified past that point.Only reset
runtime_downloadwhen the state root changed — cached storage proofs are valid as long as the state root matches. When an adjacent block has the same state root, preserve the runtime download state.Test plan
grandpa_commit_mid_warp_sync_preserves_runtime_download— reproduces the production scenario: fragment verified at block 100 → GrandPa commit to 101 → assert runtime download still desired. Fails without fix.set_chain_information_same_state_root_preserves_runtime_download— same-state-root preservation path. Fails without fix.hint_causes_ancestor_key_in_desired_request/no_hint_requests_code_key— existing hint tests still pass.cargo check -p smoldot -p smoldot-lightcleancargo fmtcleancargo test -p smoldot --lib sync::warp_sync::tests— 4 passed, 0 failed