Skip to content

fix(sync): prevent redundant runtime download during warp sync#3200

Closed
replghost wants to merge 2 commits into
paritytech:mainfrom
replghost:fix/warp-sync-redundant-runtime-download
Closed

fix(sync): prevent redundant runtime download during warp sync#3200
replghost wants to merge 2 commits into
paritytech:mainfrom
replghost:fix/warp-sync-redundant-runtime-download

Conversation

@replghost

@replghost replghost commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Summary

set_chain_information() in warp_sync.rs unconditionally resets warped_block_ty to AlreadyVerified and runtime_download to NotStarted, even when warp sync has already completed fragment verification and is ready for the runtime download. When a GrandPa commit advances finality by 1 block mid-warp-sync, this can deadlock the state machine or cause an unnecessary extra round-trip.

Problem

Sequence without fix:

  1. Warp sync verifies final fragment → warped_block_ty = Normal, ready for runtime download
  2. GrandPa commit for block N+1 → set_chain_information(N+1)
  3. Unconditionally resets warped_block_ty = AlreadyVerified and runtime_download = NotStarted

Two consequences:

  • Deadlock near the tip: desired_requests needs Normal to return a runtime download request, but all sources are at or below the new finalized height — no warp sync fragments are requested either. The state machine is stuck permanently.
  • Extra round-trip mid-sync: If sources are far ahead, the machine requests another warp sync fragment, verifies it, returns to Normal, then starts the runtime download. One unnecessary P2P round-trip + verification cycle.

Fix

Three changes to set_chain_information:

  1. Short-circuit on same header hash — if the incoming chain information has the same header hash as warped_header_hash, return early. Nothing to update.

  2. Preserve warped_block_ty = Normal — a GrandPa commit advancing finality should not force another fragment verification round when we've already verified past that point.

  3. Only reset runtime_download when the state root changed — cached storage proofs are valid as long as the state root matches. When an adjacent block has the same state root, preserve the runtime download state.

Test plan

  • grandpa_commit_mid_warp_sync_preserves_runtime_download — reproduces the production scenario: fragment verified at block 100 → GrandPa commit to 101 → assert runtime download still desired. Fails without fix.
  • set_chain_information_same_state_root_preserves_runtime_download — same-state-root preservation path. Fails without fix.
  • hint_causes_ancestor_key_in_desired_request / no_hint_requests_code_key — existing hint tests still pass.
  • cargo check -p smoldot -p smoldot-light clean
  • cargo fmt clean
  • cargo test -p smoldot --lib sync::warp_sync::tests — 4 passed, 0 failed

@replghost replghost force-pushed the fix/warp-sync-redundant-runtime-download branch 4 times, most recently from c10b5cb to 7c57022 Compare April 21, 2026 02:04
@replghost replghost changed the title perf(sync): skip redundant runtime download during warp sync fix(sync): prevent redundant runtime download during warp sync Apr 21, 2026
@replghost replghost force-pushed the fix/warp-sync-redundant-runtime-download branch from 7c57022 to aedb628 Compare April 21, 2026 02:34
@replghost

Copy link
Copy Markdown
Contributor Author

How often does this happen?

On every cold start on live networks. The race window is deterministic:

  • GrandPa commits arrive every ~6 seconds (every finalized block)
  • The vulnerable window is between "last warp sync fragment verified" and "runtime download completes" — several seconds for 1.5–2.5 MiB :code over P2P
  • A GrandPa commit landing in that window is near-certain, especially with fresh checkpoints (1–3 fragments), which is the common case

Reproduction

Use runtime-download-count.mjs from smolbench:

cd smolbench
CHAIN_SPEC=chain-specs/paseo.json RUNS=5 node bench/runtime-download-count.mjs

Without fix: storage_proofs=2 (double download) on most runs. The second proof request is the redundant re-download triggered by the GrandPa commit resetting warped_block_ty to AlreadyVerified.

With fix: storage_proofs=1 consistently.

Regression test

The Rust unit test in warp_sync.rs (set_chain_information_preserves_normal_warped_block_ty) reproduces the exact sequence offline:

  1. Verify a warp sync fragment → enters Normal (runtime download phase)
  2. Call set_chain_information (simulating GrandPa commit advancing finality)
  3. Assert desired_requests() still yields StorageGetMerkleProof — no second fragment verification needed

Confirmed: fails without fix, passes with fix.

@replghost replghost force-pushed the fix/warp-sync-redundant-runtime-download branch from aedb628 to ff2c395 Compare April 21, 2026 03:17
@replghost

Copy link
Copy Markdown
Contributor Author

Empirical data: redundant runtime downloads on live networks

Ran runtime-download-count.mjs from smolbench against three networks (3 runs each, smoldot 3.0.0 — before this fix):

Network Run Time Fragments Storage Proofs Runtime Builds Bug?
Polkadot 1 30.4s 206 2 2
Polkadot 2 27.3s 207 3 1 ✗✗
Polkadot 3 28.4s 206 1 1
Kusama 1 119.2s 571 2 1
Kusama 2 57.1s 571 2 1
Kusama 3 110.3s 570 1 1
Westend 1 12.6s 419 1 1
Westend 2 12.4s 419 1 1
Westend 3 9.2s 419 2 2

6 of 9 cold starts triggered redundant :code downloads. One Polkadot run downloaded the runtime three times (storage_proofs=3). Each redundant proof is ~1.5–2.5 MiB wasted bandwidth plus the fragment verification round-trip that precedes it.

Expected after fix: storage_proofs=1 consistently on all networks.

@replghost replghost force-pushed the fix/warp-sync-redundant-runtime-download branch 2 times, most recently from 0aec268 to 074aec6 Compare April 21, 2026 19:39
@lexnv lexnv requested a review from Copilot April 22, 2026 09:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the warp sync state machine so that advancing finalized chain information mid-warp-sync doesn’t unnecessarily force re-verification before runtime download, reducing redundant :code re-downloads and improving cold-start time.

Changes:

  • Preserve WarpedBlockTy::Normal in set_chain_information() so runtime download can proceed without an extra fragment verification cycle.
  • Add unit tests covering the regression scenario and code trie hint key selection in runtime download requests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/src/sync/warp_sync.rs Outdated
Comment on lines +581 to +585
// Preserve `Normal` if warp sync already completed fragment verification.
// A GrandPa commit advancing finality by a few blocks should not force
// another round of fragment verification before the runtime download can
// proceed. The runtime_download reset below is still necessary (the block
// hash changed so any cached proof is for the wrong state root).

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanatory comment block is duplicated verbatim (same 5 lines repeated twice), which adds noise and makes the intent harder to read. Please remove the duplicate copy and keep a single concise comment above the conditional.

Suggested change
// Preserve `Normal` if warp sync already completed fragment verification.
// A GrandPa commit advancing finality by a few blocks should not force
// another round of fragment verification before the runtime download can
// proceed. The runtime_download reset below is still necessary (the block
// hash changed so any cached proof is for the wrong state root).

Copilot uses AI. Check for mistakes.
Comment thread lib/src/sync/warp_sync.rs
Comment on lines +2411 to +2413
out.extend_from_slice(target_hash);
out.extend_from_slice(&num_bytes[..block_number_bytes]);
// 1 precommit, compact-encoded as 0x04

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_justification writes target_number into the justification using &num_bytes[..block_number_bytes], which will panic if block_number_bytes > 8. Consider mirroring the min(8, block_number_bytes) + zero-padding approach used for the signed message (or assert block_number_bytes <= 8).

Copilot uses AI. Check for mistakes.
Comment thread lib/src/sync/warp_sync.rs
Comment on lines +2416 to +2418
out.extend_from_slice(target_hash);
out.extend_from_slice(&num_bytes[..block_number_bytes]);
out.extend_from_slice(&<[u8; 64]>::from(signature));

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: the precommit also encodes target_number via &num_bytes[..block_number_bytes], which can panic for block_number_bytes > 8. Use padded encoding (or an explicit assertion) to make the helper robust.

Copilot uses AI. Check for mistakes.
Comment thread lib/src/sync/warp_sync.rs
ws.set_source_finality_state(source_id, 101);
let new_chain_info = chain_info_at_block(101, public_key, 1);
ws.set_chain_information((&new_chain_info).into());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC: the test sets set_source_finality_state(source_id, 101) immediately before set_chain_information(chain_info_at_block(101)). Since they contain the same warped_header_number this makes the guared in the warp sync effectively skipped and therefore we wont download the runtime (with or without the patch)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the old test was trivially passing regardless of the patch. Replaced with grandpa_commit_mid_warp_sync_preserves_runtime_download which uses setup_warp_sync_at_normal (source at height 100, fragment verified) then advances to block 101 with the source at 101. This matches the production scenario and actually fails without the fix.

Comment thread lib/src/sync/warp_sync.rs
}

#[cfg(test)]
mod tests {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A production repro isn't demonstrated in any of the tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. grandpa_commit_mid_warp_sync_preserves_runtime_download reproduces the production sequence: fragment verified at block 100 → warped_block_ty=Normal → GrandPa commit advances to 101 → set_chain_information(101) → assert runtime download still desired. Also added set_chain_information_same_state_root_preserves_runtime_download for the same-state-root preservation path.

Comment thread lib/src/sync/warp_sync.rs
// proceed. The runtime_download reset below is still necessary (the block
// hash changed so any cached proof is for the wrong state root).
if !matches!(self.warped_block_ty, WarpedBlockTy::Normal) {
self.warped_block_ty = WarpedBlockTy::AlreadyVerified;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The fix only suppresses a fragment-verification cycle before the next runtime download is issued, which probably means the runtime download is happening either way soon after becase of the below:

            self.runtime_download = RuntimeDownload::NotStarted {
                hint_doesnt_match: false,
            };

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — the original fix only preserved warped_block_ty but still unconditionally reset runtime_download to NotStarted. Updated: runtime_download is now only reset when warped_header_state_root != old_state_root. When the state root is the same (common for adjacent blocks with no storage changes), the in-progress or not-yet-started runtime download is fully preserved.

Comment thread lib/src/sync/warp_sync.rs Outdated
// another round of fragment verification before the runtime download can
// proceed. The runtime_download reset below is still necessary (the block
// hash changed so any cached proof is for the wrong state root).
if !matches!(self.warped_block_ty, WarpedBlockTy::Normal) {

@lexnv lexnv Apr 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is just patching a symptom. IIUC the problem is that set_chain_information wipes the runtime download / runtime calls and body download even when the incoming ValidChainInformation describes the same warp sync block we wroked with.

Then a fix would be to short-circuit the logic when warped_header_hash is equal with the new header hash. Then, this would preserve all the work we did so far. How was the 1.5/2MiB saving computed?

It looks like the hallucinations are vibe-coded because the storage proof is requested every time the function is called?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the original fix was a band-aid. Updated in 590ef4f:

  1. Short-circuit on same header hashset_chain_information now returns early when the incoming hash matches warped_header_hash. No state is touched.

  2. Preserve runtime download when state root unchangedruntime_download is only reset to NotStarted when warped_header_state_root actually changed. When the state root is identical, cached proofs remain valid.

  3. Preserve warped_block_ty=Normal — kept from the original fix, so a GrandPa commit advancing finality by 1 block doesn't force another fragment verification round.

The 2 MiB saving comes from avoiding the redundant :code storage proof download when the runtime download was already in progress and the state root didn't change. The benchmark numbers from smolbench are real — measured via runtime-download-count.mjs which counts the actual number of StorageGetMerkleProof requests and their payload sizes.

@replghost replghost force-pushed the fix/warp-sync-redundant-runtime-download branch from 074aec6 to d078d4e Compare April 22, 2026 18:13
Root cause: set_chain_information() unconditionally reset warped_block_ty
to AlreadyVerified, even when already Normal (runtime download phase).
A GrandPa commit advancing finality by 1 block forced another fragment
verification before the runtime download could proceed, causing a
redundant re-download of :code (~1.5-2.5 MiB).

Fix: preserve warped_block_ty = Normal in set_chain_information when
warp sync already completed fragment verification.

Includes regression test and hint key selection tests — first tests
for warp_sync.rs. Regression test fails without fix, passes with fix.
…ownload

Address review feedback from lexnv:

1. Short-circuit set_chain_information when the incoming header hash
   matches the current warped_header_hash — nothing to update.

2. Only reset runtime_download when the state root actually changed.
   When the state root is identical, cached proofs remain valid.

3. Preserve warped_block_ty=Normal across set_chain_information so a
   GrandPa commit advancing finality by 1 block doesn't force another
   fragment verification round before the runtime download can proceed.

4. Remove duplicated comment block.

5. Rewrite tests to demonstrate the actual production scenario:
   fragment verified → runtime download desired → GrandPa commit for
   N+1 → runtime download still desired without re-verification.
   Add separate test for same-state-root preservation path.
@replghost replghost force-pushed the fix/warp-sync-redundant-runtime-download branch from 590ef4f to 5b6953e Compare April 22, 2026 18:27
@lrubasze

Copy link
Copy Markdown
Contributor

@replghost I cannot reproduce the issue.
Tried smoldot-v3.1.1

Can you check on your side?

@replghost

Copy link
Copy Markdown
Contributor Author

@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance?

@replghost

Copy link
Copy Markdown
Contributor Author

Closing - the warp_sync_minimum_gap: 32 added in 3.1.0 eliminates this race in practice. GrandPa commits advancing by 1-2 blocks no longer trigger warp sync attempts, so the redundant runtime download doesn't happen. @lrubasze confirmed it's not reproducible on 3.1.1.

The original benchmarks were valid against 3.0.0 but the gap guard is the correct fix at the right layer.

@replghost replghost closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants