Skip to content

StateComposition: re-fire a dropped baseline rescan on scan completion#11680

Draft
AnkushinDaniil wants to merge 1 commit into
masterfrom
fix/state-comp-refire-dropped-rescan
Draft

StateComposition: re-fire a dropped baseline rescan on scan completion#11680
AnkushinDaniil wants to merge 1 commit into
masterfrom
fix/state-comp-refire-dropped-rescan

Conversation

@AnkushinDaniil
Copy link
Copy Markdown
Contributor

Changes

  • StateCompositionService.AnalyzeAsync fail-fast-returns Fail("Scan already in progress") when a scan already holds _scanLock, silently dropping the rescan request.
  • With an idle chain, nothing re-triggers the dropped rescan, so a baseline that needs refreshing (e.g. after an incremental-diff MissingTrieNodeException) never converges — a deadlock for any consumer waiting on the baseline to advance.
  • Record the dropped request in _rescanRequested; when the scan that held the lock completes, re-fire exactly one scan against the current head. With a frozen chain this converges in a single extra pass.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)

Testing

Requires testing

  • Yes

If yes, did you write tests?

  • No

Notes on testing

Verified on a live node: a sensor-paced consumer that had deadlocked waiting for the baseline to advance past a dropped rescan resumed once the running scan completed and re-fired the queued rescan. A unit test for the re-fire path should be added before this leaves draft.

Documentation

Requires documentation update

  • No

Requires explanation in Release Notes

  • No

…letion

AnalyzeAsync fail-fast-dropped a rescan request when a scan already held _scanLock. With a frozen chain (a sensor-paced driver waiting on the baseline) nothing re-triggered it -- a deadlock. Record the dropped request in _rescanRequested and, when the running scan finishes, re-fire one scan against the current head so the baseline converges.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude finished @AnkushinDaniil's task in 4m 5s —— View job


Review: StateComposition re-fire dropped rescan

  • Gather context and read changed files
  • Read relevant agent rules (coding-style, robustness)
  • Analyze correctness, race conditions, and thread safety
  • Check for missing tests
  • Post detailed review

Summary

The fix is logically sound: recording a missed rescan request in _rescanRequested and acting on it after the holding scan completes is a clean coalescing pattern. The Interlocked.Exchange usage is correct — it serializes flag writes without a lock, and the exchange-to-zero at the re-fire site is atomic read+clear, so concurrent arrivals during the completed scan's re-fire window collapse into at most one extra scan.

Two issues block merge.


High — Missing regression test

AGENTS.md mandates: "When fixing a bug, always add a regression test." The PR author explicitly flagged this as a pre-merge requirement (unchecked checkbox, note in Testing section), but the PR is no longer in draft.

The existing test AnalyzeAsync_ReturnsScanInProgressError_WhenAlreadyRunning covers the fail-fast path but deliberately unblocks the first scan before it completes, so the re-fire logic is never exercised and _rescanRequested stays set permanently in that test without effect.

A regression test should:

  1. Block the first scan inside RunTreeVisitor
  2. Issue a second AnalyzeAsync call → asserts error and sets _rescanRequested = 1
  3. Unblock the first scan (let it complete normally)
  4. Wait for Metrics.StateCompScansCompleted to reach before + 2 (or HasScanBaseline to be set from the re-fired head)

The incremental-recovery test suite (StateCompositionServiceIncrementalRecoveryTests) and its WaitForConditionAsync helper are the right home for this test.

Fix this →%20block%20the%20first%20AnalyzeAsync%20call%20inside%20RunTreeVisitor%2C%202)%20issue%20a%20second%20AnalyzeAsync%20call%20while%20the%20first%20holds%20_scanLock%20(assert%20it%20returns%20an%20error%20containing%20%22already%20in%20progress%22)%2C%203)%20unblock%20the%20first%20scan%2C%204)%20wait%20for%20Metrics.StateCompScansCompleted%20%3D%3D%20before%20%2B%202%20(the%20original%20scan%20%2B%20the%20re-fired%20rescan)%2C%20confirming%20that%20_rescanRequested%20%3D%201%20triggered%20a%20second%20AnalyzeAsync%20call%20after%20the%20first%20completed.&repo=NethermindEth%2Fnethermind)


Medium — _shuttingDown not guarded before re-firing the queued rescan

StopAsync sets _shuttingDown = true, calls CancelScan(), then waits up to 10 s for _scanLock. Because the re-fire runs after _scanLock.Release() and before StopAsync can reacquire it, the following ordering is possible:

Scan completes → _scanLock.Release()
                              ← StopAsync acquires _scanLock, flushes, releases
FireAndForget fires AnalyzeAsync(head, None)
  → acquires _scanLock
  → starts a full trie scan against partially-disposed DB
  → exception logged by FireAndForget

The trie DB and IWorldStateManager may already be mid-teardown when the re-fired scan starts. RunIncrementalDiff guards against this with if (_shuttingDown) return;; the re-fire path should do the same:

// before:
if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)

// after:
if (!_shuttingDown && Interlocked.Exchange(ref _rescanRequested, 0) == 1)

Note: ScheduleBaselineRescan (called from the MissingTrieNodeException handler) has the same gap — it predates this PR and is out of scope here, but worth a follow-up.

Fix this →


Medium — Exception/cancellation path strands _rescanRequested = 1

When the scan ends via exception (including OperationCanceledException from CancelScan()), control never reaches the post-finally re-fire check. Any _rescanRequested = 1 set during that scan is stranded.

Concretely: if scan #1 is cancelled via CancelScan() while _rescanRequested = 1 is set, and the chain is frozen so nothing else triggers a new scan, the deadlock this PR is designed to fix still occurs through this path.

The simplest fix is to consume the flag inside the finally block and store it in a local:

bool shouldRefire = false;
try
{
    // ... scan body ...
    result = Result<StateCompositionStats>.Success(stats);
    // Only commit the re-fire on successful completion:
    shouldRefire = Interlocked.Exchange(ref _rescanRequested, 0) == 1;
}
finally
{
    _currentScanCts = null;
    _scanLock.Release();
}

if (shouldRefire && !_shuttingDown)
{
    BlockHeader? head = _blockTree.Head?.Header;
    if (head?.StateRoot is not null)
        FireAndForget.Run(() => AnalyzeAsync(head, CancellationToken.None), ...);
}

This keeps the flag set if the scan is cancelled, so when a new scan eventually runs (e.g., from the next head block), it will still see _rescanRequested = 1 and re-fire — which is arguably the right behavior. Alternatively, the flag can be cleared unconditionally in finally to avoid the stranded-flag state, accepting that the re-fire only triggers on successful scan completion. Either approach is defensible; the current code is neither.


Low — XML doc comment lists three callers, now has a fourth

// src/Nethermind/Nethermind.StateComposition/Service/StateCompositionService.cs:83-98
/// Fail-fast / queued via <see cref="_scanLock"/> so back-to-back triggers collapse into one scan.

The comment says "exactly three legal callers" and the list omits the new self-re-fire path. The tripwire says "a reviewer adding one must delete this comment" — update the list or relax the tripwire to reflect the new mechanism.


Verdict

Severity Count
High 1 (missing regression test)
Medium 2 (shutdown race, cancellation strands flag)
Low 1 (doc comment)

Not mergeable until at minimum the regression test is added and the _shuttingDown guard is in place.

// just published may already trail the chain head. Re-fire one scan
// against the current head (the lock is free now) so the baseline
// converges; with a frozen chain this needs only one extra pass.
if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — _shuttingDown not checked before re-firing

StopAsync sets _shuttingDown = true, cancels the scan, then races to acquire _scanLock. Because the re-fire is dispatched after _scanLock.Release() (line 165), FireAndForget.Run can queue a new AnalyzeAsync call that later acquires the lock and runs a full trie scan against a partially-disposed DB.

RunIncrementalDiff guards this exact window with if (_shuttingDown) return; — the re-fire path should match:

Suggested change
if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)
if (!_shuttingDown && Interlocked.Exchange(ref _rescanRequested, 0) == 1)

$"StorageSlots={stats.StorageSlotsTotal}");

return Result<StateCompositionStats>.Success(stats);
result = Result<StateCompositionStats>.Success(stats);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Re-fire only triggers on successful scan completion

If the scan ends via exception or OperationCanceledException (e.g. from CancelScan()), control never reaches the post-finally block. Any _rescanRequested = 1 set during that scan is stranded indefinitely — exactly the deadlock scenario this PR fixes.

Consider moving the exchange into finally so the flag is always consumed, then acting on it after the block:

bool shouldRefire = false;
try
{
    // ...scan body...
    result = Result<StateCompositionStats>.Success(stats);
    shouldRefire = Interlocked.Exchange(ref _rescanRequested, 0) == 1;
}
finally
{
    _currentScanCts = null;
    _scanLock.Release();
}

This keeps the flag set through cancellation (so the next successful scan still sees it), or clears it here unconditionally if re-firing only after a clean completion is acceptable — either is better than the current state where the flag is silently stranded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant