StateComposition: re-fire a dropped baseline rescan on scan completion by AnkushinDaniil · Pull Request #11680 · NethermindEth/nethermind

AnkushinDaniil · 2026-05-19T12:11:36Z

Changes

StateCompositionService.AnalyzeAsync fail-fast-returns Fail("Scan already in progress") when a scan already holds _scanLock, silently dropping the rescan request.
With an idle chain, nothing re-triggers the dropped rescan, so a baseline that needs refreshing (e.g. after an incremental-diff MissingTrieNodeException) never converges — a deadlock for any consumer waiting on the baseline to advance.
Record the dropped request in _rescanRequested; when the scan that held the lock completes, re-fire exactly one scan against the current head. With a frozen chain this converges in a single extra pass.

Types of changes

What types of changes does your code introduce?

Bugfix (a non-breaking change that fixes an issue)

Testing

Requires testing

Yes

If yes, did you write tests?

No

Notes on testing

Verified on a live node: a sensor-paced consumer that had deadlocked waiting for the baseline to advance past a dropped rescan resumed once the running scan completed and re-fired the queued rescan. A unit test for the re-fire path should be added before this leaves draft.

Documentation

Requires documentation update

No

Requires explanation in Release Notes

No

…letion AnalyzeAsync fail-fast-dropped a rescan request when a scan already held _scanLock. With a frozen chain (a sensor-paced driver waiting on the baseline) nothing re-triggered it -- a deadlock. Record the dropped request in _rescanRequested and, when the running scan finishes, re-fire one scan against the current head so the baseline converges.

claude · 2026-05-19T12:13:17Z

Claude finished @AnkushinDaniil's task in 4m 5s —— View job

Review: StateComposition re-fire dropped rescan

Gather context and read changed files
Read relevant agent rules (coding-style, robustness)
Analyze correctness, race conditions, and thread safety
Check for missing tests
Post detailed review

Summary

The fix is logically sound: recording a missed rescan request in _rescanRequested and acting on it after the holding scan completes is a clean coalescing pattern. The Interlocked.Exchange usage is correct — it serializes flag writes without a lock, and the exchange-to-zero at the re-fire site is atomic read+clear, so concurrent arrivals during the completed scan's re-fire window collapse into at most one extra scan.

Two issues block merge.

High — Missing regression test

AGENTS.md mandates: "When fixing a bug, always add a regression test." The PR author explicitly flagged this as a pre-merge requirement (unchecked checkbox, note in Testing section), but the PR is no longer in draft.

The existing test AnalyzeAsync_ReturnsScanInProgressError_WhenAlreadyRunning covers the fail-fast path but deliberately unblocks the first scan before it completes, so the re-fire logic is never exercised and _rescanRequested stays set permanently in that test without effect.

A regression test should:

Block the first scan inside RunTreeVisitor
Issue a second AnalyzeAsync call → asserts error and sets _rescanRequested = 1
Unblock the first scan (let it complete normally)
Wait for Metrics.StateCompScansCompleted to reach before + 2 (or HasScanBaseline to be set from the re-fired head)

The incremental-recovery test suite (StateCompositionServiceIncrementalRecoveryTests) and its WaitForConditionAsync helper are the right home for this test.

Fix this →%20block%20the%20first%20AnalyzeAsync%20call%20inside%20RunTreeVisitor%2C%202)%20issue%20a%20second%20AnalyzeAsync%20call%20while%20the%20first%20holds%20_scanLock%20(assert%20it%20returns%20an%20error%20containing%20%22already%20in%20progress%22)%2C%203)%20unblock%20the%20first%20scan%2C%204)%20wait%20for%20Metrics.StateCompScansCompleted%20%3D%3D%20before%20%2B%202%20(the%20original%20scan%20%2B%20the%20re-fired%20rescan)%2C%20confirming%20that%20_rescanRequested%20%3D%201%20triggered%20a%20second%20AnalyzeAsync%20call%20after%20the%20first%20completed.&repo=NethermindEth%2Fnethermind)

Medium — `_shuttingDown` not guarded before re-firing the queued rescan

StopAsync sets _shuttingDown = true, calls CancelScan(), then waits up to 10 s for _scanLock. Because the re-fire runs after _scanLock.Release() and before StopAsync can reacquire it, the following ordering is possible:

Scan completes → _scanLock.Release()
                              ← StopAsync acquires _scanLock, flushes, releases
FireAndForget fires AnalyzeAsync(head, None)
  → acquires _scanLock
  → starts a full trie scan against partially-disposed DB
  → exception logged by FireAndForget

The trie DB and IWorldStateManager may already be mid-teardown when the re-fired scan starts. RunIncrementalDiff guards against this with if (_shuttingDown) return;; the re-fire path should do the same:

// before:
if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)

// after:
if (!_shuttingDown && Interlocked.Exchange(ref _rescanRequested, 0) == 1)

Note: ScheduleBaselineRescan (called from the MissingTrieNodeException handler) has the same gap — it predates this PR and is out of scope here, but worth a follow-up.

Fix this →

Medium — Exception/cancellation path strands `_rescanRequested = 1`

When the scan ends via exception (including OperationCanceledException from CancelScan()), control never reaches the post-finally re-fire check. Any _rescanRequested = 1 set during that scan is stranded.

Concretely: if scan #1 is cancelled via CancelScan() while _rescanRequested = 1 is set, and the chain is frozen so nothing else triggers a new scan, the deadlock this PR is designed to fix still occurs through this path.

The simplest fix is to consume the flag inside the finally block and store it in a local:

bool shouldRefire = false;
try
{
    // ... scan body ...
    result = Result<StateCompositionStats>.Success(stats);
    // Only commit the re-fire on successful completion:
    shouldRefire = Interlocked.Exchange(ref _rescanRequested, 0) == 1;
}
finally
{
    _currentScanCts = null;
    _scanLock.Release();
}

if (shouldRefire && !_shuttingDown)
{
    BlockHeader? head = _blockTree.Head?.Header;
    if (head?.StateRoot is not null)
        FireAndForget.Run(() => AnalyzeAsync(head, CancellationToken.None), ...);
}

This keeps the flag set if the scan is cancelled, so when a new scan eventually runs (e.g., from the next head block), it will still see _rescanRequested = 1 and re-fire — which is arguably the right behavior. Alternatively, the flag can be cleared unconditionally in finally to avoid the stranded-flag state, accepting that the re-fire only triggers on successful scan completion. Either approach is defensible; the current code is neither.

Low — XML doc comment lists three callers, now has a fourth

// src/Nethermind/Nethermind.StateComposition/Service/StateCompositionService.cs:83-98
/// Fail-fast / queued via <see cref="_scanLock"/> so back-to-back triggers collapse into one scan.

The comment says "exactly three legal callers" and the list omits the new self-re-fire path. The tripwire says "a reviewer adding one must delete this comment" — update the list or relax the tripwire to reflect the new mechanism.

Verdict

Severity	Count
High	1 (missing regression test)
Medium	2 (shutdown race, cancellation strands flag)
Low	1 (doc comment)

Not mergeable until at minimum the regression test is added and the _shuttingDown guard is in place.

claude · 2026-05-19T12:17:06Z

+        // just published may already trail the chain head. Re-fire one scan
+        // against the current head (the lock is free now) so the baseline
+        // converges; with a frozen chain this needs only one extra pass.
+        if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)


Medium — _shuttingDown not checked before re-firing

StopAsync sets _shuttingDown = true, cancels the scan, then races to acquire _scanLock. Because the re-fire is dispatched after _scanLock.Release() (line 165), FireAndForget.Run can queue a new AnalyzeAsync call that later acquires the lock and runs a full trie scan against a partially-disposed DB.

RunIncrementalDiff guards this exact window with if (_shuttingDown) return; — the re-fire path should match:

Suggested change

if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)

if (!_shuttingDown && Interlocked.Exchange(ref _rescanRequested, 0) == 1)

claude · 2026-05-19T12:17:11Z

                             $"StorageSlots={stats.StorageSlotsTotal}");

-            return Result<StateCompositionStats>.Success(stats);
+            result = Result<StateCompositionStats>.Success(stats);


Medium — Re-fire only triggers on successful scan completion

If the scan ends via exception or OperationCanceledException (e.g. from CancelScan()), control never reaches the post-finally block. Any _rescanRequested = 1 set during that scan is stranded indefinitely — exactly the deadlock scenario this PR fixes.

Consider moving the exchange into finally so the flag is always consumed, then acting on it after the block:

bool shouldRefire = false; try { // ...scan body... result = Result<StateCompositionStats>.Success(stats); shouldRefire = Interlocked.Exchange(ref _rescanRequested, 0) == 1; } finally { _currentScanCts = null; _scanLock.Release(); }

This keeps the flag set through cancellation (so the next successful scan still sees it), or clears it here unconditionally if re-firing only after a clean completion is acceptable — either is better than the current state where the flag is silently stranded.

github-actions Bot added bug fix + reliability state+storage labels May 19, 2026

claude Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StateComposition: re-fire a dropped baseline rescan on scan completion#11680

StateComposition: re-fire a dropped baseline rescan on scan completion#11680
AnkushinDaniil wants to merge 1 commit into
masterfrom
fix/state-comp-refire-dropped-rescan

AnkushinDaniil commented May 19, 2026

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

claude Bot May 19, 2026

Uh oh!

claude Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if (Interlocked.Exchange(ref _rescanRequested, 0) == 1)
	if (!_shuttingDown && Interlocked.Exchange(ref _rescanRequested, 0) == 1)

Conversation

AnkushinDaniil commented May 19, 2026

Changes

Types of changes

What types of changes does your code introduce?

Testing

Requires testing

If yes, did you write tests?

Notes on testing

Documentation

Requires documentation update

Requires explanation in Release Notes

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: StateComposition re-fire dropped rescan

Summary

High — Missing regression test

Medium — _shuttingDown not guarded before re-firing the queued rescan

Medium — Exception/cancellation path strands _rescanRequested = 1

Low — XML doc comment lists three callers, now has a fourth

Verdict

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 19, 2026 •

edited

Loading

Medium — `_shuttingDown` not guarded before re-firing the queued rescan

Medium — Exception/cancellation path strands `_rescanRequested = 1`