StateComposition: re-fire a dropped baseline rescan on scan completion#11680
StateComposition: re-fire a dropped baseline rescan on scan completion#11680AnkushinDaniil wants to merge 1 commit into
Conversation
…letion AnalyzeAsync fail-fast-dropped a rescan request when a scan already held _scanLock. With a frozen chain (a sensor-paced driver waiting on the baseline) nothing re-triggered it -- a deadlock. Record the dropped request in _rescanRequested and, when the running scan finishes, re-fire one scan against the current head so the baseline converges.
|
Claude finished @AnkushinDaniil's task in 4m 5s —— View job Review: StateComposition re-fire dropped rescan
SummaryThe fix is logically sound: recording a missed rescan request in Two issues block merge. High — Missing regression testAGENTS.md mandates: "When fixing a bug, always add a regression test." The PR author explicitly flagged this as a pre-merge requirement (unchecked checkbox, note in Testing section), but the PR is no longer in draft. The existing test A regression test should:
The incremental-recovery test suite ( Fix this →%20block%20the%20first%20AnalyzeAsync%20call%20inside%20RunTreeVisitor%2C%202)%20issue%20a%20second%20AnalyzeAsync%20call%20while%20the%20first%20holds%20_scanLock%20(assert%20it%20returns%20an%20error%20containing%20%22already%20in%20progress%22)%2C%203)%20unblock%20the%20first%20scan%2C%204)%20wait%20for%20Metrics.StateCompScansCompleted%20%3D%3D%20before%20%2B%202%20(the%20original%20scan%20%2B%20the%20re-fired%20rescan)%2C%20confirming%20that%20_rescanRequested%20%3D%201%20triggered%20a%20second%20AnalyzeAsync%20call%20after%20the%20first%20completed.&repo=NethermindEth%2Fnethermind) Medium —
|
| Severity | Count |
|---|---|
| High | 1 (missing regression test) |
| Medium | 2 (shutdown race, cancellation strands flag) |
| Low | 1 (doc comment) |
Not mergeable until at minimum the regression test is added and the _shuttingDown guard is in place.
| // just published may already trail the chain head. Re-fire one scan | ||
| // against the current head (the lock is free now) so the baseline | ||
| // converges; with a frozen chain this needs only one extra pass. | ||
| if (Interlocked.Exchange(ref _rescanRequested, 0) == 1) |
There was a problem hiding this comment.
Medium — _shuttingDown not checked before re-firing
StopAsync sets _shuttingDown = true, cancels the scan, then races to acquire _scanLock. Because the re-fire is dispatched after _scanLock.Release() (line 165), FireAndForget.Run can queue a new AnalyzeAsync call that later acquires the lock and runs a full trie scan against a partially-disposed DB.
RunIncrementalDiff guards this exact window with if (_shuttingDown) return; — the re-fire path should match:
| if (Interlocked.Exchange(ref _rescanRequested, 0) == 1) | |
| if (!_shuttingDown && Interlocked.Exchange(ref _rescanRequested, 0) == 1) |
| $"StorageSlots={stats.StorageSlotsTotal}"); | ||
|
|
||
| return Result<StateCompositionStats>.Success(stats); | ||
| result = Result<StateCompositionStats>.Success(stats); |
There was a problem hiding this comment.
Medium — Re-fire only triggers on successful scan completion
If the scan ends via exception or OperationCanceledException (e.g. from CancelScan()), control never reaches the post-finally block. Any _rescanRequested = 1 set during that scan is stranded indefinitely — exactly the deadlock scenario this PR fixes.
Consider moving the exchange into finally so the flag is always consumed, then acting on it after the block:
bool shouldRefire = false;
try
{
// ...scan body...
result = Result<StateCompositionStats>.Success(stats);
shouldRefire = Interlocked.Exchange(ref _rescanRequested, 0) == 1;
}
finally
{
_currentScanCts = null;
_scanLock.Release();
}This keeps the flag set through cancellation (so the next successful scan still sees it), or clears it here unconditionally if re-firing only after a clean completion is acceptable — either is better than the current state where the flag is silently stranded.
Changes
StateCompositionService.AnalyzeAsyncfail-fast-returnsFail("Scan already in progress")when a scan already holds_scanLock, silently dropping the rescan request.MissingTrieNodeException) never converges — a deadlock for any consumer waiting on the baseline to advance._rescanRequested; when the scan that held the lock completes, re-fire exactly one scan against the current head. With a frozen chain this converges in a single extra pass.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Verified on a live node: a sensor-paced consumer that had deadlocked waiting for the baseline to advance past a dropped rescan resumed once the running scan completed and re-fired the queued rescan. A unit test for the re-fire path should be added before this leaves draft.
Documentation
Requires documentation update
Requires explanation in Release Notes