Problem
After PR #1200 (issue #1190) made non-capture LlmQueue claims atomic via TryClaimProcessingAsync, there is a recovery gap: if a worker crashes (or its process is killed) after successfully claiming a non-capture LlmRequest (flipping it to Processing) but before completing/failing it, the row is left in Processing forever. Nothing transitions it back to Pending or Failed, so the item is silently abandoned.
This was acknowledged as the residual half of review finding M1 on PR #1200 ("orphaned Processing item"), which was mitigated for the re-fetch-null branch but not for the broader worker-crash-after-claim case.
Why nothing recovers it today
ProposalHousekeepingWorker only operates on proposals -- it has zero references to LlmRequest / RequestStatus.Processing, so it does not sweep stuck queue items.
- Capture-triage items are read from
Processing each poll and re-claimed, so they self-heal; non-capture items are read from Pending only, so once stuck in Processing they are never re-enqueued.
- The
LlmQueueToProposalWorker itself has no stuck-item reclamation pass.
Existing pattern to mirror
OutboundWebhookDeliveryWorker.RecoverStuckProcessingDeliveriesAsync (backend/src/Taskdeck.Api/Workers/OutboundWebhookDeliveryWorker.cs ~line 284) already implements exactly this shape: query rows that have sat in Processing past a threshold (GetStuckProcessingAsync) and ReturnToPending(...) them. A parallel sweeper for stuck LlmRequest rows (e.g. in ProposalHousekeepingWorker or LlmQueueToProposalWorker) should:
- Query non-capture
LlmRequests in Processing whose UpdatedAt is older than a configurable stuck threshold.
- Return them to
Pending (respecting retry budget) or mark Failed when retries are exhausted.
- Emit telemetry for recovered items.
Acceptance criteria
- A background sweeper recovers non-capture
LlmRequests stuck in Processing beyond a configurable threshold.
- Retry budget is respected (return to
Pending until max retries, then Failed).
- Recovery is covered by tests (worker crash after claim -> item eventually reclaimed).
- Threshold is configurable via
WorkerSettings and documented in docs/platform/CONFIGURATION_REFERENCE.md.
References
Problem
After PR #1200 (issue #1190) made non-capture LlmQueue claims atomic via
TryClaimProcessingAsync, there is a recovery gap: if a worker crashes (or its process is killed) after successfully claiming a non-captureLlmRequest(flipping it toProcessing) but before completing/failing it, the row is left inProcessingforever. Nothing transitions it back toPendingorFailed, so the item is silently abandoned.This was acknowledged as the residual half of review finding M1 on PR #1200 ("orphaned Processing item"), which was mitigated for the re-fetch-null branch but not for the broader worker-crash-after-claim case.
Why nothing recovers it today
ProposalHousekeepingWorkeronly operates on proposals -- it has zero references toLlmRequest/RequestStatus.Processing, so it does not sweep stuck queue items.Processingeach poll and re-claimed, so they self-heal; non-capture items are read fromPendingonly, so once stuck inProcessingthey are never re-enqueued.LlmQueueToProposalWorkeritself has no stuck-item reclamation pass.Existing pattern to mirror
OutboundWebhookDeliveryWorker.RecoverStuckProcessingDeliveriesAsync(backend/src/Taskdeck.Api/Workers/OutboundWebhookDeliveryWorker.cs ~line 284) already implements exactly this shape: query rows that have sat inProcessingpast a threshold (GetStuckProcessingAsync) andReturnToPending(...)them. A parallel sweeper for stuckLlmRequestrows (e.g. inProposalHousekeepingWorkerorLlmQueueToProposalWorker) should:LlmRequests inProcessingwhoseUpdatedAtis older than a configurable stuck threshold.Pending(respecting retry budget) or markFailedwhen retries are exhausted.Acceptance criteria
LlmRequests stuck inProcessingbeyond a configurable threshold.Pendinguntil max retries, thenFailed).WorkerSettingsand documented indocs/platform/CONFIGURATION_REFERENCE.md.References
OutboundWebhookDeliveryWorker.RecoverStuckProcessingDeliveriesAsync