Skip to content

No sweeper recovers non-capture LlmRequests stuck in Processing #1209

@Chris0Jeky

Description

@Chris0Jeky

Problem

After PR #1200 (issue #1190) made non-capture LlmQueue claims atomic via TryClaimProcessingAsync, there is a recovery gap: if a worker crashes (or its process is killed) after successfully claiming a non-capture LlmRequest (flipping it to Processing) but before completing/failing it, the row is left in Processing forever. Nothing transitions it back to Pending or Failed, so the item is silently abandoned.

This was acknowledged as the residual half of review finding M1 on PR #1200 ("orphaned Processing item"), which was mitigated for the re-fetch-null branch but not for the broader worker-crash-after-claim case.

Why nothing recovers it today

  • ProposalHousekeepingWorker only operates on proposals -- it has zero references to LlmRequest / RequestStatus.Processing, so it does not sweep stuck queue items.
  • Capture-triage items are read from Processing each poll and re-claimed, so they self-heal; non-capture items are read from Pending only, so once stuck in Processing they are never re-enqueued.
  • The LlmQueueToProposalWorker itself has no stuck-item reclamation pass.

Existing pattern to mirror

OutboundWebhookDeliveryWorker.RecoverStuckProcessingDeliveriesAsync (backend/src/Taskdeck.Api/Workers/OutboundWebhookDeliveryWorker.cs ~line 284) already implements exactly this shape: query rows that have sat in Processing past a threshold (GetStuckProcessingAsync) and ReturnToPending(...) them. A parallel sweeper for stuck LlmRequest rows (e.g. in ProposalHousekeepingWorker or LlmQueueToProposalWorker) should:

  1. Query non-capture LlmRequests in Processing whose UpdatedAt is older than a configurable stuck threshold.
  2. Return them to Pending (respecting retry budget) or mark Failed when retries are exhausted.
  3. Emit telemetry for recovered items.

Acceptance criteria

  • A background sweeper recovers non-capture LlmRequests stuck in Processing beyond a configurable threshold.
  • Retry budget is respected (return to Pending until max retries, then Failed).
  • Recovery is covered by tests (worker crash after claim -> item eventually reclaimed).
  • Threshold is configurable via WorkerSettings and documented in docs/platform/CONFIGURATION_REFERENCE.md.

References

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status
    Pending

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions