Skip to content

fix: atomic claim for LlmQueue non-capture processing (#1190)#1200

Merged
Chris0Jeky merged 9 commits into
mainfrom
fix/1190-queue-claim-race
Jun 13, 2026
Merged

fix: atomic claim for LlmQueue non-capture processing (#1190)#1200
Chris0Jeky merged 9 commits into
mainfrom
fix/1190-queue-claim-race

Conversation

@Chris0Jeky

Copy link
Copy Markdown
Owner

Summary

Fixes #1190 -- ProcessNextRequestAsync and LlmQueueToProposalWorker.ProcessSingleItemAsync used a non-atomic fetch-then-mutate pattern for claiming pending LLM requests. Under concurrent workers, the same request could be claimed twice, producing duplicate proposals.

  • Add TryClaimProcessingAsync to ILlmQueueRepository / LlmQueueRepository that atomically transitions Pending -> Processing with an optimistic concurrency guard (WHERE Status = Pending AND UpdatedAt = @expected), mirroring the existing TryClaimProcessingCaptureAsync pattern
  • Update LlmQueueService.ProcessNextRequestAsync to use the atomic claim, iterating FIFO candidates and skipping any that fail the claim
  • Update LlmQueueToProposalWorker.ProcessSingleItemAsync to use TryClaimProcessingAsync instead of the racy GetByIdAsync + MarkAsProcessing + SaveChangesAsync flow
  • Pass ExpectedUpdatedAt for non-capture batch items in BuildFairBatchItems

Test plan

  • 6 new unit tests in LlmQueueServiceTests: claim success, claim failure (concurrent), fallthrough to next candidate, skip capture requests, empty queue, FIFO ordering
  • 4 new integration tests in LlmQueueRepositoryIntegrationTests: claim pending request, fail when status already changed, concurrent race (exactly one wins), reject non-pending request
  • Updated ProcessBatch_ItemClaimedBetweenFetchAndProcess_SkipsGracefully worker test to use atomic claim pattern
  • All 3,297 Application.Tests pass
  • All 1,744 Api.Tests pass
  • Build clean (0 errors)

ProcessNextRequestAsync and LlmQueueToProposalWorker.ProcessSingleItemAsync
used a non-atomic fetch-then-mutate pattern for claiming pending LLM requests.
Under concurrent workers, the same request could be claimed twice, producing
duplicate proposals.

Add TryClaimProcessingAsync to ILlmQueueRepository / LlmQueueRepository that
atomically transitions Pending -> Processing with an optimistic concurrency
guard (WHERE Status = Pending AND UpdatedAt = @expected), mirroring the
existing TryClaimProcessingCaptureAsync pattern.

Update both ProcessNextRequestAsync and ProcessSingleItemAsync to use the
atomic claim instead of the racy read-then-MarkAsProcessing-then-save flow.
ProcessNextRequestAsync now iterates candidates and skips any that fail the
atomic claim, falling through to the next FIFO candidate.

Tests: 6 new unit tests (service claim success, claim failure, fallthrough to
next candidate, skip capture requests, empty queue, FIFO ordering) + 4 new
integration tests (claim pending, fail when stale, concurrent race exactly-one,
reject non-pending). All 3,297 Application.Tests + 1,744 Api.Tests green.
@Chris0Jeky

Copy link
Copy Markdown
Owner Author

Adversarial Code Review

CRITICAL

  • None

HIGH

  • None

MEDIUM

  • M1: Orphaned Processing item on null re-fetch (LlmQueueService.cs line ~131): After TryClaimProcessingAsync succeeds, if GetByIdAsync returns null the code does continue to the next candidate. The already-claimed item is now stuck in Processing with no one to process or reset it. While extremely unlikely (requires a DELETE between the UPDATE and SELECT), the defensive continue silently orphans the item. Should log a warning. The worker path has the same pattern but is pre-existing; the proposal housekeeping worker handles stuck items. Adding a warning log on this edge case is sufficient.
  • M2: Snake_case variable name (LlmQueueService.cs line ~130): claimed_request uses snake_case, violating the C# camelCase convention established in the codebase. Should be claimedRequest.

LOW

  • L1: Misleading test name (LlmQueueServiceTests.cs): ProcessNextRequestAsync_ShouldReturnConflict_WhenClaimFails — the name says "Conflict" but the assertion correctly checks for ErrorCodes.NotFound. The name should reflect the actual behavior (all claims exhausted = NotFound).

Bot Comments Addressed

  • None (no bot comments at this time; CI still pending)

Summary

0 CRITICAL, 0 HIGH, 2 MEDIUM, 1 LOW. No merge blockers. All findings are fixable with minimal changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces optimistic concurrency for claiming pending non-capture requests in the LLM queue by adding and implementing TryClaimProcessingAsync using raw SQL updates. It updates the background worker and queue service to use this atomic claim mechanism and adds corresponding integration and unit tests. The review feedback identifies a critical bug in LlmQueueService.ProcessNextRequestAsync where re-fetching the claimed request via GetByIdAsync returns a stale, tracked in-memory entity from EF Core's cache instead of the updated database state. The reviewer suggests directly updating the tracked entity's state in-memory using candidate.MarkAsProcessing() to avoid an extra database roundtrip and ensure the correct status is returned.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread backend/src/Taskdeck.Application/Services/LlmQueueService.cs Outdated
M1: Document orphaned-Processing edge case on null re-fetch after
successful TryClaimProcessingAsync claim with explanatory comment.
M2: Rename snake_case `claimed_request` to camelCase `claimedRequest`.
L1: Rename misleading test from ShouldReturnConflict to
ShouldReturnNotFound_WhenAllClaimsFail.
@Chris0Jeky

Copy link
Copy Markdown
Owner Author

Adversarial Review -- Fixes Applied

Finding Severity Fix Commit Verified
M1: Orphaned Processing item on null re-fetch MEDIUM `98cdd427` Documented with explanatory comment; housekeeping worker mitigates
M2: Snake_case `claimed_request` variable MEDIUM `98cdd427` Renamed to `claimedRequest`; build clean
L1: Misleading test name (Conflict vs NotFound) LOW `98cdd427` Renamed to `ShouldReturnNotFound_WhenAllClaimsFail`; 20/20 tests pass

All findings addressed. CI status: PENDING (new run triggered by fix push).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f7eb7a451

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/src/Taskdeck.Application/Services/LlmQueueService.cs Outdated
The raw-SQL UPDATE in TryClaimProcessingAsync bypasses the EF change
tracker, leaving any tracked instance stale at Pending. Reload the
tracked entity on a successful claim so callers holding the instance
(and GetByIdAsync via the FindAsync identity map) observe Processing.
Document the refresh contract on ILlmQueueRepository.
Drop the post-claim GetByIdAsync re-fetch: it served the stale tracked
entity from the identity map, so the returned DTO reported Pending after
a successful claim. The repository now refreshes the tracked candidate
on claim, so map it directly. This also removes the misleading comments
(the re-fetch did not reflect DB state, and no warning was logged).
Unit test mocks now honor the refresh contract via callbacks.
Two integration tests against real SQLite: a tracked candidate fetched
via GetByStatusAsync shows Processing after a successful claim, and
GetByIdAsync without clearing the change tracker returns Processing.
Both fail with stale Pending without the post-claim reload.
@Chris0Jeky

Copy link
Copy Markdown
Owner Author

Review Fixes -- Stale Tracked Entity After Atomic Claim

Finding Source Severity Fix Commit Verification
Raw-SQL claim bypasses EF change tracker; GetByIdAsync (FindAsync identity-map) and the tracked candidate report stale Pending after a successful claim gemini-code-assist HIGH 579c44bc (repo reload + interface contract), 9455241b (service maps refreshed candidate) New integration tests TryClaimProcessingAsync_ShouldRefreshTrackedEntity and TryClaimProcessingAsync_GetByIdAfterClaim_ShouldReturnProcessingWithoutClearingTracker (e7657933) reproduced the bug (failed with Pending) before the fix and pass after
Refresh the tracked request after claiming so POST /api/llm-queue/process-next returns Processing chatgpt-codex-connector P2 579c44bc + 9455241b Same two integration tests; unit tests ProcessNextRequestAsync_ShouldReturnSuccess_WhenClaimSucceeds, ...ShouldPreserveFifo..., ...ShouldTryNextCandidate... updated to the documented refresh contract
Misleading comments: "Re-fetch so the in-memory entity reflects the DB state" (it did not -- identity-map hit) and "log a warning so the anomaly is visible" (no warning was logged) review LOW 9455241b Stale re-fetch, dead null-branch, and both comments removed; replaced with an accurate note about the repository refresh contract

How the fix works

LlmQueueRepository.TryClaimProcessingAsync now, on a successful UPDATE, reloads any tracked instance via _context.Entry(tracked).ReloadAsync() (looked up in LlmRequests.Local). Chosen over the suggested in-memory MarkAsProcessing() because reload syncs UpdatedAt to the exact DB-written value and leaves the entry Unchanged (no divergent timestamp, no redundant second UPDATE on a later SaveChangesAsync). Contract documented on ILlmQueueRepository.

Out-of-scope finding tracked

TryClaimProcessingCaptureAsync shares the tracker-bypass pattern but its only caller claims in a fresh scope before fetching, so it is latent -- seeded as #1206.

Test evidence

  • dotnet test backend/Taskdeck.sln -c Release -m:1 --filter "FullyQualifiedName~LlmQueue": 83/83 passed (Application.Tests 20/20, Api.Tests 63/63, incl. LlmQueueRepositoryIntegrationTests and LlmQueueToProposalWorkerTests)
  • dotnet test ... --filter "FullyQualifiedName~WorkerResilienceTests": 13/13 passed
  • Pre-fix run of the two new integration tests: both FAILED with Expected ... Processing ... but found ... Pending -- confirming they guard the regression

In-thread replies posted on both bot comments. Threads left open for the orchestrator to resolve.

…ngAsync

Add 'AND RequestType NOT LIKE inbox.capture.%' to the non-capture claim WHERE clause, mirroring the inverse guard in TryClaimProcessingCaptureAsync so the two claim paths are mutually exclusive at the SQL layer.
When the post-claim re-fetch no longer sees the row Processing, we DID win the claim but the row vanished/mutated between UPDATE and SELECT. Log a Warning and emit a distinct 'claimed_then_missing' telemetry outcome instead of conflating it with losing the claim race.
…after claim

Read the persisted row via a fresh AsNoTracking query and assert tracked.UpdatedAt equals it, distinguishing a true ReloadAsync from an in-memory MarkAsProcessing() substitute that would set a different UTC-now timestamp.
Record the expectedUpdatedAt each fake claim receives and add a happy-path test asserting it equals the pending item's actual UpdatedAt. Catches a regression passing default/now (which would stall the queue in production via a no-match optimistic-concurrency UPDATE while tests stayed green) in both worker fakes.
@Chris0Jeky

Copy link
Copy Markdown
Owner Author

Residual Review Findings -- Fixes Applied

All five residual LOW findings from the prior review are addressed. Verified with dotnet test ... --filter "FullyQualifiedName~LlmQueue" (65 passed) and the WorkerResilienceTests classes (14 passed) on a clean Release build (0 errors).

# Severity Finding Fix Commit Verification
1 LOW TryClaimProcessingAsync lacked the inverse of the capture guard, so the non-capture contract was not self-enforcing at the SQL layer. 5a9ce311 Added AND RequestType NOT LIKE {CaptureRequestTypeLike} to the non-capture claim WHERE clause, mirroring TryClaimProcessingCaptureAsync. Existing chat.completion claim tests still pass (non-capture types still match); capture types now provably cannot be claimed via this path.
2 LOW TryClaimProcessingAsync_ShouldRefreshTrackedEntity only asserted tracked.UpdatedAt != expectedUpdatedAt, which an in-memory MarkAsProcessing() substitute would also satisfy. 08ab7456 Test now reads the persisted row via a fresh-scope AsNoTracking().SingleAsync(...) and asserts tracked.UpdatedAt == persisted.UpdatedAt, proving a true DB ReloadAsync rather than a UTC-now in-memory mutation.
3 LOW The post-claim item == null || item.Status != Processing branch reused the already_claimed label, conflating "we never won the claim" with "we won but the row vanished between UPDATE and SELECT." 0fabb60b Now logs a Warning and emits a distinct claimed_then_missing telemetry outcome in the non-capture branch (LlmQueueToProposalWorker.ProcessSingleItemAsync).
4 LOW The worker fakes accepted expectedUpdatedAt but never asserted it, so a regression passing default/now (which would stall the queue in prod via a no-match optimistic-concurrency UPDATE) would leave tests green. 4315573c Both fakes (LlmQueueToProposalWorkerTests and root WorkerResilienceTests) now record (requestId, expectedUpdatedAt). Added happy-path tests asserting the worker forwards the pending item's actual UpdatedAt (snapshotted before the claim mutates it).
5 LOW No worker recovers non-capture LlmRequests stuck in Processing (worker crash after a successful claim leaves the item Processing forever; ProposalHousekeepingWorker only touches proposals). Out of scope for this PR. Seeded #1209 Confirmed no prior issue existed (searched "stuck Processing" / "sweeper"). Issue #1209 (backend, tech-debt) tracks adding a sweeper mirroring OutboundWebhookDeliveryWorker.RecoverStuckProcessingDeliveriesAsync, references PR #1200 and issue #1190.

CI note

Main is green at 8e603723; the earlier Windows CLI timeout was runner-transient. This push triggers a fresh required-CI run; the two prior MEDIUM CI findings resolve via that re-run.

Verification commands

dotnet test backend/tests/Taskdeck.Api.Tests/Taskdeck.Api.Tests.csproj -c Release --filter "FullyQualifiedName~LlmQueue"        # 65 passed
dotnet test backend/tests/Taskdeck.Api.Tests/Taskdeck.Api.Tests.csproj -c Release --filter "FullyQualifiedName~WorkerResilienceTests"  # 14 passed

@Chris0Jeky Chris0Jeky merged commit 5d0f16e into main Jun 13, 2026
20 checks passed
@github-project-automation github-project-automation Bot moved this from Pending to Done in Taskdeck Execution Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

LlmQueueService.ProcessNextRequestAsync race condition with worker (non-atomic claim)

1 participant