fix: atomic claim for LlmQueue non-capture processing (#1190) by Chris0Jeky · Pull Request #1200 · Chris0Jeky/Taskdeck

Chris0Jeky · 2026-06-06T10:35:21Z

Summary

Fixes #1190 -- ProcessNextRequestAsync and LlmQueueToProposalWorker.ProcessSingleItemAsync used a non-atomic fetch-then-mutate pattern for claiming pending LLM requests. Under concurrent workers, the same request could be claimed twice, producing duplicate proposals.

Add TryClaimProcessingAsync to ILlmQueueRepository / LlmQueueRepository that atomically transitions Pending -> Processing with an optimistic concurrency guard (WHERE Status = Pending AND UpdatedAt = @expected), mirroring the existing TryClaimProcessingCaptureAsync pattern
Update LlmQueueService.ProcessNextRequestAsync to use the atomic claim, iterating FIFO candidates and skipping any that fail the claim
Update LlmQueueToProposalWorker.ProcessSingleItemAsync to use TryClaimProcessingAsync instead of the racy GetByIdAsync + MarkAsProcessing + SaveChangesAsync flow
Pass ExpectedUpdatedAt for non-capture batch items in BuildFairBatchItems

Test plan

6 new unit tests in LlmQueueServiceTests: claim success, claim failure (concurrent), fallthrough to next candidate, skip capture requests, empty queue, FIFO ordering
4 new integration tests in LlmQueueRepositoryIntegrationTests: claim pending request, fail when status already changed, concurrent race (exactly one wins), reject non-pending request
Updated ProcessBatch_ItemClaimedBetweenFetchAndProcess_SkipsGracefully worker test to use atomic claim pattern
All 3,297 Application.Tests pass
All 1,744 Api.Tests pass
Build clean (0 errors)

@expected

ProcessNextRequestAsync and LlmQueueToProposalWorker.ProcessSingleItemAsync used a non-atomic fetch-then-mutate pattern for claiming pending LLM requests. Under concurrent workers, the same request could be claimed twice, producing duplicate proposals. Add TryClaimProcessingAsync to ILlmQueueRepository / LlmQueueRepository that atomically transitions Pending -> Processing with an optimistic concurrency guard (WHERE Status = Pending AND UpdatedAt = @expected), mirroring the existing TryClaimProcessingCaptureAsync pattern. Update both ProcessNextRequestAsync and ProcessSingleItemAsync to use the atomic claim instead of the racy read-then-MarkAsProcessing-then-save flow. ProcessNextRequestAsync now iterates candidates and skips any that fail the atomic claim, falling through to the next FIFO candidate. Tests: 6 new unit tests (service claim success, claim failure, fallthrough to next candidate, skip capture requests, empty queue, FIFO ordering) + 4 new integration tests (claim pending, fail when stale, concurrent race exactly-one, reject non-pending). All 3,297 Application.Tests + 1,744 Api.Tests green.

Chris0Jeky · 2026-06-06T10:36:43Z

Adversarial Code Review

CRITICAL

None

HIGH

None

MEDIUM

M1: Orphaned Processing item on null re-fetch (LlmQueueService.cs line ~131): After TryClaimProcessingAsync succeeds, if GetByIdAsync returns null the code does continue to the next candidate. The already-claimed item is now stuck in Processing with no one to process or reset it. While extremely unlikely (requires a DELETE between the UPDATE and SELECT), the defensive continue silently orphans the item. Should log a warning. The worker path has the same pattern but is pre-existing; the proposal housekeeping worker handles stuck items. Adding a warning log on this edge case is sufficient.
M2: Snake_case variable name (LlmQueueService.cs line ~130): claimed_request uses snake_case, violating the C# camelCase convention established in the codebase. Should be claimedRequest.

LOW

L1: Misleading test name (LlmQueueServiceTests.cs): ProcessNextRequestAsync_ShouldReturnConflict_WhenClaimFails — the name says "Conflict" but the assertion correctly checks for ErrorCodes.NotFound. The name should reflect the actual behavior (all claims exhausted = NotFound).

Bot Comments Addressed

None (no bot comments at this time; CI still pending)

Summary

0 CRITICAL, 0 HIGH, 2 MEDIUM, 1 LOW. No merge blockers. All findings are fixable with minimal changes.

gemini-code-assist

Code Review

This pull request introduces optimistic concurrency for claiming pending non-capture requests in the LLM queue by adding and implementing TryClaimProcessingAsync using raw SQL updates. It updates the background worker and queue service to use this atomic claim mechanism and adds corresponding integration and unit tests. The review feedback identifies a critical bug in LlmQueueService.ProcessNextRequestAsync where re-fetching the claimed request via GetByIdAsync returns a stale, tracked in-memory entity from EF Core's cache instead of the updated database state. The reviewer suggests directly updating the tracked entity's state in-memory using candidate.MarkAsProcessing() to avoid an extra database roundtrip and ensure the correct status is returned.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

M1: Document orphaned-Processing edge case on null re-fetch after successful TryClaimProcessingAsync claim with explanatory comment. M2: Rename snake_case `claimed_request` to camelCase `claimedRequest`. L1: Rename misleading test from ShouldReturnConflict to ShouldReturnNotFound_WhenAllClaimsFail.

Chris0Jeky · 2026-06-06T10:38:09Z

Adversarial Review -- Fixes Applied

Finding	Severity	Fix Commit	Verified
M1: Orphaned Processing item on null re-fetch	MEDIUM	`98cdd427`	Documented with explanatory comment; housekeeping worker mitigates
M2: Snake_case `claimed_request` variable	MEDIUM	`98cdd427`	Renamed to `claimedRequest`; build clean
L1: Misleading test name (Conflict vs NotFound)	LOW	`98cdd427`	Renamed to `ShouldReturnNotFound_WhenAllClaimsFail`; 20/20 tests pass

All findings addressed. CI status: PENDING (new run triggered by fix push).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f7eb7a451

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The raw-SQL UPDATE in TryClaimProcessingAsync bypasses the EF change tracker, leaving any tracked instance stale at Pending. Reload the tracked entity on a successful claim so callers holding the instance (and GetByIdAsync via the FindAsync identity map) observe Processing. Document the refresh contract on ILlmQueueRepository.

Drop the post-claim GetByIdAsync re-fetch: it served the stale tracked entity from the identity map, so the returned DTO reported Pending after a successful claim. The repository now refreshes the tracked candidate on claim, so map it directly. This also removes the misleading comments (the re-fetch did not reflect DB state, and no warning was logged). Unit test mocks now honor the refresh contract via callbacks.

Two integration tests against real SQLite: a tracked candidate fetched via GetByStatusAsync shows Processing after a successful claim, and GetByIdAsync without clearing the change tracker returns Processing. Both fail with stale Pending without the post-claim reload.

Chris0Jeky · 2026-06-12T23:44:14Z

Review Fixes -- Stale Tracked Entity After Atomic Claim

Finding	Source	Severity	Fix Commit	Verification
Raw-SQL claim bypasses EF change tracker; `GetByIdAsync` (FindAsync identity-map) and the tracked candidate report stale `Pending` after a successful claim	gemini-code-assist	HIGH	`579c44bc` (repo reload + interface contract), `9455241b` (service maps refreshed candidate)	New integration tests `TryClaimProcessingAsync_ShouldRefreshTrackedEntity` and `TryClaimProcessingAsync_GetByIdAfterClaim_ShouldReturnProcessingWithoutClearingTracker` (`e7657933`) reproduced the bug (failed with `Pending`) before the fix and pass after
Refresh the tracked request after claiming so `POST /api/llm-queue/process-next` returns `Processing`	chatgpt-codex-connector	P2	`579c44bc` + `9455241b`	Same two integration tests; unit tests `ProcessNextRequestAsync_ShouldReturnSuccess_WhenClaimSucceeds`, `...ShouldPreserveFifo...`, `...ShouldTryNextCandidate...` updated to the documented refresh contract
Misleading comments: "Re-fetch so the in-memory entity reflects the DB state" (it did not -- identity-map hit) and "log a warning so the anomaly is visible" (no warning was logged)	review	LOW	`9455241b`	Stale re-fetch, dead null-branch, and both comments removed; replaced with an accurate note about the repository refresh contract

How the fix works

LlmQueueRepository.TryClaimProcessingAsync now, on a successful UPDATE, reloads any tracked instance via _context.Entry(tracked).ReloadAsync() (looked up in LlmRequests.Local). Chosen over the suggested in-memory MarkAsProcessing() because reload syncs UpdatedAt to the exact DB-written value and leaves the entry Unchanged (no divergent timestamp, no redundant second UPDATE on a later SaveChangesAsync). Contract documented on ILlmQueueRepository.

Out-of-scope finding tracked

TryClaimProcessingCaptureAsync shares the tracker-bypass pattern but its only caller claims in a fresh scope before fetching, so it is latent -- seeded as #1206.

Test evidence

dotnet test backend/Taskdeck.sln -c Release -m:1 --filter "FullyQualifiedName~LlmQueue": 83/83 passed (Application.Tests 20/20, Api.Tests 63/63, incl. LlmQueueRepositoryIntegrationTests and LlmQueueToProposalWorkerTests)
dotnet test ... --filter "FullyQualifiedName~WorkerResilienceTests": 13/13 passed
Pre-fix run of the two new integration tests: both FAILED with Expected ... Processing ... but found ... Pending -- confirming they guard the regression

In-thread replies posted on both bot comments. Threads left open for the orchestrator to resolve.

…ngAsync Add 'AND RequestType NOT LIKE inbox.capture.%' to the non-capture claim WHERE clause, mirroring the inverse guard in TryClaimProcessingCaptureAsync so the two claim paths are mutually exclusive at the SQL layer.

When the post-claim re-fetch no longer sees the row Processing, we DID win the claim but the row vanished/mutated between UPDATE and SELECT. Log a Warning and emit a distinct 'claimed_then_missing' telemetry outcome instead of conflating it with losing the claim race.

…after claim Read the persisted row via a fresh AsNoTracking query and assert tracked.UpdatedAt equals it, distinguishing a true ReloadAsync from an in-memory MarkAsProcessing() substitute that would set a different UTC-now timestamp.

Record the expectedUpdatedAt each fake claim receives and add a happy-path test asserting it equals the pending item's actual UpdatedAt. Catches a regression passing default/now (which would stall the queue in production via a no-match optimistic-concurrency UPDATE while tests stayed green) in both worker fakes.

Chris0Jeky · 2026-06-13T09:08:40Z

Residual Review Findings -- Fixes Applied

All five residual LOW findings from the prior review are addressed. Verified with dotnet test ... --filter "FullyQualifiedName~LlmQueue" (65 passed) and the WorkerResilienceTests classes (14 passed) on a clean Release build (0 errors).

#	Severity	Finding	Fix Commit	Verification
1	LOW	`TryClaimProcessingAsync` lacked the inverse of the capture guard, so the non-capture contract was not self-enforcing at the SQL layer.	`5a9ce311`	Added `AND RequestType NOT LIKE {CaptureRequestTypeLike}` to the non-capture claim WHERE clause, mirroring `TryClaimProcessingCaptureAsync`. Existing `chat.completion` claim tests still pass (non-capture types still match); capture types now provably cannot be claimed via this path.
2	LOW	`TryClaimProcessingAsync_ShouldRefreshTrackedEntity` only asserted `tracked.UpdatedAt != expectedUpdatedAt`, which an in-memory `MarkAsProcessing()` substitute would also satisfy.	`08ab7456`	Test now reads the persisted row via a fresh-scope `AsNoTracking().SingleAsync(...)` and asserts `tracked.UpdatedAt == persisted.UpdatedAt`, proving a true DB `ReloadAsync` rather than a UTC-now in-memory mutation.
3	LOW	The post-claim `item == null \|\| item.Status != Processing` branch reused the `already_claimed` label, conflating "we never won the claim" with "we won but the row vanished between UPDATE and SELECT."	`0fabb60b`	Now logs a `Warning` and emits a distinct `claimed_then_missing` telemetry outcome in the non-capture branch (`LlmQueueToProposalWorker.ProcessSingleItemAsync`).
4	LOW	The worker fakes accepted `expectedUpdatedAt` but never asserted it, so a regression passing `default`/now (which would stall the queue in prod via a no-match optimistic-concurrency UPDATE) would leave tests green.	`4315573c`	Both fakes (`LlmQueueToProposalWorkerTests` and root `WorkerResilienceTests`) now record `(requestId, expectedUpdatedAt)`. Added happy-path tests asserting the worker forwards the pending item's actual `UpdatedAt` (snapshotted before the claim mutates it).
5	LOW	No worker recovers non-capture `LlmRequest`s stuck in `Processing` (worker crash after a successful claim leaves the item Processing forever; `ProposalHousekeepingWorker` only touches proposals). Out of scope for this PR.	Seeded #1209	Confirmed no prior issue existed (searched "stuck Processing" / "sweeper"). Issue #1209 (`backend`, `tech-debt`) tracks adding a sweeper mirroring `OutboundWebhookDeliveryWorker.RecoverStuckProcessingDeliveriesAsync`, references PR #1200 and issue #1190.

CI note

Main is green at 8e603723; the earlier Windows CLI timeout was runner-transient. This push triggers a fresh required-CI run; the two prior MEDIUM CI findings resolve via that re-run.

Verification commands

dotnet test backend/tests/Taskdeck.Api.Tests/Taskdeck.Api.Tests.csproj -c Release --filter "FullyQualifiedName~LlmQueue"        # 65 passed
dotnet test backend/tests/Taskdeck.Api.Tests/Taskdeck.Api.Tests.csproj -c Release --filter "FullyQualifiedName~WorkerResilienceTests"  # 14 passed

github-project-automation Bot added this to Taskdeck Execution Jun 6, 2026

github-project-automation Bot moved this to Pending in Taskdeck Execution Jun 6, 2026

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread backend/src/Taskdeck.Application/Services/LlmQueueService.cs Outdated

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread backend/src/Taskdeck.Application/Services/LlmQueueService.cs Outdated

Chris0Jeky added 3 commits June 13, 2026 00:42

Chris0Jeky mentioned this pull request Jun 12, 2026

LlmQueueRepository.TryClaimProcessingCaptureAsync bypasses EF change tracker (latent staleness) #1206

Open

Chris0Jeky added 4 commits June 13, 2026 10:07

Chris0Jeky mentioned this pull request Jun 13, 2026

No sweeper recovers non-capture LlmRequests stuck in Processing #1209

Open

Chris0Jeky merged commit 5d0f16e into main Jun 13, 2026
20 checks passed

github-project-automation Bot moved this from Pending to Done in Taskdeck Execution Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: atomic claim for LlmQueue non-capture processing (#1190)#1200

fix: atomic claim for LlmQueue non-capture processing (#1190)#1200
Chris0Jeky merged 9 commits into
mainfrom
fix/1190-queue-claim-race

Chris0Jeky commented Jun 6, 2026

Uh oh!

Chris0Jeky commented Jun 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Chris0Jeky commented Jun 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Chris0Jeky commented Jun 12, 2026

Uh oh!

Chris0Jeky commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chris0Jeky commented Jun 6, 2026

Summary

Test plan

Uh oh!

Chris0Jeky commented Jun 6, 2026

Adversarial Code Review

CRITICAL

HIGH

MEDIUM

LOW

Bot Comments Addressed

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Chris0Jeky commented Jun 6, 2026

Adversarial Review -- Fixes Applied

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Chris0Jeky commented Jun 12, 2026

Review Fixes -- Stale Tracked Entity After Atomic Claim

How the fix works

Out-of-scope finding tracked

Test evidence

Uh oh!

Chris0Jeky commented Jun 13, 2026

Residual Review Findings -- Fixes Applied

CI note

Verification commands

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant