FEA-1461: Dead-letter agent-session sync on persistent rate_limited#258
FEA-1461: Dead-letter agent-session sync on persistent rate_limited#258aeyeCEO wants to merge 1 commit into
Conversation
- Add MAX_CONSECUTIVE_RATE_LIMITED = 5 + per-session counter rateLimitedCountById; mirror the existing AckTimeout dead-letter pattern for the previously-unhandled `rate_limited` response. Threshold is intentionally higher than the timeout threshold (3) because rate-limits are more legitimately transient. - Add RATE_LIMIT_BACKOFF_MS = 30_000 + nextRetryAfterMs Map. After a rate_limited rejection, defer re-attempting the same session for 30s. Caps the wasted re-chunking work at one attempt per 30s instead of every 5s sync tick (the original symptom). Other queued sessions continue to flow through a new pickReadyCandidates() helper that iterates the queue rather than slicing the head, so a backed-off session does NOT head-of-line-block siblings behind it. - New RateLimited branch in handleBatchAck replacing the bare `else` fallthrough that previously only debug-logged. Dead-letters at 5 consecutive rejections; otherwise sets backoff and logs one info line per deferral (vs three lines per 5s tick before). - Successful ack clears all three counter maps (timeoutCountById, rateLimitedCountById, nextRetryAfterMs) for the synced session ids so a subsequent rejection starts from a clean slate. - Symmetric cross-path cleanup on dead-letter: both AckTimeout and RateLimited paths now clear the other reason's Map entries for the dead-lettered id, so no orphaned state lingers. - lastIncrementalBatchAttemptedAtMs is now only stamped when at least one candidate was actually selected; before this change, an all-backed-off queue would set the throttle timestamp and delay a newly-added session for a full MIN_INCREMENTAL_SYNC_INTERVAL_MS (30s). - True resume-from-chunk-N is documented as out of scope; would require server-side partial-payload reassembly that does not exist today. - Bump desktop version to 0.15.99 per repo CI rule. Testing: - pnpm -C apps/desktop typecheck: clean - pnpm -C apps/desktop lint: clean - pnpm -C apps/desktop build:agent-monitor: clean (SQLite gate + all hard-gates pass) - New test/agent-session-sync-rate-limit-deadletter.test.ts: 4/4 pass — covers (1) 5-consecutive dead-letter trip, (2) head-of-line resolution via separate-tick session arrival, (3) counter reset on successful ack between rejections, (4) AckTimeout regression guard. - pnpm -C apps/desktop test (full suite): 1922/1922 pass, 0 fail. - Independent code-reviewer agent surfaced one Medium correctness finding (lastIncrementalBatchAttemptedAtMs stamped on empty candidate set) and one Low cleanup finding (asymmetric Map cleanup on dead-letter). Both addressed in this commit. Risks: - Behavior change for legitimate transient rate-limits: the relay's natural throttle-clear window matters. 30s backoff × 5 attempts = 2.5 minutes before dead-letter. If the relay's typical throttle clears faster than 30s, we under-utilize the window. If it lasts longer than 150s, we dead-letter sessions we could have synced. Both tunable via the exported constants in a follow-up if data shows the defaults are wrong. - Dead-lettered state is in-memory only; an app restart re-enters the session into the queue and re-trips the dead-letter after another 5 attempts. Acceptable for v1, documented in FEA-1461 out-of-scope. - The new pickReadyCandidates iterates the queue O(n) per tick. Bounded by INCREMENTAL_SESSION_BATCH_SIZE (10) early-out; performance impact negligible even for thousand-session queues.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8a13cd7e9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (count >= MAX_CONSECUTIVE_RATE_LIMITED) { | ||
| deadLettered.push(id); | ||
| this.rateLimitedCountById.delete(id); | ||
| this.nextRetryAfterMs.delete(id); | ||
| // FEA-1461: also clear any orphaned timeout state for this | ||
| // session so a dead-lettered id leaves no Map entries behind. | ||
| this.timeoutCountById.delete(id); | ||
| this.deadLetteredIds.add(id); |
There was a problem hiding this comment.
Do not dead-letter relay-level rate limits
When rate_limited comes from relay availability rather than a bad session payload, this counter now drops the session after five attempts. I checked cloud-socket.sendAgentSessions, and it returns DesktopAgentSessionsAckReason.RateLimited whenever !isRelayReady() or !socket?.connected, so a flapping/disconnected relay that races after syncOnce() has selected a batch can make otherwise valid sessions hit this branch and be dequeued permanently. Please distinguish local/transport backpressure from server payload rejection before applying the per-session dead-letter path.
Useful? React with 👍 / 👎.
| // candidate was filtered by backoff would unnecessarily delay | ||
| // a session added to the queue moments later by the full | ||
| // MIN_INCREMENTAL_SYNC_INTERVAL_MS window. | ||
| if (candidateIds.length > 0) { |
There was a problem hiding this comment.
You only stamp lastIncrementalBatchAttemptedAtMs when a candidate was actually picked, but the incremental branch is still chosen before backfill. So when every ready incremental session is in backoff, pickReadyCandidates returns [], candidateIds.length === 0 hits the early return below, and backfill is skipped for that tick. Since you didn't stamp the timestamp, the throttle stays satisfied and the incremental branch keeps getting selected until backoff clears, so backfill stays starved the whole time. Today it's hidden because RATE_LIMIT_BACKOFF_MS == MIN_INCREMENTAL_SYNC_INTERVAL_MS (both 30s) so the overlap is ~one tick. The PR says a telemetry follow-up may tune the backoff. Push it past 30s and you starve backfill for (backoff - 30s) every cycle.
| import { DesktopAgentSessionsAckReason } from "../src/main/cloud-protocol.js"; | ||
|
|
||
| function createTestDb(rootDir: string): DatabaseSync { | ||
| const userDataDir = path.join(rootDir, "user-data"); |
There was a problem hiding this comment.
createTestDb is a verbatim copy of createServiceTestDatabase in agent-session-sync-service.test.ts, and flushAgentSessionSync is byte-for-byte identical. The whole agent-monitor schema now lives in two test files and will drift the next time those tables change. CLAUDE.md flags this exact thing under tests|duplication. Pull these into test/helpers/.
thadeusb
left a comment
There was a problem hiding this comment.
Mirroring the AckTimeout dead-letter is the right shape and the backoff reads clean. The thing that has to get settled before merge is the one Codex already flagged, rate_limited also fires on a relay blip so a flap can permanently drop good sessions. Left a separate note on backfill getting starved when the whole incremental queue is in backoff.
Summary
rate_limitedfor agent-session sync that was log-spamming every 5s on oversized sessions (~40 MiB observed locally).AckTimeoutdead-letter pattern: per-session counter (rateLimitedCountById), threshold (MAX_CONSECUTIVE_RATE_LIMITED = 5), and 30s per-session backoff (RATE_LIMIT_BACKOFF_MS) so other queued sessions are not head-of-line-blocked.Test plan
pnpm -C apps/desktop typecheck— cleanpnpm -C apps/desktop lint— cleanpnpm -C apps/desktop build:agent-monitor— clean (SQLite gate + hard-gates pass)test/agent-session-sync-rate-limit-deadletter.test.ts— 4/4 pass (dead-letter trip, head-of-line resolution, counter reset on success, AckTimeout regression guard)pnpm -C apps/desktop testfull suite — 1922/1922 pass, 0 failRisks
🤖 Generated with Claude Code