Skip to content

FEA-1461: Dead-letter agent-session sync on persistent rate_limited#258

Open
aeyeCEO wants to merge 1 commit into
mainfrom
FEA-1461-sync-rate-limit-deadletter
Open

FEA-1461: Dead-letter agent-session sync on persistent rate_limited#258
aeyeCEO wants to merge 1 commit into
mainfrom
FEA-1461-sync-rate-limit-deadletter

Conversation

@aeyeCEO
Copy link
Copy Markdown
Contributor

@aeyeCEO aeyeCEO commented May 29, 2026

Summary

  • Fixes the infinite-retry-loop on rate_limited for agent-session sync that was log-spamming every 5s on oversized sessions (~40 MiB observed locally).
  • Mirrors the existing AckTimeout dead-letter pattern: per-session counter (rateLimitedCountById), threshold (MAX_CONSECUTIVE_RATE_LIMITED = 5), and 30s per-session backoff (RATE_LIMIT_BACKOFF_MS) so other queued sessions are not head-of-line-blocked.
  • See FEA-1461 and implementation plan PLN-771.

Test plan

  • pnpm -C apps/desktop typecheck — clean
  • pnpm -C apps/desktop lint — clean
  • pnpm -C apps/desktop build:agent-monitor — clean (SQLite gate + hard-gates pass)
  • New test/agent-session-sync-rate-limit-deadletter.test.ts — 4/4 pass (dead-letter trip, head-of-line resolution, counter reset on success, AckTimeout regression guard)
  • pnpm -C apps/desktop test full suite — 1922/1922 pass, 0 fail
  • Independent code review — two findings (Medium: stale throttle timestamp on empty candidate set; Low: asymmetric Map cleanup on dead-letter) both addressed in this PR
  • Manual smoke: trigger a rate_limited from the relay against an oversized local session and confirm logs go quiet after 5 deferrals (next on-machine test cycle)

Risks

  • Tunable defaults (5 attempts × 30s = 2.5 min before dead-letter). If the relay's typical throttle clears faster, we under-utilize the window; if slower, we dead-letter recoverable sessions. Constants are exported so a follow-up can adjust based on telemetry.
  • Dead-letter state is in-memory only — app restart re-enters the session and re-trips after another 5 attempts. Documented as out-of-scope on FEA-1461.

🤖 Generated with Claude Code

- Add MAX_CONSECUTIVE_RATE_LIMITED = 5 + per-session counter
  rateLimitedCountById; mirror the existing AckTimeout dead-letter pattern
  for the previously-unhandled `rate_limited` response. Threshold is
  intentionally higher than the timeout threshold (3) because rate-limits
  are more legitimately transient.
- Add RATE_LIMIT_BACKOFF_MS = 30_000 + nextRetryAfterMs Map. After a
  rate_limited rejection, defer re-attempting the same session for 30s.
  Caps the wasted re-chunking work at one attempt per 30s instead of every
  5s sync tick (the original symptom). Other queued sessions continue to
  flow through a new pickReadyCandidates() helper that iterates the queue
  rather than slicing the head, so a backed-off session does NOT
  head-of-line-block siblings behind it.
- New RateLimited branch in handleBatchAck replacing the bare `else`
  fallthrough that previously only debug-logged. Dead-letters at 5
  consecutive rejections; otherwise sets backoff and logs one info line
  per deferral (vs three lines per 5s tick before).
- Successful ack clears all three counter maps (timeoutCountById,
  rateLimitedCountById, nextRetryAfterMs) for the synced session ids so a
  subsequent rejection starts from a clean slate.
- Symmetric cross-path cleanup on dead-letter: both AckTimeout and
  RateLimited paths now clear the other reason's Map entries for the
  dead-lettered id, so no orphaned state lingers.
- lastIncrementalBatchAttemptedAtMs is now only stamped when at least one
  candidate was actually selected; before this change, an all-backed-off
  queue would set the throttle timestamp and delay a newly-added session
  for a full MIN_INCREMENTAL_SYNC_INTERVAL_MS (30s).
- True resume-from-chunk-N is documented as out of scope; would require
  server-side partial-payload reassembly that does not exist today.
- Bump desktop version to 0.15.99 per repo CI rule.

Testing:
- pnpm -C apps/desktop typecheck: clean
- pnpm -C apps/desktop lint: clean
- pnpm -C apps/desktop build:agent-monitor: clean
  (SQLite gate + all hard-gates pass)
- New test/agent-session-sync-rate-limit-deadletter.test.ts: 4/4 pass —
  covers (1) 5-consecutive dead-letter trip, (2) head-of-line resolution
  via separate-tick session arrival, (3) counter reset on successful ack
  between rejections, (4) AckTimeout regression guard.
- pnpm -C apps/desktop test (full suite): 1922/1922 pass, 0 fail.
- Independent code-reviewer agent surfaced one Medium correctness finding
  (lastIncrementalBatchAttemptedAtMs stamped on empty candidate set) and
  one Low cleanup finding (asymmetric Map cleanup on dead-letter). Both
  addressed in this commit.

Risks:
- Behavior change for legitimate transient rate-limits: the relay's
  natural throttle-clear window matters. 30s backoff × 5 attempts = 2.5
  minutes before dead-letter. If the relay's typical throttle clears
  faster than 30s, we under-utilize the window. If it lasts longer than
  150s, we dead-letter sessions we could have synced. Both tunable via
  the exported constants in a follow-up if data shows the defaults are
  wrong.
- Dead-lettered state is in-memory only; an app restart re-enters the
  session into the queue and re-trips the dead-letter after another 5
  attempts. Acceptable for v1, documented in FEA-1461 out-of-scope.
- The new pickReadyCandidates iterates the queue O(n) per tick. Bounded
  by INCREMENTAL_SESSION_BATCH_SIZE (10) early-out; performance impact
  negligible even for thousand-session queues.
@aeyeCEO aeyeCEO requested a review from a team May 29, 2026 19:59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8a13cd7e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +652 to +659
if (count >= MAX_CONSECUTIVE_RATE_LIMITED) {
deadLettered.push(id);
this.rateLimitedCountById.delete(id);
this.nextRetryAfterMs.delete(id);
// FEA-1461: also clear any orphaned timeout state for this
// session so a dead-lettered id leaves no Map entries behind.
this.timeoutCountById.delete(id);
this.deadLetteredIds.add(id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not dead-letter relay-level rate limits

When rate_limited comes from relay availability rather than a bad session payload, this counter now drops the session after five attempts. I checked cloud-socket.sendAgentSessions, and it returns DesktopAgentSessionsAckReason.RateLimited whenever !isRelayReady() or !socket?.connected, so a flapping/disconnected relay that races after syncOnce() has selected a batch can make otherwise valid sessions hit this branch and be dequeued permanently. Please distinguish local/transport backpressure from server payload rejection before applying the per-session dead-letter path.

Useful? React with 👍 / 👎.

// candidate was filtered by backoff would unnecessarily delay
// a session added to the queue moments later by the full
// MIN_INCREMENTAL_SYNC_INTERVAL_MS window.
if (candidateIds.length > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only stamp lastIncrementalBatchAttemptedAtMs when a candidate was actually picked, but the incremental branch is still chosen before backfill. So when every ready incremental session is in backoff, pickReadyCandidates returns [], candidateIds.length === 0 hits the early return below, and backfill is skipped for that tick. Since you didn't stamp the timestamp, the throttle stays satisfied and the incremental branch keeps getting selected until backoff clears, so backfill stays starved the whole time. Today it's hidden because RATE_LIMIT_BACKOFF_MS == MIN_INCREMENTAL_SYNC_INTERVAL_MS (both 30s) so the overlap is ~one tick. The PR says a telemetry follow-up may tune the backoff. Push it past 30s and you starve backfill for (backoff - 30s) every cycle.

import { DesktopAgentSessionsAckReason } from "../src/main/cloud-protocol.js";

function createTestDb(rootDir: string): DatabaseSync {
const userDataDir = path.join(rootDir, "user-data");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTestDb is a verbatim copy of createServiceTestDatabase in agent-session-sync-service.test.ts, and flushAgentSessionSync is byte-for-byte identical. The whole agent-monitor schema now lives in two test files and will drift the next time those tables change. CLAUDE.md flags this exact thing under tests|duplication. Pull these into test/helpers/.

Copy link
Copy Markdown
Contributor

@thadeusb thadeusb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mirroring the AckTimeout dead-letter is the right shape and the backoff reads clean. The thing that has to get settled before merge is the one Codex already flagged, rate_limited also fires on a relay blip so a flap can permanently drop good sessions. Left a separate note on backfill getting starved when the whole incremental queue is in backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants