Skip to content

feat(retry): add persistent retry mode for unattended CI/CD environments#3080

Merged
wenshao merged 11 commits into
QwenLM:mainfrom
zhangxy-zju:feat/persistent-retry-mode
Apr 21, 2026
Merged

feat(retry): add persistent retry mode for unattended CI/CD environments#3080
wenshao merged 11 commits into
QwenLM:mainfrom
zhangxy-zju:feat/persistent-retry-mode

Conversation

@zhangxy-zju

@zhangxy-zju zhangxy-zju commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Why

When Qwen Code runs as part of a CI/CD pipeline or as a background daemon (e.g., overnight batch refactoring, large-scale security audits), transient API capacity errors — HTTP 429 (Rate Limit) and 529 (Overloaded) — should not terminate long-running tasks. The current implementation uses a fixed maxAttempts: 7 across all modes, which means a multi-hour automated job can be killed by a brief API outage lasting just a couple of minutes. This is the #1 blocker for using Qwen Code as reliable DevOps infrastructure.

Claude Code solves this with a "persistent retry" mode (services/api/withRetry.ts). This PR brings equivalent capability to Qwen Code.

Closes #3003

What changed

Core: packages/core/src/utils/retry.ts

  • isUnattendedMode() — Detects persistent retry opt-in via QWEN_CODE_UNATTENDED_RETRY env var (explicit only — CI=true alone does NOT activate, to avoid silently turning fast-fail CI jobs into infinite-wait jobs)
  • isTransientCapacityError() — Classifies only 429 and 529 as transient capacity errors (HTTP 500 is excluded — it may be a permanent server bug)
  • sleepWithHeartbeat() — Chunked sleep that emits heartbeat callbacks every 30s to keep CI runners alive
  • retryWithBackoff() enhanced — In persistent mode, transient capacity errors bypass maxAttempts and retry indefinitely with:
    • Exponential backoff capped at 5 minutes per retry
    • Absolute single-wait cap at 6 hours
    • Retry-After header respected when present
    • AbortSignal support for graceful cancellation
    • Independent persistentAttempt counter (attempt clamping technique from Claude Code)

Callers: geminiChat.ts, baseLlmClient.ts

  • Pass persistentMode: isUnattendedMode() and a heartbeatFn that writes to stderr

Docs

  • docs/users/configuration/settings.md — Added QWEN_CODE_UNATTENDED_RETRY to environment variables table
  • docs/users/features/headless.md — Added "Persistent Retry Mode" section with activation, examples, and monitoring

Non-breaking

  • persistentMode defaults to false — zero behavior change for interactive users
  • All 26 existing tests pass unchanged; 20 new tests added (46 total)

How to verify

Unit tests

npx vitest run packages/core/src/utils/retry.test.ts

All 46 tests pass, covering:

  • 429/529 persistent retry beyond maxAttempts
  • 500 still fails at maxAttempts (not persistent)
  • CI=true alone does NOT activate persistent mode
  • Backoff cap enforcement
  • Heartbeat callback invocation
  • AbortSignal cancellation
  • isUnattendedMode() env var detection (truthy/falsy)
  • Regression: non-persistent mode unchanged

Manual / integration

  1. QWEN_CODE_UNATTENDED_RETRY=1 qwen-code "refactor all files in src/" — enable persistent retry
  2. Temporarily inject a 429 error in the API layer and observe persistent retry with heartbeat logs on stderr
  3. Verify normal interactive mode is unaffected (no env vars set)

@github-actions

Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR introduces a persistent retry mode for handling transient API capacity errors (429/529) in unattended/CI environments, enabling long-running automated tasks to survive temporary API outages. The implementation is well-structured with comprehensive test coverage, but there are some concerns around the attempt clamping technique and potential infinite loop risks that should be addressed before merging.

🔍 General Feedback

  • Strong test coverage: 20 new tests thoroughly cover persistent mode behavior, edge cases, and regression scenarios
  • Non-breaking design: persistentMode defaults to false, ensuring zero impact on interactive users
  • Good separation of concerns: New helper functions (isTransientCapacityError, isUnattendedMode, sleepWithHeartbeat) are well-isolated
  • Heartbeat mechanism: Excellent addition for keeping CI runners alive during long waits
  • AbortSignal support: Proper graceful cancellation support for persistent retries
  • Concern: The attempt clamping technique (if (attempt >= maxAttempts) { attempt = maxAttempts - 1 }) is a clever but potentially confusing pattern that could be refactored for clarity

🎯 Specific Feedback

🔴 Critical

  • File: packages/core/src/utils/retry.ts:267-271 - The attempt clamping technique creates a potential maintenance hazard. While functional, this pattern of modifying the loop counter to prevent exit is non-standard and could confuse future maintainers:
    // Clamp attempt so the while-loop never exits
    if (attempt >= maxAttempts) {
      attempt = maxAttempts - 1;
    }
    Recommendation: Consider restructuring the loop logic to use separate conditions for persistent vs. normal mode, or extract the persistent retry loop into its own function for clarity.

🟡 High

  • File: packages/core/src/utils/retry.ts:224-227 - The shouldPersist logic has a subtle issue. When shouldPersist is true, the code bypasses the normal retry exhaustion check entirely, but then still uses attempt in the while condition. This creates two independent counters (attempt and persistentAttempt) that track different things:

    const shouldPersist = persistent && isTransient;
    
    if (!shouldPersist) {
      if (attempt >= maxAttempts || !shouldRetryOnError(error as Error)) {
        throw error;
      }
    }

    Recommendation: Add a comment explaining why attempt is still used in the while condition even when in persistent mode, or consider using persistentAttempt for the while condition in persistent mode.

  • File: packages/core/src/utils/retry.ts:15-16 - The constants are defined at module level but the interface RetryOptions allows overriding them. This is fine, but there's no validation that persistentCapMs should be greater than persistentMaxBackoffMs. If someone accidentally swaps them, behavior could be unexpected.
    Recommendation: Add validation or at least a comment about the relationship between these values.

🟢 Medium

  • File: packages/core/src/utils/retry.ts:238-243 - The backoff calculation in persistent mode applies jitter after capping, which could result in delays slightly exceeding the cap (though the comment acknowledges this with + 1 for rounding). The test at line 615 checks 5000 * 1.25 + 1, which is a 25% jitter range, but the code comment says "±25%":

    delayMs += delayMs * 0.25 * (Math.random() * 2 - 1);

    This is correct, but the test comment at line 614 says "cap + max jitter" which could be clearer about the 25% figure.
    Recommendation: Align the comment to explicitly mention 25% jitter for consistency.

  • File: packages/core/src/core/geminiChat.ts:592-597 and packages/core/src/core/baseLlmClient.ts:120-125 - The heartbeat function writes to process.stderr with a fixed format. This is good for visibility, but there's no way to customize or suppress this output if a CI environment has specific logging requirements.
    Recommendation: Consider making the heartbeat callback optional or providing a way to customize the output format via config.

  • File: packages/core/src/utils/retry.ts:78-82 - The isEnvTruthy helper function is defined after isUnattendedMode in the diff, but logically it should be defined before since it's a dependency. The current order works but is slightly harder to follow.
    Recommendation: Move isEnvTruthy above isUnattendedMode for better code flow.

🔵 Low

  • File: packages/core/src/utils/retry.ts:51-54 - The HeartbeatInfo interface is exported but only used internally within the retry module. Consider whether this needs to be part of the public API or if it should remain internal.
    Suggestion: If external customization of heartbeat callbacks is desired, keep it exported. Otherwise, consider making it internal to reduce API surface.

  • File: packages/core/src/utils/retry.ts:135-148 - The sleepWithHeartbeat function checks ctx.signal?.aborted at the start of each chunk, but if the signal is aborted mid-chunk (during the delay(chunk) call), it won't be detected until the chunk completes. For a 30-second heartbeat interval, this could mean up to 30 seconds of delay before abort is recognized.
    Suggestion: Consider using AbortSignal.timeout() or breaking chunks into smaller intervals for more responsive abort handling, though this is a minor concern given the 30-second default.

  • File: packages/core/src/utils/retry.test.ts:558-560 - The test for backoff cap enforcement checks delays with 5000 * 1.25 + 1, which includes the 25% jitter. This is correct, but the test comment says "cap + max jitter + rounding" without specifying the 25% figure explicitly.
    Suggestion: Add a comment noting the 25% jitter constant for future maintainers.

  • File: packages/core/src/utils/retry.ts:219 - The debug log message format changes between persistent and normal mode ([Persistent] prefix vs. no prefix). This is good for differentiation, but consider adding the mode indicator to the normal mode logs as well for consistency.
    Suggestion: Add [Normal] or similar prefix to non-persistent mode logs for easier log filtering.

✅ Highlights

  • Excellent test coverage: The test suite comprehensively covers persistent mode behavior including 429/529 indefinite retry, 500 error handling (correctly fails at maxAttempts), backoff cap enforcement, heartbeat invocation, AbortSignal cancellation, and regression tests for normal mode
  • Clean API design: The persistentMode, heartbeatFn, and signal options are well-integrated into the existing RetryOptions interface without breaking existing callers
  • Smart use of Retry-After header: The implementation respects server-provided retry timing when available, falling back to exponential backoff otherwise
  • Heartbeat mechanism: The chunked sleep with heartbeat callbacks is an excellent solution for keeping CI runners alive during long waits
  • Proper error classification: Excluding HTTP 500 from transient capacity errors is the right call - permanent server bugs shouldn't trigger infinite retries
  • Non-breaking by default: The design ensures existing behavior is completely unchanged for interactive users

Comment thread packages/core/src/core/baseLlmClient.ts
Comment thread packages/core/src/core/geminiChat.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “persistent retry” mode to make long-running, unattended Qwen Code runs (e.g., CI/CD) resilient to transient API capacity errors without impacting interactive behavior.

Changes:

  • Enhanced retryWithBackoff() with persistent retry semantics for 429/529, heartbeat sleeping, and additional options (persistentMode, caps, heartbeat, signal).
  • Wired persistent retry into core API callers (GeminiChat, BaseLlmClient) and documented the new env var (QWEN_CODE_UNATTENDED_RETRY).
  • Added unit tests for transient classification, unattended mode detection, and persistent retry behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/core/src/utils/retry.ts Implements persistent retry mode, heartbeat sleep, and new retry options.
packages/core/src/utils/retry.test.ts Adds coverage for persistent retry mode + env detection helpers.
packages/core/src/core/geminiChat.ts Enables persistent mode in Gemini caller with stderr heartbeat logging.
packages/core/src/core/baseLlmClient.ts Enables persistent mode in base client with stderr heartbeat logging.
docs/users/features/headless.md Documents “Persistent Retry Mode” usage and monitoring behavior.
docs/users/configuration/settings.md Adds QWEN_CODE_UNATTENDED_RETRY to the environment variables table.
Comments suppressed due to low confidence (1)

packages/core/src/utils/retry.ts:186

  • The new signal?: AbortSignal option is only consulted inside sleepWithHeartbeat (persistent path). In non-persistent mode (and in the shouldRetryOnContent backoff), delays ignore the signal and the loop doesn't check signal.aborted before calling fn(). If the intent is "graceful cancellation" across all retries, consider checking signal?.aborted at the top of the loop and making all waits abortable (including non-persistent delay(...) calls).
  while (attempt < maxAttempts) {
    attempt++;
    try {
      const result = await fn();


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/utils/retry.ts Outdated
Comment thread packages/core/src/utils/retry.ts Outdated
Comment thread packages/core/src/utils/retry.ts Outdated
Comment thread packages/core/src/utils/retry.ts Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/utils/retry.ts Outdated
Comment thread packages/core/src/utils/retry.ts
Comment thread packages/core/src/utils/retry.ts
Comment thread packages/core/src/utils/retry.ts

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/core/src/core/client.ts:963 still calls retryWithBackoff() without persistentMode: isUnattendedMode() or the new unattended retry wiring, while this PR adds that behavior in BaseLlmClient.generateJson() and GeminiChat.makeApiCallAndProcessStream(). Callers that still route through Client.generateContent()—including packages/core/src/tools/web-fetch.ts:111, packages/cli/src/ui/commands/btwCommand.ts:66, and packages/cli/src/ui/commands/summaryCommand.ts:92—will still fail after the normal retry budget during 429/529 capacity outages, so unattended retry remains inconsistent across top-level model entry points. Please apply the same unattended retry options here, or centralize the policy in a shared wrapper so all top-level model calls opt in consistently.

— gpt-5.4 via Qwen Code /review

@zhangxy-zju

Copy link
Copy Markdown
Collaborator Author

@wenshao Good catch — client.ts:963 was indeed missing the persistent retry wiring.

Fixed in 389e856: added persistentMode: isUnattendedMode() and signal: abortSignal to the retryWithBackoff() call in GeminiClient.generateContent(), matching the existing wiring in baseLlmClient.ts and geminiChat.ts.

All three retryWithBackoff() call sites now consistently forward persistent mode and abort signal.

@zhangxy-zju zhangxy-zju dismissed wenshao’s stale review April 12, 2026 09:36

Fixed in ec7659c — added persistentMode and signal to client.ts:963 retryWithBackoff() call. All three call sites now have consistent wiring.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments. — gpt-5.4 via Qwen Code /review

Comment thread packages/core/src/core/client.ts

while (remaining > 0) {
if (ctx.signal?.aborted) {
throw new Error('Retry aborted by signal');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] sleepWithHeartbeat() says it supports AbortSignal, but each chunk still waits on a plain timer. If the signal is aborted after a chunk starts, cancellation is delayed until that chunk finishes, which can be up to the full heartbeat interval in persistent mode.

— gpt-5.4 via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged but won't fix — the worst-case cancel latency in persistent mode is bounded by heartbeatIntervalMs (default 30s), and a single chunk is already a discrete sleep that ends naturally without any work happening during it. Making each chunk Promise.race([delay, signal])-based would add abort-listener wiring/cleanup for ≤30s of latency improvement.

This is consistent with the previously declined Copilot suggestion about non-persistent delay() not honoring signal — same rationale: the bounded wait makes the responsiveness gap not worth the complexity. Happy to revisit if there's a concrete user-facing scenario where 30s cancel lag is a problem.

When running in CI/CD pipelines or background daemon mode, transient API
capacity errors (429/529) should not terminate long-running tasks after a
fixed number of retries. This adds an environment-aware persistent retry
mode that retries indefinitely for transient errors, with exponential
backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to
prevent CI runner timeouts.
Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG)
to the settings reference, and a new "Persistent Retry Mode" section to the
headless mode docs covering activation, behavior, and CI/CD usage examples.
…NDED_RETRY

Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry.
Having multiple env vars with identical behavior adds confusion, and silently
activating infinite retry on CI=true is dangerous — a regular CI test hitting
a 429 would hang forever instead of failing fast.
- Forward caller's abortSignal into retryWithBackoff in both
  baseLlmClient.ts and geminiChat.ts so persistent waits remain
  cancellable (wenshao)
- Re-apply maxBackoff and capMs after jitter so delays strictly
  respect stated caps (Copilot)
- Respect shouldRetryOnError in persistent mode so callers can
  force fast-fail even for transient 429/529 errors (Copilot)
- Guard sleepWithHeartbeat against infinite loop when heartbeat
  interval is <= 0 via Math.max(1, ...) (Copilot)
- Normalize isEnvTruthy with trim/toLowerCase for robust env
  var parsing across CI conventions (Copilot)
Server-specified Retry-After values should only be limited by the
absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min).
Jitter is also skipped for Retry-After since the server already specified
the exact wait time.
…ention

Replace custom isEnvTruthy (trim + toLowerCase) with strict matching
(val === 'true' || val === '1') to match parseBooleanEnvFlag used
elsewhere in the codebase. Prevents inconsistent behavior where
'TRUE' or ' 1 ' would activate persistent retry here but not in
telemetry or other env-driven features.
Cover three key behaviors:
- Retry-After is NOT capped at maxBackoff (only at capMs)
- Retry-After IS capped at persistentCapMs absolute limit
- Retry-After delays have no jitter applied
The existing vi.mock for retry.js only exported retryWithBackoff.
After adding isUnattendedMode to the retry module, baseLlmClient.ts
imports it, causing all 10 generateJson tests to fail with
'No "isUnattendedMode" export is defined on the mock'.
Forward persistentMode and abortSignal to retryWithBackoff() in
GeminiClient.generateContent(), matching the existing wiring in
baseLlmClient.ts and geminiChat.ts.
@zhangxy-zju zhangxy-zju force-pushed the feat/persistent-retry-mode branch from ec7659c to 051f367 Compare April 13, 2026 02:39
Comment thread packages/core/src/core/client.ts
Resolve conflict in docs/users/configuration/settings.md by keeping both
QWEN_CODE_UNATTENDED_RETRY (this branch) and QWEN_CODE_PROFILE_STARTUP
(main) rows.

Address PR QwenLM#3080 review (wenshao 2026-04-19): add the missing heartbeatFn
to the retryWithBackoff() call in GeminiClient.generateContent(), matching
the stderr keepalive pattern already used in geminiChat.ts and
baseLlmClient.ts. Without this, unattended retries on the non-streaming
content path stay silent during long 429/529 outages, contradicting the
heartbeat-based monitoring promised in the new docs.
@zhangxy-zju

zhangxy-zju commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator Author

Validation Strategy & Results

After the merge + heartbeat-fix commit (3f6020f0), I ran a 4-layer validation. 86 assertions pass, 0 manual steps remaining. Scripts are self-contained — running all 4 layers on a clean checkout takes < 30s.

Layer 1 — Unit tests (56/56)

vitest run packages/core/src/utils/retry.test.ts — covers all branches of retry.ts: base retry, Qwen OAuth quota handling, isTransientCapacityError, isUnattendedMode strict matching, persistent mode core paths, Retry-After, multi-shape getErrorStatus.

Layer 2 — In-process integration (17/17)

Parent spawns a worker subprocess that imports the built dist/ artifact (not the TS source vitest compiles in-memory) and exercises 6 scenarios under real process.env.QWEN_CODE_UNATTENDED_RETRY. Catches issues that vi.stubEnv cannot.

Key evidence — T3 (persistent 429 breaks past maxAttempts=5):

attempts=12, heartbeatCount=29
heartbeatAttempts=[1, 2, 3,3,3, 4,4,4, 5,5,5, 6,6,6, ..., 11,11,11]

Layer 3 — fake provider e2e (8/8)

Local http.Server returning controlled 429/529 + Retry-After + real OpenAI SDK (maxRetries: 0 to disable SDK-internal retry). Validates the real SDK error shape flowing through retryWithBackoff, which mocked unit tests cannot do.

Key evidence:

  • S1: persistent mode succeeds via exponential backoff (1419ms, heartbeat fires)
  • S2: 529 survives maxAttempts=1, succeeds on 3rd call
  • S3: real RateLimitErrorgetErrorStatus extracts status=429 correctly
  • S4: AbortSignal cancels within 776ms despite server advertising Retry-After: 5

Layer 4 — SIGINT + streaming stderr e2e (5/5)

Previously considered manual-only. Turns out child.kill('SIGINT') is equivalent to terminal Ctrl+C, and stderr realtime-vs-buffered is observable via timestamped child.stderr.on('data'). So all 3 originally-manual checks (Ctrl+C responsiveness, stderr realtime, GHA-equivalent integration) are scripted.

Key evidence:

[ 1207ms] [qwen-code] Waiting for API capacity... attempt 2, retry in 1s   ← realtime stderr
[ 1218ms] SIGINT sent
[ 1218ms] [worker] SIGINT received, aborting...
[ 1258ms] [worker] EXIT error=Retry aborted by signal                      ← 40ms SIGINT latency, exit code 130

SIGINT latency 40ms — far below the design upper bound of heartbeatIntervalMs (400ms in test, 30s in production).


Pre-existing bug surfaced (NOT introduced by this PR, does NOT block merge)

The fake-provider e2e caught a bug in getRetryAfterDelayMs(): it only reads error.response.headers['retry-after'] (axios shape), but OpenAI SDK 5.x exposes headers at error.headers (a Web Headers instance) with no error.response at all. Unit tests missed this because they construct fake errors in the axios shape.

Observed: server sends Retry-After: 5, retry fires after ~100ms (exponential backoff fallback) instead of 5s.

Scope: affects all OpenAI-SDK-based providers (Qwen DashScope OpenAI-compatible mode, etc.). Pre-dates #3080 (introduced in a9d6965be pre-release commit). Since this PR's core promise (long tasks survive 429/529) is fulfilled by the exponential backoff fallback, fixing this is a separate concern — I'll open a follow-up issue.


One-shot reproducer

# Unit + typecheck
(cd packages/core && npx vitest run src/utils/retry.test.ts && npx tsc --noEmit)

# Then the three e2e scripts (happy to share them in a gist if useful):
node run_integration_test.mjs          # 17 assertions
node run_e2e_with_fake_provider.mjs    # 8 assertions (OpenAI SDK + fake server)
node run_signal_and_stream_test.mjs    # 5 assertions (SIGINT + stderr realtime)

@zhangxy-zju

zhangxy-zju commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: both SDK paths verified — Gemini path is worse

Validated the @google/genai path too with 7 new e2e assertions (same fake-provider harness as before, pointed at the Gemini SDK). Result: retry.ts works on both SDK paths, but Retry-After is honored on neither — and the Gemini path is worse than OpenAI.

SDK error shape comparison

SDK .status .headers .response Can we recover retry-after?
openai@5.11.0 (RateLimitError) 429 ✓ Headers instance (.get('retry-after')) undefined Yes — patch getRetryAfterDelayMs to also read error.headers?.get?.('retry-after')
@google/genai@1.30.0 (ApiError) 429 ✓ undefined undefined No — SDK drops all headers at the error boundary. Only [stack, message, name, status] survive.
// @google/genai ApiError probe output:
ERROR class: ApiError
e.status: 429                ← the only recoverable field
e.headers: undefined
e.response: undefined
all property names: [stack, message, name, status]

Impact for this PR

Neither finding blocks #3080 merge:

  • Core promise (429/529 survive in unattended mode) is fulfilled via exponential backoff on both SDK paths — isTransientCapacityError only needs status, which both SDKs expose.
  • Observed behavior: 3 × 429 with Retry-After: 5 from fake server completes in ~260ms (expected ~15s if honored) on both paths — degraded to exponential backoff, but the retry itself works and eventually succeeds.

Follow-up fix requires two things

  1. OpenAI path — local fix in getRetryAfterDelayMs: try error.headers?.get?.('retry-after') and error.headers?.['retry-after'] before the existing error.response.headers[…] axios-shape path. Plus a real RateLimitError regression test case (current unit tests construct fake errors in the axios shape, which is why this wasn't caught).
  2. Gemini path — upstream issue on googleapis/js-genai asking ApiError to preserve response headers (at least retry-after). No local fix possible until then.

Happy to open both as separate issues after this PR lands.

Total script coverage now: 93 assertions (56 unit + 17 in-process + 8 OpenAI e2e + 7 Gemini e2e + 5 SIGINT/stderr e2e), still 0 manual steps remaining.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review

@wenshao

wenshao commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Ran the full suite on this branch plus a small set of observational tests against the real timer to eyeball the heartbeat / backoff behavior. All green.

Tests

Suite Result
packages/core/src/utils/retry.test.ts 56/56 ✓
packages/core/src/core/baseLlmClient.test.ts 18/18 ✓
packages/core/src/core/geminiChat.test.ts 52/52 ✓
tsc --noEmit (packages/core) clean

Observational scenarios (real timers, MANUAL_RETRY_OBSERVE=1)

# Scenario Verified
1 429 persistent retry bypasses maxAttempts maxAttempts=2, took 4 calls to succeed; 15 heartbeats fired; exponential backoff visible (~300ms → ~480ms → ~1090ms)
2 500 does not trigger persistent mode still throws after maxAttempts=3 — 3 calls total
3 AbortSignal cancels mid-sleep throws Retry aborted by signal
4 heartbeatIntervalMs=0 infinite-loop guard Math.max(1, …) works, returns success
5 shouldRetryOnError=false vetoes persistent mode single call, fast-fail

Non-blocking nits

  1. The three-line heartbeatFn closure writing to process.stderr is duplicated verbatim in geminiChat.ts, client.ts, and baseLlmClient.ts. Worth extracting a defaultHeartbeatToStderr helper in retry.ts so callers can just pass a reference.
  2. editor.ts / editor.test.ts contain unrelated formatting changes (indentation + line-wrap reflow) — probably prettier-on-save. Would be cleaner split into their own commit/PR.
  3. The attempt = maxAttempts - 1 clamp at retry.ts:275-277 works and mirrors the reference implementation, so it's fine — a one-line comment explaining why the loop is kept alive that way would help future readers.

Code looks solid overall, functionality lines up with the PR description. 👍

@wenshao

wenshao commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Since the PR body cites Claude Code's services/api/withRetry.ts as the reference, I went through both side-by-side. Posting the diff-of-capabilities here so future readers can tell what was ported faithfully vs. deliberately dropped.

Ported verbatim / structurally equivalent

Element Reference PR #3080
Single backoff cap PERSISTENT_MAX_BACKOFF_MS = 5 min same constant, same value
Absolute wait cap PERSISTENT_RESET_CAP_MS = 6 h PERSISTENT_CAP_MS = 6 h
Heartbeat interval HEARTBEAT_INTERVAL_MS = 30 s same constant, same value
Trigger set 429 + 529 only same
Dual counter (attempt + persistentAttempt) yes yes
Attempt-clamp trick to keep the main loop alive if (attempt >= maxRetries) attempt = maxRetries if (attempt >= maxAttempts) attempt = maxAttempts - 1
Chunked sleep with heartbeat inline while-loop extracted into a sleepWithHeartbeat helper (cleaner)
AbortSignal mid-sleep cancellation yes yes
Retry-After respected, not clamped to maxBackoff yes yes
Explicit env opt-in, no implicit CI=true activation CLAUDE_CODE_UNATTENDED_RETRY QWEN_CODE_UNATTENDED_RETRY

Correctly dropped (not applicable to qwen-code)

Reference feature Why it's not a gap here
anthropic-ratelimit-unified-reset header handling (one-shot wait until window reset) Anthropic-specific header; Qwen/DashScope and OpenAI-compatible endpoints don't emit it
x-should-retry header Anthropic-proprietary non-standard header (the reference's own comment: "Note this is not a standard header."), consumed only by @anthropic-ai/sdk. qwen-code's upstreams don't send it
FOREGROUND_529_RETRY_SOURCES / QuerySource taxonomy qwen-code has no per-call source classification; concept not applicable
Fast-mode cooldown & short-vs-long retry-after branching Reference-product-specific, no analog here
Subscriber-tier gating (isClaudeAISubscriber / isEnterpriseSubscriber) No subscriber model in qwen-code
Consecutive-529 counter + primary→fallback model switch No built-in primary/fallback model pair
Auth-refresh / credential-cache clearing (401, 403 token-revoked, Bedrock, Vertex, ECONNRESET) qwen-code handles auth in each provider layer, not in the retry layer

One notable simplification

Heartbeat transport. The reference yields a structured { type: 'system', subtype: 'api_retry', ... } event via async generator, which a downstream SDK/UI can consume. This PR writes a plain text line to stderr: [qwen-code] Waiting for API capacity... attempt N, retry in Ns. That's sufficient for the stated CI/daemon target — any future SDK consumer would need to convert this to a structured event.

Bottom line

The port is faithful on the core mechanism. Everything that was cut falls into "reference-product/SDK-specific" or "architectural choice that doesn't exist in qwen-code" — I don't see a retry-correctness gap. 👍

@wenshao wenshao merged commit ebe364d into QwenLM:main Apr 21, 2026
13 checks passed
chiga0 pushed a commit that referenced this pull request Apr 24, 2026
…nts (#3080)

* feat(retry): add persistent retry mode for unattended CI/CD environments

When running in CI/CD pipelines or background daemon mode, transient API
capacity errors (429/529) should not terminate long-running tasks after a
fixed number of retries. This adds an environment-aware persistent retry
mode that retries indefinitely for transient errors, with exponential
backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to
prevent CI runner timeouts.

* docs: add persistent retry mode documentation

Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG)
to the settings reference, and a new "Persistent Retry Mode" section to the
headless mode docs covering activation, behavior, and CI/CD usage examples.

* refactor(retry): simplify to single explicit env var QWEN_CODE_UNATTENDED_RETRY

Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry.
Having multiple env vars with identical behavior adds confusion, and silently
activating infinite retry on CI=true is dangerous — a regular CI test hitting
a 429 would hang forever instead of failing fast.

* fix(retry): address PR review feedback

- Forward caller's abortSignal into retryWithBackoff in both
  baseLlmClient.ts and geminiChat.ts so persistent waits remain
  cancellable (wenshao)
- Re-apply maxBackoff and capMs after jitter so delays strictly
  respect stated caps (Copilot)
- Respect shouldRetryOnError in persistent mode so callers can
  force fast-fail even for transient 429/529 errors (Copilot)
- Guard sleepWithHeartbeat against infinite loop when heartbeat
  interval is <= 0 via Math.max(1, ...) (Copilot)
- Normalize isEnvTruthy with trim/toLowerCase for robust env
  var parsing across CI conventions (Copilot)

* test(retry): add missing UT for shouldRetryOnError override and heartbeat zero-interval guard

* fix(retry): do not cap Retry-After delays at maxBackoff

Server-specified Retry-After values should only be limited by the
absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min).
Jitter is also skipped for Retry-After since the server already specified
the exact wait time.

* refactor(retry): align isUnattendedMode with project env parsing convention

Replace custom isEnvTruthy (trim + toLowerCase) with strict matching
(val === 'true' || val === '1') to match parseBooleanEnvFlag used
elsewhere in the codebase. Prevents inconsistent behavior where
'TRUE' or ' 1 ' would activate persistent retry here but not in
telemetry or other env-driven features.

* test(retry): add Retry-After handling tests for persistent mode

Cover three key behaviors:
- Retry-After is NOT capped at maxBackoff (only at capMs)
- Retry-After IS capped at persistentCapMs absolute limit
- Retry-After delays have no jitter applied

* fix(test): add isUnattendedMode to retry.js mock in baseLlmClient tests

The existing vi.mock for retry.js only exported retryWithBackoff.
After adding isUnattendedMode to the retry module, baseLlmClient.ts
imports it, causing all 10 generateJson tests to fail with
'No "isUnattendedMode" export is defined on the mock'.

* fix(retry): wire persistent retry mode into client.ts generateContent

Forward persistentMode and abortSignal to retryWithBackoff() in
GeminiClient.generateContent(), matching the existing wiring in
baseLlmClient.ts and geminiChat.ts.
xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026
…nts (QwenLM#3080)

* feat(retry): add persistent retry mode for unattended CI/CD environments

When running in CI/CD pipelines or background daemon mode, transient API
capacity errors (429/529) should not terminate long-running tasks after a
fixed number of retries. This adds an environment-aware persistent retry
mode that retries indefinitely for transient errors, with exponential
backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to
prevent CI runner timeouts.

* docs: add persistent retry mode documentation

Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG)
to the settings reference, and a new "Persistent Retry Mode" section to the
headless mode docs covering activation, behavior, and CI/CD usage examples.

* refactor(retry): simplify to single explicit env var QWEN_CODE_UNATTENDED_RETRY

Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry.
Having multiple env vars with identical behavior adds confusion, and silently
activating infinite retry on CI=true is dangerous — a regular CI test hitting
a 429 would hang forever instead of failing fast.

* fix(retry): address PR review feedback

- Forward caller's abortSignal into retryWithBackoff in both
  baseLlmClient.ts and geminiChat.ts so persistent waits remain
  cancellable (wenshao)
- Re-apply maxBackoff and capMs after jitter so delays strictly
  respect stated caps (Copilot)
- Respect shouldRetryOnError in persistent mode so callers can
  force fast-fail even for transient 429/529 errors (Copilot)
- Guard sleepWithHeartbeat against infinite loop when heartbeat
  interval is <= 0 via Math.max(1, ...) (Copilot)
- Normalize isEnvTruthy with trim/toLowerCase for robust env
  var parsing across CI conventions (Copilot)

* test(retry): add missing UT for shouldRetryOnError override and heartbeat zero-interval guard

* fix(retry): do not cap Retry-After delays at maxBackoff

Server-specified Retry-After values should only be limited by the
absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min).
Jitter is also skipped for Retry-After since the server already specified
the exact wait time.

* refactor(retry): align isUnattendedMode with project env parsing convention

Replace custom isEnvTruthy (trim + toLowerCase) with strict matching
(val === 'true' || val === '1') to match parseBooleanEnvFlag used
elsewhere in the codebase. Prevents inconsistent behavior where
'TRUE' or ' 1 ' would activate persistent retry here but not in
telemetry or other env-driven features.

* test(retry): add Retry-After handling tests for persistent mode

Cover three key behaviors:
- Retry-After is NOT capped at maxBackoff (only at capMs)
- Retry-After IS capped at persistentCapMs absolute limit
- Retry-After delays have no jitter applied

* fix(test): add isUnattendedMode to retry.js mock in baseLlmClient tests

The existing vi.mock for retry.js only exported retryWithBackoff.
After adding isUnattendedMode to the retry module, baseLlmClient.ts
imports it, causing all 10 generateJson tests to fail with
'No "isUnattendedMode" export is defined on the mock'.

* fix(retry): wire persistent retry mode into client.ts generateContent

Forward persistentMode and abortSignal to retryWithBackoff() in
GeminiClient.generateContent(), matching the existing wiring in
baseLlmClient.ts and geminiChat.ts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/feature-request New feature or enhancement request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] Persistent Retry Mode / 持久化重试模式

4 participants