feat(retry): add persistent retry mode for unattended CI/CD environments#3080
Conversation
📋 Review SummaryThis PR introduces a persistent retry mode for handling transient API capacity errors (429/529) in unattended/CI environments, enabling long-running automated tasks to survive temporary API outages. The implementation is well-structured with comprehensive test coverage, but there are some concerns around the attempt clamping technique and potential infinite loop risks that should be addressed before merging. 🔍 General Feedback
🎯 Specific Feedback🔴 Critical
🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “persistent retry” mode to make long-running, unattended Qwen Code runs (e.g., CI/CD) resilient to transient API capacity errors without impacting interactive behavior.
Changes:
- Enhanced
retryWithBackoff()with persistent retry semantics for 429/529, heartbeat sleeping, and additional options (persistentMode, caps, heartbeat,signal). - Wired persistent retry into core API callers (
GeminiChat,BaseLlmClient) and documented the new env var (QWEN_CODE_UNATTENDED_RETRY). - Added unit tests for transient classification, unattended mode detection, and persistent retry behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/utils/retry.ts | Implements persistent retry mode, heartbeat sleep, and new retry options. |
| packages/core/src/utils/retry.test.ts | Adds coverage for persistent retry mode + env detection helpers. |
| packages/core/src/core/geminiChat.ts | Enables persistent mode in Gemini caller with stderr heartbeat logging. |
| packages/core/src/core/baseLlmClient.ts | Enables persistent mode in base client with stderr heartbeat logging. |
| docs/users/features/headless.md | Documents “Persistent Retry Mode” usage and monitoring behavior. |
| docs/users/configuration/settings.md | Adds QWEN_CODE_UNATTENDED_RETRY to the environment variables table. |
Comments suppressed due to low confidence (1)
packages/core/src/utils/retry.ts:186
- The new
signal?: AbortSignaloption is only consulted insidesleepWithHeartbeat(persistent path). In non-persistent mode (and in theshouldRetryOnContentbackoff), delays ignore the signal and the loop doesn't checksignal.abortedbefore callingfn(). If the intent is "graceful cancellation" across all retries, consider checkingsignal?.abortedat the top of the loop and making all waits abortable (including non-persistentdelay(...)calls).
while (attempt < maxAttempts) {
attempt++;
try {
const result = await fn();
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wenshao
left a comment
There was a problem hiding this comment.
packages/core/src/core/client.ts:963 still calls retryWithBackoff() without persistentMode: isUnattendedMode() or the new unattended retry wiring, while this PR adds that behavior in BaseLlmClient.generateJson() and GeminiChat.makeApiCallAndProcessStream(). Callers that still route through Client.generateContent()—including packages/core/src/tools/web-fetch.ts:111, packages/cli/src/ui/commands/btwCommand.ts:66, and packages/cli/src/ui/commands/summaryCommand.ts:92—will still fail after the normal retry budget during 429/529 capacity outages, so unattended retry remains inconsistent across top-level model entry points. Please apply the same unattended retry options here, or centralize the policy in a shared wrapper so all top-level model calls opt in consistently.
— gpt-5.4 via Qwen Code /review
|
@wenshao Good catch — Fixed in 389e856: added All three |
Fixed in ec7659c — added persistentMode and signal to client.ts:963 retryWithBackoff() call. All three call sites now have consistent wiring.
wenshao
left a comment
There was a problem hiding this comment.
See inline comments. — gpt-5.4 via Qwen Code /review
|
|
||
| while (remaining > 0) { | ||
| if (ctx.signal?.aborted) { | ||
| throw new Error('Retry aborted by signal'); |
There was a problem hiding this comment.
[Suggestion] sleepWithHeartbeat() says it supports AbortSignal, but each chunk still waits on a plain timer. If the signal is aborted after a chunk starts, cancellation is delayed until that chunk finishes, which can be up to the full heartbeat interval in persistent mode.
— gpt-5.4 via Qwen Code /review
There was a problem hiding this comment.
Acknowledged but won't fix — the worst-case cancel latency in persistent mode is bounded by heartbeatIntervalMs (default 30s), and a single chunk is already a discrete sleep that ends naturally without any work happening during it. Making each chunk Promise.race([delay, signal])-based would add abort-listener wiring/cleanup for ≤30s of latency improvement.
This is consistent with the previously declined Copilot suggestion about non-persistent delay() not honoring signal — same rationale: the bounded wait makes the responsiveness gap not worth the complexity. Happy to revisit if there's a concrete user-facing scenario where 30s cancel lag is a problem.
When running in CI/CD pipelines or background daemon mode, transient API capacity errors (429/529) should not terminate long-running tasks after a fixed number of retries. This adds an environment-aware persistent retry mode that retries indefinitely for transient errors, with exponential backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to prevent CI runner timeouts.
Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG) to the settings reference, and a new "Persistent Retry Mode" section to the headless mode docs covering activation, behavior, and CI/CD usage examples.
…NDED_RETRY Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry. Having multiple env vars with identical behavior adds confusion, and silently activating infinite retry on CI=true is dangerous — a regular CI test hitting a 429 would hang forever instead of failing fast.
- Forward caller's abortSignal into retryWithBackoff in both baseLlmClient.ts and geminiChat.ts so persistent waits remain cancellable (wenshao) - Re-apply maxBackoff and capMs after jitter so delays strictly respect stated caps (Copilot) - Respect shouldRetryOnError in persistent mode so callers can force fast-fail even for transient 429/529 errors (Copilot) - Guard sleepWithHeartbeat against infinite loop when heartbeat interval is <= 0 via Math.max(1, ...) (Copilot) - Normalize isEnvTruthy with trim/toLowerCase for robust env var parsing across CI conventions (Copilot)
…beat zero-interval guard
Server-specified Retry-After values should only be limited by the absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min). Jitter is also skipped for Retry-After since the server already specified the exact wait time.
…ention Replace custom isEnvTruthy (trim + toLowerCase) with strict matching (val === 'true' || val === '1') to match parseBooleanEnvFlag used elsewhere in the codebase. Prevents inconsistent behavior where 'TRUE' or ' 1 ' would activate persistent retry here but not in telemetry or other env-driven features.
Cover three key behaviors: - Retry-After is NOT capped at maxBackoff (only at capMs) - Retry-After IS capped at persistentCapMs absolute limit - Retry-After delays have no jitter applied
The existing vi.mock for retry.js only exported retryWithBackoff. After adding isUnattendedMode to the retry module, baseLlmClient.ts imports it, causing all 10 generateJson tests to fail with 'No "isUnattendedMode" export is defined on the mock'.
Forward persistentMode and abortSignal to retryWithBackoff() in GeminiClient.generateContent(), matching the existing wiring in baseLlmClient.ts and geminiChat.ts.
ec7659c to
051f367
Compare
Resolve conflict in docs/users/configuration/settings.md by keeping both QWEN_CODE_UNATTENDED_RETRY (this branch) and QWEN_CODE_PROFILE_STARTUP (main) rows. Address PR QwenLM#3080 review (wenshao 2026-04-19): add the missing heartbeatFn to the retryWithBackoff() call in GeminiClient.generateContent(), matching the stderr keepalive pattern already used in geminiChat.ts and baseLlmClient.ts. Without this, unattended retries on the non-streaming content path stay silent during long 429/529 outages, contradicting the heartbeat-based monitoring promised in the new docs.
Validation Strategy & ResultsAfter the merge + heartbeat-fix commit ( Layer 1 — Unit tests (56/56)
Layer 2 — In-process integration (17/17)Parent spawns a worker subprocess that imports the built Key evidence — T3 (persistent 429 breaks past Layer 3 — fake provider e2e (8/8)Local Key evidence:
Layer 4 — SIGINT + streaming stderr e2e (5/5)Previously considered manual-only. Turns out Key evidence: SIGINT latency 40ms — far below the design upper bound of Pre-existing bug surfaced (NOT introduced by this PR, does NOT block merge)The fake-provider e2e caught a bug in Observed: server sends Scope: affects all OpenAI-SDK-based providers (Qwen DashScope OpenAI-compatible mode, etc.). Pre-dates #3080 (introduced in One-shot reproducer# Unit + typecheck
(cd packages/core && npx vitest run src/utils/retry.test.ts && npx tsc --noEmit)
# Then the three e2e scripts (happy to share them in a gist if useful):
node run_integration_test.mjs # 17 assertions
node run_e2e_with_fake_provider.mjs # 8 assertions (OpenAI SDK + fake server)
node run_signal_and_stream_test.mjs # 5 assertions (SIGINT + stderr realtime) |
Follow-up: both SDK paths verified — Gemini path is worseValidated the SDK error shape comparison
Impact for this PRNeither finding blocks #3080 merge:
Follow-up fix requires two things
Happy to open both as separate issues after this PR lands. Total script coverage now: 93 assertions (56 unit + 17 in-process + 8 OpenAI e2e + 7 Gemini e2e + 5 SIGINT/stderr e2e), still 0 manual steps remaining. |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review
|
Ran the full suite on this branch plus a small set of observational tests against the real timer to eyeball the heartbeat / backoff behavior. All green. Tests
Observational scenarios (real timers,
|
| # | Scenario | Verified |
|---|---|---|
| 1 | 429 persistent retry bypasses maxAttempts |
maxAttempts=2, took 4 calls to succeed; 15 heartbeats fired; exponential backoff visible (~300ms → ~480ms → ~1090ms) |
| 2 | 500 does not trigger persistent mode | still throws after maxAttempts=3 — 3 calls total |
| 3 | AbortSignal cancels mid-sleep |
throws Retry aborted by signal |
| 4 | heartbeatIntervalMs=0 infinite-loop guard |
Math.max(1, …) works, returns success |
| 5 | shouldRetryOnError=false vetoes persistent mode |
single call, fast-fail |
Non-blocking nits
- The three-line
heartbeatFnclosure writing toprocess.stderris duplicated verbatim ingeminiChat.ts,client.ts, andbaseLlmClient.ts. Worth extracting adefaultHeartbeatToStderrhelper inretry.tsso callers can just pass a reference. editor.ts/editor.test.tscontain unrelated formatting changes (indentation + line-wrap reflow) — probably prettier-on-save. Would be cleaner split into their own commit/PR.- The
attempt = maxAttempts - 1clamp atretry.ts:275-277works and mirrors the reference implementation, so it's fine — a one-line comment explaining why the loop is kept alive that way would help future readers.
Code looks solid overall, functionality lines up with the PR description. 👍
|
Since the PR body cites Claude Code's Ported verbatim / structurally equivalent
Correctly dropped (not applicable to qwen-code)
One notable simplificationHeartbeat transport. The reference yields a structured Bottom lineThe port is faithful on the core mechanism. Everything that was cut falls into "reference-product/SDK-specific" or "architectural choice that doesn't exist in qwen-code" — I don't see a retry-correctness gap. 👍 |
…nts (#3080) * feat(retry): add persistent retry mode for unattended CI/CD environments When running in CI/CD pipelines or background daemon mode, transient API capacity errors (429/529) should not terminate long-running tasks after a fixed number of retries. This adds an environment-aware persistent retry mode that retries indefinitely for transient errors, with exponential backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to prevent CI runner timeouts. * docs: add persistent retry mode documentation Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG) to the settings reference, and a new "Persistent Retry Mode" section to the headless mode docs covering activation, behavior, and CI/CD usage examples. * refactor(retry): simplify to single explicit env var QWEN_CODE_UNATTENDED_RETRY Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry. Having multiple env vars with identical behavior adds confusion, and silently activating infinite retry on CI=true is dangerous — a regular CI test hitting a 429 would hang forever instead of failing fast. * fix(retry): address PR review feedback - Forward caller's abortSignal into retryWithBackoff in both baseLlmClient.ts and geminiChat.ts so persistent waits remain cancellable (wenshao) - Re-apply maxBackoff and capMs after jitter so delays strictly respect stated caps (Copilot) - Respect shouldRetryOnError in persistent mode so callers can force fast-fail even for transient 429/529 errors (Copilot) - Guard sleepWithHeartbeat against infinite loop when heartbeat interval is <= 0 via Math.max(1, ...) (Copilot) - Normalize isEnvTruthy with trim/toLowerCase for robust env var parsing across CI conventions (Copilot) * test(retry): add missing UT for shouldRetryOnError override and heartbeat zero-interval guard * fix(retry): do not cap Retry-After delays at maxBackoff Server-specified Retry-After values should only be limited by the absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min). Jitter is also skipped for Retry-After since the server already specified the exact wait time. * refactor(retry): align isUnattendedMode with project env parsing convention Replace custom isEnvTruthy (trim + toLowerCase) with strict matching (val === 'true' || val === '1') to match parseBooleanEnvFlag used elsewhere in the codebase. Prevents inconsistent behavior where 'TRUE' or ' 1 ' would activate persistent retry here but not in telemetry or other env-driven features. * test(retry): add Retry-After handling tests for persistent mode Cover three key behaviors: - Retry-After is NOT capped at maxBackoff (only at capMs) - Retry-After IS capped at persistentCapMs absolute limit - Retry-After delays have no jitter applied * fix(test): add isUnattendedMode to retry.js mock in baseLlmClient tests The existing vi.mock for retry.js only exported retryWithBackoff. After adding isUnattendedMode to the retry module, baseLlmClient.ts imports it, causing all 10 generateJson tests to fail with 'No "isUnattendedMode" export is defined on the mock'. * fix(retry): wire persistent retry mode into client.ts generateContent Forward persistentMode and abortSignal to retryWithBackoff() in GeminiClient.generateContent(), matching the existing wiring in baseLlmClient.ts and geminiChat.ts.
…nts (QwenLM#3080) * feat(retry): add persistent retry mode for unattended CI/CD environments When running in CI/CD pipelines or background daemon mode, transient API capacity errors (429/529) should not terminate long-running tasks after a fixed number of retries. This adds an environment-aware persistent retry mode that retries indefinitely for transient errors, with exponential backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to prevent CI runner timeouts. * docs: add persistent retry mode documentation Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG) to the settings reference, and a new "Persistent Retry Mode" section to the headless mode docs covering activation, behavior, and CI/CD usage examples. * refactor(retry): simplify to single explicit env var QWEN_CODE_UNATTENDED_RETRY Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry. Having multiple env vars with identical behavior adds confusion, and silently activating infinite retry on CI=true is dangerous — a regular CI test hitting a 429 would hang forever instead of failing fast. * fix(retry): address PR review feedback - Forward caller's abortSignal into retryWithBackoff in both baseLlmClient.ts and geminiChat.ts so persistent waits remain cancellable (wenshao) - Re-apply maxBackoff and capMs after jitter so delays strictly respect stated caps (Copilot) - Respect shouldRetryOnError in persistent mode so callers can force fast-fail even for transient 429/529 errors (Copilot) - Guard sleepWithHeartbeat against infinite loop when heartbeat interval is <= 0 via Math.max(1, ...) (Copilot) - Normalize isEnvTruthy with trim/toLowerCase for robust env var parsing across CI conventions (Copilot) * test(retry): add missing UT for shouldRetryOnError override and heartbeat zero-interval guard * fix(retry): do not cap Retry-After delays at maxBackoff Server-specified Retry-After values should only be limited by the absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min). Jitter is also skipped for Retry-After since the server already specified the exact wait time. * refactor(retry): align isUnattendedMode with project env parsing convention Replace custom isEnvTruthy (trim + toLowerCase) with strict matching (val === 'true' || val === '1') to match parseBooleanEnvFlag used elsewhere in the codebase. Prevents inconsistent behavior where 'TRUE' or ' 1 ' would activate persistent retry here but not in telemetry or other env-driven features. * test(retry): add Retry-After handling tests for persistent mode Cover three key behaviors: - Retry-After is NOT capped at maxBackoff (only at capMs) - Retry-After IS capped at persistentCapMs absolute limit - Retry-After delays have no jitter applied * fix(test): add isUnattendedMode to retry.js mock in baseLlmClient tests The existing vi.mock for retry.js only exported retryWithBackoff. After adding isUnattendedMode to the retry module, baseLlmClient.ts imports it, causing all 10 generateJson tests to fail with 'No "isUnattendedMode" export is defined on the mock'. * fix(retry): wire persistent retry mode into client.ts generateContent Forward persistentMode and abortSignal to retryWithBackoff() in GeminiClient.generateContent(), matching the existing wiring in baseLlmClient.ts and geminiChat.ts.
Why
When Qwen Code runs as part of a CI/CD pipeline or as a background daemon (e.g., overnight batch refactoring, large-scale security audits), transient API capacity errors — HTTP 429 (Rate Limit) and 529 (Overloaded) — should not terminate long-running tasks. The current implementation uses a fixed
maxAttempts: 7across all modes, which means a multi-hour automated job can be killed by a brief API outage lasting just a couple of minutes. This is the #1 blocker for using Qwen Code as reliable DevOps infrastructure.Claude Code solves this with a "persistent retry" mode (
services/api/withRetry.ts). This PR brings equivalent capability to Qwen Code.Closes #3003
What changed
Core:
packages/core/src/utils/retry.tsisUnattendedMode()— Detects persistent retry opt-in viaQWEN_CODE_UNATTENDED_RETRYenv var (explicit only —CI=truealone does NOT activate, to avoid silently turning fast-fail CI jobs into infinite-wait jobs)isTransientCapacityError()— Classifies only 429 and 529 as transient capacity errors (HTTP 500 is excluded — it may be a permanent server bug)sleepWithHeartbeat()— Chunked sleep that emits heartbeat callbacks every 30s to keep CI runners aliveretryWithBackoff()enhanced — In persistent mode, transient capacity errors bypassmaxAttemptsand retry indefinitely with:Retry-Afterheader respected when presentAbortSignalsupport for graceful cancellationpersistentAttemptcounter (attempt clamping technique from Claude Code)Callers:
geminiChat.ts,baseLlmClient.tspersistentMode: isUnattendedMode()and aheartbeatFnthat writes to stderrDocs
docs/users/configuration/settings.md— AddedQWEN_CODE_UNATTENDED_RETRYto environment variables tabledocs/users/features/headless.md— Added "Persistent Retry Mode" section with activation, examples, and monitoringNon-breaking
persistentModedefaults tofalse— zero behavior change for interactive usersHow to verify
Unit tests
All 46 tests pass, covering:
maxAttemptsmaxAttempts(not persistent)CI=truealone does NOT activate persistent modeAbortSignalcancellationisUnattendedMode()env var detection (truthy/falsy)Manual / integration
QWEN_CODE_UNATTENDED_RETRY=1 qwen-code "refactor all files in src/"— enable persistent retry