Harden awf-reflect startup against api-proxy warmup 503s on /v1/models#34265
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
| if (!res.ok) { | ||
| logger(`awf-reflect: models fetch returned ${res.status} for ${modelsUrl}`); | ||
| for (let attempt = 1; attempt <= AWF_MODELS_URL_MAX_ATTEMPTS; attempt += 1) { | ||
| const ac = new AbortController(); |
There was a problem hiding this comment.
Good call-out. I checked it, but withRetry currently assumes the github-script core global on retry paths and throws ReferenceError: core is not defined in harness runtime (where awf_reflect.cjs runs). I kept the local bounded 503 retry here to preserve best-effort behavior without introducing that dependency. If you want, I can do a follow-up to make withRetry runtime-agnostic and then switch this over.
There was a problem hiding this comment.
Implemented in 7a1c875. awf_reflect.cjs now requires ./shim.cjs and uses withRetry for the 503 warmup path; tests were updated in awf_reflect.test.cjs and pass.
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
❌ Test Quality Sentinel failed during test quality analysis. |
|
|
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #34265 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (threshold: 100). |
|
🧠 Matt Pocock Skills Reviewer failed during the skills-based review. |
Addressed in |
There was a problem hiding this comment.
Pull request overview
This PR hardens awf-reflect startup model discovery against transient 503 Service Unavailable responses from the awf-api-proxy during warmup, improving the likelihood that the initial awf-reflect.json includes complete model lists.
Changes:
- Added bounded, exponential-backoff retry behavior for
503responses infetchModelsFromUrlvia the sharedwithRetryhelper. - Introduced new retry/backoff constants exported from
awf_reflect.cjs. - Added focused Vitest coverage for
503 → 503 → 200recovery and “exhaust retries” behavior.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/awf_reflect.cjs | Adds 503-specific retry/backoff around models discovery fetches and exports new tuning constants. |
| actions/setup/js/awf_reflect.test.cjs | Extends unit tests to cover the new retry behavior and constant defaults. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 2
| it("retries on 503 and eventually succeeds", async () => { | ||
| vi.stubGlobal( | ||
| "fetch", | ||
| vi | ||
| .fn() | ||
| .mockResolvedValueOnce({ ok: false, status: 503 }) | ||
| .mockResolvedValueOnce({ ok: false, status: 503 }) | ||
| .mockResolvedValueOnce({ ok: true, status: 200, json: async () => ({ data: [{ id: "gpt-4o" }] }) }) | ||
| ); | ||
|
|
||
| const logs = []; | ||
| const result = await fetchModelsFromUrl("http://api-proxy:10000/v1/models", 1000, msg => logs.push(msg)); | ||
| expect(result).toEqual(["gpt-4o"]); | ||
| expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(2); | ||
| expect(logs.some(l => l.includes("fetched 1 model(s)"))).toBe(true); | ||
| }); | ||
|
|
||
| it("stops retrying after max attempts on repeated 503 responses", async () => { | ||
| vi.stubGlobal("fetch", vi.fn().mockResolvedValue({ ok: false, status: 503 })); | ||
|
|
||
| const logs = []; | ||
| const result = await fetchModelsFromUrl("http://api-proxy:10000/v1/models", 1000, msg => logs.push(msg)); | ||
| expect(result).toBeNull(); | ||
| expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(AWF_MODELS_URL_MAX_ATTEMPTS - 1); | ||
| expect(logs.some(l => l.includes("models fetch returned 503"))).toBe(true); | ||
| }); |
| // Milliseconds to wait for the /reflect endpoint before giving up. | ||
| const AWF_REFLECT_TIMEOUT_MS = 60000; | ||
| // Milliseconds to wait for each models_url fallback fetch (shorter than the main reflect timeout). | ||
| const AWF_MODELS_URL_TIMEOUT_MS = 3000; | ||
| // Maximum attempts for models_url fallback fetches when the proxy is not yet ready. | ||
| const AWF_MODELS_URL_MAX_ATTEMPTS = 5; | ||
| // Base delay between models_url fallback retries. Uses exponential backoff. | ||
| const AWF_MODELS_URL_RETRY_BASE_MS = 250; | ||
| // Cap for exponential backoff delay between retries. | ||
| const AWF_MODELS_URL_RETRY_MAX_MS = 2000; |
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd — no blocking issues, but a few items worth addressing before merge.
📋 Key Themes & Highlights
Key Themes
- Fragile
withRetrycoupling (initialDelayMs / 2): the backoff compensation leakswithRetry's internal multiply-before-delay contract into the caller, making the code silently wrong ifwithRetryever changes. - Misleading logging guard:
attemptCounter < AWF_MODELS_URL_MAX_ATTEMPTSinshouldRetrylooks like it controls retry count but only guards a log message — retry termination is solelywithRetry's job viamaxRetries. - Untested outer-catch 503 path: only reachable if
withRetrythrows after exhaustion; whether that happens is unclear, and no test covers it. - Attempt-count assertion missing: the exhaustion test checks logs and null return but not that
fetchwas called exactlyMAX_ATTEMPTStimes.
Positive Highlights
- ✅ 503-only retry scope is well-bounded — other failures retain best-effort null-return behavior.
- ✅ Regression tests for both recovery (503→503→200) and exhaustion paths are a good addition.
- ✅ Per-attempt
AbortControllerinside thewithRetrycallback ensures each retry gets a fresh timeout. - ✅ The fix is appropriately minimal — the opportunistic nature of pre-flight reflect is preserved.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 1.5M
| let attemptCounter = 0; | ||
| const retryConfig = { | ||
| maxRetries: AWF_MODELS_URL_MAX_ATTEMPTS - 1, | ||
| // withRetry multiplies delay before the next attempt, so divide by 2 here |
There was a problem hiding this comment.
[/diagnose] initialDelayMs / 2 leaks withRetry's internal multiply-before-delay contract into the caller — if withRetry changes when it applies the multiplier, the effective first delay will silently double or halve.
💡 Suggestion
Consider one of:
- Add a unit test in
error_recovery.test.cjsthat pinswithRetry's first-delay behavior so a change there fails loudly before it reaches callers. - Pass the raw
AWF_MODELS_URL_RETRY_BASE_MSand documentwithRetry's contract in its own JSDoc instead of compensating at call-site.
Right now any refactor of withRetry's backoff scheduling breaks this silently.
| jitterMs: 0, | ||
| shouldRetry: error => { | ||
| const original = error?.originalError || error; | ||
| const status = original?.status ?? original?.response?.status ?? null; |
There was a problem hiding this comment.
[/diagnose] The attemptCounter < AWF_MODELS_URL_MAX_ATTEMPTS guard controls only the log message, not actual retry termination — withRetry controls termination via maxRetries. A future reader may mistake this for the retry-stop condition.
💡 Suggestion
Either remove the guard (the log fires on every 503, including the final one), or add a clarifying comment:
// Log on every 503; withRetry itself stops after maxRetries (MAX_ATTEMPTS - 1 retries).
if (shouldRetry) {
logger(`awf-reflect: models fetch returned 503 for ${modelsUrl}; retrying (attempt ${attemptCounter + 1}/${AWF_MODELS_URL_MAX_ATTEMPTS})`);
}This also means the last 503 won't log a "retrying" line even though shouldRetry returns true — the withRetry layer will have already stopped, so no message is emitted. Worth checking if the silence on the final attempt is intentional.
| const e = /** @type {Error} */ err; | ||
| if (e.name === "AbortError") { | ||
| return null; // already logged above | ||
| const original = e?.originalError || e; |
There was a problem hiding this comment.
[/tdd] This outer catch 503 branch has no test coverage — it's only reachable if withRetry itself throws after exhausting retries, but whether withRetry throws or returns null on exhaustion isn't clear from this call-site.
💡 Suggestion
Add a test that verifies what withRetry actually does on exhaustion (throw vs. return), and then either:
- Add a regression test that exercises this outer-catch 503 path if it's reachable.
- Or remove this branch and replace with a comment if
withRetrynever throws after exhaustion.
Dead catch branches are maintenance traps: they'll never fail tests if the behavior changes, but they'll mislead readers into thinking there's a safety net.
| logger(`awf-reflect: fetched ${models.length} model(s) from ${modelsUrl}`); | ||
| } | ||
| return models; | ||
| } catch (err) { |
There was a problem hiding this comment.
[/diagnose] A timeout (AbortError) during a retry returns null, which withRetry treats as a successful result and stops retrying — even though the URL might still become reachable. This differs from the original behavior where a timeout was always final. Worth a comment to confirm it's intentional.
💡 Context
In the original code, AbortError → return null was the only exit path for timeouts and was fine because there was only one attempt.
Now, if attempt #2 times out but the server would have responded on attempt #3, the retry loop stops early. This is probably acceptable (keep pre-flight opportunistic), but the behavior difference is non-obvious and a comment would prevent a future reader from "fixing" it.
// Timeout: treat as a successful null (not an error) so withRetry
// doesn't count this as a retryable failure — pre-flight capture is best-effort.
if (e.name === "AbortError") {
return null;
}| expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(AWF_MODELS_URL_MAX_ATTEMPTS - 1); | ||
| expect(logs.some(l => l.includes("models fetch returned 503"))).toBe(true); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
[/tdd] The exhaustion test verifies log messages and null return but doesn't assert that fetch was called exactly AWF_MODELS_URL_MAX_ATTEMPTS times — so if withRetry silently stops earlier, the test still passes.
💡 Suggested addition
const mockFetch = vi.fn().mockResolvedValue({ ok: false, status: 503 });
vi.stubGlobal("fetch", mockFetch);
// ... run ...
expect(mockFetch).toHaveBeenCalledTimes(AWF_MODELS_URL_MAX_ATTEMPTS);This pins the contract: exactly 5 attempts (1 initial + 4 retries), no more, no less. Without it, an off-by-one in maxRetries wouldn't be caught.
🧪 Test Quality Sentinel Report✅ Test Quality Score: 90/100 — Excellent
📊 Metrics & Test Classification (3 tests analyzed)
Test Classification Details
Language SupportTests analyzed:
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References:
|
awf-api-proxycould be marked healthy before upstream LLM connectivity was ready, so earlyawf-reflectmodel discovery hit503and wrote incompleteawf-reflect.json. This also consumed Claude startup retry budget during the first minutes of a run.Retry policy for startup 503s
fetchModelsFromUrlfor HTTP503responses.AWF_MODELS_URL_MAX_ATTEMPTS = 5AWF_MODELS_URL_RETRY_BASE_MS = 250AWF_MODELS_URL_RETRY_MAX_MS = 2000503responses and other failures retain existing best-effort behavior (no throw, returnnull).Behavioral intent
awf-reflect.jsonincludes model lists instead of requiring teardown-time recovery.Focused test coverage
503 -> 503 -> 200recovery path503path stopping at max attempts