Harden awf-reflect startup against api-proxy warmup 503s on /v1/models by Copilot · Pull Request #34265 · github/gh-aw

Copilot · 2026-05-23T17:37:35Z

awf-api-proxy could be marked healthy before upstream LLM connectivity was ready, so early awf-reflect model discovery hit 503 and wrote incomplete awf-reflect.json. This also consumed Claude startup retry budget during the first minutes of a run.

Retry policy for startup 503s
- Added bounded retry logic in fetchModelsFromUrl for HTTP 503 responses.
- Uses exponential backoff with explicit limits:
  - AWF_MODELS_URL_MAX_ATTEMPTS = 5
  - AWF_MODELS_URL_RETRY_BASE_MS = 250
  - AWF_MODELS_URL_RETRY_MAX_MS = 2000
- Non-503 responses and other failures retain existing best-effort behavior (no throw, return null).
Behavioral intent
- Keep pre-flight reflect capture opportunistic, but avoid failing fast on transient proxy warmup states.
- Improve chances that initial awf-reflect.json includes model lists instead of requiring teardown-time recovery.
Focused test coverage
- Added tests for:
  - 503 -> 503 -> 200 recovery path
  - repeated 503 path stopping at max attempts
- Updated constant assertions for new retry/backoff settings.

if (res.status === 503 && attempt < AWF_MODELS_URL_MAX_ATTEMPTS) {
  const backoffMs = Math.min(
    AWF_MODELS_URL_RETRY_BASE_MS * 2 ** (attempt - 1),
    AWF_MODELS_URL_RETRY_MAX_MS
  );
  await new Promise(resolve => setTimeout(resolve, backoffMs));
  continue;
}

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

pelikhan · 2026-05-23T18:39:37Z

-    if (!res.ok) {
-      logger(`awf-reflect: models fetch returned ${res.status} for ${modelsUrl}`);
+  for (let attempt = 1; attempt <= AWF_MODELS_URL_MAX_ATTEMPTS; attempt += 1) {
+    const ac = new AbortController();


@copilot use withRetry instead?

Good call-out. I checked it, but withRetry currently assumes the github-script core global on retry paths and throws ReferenceError: core is not defined in harness runtime (where awf_reflect.cjs runs). I kept the local bounded 503 retry here to preserve best-effort behavior without introducing that dependency. If you want, I can do a follow-up to make withRetry runtime-agnostic and then switch this over.

@copilot require shim.cjs and use withRetry

Implemented in 7a1c875. awf_reflect.cjs now requires ./shim.cjs and uses withRetry for the 503 warmup path; tests were updated in awf_reflect.test.cjs and pass.

pelikhan

@copilot use shim.cjs and withRetry

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

github-actions · 2026-05-23T19:10:49Z

❌ Test Quality Sentinel failed during test quality analysis.

github-actions · 2026-05-23T19:10:49Z

⚠️ PR Code Quality Reviewer failed during code quality review.

github-actions · 2026-05-23T19:10:51Z

✅ Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #34265 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (threshold: 100).

github-actions · 2026-05-23T19:10:53Z

🧠 Matt Pocock Skills Reviewer failed during the skills-based review.

Copilot · 2026-05-23T19:11:12Z

@copilot use shim.cjs and withRetry

Addressed in 7a1c875: actions/setup/js/awf_reflect.cjs now requires ./shim.cjs and uses withRetry for the 503 models fetch retry path.

Copilot

Pull request overview

This PR hardens awf-reflect startup model discovery against transient 503 Service Unavailable responses from the awf-api-proxy during warmup, improving the likelihood that the initial awf-reflect.json includes complete model lists.

Changes:

Added bounded, exponential-backoff retry behavior for 503 responses in fetchModelsFromUrl via the shared withRetry helper.
Introduced new retry/backoff constants exported from awf_reflect.cjs.
Added focused Vitest coverage for 503 → 503 → 200 recovery and “exhaust retries” behavior.

Show a summary per file

File	Description
actions/setup/js/awf_reflect.cjs	Adds 503-specific retry/backoff around models discovery fetches and exports new tuning constants.
actions/setup/js/awf_reflect.test.cjs	Extends unit tests to cover the new retry behavior and constant defaults.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 2

+    it("retries on 503 and eventually succeeds", async () => {
+      vi.stubGlobal(
+        "fetch",
+        vi
+          .fn()
+          .mockResolvedValueOnce({ ok: false, status: 503 })
+          .mockResolvedValueOnce({ ok: false, status: 503 })
+          .mockResolvedValueOnce({ ok: true, status: 200, json: async () => ({ data: [{ id: "gpt-4o" }] }) })
+      );
+
+      const logs = [];
+      const result = await fetchModelsFromUrl("http://api-proxy:10000/v1/models", 1000, msg => logs.push(msg));
+      expect(result).toEqual(["gpt-4o"]);
+      expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(2);
+      expect(logs.some(l => l.includes("fetched 1 model(s)"))).toBe(true);
+    });
+
+    it("stops retrying after max attempts on repeated 503 responses", async () => {
+      vi.stubGlobal("fetch", vi.fn().mockResolvedValue({ ok: false, status: 503 }));
+
+      const logs = [];
+      const result = await fetchModelsFromUrl("http://api-proxy:10000/v1/models", 1000, msg => logs.push(msg));
+      expect(result).toBeNull();
+      expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(AWF_MODELS_URL_MAX_ATTEMPTS - 1);
+      expect(logs.some(l => l.includes("models fetch returned 503"))).toBe(true);
+    });


 // Milliseconds to wait for the /reflect endpoint before giving up.
 const AWF_REFLECT_TIMEOUT_MS = 60000;
 // Milliseconds to wait for each models_url fallback fetch (shorter than the main reflect timeout).
 const AWF_MODELS_URL_TIMEOUT_MS = 3000;
+// Maximum attempts for models_url fallback fetches when the proxy is not yet ready.
+const AWF_MODELS_URL_MAX_ATTEMPTS = 5;
+// Base delay between models_url fallback retries. Uses exponential backoff.
+const AWF_MODELS_URL_RETRY_BASE_MS = 250;
+// Cap for exponential backoff delay between retries.
+const AWF_MODELS_URL_RETRY_MAX_MS = 2000;


github-actions

Skills-Based Review 🧠

Applied /diagnose and /tdd — no blocking issues, but a few items worth addressing before merge.

📋 Key Themes & Highlights

Key Themes

Fragile withRetry coupling (initialDelayMs / 2): the backoff compensation leaks withRetry's internal multiply-before-delay contract into the caller, making the code silently wrong if withRetry ever changes.
Misleading logging guard: attemptCounter < AWF_MODELS_URL_MAX_ATTEMPTS in shouldRetry looks like it controls retry count but only guards a log message — retry termination is solely withRetry's job via maxRetries.
Untested outer-catch 503 path: only reachable if withRetry throws after exhaustion; whether that happens is unclear, and no test covers it.
Attempt-count assertion missing: the exhaustion test checks logs and null return but not that fetch was called exactly MAX_ATTEMPTS times.

Positive Highlights

✅ 503-only retry scope is well-bounded — other failures retain best-effort null-return behavior.
✅ Regression tests for both recovery (503→503→200) and exhaustion paths are a good addition.
✅ Per-attempt AbortController inside the withRetry callback ensures each retry gets a fresh timeout.
✅ The fix is appropriately minimal — the opportunistic nature of pre-flight reflect is preserved.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 1.5M

github-actions · 2026-05-23T19:16:40Z

+  let attemptCounter = 0;
+  const retryConfig = {
+    maxRetries: AWF_MODELS_URL_MAX_ATTEMPTS - 1,
+    // withRetry multiplies delay before the next attempt, so divide by 2 here


[/diagnose] initialDelayMs / 2 leaks withRetry's internal multiply-before-delay contract into the caller — if withRetry changes when it applies the multiplier, the effective first delay will silently double or halve.

💡 Suggestion

Consider one of:

Add a unit test in error_recovery.test.cjs that pins withRetry's first-delay behavior so a change there fails loudly before it reaches callers.

Pass the raw AWF_MODELS_URL_RETRY_BASE_MS and document withRetry's contract in its own JSDoc instead of compensating at call-site.

Right now any refactor of withRetry's backoff scheduling breaks this silently.

github-actions · 2026-05-23T19:16:40Z

+    jitterMs: 0,
+    shouldRetry: error => {
+      const original = error?.originalError || error;
+      const status = original?.status ?? original?.response?.status ?? null;


[/diagnose] The attemptCounter < AWF_MODELS_URL_MAX_ATTEMPTS guard controls only the log message, not actual retry termination — withRetry controls termination via maxRetries. A future reader may mistake this for the retry-stop condition.

💡 Suggestion

Either remove the guard (the log fires on every 503, including the final one), or add a clarifying comment:

// Log on every 503; withRetry itself stops after maxRetries (MAX_ATTEMPTS - 1 retries). if (shouldRetry) { logger(`awf-reflect: models fetch returned 503 for ${modelsUrl}; retrying (attempt ${attemptCounter + 1}/${AWF_MODELS_URL_MAX_ATTEMPTS})`); }

This also means the last 503 won't log a "retrying" line even though shouldRetry returns true — the withRetry layer will have already stopped, so no message is emitted. Worth checking if the silence on the final attempt is intentional.

github-actions · 2026-05-23T19:16:40Z

    const e = /** @type {Error} */ err;
-    if (e.name === "AbortError") {
-      return null; // already logged above
+    const original = e?.originalError || e;


[/tdd] This outer catch 503 branch has no test coverage — it's only reachable if withRetry itself throws after exhausting retries, but whether withRetry throws or returns null on exhaustion isn't clear from this call-site.

💡 Suggestion

Add a test that verifies what withRetry actually does on exhaustion (throw vs. return), and then either:

Add a regression test that exercises this outer-catch 503 path if it's reachable.

Or remove this branch and replace with a comment if withRetry never throws after exhaustion.

Dead catch branches are maintenance traps: they'll never fail tests if the behavior changes, but they'll mislead readers into thinking there's a safety net.

github-actions · 2026-05-23T19:16:40Z

+            logger(`awf-reflect: fetched ${models.length} model(s) from ${modelsUrl}`);
+          }
+          return models;
+        } catch (err) {


[/diagnose] A timeout (AbortError) during a retry returns null, which withRetry treats as a successful result and stops retrying — even though the URL might still become reachable. This differs from the original behavior where a timeout was always final. Worth a comment to confirm it's intentional.

💡 Context

In the original code, AbortError → return null was the only exit path for timeouts and was fine because there was only one attempt.

Now, if attempt #2 times out but the server would have responded on attempt #3, the retry loop stops early. This is probably acceptable (keep pre-flight opportunistic), but the behavior difference is non-obvious and a comment would prevent a future reader from "fixing" it.

// Timeout: treat as a successful null (not an error) so withRetry // doesn't count this as a retryable failure — pre-flight capture is best-effort. if (e.name === "AbortError") { return null; }

github-actions · 2026-05-23T19:16:40Z

+      expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(AWF_MODELS_URL_MAX_ATTEMPTS - 1);
+      expect(logs.some(l => l.includes("models fetch returned 503"))).toBe(true);
+    });
  });


[/tdd] The exhaustion test verifies log messages and null return but doesn't assert that fetch was called exactly AWF_MODELS_URL_MAX_ATTEMPTS times — so if withRetry silently stops earlier, the test still passes.

💡 Suggested addition

const mockFetch = vi.fn().mockResolvedValue({ ok: false, status: 503 }); vi.stubGlobal("fetch", mockFetch); // ... run ... expect(mockFetch).toHaveBeenCalledTimes(AWF_MODELS_URL_MAX_ATTEMPTS);

This pins the contract: exactly 5 attempts (1 initial + 4 retries), no more, no less. Without it, an off-by-one in maxRetries wouldn't be caught.

github-actions · 2026-05-23T19:17:44Z

🧪 Test Quality Sentinel Report

✅ Test Quality Score: 90/100 — Excellent

Analyzed 3 test case(s): 3 design tests, 0 implementation tests, 0 guideline violations. All new tests verify behavioral contracts for 503 retry handling.

📊 Metrics & Test Classification (3 tests analyzed)

Metric	Value
New/modified tests analyzed	3
✅ Design tests (behavioral contracts)	3 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	2 (67%)
Duplicate test clusters	0
Test inflation detected	No (0.56:1 ratio)
🚨 Coding-guideline violations	0

Test Classification Details

Test	File	Classification	Issues Detected
Constants validation (modified)	`actions/setup/js/awf_reflect.test.cjs:24`	✅ Design	None — verifies exported retry configuration constants
`retries on 503 and eventually succeeds`	`actions/setup/js/awf_reflect.test.cjs:180`	✅ Design	None — verifies system eventually succeeds after transient 503 errors
`stops retrying after max attempts on repeated 503 responses`	`actions/setup/js/awf_reflect.test.cjs:197`	✅ Design	None — verifies retry limit enforcement after exhausting attempts

Language Support

Tests analyzed:

🟨 JavaScript (*.test.cjs): 3 tests (vitest)

Verdict

✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). All tests verify behavioral contracts for the new 503 retry logic.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

Assert on observable outputs, return values, or state changes
Cover error paths and boundary conditions
Would catch a behavioral regression if deleted
Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

Assert on internal function calls (mocking internals)
Only test the happy path with typical inputs
Break during legitimate refactoring even when behavior is correct
Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References:

§26341143640

🧪 Test quality analysis by Test Quality Sentinel · ● 1.3M · ◷

github-actions

✅ Test Quality Sentinel: 90/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All tests verify behavioral contracts for 503 retry handling.

Initial plan

ee6d784

Copilot AI assigned Copilot and pelikhan May 23, 2026

Copilot started work on behalf of pelikhan May 23, 2026 17:37 View session

Copilot AI linked an issue May 23, 2026 that may be closed by this pull request

api-proxy healthcheck race: /v1/models returns 503 before upstream LLM connection is ready #34250

Closed

Copilot AI and others added 2 commits May 23, 2026 17:46

fix: retry awf-reflect models fetch on startup 503

1776b0b

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

test: cover awf-reflect 503 retry backoff behavior

aee30d8

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix api-proxy healthcheck to avoid 503 error during startup~~ Harden awf-reflect startup against api-proxy warmup 503s on /v1/models May 23, 2026

Copilot finished work on behalf of pelikhan May 23, 2026 17:48

Copilot AI requested a review from pelikhan May 23, 2026 17:48

pelikhan reviewed May 23, 2026

View reviewed changes

Copilot started work on behalf of pelikhan May 23, 2026 18:39 View session

Copilot finished work on behalf of pelikhan May 23, 2026 18:46

Copilot AI requested a review from pelikhan May 23, 2026 18:46

Copilot started work on behalf of pelikhan May 23, 2026 18:53 View session

pelikhan reviewed May 23, 2026

View reviewed changes

Copilot AI and others added 4 commits May 23, 2026 19:00

refactor: use withRetry in awf_reflect models fetch

bf57424

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

test: align awf_reflect retry log assertions with withRetry

225cf6d

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

chore: tighten withRetry integration in awf_reflect

3f2ae34

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

refine: simplify awf_reflect withRetry error metadata

7a1c875

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot finished work on behalf of pelikhan May 23, 2026 19:04

Copilot AI requested a review from pelikhan May 23, 2026 19:04

Copilot started work on behalf of pelikhan May 23, 2026 19:04 View session

pelikhan marked this pull request as ready for review May 23, 2026 19:10

Copilot AI review requested due to automatic review settings May 23, 2026 19:10

Copilot started reviewing on behalf of pelikhan May 23, 2026 19:10 View session

Copilot finished work on behalf of pelikhan May 23, 2026 19:11

pelikhan merged commit d1d7460 into main May 23, 2026
24 of 33 checks passed

pelikhan deleted the copilot/fix-api-proxy-healthcheck branch May 23, 2026 19:12

Copilot AI reviewed May 23, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 23, 2026

[aw] No-Op Runs #34226

Closed

github-actions Bot reviewed May 23, 2026

View reviewed changes

github-actions Bot approved these changes May 23, 2026

View reviewed changes

Conversation

Copilot AI commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pelikhan May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 23, 2026

Choose a reason for hiding this comment

Uh oh!

pelikhan May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 23, 2026

Choose a reason for hiding this comment

Uh oh!

pelikhan left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI commented May 23, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Skills-Based Review 🧠

Key Themes

Positive Highlights

Uh oh!

github-actions Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 23, 2026

🧪 Test Quality Sentinel Report

Test Classification Details

Language Support

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 23, 2026 •

edited

Loading