Skip to content

Harden awf-reflect startup against api-proxy warmup 503s on /v1/models#34265

Merged
pelikhan merged 7 commits into
mainfrom
copilot/fix-api-proxy-healthcheck
May 23, 2026
Merged

Harden awf-reflect startup against api-proxy warmup 503s on /v1/models#34265
pelikhan merged 7 commits into
mainfrom
copilot/fix-api-proxy-healthcheck

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 23, 2026

awf-api-proxy could be marked healthy before upstream LLM connectivity was ready, so early awf-reflect model discovery hit 503 and wrote incomplete awf-reflect.json. This also consumed Claude startup retry budget during the first minutes of a run.

  • Retry policy for startup 503s

    • Added bounded retry logic in fetchModelsFromUrl for HTTP 503 responses.
    • Uses exponential backoff with explicit limits:
      • AWF_MODELS_URL_MAX_ATTEMPTS = 5
      • AWF_MODELS_URL_RETRY_BASE_MS = 250
      • AWF_MODELS_URL_RETRY_MAX_MS = 2000
    • Non-503 responses and other failures retain existing best-effort behavior (no throw, return null).
  • Behavioral intent

    • Keep pre-flight reflect capture opportunistic, but avoid failing fast on transient proxy warmup states.
    • Improve chances that initial awf-reflect.json includes model lists instead of requiring teardown-time recovery.
  • Focused test coverage

    • Added tests for:
      • 503 -> 503 -> 200 recovery path
      • repeated 503 path stopping at max attempts
    • Updated constant assertions for new retry/backoff settings.
if (res.status === 503 && attempt < AWF_MODELS_URL_MAX_ATTEMPTS) {
  const backoffMs = Math.min(
    AWF_MODELS_URL_RETRY_BASE_MS * 2 ** (attempt - 1),
    AWF_MODELS_URL_RETRY_MAX_MS
  );
  await new Promise(resolve => setTimeout(resolve, backoffMs));
  continue;
}

Copilot AI and others added 2 commits May 23, 2026 17:46
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix api-proxy healthcheck to avoid 503 error during startup Harden awf-reflect startup against api-proxy warmup 503s on /v1/models May 23, 2026
Copilot AI requested a review from pelikhan May 23, 2026 17:48
Comment thread actions/setup/js/awf_reflect.cjs Outdated
if (!res.ok) {
logger(`awf-reflect: models fetch returned ${res.status} for ${modelsUrl}`);
for (let attempt = 1; attempt <= AWF_MODELS_URL_MAX_ATTEMPTS; attempt += 1) {
const ac = new AbortController();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot use withRetry instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call-out. I checked it, but withRetry currently assumes the github-script core global on retry paths and throws ReferenceError: core is not defined in harness runtime (where awf_reflect.cjs runs). I kept the local bounded 503 retry here to preserve best-effort behavior without introducing that dependency. If you want, I can do a follow-up to make withRetry runtime-agnostic and then switch this over.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot require shim.cjs and use withRetry

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 7a1c875. awf_reflect.cjs now requires ./shim.cjs and uses withRetry for the 503 warmup path; tests were updated in awf_reflect.test.cjs and pass.

Copy link
Copy Markdown
Collaborator

@pelikhan pelikhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot use shim.cjs and withRetry

Copilot AI and others added 4 commits May 23, 2026 19:00
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI requested a review from pelikhan May 23, 2026 19:04
@pelikhan pelikhan marked this pull request as ready for review May 23, 2026 19:10
Copilot AI review requested due to automatic review settings May 23, 2026 19:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

Test Quality Sentinel failed during test quality analysis.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

⚠️ PR Code Quality Reviewer failed during code quality review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #34265 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (threshold: 100).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

🧠 Matt Pocock Skills Reviewer failed during the skills-based review.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 23, 2026

@copilot use shim.cjs and withRetry

Addressed in 7a1c875: actions/setup/js/awf_reflect.cjs now requires ./shim.cjs and uses withRetry for the 503 models fetch retry path.

@pelikhan pelikhan merged commit d1d7460 into main May 23, 2026
24 of 33 checks passed
@pelikhan pelikhan deleted the copilot/fix-api-proxy-healthcheck branch May 23, 2026 19:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens awf-reflect startup model discovery against transient 503 Service Unavailable responses from the awf-api-proxy during warmup, improving the likelihood that the initial awf-reflect.json includes complete model lists.

Changes:

  • Added bounded, exponential-backoff retry behavior for 503 responses in fetchModelsFromUrl via the shared withRetry helper.
  • Introduced new retry/backoff constants exported from awf_reflect.cjs.
  • Added focused Vitest coverage for 503 → 503 → 200 recovery and “exhaust retries” behavior.
Show a summary per file
File Description
actions/setup/js/awf_reflect.cjs Adds 503-specific retry/backoff around models discovery fetches and exports new tuning constants.
actions/setup/js/awf_reflect.test.cjs Extends unit tests to cover the new retry behavior and constant defaults.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Comment on lines +181 to +206
it("retries on 503 and eventually succeeds", async () => {
vi.stubGlobal(
"fetch",
vi
.fn()
.mockResolvedValueOnce({ ok: false, status: 503 })
.mockResolvedValueOnce({ ok: false, status: 503 })
.mockResolvedValueOnce({ ok: true, status: 200, json: async () => ({ data: [{ id: "gpt-4o" }] }) })
);

const logs = [];
const result = await fetchModelsFromUrl("http://api-proxy:10000/v1/models", 1000, msg => logs.push(msg));
expect(result).toEqual(["gpt-4o"]);
expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(2);
expect(logs.some(l => l.includes("fetched 1 model(s)"))).toBe(true);
});

it("stops retrying after max attempts on repeated 503 responses", async () => {
vi.stubGlobal("fetch", vi.fn().mockResolvedValue({ ok: false, status: 503 }));

const logs = [];
const result = await fetchModelsFromUrl("http://api-proxy:10000/v1/models", 1000, msg => logs.push(msg));
expect(result).toBeNull();
expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(AWF_MODELS_URL_MAX_ATTEMPTS - 1);
expect(logs.some(l => l.includes("models fetch returned 503"))).toBe(true);
});
Comment on lines 31 to +40
// Milliseconds to wait for the /reflect endpoint before giving up.
const AWF_REFLECT_TIMEOUT_MS = 60000;
// Milliseconds to wait for each models_url fallback fetch (shorter than the main reflect timeout).
const AWF_MODELS_URL_TIMEOUT_MS = 3000;
// Maximum attempts for models_url fallback fetches when the proxy is not yet ready.
const AWF_MODELS_URL_MAX_ATTEMPTS = 5;
// Base delay between models_url fallback retries. Uses exponential backoff.
const AWF_MODELS_URL_RETRY_BASE_MS = 250;
// Cap for exponential backoff delay between retries.
const AWF_MODELS_URL_RETRY_MAX_MS = 2000;
@github-actions github-actions Bot mentioned this pull request May 23, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — no blocking issues, but a few items worth addressing before merge.

📋 Key Themes & Highlights

Key Themes

  • Fragile withRetry coupling (initialDelayMs / 2): the backoff compensation leaks withRetry's internal multiply-before-delay contract into the caller, making the code silently wrong if withRetry ever changes.
  • Misleading logging guard: attemptCounter < AWF_MODELS_URL_MAX_ATTEMPTS in shouldRetry looks like it controls retry count but only guards a log message — retry termination is solely withRetry's job via maxRetries.
  • Untested outer-catch 503 path: only reachable if withRetry throws after exhaustion; whether that happens is unclear, and no test covers it.
  • Attempt-count assertion missing: the exhaustion test checks logs and null return but not that fetch was called exactly MAX_ATTEMPTS times.

Positive Highlights

  • ✅ 503-only retry scope is well-bounded — other failures retain best-effort null-return behavior.
  • ✅ Regression tests for both recovery (503→503→200) and exhaustion paths are a good addition.
  • ✅ Per-attempt AbortController inside the withRetry callback ensures each retry gets a fresh timeout.
  • ✅ The fix is appropriately minimal — the opportunistic nature of pre-flight reflect is preserved.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 1.5M

let attemptCounter = 0;
const retryConfig = {
maxRetries: AWF_MODELS_URL_MAX_ATTEMPTS - 1,
// withRetry multiplies delay before the next attempt, so divide by 2 here
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] initialDelayMs / 2 leaks withRetry's internal multiply-before-delay contract into the caller — if withRetry changes when it applies the multiplier, the effective first delay will silently double or halve.

💡 Suggestion

Consider one of:

  1. Add a unit test in error_recovery.test.cjs that pins withRetry's first-delay behavior so a change there fails loudly before it reaches callers.
  2. Pass the raw AWF_MODELS_URL_RETRY_BASE_MS and document withRetry's contract in its own JSDoc instead of compensating at call-site.

Right now any refactor of withRetry's backoff scheduling breaks this silently.

jitterMs: 0,
shouldRetry: error => {
const original = error?.originalError || error;
const status = original?.status ?? original?.response?.status ?? null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] The attemptCounter < AWF_MODELS_URL_MAX_ATTEMPTS guard controls only the log message, not actual retry termination — withRetry controls termination via maxRetries. A future reader may mistake this for the retry-stop condition.

💡 Suggestion

Either remove the guard (the log fires on every 503, including the final one), or add a clarifying comment:

// Log on every 503; withRetry itself stops after maxRetries (MAX_ATTEMPTS - 1 retries).
if (shouldRetry) {
  logger(`awf-reflect: models fetch returned 503 for ${modelsUrl}; retrying (attempt ${attemptCounter + 1}/${AWF_MODELS_URL_MAX_ATTEMPTS})`);
}

This also means the last 503 won't log a "retrying" line even though shouldRetry returns true — the withRetry layer will have already stopped, so no message is emitted. Worth checking if the silence on the final attempt is intentional.

const e = /** @type {Error} */ err;
if (e.name === "AbortError") {
return null; // already logged above
const original = e?.originalError || e;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] This outer catch 503 branch has no test coverage — it's only reachable if withRetry itself throws after exhausting retries, but whether withRetry throws or returns null on exhaustion isn't clear from this call-site.

💡 Suggestion

Add a test that verifies what withRetry actually does on exhaustion (throw vs. return), and then either:

  • Add a regression test that exercises this outer-catch 503 path if it's reachable.
  • Or remove this branch and replace with a comment if withRetry never throws after exhaustion.

Dead catch branches are maintenance traps: they'll never fail tests if the behavior changes, but they'll mislead readers into thinking there's a safety net.

logger(`awf-reflect: fetched ${models.length} model(s) from ${modelsUrl}`);
}
return models;
} catch (err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] A timeout (AbortError) during a retry returns null, which withRetry treats as a successful result and stops retrying — even though the URL might still become reachable. This differs from the original behavior where a timeout was always final. Worth a comment to confirm it's intentional.

💡 Context

In the original code, AbortErrorreturn null was the only exit path for timeouts and was fine because there was only one attempt.

Now, if attempt #2 times out but the server would have responded on attempt #3, the retry loop stops early. This is probably acceptable (keep pre-flight opportunistic), but the behavior difference is non-obvious and a comment would prevent a future reader from "fixing" it.

// Timeout: treat as a successful null (not an error) so withRetry
// doesn't count this as a retryable failure — pre-flight capture is best-effort.
if (e.name === "AbortError") {
  return null;
}

expect(logs.filter(l => l.includes("retrying (attempt")).length).toBe(AWF_MODELS_URL_MAX_ATTEMPTS - 1);
expect(logs.some(l => l.includes("models fetch returned 503"))).toBe(true);
});
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The exhaustion test verifies log messages and null return but doesn't assert that fetch was called exactly AWF_MODELS_URL_MAX_ATTEMPTS times — so if withRetry silently stops earlier, the test still passes.

💡 Suggested addition
const mockFetch = vi.fn().mockResolvedValue({ ok: false, status: 503 });
vi.stubGlobal("fetch", mockFetch);
// ... run ...
expect(mockFetch).toHaveBeenCalledTimes(AWF_MODELS_URL_MAX_ATTEMPTS);

This pins the contract: exactly 5 attempts (1 initial + 4 retries), no more, no less. Without it, an off-by-one in maxRetries wouldn't be caught.

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 90/100 — Excellent

Analyzed 3 test case(s): 3 design tests, 0 implementation tests, 0 guideline violations. All new tests verify behavioral contracts for 503 retry handling.

📊 Metrics & Test Classification (3 tests analyzed)
Metric Value
New/modified tests analyzed 3
✅ Design tests (behavioral contracts) 3 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 2 (67%)
Duplicate test clusters 0
Test inflation detected No (0.56:1 ratio)
🚨 Coding-guideline violations 0

Test Classification Details

Test File Classification Issues Detected
Constants validation (modified) actions/setup/js/awf_reflect.test.cjs:24 ✅ Design None — verifies exported retry configuration constants
retries on 503 and eventually succeeds actions/setup/js/awf_reflect.test.cjs:180 ✅ Design None — verifies system eventually succeeds after transient 503 errors
stops retrying after max attempts on repeated 503 responses actions/setup/js/awf_reflect.test.cjs:197 ✅ Design None — verifies retry limit enforcement after exhausting attempts

Language Support

Tests analyzed:

  • 🟨 JavaScript (*.test.cjs): 3 tests (vitest)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All tests verify behavioral contracts for the new 503 retry logic.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References:

🧪 Test quality analysis by Test Quality Sentinel · ● 1.3M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 90/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All tests verify behavioral contracts for 503 retry handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

api-proxy healthcheck race: /v1/models returns 503 before upstream LLM connection is ready

3 participants