feat: add /ci-report slash command for oncall CI monitoring

Stringy · claude · Stringy · commit c56c9926e39d · 2026-03-31T10:44:15.000+01:00
Adds a Claude Code slash command that generates CI oncall handoff
reports by analyzing GitHub Actions workflow runs on master and
release branches.

Features:
- Natural language time ranges (e.g. /ci-report last 3 days)
- Branch health summary with pass rates and links
- Root cause analysis that digs past symptoms (e.g. tar failures)
  to find actual test failures and infrastructure issues
- Flaky test detection with same-root-cause requirement
- Trends section comparing against previous reports
- Reports saved to docs/oncall/ for team handoff

Invoke with: /ci-report [time range]

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/commands/ci-report.md b/.claude/commands/ci-report.md
@@ -0,0 +1,216 @@
+---
+name: ci-report
+description: Generate a CI oncall handoff report analyzing GitHub Actions workflow runs on master and release branches. Shows failures, flaky tests, and action items.
+user_invocable: true
+---
+
+# CI Oncall Report
+
+Generate a concise CI health report for the Collector oncall handler. This report is designed for oncall handoff — lead with action items, keep it tight.
+
+## Arguments
+
+The user may provide a natural language time range as the argument (e.g. "today", "last 3 days", "this week", "since Monday"). Default to "last 24 hours" if no argument is given. Cap at 7 days maximum — if the user requests more, tell them and use 7 days.
+
+Convert the time range to an ISO 8601 date (YYYY-MM-DD) for the `--created` filter. Use `>=` (not `>`) so that "today" includes today's runs.
+
+## Process
+
+Follow these steps in order. Use the Bash tool for all `gh` commands.
+
+### Step 1: Detect the repository
+
+```bash
+gh repo view --json nameWithOwner -q .nameWithOwner
+```
+
+If this fails, fall back to parsing the git remote:
+```bash
+git remote get-url origin | sed 's|.*github.com[:/]||;s|\.git$||'
+```
+
+Store the result as `REPO` for subsequent commands.
+
+### Step 2: Fetch all workflow runs
+
+```bash
+gh run list --repo REPO --created ">=YYYY-MM-DD" --limit 500 --json headBranch,status,conclusion,workflowName,databaseId,createdAt,updatedAt,url,event
+```
+
+If exactly 500 results are returned, warn the user that results may be truncated and suggest narrowing the time window.
+
+Filter the results to only branches matching `^(master|release-\d+\.\d+)$` — this excludes feature branches and sub-branches like `release-3.24/foo`.
+
+### Step 3: Group and summarize
+
+Count **workflow runs** (not individual jobs within runs). Each entry from `gh run list` is one workflow run. For each branch, count:
+- Passed (conclusion == "success")
+- Failed (conclusion == "failure")
+- Cancelled (reported separately, excluded from pass rate)
+
+Do NOT count skipped jobs or individual job statuses in the summary table — this table is about whole workflow runs only. Individual job failures belong in the Failure Details section.
+
+Calculate pass rate as: passed / (passed + failed) * 100.
+
+### Step 4: Fetch failure details
+
+For each failed run:
+
+1. Get the failed jobs:
+```bash
+gh run view RUN_ID --repo REPO --json jobs
+```
+
+2. Get the failed log output and search for the **real root cause**:
+```bash
+gh run view RUN_ID --repo REPO --log-failed 2>&1 | grep -e "FAIL:" -e "fatal:.*FAILED" -e "TASK \[" -e "Configure VSI" -e "Run integration tests" -e "##\[error\]" | grep -v "RETRYING" | head -20
+```
+
+**IMPORTANT: Do NOT use `tail` on the log output.** The end of the log is typically git cleanup, artifact upload, and `Unarchive logs` steps — these are post-test housekeeping, not the root cause. The `Unarchive logs` step failing with `tar: container-logs/*.tar.gz: Cannot open` is a **symptom** (tests didn't produce logs), never the root cause.
+
+Instead, search the full log for the actual failure by looking for:
+- `--- FAIL: TestName` — Go test failures (the tests ran and a specific test failed)
+- `fatal: [hostname]: FAILED!` — Ansible task failures (VM provisioning, image pulls, etc.)
+- `TASK [task-name]` lines immediately before `fatal:` lines — identifies which ansible step failed
+- `##[error]` — GitHub Actions step errors
+- Build/compilation errors
+
+3. If you need more context around a specific error, use:
+```bash
+gh run view RUN_ID --repo REPO --log --job JOB_ID 2>&1 | grep -B5 -A10 "FAIL:\|fatal:.*FAILED" | head -50
+```
+
+4. Classify the root cause:
+- **Test failure**: A `--- FAIL: TestName` line means the tests ran and failed. Report the test name and the assertion/error message. These are real regressions or flaky tests.
+- **VM provisioning failure**: An ansible `fatal:` on a `create-vm` or `Configure VSI` task means the test environment couldn't be set up. This is infrastructure, not a test problem.
+- **Image pull failure**: An ansible `fatal:` on `Pull non-QA images` or `Pull QA images` could be a non-fatal warning if the tests still ran afterwards. Check whether `Run integration tests` appears later in the log — if it does, the pull failure was not the root cause.
+- **Build failure**: A compilation error in the build step. Report the file and error.
+
+5. Summarize the root cause in one line, naming the specific test or ansible task that failed.
+
+If log fetching fails for a specific run, note it and continue with other runs.
+
+### Step 5: Detect flakiness
+
+Compare runs of the same workflow on the same branch. A job is flaky if it fails in some runs but passes in others within the time window **and the failure has the same root cause each time**. Track the failure frequency (e.g. "failed 2/5 runs").
+
+A job that fails in multiple runs with **different** root causes (e.g. one run hits a repo mirror issue, another hits a timeout) is NOT flaky — those are separate infrastructure problems. Only flag as flaky when the same failure pattern repeats intermittently.
+
+### Step 6: Check previous reports for trends
+
+Look for existing report files in `docs/oncall/`:
+```bash
+ls -1 docs/oncall/*-ci-report*.md 2>/dev/null | sort -r | head -5
+```
+
+If previous reports exist, read them and extract:
+- **Pass rate per branch** from their Branch Health Summary tables
+- **Action items** from their Action Items sections
+
+Use this to build two trend views:
+1. **Pass rate trends** — how each branch's pass rate has changed across reports
+2. **Action item tracking** — which items from previous reports are now resolved vs still failing
+
+If no previous reports exist, skip the trends section in the output.
+
+### Step 7: Generate the report
+
+Write the report following this exact structure. Be concise throughout — the report should be readable in under 2 minutes.
+
+**Linking**: Every claim in the report must be independently verifiable. Use the `url` field from the `gh run list` output to link to specific workflow runs. The GitHub Actions filter URL for a branch is `https://github.com/REPO/actions?query=branch%3aBRANCH_NAME`. Include these links so a human reader can click through and verify any data point.
+
+#### Section 1: Action Items
+
+This is the most important section. Put it first. List things needing attention, most urgent first. Each item should include:
+- What needs attention and why
+- Link to the relevant run(s)
+- Classification: regression, flaky, infrastructure, or needs investigation
+
+Example format:
+```
+- **Regression**: integration-tests failing on master since Mar 24 — NetworkConnection test timeout. [Run #1234](url)
+- **Flaky**: k8s-integration-tests on release-3.24 — fails 2/5 runs, ProcessSignal assertion. [Run #1230](url)
+- **Investigate**: Konflux build failures on release-3.23 — image pull error. [Run #1228](url)
+```
+
+If nothing needs attention: "All clear — no action items."
+
+#### Section 2: Branch Health Summary
+
+One line per branch. Count whole workflow runs only (not individual jobs). Cancelled runs shown separately, excluded from pass rate. Do NOT add a "Skipped" column. Link each branch name to its GitHub Actions filter page.
+
+```
+| Branch       | Runs | Passed | Failed | Cancelled | Pass Rate |
+|--------------|------|--------|--------|-----------|-----------|
+| [master](https://github.com/REPO/actions?query=branch%3Amaster) | 12 | 11 | 1 | 0 | 92% |
+| [release-3.24](https://github.com/REPO/actions?query=branch%3Arelease-3.24) | 8 | 6 | 2 | 0 | 75% |
+```
+
+#### Section 3: Flaky Jobs
+
+Only include this section if flakiness was detected. Link to an example failing run for each entry.
+
+```
+| Job               | Branch       | Fail Rate | Pattern            | Example |
+|-------------------|--------------|-----------|---------------------|---------|
+| NetworkConnection | master       | 2/10      | Timeout after 120s  | [Run #1234](url) |
+```
+
+#### Section 4: Failure Details
+
+Group by root cause where possible. Each entry:
+- Branch, workflow, run link
+- Failed job name
+- One-line root cause
+
+Only include log excerpts when the cause is non-obvious. Keep this section short.
+
+#### Section 5: Trends
+
+Only include this section if previous reports were found in `docs/oncall/`.
+
+**Pass rate trends** — show how each branch's health has changed. Use the dates from previous report filenames as column headers.
+
+```
+| Branch       | Mar 23 | Mar 24 | Mar 25 (today) |
+|--------------|--------|--------|----------------|
+| master       | 100%   | 85%    | 92%            |
+| release-3.24 | 90%    | 75%    | 75%            |
+```
+
+**Action item tracking** — compare today's action items against previous reports. For each previous action item, note whether it's resolved, still present, or new.
+
+```
+- **Resolved**: NetworkConnection timeout on master (first seen Mar 23, resolved today)
+- **Ongoing**: Konflux build failures on release-3.23 (first seen Mar 24, still failing)
+- **New**: integration-tests regression on master (first seen today)
+```
+
+Keep this concise — only mention items that changed status or have persisted for multiple reports.
+
+#### Section 6: Stats
+
+Reference information at the bottom:
+- Date range analyzed
+- Total runs across all branches
+- Overall pass rate
+- Report generated timestamp
+
+### Step 8: Save the report
+
+1. Create the output directory if needed:
+```bash
+mkdir -p docs/oncall
+```
+
+2. Save to `docs/oncall/YYYY-MM-DD-ci-report.md` using today's date.
+
+3. If that file already exists, try `-2`, `-3`, etc. until a unique filename is found.
+
+4. Display the full report content in the terminal as well.
+
+## Error Handling
+
+- If `gh` is not authenticated, tell the user to run `! gh auth login` (the `!` prefix runs it in the current session).
+- If no runs are found in the time window, report that clearly — don't generate an empty report.
+- If individual log fetches fail, note the failure and continue with other runs.
diff --git a/docs/oncall/2026-03-25-ci-report-2.md b/docs/oncall/2026-03-25-ci-report-2.md
@@ -0,0 +1,86 @@
+# CI Oncall Report - 2026-03-25 (Weekly)
+
+## Action Items
+
+- **Infrastructure**: release-3.22 Main CI — RHEL 8 yum repo mirror unavailable (`rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms`), blocking VM provisioning. [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045)
+- **Infrastructure**: release-3.22 Main CI — VMs provision OK but tests produce no container log artifacts; `Unarchive logs` fails. GCP Read requests quota exceeded during teardown. [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114)
+- **Infrastructure**: release-3.22 Konflux — `wait-for-images` timed out (~90min) waiting for `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`. Image never published. [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015)
+- **Flaky**: release-3.24 Konflux integration tests — failed 2/4 runs with missing container log artifacts (`Unarchive logs` / `Create Test VMs` failures). Affected architectures: ppc64le (Mar 23), rhel + rhel-arm64 (Mar 24). [Run #23455478297](https://github.com/stackrox/collector/actions/runs/23455478297), [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295)
+- **Investigate**: master Konflux — `Store artifacts` step failed in ubuntu-os integration tests (Mar 23). Logs unavailable. Passed on rerun (Mar 24). [Run #23455637467](https://github.com/stackrox/collector/actions/runs/23455637467)
+
+## Branch Health Summary
+
+| Branch | Runs | Passed | Failed | Cancelled | Pass Rate |
+|--------|------|--------|--------|-----------|-----------|
+| [master](https://github.com/stackrox/collector/actions?query=branch%3Amaster) | 20 | 18 | 1 | 1 | 95% |
+| [release-3.22](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.22) | 4 | 1 | 3 | 0 | 25% |
+| [release-3.23](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.23) | 2 | 2 | 0 | 0 | 100% |
+| [release-3.24](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.24) | 8 | 5 | 2 | 1 | 71% |
+
+Note: 107 skipped `Retest Konflux Builds` runs on master (triggered by `check_run` events) excluded from run counts.
+
+## Flaky Jobs
+
+| Job | Branch | Fail Rate | Pattern | Example |
+|-----|--------|-----------|---------|---------|
+| Konflux integration tests (various archs) | release-3.24 | 2/4 | Missing container log artifacts, `Unarchive logs` / `Create Test VMs` failure | [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) |
+| Konflux ubuntu-os integration tests | master | 1/2 | `Store artifacts` step failure (logs unavailable) | [Run #23455637467](https://github.com/stackrox/collector/actions/runs/23455637467) |
+
+## Failure Details
+
+### Missing container log artifacts (release-3.22, release-3.24, master)
+
+This is the most common failure pattern across the week. Tests run but produce no `container-logs/*.tar.gz` artifacts, causing the `Unarchive logs` step to fail with `tar: Cannot open: No such file or directory`.
+
+- **release-3.22** Main CI — [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114) (Mar 24, 08:15 UTC)
+  - Failed jobs: `rhel`, `rhel-sap`
+  - GCP Read requests quota also exceeded during teardown.
+
+- **release-3.24** Konflux — [Run #23455478297](https://github.com/stackrox/collector/actions/runs/23455478297) (Mar 23, 19:12 UTC)
+  - Failed job: `ppc64le-integration-tests` — `Create Test VMs` and `Unarchive logs` steps failed.
+
+- **release-3.24** Konflux — [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) (Mar 24, 15:20 UTC)
+  - Failed jobs: `rhel`, `rhel-arm64` — `Unarchive logs` step failed.
+
+- **master** Konflux — [Run #23455637467](https://github.com/stackrox/collector/actions/runs/23455637467) (Mar 23, 19:16 UTC)
+  - Failed job: `ubuntu-os` — `Store artifacts` step failed. Logs not available for this run.
+
+### RHEL 8 yum repo mirror failure (release-3.22)
+
+- **Workflow**: Main CI — [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045) (Mar 24, 09:37 UTC)
+- **Failed job**: `amd64-integration-tests (rhel)`
+- **Cause**: `Create Test VMs` failed — ansible provisioning hit `Failed to download metadata for repo 'rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms'`.
+
+### Konflux image not published (release-3.22)
+
+- **Workflow**: Test Konflux builds — [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015) (Mar 24, 09:37 UTC)
+- **Failed job**: `wait-for-images`
+- **Cause**: Timed out (~90min) polling for image `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`.
+
+## Trends
+
+### Pass Rate Trends
+
+| Branch | Mar 25 (daily) | Mar 25 (weekly, today) |
+|--------|---------------|------------------------|
+| master | 100% | 95% |
+| release-3.22 | 25% | 25% |
+| release-3.23 | 100% | 100% |
+| release-3.24 | 80% | 71% |
+
+The daily report (covering Mar 24-25) showed master at 100% because the Mar 23 Konflux failure fell outside its window. The weekly view reveals that failure, dropping master to 95%.
+
+release-3.24 drops from 80% to 71% with the additional Mar 23 ppc64le failure now in scope.
+
+### Action Item Tracking
+
+- **Ongoing**: release-3.22 infrastructure failures (RHEL 8 mirror, GCP quota, Konflux image) — all from Mar 24, still present
+- **Ongoing**: release-3.24 Konflux missing container logs — seen in both daily and weekly reports, 2 failures across the week
+- **New**: master Konflux `Store artifacts` failure (Mar 23) — not in the daily report, self-resolved on rerun
+
+## Stats
+
+- **Date range**: 2026-03-18 to 2026-03-25
+- **Total runs (master/release, non-skipped)**: 34
+- **Overall pass rate**: 81% (26/32 non-cancelled)
+- **Report generated**: 2026-03-25
diff --git a/docs/oncall/2026-03-25-ci-report.md b/docs/oncall/2026-03-25-ci-report.md
@@ -0,0 +1,53 @@
+# CI Oncall Report - 2026-03-25
+
+## Action Items
+
+- **Infrastructure**: release-3.22 Main CI — RHEL 8 yum repo mirror unavailable (`rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms`), blocking VM provisioning. [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045)
+- **Infrastructure**: release-3.22 Main CI — VMs provision OK but no container logs produced, `Unarchive logs` fails on missing `container-logs/*.tar.gz`. GCP Read requests quota exceeded during teardown. [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114)
+- **Infrastructure**: release-3.22 Konflux — `wait-for-images` timed out (~90min) waiting for `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`. Image never published. [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015)
+- **Flaky**: release-3.24 Konflux integration tests — failed 1/3 runs with missing container log artifacts. [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295)
+
+## Branch Health Summary
+
+| Branch | Runs | Passed | Failed | Cancelled | Pass Rate |
+|--------|------|--------|--------|-----------|-----------|
+| [master](https://github.com/stackrox/collector/actions?query=branch%3Amaster) | 18 | 17 | 0 | 1 | 100% |
+| [release-3.22](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.22) | 4 | 1 | 3 | 0 | 25% |
+| [release-3.23](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.23) | 2 | 2 | 0 | 0 | 100% |
+| [release-3.24](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.24) | 6 | 4 | 1 | 1 | 80% |
+
+Note: 97 skipped `Retest Konflux Builds` runs on master (triggered by `check_run` events) excluded from run counts.
+
+## Flaky Jobs
+
+| Job | Branch | Fail Rate | Pattern | Example |
+|-----|--------|-----------|---------|---------|
+| amd64-integration-tests (rhel) | release-3.24 | 1/3 | No container logs produced, tar unarchive fails | [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) |
+
+## Failure Details
+
+### RHEL 8 yum repo mirror failure (release-3.22)
+- **Workflow**: Main collector CI — [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045) (Mar 24, 09:37 UTC)
+- **Failed job**: `amd64-integration-tests (rhel) / Testing rhel`
+- **Cause**: `Create Test VMs` step failed — ansible provisioning hit `Failed to download metadata for repo 'rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms'`. `Unarchive logs` also failed (no test artifacts produced).
+
+### Missing test artifacts / GCP quota (release-3.22, release-3.24)
+- **Workflow**: Main collector CI — [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114) (Mar 24, 08:15 UTC)
+- **Failed jobs**: `rhel`, `rhel-sap`
+- **Cause**: VMs provisioned successfully but tests produced no container log artifacts. `Unarchive logs` failed: `tar: container-logs/*.tar.gz: Cannot open: No such file or directory`. GCP `Read requests` quota exceeded during teardown.
+
+- **Workflow**: Test Konflux builds — [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) (Mar 24, 15:20 UTC)
+- **Failed jobs**: `rhel`, `rhel-arm64`
+- **Cause**: Same missing container-logs pattern.
+
+### Konflux image not published (release-3.22)
+- **Workflow**: Test Konflux builds — [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015) (Mar 24, 09:37 UTC)
+- **Failed job**: `wait-for-images`
+- **Cause**: Timed out (~90min) polling for image `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`. Image never appeared in registry.
+
+## Stats
+
+- **Date range**: 2026-03-24 to 2026-03-25
+- **Total runs (master/release only)**: 30 non-skipped (127 including skipped)
+- **Overall pass rate**: 86% (24/28 non-skipped, non-cancelled)
+- **Report generated**: 2026-03-25