Skip to content

Commit c56c992

Browse files
Stringyclaude
andcommitted
feat: add /ci-report slash command for oncall CI monitoring
Adds a Claude Code slash command that generates CI oncall handoff reports by analyzing GitHub Actions workflow runs on master and release branches. Features: - Natural language time ranges (e.g. /ci-report last 3 days) - Branch health summary with pass rates and links - Root cause analysis that digs past symptoms (e.g. tar failures) to find actual test failures and infrastructure issues - Flaky test detection with same-root-cause requirement - Trends section comparing against previous reports - Reports saved to docs/oncall/ for team handoff Invoke with: /ci-report [time range] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4897659 commit c56c992

3 files changed

Lines changed: 355 additions & 0 deletions

File tree

.claude/commands/ci-report.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
---
2+
name: ci-report
3+
description: Generate a CI oncall handoff report analyzing GitHub Actions workflow runs on master and release branches. Shows failures, flaky tests, and action items.
4+
user_invocable: true
5+
---
6+
7+
# CI Oncall Report
8+
9+
Generate a concise CI health report for the Collector oncall handler. This report is designed for oncall handoff — lead with action items, keep it tight.
10+
11+
## Arguments
12+
13+
The user may provide a natural language time range as the argument (e.g. "today", "last 3 days", "this week", "since Monday"). Default to "last 24 hours" if no argument is given. Cap at 7 days maximum — if the user requests more, tell them and use 7 days.
14+
15+
Convert the time range to an ISO 8601 date (YYYY-MM-DD) for the `--created` filter. Use `>=` (not `>`) so that "today" includes today's runs.
16+
17+
## Process
18+
19+
Follow these steps in order. Use the Bash tool for all `gh` commands.
20+
21+
### Step 1: Detect the repository
22+
23+
```bash
24+
gh repo view --json nameWithOwner -q .nameWithOwner
25+
```
26+
27+
If this fails, fall back to parsing the git remote:
28+
```bash
29+
git remote get-url origin | sed 's|.*github.com[:/]||;s|\.git$||'
30+
```
31+
32+
Store the result as `REPO` for subsequent commands.
33+
34+
### Step 2: Fetch all workflow runs
35+
36+
```bash
37+
gh run list --repo REPO --created ">=YYYY-MM-DD" --limit 500 --json headBranch,status,conclusion,workflowName,databaseId,createdAt,updatedAt,url,event
38+
```
39+
40+
If exactly 500 results are returned, warn the user that results may be truncated and suggest narrowing the time window.
41+
42+
Filter the results to only branches matching `^(master|release-\d+\.\d+)$` — this excludes feature branches and sub-branches like `release-3.24/foo`.
43+
44+
### Step 3: Group and summarize
45+
46+
Count **workflow runs** (not individual jobs within runs). Each entry from `gh run list` is one workflow run. For each branch, count:
47+
- Passed (conclusion == "success")
48+
- Failed (conclusion == "failure")
49+
- Cancelled (reported separately, excluded from pass rate)
50+
51+
Do NOT count skipped jobs or individual job statuses in the summary table — this table is about whole workflow runs only. Individual job failures belong in the Failure Details section.
52+
53+
Calculate pass rate as: passed / (passed + failed) * 100.
54+
55+
### Step 4: Fetch failure details
56+
57+
For each failed run:
58+
59+
1. Get the failed jobs:
60+
```bash
61+
gh run view RUN_ID --repo REPO --json jobs
62+
```
63+
64+
2. Get the failed log output and search for the **real root cause**:
65+
```bash
66+
gh run view RUN_ID --repo REPO --log-failed 2>&1 | grep -e "FAIL:" -e "fatal:.*FAILED" -e "TASK \[" -e "Configure VSI" -e "Run integration tests" -e "##\[error\]" | grep -v "RETRYING" | head -20
67+
```
68+
69+
**IMPORTANT: Do NOT use `tail` on the log output.** The end of the log is typically git cleanup, artifact upload, and `Unarchive logs` steps — these are post-test housekeeping, not the root cause. The `Unarchive logs` step failing with `tar: container-logs/*.tar.gz: Cannot open` is a **symptom** (tests didn't produce logs), never the root cause.
70+
71+
Instead, search the full log for the actual failure by looking for:
72+
- `--- FAIL: TestName` — Go test failures (the tests ran and a specific test failed)
73+
- `fatal: [hostname]: FAILED!` — Ansible task failures (VM provisioning, image pulls, etc.)
74+
- `TASK [task-name]` lines immediately before `fatal:` lines — identifies which ansible step failed
75+
- `##[error]` — GitHub Actions step errors
76+
- Build/compilation errors
77+
78+
3. If you need more context around a specific error, use:
79+
```bash
80+
gh run view RUN_ID --repo REPO --log --job JOB_ID 2>&1 | grep -B5 -A10 "FAIL:\|fatal:.*FAILED" | head -50
81+
```
82+
83+
4. Classify the root cause:
84+
- **Test failure**: A `--- FAIL: TestName` line means the tests ran and failed. Report the test name and the assertion/error message. These are real regressions or flaky tests.
85+
- **VM provisioning failure**: An ansible `fatal:` on a `create-vm` or `Configure VSI` task means the test environment couldn't be set up. This is infrastructure, not a test problem.
86+
- **Image pull failure**: An ansible `fatal:` on `Pull non-QA images` or `Pull QA images` could be a non-fatal warning if the tests still ran afterwards. Check whether `Run integration tests` appears later in the log — if it does, the pull failure was not the root cause.
87+
- **Build failure**: A compilation error in the build step. Report the file and error.
88+
89+
5. Summarize the root cause in one line, naming the specific test or ansible task that failed.
90+
91+
If log fetching fails for a specific run, note it and continue with other runs.
92+
93+
### Step 5: Detect flakiness
94+
95+
Compare runs of the same workflow on the same branch. A job is flaky if it fails in some runs but passes in others within the time window **and the failure has the same root cause each time**. Track the failure frequency (e.g. "failed 2/5 runs").
96+
97+
A job that fails in multiple runs with **different** root causes (e.g. one run hits a repo mirror issue, another hits a timeout) is NOT flaky — those are separate infrastructure problems. Only flag as flaky when the same failure pattern repeats intermittently.
98+
99+
### Step 6: Check previous reports for trends
100+
101+
Look for existing report files in `docs/oncall/`:
102+
```bash
103+
ls -1 docs/oncall/*-ci-report*.md 2>/dev/null | sort -r | head -5
104+
```
105+
106+
If previous reports exist, read them and extract:
107+
- **Pass rate per branch** from their Branch Health Summary tables
108+
- **Action items** from their Action Items sections
109+
110+
Use this to build two trend views:
111+
1. **Pass rate trends** — how each branch's pass rate has changed across reports
112+
2. **Action item tracking** — which items from previous reports are now resolved vs still failing
113+
114+
If no previous reports exist, skip the trends section in the output.
115+
116+
### Step 7: Generate the report
117+
118+
Write the report following this exact structure. Be concise throughout — the report should be readable in under 2 minutes.
119+
120+
**Linking**: Every claim in the report must be independently verifiable. Use the `url` field from the `gh run list` output to link to specific workflow runs. The GitHub Actions filter URL for a branch is `https://github.com/REPO/actions?query=branch%3aBRANCH_NAME`. Include these links so a human reader can click through and verify any data point.
121+
122+
#### Section 1: Action Items
123+
124+
This is the most important section. Put it first. List things needing attention, most urgent first. Each item should include:
125+
- What needs attention and why
126+
- Link to the relevant run(s)
127+
- Classification: regression, flaky, infrastructure, or needs investigation
128+
129+
Example format:
130+
```
131+
- **Regression**: integration-tests failing on master since Mar 24 — NetworkConnection test timeout. [Run #1234](url)
132+
- **Flaky**: k8s-integration-tests on release-3.24 — fails 2/5 runs, ProcessSignal assertion. [Run #1230](url)
133+
- **Investigate**: Konflux build failures on release-3.23 — image pull error. [Run #1228](url)
134+
```
135+
136+
If nothing needs attention: "All clear — no action items."
137+
138+
#### Section 2: Branch Health Summary
139+
140+
One line per branch. Count whole workflow runs only (not individual jobs). Cancelled runs shown separately, excluded from pass rate. Do NOT add a "Skipped" column. Link each branch name to its GitHub Actions filter page.
141+
142+
```
143+
| Branch | Runs | Passed | Failed | Cancelled | Pass Rate |
144+
|--------------|------|--------|--------|-----------|-----------|
145+
| [master](https://github.com/REPO/actions?query=branch%3Amaster) | 12 | 11 | 1 | 0 | 92% |
146+
| [release-3.24](https://github.com/REPO/actions?query=branch%3Arelease-3.24) | 8 | 6 | 2 | 0 | 75% |
147+
```
148+
149+
#### Section 3: Flaky Jobs
150+
151+
Only include this section if flakiness was detected. Link to an example failing run for each entry.
152+
153+
```
154+
| Job | Branch | Fail Rate | Pattern | Example |
155+
|-------------------|--------------|-----------|---------------------|---------|
156+
| NetworkConnection | master | 2/10 | Timeout after 120s | [Run #1234](url) |
157+
```
158+
159+
#### Section 4: Failure Details
160+
161+
Group by root cause where possible. Each entry:
162+
- Branch, workflow, run link
163+
- Failed job name
164+
- One-line root cause
165+
166+
Only include log excerpts when the cause is non-obvious. Keep this section short.
167+
168+
#### Section 5: Trends
169+
170+
Only include this section if previous reports were found in `docs/oncall/`.
171+
172+
**Pass rate trends** — show how each branch's health has changed. Use the dates from previous report filenames as column headers.
173+
174+
```
175+
| Branch | Mar 23 | Mar 24 | Mar 25 (today) |
176+
|--------------|--------|--------|----------------|
177+
| master | 100% | 85% | 92% |
178+
| release-3.24 | 90% | 75% | 75% |
179+
```
180+
181+
**Action item tracking** — compare today's action items against previous reports. For each previous action item, note whether it's resolved, still present, or new.
182+
183+
```
184+
- **Resolved**: NetworkConnection timeout on master (first seen Mar 23, resolved today)
185+
- **Ongoing**: Konflux build failures on release-3.23 (first seen Mar 24, still failing)
186+
- **New**: integration-tests regression on master (first seen today)
187+
```
188+
189+
Keep this concise — only mention items that changed status or have persisted for multiple reports.
190+
191+
#### Section 6: Stats
192+
193+
Reference information at the bottom:
194+
- Date range analyzed
195+
- Total runs across all branches
196+
- Overall pass rate
197+
- Report generated timestamp
198+
199+
### Step 8: Save the report
200+
201+
1. Create the output directory if needed:
202+
```bash
203+
mkdir -p docs/oncall
204+
```
205+
206+
2. Save to `docs/oncall/YYYY-MM-DD-ci-report.md` using today's date.
207+
208+
3. If that file already exists, try `-2`, `-3`, etc. until a unique filename is found.
209+
210+
4. Display the full report content in the terminal as well.
211+
212+
## Error Handling
213+
214+
- If `gh` is not authenticated, tell the user to run `! gh auth login` (the `!` prefix runs it in the current session).
215+
- If no runs are found in the time window, report that clearly — don't generate an empty report.
216+
- If individual log fetches fail, note the failure and continue with other runs.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# CI Oncall Report - 2026-03-25 (Weekly)
2+
3+
## Action Items
4+
5+
- **Infrastructure**: release-3.22 Main CI — RHEL 8 yum repo mirror unavailable (`rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms`), blocking VM provisioning. [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045)
6+
- **Infrastructure**: release-3.22 Main CI — VMs provision OK but tests produce no container log artifacts; `Unarchive logs` fails. GCP Read requests quota exceeded during teardown. [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114)
7+
- **Infrastructure**: release-3.22 Konflux — `wait-for-images` timed out (~90min) waiting for `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`. Image never published. [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015)
8+
- **Flaky**: release-3.24 Konflux integration tests — failed 2/4 runs with missing container log artifacts (`Unarchive logs` / `Create Test VMs` failures). Affected architectures: ppc64le (Mar 23), rhel + rhel-arm64 (Mar 24). [Run #23455478297](https://github.com/stackrox/collector/actions/runs/23455478297), [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295)
9+
- **Investigate**: master Konflux — `Store artifacts` step failed in ubuntu-os integration tests (Mar 23). Logs unavailable. Passed on rerun (Mar 24). [Run #23455637467](https://github.com/stackrox/collector/actions/runs/23455637467)
10+
11+
## Branch Health Summary
12+
13+
| Branch | Runs | Passed | Failed | Cancelled | Pass Rate |
14+
|--------|------|--------|--------|-----------|-----------|
15+
| [master](https://github.com/stackrox/collector/actions?query=branch%3Amaster) | 20 | 18 | 1 | 1 | 95% |
16+
| [release-3.22](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.22) | 4 | 1 | 3 | 0 | 25% |
17+
| [release-3.23](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.23) | 2 | 2 | 0 | 0 | 100% |
18+
| [release-3.24](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.24) | 8 | 5 | 2 | 1 | 71% |
19+
20+
Note: 107 skipped `Retest Konflux Builds` runs on master (triggered by `check_run` events) excluded from run counts.
21+
22+
## Flaky Jobs
23+
24+
| Job | Branch | Fail Rate | Pattern | Example |
25+
|-----|--------|-----------|---------|---------|
26+
| Konflux integration tests (various archs) | release-3.24 | 2/4 | Missing container log artifacts, `Unarchive logs` / `Create Test VMs` failure | [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) |
27+
| Konflux ubuntu-os integration tests | master | 1/2 | `Store artifacts` step failure (logs unavailable) | [Run #23455637467](https://github.com/stackrox/collector/actions/runs/23455637467) |
28+
29+
## Failure Details
30+
31+
### Missing container log artifacts (release-3.22, release-3.24, master)
32+
33+
This is the most common failure pattern across the week. Tests run but produce no `container-logs/*.tar.gz` artifacts, causing the `Unarchive logs` step to fail with `tar: Cannot open: No such file or directory`.
34+
35+
- **release-3.22** Main CI — [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114) (Mar 24, 08:15 UTC)
36+
- Failed jobs: `rhel`, `rhel-sap`
37+
- GCP Read requests quota also exceeded during teardown.
38+
39+
- **release-3.24** Konflux — [Run #23455478297](https://github.com/stackrox/collector/actions/runs/23455478297) (Mar 23, 19:12 UTC)
40+
- Failed job: `ppc64le-integration-tests``Create Test VMs` and `Unarchive logs` steps failed.
41+
42+
- **release-3.24** Konflux — [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) (Mar 24, 15:20 UTC)
43+
- Failed jobs: `rhel`, `rhel-arm64``Unarchive logs` step failed.
44+
45+
- **master** Konflux — [Run #23455637467](https://github.com/stackrox/collector/actions/runs/23455637467) (Mar 23, 19:16 UTC)
46+
- Failed job: `ubuntu-os``Store artifacts` step failed. Logs not available for this run.
47+
48+
### RHEL 8 yum repo mirror failure (release-3.22)
49+
50+
- **Workflow**: Main CI — [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045) (Mar 24, 09:37 UTC)
51+
- **Failed job**: `amd64-integration-tests (rhel)`
52+
- **Cause**: `Create Test VMs` failed — ansible provisioning hit `Failed to download metadata for repo 'rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms'`.
53+
54+
### Konflux image not published (release-3.22)
55+
56+
- **Workflow**: Test Konflux builds — [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015) (Mar 24, 09:37 UTC)
57+
- **Failed job**: `wait-for-images`
58+
- **Cause**: Timed out (~90min) polling for image `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`.
59+
60+
## Trends
61+
62+
### Pass Rate Trends
63+
64+
| Branch | Mar 25 (daily) | Mar 25 (weekly, today) |
65+
|--------|---------------|------------------------|
66+
| master | 100% | 95% |
67+
| release-3.22 | 25% | 25% |
68+
| release-3.23 | 100% | 100% |
69+
| release-3.24 | 80% | 71% |
70+
71+
The daily report (covering Mar 24-25) showed master at 100% because the Mar 23 Konflux failure fell outside its window. The weekly view reveals that failure, dropping master to 95%.
72+
73+
release-3.24 drops from 80% to 71% with the additional Mar 23 ppc64le failure now in scope.
74+
75+
### Action Item Tracking
76+
77+
- **Ongoing**: release-3.22 infrastructure failures (RHEL 8 mirror, GCP quota, Konflux image) — all from Mar 24, still present
78+
- **Ongoing**: release-3.24 Konflux missing container logs — seen in both daily and weekly reports, 2 failures across the week
79+
- **New**: master Konflux `Store artifacts` failure (Mar 23) — not in the daily report, self-resolved on rerun
80+
81+
## Stats
82+
83+
- **Date range**: 2026-03-18 to 2026-03-25
84+
- **Total runs (master/release, non-skipped)**: 34
85+
- **Overall pass rate**: 81% (26/32 non-cancelled)
86+
- **Report generated**: 2026-03-25
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# CI Oncall Report - 2026-03-25
2+
3+
## Action Items
4+
5+
- **Infrastructure**: release-3.22 Main CI — RHEL 8 yum repo mirror unavailable (`rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms`), blocking VM provisioning. [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045)
6+
- **Infrastructure**: release-3.22 Main CI — VMs provision OK but no container logs produced, `Unarchive logs` fails on missing `container-logs/*.tar.gz`. GCP Read requests quota exceeded during teardown. [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114)
7+
- **Infrastructure**: release-3.22 Konflux — `wait-for-images` timed out (~90min) waiting for `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`. Image never published. [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015)
8+
- **Flaky**: release-3.24 Konflux integration tests — failed 1/3 runs with missing container log artifacts. [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295)
9+
10+
## Branch Health Summary
11+
12+
| Branch | Runs | Passed | Failed | Cancelled | Pass Rate |
13+
|--------|------|--------|--------|-----------|-----------|
14+
| [master](https://github.com/stackrox/collector/actions?query=branch%3Amaster) | 18 | 17 | 0 | 1 | 100% |
15+
| [release-3.22](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.22) | 4 | 1 | 3 | 0 | 25% |
16+
| [release-3.23](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.23) | 2 | 2 | 0 | 0 | 100% |
17+
| [release-3.24](https://github.com/stackrox/collector/actions?query=branch%3Arelease-3.24) | 6 | 4 | 1 | 1 | 80% |
18+
19+
Note: 97 skipped `Retest Konflux Builds` runs on master (triggered by `check_run` events) excluded from run counts.
20+
21+
## Flaky Jobs
22+
23+
| Job | Branch | Fail Rate | Pattern | Example |
24+
|-----|--------|-----------|---------|---------|
25+
| amd64-integration-tests (rhel) | release-3.24 | 1/3 | No container logs produced, tar unarchive fails | [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) |
26+
27+
## Failure Details
28+
29+
### RHEL 8 yum repo mirror failure (release-3.22)
30+
- **Workflow**: Main collector CI — [Run #23482719045](https://github.com/stackrox/collector/actions/runs/23482719045) (Mar 24, 09:37 UTC)
31+
- **Failed job**: `amd64-integration-tests (rhel) / Testing rhel`
32+
- **Cause**: `Create Test VMs` step failed — ansible provisioning hit `Failed to download metadata for repo 'rhui-rhel-8-for-aarch64-appstream-source-rhui-rpms'`. `Unarchive logs` also failed (no test artifacts produced).
33+
34+
### Missing test artifacts / GCP quota (release-3.22, release-3.24)
35+
- **Workflow**: Main collector CI — [Run #23479589114](https://github.com/stackrox/collector/actions/runs/23479589114) (Mar 24, 08:15 UTC)
36+
- **Failed jobs**: `rhel`, `rhel-sap`
37+
- **Cause**: VMs provisioned successfully but tests produced no container log artifacts. `Unarchive logs` failed: `tar: container-logs/*.tar.gz: Cannot open: No such file or directory`. GCP `Read requests` quota exceeded during teardown.
38+
39+
- **Workflow**: Test Konflux builds — [Run #23497306295](https://github.com/stackrox/collector/actions/runs/23497306295) (Mar 24, 15:20 UTC)
40+
- **Failed jobs**: `rhel`, `rhel-arm64`
41+
- **Cause**: Same missing container-logs pattern.
42+
43+
### Konflux image not published (release-3.22)
44+
- **Workflow**: Test Konflux builds — [Run #23482719015](https://github.com/stackrox/collector/actions/runs/23482719015) (Mar 24, 09:37 UTC)
45+
- **Failed job**: `wait-for-images`
46+
- **Cause**: Timed out (~90min) polling for image `rhacs-eng/release-collector:3.22.10-2-g781943a68f-fast`. Image never appeared in registry.
47+
48+
## Stats
49+
50+
- **Date range**: 2026-03-24 to 2026-03-25
51+
- **Total runs (master/release only)**: 30 non-skipped (127 including skipped)
52+
- **Overall pass rate**: 86% (24/28 non-skipped, non-cancelled)
53+
- **Report generated**: 2026-03-25

0 commit comments

Comments
 (0)