|
| 1 | +--- |
| 2 | +name: ci-report |
| 3 | +description: Generate a CI oncall handoff report analyzing GitHub Actions workflow runs on master and release branches. Shows failures, flaky tests, and action items. |
| 4 | +user_invocable: true |
| 5 | +--- |
| 6 | + |
| 7 | +# CI Oncall Report |
| 8 | + |
| 9 | +Generate a concise CI health report for the Collector oncall handler. This report is designed for oncall handoff — lead with action items, keep it tight. |
| 10 | + |
| 11 | +## Arguments |
| 12 | + |
| 13 | +The user may provide a natural language time range as the argument (e.g. "today", "last 3 days", "this week", "since Monday"). Default to "last 24 hours" if no argument is given. Cap at 7 days maximum — if the user requests more, tell them and use 7 days. |
| 14 | + |
| 15 | +Convert the time range to an ISO 8601 date (YYYY-MM-DD) for the `--created` filter. Use `>=` (not `>`) so that "today" includes today's runs. |
| 16 | + |
| 17 | +## Process |
| 18 | + |
| 19 | +Follow these steps in order. Use the Bash tool for all `gh` commands. |
| 20 | + |
| 21 | +### Step 1: Detect the repository |
| 22 | + |
| 23 | +```bash |
| 24 | +gh repo view --json nameWithOwner -q .nameWithOwner |
| 25 | +``` |
| 26 | + |
| 27 | +If this fails, fall back to parsing the git remote: |
| 28 | +```bash |
| 29 | +git remote get-url origin | sed 's|.*github.com[:/]||;s|\.git$||' |
| 30 | +``` |
| 31 | + |
| 32 | +Store the result as `REPO` for subsequent commands. |
| 33 | + |
| 34 | +### Step 2: Fetch all workflow runs |
| 35 | + |
| 36 | +```bash |
| 37 | +gh run list --repo REPO --created ">=YYYY-MM-DD" --limit 500 --json headBranch,status,conclusion,workflowName,databaseId,createdAt,updatedAt,url,event |
| 38 | +``` |
| 39 | + |
| 40 | +If exactly 500 results are returned, warn the user that results may be truncated and suggest narrowing the time window. |
| 41 | + |
| 42 | +Filter the results to only branches matching `^(master|release-\d+\.\d+)$` — this excludes feature branches and sub-branches like `release-3.24/foo`. |
| 43 | + |
| 44 | +### Step 3: Group and summarize |
| 45 | + |
| 46 | +Count **workflow runs** (not individual jobs within runs). Each entry from `gh run list` is one workflow run. For each branch, count: |
| 47 | +- Passed (conclusion == "success") |
| 48 | +- Failed (conclusion == "failure") |
| 49 | +- Cancelled (reported separately, excluded from pass rate) |
| 50 | + |
| 51 | +Do NOT count skipped jobs or individual job statuses in the summary table — this table is about whole workflow runs only. Individual job failures belong in the Failure Details section. |
| 52 | + |
| 53 | +Calculate pass rate as: passed / (passed + failed) * 100. |
| 54 | + |
| 55 | +### Step 4: Fetch failure details |
| 56 | + |
| 57 | +For each failed run: |
| 58 | + |
| 59 | +1. Get the failed jobs: |
| 60 | +```bash |
| 61 | +gh run view RUN_ID --repo REPO --json jobs |
| 62 | +``` |
| 63 | + |
| 64 | +2. Get the failed log output and search for the **real root cause**: |
| 65 | +```bash |
| 66 | +gh run view RUN_ID --repo REPO --log-failed 2>&1 | grep -e "FAIL:" -e "fatal:.*FAILED" -e "TASK \[" -e "Configure VSI" -e "Run integration tests" -e "##\[error\]" | grep -v "RETRYING" | head -20 |
| 67 | +``` |
| 68 | + |
| 69 | +**IMPORTANT: Do NOT use `tail` on the log output.** The end of the log is typically git cleanup, artifact upload, and `Unarchive logs` steps — these are post-test housekeeping, not the root cause. The `Unarchive logs` step failing with `tar: container-logs/*.tar.gz: Cannot open` is a **symptom** (tests didn't produce logs), never the root cause. |
| 70 | + |
| 71 | +Instead, search the full log for the actual failure by looking for: |
| 72 | +- `--- FAIL: TestName` — Go test failures (the tests ran and a specific test failed) |
| 73 | +- `fatal: [hostname]: FAILED!` — Ansible task failures (VM provisioning, image pulls, etc.) |
| 74 | +- `TASK [task-name]` lines immediately before `fatal:` lines — identifies which ansible step failed |
| 75 | +- `##[error]` — GitHub Actions step errors |
| 76 | +- Build/compilation errors |
| 77 | + |
| 78 | +3. If you need more context around a specific error, use: |
| 79 | +```bash |
| 80 | +gh run view RUN_ID --repo REPO --log --job JOB_ID 2>&1 | grep -B5 -A10 "FAIL:\|fatal:.*FAILED" | head -50 |
| 81 | +``` |
| 82 | + |
| 83 | +4. Classify the root cause: |
| 84 | +- **Test failure**: A `--- FAIL: TestName` line means the tests ran and failed. Report the test name and the assertion/error message. These are real regressions or flaky tests. |
| 85 | +- **VM provisioning failure**: An ansible `fatal:` on a `create-vm` or `Configure VSI` task means the test environment couldn't be set up. This is infrastructure, not a test problem. |
| 86 | +- **Image pull failure**: An ansible `fatal:` on `Pull non-QA images` or `Pull QA images` could be a non-fatal warning if the tests still ran afterwards. Check whether `Run integration tests` appears later in the log — if it does, the pull failure was not the root cause. |
| 87 | +- **Build failure**: A compilation error in the build step. Report the file and error. |
| 88 | + |
| 89 | +5. Summarize the root cause in one line, naming the specific test or ansible task that failed. |
| 90 | + |
| 91 | +If log fetching fails for a specific run, note it and continue with other runs. |
| 92 | + |
| 93 | +### Step 5: Detect flakiness |
| 94 | + |
| 95 | +Compare runs of the same workflow on the same branch. A job is flaky if it fails in some runs but passes in others within the time window **and the failure has the same root cause each time**. Track the failure frequency (e.g. "failed 2/5 runs"). |
| 96 | + |
| 97 | +A job that fails in multiple runs with **different** root causes (e.g. one run hits a repo mirror issue, another hits a timeout) is NOT flaky — those are separate infrastructure problems. Only flag as flaky when the same failure pattern repeats intermittently. |
| 98 | + |
| 99 | +### Step 6: Check previous reports for trends |
| 100 | + |
| 101 | +Look for existing report files in `docs/oncall/`: |
| 102 | +```bash |
| 103 | +ls -1 docs/oncall/*-ci-report*.md 2>/dev/null | sort -r | head -5 |
| 104 | +``` |
| 105 | + |
| 106 | +If previous reports exist, read them and extract: |
| 107 | +- **Pass rate per branch** from their Branch Health Summary tables |
| 108 | +- **Action items** from their Action Items sections |
| 109 | + |
| 110 | +Use this to build two trend views: |
| 111 | +1. **Pass rate trends** — how each branch's pass rate has changed across reports |
| 112 | +2. **Action item tracking** — which items from previous reports are now resolved vs still failing |
| 113 | + |
| 114 | +If no previous reports exist, skip the trends section in the output. |
| 115 | + |
| 116 | +### Step 7: Generate the report |
| 117 | + |
| 118 | +Write the report following this exact structure. Be concise throughout — the report should be readable in under 2 minutes. |
| 119 | + |
| 120 | +**Linking**: Every claim in the report must be independently verifiable. Use the `url` field from the `gh run list` output to link to specific workflow runs. The GitHub Actions filter URL for a branch is `https://github.com/REPO/actions?query=branch%3aBRANCH_NAME`. Include these links so a human reader can click through and verify any data point. |
| 121 | + |
| 122 | +#### Section 1: Action Items |
| 123 | + |
| 124 | +This is the most important section. Put it first. List things needing attention, most urgent first. Each item should include: |
| 125 | +- What needs attention and why |
| 126 | +- Link to the relevant run(s) |
| 127 | +- Classification: regression, flaky, infrastructure, or needs investigation |
| 128 | + |
| 129 | +Example format: |
| 130 | +``` |
| 131 | +- **Regression**: integration-tests failing on master since Mar 24 — NetworkConnection test timeout. [Run #1234](url) |
| 132 | +- **Flaky**: k8s-integration-tests on release-3.24 — fails 2/5 runs, ProcessSignal assertion. [Run #1230](url) |
| 133 | +- **Investigate**: Konflux build failures on release-3.23 — image pull error. [Run #1228](url) |
| 134 | +``` |
| 135 | + |
| 136 | +If nothing needs attention: "All clear — no action items." |
| 137 | + |
| 138 | +#### Section 2: Branch Health Summary |
| 139 | + |
| 140 | +One line per branch. Count whole workflow runs only (not individual jobs). Cancelled runs shown separately, excluded from pass rate. Do NOT add a "Skipped" column. Link each branch name to its GitHub Actions filter page. |
| 141 | + |
| 142 | +``` |
| 143 | +| Branch | Runs | Passed | Failed | Cancelled | Pass Rate | |
| 144 | +|--------------|------|--------|--------|-----------|-----------| |
| 145 | +| [master](https://github.com/REPO/actions?query=branch%3Amaster) | 12 | 11 | 1 | 0 | 92% | |
| 146 | +| [release-3.24](https://github.com/REPO/actions?query=branch%3Arelease-3.24) | 8 | 6 | 2 | 0 | 75% | |
| 147 | +``` |
| 148 | + |
| 149 | +#### Section 3: Flaky Jobs |
| 150 | + |
| 151 | +Only include this section if flakiness was detected. Link to an example failing run for each entry. |
| 152 | + |
| 153 | +``` |
| 154 | +| Job | Branch | Fail Rate | Pattern | Example | |
| 155 | +|-------------------|--------------|-----------|---------------------|---------| |
| 156 | +| NetworkConnection | master | 2/10 | Timeout after 120s | [Run #1234](url) | |
| 157 | +``` |
| 158 | + |
| 159 | +#### Section 4: Failure Details |
| 160 | + |
| 161 | +Group by root cause where possible. Each entry: |
| 162 | +- Branch, workflow, run link |
| 163 | +- Failed job name |
| 164 | +- One-line root cause |
| 165 | + |
| 166 | +Only include log excerpts when the cause is non-obvious. Keep this section short. |
| 167 | + |
| 168 | +#### Section 5: Trends |
| 169 | + |
| 170 | +Only include this section if previous reports were found in `docs/oncall/`. |
| 171 | + |
| 172 | +**Pass rate trends** — show how each branch's health has changed. Use the dates from previous report filenames as column headers. |
| 173 | + |
| 174 | +``` |
| 175 | +| Branch | Mar 23 | Mar 24 | Mar 25 (today) | |
| 176 | +|--------------|--------|--------|----------------| |
| 177 | +| master | 100% | 85% | 92% | |
| 178 | +| release-3.24 | 90% | 75% | 75% | |
| 179 | +``` |
| 180 | + |
| 181 | +**Action item tracking** — compare today's action items against previous reports. For each previous action item, note whether it's resolved, still present, or new. |
| 182 | + |
| 183 | +``` |
| 184 | +- **Resolved**: NetworkConnection timeout on master (first seen Mar 23, resolved today) |
| 185 | +- **Ongoing**: Konflux build failures on release-3.23 (first seen Mar 24, still failing) |
| 186 | +- **New**: integration-tests regression on master (first seen today) |
| 187 | +``` |
| 188 | + |
| 189 | +Keep this concise — only mention items that changed status or have persisted for multiple reports. |
| 190 | + |
| 191 | +#### Section 6: Stats |
| 192 | + |
| 193 | +Reference information at the bottom: |
| 194 | +- Date range analyzed |
| 195 | +- Total runs across all branches |
| 196 | +- Overall pass rate |
| 197 | +- Report generated timestamp |
| 198 | + |
| 199 | +### Step 8: Save the report |
| 200 | + |
| 201 | +1. Create the output directory if needed: |
| 202 | +```bash |
| 203 | +mkdir -p docs/oncall |
| 204 | +``` |
| 205 | + |
| 206 | +2. Save to `docs/oncall/YYYY-MM-DD-ci-report.md` using today's date. |
| 207 | + |
| 208 | +3. If that file already exists, try `-2`, `-3`, etc. until a unique filename is found. |
| 209 | + |
| 210 | +4. Display the full report content in the terminal as well. |
| 211 | + |
| 212 | +## Error Handling |
| 213 | + |
| 214 | +- If `gh` is not authenticated, tell the user to run `! gh auth login` (the `!` prefix runs it in the current session). |
| 215 | +- If no runs are found in the time window, report that clearly — don't generate an empty report. |
| 216 | +- If individual log fetches fail, note the failure and continue with other runs. |
0 commit comments