Commit 356b61f
Add CI build caching and improve benchmark workflow (#1148)
* Add CI build caching for GitHub-hosted and self-hosted HPC runners
GitHub-hosted runners: Add actions/cache@v4 to test.yml and coverage.yml,
caching the build/ directory keyed by matrix config and source file hashes.
Partial cache hits via restore-keys enable incremental builds.
Self-hosted HPC runners (Phoenix, Frontier, Frontier AMD): Add a persistent
build cache that symlinks build/ to $HOME/scratch/.mfc-ci-cache/<config>/build.
This ensures cached artifacts persist across CI runs regardless of which
runner instance picks up the job. Key details:
- Cross-runner workspace path fixup via sed on CMake files
- flock-based locking prevents concurrent builds from corrupting the cache
- Retry logic uses targeted rm (staging/install only) instead of mfc.sh clean
- Phoenix releases the lock after build, before tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix race conditions and cleanup in build cache
- Only remove build/staging (not build/install) on retry, so concurrent
test jobs reading installed binaries are not disrupted
- Remove stale symlink in lock-timeout fallback path to prevent writing
into the shared cache without holding the lock
- Remove redundant flock --unlock (closing fd is sufficient)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix stale retry log messages
The echo said "Clearing staging/install" but build/install is
intentionally preserved to avoid disrupting concurrent test jobs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Disable git clean on self-hosted runners to preserve build cache
actions/checkout@v4 defaults to clean: true, which runs git clean -ffdx.
This follows the build/ symlink into the shared cache directory and
deletes all cached artifacts (staging, install, venv), defeating the
purpose of the persistent cache and causing SIGILL errors from partially
destroyed build artifacts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Skip build cache for benchmarks and fix benchmark trigger logic
Benchmarks build PR and master in parallel — sharing a cache key causes
collisions. Skip cache setup when run_bench=="bench" so each benchmark
builds from scratch.
Also fix two issues in the benchmark workflow trigger:
- Cross-repo PRs don't populate pull_requests[]; fall back to searching
by head SHA so the PR author is correctly detected.
- Only count approvals from users with write/maintain/admin permission,
filtering out AI bot approvals (Copilot, Qodo).
- Remove wilfonba auto-run; only sbryngelson auto-runs benchmarks.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix cross-runner cache by updating install/ config paths
When the cache moves between runner instances (e.g. actions-runner-6 to
actions-runner-1), the sed path replacement only updated staging/ CMake
files. Config files in install/ (.pc, .cmake) still had the old runner
path, causing silo/HDF5 to link against nonexistent paths and h5dump to
fail on all tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Delete install/ on workspace path change to fix stale binaries
Updating .pc and .cmake config files with sed is insufficient — the MFC
executables (simulation, pre_process, post_process) and static libraries
have the old runner workspace path baked in at compile time. When the
cache moves between runner instances, these binaries fail at runtime.
Replace the install/ sed fix with rm -rf install/ so CMake re-links and
re-installs all binaries with correct paths. The staging/ object files
remain valid, so this is a re-link, not a full rebuild.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Simplify build cache to per-runner directories
Replace the shared cache (with flock, sed path fixups, and workspace
tracking) with per-runner caches keyed by RUNNER_NAME. Each runner
always uses the same workspace path, so CMake's absolute paths are
always correct — no cross-runner path issues, no locking needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Remove restore-keys prefix fallback from GH-hosted build cache
The prefix fallback can restore a cache built on a runner with AVX-512
onto a runner without it, causing SIGILL in Chemistry tests. Without
restore-keys, only exact key matches are used — source changes trigger
a full rebuild but binaries are always compatible with the runner.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Make benchmark pipeline robust to transient GPU failures
Add three layers of defense against transient failures (e.g. ROCm
HSA_STATUS_ERROR_INVALID_ARGUMENT) tanking the entire benchmark:
1. Retry failed cases once (5s delay) before marking as failed
2. Always write partial results YAML before raising on failure
3. CI scripts warn on non-zero exit instead of aborting, and
bench.yml runs diff() via `if: always()` so partial results
are still compared
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Suppress pylint too-many-nested-blocks for bench()
The retry loop adds nesting depth beyond pylint's default limit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix rm -rf following build symlink into shared cache
rm -rf on a symlink follows it and deletes the target's contents,
which fails when another runner is using the shared cache directory.
Use unlink for symlinks, rm -rf only for real directories.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Detect stale cached binaries and include install/ in retry cleanup
Phoenix compute nodes may have different CPU architectures, causing
SIGILL when running binaries cached from a different node. After a
successful build, smoke-test syscheck to detect stale installs and
trigger a full rebuild.
Also include build/install in retry cleanup for all clusters. With
per-runner caching there are no concurrent readers sharing the same
cache directory, so clearing install is safe.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix benchmark PR detection for cross-fork workflow_run events
workflow_run events for cross-fork PRs don't populate pull_requests[]
and may report the base branch SHA instead of the PR head SHA, causing
the SHA-based PR lookup to fail. Add a fallback that searches by
branch name so benchmarks auto-trigger for fork PRs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Move build cache from scratch to coda1 project storage
Scratch quota was filling up and causing build failures.
coda1 project storage has more space for persistent caches.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent ad35e0e commit 356b61f
File tree
10 files changed
+221
-93
lines changed- .github
- scripts
- workflows
- frontier_amd
- frontier
- phoenix
- toolchain/mfc
10 files changed
+221
-93
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
55 | | - | |
| 55 | + | |
56 | 56 | | |
57 | | - | |
58 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
59 | 63 | | |
60 | 64 | | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
41 | | - | |
42 | | - | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
43 | 47 | | |
44 | 48 | | |
45 | 49 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
49 | 64 | | |
50 | 65 | | |
51 | 66 | | |
52 | 67 | | |
53 | 68 | | |
54 | 69 | | |
55 | 70 | | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
64 | 85 | | |
65 | 86 | | |
66 | 87 | | |
| |||
76 | 97 | | |
77 | 98 | | |
78 | 99 | | |
79 | | - | |
80 | | - | |
| 100 | + | |
81 | 101 | | |
82 | 102 | | |
83 | 103 | | |
| |||
164 | 184 | | |
165 | 185 | | |
166 | 186 | | |
| 187 | + | |
167 | 188 | | |
168 | 189 | | |
169 | 190 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
38 | 44 | | |
39 | 45 | | |
40 | 46 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
21 | 26 | | |
22 | 27 | | |
23 | 28 | | |
| |||
45 | 50 | | |
46 | 51 | | |
47 | 52 | | |
48 | | - | |
49 | | - | |
| 53 | + | |
| 54 | + | |
50 | 55 | | |
51 | 56 | | |
52 | 57 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
21 | 26 | | |
22 | 27 | | |
23 | 28 | | |
| |||
45 | 50 | | |
46 | 51 | | |
47 | 52 | | |
48 | | - | |
49 | | - | |
| 53 | + | |
| 54 | + | |
50 | 55 | | |
51 | 56 | | |
52 | 57 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
13 | 16 | | |
14 | 17 | | |
15 | 18 | | |
16 | 19 | | |
17 | 20 | | |
18 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
19 | 40 | | |
20 | 41 | | |
21 | 42 | | |
22 | 43 | | |
23 | | - | |
24 | | - | |
| 44 | + | |
| 45 | + | |
25 | 46 | | |
26 | 47 | | |
27 | 48 | | |
| |||
40 | 61 | | |
41 | 62 | | |
42 | 63 | | |
43 | | - | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
100 | 106 | | |
101 | 107 | | |
102 | 108 | | |
| |||
205 | 211 | | |
206 | 212 | | |
207 | 213 | | |
| 214 | + | |
| 215 | + | |
208 | 216 | | |
209 | 217 | | |
210 | 218 | | |
| |||
0 commit comments