Skip to content

Commit 356b61f

Browse files
sbryngelsonclaude
andauthored
Add CI build caching and improve benchmark workflow (#1148)
* Add CI build caching for GitHub-hosted and self-hosted HPC runners GitHub-hosted runners: Add actions/cache@v4 to test.yml and coverage.yml, caching the build/ directory keyed by matrix config and source file hashes. Partial cache hits via restore-keys enable incremental builds. Self-hosted HPC runners (Phoenix, Frontier, Frontier AMD): Add a persistent build cache that symlinks build/ to $HOME/scratch/.mfc-ci-cache/<config>/build. This ensures cached artifacts persist across CI runs regardless of which runner instance picks up the job. Key details: - Cross-runner workspace path fixup via sed on CMake files - flock-based locking prevents concurrent builds from corrupting the cache - Retry logic uses targeted rm (staging/install only) instead of mfc.sh clean - Phoenix releases the lock after build, before tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix race conditions and cleanup in build cache - Only remove build/staging (not build/install) on retry, so concurrent test jobs reading installed binaries are not disrupted - Remove stale symlink in lock-timeout fallback path to prevent writing into the shared cache without holding the lock - Remove redundant flock --unlock (closing fd is sufficient) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix stale retry log messages The echo said "Clearing staging/install" but build/install is intentionally preserved to avoid disrupting concurrent test jobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Disable git clean on self-hosted runners to preserve build cache actions/checkout@v4 defaults to clean: true, which runs git clean -ffdx. This follows the build/ symlink into the shared cache directory and deletes all cached artifacts (staging, install, venv), defeating the purpose of the persistent cache and causing SIGILL errors from partially destroyed build artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Skip build cache for benchmarks and fix benchmark trigger logic Benchmarks build PR and master in parallel — sharing a cache key causes collisions. Skip cache setup when run_bench=="bench" so each benchmark builds from scratch. Also fix two issues in the benchmark workflow trigger: - Cross-repo PRs don't populate pull_requests[]; fall back to searching by head SHA so the PR author is correctly detected. - Only count approvals from users with write/maintain/admin permission, filtering out AI bot approvals (Copilot, Qodo). - Remove wilfonba auto-run; only sbryngelson auto-runs benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix cross-runner cache by updating install/ config paths When the cache moves between runner instances (e.g. actions-runner-6 to actions-runner-1), the sed path replacement only updated staging/ CMake files. Config files in install/ (.pc, .cmake) still had the old runner path, causing silo/HDF5 to link against nonexistent paths and h5dump to fail on all tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Delete install/ on workspace path change to fix stale binaries Updating .pc and .cmake config files with sed is insufficient — the MFC executables (simulation, pre_process, post_process) and static libraries have the old runner workspace path baked in at compile time. When the cache moves between runner instances, these binaries fail at runtime. Replace the install/ sed fix with rm -rf install/ so CMake re-links and re-installs all binaries with correct paths. The staging/ object files remain valid, so this is a re-link, not a full rebuild. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Simplify build cache to per-runner directories Replace the shared cache (with flock, sed path fixups, and workspace tracking) with per-runner caches keyed by RUNNER_NAME. Each runner always uses the same workspace path, so CMake's absolute paths are always correct — no cross-runner path issues, no locking needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove restore-keys prefix fallback from GH-hosted build cache The prefix fallback can restore a cache built on a runner with AVX-512 onto a runner without it, causing SIGILL in Chemistry tests. Without restore-keys, only exact key matches are used — source changes trigger a full rebuild but binaries are always compatible with the runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Make benchmark pipeline robust to transient GPU failures Add three layers of defense against transient failures (e.g. ROCm HSA_STATUS_ERROR_INVALID_ARGUMENT) tanking the entire benchmark: 1. Retry failed cases once (5s delay) before marking as failed 2. Always write partial results YAML before raising on failure 3. CI scripts warn on non-zero exit instead of aborting, and bench.yml runs diff() via `if: always()` so partial results are still compared Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Suppress pylint too-many-nested-blocks for bench() The retry loop adds nesting depth beyond pylint's default limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix rm -rf following build symlink into shared cache rm -rf on a symlink follows it and deletes the target's contents, which fails when another runner is using the shared cache directory. Use unlink for symlinks, rm -rf only for real directories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Detect stale cached binaries and include install/ in retry cleanup Phoenix compute nodes may have different CPU architectures, causing SIGILL when running binaries cached from a different node. After a successful build, smoke-test syscheck to detect stale installs and trigger a full rebuild. Also include build/install in retry cleanup for all clusters. With per-runner caching there are no concurrent readers sharing the same cache directory, so clearing install is safe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix benchmark PR detection for cross-fork workflow_run events workflow_run events for cross-fork PRs don't populate pull_requests[] and may report the base branch SHA instead of the PR head SHA, causing the SHA-based PR lookup to fail. Add a fallback that searches by branch name so benchmarks auto-trigger for fork PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move build cache from scratch to coda1 project storage Scratch quota was filling up and causing build failures. coda1 project storage has more space for persistent caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ad35e0e commit 356b61f

File tree

10 files changed

+221
-93
lines changed

10 files changed

+221
-93
lines changed

.github/scripts/run_parallel_benchmarks.sh

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,16 @@ else
5252
echo "Master job completed successfully"
5353
fi
5454

55-
# Check if either job failed
55+
# Warn if either job failed (partial results may still be usable)
5656
if [ "${pr_exit}" -ne 0 ] || [ "${master_exit}" -ne 0 ]; then
57-
echo "ERROR: One or both benchmark jobs failed: pr_exit=${pr_exit}, master_exit=${master_exit}"
58-
exit 1
57+
echo "WARNING: Benchmark jobs had failures: pr=${pr_exit}, master=${master_exit}"
58+
echo "Checking for partial results..."
59+
else
60+
echo "=========================================="
61+
echo "Both benchmark jobs completed successfully!"
62+
echo "=========================================="
5963
fi
6064

61-
echo "=========================================="
62-
echo "Both benchmark jobs completed successfully!"
63-
echo "=========================================="
64-
6565
# Final verification that output files exist before proceeding
6666
pr_yaml="pr/bench-${device}-${interface}.yaml"
6767
master_yaml="master/bench-${device}-${interface}.yaml"
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#!/bin/bash
2+
# Sets up a persistent build cache for self-hosted CI runners.
3+
# Creates a symlink: ./build -> /storage/coda1/d-coc/0/sbryngelson3/.mfc-ci-cache/<key>/build
4+
#
5+
# Each runner gets its own cache keyed by (cluster, device, interface, runner).
6+
# This avoids cross-runner path issues entirely — CMake's absolute paths are
7+
# always correct because the same runner always uses the same workspace path.
8+
#
9+
# Usage: source .github/scripts/setup-build-cache.sh <cluster> <device> <interface>
10+
11+
_cache_cluster="${1:?Usage: setup-build-cache.sh <cluster> <device> <interface>}"
12+
_cache_device="${2:?}"
13+
_cache_interface="${3:-none}"
14+
_cache_runner="${RUNNER_NAME:?RUNNER_NAME not set}"
15+
16+
_cache_key="${_cache_cluster}-${_cache_device}-${_cache_interface}-${_cache_runner}"
17+
_cache_base="/storage/coda1/d-coc/0/sbryngelson3/.mfc-ci-cache/${_cache_key}/build"
18+
19+
mkdir -p "$_cache_base"
20+
_cache_dir="$(cd "$_cache_base" && pwd -P)"
21+
22+
echo "=== Build Cache Setup ==="
23+
echo " Cache key: $_cache_key"
24+
echo " Cache dir: $_cache_dir"
25+
26+
# Replace any existing build/ (real dir or stale symlink) with a symlink
27+
# to our runner-specific cache directory.
28+
# Use unlink for symlinks to avoid rm -rf following the link and deleting
29+
# the shared cache contents (which another runner may be using).
30+
if [ -L "build" ]; then
31+
unlink "build"
32+
elif [ -e "build" ]; then
33+
rm -rf "build"
34+
fi
35+
36+
ln -s "$_cache_dir" "build"
37+
38+
echo " Symlink: build -> $_cache_dir"
39+
echo "========================="

.github/scripts/submit_and_monitor_bench.sh

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,13 @@ fi
3737
echo "[$dir] Job ID: $job_id, monitoring output file: $output_file"
3838

3939
# Use the monitoring script from PR (where this script lives)
40-
bash "${SCRIPT_DIR}/monitor_slurm_job.sh" "$job_id" "$output_file"
41-
42-
echo "[$dir] Monitoring complete for job $job_id"
40+
monitor_exit=0
41+
bash "${SCRIPT_DIR}/monitor_slurm_job.sh" "$job_id" "$output_file" || monitor_exit=$?
42+
if [ "$monitor_exit" -ne 0 ]; then
43+
echo "[$dir] WARNING: SLURM job exited with code $monitor_exit"
44+
else
45+
echo "[$dir] Monitoring complete for job $job_id"
46+
fi
4347

4448
# Verify the YAML output file was created
4549
yaml_file="${job_slug}.yaml"

.github/workflows/bench.yml

Lines changed: 31 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -46,21 +46,42 @@ jobs:
4646
else
4747
# Get PR number from workflow_run
4848
PR_NUMBER="${{ github.event.workflow_run.pull_requests[0].number }}"
49+
if [ -z "$PR_NUMBER" ]; then
50+
# Cross-repo PRs don't populate pull_requests[]. Search by head SHA.
51+
HEAD_SHA="${{ github.event.workflow_run.head_sha }}"
52+
PR_NUMBER=$(gh api "repos/${{ github.repository }}/pulls?state=open&sort=updated&direction=desc&per_page=30" \
53+
--jq ".[] | select(.head.sha == \"$HEAD_SHA\") | .number" | head -1)
54+
fi
55+
if [ -z "$PR_NUMBER" ]; then
56+
# workflow_run may report the merge/base SHA for forks. Fall back to branch name.
57+
HEAD_BRANCH="${{ github.event.workflow_run.head_branch }}"
58+
if [ -n "$HEAD_BRANCH" ] && [ "$HEAD_BRANCH" != "master" ]; then
59+
PR_NUMBER=$(gh api "repos/${{ github.repository }}/pulls?state=open&sort=updated&direction=desc&per_page=30" \
60+
--jq ".[] | select(.head.ref == \"$HEAD_BRANCH\") | .number" | head -1)
61+
fi
62+
fi
63+
4964
if [ -n "$PR_NUMBER" ]; then
5065
echo "pr_number=$PR_NUMBER" >> $GITHUB_OUTPUT
5166
5267
# Fetch actual PR author from API (workflow_run.actor is the re-runner, not PR author)
5368
PR_AUTHOR=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER --jq '.user.login')
5469
echo "author=$PR_AUTHOR" >> $GITHUB_OUTPUT
5570
56-
# Check if PR is approved
57-
APPROVED=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/reviews \
58-
--jq '[.[] | select(.state == "APPROVED")] | length')
59-
if [ "$APPROVED" -gt 0 ]; then
60-
echo "approved=true" >> $GITHUB_OUTPUT
61-
else
62-
echo "approved=false" >> $GITHUB_OUTPUT
63-
fi
71+
# Check if PR is approved by a maintainer/admin (ignore AI bot approvals)
72+
APPROVERS=$(gh api "repos/${{ github.repository }}/pulls/$PR_NUMBER/reviews" \
73+
--jq '[.[] | select(.state == "APPROVED") | .user.login] | unique | .[]')
74+
APPROVED="false"
75+
for approver in $APPROVERS; do
76+
PERM=$(gh api "repos/${{ github.repository }}/collaborators/$approver/permission" \
77+
--jq '.permission' 2>/dev/null || echo "none")
78+
if [ "$PERM" = "admin" ] || [ "$PERM" = "maintain" ] || [ "$PERM" = "write" ]; then
79+
echo " Approved by $approver (permission: $PERM)"
80+
APPROVED="true"
81+
break
82+
fi
83+
done
84+
echo "approved=$APPROVED" >> $GITHUB_OUTPUT
6485
else
6586
echo "pr_number=" >> $GITHUB_OUTPUT
6687
echo "approved=false" >> $GITHUB_OUTPUT
@@ -76,8 +97,7 @@ jobs:
7697
(
7798
github.event_name == 'workflow_dispatch' ||
7899
needs.file-changes.outputs.pr_approved == 'true' ||
79-
needs.file-changes.outputs.pr_author == 'sbryngelson' ||
80-
needs.file-changes.outputs.pr_author == 'wilfonba'
100+
needs.file-changes.outputs.pr_author == 'sbryngelson'
81101
)
82102
needs: [file-changes]
83103
strategy:
@@ -164,6 +184,7 @@ jobs:
164184
run: bash pr/.github/scripts/run_parallel_benchmarks.sh ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}
165185

166186
- name: Generate & Post Comment
187+
if: always()
167188
run: |
168189
(cd pr && . ./mfc.sh load -c ${{ matrix.flag }} -m g)
169190
(cd pr && ./mfc.sh bench_diff ../master/bench-${{ matrix.device }}-${{ matrix.interface }}.yaml ../pr/bench-${{ matrix.device }}-${{ matrix.interface }}.yaml)

.github/workflows/coverage.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,12 @@ jobs:
3535
- name: Checkouts
3636
uses: actions/checkout@v4
3737

38+
- name: Restore Build Cache
39+
uses: actions/cache@v4
40+
with:
41+
path: build
42+
key: mfc-coverage-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
43+
3844
- name: Setup Ubuntu
3945
run: |
4046
sudo apt update -y

.github/workflows/frontier/build.sh

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ fi
1818

1919
. ./mfc.sh load -c f -m g
2020

21+
# Only set up build cache for test suite, not benchmarks
22+
if [ "$run_bench" != "bench" ]; then
23+
source .github/scripts/setup-build-cache.sh frontier "$job_device" "$job_interface"
24+
fi
25+
2126
max_attempts=3
2227
attempt=1
2328
while [ $attempt -le $max_attempts ]; do
@@ -45,8 +50,8 @@ while [ $attempt -le $max_attempts ]; do
4550
fi
4651

4752
if [ $attempt -lt $max_attempts ]; then
48-
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
49-
./mfc.sh clean
53+
echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
54+
rm -rf build/staging build/install build/lock.yaml
5055
sleep 30
5156
fi
5257
attempt=$((attempt + 1))

.github/workflows/frontier_amd/build.sh

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ fi
1818

1919
. ./mfc.sh load -c famd -m g
2020

21+
# Only set up build cache for test suite, not benchmarks
22+
if [ "$run_bench" != "bench" ]; then
23+
source .github/scripts/setup-build-cache.sh frontier_amd "$job_device" "$job_interface"
24+
fi
25+
2126
max_attempts=3
2227
attempt=1
2328
while [ $attempt -le $max_attempts ]; do
@@ -45,8 +50,8 @@ while [ $attempt -le $max_attempts ]; do
4550
fi
4651

4752
if [ $attempt -lt $max_attempts ]; then
48-
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
49-
./mfc.sh clean
53+
echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
54+
rm -rf build/staging build/install build/lock.yaml
5055
sleep 30
5156
fi
5257
attempt=$((attempt + 1))

.github/workflows/phoenix/test.sh

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,39 @@ if [ "$job_device" = "gpu" ]; then
1010
fi
1111
fi
1212

13+
# Set up persistent build cache
14+
source .github/scripts/setup-build-cache.sh phoenix "$job_device" "$job_interface"
15+
1316
max_attempts=3
1417
attempt=1
1518
while [ $attempt -le $max_attempts ]; do
1619
echo "Build attempt $attempt of $max_attempts..."
1720
if ./mfc.sh test -v --dry-run -j 8 $build_opts; then
1821
echo "Build succeeded on attempt $attempt."
22+
23+
# Smoke-test the cached binaries to catch architecture mismatches
24+
# (SIGILL from binaries compiled on a different compute node).
25+
syscheck_bin=$(find build/install -name syscheck -type f 2>/dev/null | head -1)
26+
if [ -n "$syscheck_bin" ] && ! "$syscheck_bin" > /dev/null 2>&1; then
27+
echo "WARNING: syscheck binary crashed — cached install is stale."
28+
if [ $attempt -lt $max_attempts ]; then
29+
echo "Clearing cache and rebuilding..."
30+
rm -rf build/staging build/install build/lock.yaml
31+
sleep 5
32+
attempt=$((attempt + 1))
33+
continue
34+
else
35+
echo "ERROR: syscheck still failing after $max_attempts attempts."
36+
exit 1
37+
fi
38+
fi
39+
1940
break
2041
fi
2142

2243
if [ $attempt -lt $max_attempts ]; then
23-
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
24-
./mfc.sh clean
44+
echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
45+
rm -rf build/staging build/install build/lock.yaml
2546
sleep 30
2647
else
2748
echo "Build failed after $max_attempts attempts."
@@ -40,4 +61,3 @@ if [ "$job_device" = "gpu" ]; then
4061
fi
4162

4263
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
43-

.github/workflows/test.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,12 @@ jobs:
9797
- name: Clone
9898
uses: actions/checkout@v4
9999

100+
- name: Restore Build Cache
101+
uses: actions/cache@v4
102+
with:
103+
path: build
104+
key: mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
105+
100106
- name: Setup MacOS
101107
if: matrix.os == 'macos'
102108
run: |
@@ -205,6 +211,8 @@ jobs:
205211
steps:
206212
- name: Clone
207213
uses: actions/checkout@v4
214+
with:
215+
clean: false
208216

209217
- name: Build
210218
if: matrix.cluster != 'phoenix'

0 commit comments

Comments
 (0)