Add CI build caching and improve benchmark workflow (#1148)

sbryngelson · claude · web-flow · commit 356b61feeb00 · 2026-02-20T14:22:02.000-05:00
* Add CI build caching for GitHub-hosted and self-hosted HPC runners

GitHub-hosted runners: Add actions/cache@v4 to test.yml and coverage.yml,
caching the build/ directory keyed by matrix config and source file hashes.
Partial cache hits via restore-keys enable incremental builds.

Self-hosted HPC runners (Phoenix, Frontier, Frontier AMD): Add a persistent
build cache that symlinks build/ to $HOME/scratch/.mfc-ci-cache/&lt;config&gt;/build.
This ensures cached artifacts persist across CI runs regardless of which
runner instance picks up the job. Key details:

- Cross-runner workspace path fixup via sed on CMake files
- flock-based locking prevents concurrent builds from corrupting the cache
- Retry logic uses targeted rm (staging/install only) instead of mfc.sh clean
- Phoenix releases the lock after build, before tests

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Fix race conditions and cleanup in build cache

- Only remove build/staging (not build/install) on retry, so concurrent
  test jobs reading installed binaries are not disrupted
- Remove stale symlink in lock-timeout fallback path to prevent writing
  into the shared cache without holding the lock
- Remove redundant flock --unlock (closing fd is sufficient)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Fix stale retry log messages

The echo said "Clearing staging/install" but build/install is
intentionally preserved to avoid disrupting concurrent test jobs.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Disable git clean on self-hosted runners to preserve build cache

actions/checkout@v4 defaults to clean: true, which runs git clean -ffdx.
This follows the build/ symlink into the shared cache directory and
deletes all cached artifacts (staging, install, venv), defeating the
purpose of the persistent cache and causing SIGILL errors from partially
destroyed build artifacts.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Skip build cache for benchmarks and fix benchmark trigger logic

Benchmarks build PR and master in parallel — sharing a cache key causes
collisions. Skip cache setup when run_bench=="bench" so each benchmark
builds from scratch.

Also fix two issues in the benchmark workflow trigger:
- Cross-repo PRs don't populate pull_requests[]; fall back to searching
  by head SHA so the PR author is correctly detected.
- Only count approvals from users with write/maintain/admin permission,
  filtering out AI bot approvals (Copilot, Qodo).
- Remove wilfonba auto-run; only sbryngelson auto-runs benchmarks.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Fix cross-runner cache by updating install/ config paths

When the cache moves between runner instances (e.g. actions-runner-6 to
actions-runner-1), the sed path replacement only updated staging/ CMake
files. Config files in install/ (.pc, .cmake) still had the old runner
path, causing silo/HDF5 to link against nonexistent paths and h5dump to
fail on all tests.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Delete install/ on workspace path change to fix stale binaries

Updating .pc and .cmake config files with sed is insufficient — the MFC
executables (simulation, pre_process, post_process) and static libraries
have the old runner workspace path baked in at compile time. When the
cache moves between runner instances, these binaries fail at runtime.

Replace the install/ sed fix with rm -rf install/ so CMake re-links and
re-installs all binaries with correct paths. The staging/ object files
remain valid, so this is a re-link, not a full rebuild.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Simplify build cache to per-runner directories

Replace the shared cache (with flock, sed path fixups, and workspace
tracking) with per-runner caches keyed by RUNNER_NAME. Each runner
always uses the same workspace path, so CMake's absolute paths are
always correct — no cross-runner path issues, no locking needed.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Remove restore-keys prefix fallback from GH-hosted build cache

The prefix fallback can restore a cache built on a runner with AVX-512
onto a runner without it, causing SIGILL in Chemistry tests. Without
restore-keys, only exact key matches are used — source changes trigger
a full rebuild but binaries are always compatible with the runner.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Make benchmark pipeline robust to transient GPU failures

Add three layers of defense against transient failures (e.g. ROCm
HSA_STATUS_ERROR_INVALID_ARGUMENT) tanking the entire benchmark:

1. Retry failed cases once (5s delay) before marking as failed
2. Always write partial results YAML before raising on failure
3. CI scripts warn on non-zero exit instead of aborting, and
   bench.yml runs diff() via `if: always()` so partial results
   are still compared

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Suppress pylint too-many-nested-blocks for bench()

The retry loop adds nesting depth beyond pylint's default limit.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Fix rm -rf following build symlink into shared cache

rm -rf on a symlink follows it and deletes the target's contents,
which fails when another runner is using the shared cache directory.
Use unlink for symlinks, rm -rf only for real directories.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Detect stale cached binaries and include install/ in retry cleanup

Phoenix compute nodes may have different CPU architectures, causing
SIGILL when running binaries cached from a different node. After a
successful build, smoke-test syscheck to detect stale installs and
trigger a full rebuild.

Also include build/install in retry cleanup for all clusters. With
per-runner caching there are no concurrent readers sharing the same
cache directory, so clearing install is safe.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Fix benchmark PR detection for cross-fork workflow_run events

workflow_run events for cross-fork PRs don't populate pull_requests[]
and may report the base branch SHA instead of the PR head SHA, causing
the SHA-based PR lookup to fail. Add a fallback that searches by
branch name so benchmarks auto-trigger for fork PRs.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Move build cache from scratch to coda1 project storage

Scratch quota was filling up and causing build failures.
coda1 project storage has more space for persistent caches.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.github/scripts/run_parallel_benchmarks.sh b/.github/scripts/run_parallel_benchmarks.sh
@@ -52,16 +52,16 @@ else
   echo "Master job completed successfully"
 fi
 
-# Check if either job failed
+# Warn if either job failed (partial results may still be usable)
 if [ "${pr_exit}" -ne 0 ] || [ "${master_exit}" -ne 0 ]; then
-  echo "ERROR: One or both benchmark jobs failed: pr_exit=${pr_exit}, master_exit=${master_exit}"
-  exit 1
+  echo "WARNING: Benchmark jobs had failures: pr=${pr_exit}, master=${master_exit}"
+  echo "Checking for partial results..."
+else
+  echo "=========================================="
+  echo "Both benchmark jobs completed successfully!"
+  echo "=========================================="
 fi
 
-echo "=========================================="
-echo "Both benchmark jobs completed successfully!"
-echo "=========================================="
-
 # Final verification that output files exist before proceeding
 pr_yaml="pr/bench-${device}-${interface}.yaml"
 master_yaml="master/bench-${device}-${interface}.yaml"
diff --git a/.github/scripts/setup-build-cache.sh b/.github/scripts/setup-build-cache.sh
@@ -0,0 +1,39 @@
+#!/bin/bash
+# Sets up a persistent build cache for self-hosted CI runners.
+# Creates a symlink: ./build -> /storage/coda1/d-coc/0/sbryngelson3/.mfc-ci-cache/<key>/build
+#
+# Each runner gets its own cache keyed by (cluster, device, interface, runner).
+# This avoids cross-runner path issues entirely — CMake's absolute paths are
+# always correct because the same runner always uses the same workspace path.
+#
+# Usage: source .github/scripts/setup-build-cache.sh <cluster> <device> <interface>
+
+_cache_cluster="${1:?Usage: setup-build-cache.sh <cluster> <device> <interface>}"
+_cache_device="${2:?}"
+_cache_interface="${3:-none}"
+_cache_runner="${RUNNER_NAME:?RUNNER_NAME not set}"
+
+_cache_key="${_cache_cluster}-${_cache_device}-${_cache_interface}-${_cache_runner}"
+_cache_base="/storage/coda1/d-coc/0/sbryngelson3/.mfc-ci-cache/${_cache_key}/build"
+
+mkdir -p "$_cache_base"
+_cache_dir="$(cd "$_cache_base" && pwd -P)"
+
+echo "=== Build Cache Setup ==="
+echo "  Cache key: $_cache_key"
+echo "  Cache dir: $_cache_dir"
+
+# Replace any existing build/ (real dir or stale symlink) with a symlink
+# to our runner-specific cache directory.
+# Use unlink for symlinks to avoid rm -rf following the link and deleting
+# the shared cache contents (which another runner may be using).
+if [ -L "build" ]; then
+    unlink "build"
+elif [ -e "build" ]; then
+    rm -rf "build"
+fi
+
+ln -s "$_cache_dir" "build"
+
+echo "  Symlink: build -> $_cache_dir"
+echo "========================="
diff --git a/.github/scripts/submit_and_monitor_bench.sh b/.github/scripts/submit_and_monitor_bench.sh
@@ -37,9 +37,13 @@ fi
 echo "[$dir] Job ID: $job_id, monitoring output file: $output_file"
 
 # Use the monitoring script from PR (where this script lives)
-bash "${SCRIPT_DIR}/monitor_slurm_job.sh" "$job_id" "$output_file"
-
-echo "[$dir] Monitoring complete for job $job_id"
+monitor_exit=0
+bash "${SCRIPT_DIR}/monitor_slurm_job.sh" "$job_id" "$output_file" || monitor_exit=$?
+if [ "$monitor_exit" -ne 0 ]; then
+  echo "[$dir] WARNING: SLURM job exited with code $monitor_exit"
+else
+  echo "[$dir] Monitoring complete for job $job_id"
+fi
 
 # Verify the YAML output file was created
 yaml_file="${job_slug}.yaml"
diff --git a/.github/workflows/bench.yml b/.github/workflows/bench.yml
@@ -46,21 +46,42 @@ jobs:
           else
             # Get PR number from workflow_run
             PR_NUMBER="${{ github.event.workflow_run.pull_requests[0].number }}"
+            if [ -z "$PR_NUMBER" ]; then
+              # Cross-repo PRs don't populate pull_requests[]. Search by head SHA.
+              HEAD_SHA="${{ github.event.workflow_run.head_sha }}"
+              PR_NUMBER=$(gh api "repos/${{ github.repository }}/pulls?state=open&sort=updated&direction=desc&per_page=30" \
+                  --jq ".[] | select(.head.sha == \"$HEAD_SHA\") | .number" | head -1)
+            fi
+            if [ -z "$PR_NUMBER" ]; then
+              # workflow_run may report the merge/base SHA for forks. Fall back to branch name.
+              HEAD_BRANCH="${{ github.event.workflow_run.head_branch }}"
+              if [ -n "$HEAD_BRANCH" ] && [ "$HEAD_BRANCH" != "master" ]; then
+                PR_NUMBER=$(gh api "repos/${{ github.repository }}/pulls?state=open&sort=updated&direction=desc&per_page=30" \
+                    --jq ".[] | select(.head.ref == \"$HEAD_BRANCH\") | .number" | head -1)
+              fi
+            fi
+
             if [ -n "$PR_NUMBER" ]; then
               echo "pr_number=$PR_NUMBER" >> $GITHUB_OUTPUT
 
               # Fetch actual PR author from API (workflow_run.actor is the re-runner, not PR author)
               PR_AUTHOR=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER --jq '.user.login')
               echo "author=$PR_AUTHOR" >> $GITHUB_OUTPUT
 
-              # Check if PR is approved
-              APPROVED=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/reviews \
-                --jq '[.[] | select(.state == "APPROVED")] | length')
-              if [ "$APPROVED" -gt 0 ]; then
-                echo "approved=true" >> $GITHUB_OUTPUT
-              else
-                echo "approved=false" >> $GITHUB_OUTPUT
-              fi
+              # Check if PR is approved by a maintainer/admin (ignore AI bot approvals)
+              APPROVERS=$(gh api "repos/${{ github.repository }}/pulls/$PR_NUMBER/reviews" \
+                  --jq '[.[] | select(.state == "APPROVED") | .user.login] | unique | .[]')
+              APPROVED="false"
+              for approver in $APPROVERS; do
+                  PERM=$(gh api "repos/${{ github.repository }}/collaborators/$approver/permission" \
+                      --jq '.permission' 2>/dev/null || echo "none")
+                  if [ "$PERM" = "admin" ] || [ "$PERM" = "maintain" ] || [ "$PERM" = "write" ]; then
+                      echo "  Approved by $approver (permission: $PERM)"
+                      APPROVED="true"
+                      break
+                  fi
+              done
+              echo "approved=$APPROVED" >> $GITHUB_OUTPUT
             else
               echo "pr_number=" >> $GITHUB_OUTPUT
               echo "approved=false" >> $GITHUB_OUTPUT
@@ -76,8 +97,7 @@ jobs:
       (
         github.event_name == 'workflow_dispatch' ||
         needs.file-changes.outputs.pr_approved == 'true' ||
-        needs.file-changes.outputs.pr_author == 'sbryngelson' ||
-        needs.file-changes.outputs.pr_author == 'wilfonba'
+        needs.file-changes.outputs.pr_author == 'sbryngelson'
       )
     needs: [file-changes]
     strategy:
@@ -164,6 +184,7 @@ jobs:
         run: bash pr/.github/scripts/run_parallel_benchmarks.sh ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}
 
       - name: Generate & Post Comment
+        if: always()
         run: |
           (cd pr && . ./mfc.sh load -c ${{ matrix.flag }} -m g)
           (cd pr && ./mfc.sh bench_diff ../master/bench-${{ matrix.device }}-${{ matrix.interface }}.yaml ../pr/bench-${{ matrix.device }}-${{ matrix.interface }}.yaml)
diff --git a/.github/workflows/coverage.yml b/.github/workflows/coverage.yml
@@ -35,6 +35,12 @@ jobs:
       - name: Checkouts
         uses: actions/checkout@v4
 
+      - name: Restore Build Cache
+        uses: actions/cache@v4
+        with:
+          path: build
+          key: mfc-coverage-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
+
       - name: Setup Ubuntu
         run: |
             sudo apt update -y
diff --git a/.github/workflows/frontier/build.sh b/.github/workflows/frontier/build.sh
@@ -18,6 +18,11 @@ fi
 
 . ./mfc.sh load -c f -m g
 
+# Only set up build cache for test suite, not benchmarks
+if [ "$run_bench" != "bench" ]; then
+    source .github/scripts/setup-build-cache.sh frontier "$job_device" "$job_interface"
+fi
+
 max_attempts=3
 attempt=1
 while [ $attempt -le $max_attempts ]; do
@@ -45,8 +50,8 @@ while [ $attempt -le $max_attempts ]; do
     fi
 
     if [ $attempt -lt $max_attempts ]; then
-        echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
-        ./mfc.sh clean
+        echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
+        rm -rf build/staging build/install build/lock.yaml
         sleep 30
     fi
     attempt=$((attempt + 1))
diff --git a/.github/workflows/frontier_amd/build.sh b/.github/workflows/frontier_amd/build.sh
@@ -18,6 +18,11 @@ fi
 
 . ./mfc.sh load -c famd -m g
 
+# Only set up build cache for test suite, not benchmarks
+if [ "$run_bench" != "bench" ]; then
+    source .github/scripts/setup-build-cache.sh frontier_amd "$job_device" "$job_interface"
+fi
+
 max_attempts=3
 attempt=1
 while [ $attempt -le $max_attempts ]; do
@@ -45,8 +50,8 @@ while [ $attempt -le $max_attempts ]; do
     fi
 
     if [ $attempt -lt $max_attempts ]; then
-        echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
-        ./mfc.sh clean
+        echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
+        rm -rf build/staging build/install build/lock.yaml
         sleep 30
     fi
     attempt=$((attempt + 1))
diff --git a/.github/workflows/phoenix/test.sh b/.github/workflows/phoenix/test.sh
@@ -10,18 +10,39 @@ if [ "$job_device" = "gpu" ]; then
     fi
 fi
 
+# Set up persistent build cache
+source .github/scripts/setup-build-cache.sh phoenix "$job_device" "$job_interface"
+
 max_attempts=3
 attempt=1
 while [ $attempt -le $max_attempts ]; do
     echo "Build attempt $attempt of $max_attempts..."
     if ./mfc.sh test -v --dry-run -j 8 $build_opts; then
         echo "Build succeeded on attempt $attempt."
+
+        # Smoke-test the cached binaries to catch architecture mismatches
+        # (SIGILL from binaries compiled on a different compute node).
+        syscheck_bin=$(find build/install -name syscheck -type f 2>/dev/null | head -1)
+        if [ -n "$syscheck_bin" ] && ! "$syscheck_bin" > /dev/null 2>&1; then
+            echo "WARNING: syscheck binary crashed — cached install is stale."
+            if [ $attempt -lt $max_attempts ]; then
+                echo "Clearing cache and rebuilding..."
+                rm -rf build/staging build/install build/lock.yaml
+                sleep 5
+                attempt=$((attempt + 1))
+                continue
+            else
+                echo "ERROR: syscheck still failing after $max_attempts attempts."
+                exit 1
+            fi
+        fi
+
         break
     fi
 
     if [ $attempt -lt $max_attempts ]; then
-        echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
-        ./mfc.sh clean
+        echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
+        rm -rf build/staging build/install build/lock.yaml
         sleep 30
     else
         echo "Build failed after $max_attempts attempts."
@@ -40,4 +61,3 @@ if [ "$job_device" = "gpu" ]; then
 fi
 
 ./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
-
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -97,6 +97,12 @@ jobs:
       - name: Clone
         uses: actions/checkout@v4
 
+      - name: Restore Build Cache
+        uses: actions/cache@v4
+        with:
+          path: build
+          key: mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
+
       - name: Setup MacOS
         if:   matrix.os == 'macos'
         run:  |
@@ -205,6 +211,8 @@ jobs:
     steps:
       - name: Clone
         uses: actions/checkout@v4
+        with:
+          clean: false
 
       - name: Build
         if:   matrix.cluster != 'phoenix'
diff --git a/toolchain/mfc/bench.py b/toolchain/mfc/bench.py