MFC CI has zero build caching — every run compiles from scratch. There are two categories of runners:
- GitHub-hosted (Ubuntu, macOS): 7 matrix combos with GCC/Intel. Ephemeral VMs.
- Self-hosted HPC (Phoenix, Frontier, Frontier AMD): 8 matrix combos with NVHPC/Cray CCE/amdflang. Shared runner pools where jobs float between multiple runner instances, each with a different workspace path.
The build system already supports incremental builds — build.py:578 checks is_configured() (looks for CMakeCache.txt) and skips reconfiguration if found. The problem is purely that build artifacts don't persist across runs.
Key constraints discovered during analysis
build/ is in .gitignore → actions/checkout doesn't delete it
shutil.rmtree (used by mfc.sh clean) follows symlinks and would destroy cache contents
- ccache does not work with NVHPC, Cray CCE, or amdflang → useless for HPC runners
- Self-hosted runners have different workspace paths per instance →
CMakeCache.txt contains stale absolute paths when a job lands on a different runner
uv is already used for pip installs → venv setup is already fast (~seconds)
Changes
1. actions/cache for GitHub-hosted runners
Files: .github/workflows/test.yml, .github/workflows/coverage.yml
Add actions/cache@v4 after checkout, caching the build/ directory.
In test.yml, insert after the Clone step in the github job (after line 95):
- name: Restore Build Cache
uses: actions/cache@v4
with:
path: build
key: mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
restore-keys: |
mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-
In coverage.yml, insert after checkout (after line 36):
- name: Restore Build Cache
uses: actions/cache@v4
with:
path: build
key: mfc-coverage-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
restore-keys: |
mfc-coverage-
How it works:
- Cache miss (first run or source hash change): builds from scratch, cache saved on completion
- Cache hit via
restore-keys prefix (source changed but config same): restores old build dir, is_configured() returns True, CMake does incremental build of only changed files
- Exact cache hit: restores build dir, nothing to recompile, just runs tests
2. Persistent build cache for self-hosted HPC runners
Approach: Symlink build/ → $HOME/.mfc-ci-cache/<cluster>-<device>-<interface>/build/ so every run of the same config finds cached artifacts regardless of which runner instance it lands on.
2a. New helper script: .github/scripts/setup-build-cache.sh
#!/bin/bash
# Sets up a persistent build cache for self-hosted CI runners.
# Creates a symlink: ./build -> $HOME/.mfc-ci-cache/<key>/build
#
# Usage: source .github/scripts/setup-build-cache.sh <cluster> <device> <interface>
_cache_cluster="${1:?Usage: setup-build-cache.sh <cluster> <device> <interface>}"
_cache_device="${2:?}"
_cache_interface="${3:-none}"
_cache_key="${_cache_cluster}-${_cache_device}-${_cache_interface}"
_cache_dir="$HOME/.mfc-ci-cache/${_cache_key}/build"
echo "=== Build Cache Setup ==="
echo " Cache key: $_cache_key"
echo " Cache dir: $_cache_dir"
mkdir -p "$_cache_dir"
# If build/ exists (real dir or stale symlink), remove it
if [ -e "build" ] || [ -L "build" ]; then
rm -rf "build"
fi
ln -s "$_cache_dir" "build"
# Handle cross-runner workspace path changes.
# CMakeCache.txt stores absolute paths from whichever runner instance
# originally configured the build. If we're on a different runner, sed-replace
# the old workspace path with the current one so CMake can do incremental builds.
_workspace_marker="$_cache_dir/.workspace_path"
if [ -f "$_workspace_marker" ]; then
_old_workspace=$(cat "$_workspace_marker")
if [ "$_old_workspace" != "$(pwd)" ]; then
echo " Workspace path changed: $_old_workspace -> $(pwd)"
echo " Updating cached CMake paths..."
find "$_cache_dir/staging" -type f \
\( -name "CMakeCache.txt" -o -name "*.cmake" \
-o -name "Makefile" -o -name "build.ninja" \) \
-exec sed -i "s|${_old_workspace}|$(pwd)|g" {} + 2>/dev/null || true
fi
fi
echo "$(pwd)" > "$_workspace_marker"
echo " Symlink: build -> $_cache_dir"
echo "========================="
Why this works:
rm -rf "build" on a symlink removes the symlink, not the target — cache is safe
actions/checkout may remove the symlink via git clean, but we recreate it immediately after
- The sed fixup rewrites stale absolute paths in CMake files when the workspace path changes (different runner instance), enabling incremental builds across runners
$HOME is on shared NFS on both Phoenix and Frontier, so the cache is accessible from login and compute nodes
2b. Modify HPC build scripts
.github/workflows/frontier/build.sh — Add cache setup after mfc.sh load, replace ./mfc.sh clean in retry with targeted cleanup:
. ./mfc.sh load -c f -m g
# Set up persistent build cache
source .github/scripts/setup-build-cache.sh frontier "$job_device" "$job_interface"
# In retry logic, replace:
# ./mfc.sh clean
# with:
rm -rf build/staging/* build/install/* build/lock.yaml
The targeted cleanup clears compiled artifacts (forcing full reconfigure on retry) without destroying the symlink or the venv.
.github/workflows/frontier_amd/build.sh — Same pattern, using frontier_amd as the cluster name. Module load is mfc.sh load -c famd -m g.
.github/workflows/phoenix/test.sh — Same pattern, using phoenix as the cluster name. Note: Phoenix builds inside SLURM jobs (not on login nodes), but $HOME is on shared NFS so the cache is accessible.
2c. No changes needed to:
frontier/test.sh, frontier_amd/test.sh — test-only scripts, no build step
frontier/submit.sh, phoenix/submit.sh — submit scripts, build symlink is in the workspace which $SLURM_SUBMIT_DIR points to
test.yml self-hosted job definitions — cache logic lives entirely in the shell scripts
- Any Python toolchain files (
build.py, clean.py, common.py, etc.)
mfc.sh
File change summary
| File |
Action |
Description |
.github/workflows/test.yml |
Modify |
Add actions/cache@v4 step to github job |
.github/workflows/coverage.yml |
Modify |
Add actions/cache@v4 step to run job |
.github/scripts/setup-build-cache.sh |
New |
Shared helper: symlink build/ → persistent cache, sed-fixup paths |
.github/workflows/frontier/build.sh |
Modify |
Add cache setup, replace mfc.sh clean with targeted rm |
.github/workflows/frontier_amd/build.sh |
Modify |
Same as frontier |
.github/workflows/phoenix/test.sh |
Modify |
Add cache setup, replace mfc.sh clean with targeted rm |
Zero changes to MFC source code or Python toolchain.
How incremental builds work after these changes
- First run (cold cache): No
CMakeCache.txt → is_configured() returns False → full configure + build. Artifacts saved in persistent cache.
- Same runner, code changed:
CMakeCache.txt exists → is_configured() returns True → skips configure → cmake --build recompiles only changed files.
- Different runner, code changed: Symlink points to same cache → sed fixes workspace paths in CMake files →
is_configured() returns True → CMake re-runs configure (detects file changes) → incremental build.
- Build failure: Retry logic clears
build/staging/* and build/install/* → next attempt does full configure + build (same as today).
Verification
- GH runners: Open a PR against
MFlowCode/MFC. First run builds from scratch (cache miss). Push a trivial commit — second run should show Cache restored in the "Restore Build Cache" step and build faster.
- Self-hosted runners: The first run on each runner creates the cache. Subsequent runs should show "Symlink: build -> ..." in the logs. If the runner changes, should see "Updating cached CMake paths..." then still build incrementally.
- Retry path: Intentionally break a build (if possible) to verify that the retry logic clears staging and recovers.
Risks and mitigations
| Risk |
Mitigation |
| Stale CMake cache causes persistent build failures |
Retry logic clears staging on failure; degrades to full rebuild (same as today) |
| sed misses some path references |
CMake auto-regeneration handles most mismatches; retry catches the rest |
| Cache grows unbounded on HPC |
Only 8 cache dirs total (one per config); each is updated in-place. Can add housekeeping cron later. |
| Concurrent jobs for same config corrupt cache |
concurrency group in test.yml cancels in-progress runs; matrix dimensions ensure different configs use different cache dirs |
MFC CI has zero build caching — every run compiles from scratch. There are two categories of runners:
The build system already supports incremental builds —
build.py:578checksis_configured()(looks forCMakeCache.txt) and skips reconfiguration if found. The problem is purely that build artifacts don't persist across runs.Key constraints discovered during analysis
build/is in.gitignore→actions/checkoutdoesn't delete itshutil.rmtree(used bymfc.sh clean) follows symlinks and would destroy cache contentsCMakeCache.txtcontains stale absolute paths when a job lands on a different runneruvis already used for pip installs → venv setup is already fast (~seconds)Changes
1.
actions/cachefor GitHub-hosted runnersFiles:
.github/workflows/test.yml,.github/workflows/coverage.ymlAdd
actions/cache@v4after checkout, caching thebuild/directory.In
test.yml, insert after theClonestep in thegithubjob (after line 95):In
coverage.yml, insert after checkout (after line 36):How it works:
restore-keysprefix (source changed but config same): restores old build dir,is_configured()returns True, CMake does incremental build of only changed files2. Persistent build cache for self-hosted HPC runners
Approach: Symlink
build/→$HOME/.mfc-ci-cache/<cluster>-<device>-<interface>/build/so every run of the same config finds cached artifacts regardless of which runner instance it lands on.2a. New helper script:
.github/scripts/setup-build-cache.shWhy this works:
rm -rf "build"on a symlink removes the symlink, not the target — cache is safeactions/checkoutmay remove the symlink viagit clean, but we recreate it immediately after$HOMEis on shared NFS on both Phoenix and Frontier, so the cache is accessible from login and compute nodes2b. Modify HPC build scripts
.github/workflows/frontier/build.sh— Add cache setup aftermfc.sh load, replace./mfc.sh cleanin retry with targeted cleanup:The targeted cleanup clears compiled artifacts (forcing full reconfigure on retry) without destroying the symlink or the venv.
.github/workflows/frontier_amd/build.sh— Same pattern, usingfrontier_amdas the cluster name. Module load ismfc.sh load -c famd -m g..github/workflows/phoenix/test.sh— Same pattern, usingphoenixas the cluster name. Note: Phoenix builds inside SLURM jobs (not on login nodes), but$HOMEis on shared NFS so the cache is accessible.2c. No changes needed to:
frontier/test.sh,frontier_amd/test.sh— test-only scripts, no build stepfrontier/submit.sh,phoenix/submit.sh— submit scripts, build symlink is in the workspace which$SLURM_SUBMIT_DIRpoints totest.ymlself-hosted job definitions — cache logic lives entirely in the shell scriptsbuild.py,clean.py,common.py, etc.)mfc.shFile change summary
.github/workflows/test.ymlactions/cache@v4step togithubjob.github/workflows/coverage.ymlactions/cache@v4step torunjob.github/scripts/setup-build-cache.shbuild/→ persistent cache, sed-fixup paths.github/workflows/frontier/build.shmfc.sh cleanwith targeted rm.github/workflows/frontier_amd/build.sh.github/workflows/phoenix/test.shmfc.sh cleanwith targeted rmZero changes to MFC source code or Python toolchain.
How incremental builds work after these changes
CMakeCache.txt→is_configured()returns False → full configure + build. Artifacts saved in persistent cache.CMakeCache.txtexists →is_configured()returns True → skips configure →cmake --buildrecompiles only changed files.is_configured()returns True → CMake re-runs configure (detects file changes) → incremental build.build/staging/*andbuild/install/*→ next attempt does full configure + build (same as today).Verification
MFlowCode/MFC. First run builds from scratch (cache miss). Push a trivial commit — second run should showCache restoredin the "Restore Build Cache" step and build faster.Risks and mitigations
concurrencygroup in test.yml cancels in-progress runs; matrix dimensions ensure different configs use different cache dirs