Skip to content

fix: handle stale runner processes with deleted binaries#1322

Merged
sbryngelson merged 1 commit intoMFlowCode:masterfrom
sbryngelson:runners-fix-2
Mar 18, 2026
Merged

fix: handle stale runner processes with deleted binaries#1322
sbryngelson merged 1 commit intoMFlowCode:masterfrom
sbryngelson:runners-fix-2

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

Summary

  • find_pids now strips the (deleted) suffix that Linux appends to /proc/PID/exe when a runner binary is replaced on disk
  • Without this, runners with stale binaries are invisible to find_node and restart-all, appearing offline even though the process is alive
  • Fixes the case where restart-all reports runners as OFFLINE and skips them instead of restarting them

Test plan

  • Verify restart-all detects and restarts runners whose binaries have been updated on disk
  • Verify find_node correctly locates runners with deleted binaries
  • Verify stop_runner correctly kills stale processes before starting fresh ones

🤖 Generated with Claude Code

find_pids strips the ' (deleted)' suffix that Linux appends to
/proc/PID/exe when the binary has been replaced on disk. Without
this, runners running old binaries are invisible to find_node and
restart-all, appearing offline even though the process is alive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 18, 2026

Claude Code Review

Head SHA: 2d0335aa5d66bf1b5fdb1f995c2fd642fdbf2371

Files changed: 1

  • misc/runners/common/runner-lib.sh

Summary:

  • Adds exe=${exe% (deleted)} to find_pids() to strip the Linux kernel suffix appended to /proc/PID/exe when the underlying binary inode is replaced on disk
  • Fixes runner detection failure after binary updates (runner appears OFFLINE to find_node/restart-all despite being alive)
  • Change is minimal, correct, and safe (pure shell parameter expansion, no injection risk)

Findings:

No blocking issues. The analogous sweep_all_nodes function is unaffected: it derives dir via two dirname calls on the exe path, which strip the filename (and any trailing (deleted) suffix) before the value is used or printed. No fix is needed there.

@sbryngelson sbryngelson marked this pull request as ready for review March 18, 2026 02:42
Copilot AI review requested due to automatic review settings March 18, 2026 02:42
@sbryngelson sbryngelson merged commit 6291786 into MFlowCode:master Mar 18, 2026
31 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GitHub Actions runner process discovery on login nodes by handling the Linux /proc/<pid>/exe symlink behavior when the underlying runner binary has been deleted.

Changes:

  • Strip the trailing (deleted) suffix from resolved executable paths in find_pids() before comparing against the expected Runner.Listener path.

@github-actions
Copy link
Copy Markdown

Claude Code Review

Head SHA: 2d0335a

Files changed:

  • 1
  • misc/runners/common/runner-lib.sh

Findings:

sweep_all_nodes() has the same (deleted) problem but was not fixed

The fix strips (deleted) from exe in find_pids(), but sweep_all_nodes() (in the same file, visible in context) has an identical /proc/$p/exe resolution pattern with no corresponding fix:

exe=$(readlink -f /proc/$p/exe 2>/dev/null || true)
[ -z "$exe" ] && continue
dir=$(dirname "$(dirname "$exe")")

If the runner binary has been updated while the process is running, readlink -f /proc/$p/exe returns something like /path/to/bin/Runner.Listener (deleted). The [ -z "$exe" ] guard does not catch this — exe is non-empty — so dir is computed via dirname dirname of a path ending in (deleted). The resulting dir would be wrong (e.g. the (deleted) suffix being treated as part of the innermost path component), causing sweep_all_nodes() to report the wrong runner directory and emit a malformed RUNNER output line.

The same one-line fix should be applied to sweep_all_nodes() immediately after its readlink call:

exe=${exe% (deleted)}

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 18, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b729d8d6-729c-4e07-84ad-c8252a855b44

📥 Commits

Reviewing files that changed from the base of the PR and between 5c74f9f and 2d0335a.

📒 Files selected for processing (1)
  • misc/runners/common/runner-lib.sh

📝 Walkthrough

Walkthrough

The find_pids function in the runner library script was updated to handle executable path comparison more accurately. After resolving the executable path of a running process, the code now removes any trailing " (deleted)" suffix from the path before comparing it against the target Runner.Listener path. This adjustment accounts for scenarios where the original binary has been deleted but the process still maintains a reference to it through the executable symlink. The modification adds a single line and does not alter other logic or error handling mechanisms.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

@sbryngelson sbryngelson deleted the runners-fix-2 branch March 20, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants