Skip to content

FEA-1450: auto-reclaim agent monitor port from orphaned sidecar at boot#257

Open
aeyeCEO wants to merge 4 commits into
mainfrom
FEA-1450-sidecar-port-cleanup
Open

FEA-1450: auto-reclaim agent monitor port from orphaned sidecar at boot#257
aeyeCEO wants to merge 4 commits into
mainfrom
FEA-1450-sidecar-port-cleanup

Conversation

@aeyeCEO
Copy link
Copy Markdown
Contributor

@aeyeCEO aeyeCEO commented May 29, 2026

Summary

Adds a detect-and-reconcile preflight to the Agent Monitor sidecar so an orphaned sidecar holding the fixed port 4820 (after an unclean ClosedLoop exit) is reclaimed at boot, instead of losing the bind, burning 5 backoff retries, and degrading to a blank dashboard.

  • New agent-monitor-port-reconcile.ts — best-effort holder lookup (lsof+ps; macOS/Linux, no-op on win32) and a pure three-guard classifier:
    • foreign (different uid, or no agent-monitor/server/index.js marker) → leave it, log a clear diagnostic (replaces the silent degrade).
    • orphan (ours AND (PPID 1 OR pid matches our PID file)) → silently kill. The ~90% case.
    • live (ours, real parent still running) → consent dialog with PID + path + explicit Kill It; never killed without consent.
  • PID file at userData/agent-monitor/sidecar.pid — written on confirmed-ready, removed on clean shutdown; higher-confidence orphan proof.
  • agent-monitor-process-utils.ts — shared isProcessAlive / signalProcess, extracted from the sidecar's duplicated killGroup/isRunning.
  • Reconcile runs in launch() before spawn(); a declined live-instance stops the supervisor (no retry storm). Never throws into boot.

Implements FEA-1450 (plan PLN-769).

Review hardening (independent review, all Low)

  • TOCTOU/PID-reuse guard: re-verify same-pid + still-ours immediately before signaling.
  • parsePsLine parses only the first data line.
  • Live-instance dialog attaches to the main BrowserWindow (no free-floating window behind the app on macOS).

Test plan

  • typecheck + eslint clean
  • Unit: classifier truth table, lsof/ps parsers, PID-file tolerance (18 cases)
  • Related suites: wiring-static / shutdown / lifecycle / no-bare-console (39 cases)
  • Real-process E2E on a safe high port (not 4820): orphan (PPID 1) → killed + port freed; live → blocked-live + survives decline; foreign → untouched
  • Clean-machine packaged-DMG smoke (reviewer to confirm in CI/manual)

🤖 Generated with Claude Code

Andrew Eye and others added 2 commits May 29, 2026 10:15
A sidecar reparented to PID 1 after an unclean ClosedLoop exit keeps
holding the fixed port 4820, so the next launch loses the bind, burns 5
backoff retries, and degrades to a blank "no monitor" dashboard. Add a
detect-and-reconcile preflight that runs before the bind.

- New agent-monitor-port-reconcile.ts: best-effort holder lookup (lsof+ps,
  macOS/Linux; no-op on win32) plus a pure three-guard classifier —
  foreign (leave + clear diagnostic), orphan (PPID 1 or PID-file match →
  silent kill, the 90% case), live (other running instance → consent
  dialog with PID/path + explicit "Kill It").
- PID file at userData/agent-monitor/sidecar.pid: written on confirmed
  ready, removed on clean shutdown; gives higher-confidence orphan proof.
- Wire reconcile into AgentMonitorSidecar.launch() before spawn; a declined
  live-instance stops the supervisor instead of retry-storming the port.
- Extract shared isProcessAlive/signalProcess into agent-monitor-process-
  utils.ts; sidecar stop() now reuses them (drops duplicated killGroup/
  isRunning).

Testing:
- New test/agent-monitor-port-reconcile.test.ts: classifier truth table
  (foreign/orphan/live), lsof/ps parsers, PID-file round-trip + tolerance
  of missing/garbage files (17 cases, all green).
- typecheck + eslint clean; related wiring/shutdown/lifecycle suites pass
  (39 cases). Manual orphan-repro to follow before PR.

Risks:
- Process-killing code: orphan path is gated on uid + command-marker match
  AND (PPID 1 or PID-file match); live path never kills without explicit
  consent; foreign processes are never touched.
- Reclaim is best-effort and wrapped so it can never block boot; win32 keeps
  the existing degrade behavior (out of scope for v1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address findings from independent review (all Low severity):

- TOCTOU/PID-reuse guard: re-verify the port is still held by the same pid
  AND still ours (uid + command marker) immediately before signaling, so a
  recycled pid cannot be killed if the orphan exited after classification.
- parsePsLine: parse only the first data line, so a trailing/wrapped newline
  from an exotic `ps` cannot defeat the single-line regex.
- Live-instance dialog: attach to the focused/main BrowserWindow so the
  consent prompt is a sheet on top of the app instead of a free-floating
  window that can hide behind it on macOS.

Testing:
- typecheck + eslint clean; reconcile test suite green (18 cases, adds a
  first-line-wins parse regression).

Risks:
- The TOCTOU re-check adds one extra lsof+ps right before kill (cheap); on
  mismatch it aborts the kill and proceeds to the normal bind/degrade path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@aeyeCEO aeyeCEO requested a review from a team May 29, 2026 17:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb2a4e1033

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const ownedByUs =
holder.uid === selfUid && holder.command.includes(SIDECAR_COMMAND_MARKER);
if (!ownedByUs) {
return "foreign";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require dead-parent proof before auto-killing recorded sidecars

When a second ClosedLoop instance starts while an existing instance is still running, both instances share the same userData PID file written by the first sidecar, so recordedPid matches the live port holder. This branch then classifies it as orphan solely because of that match and reconcileAgentMonitorPort kills it without showing the live-instance consent dialog, even though the holder's PPID is still a real Electron parent. The PID-file guard needs to be combined with evidence the parent is gone, or it will silently terminate another live app instance.

Useful? React with 👍 / 👎.

}
const lsofOut = await runCapture(
"lsof",
["-nP", `-iTCP@127.0.0.1:${port}`, "-sTCP:LISTEN", "-t"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findPortHolder filters lsof with -iTCP@127.0.0.1:port. Our own sidecar binds loopback so that matches fine, but a foreign holder bound to 0.0.0.0/* (a stray python -m http.server, a leftover dev server) shows up as *:4820 and won't match the @127.0.0.1 filter. So findPortHolder returns null and you fall straight through to spawn, EADDRINUSE, and the same 5-retry blank-dashboard degrade this PR opens by complaining about. The "clear diagnostic replaces the silent degrade" claim only holds for holders bound to 127.0.0.1. Dropping the address and matching on :port alone, then letting classifyHolder decide ours vs foreign, closes that.

// previous unclean exit so we don't burn the restart budget and degrade to
// a blank dashboard. A dialog (live-instance case) can take arbitrary time,
// so re-check our lifecycle flags after it returns.
const outcome = await this.reconcilePort();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcilePort runs on every launch(), and launch() is also the crash-restart path, not just boot. So if our child dies mid-session and another instance grabbed 4820 during the backoff window, classifyHolder returns live and this fires the Kill It modal while the user is working. Intended? I'd expect the consent prompt only at boot, not as a surprise popup during recovery.

Copy link
Copy Markdown
Contributor

@thadeusb thadeusb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect-and-reconcile is the right shape and the orphan reclaim is the real win. Two gaps on the foreign side: a 0.0.0.0-bound holder slips past the lsof filter so it still silent-degrades, and the consent dialog can ambush the user on a mid-session restart instead of only at boot. Codex's shared-PID-file point stands too.

P1 (Codex bot): orphan classification no longer accepts a PID-file match
as sufficient proof. Parallel worktrees share one userData dir (hence one
sidecar.pid), so a LIVE instance's holder pid can match the recorded pid;
the old `ppid===1 || matchesPidFile` rule would silently kill that running
instance. Orphan now strictly requires a dead parent (ppid===1). The PID
file is kept only to annotate the reclaim log (our prior instance vs
another worktree's orphan), never to gate the kill.

P2 (thadeusb): findPortHolder now matches `-iTCP:<port>` instead of
`-iTCP@127.0.0.1:<port>`, so a foreign holder bound to 0.0.0.0/* is
detected and surfaced with a clear diagnostic instead of falling through
to the silent EADDRINUSE degrade. classifyHolder still decides ours
(loopback) vs foreign.

P2 (thadeusb): the live-instance consent dialog is now boot-only.
launch(interactive) defaults true for user-initiated starts; the
crash-restart path calls launch(false), so a live holder grabbed during
background recovery no longer pops a "Kill It" modal mid-session — it just
stops the supervisor (blocked-live).

Testing:
- typecheck + eslint clean; reconcile unit suite green (17 cases; the
  PID-file-match case now asserts `live`, locking the P1 fix).
- Real-process E2E (safe port, not 4820), 12/12: orphan (ppid 1) killed +
  port freed; LIVE instance whose pid is in the shared PID file SURVIVES
  as blocked-live; 0.0.0.0-bound foreign holder detected + left untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@aeyeCEO
Copy link
Copy Markdown
Contributor Author

aeyeCEO commented May 30, 2026

Thanks for the review — all three addressed in 0002c29.

P1 (Codex, silent-kill of a live instance): Fixed. classifyHolder no longer treats a PID-file match as orphan proof. Parallel worktrees share one userData dir (one sidecar.pid), so a live instance's holder pid can match the recorded pid — the old ppid===1 || matchesPidFile would silently kill it. Orphan now strictly requires a dead parent (ppid===1); the PID file is kept only to annotate the reclaim log, never to gate the kill. Added a regression test asserting a live-parent holder classifies live even when the PID file matches, plus a real-process E2E confirming such an instance survives (blocked-live).

P2 (thadeusb, non-loopback binds): Fixed. findPortHolder now matches -iTCP:<port> instead of -iTCP@127.0.0.1:<port>, so a foreign holder bound to 0.0.0.0/* is detected and surfaced with a clear diagnostic instead of falling through to the silent EADDRINUSE degrade. classifyHolder still decides ours (loopback) vs foreign. E2E now covers a 0.0.0.0-bound holder.

P2 (thadeusb, modal during recovery): Fixed. The consent dialog is boot-only. launch(interactive) defaults true for user-initiated starts; the crash-restart path calls launch(false), so a live holder grabbed during background recovery no longer pops a "Kill It" modal mid-session — it stops the supervisor (blocked-live) instead.

Local gate: typecheck + eslint clean; reconcile unit suite green (17 cases); real-process E2E 12/12.

The prior commit updated classifyHolder to a 2-arg signature (dropped the
recordedPid orphan trigger) but the test's case table still passed a third
arg and asserted the old "pid matches PID file -> orphan" behavior, so the
suite failed (live !== orphan). Update the table to the 2-arg signature and
assert the P1 invariant: a live-parent holder is `live`, never `orphan`.

Testing: reconcile suite green (16 cases, fail 0); typecheck clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants