Skip to content

PLN-745: Reclaim orphaned agent-monitor sidecar via PID file#247

Merged
aponamarev merged 8 commits into
mainfrom
symphony/pln-745
May 29, 2026
Merged

PLN-745: Reclaim orphaned agent-monitor sidecar via PID file#247
aponamarev merged 8 commits into
mainfrom
symphony/pln-745

Conversation

@aponamarev
Copy link
Copy Markdown
Contributor

Summary

Hardens the agent-monitor sidecar lifecycle so a leftover ("orphaned") sidecar process from a previous app run is safely reclaimed before a new one launches, and so a fatal launch failure (most commonly a port conflict) is surfaced to the user instead of silently looping.

Previously, if the desktop app crashed or was force-quit, the detached sidecar child could keep holding AGENT_MONITOR_PORT, causing the next launch to fail with EADDRINUSE and retry indefinitely with no user-visible signal.

What changed

  • PID-file tracking (agent-monitor-sidecar.ts): the spawned sidecar is recorded in userData/agent-monitor/sidecar.pid, written atomically (*.tmp + rename) with the child pid, a per-session token, and a timestamp. The file is deleted on stop(), on process exit, and after a reclaim.
  • Safe orphan reclaim: on each launch(), reclaimOrphan() reads the PID file and SIGKILLs a leftover sidecar only when the pid is a positive integer and a sessionToken is present. A pid that may have been recycled by an unrelated ("foreign") process is never killed — the stale file is just deleted.
  • Port-conflict detection & terminal failure: EADDRINUSE on the child's stderr is detected and, after restart attempts are exhausted, an onTerminalFailure callback fires. app.ts wires this to show a desktop Notification and put the tray into a degraded state with the failure reason.
  • Stale-launch suppression: if a newer launch() has already replaced this.child, the superseded launch suppresses its "did not become healthy" warning and skips flushReady(false) so it cannot overwrite the newer launch's outcome.
  • killGroup() guard: ignores non-positive pids so process.kill(-pid, …) can never accidentally signal the app's own process group (pid=0 → -0 → 0).
  • Version bump: desktop 0.15.90 → 0.15.91.

Testing

  • New apps/desktop/test/agent-monitor-sidecar.test.ts (1205 lines) covering PID-file write/reclaim/delete, the session-token and invalid-pid guards, EADDRINUSE port-conflict detection, and the terminal-failure notification path.
  • Updated agent-monitor-wiring-static.test.ts to match the new AgentMonitorSidecar({ onTerminalFailure }) constructor signature.

Risks

The sidecar.pid file lives under userData and is entirely process-local — it is read and written only by the desktop app itself, not an external contract, so no migration logic is required. The reclaim path is explicitly guarded against killing foreign or recycled pids.


Loop ID: 019e6f5e-5c5f-7762-9b86-5e1172836dd7
Artifact: https://app.closedloop.ai/implementation-plans/PLN-745

- Track the spawned sidecar in a userData/agent-monitor/sidecar.pid file
  written atomically (tmp + rename) with pid, a per-session token, and
  a timestamp; delete it on stop, exit, and after reclaim.
- On launch, reclaimOrphan() reads the PID file and SIGKILLs a leftover
  sidecar only when the pid is a positive integer AND a sessionToken is
  present, so a recycled pid owned by a foreign process is never killed.
- Detect EADDRINUSE on the child's stderr and surface a port-conflict
  terminal failure; add an onTerminalFailure callback wired in app.ts to
  show a Notification and set the tray to a degraded state.
- Suppress the stale "did not become healthy" warning and skip
  flushReady(false) when a newer launch() has already replaced this.child.
- Harden killGroup() to ignore non-positive pids so process.kill(-pid)
  can never signal the app's own process group.
- Bump desktop version to 0.15.91.

Testing: added apps/desktop/test/agent-monitor-sidecar.test.ts covering
PID file write/reclaim/delete, session-token and invalid-pid guards,
port-conflict detection, and terminal-failure notification; updated the
static wiring test to match the new constructor signature.

Risks: PID file lives under userData and is process-local (IPC-internal),
not an external contract; no migration needed. Reclaim path is guarded
against killing foreign pids.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aponamarev aponamarev requested a review from a team May 28, 2026 18:23
@aponamarev aponamarev added the symphony ClosedLoop Symphony automated PRs label May 28, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

- Remove unused future sidecar test harness helpers

- Drop duplicate orphan-recovery invariant checks

Testing: Full desktop test suite, typecheck, and lint passed

Risks: Low; cleanup only removes unused and redundant test code
Comment thread apps/desktop/src/main/agent-monitor-sidecar.ts Outdated
Comment thread apps/desktop/src/main/app.ts Outdated
Comment thread apps/desktop/src/main/agent-monitor-sidecar.ts
Copy link
Copy Markdown
Contributor

@thadeusb thadeusb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orphan reclaim is the headline of this PR and it doesn't do what it claims. The sessionToken check is a no-op against pid recycling, so after a crash or reboot the reclaim path can SIGKILL an unrelated process group. That needs a real identity check, not a patch. Will re-review once reclaim actually verifies it's killing our sidecar.

aponamarev and others added 3 commits May 28, 2026 14:16
- reclaimOrphan now gates SIGKILL on PID-file-independent ownership
  signals: the live process's command line must still run our sidecar
  entry AND its OS start-time must match the value recorded at spawn.
  sessionToken presence alone could not witness ownership (written and
  read by the same record), so a recycled/foreign pid on the fixed port
  could be killed. Now an unverified pid fails safe (skip kill).
- writePidFile persists startTime (ps -o lstart=) as part of the
  ownership identity; add getProcessCommand/getProcessStartTime helpers
  (ps -ww, fail-safe to null, macOS+Linux portable).
- Make lastExitWasPortConflict private to match all sibling fields.
- Widen the test source-window for handleExit (1200->2000) and
  reclaimOrphan (1600->3200) so boundary-straddling assertion targets
  are not silently truncated.

Testing:
- just desktop-typecheck, just desktop-lint: clean
- pnpm -C apps/desktop test: 104/104 pass (39 sidecar, incl. new
  AC-008d/AC-008e ownership regression tests and updated AC-006b)

Risks:
- ps invocation per reclaim attempt; failures degrade to skip-kill, so a
  genuine orphan after an app-path change falls through to EADDRINUSE
  backoff rather than being reclaimed (never blocks boot).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ore respawn

Addresses two actionable PR #247 review comments (thadeusb):

- Tray state: the one-shot tray.setState("degraded") in the sidecar
  onTerminalFailure callback was stomped by the next refreshTrayState()
  (cloud heartbeat / gateway recheck), which only branched on
  gatewayHealthy/cloudCommandsPaused/cloudStatus. Add a tracked in-memory
  agentMonitorFailed field (modeled on cloudCommandsPaused); the callback
  latches it and routes through refreshTrayState(), which now consults it
  in a branch ranked just below gateway-down and above the cloud branches
  so the degraded indicator sticks. In-memory only (per-process verdict;
  fresh boot re-attempts the sidecar) — no store schema change.
- Reclaim race: reclaimOrphan SIGKILLed the orphan then returned, and
  launch() respawned immediately; SIGKILL is not synchronous with the OS
  releasing the fixed port, so the first respawn could hit EADDRINUSE.
  Add a bounded isRunning(pid) poll (RECLAIM_WAIT_TIMEOUT_MS, reusing
  delay()/READY_POLL_INTERVAL_MS) after the kill so the port is freed
  before respawn. Bounded so it never stalls the fire-and-forget boot;
  handleExit() backoff remains the fallback.

The third review thread (sessionToken ownership) was already addressed
in b930dee and needs no change.

Testing:
- just desktop-typecheck, just desktop-lint: clean
- pnpm -C apps/desktop test: all pass (added AC-008f bounded-wait
  invariant and a refreshTrayState/agentMonitorFailed wiring test;
  widened reclaimOrphanBody source-window 3200->4000)

Risks:
- Bounded reclaim wait adds up to RECLAIM_WAIT_TIMEOUT_MS (2s) only on
  the path where an orphan was actually killed; skip-kill paths return
  immediately and boot is never blocked.
- agentMonitorFailed latches for the process lifetime (no manual
  re-enable path exists yet); a future re-enable would clear it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Write sidecar PID file immediately after spawn succeeds

- Update invariant coverage for startup-window orphan recovery

Testing: Focused sidecar tests, typecheck, and lint passed

Risks: Low; readiness remains gated by health and stability checks
@aponamarev aponamarev requested a review from thadeusb May 28, 2026 19:36
@aponamarev aponamarev merged commit b21b976 into main May 29, 2026
5 checks passed
@aponamarev aponamarev deleted the symphony/pln-745 branch May 29, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

symphony ClosedLoop Symphony automated PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants