PLN-745: Reclaim orphaned agent-monitor sidecar via PID file#247
Merged
Conversation
- Track the spawned sidecar in a userData/agent-monitor/sidecar.pid file written atomically (tmp + rename) with pid, a per-session token, and a timestamp; delete it on stop, exit, and after reclaim. - On launch, reclaimOrphan() reads the PID file and SIGKILLs a leftover sidecar only when the pid is a positive integer AND a sessionToken is present, so a recycled pid owned by a foreign process is never killed. - Detect EADDRINUSE on the child's stderr and surface a port-conflict terminal failure; add an onTerminalFailure callback wired in app.ts to show a Notification and set the tray to a degraded state. - Suppress the stale "did not become healthy" warning and skip flushReady(false) when a newer launch() has already replaced this.child. - Harden killGroup() to ignore non-positive pids so process.kill(-pid) can never signal the app's own process group. - Bump desktop version to 0.15.91. Testing: added apps/desktop/test/agent-monitor-sidecar.test.ts covering PID file write/reclaim/delete, session-token and invalid-pid guards, port-conflict detection, and terminal-failure notification; updated the static wiring test to match the new constructor signature. Risks: PID file lives under userData and is process-local (IPC-internal), not an external contract; no migration needed. Reclaim path is guarded against killing foreign pids. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
- Remove unused future sidecar test harness helpers - Drop duplicate orphan-recovery invariant checks Testing: Full desktop test suite, typecheck, and lint passed Risks: Low; cleanup only removes unused and redundant test code
thadeusb
reviewed
May 28, 2026
thadeusb
reviewed
May 28, 2026
thadeusb
reviewed
May 28, 2026
thadeusb
requested changes
May 28, 2026
Contributor
thadeusb
left a comment
There was a problem hiding this comment.
The orphan reclaim is the headline of this PR and it doesn't do what it claims. The sessionToken check is a no-op against pid recycling, so after a crash or reboot the reclaim path can SIGKILL an unrelated process group. That needs a real identity check, not a patch. Will re-review once reclaim actually verifies it's killing our sidecar.
- reclaimOrphan now gates SIGKILL on PID-file-independent ownership signals: the live process's command line must still run our sidecar entry AND its OS start-time must match the value recorded at spawn. sessionToken presence alone could not witness ownership (written and read by the same record), so a recycled/foreign pid on the fixed port could be killed. Now an unverified pid fails safe (skip kill). - writePidFile persists startTime (ps -o lstart=) as part of the ownership identity; add getProcessCommand/getProcessStartTime helpers (ps -ww, fail-safe to null, macOS+Linux portable). - Make lastExitWasPortConflict private to match all sibling fields. - Widen the test source-window for handleExit (1200->2000) and reclaimOrphan (1600->3200) so boundary-straddling assertion targets are not silently truncated. Testing: - just desktop-typecheck, just desktop-lint: clean - pnpm -C apps/desktop test: 104/104 pass (39 sidecar, incl. new AC-008d/AC-008e ownership regression tests and updated AC-006b) Risks: - ps invocation per reclaim attempt; failures degrade to skip-kill, so a genuine orphan after an app-path change falls through to EADDRINUSE backoff rather than being reclaimed (never blocks boot). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ore respawn Addresses two actionable PR #247 review comments (thadeusb): - Tray state: the one-shot tray.setState("degraded") in the sidecar onTerminalFailure callback was stomped by the next refreshTrayState() (cloud heartbeat / gateway recheck), which only branched on gatewayHealthy/cloudCommandsPaused/cloudStatus. Add a tracked in-memory agentMonitorFailed field (modeled on cloudCommandsPaused); the callback latches it and routes through refreshTrayState(), which now consults it in a branch ranked just below gateway-down and above the cloud branches so the degraded indicator sticks. In-memory only (per-process verdict; fresh boot re-attempts the sidecar) — no store schema change. - Reclaim race: reclaimOrphan SIGKILLed the orphan then returned, and launch() respawned immediately; SIGKILL is not synchronous with the OS releasing the fixed port, so the first respawn could hit EADDRINUSE. Add a bounded isRunning(pid) poll (RECLAIM_WAIT_TIMEOUT_MS, reusing delay()/READY_POLL_INTERVAL_MS) after the kill so the port is freed before respawn. Bounded so it never stalls the fire-and-forget boot; handleExit() backoff remains the fallback. The third review thread (sessionToken ownership) was already addressed in b930dee and needs no change. Testing: - just desktop-typecheck, just desktop-lint: clean - pnpm -C apps/desktop test: all pass (added AC-008f bounded-wait invariant and a refreshTrayState/agentMonitorFailed wiring test; widened reclaimOrphanBody source-window 3200->4000) Risks: - Bounded reclaim wait adds up to RECLAIM_WAIT_TIMEOUT_MS (2s) only on the path where an orphan was actually killed; skip-kill paths return immediately and boot is never blocked. - agentMonitorFailed latches for the process lifetime (no manual re-enable path exists yet); a future re-enable would clear it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Write sidecar PID file immediately after spawn succeeds - Update invariant coverage for startup-window orphan recovery Testing: Focused sidecar tests, typecheck, and lint passed Risks: Low; readiness remains gated by health and stability checks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the agent-monitor sidecar lifecycle so a leftover ("orphaned") sidecar process from a previous app run is safely reclaimed before a new one launches, and so a fatal launch failure (most commonly a port conflict) is surfaced to the user instead of silently looping.
Previously, if the desktop app crashed or was force-quit, the detached sidecar child could keep holding
AGENT_MONITOR_PORT, causing the next launch to fail withEADDRINUSEand retry indefinitely with no user-visible signal.What changed
agent-monitor-sidecar.ts): the spawned sidecar is recorded inuserData/agent-monitor/sidecar.pid, written atomically (*.tmp+rename) with the childpid, a per-session token, and a timestamp. The file is deleted onstop(), on process exit, and after a reclaim.launch(),reclaimOrphan()reads the PID file andSIGKILLs a leftover sidecar only when the pid is a positive integer and asessionTokenis present. A pid that may have been recycled by an unrelated ("foreign") process is never killed — the stale file is just deleted.EADDRINUSEon the child's stderr is detected and, after restart attempts are exhausted, anonTerminalFailurecallback fires.app.tswires this to show a desktopNotificationand put the tray into adegradedstate with the failure reason.launch()has already replacedthis.child, the superseded launch suppresses its "did not become healthy" warning and skipsflushReady(false)so it cannot overwrite the newer launch's outcome.killGroup()guard: ignores non-positive pids soprocess.kill(-pid, …)can never accidentally signal the app's own process group (pid=0 → -0 → 0).0.15.90 → 0.15.91.Testing
apps/desktop/test/agent-monitor-sidecar.test.ts(1205 lines) covering PID-file write/reclaim/delete, the session-token and invalid-pid guards, EADDRINUSE port-conflict detection, and the terminal-failure notification path.agent-monitor-wiring-static.test.tsto match the newAgentMonitorSidecar({ onTerminalFailure })constructor signature.Risks
The
sidecar.pidfile lives underuserDataand is entirely process-local — it is read and written only by the desktop app itself, not an external contract, so no migration logic is required. The reclaim path is explicitly guarded against killing foreign or recycled pids.Loop ID: 019e6f5e-5c5f-7762-9b86-5e1172836dd7
Artifact: https://app.closedloop.ai/implementation-plans/PLN-745