PLN-757: Boot-recovery PoP heartbeat revival for managed-key loops#256
Conversation
- Thread provenance context into classifyLoopStatus: a 401 for a DESKTOP_MANAGED loop with PoP available now classifies as `pop_fallback` instead of `terminal`, signaling the caller to attempt managed-key revival before finalizing (backward-compatible — omitting provenance preserves 401-is-always-terminal). - Add BootRecoveryService.attemptPopHeartbeatRevival: on a rejected runner JWT, post a managed-key PoP heartbeat and adopt the fresh runner token when the server revives the loop — the boot-path analog of PLN-740's live revival. - Extract shared finalizeAsTerminal/buildProvenanceContext helpers from reconcileCloudLoopStatus; reattachLiveJob no longer re-passes provenance so the PoP fallback is consumed exactly once. - Clarify the finalizeDeadJobs guard: timed_out is already persisted terminal, every other reconcile outcome (including cloud-reported "active") still finalizes because the local runner PID is dead (AC-007 invariant). Testing: - desktop-typecheck: clean - eslint (boot-recovery.ts, boot-recovery.test.ts): clean - tsx --test boot-recovery.test.ts loop-status-classifier.test.ts: 53 passed, 0 failed (T-3.1/T-3.2/T-3.3 PoP revival + provenance classification cases) Risks: - classifyLoopStatus gains an optional 3rd arg; all existing callers pass two args and keep prior behavior. PoP revival only fires for DESKTOP_MANAGED loops with signing deps present, so USER_CREATED and no-PoP paths are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
thadeusb
left a comment
There was a problem hiding this comment.
The live revival path reads right and the classifier change is clean. The thing I want eyes on is that reconcile is shared with the dead-PID finalize path, so I think the revival heartbeat is firing for loops you're about to tear down. Flagged that plus the transient-error handling inline.
…t tests Addresses the two findings from the PR #256 bloat report: - Extract `persistRevivalToken()` into loop-http.ts as the single source of truth for the revival-token write. The `revived/token` guard and the `expiresAt.getTime()` payload mapping were duplicated between the live heartbeat path (loop-heartbeat.ts runHeartbeatTick) and the boot-recovery PoP path (boot-recovery.ts attemptPopHeartbeatRevival, added in this branch). Both now call the shared helper, so the mapping cannot drift. - Collapse three "provenance irrelevant" classifier test cases (404/410/ timed_out × DESKTOP_MANAGED+PoP) to a single 404 representative. The provenance check lives entirely inside the `httpStatus === 401` branch, so one non-401 case proves provenance is never consulted; 404/410/timed_out already have dedicated no-provenance coverage. Testing: - desktop-typecheck: clean - eslint (loop-http.ts, loop-heartbeat.ts, boot-recovery.ts, classifier test): clean - tsx --test loop-http loop-heartbeat loop-status-classifier boot-recovery: 127 passed, 0 failed Risks: - persistRevivalToken preserves the exact prior guard semantics (optional store, revived===true, token!==undefined); callers keep their own logging and return contracts, so live and boot revival behavior is unchanged. - No version bump: branch is already at 0.15.99 (one patch above main's 0.15.98), so these desktop changes are covered by the existing bump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compatibility Smoke Test Results
|
…ly on transient heartbeat Addresses two inline review comments on PR #256, both rooted in reconcileCloudLoopStatus being shared by reattachLiveJob (live PID) and finalizeDeadJobs (dead PID) while firing PoP revival unconditionally. - Thread caller liveness via an internal allowPopRevival flag (default false, the dead-safe value). reattachLiveJob passes true (PID confirmed running); finalizeDeadJobs passes false. - Comment 1 (dead-PID revival): build provenance context only when revival is allowed, so a dead job's 401 classifies terminal and no revival heartbeat is ever POSTed — eliminating the resurrect-then-finalize zombie-loop race. - Comment 2 (transient teardown): a non-terminal PoP heartbeat result (5xx/network/timeout) now returns a transient CloudLoopStatus instead of finalizing terminal, so reattachLiveJob re-classifies it as transient and reattaches conservatively, letting the heartbeat scheduler retry. Terminal 410 case unchanged. - Bump desktop 0.15.99 -> 0.15.100 (CI-required for apps/desktop changes). Testing: - just desktop-typecheck, just desktop-lint: clean - pnpm test: 1942 tests, 0 failures (T-3.1/T-3.2/T-3.3 still green) - New T-3.4 (table-driven 5xx + network): live loop not finalized, token retained, reattached, schedulers not torn down on transient heartbeat - New T-3.5: dead-PID DESKTOP_MANAGED loop never POSTs a heartbeat; terminal classification clears the token and never registers the loop active Risks: - Internal-only change (private method param); no gateway/IPC/relay/store contract change, so no migration needed. Live-path teardown is now gated on a definitively terminal heartbeat (401/404/410); transient errors defer to the live heartbeat scheduler, matching the existing conservative-reattach policy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Fixes the bug where boot recovery removed live managed loops: the reconcile/refresh path had no managed-key PoP fallback, so a
DESKTOP_MANAGEDloop whose runner JWT had been rejected (401) was classified terminal and finalized on the next boot — even though it could have been revived.This is the boot-path analog of PLN-740's live revival: when the runner JWT is stale, boot recovery now posts a managed-key PoP-signed heartbeat and adopts the fresh runner token the server returns, keeping the loop alive instead of tearing it down.
Implementation plan: PLN-757 (parent feature FEA-1430).
What changed
loop-status-classifier.ts) —classifyLoopStatustakes an optionalClassifierProvenanceContext. A 401 for aDESKTOP_MANAGEDloop with PoP available now returnspop_fallbackinstead ofterminal. Backward-compatible: omitting the context preserves the existing 401-is-always-terminal behavior, and non-401 terminal codes (404/410/timed_out) are unaffected by provenance.boot-recovery.ts) — newattemptPopHeartbeatRevival()posts a managed-key PoP-signed heartbeat when the JWT-only path fails, persisting the fresh runner token (jti/expiresAt) on a successful revival. On a terminal heartbeat (e.g. 410) the loop is finalized terminal; a transient/inconclusive heartbeat finalizes as unauthorized.finalizeAsTerminal()andbuildProvenanceContext()pulled out ofreconcileCloudLoopStatus.reattachLiveJobintentionally does not re-pass provenance, so the PoP fallback is consumed exactly once.finalizeDeadJobsguard — onlytimed_outshort-circuits (it is already persisted terminal inside reconcile); every other outcome, including a cloud-reportedactive, still finalizes because the local runner PID is already dead (the AC-007 invariant).Testing
just desktop-typecheck— cleaneslinton changed files — cleantsx --test boot-recovery.test.ts loop-status-classifier.test.ts— 53 passed, 0 failedRisks
classifyLoopStatusgains an optional third argument; all existing callers pass two args and retain prior behavior.DESKTOP_MANAGEDloops with signing deps present —USER_CREATEDand no-PoP paths are unchanged.🤖 Generated with Claude Code