Skip to content

PLN-757: Boot-recovery PoP heartbeat revival for managed-key loops#256

Merged
aponamarev merged 5 commits into
mainfrom
symphony/pln-757
May 29, 2026
Merged

PLN-757: Boot-recovery PoP heartbeat revival for managed-key loops#256
aponamarev merged 5 commits into
mainfrom
symphony/pln-757

Conversation

@aponamarev
Copy link
Copy Markdown
Contributor

Summary

Fixes the bug where boot recovery removed live managed loops: the reconcile/refresh path had no managed-key PoP fallback, so a DESKTOP_MANAGED loop whose runner JWT had been rejected (401) was classified terminal and finalized on the next boot — even though it could have been revived.

This is the boot-path analog of PLN-740's live revival: when the runner JWT is stale, boot recovery now posts a managed-key PoP-signed heartbeat and adopts the fresh runner token the server returns, keeping the loop alive instead of tearing it down.

Implementation plan: PLN-757 (parent feature FEA-1430).

What changed

  • Provenance-aware classification (loop-status-classifier.ts)classifyLoopStatus takes an optional ClassifierProvenanceContext. A 401 for a DESKTOP_MANAGED loop with PoP available now returns pop_fallback instead of terminal. Backward-compatible: omitting the context preserves the existing 401-is-always-terminal behavior, and non-401 terminal codes (404/410/timed_out) are unaffected by provenance.
  • PoP heartbeat revival (boot-recovery.ts) — new attemptPopHeartbeatRevival() posts a managed-key PoP-signed heartbeat when the JWT-only path fails, persisting the fresh runner token (jti/expiresAt) on a successful revival. On a terminal heartbeat (e.g. 410) the loop is finalized terminal; a transient/inconclusive heartbeat finalizes as unauthorized.
  • Shared helpers extractedfinalizeAsTerminal() and buildProvenanceContext() pulled out of reconcileCloudLoopStatus. reattachLiveJob intentionally does not re-pass provenance, so the PoP fallback is consumed exactly once.
  • Clarified finalizeDeadJobs guard — only timed_out short-circuits (it is already persisted terminal inside reconcile); every other outcome, including a cloud-reported active, still finalizes because the local runner PID is already dead (the AC-007 invariant).

Testing

  • just desktop-typecheck — clean
  • eslint on changed files — clean
  • tsx --test boot-recovery.test.ts loop-status-classifier.test.ts53 passed, 0 failed
    • T-3.1 successful PoP revival, T-3.2 terminal 410, T-3.3 USER_CREATED no-PoP
    • Provenance-aware classification cases (DESKTOP_MANAGED/USER_CREATED, PoP available/unavailable, backward-compat no-context)

Risks

  • classifyLoopStatus gains an optional third argument; all existing callers pass two args and retain prior behavior.
  • PoP revival fires only for DESKTOP_MANAGED loops with signing deps present — USER_CREATED and no-PoP paths are unchanged.
  • Internal contract change only (main-process modules bundled together); no external gateway/relay/persisted-schema change.

🤖 Generated with Claude Code

- Thread provenance context into classifyLoopStatus: a 401 for a
  DESKTOP_MANAGED loop with PoP available now classifies as `pop_fallback`
  instead of `terminal`, signaling the caller to attempt managed-key revival
  before finalizing (backward-compatible — omitting provenance preserves
  401-is-always-terminal).
- Add BootRecoveryService.attemptPopHeartbeatRevival: on a rejected runner
  JWT, post a managed-key PoP heartbeat and adopt the fresh runner token when
  the server revives the loop — the boot-path analog of PLN-740's live revival.
- Extract shared finalizeAsTerminal/buildProvenanceContext helpers from
  reconcileCloudLoopStatus; reattachLiveJob no longer re-passes provenance so
  the PoP fallback is consumed exactly once.
- Clarify the finalizeDeadJobs guard: timed_out is already persisted terminal,
  every other reconcile outcome (including cloud-reported "active") still
  finalizes because the local runner PID is dead (AC-007 invariant).

Testing:
- desktop-typecheck: clean
- eslint (boot-recovery.ts, boot-recovery.test.ts): clean
- tsx --test boot-recovery.test.ts loop-status-classifier.test.ts: 53 passed,
  0 failed (T-3.1/T-3.2/T-3.3 PoP revival + provenance classification cases)

Risks:
- classifyLoopStatus gains an optional 3rd arg; all existing callers pass two
  args and keep prior behavior. PoP revival only fires for DESKTOP_MANAGED
  loops with signing deps present, so USER_CREATED and no-PoP paths are
  unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aponamarev aponamarev requested a review from a team May 29, 2026 14:45
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Comment thread apps/desktop/src/main/boot-recovery.ts
Comment thread apps/desktop/src/main/boot-recovery.ts Outdated
Copy link
Copy Markdown
Contributor

@thadeusb thadeusb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The live revival path reads right and the classifier change is clean. The thing I want eyes on is that reconcile is shared with the dead-PID finalize path, so I think the revival heartbeat is firing for loops you're about to tear down. Flagged that plus the transient-error handling inline.

…t tests

Addresses the two findings from the PR #256 bloat report:

- Extract `persistRevivalToken()` into loop-http.ts as the single source of
  truth for the revival-token write. The `revived/token` guard and the
  `expiresAt.getTime()` payload mapping were duplicated between the live
  heartbeat path (loop-heartbeat.ts runHeartbeatTick) and the boot-recovery
  PoP path (boot-recovery.ts attemptPopHeartbeatRevival, added in this branch).
  Both now call the shared helper, so the mapping cannot drift.
- Collapse three "provenance irrelevant" classifier test cases (404/410/
  timed_out × DESKTOP_MANAGED+PoP) to a single 404 representative. The
  provenance check lives entirely inside the `httpStatus === 401` branch, so
  one non-401 case proves provenance is never consulted; 404/410/timed_out
  already have dedicated no-provenance coverage.

Testing:
- desktop-typecheck: clean
- eslint (loop-http.ts, loop-heartbeat.ts, boot-recovery.ts, classifier test): clean
- tsx --test loop-http loop-heartbeat loop-status-classifier boot-recovery:
  127 passed, 0 failed

Risks:
- persistRevivalToken preserves the exact prior guard semantics (optional
  store, revived===true, token!==undefined); callers keep their own logging
  and return contracts, so live and boot revival behavior is unchanged.
- No version bump: branch is already at 0.15.99 (one patch above main's
  0.15.98), so these desktop changes are covered by the existing bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

Compatibility Smoke Test Results ⚠️

Status: skipped
Electron SHA: 98335116ed2983b1c02e017e5e8f2bebf09ae6d5
Symphony Alpha SHA (last-known-good): 24741ed9be45cc761195d7a7b6613bd30cdfce84
Note: Skipped because the stage GitHub App credentials are not configured for this workflow run.

View Actions run

aponamarev and others added 2 commits May 29, 2026 18:24
…ly on transient heartbeat

Addresses two inline review comments on PR #256, both rooted in
reconcileCloudLoopStatus being shared by reattachLiveJob (live PID) and
finalizeDeadJobs (dead PID) while firing PoP revival unconditionally.

- Thread caller liveness via an internal allowPopRevival flag (default false,
  the dead-safe value). reattachLiveJob passes true (PID confirmed running);
  finalizeDeadJobs passes false.
- Comment 1 (dead-PID revival): build provenance context only when revival is
  allowed, so a dead job's 401 classifies terminal and no revival heartbeat is
  ever POSTed — eliminating the resurrect-then-finalize zombie-loop race.
- Comment 2 (transient teardown): a non-terminal PoP heartbeat result
  (5xx/network/timeout) now returns a transient CloudLoopStatus instead of
  finalizing terminal, so reattachLiveJob re-classifies it as transient and
  reattaches conservatively, letting the heartbeat scheduler retry. Terminal
  410 case unchanged.
- Bump desktop 0.15.99 -> 0.15.100 (CI-required for apps/desktop changes).

Testing:
- just desktop-typecheck, just desktop-lint: clean
- pnpm test: 1942 tests, 0 failures (T-3.1/T-3.2/T-3.3 still green)
- New T-3.4 (table-driven 5xx + network): live loop not finalized, token
  retained, reattached, schedulers not torn down on transient heartbeat
- New T-3.5: dead-PID DESKTOP_MANAGED loop never POSTs a heartbeat; terminal
  classification clears the token and never registers the loop active

Risks:
- Internal-only change (private method param); no gateway/IPC/relay/store
  contract change, so no migration needed. Live-path teardown is now gated on
  a definitively terminal heartbeat (401/404/410); transient errors defer to
  the live heartbeat scheduler, matching the existing conservative-reattach
  policy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aponamarev aponamarev merged commit c1b165b into main May 29, 2026
5 checks passed
@aponamarev aponamarev deleted the symphony/pln-757 branch May 29, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants