fix: historian livelock starves pending drops in active sessions#141
fix: historian livelock starves pending drops in active sessions#141tracycam wants to merge 1 commit into
Conversation
…ent drop starvation In an active session the protected tail's newest message changes every turn, so a boundary snapshot captured at trigger time fails validation on the "last ordinal id changed" check by the time the historian runs — even though the eligible head it would compact is untouched. The runner no-op'd forever while the trigger refired each turn, queued drop ops accumulated (observed: 27 consecutive stale no-ops, 179 pending drops, zero reduction). - compartment-runner-incremental: on a stale_snapshot validation result, re-resolve the boundary once from current session state and adopt the fresh snapshot when it still exposes a runnable head. The refreshed snapshot recomputes protectedTailStart/eligibleEnd from live messages, so the protected-tail guarantee is preserved (the head can never include a message now in the live tail). Historian makes real progress, publishes, and queues drops so the accumulated pending ops drain. - startCompartmentAgent: a synchronous runner no-op cleared compartmentInProgress but left the activeRuns registration alive until a microtask, so the same transform pass deferred queued drop ops for a run that already finished. Signal onHistorianRunStarted at the real-run commit point and drop the registration synchronously when the runner no-ops. - Regression tests: stale-tail snapshot re-derives + publishes instead of no-op'ing; synchronous no-op clears the active-run belief in the same pass.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR addresses a production livelock in the magic-context historian runner by ensuring stale protected-tail snapshots can be refreshed at run time and by preventing synchronous no-op runs from leaving a misleading “active run” registration that starves queued drop operations.
Changes:
- Refresh stale protected-tail boundary snapshots at run time when the eligible head is still runnable.
- Add an
onHistorianRunStartedcallback sostartCompartmentAgentcan distinguish a real run from synchronous no-ops and clearactiveRunsimmediately in the no-op case. - Add regression tests covering stale-tail refresh and synchronous no-op active-run cleanup.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/plugin/src/hooks/magic-context/compartment-runner.ts | Tracks whether a real historian run started and synchronously clears activeRuns for synchronous no-ops. |
| packages/plugin/src/hooks/magic-context/compartment-runner-incremental.ts | Refreshes stale snapshots at run time and signals onHistorianRunStarted before the main awaited work. |
| packages/plugin/src/hooks/magic-context/compartment-runner-types.ts | Extends runner deps with the onHistorianRunStarted callback and documents its intended semantics. |
| packages/plugin/src/hooks/magic-context/compartment-runner.test.ts | Adds regression tests to validate stale-tail refresh and synchronous no-op active-run cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -143,6 +150,20 @@ export function startCompartmentAgent(deps: CompartmentRunnerDeps): void { | |||
| } | |||
| }); | |||
| activeRuns.set(deps.sessionId, { promise, published: false }); | |||
| // If the runner no-op'd synchronously (stale/empty snapshot, nothing to | |||
| // compact, drain-quota), it returned before signalling onHistorianRunStarted | |||
| // and before any `await`, so `promise` is already settling. It cleared | |||
| // compartmentInProgress in its own finally, but the activeRuns entry above | |||
| // would otherwise survive (cleared only by the microtask-scheduled | |||
| // promise.finally) and make the SAME transform pass treat a non-running | |||
| // historian as in-progress — deferring queued drop ops and starving them | |||
| // turn after turn (the production livelock). Drop the registration | |||
| // synchronously so pending ops can materialize this pass. The promise.finally | |||
| // below still runs for interval/lease cleanup; its `=== promise` guard makes | |||
| // the (now redundant) delete a no-op. | |||
| if (!realRunStarted && activeRuns.get(deps.sessionId)?.promise === promise) { | |||
| activeRuns.delete(deps.sessionId); | |||
| } | |||
| /** | ||
| * Called synchronously the moment the runner commits to a REAL historian | ||
| * pass — after every no-op early-return (stale/empty snapshot, nothing to | ||
| * compact, drain-quota) and immediately before the first `await`. Lets | ||
| * `startCompartmentAgent` distinguish a fire-and-forget run that actually | ||
| * started from one that no-op'd synchronously, so a no-op does not leave | ||
| * the rest of the transform pass believing a historian is in progress | ||
| * (which would defer queued drop ops — the production livelock). | ||
| */ | ||
| onHistorianRunStarted?: () => void; |
| // Past every synchronous no-op early-return and immediately before the | ||
| // first `await` (client.session.get below): we are now committed to a | ||
| // real historian pass. Signal the caller so startCompartmentAgent keeps | ||
| // the active-run registration; a synchronous no-op above never reaches | ||
| // here, so its lingering registration is cleared instead of blocking the | ||
| // same transform pass's pending-op drain. | ||
| deps.onHistorianRunStarted?.(); |
| // Past every synchronous no-op early-return and immediately before the | ||
| // first `await` (client.session.get below): we are now committed to a | ||
| // real historian pass. Signal the caller so startCompartmentAgent keeps | ||
| // the active-run registration; a synchronous no-op above never reaches | ||
| // here, so its lingering registration is cleared instead of blocking the | ||
| // same transform pass's pending-op drain. | ||
| deps.onHistorianRunStarted?.(); |
There was a problem hiding this comment.
"Synchronous no-op" framing understates the callback's actual scope
The comment says "Past every synchronous no-op early-return … immediately before the first await", but two earlier early-return paths — existingValidationError (line ~197) and chunkCoverageError (line ~371) — both reach await notifyHistorianIssue(…) without ever calling onHistorianRunStarted. They are async paths that still leave realRunStarted = false, so startCompartmentAgent will also delete their activeRuns entry synchronously.
The mechanism is correct (both paths are genuine no-ops from a historian perspective), but the comment and the matching one in compartment-runner.ts should say "past every early-return path — including the async error paths above — and immediately before the first real work await" to avoid a future reader placing a new early-return after the callback by mistake.
| if (!validation.ok && validation.reason === "stale_snapshot") { | ||
| const refreshed = resolveOpenCodeProtectedTailBoundary({ | ||
| db, | ||
| sessionId, | ||
| mode: "incremental-runner", | ||
| contextLimit: deps.currentContextLimit ?? boundarySnapshot.contextLimit, | ||
| executeThresholdPercentage: boundarySnapshot.executeThresholdPercentage, | ||
| usage: { | ||
| percentage: boundarySnapshot.usagePercentage, | ||
| inputTokens: boundarySnapshot.usageInputTokens, | ||
| }, | ||
| usageSource: boundarySnapshot.usageSource, | ||
| emergencyTailScale: boundarySnapshot.emergencyTailScale, | ||
| }); | ||
| if (hasRunnableCompartmentWindow(refreshed)) { | ||
| sessionLog( | ||
| sessionId, | ||
| `historian: refreshed stale protected-tail snapshot at run time (was: ${validation.detail ?? "stale"}) — eligible head ${refreshed.offset}-${refreshed.eligibleEndOrdinal - 1}`, | ||
| ); | ||
| boundarySnapshot = refreshed; | ||
| validation = { ok: true }; | ||
| } | ||
| } |
There was a problem hiding this comment.
Stale
usagePercentage passed to the refreshed boundary resolver
The refresh call forwards boundarySnapshot.usagePercentage and boundarySnapshot.usageInputTokens — values captured at trigger time. In the active-session livelock scenario (context growing every turn), the real usage at run time is higher than what the snapshot recorded. A lower-than-actual percentage causes resolveOpenCodeProtectedTailBoundary to compute a slightly smaller protected tail (protectedTailStart shifted toward the end), which could in principle include messages that belong to the current live tail.
In practice this is benign because the runner immediately clamps eligibleEndOrdinal = Math.min(snapshot.eligibleEndOrdinal, protectedTailStart), so nothing outside the resolved tail is ever compacted. Adding a brief inline note here — e.g., "usage is intentionally trigger-time; runner clamps eligibleEnd to protectedTailStart" — would make the safety argument self-documenting for the next reader.
Problem
In an active session, the historian can livelock and pending drop operations starve forever. Observed in production (OpenCode harness): a session accumulated 179 pending drops while active context grew past 280K tokens with zero reduction.
The cycle, repeating every turn:
compartmentInProgressflag set, historian agent startshistorian no-op: stale protected-tail snapshot (last ordinal N id changed)— the boundary snapshot is captured at trigger time, and the ONLY failing validation is the "last message in protected tail" tuple, which changes every turn in an active sessiontransform: deferring pending ops — compartment agent in progress— the no-op returns synchronously, but theactiveRunsregistration is only cleared via a microtaskpromise.finally, so the same pass still believes a run is active and defers pending opsFix (two independent defects)
stale_snapshotvalidation result, the runner re-resolves the boundary once from current session state and adopts it if it still has a runnable head.validateBoundarySnapshotitself is untouched — the refreshed snapshot recomputesprotectedTailStart/eligibleEndfrom live messages, so the compacted head can never include a message that belongs to the live protected tail.startCompartmentAgentnow registers the run at the real-run commit point (after the last synchronous no-op return, before the first await) viaonHistorianRunStarted, and drops the registration synchronously when the runner no-ops — so pending ops can materialize in the same pass.Tests
Two regression tests in
compartment-runner.test.ts:compartmentInProgress=false)bun run typecheck,bun run lint, full plugin suite pass (the two pre-existing flaky timing tests aside).Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.Summary by cubic
Fixes a historian livelock that starved pending drop ops in active sessions. The runner now refreshes stale boundaries and clears no-op runs immediately, so compaction proceeds and drops drain.
stale_snapshotat run time and adopts it only if the head is runnable, preserving protected-tail safety.onHistorianRunStarted, soactiveRunsdoesn’t block pending drops in the same pass.Written for commit 055da42. Summary will update on new commits.
Greptile Summary
This PR fixes a production livelock where the historian agent repeatedly no-ops due to a stale boundary snapshot while pending drop operations starve indefinitely. Two independent defects are addressed: (1) on a
stale_snapshotvalidation result the runner now re-resolves the boundary once from live DB state and proceeds if a runnable window still exists, and (2) theactiveRunsregistration is cleared synchronously when the runner determines it will no-op, so the same transform pass can drain pending drops instead of deferring them.compartment-runner-incremental.ts): whenvalidateBoundarySnapshotreturnsreason: \"stale_snapshot\", the runner callsresolveOpenCodeProtectedTailBoundaryagainst the live DB and adopts the refreshed snapshot ifhasRunnableCompartmentWindowis satisfied. The runner's owneligibleEndOrdinal = Math.min(snapshot.eligibleEndOrdinal, protectedTailStart)clamp maintains the protected-tail guarantee regardless of snapshot staleness.activeRunscleanup (compartment-runner.ts):startCompartmentAgentintroduces arealRunStartedboolean set by the newonHistorianRunStartedcallback, called just before the first realawait. AfterrunCompartmentAgentis launched, ifrealRunStartedis still false the registration is deleted synchronously before the microtask-scheduled.finallycan clear it — unblocking pending-op drain in the same pass.Confidence Score: 4/5
Safe to merge — the two independent fixes address a well-documented production livelock with targeted, narrow changes backed by regression tests.
Both changes are tightly scoped and the core safety guarantee (protected tail never compacted) is enforced by the runner's own Math.min(eligibleEndOrdinal, protectedTailStart) clamp independent of the refreshed snapshot. The realRunStarted mechanism is correct for both synchronous and async no-op paths. The two findings are documentation and clarity gaps, not behavioral defects.
The stale-snapshot refresh block in compartment-runner-incremental.ts (lines 244–266) and the matching onHistorianRunStarted comment in both compartment-runner-incremental.ts and compartment-runner.ts would benefit from a one-line clarification about async early-return coverage and stale usage forwarding.
Important Files Changed
Sequence Diagram
sequenceDiagram participant T as Transform Pass participant SCA as startCompartmentAgent participant RCA as runCompartmentAgent participant DB as DB / State T->>SCA: startCompartmentAgent(deps) SCA->>DB: acquireCompartmentLease() SCA->>RCA: runCompartmentAgent(runnerDeps) [fire-and-forget] alt No-op path (stale snapshot) RCA->>DB: validateBoundarySnapshot → stale_snapshot Note over RCA: NEW: re-resolve boundary RCA->>DB: resolveOpenCodeProtectedTailBoundary (live) alt hasRunnableCompartmentWindow RCA->>RCA: "boundarySnapshot = refreshed, validation.ok = true" Note over RCA: continues to real run RCA-->>SCA: onHistorianRunStarted() called Note over SCA: realRunStarted = true RCA->>DB: publish compartments else no runnable window RCA-->>RCA: rollback, return (no await reached) Note over SCA: realRunStarted = false end else No-op path (nothing to compact) RCA-->>RCA: rollback, return synchronously Note over SCA: realRunStarted = false end SCA->>SCA: activeRuns.set(sessionId, promise) Note over SCA: NEW: if (!realRunStarted) activeRuns.delete() synchronously SCA-->>T: returns T->>T: getActiveCompartmentRun() → undefined T->>DB: drain pending drops (no longer blocked) Note over RCA: microtask later: .finally() clears interval + leaseReviews (1): Last reviewed commit: "plugin: refresh protected-tail snapshot ..." | Re-trigger Greptile