Skip to content

fix: historian livelock starves pending drops in active sessions#141

Open
tracycam wants to merge 1 commit into
cortexkit:masterfrom
tracycam:fix/historian-stale-snapshot-livelock
Open

fix: historian livelock starves pending drops in active sessions#141
tracycam wants to merge 1 commit into
cortexkit:masterfrom
tracycam:fix/historian-stale-snapshot-livelock

Conversation

@tracycam

@tracycam tracycam commented Jun 12, 2026

Copy link
Copy Markdown

Problem

In an active session, the historian can livelock and pending drop operations starve forever. Observed in production (OpenCode harness): a session accumulated 179 pending drops while active context grew past 280K tokens with zero reduction.

The cycle, repeating every turn:

  1. commit-cluster trigger fires (clusters never get compartmentalized, so it refires every turn)
  2. compartmentInProgress flag set, historian agent starts
  3. historian no-op: stale protected-tail snapshot (last ordinal N id changed) — the boundary snapshot is captured at trigger time, and the ONLY failing validation is the "last message in protected tail" tuple, which changes every turn in an active session
  4. transform: deferring pending ops — compartment agent in progress — the no-op returns synchronously, but the activeRuns registration is only cleared via a microtask promise.finally, so the same pass still believes a run is active and defers pending ops

Fix (two independent defects)

  1. Re-derive the boundary snapshot at run time on staleness. On a stale_snapshot validation result, the runner re-resolves the boundary once from current session state and adopts it if it still has a runnable head. validateBoundarySnapshot itself is untouched — the refreshed snapshot recomputes protectedTailStart/eligibleEnd from live messages, so the compacted head can never include a message that belongs to the live protected tail.
  2. Synchronous no-op clears the active-run registration. startCompartmentAgent now registers the run at the real-run commit point (after the last synchronous no-op return, before the first await) via onHistorianRunStarted, and drops the registration synchronously when the runner no-ops — so pending ops can materialize in the same pass.

Tests

Two regression tests in compartment-runner.test.ts:

  • refreshes a stale-tail boundary snapshot at run time instead of no-op'ing forever (asserts publish + queued drops + compartmentInProgress=false)
  • clears the active-run registration synchronously when the runner no-ops

bun run typecheck, bun run lint, full plugin suite pass (the two pre-existing flaky timing tests aside).


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Fixes a historian livelock that starved pending drop ops in active sessions. The runner now refreshes stale boundaries and clears no-op runs immediately, so compaction proceeds and drops drain.

  • Bug Fixes
    • Re-resolves the protected-tail boundary on stale_snapshot at run time and adopts it only if the head is runnable, preserving protected-tail safety.
    • Clears active-run registration synchronously on no-op via onHistorianRunStarted, so activeRuns doesn’t block pending drops in the same pass.

Written for commit 055da42. Summary will update on new commits.

Review in cubic

Greptile Summary

This PR fixes a production livelock where the historian agent repeatedly no-ops due to a stale boundary snapshot while pending drop operations starve indefinitely. Two independent defects are addressed: (1) on a stale_snapshot validation result the runner now re-resolves the boundary once from live DB state and proceeds if a runnable window still exists, and (2) the activeRuns registration is cleared synchronously when the runner determines it will no-op, so the same transform pass can drain pending drops instead of deferring them.

  • Stale-snapshot refresh (compartment-runner-incremental.ts): when validateBoundarySnapshot returns reason: \"stale_snapshot\", the runner calls resolveOpenCodeProtectedTailBoundary against the live DB and adopts the refreshed snapshot if hasRunnableCompartmentWindow is satisfied. The runner's own eligibleEndOrdinal = Math.min(snapshot.eligibleEndOrdinal, protectedTailStart) clamp maintains the protected-tail guarantee regardless of snapshot staleness.
  • Synchronous activeRuns cleanup (compartment-runner.ts): startCompartmentAgent introduces a realRunStarted boolean set by the new onHistorianRunStarted callback, called just before the first real await. After runCompartmentAgent is launched, if realRunStarted is still false the registration is deleted synchronously before the microtask-scheduled .finally can clear it — unblocking pending-op drain in the same pass.

Confidence Score: 4/5

Safe to merge — the two independent fixes address a well-documented production livelock with targeted, narrow changes backed by regression tests.

Both changes are tightly scoped and the core safety guarantee (protected tail never compacted) is enforced by the runner's own Math.min(eligibleEndOrdinal, protectedTailStart) clamp independent of the refreshed snapshot. The realRunStarted mechanism is correct for both synchronous and async no-op paths. The two findings are documentation and clarity gaps, not behavioral defects.

The stale-snapshot refresh block in compartment-runner-incremental.ts (lines 244–266) and the matching onHistorianRunStarted comment in both compartment-runner-incremental.ts and compartment-runner.ts would benefit from a one-line clarification about async early-return coverage and stale usage forwarding.

Important Files Changed

Filename Overview
packages/plugin/src/hooks/magic-context/compartment-runner.ts Introduces realRunStarted flag + onHistorianRunStarted callback to synchronously clear activeRuns when the runner no-ops; lease and interval cleanup still run via .finally() microtask. Logic is sound for both synchronous and async no-op paths.
packages/plugin/src/hooks/magic-context/compartment-runner-incremental.ts Adds stale-snapshot refresh: on stale_snapshot validation failure, re-resolves boundary from live DB using trigger-time usage/threshold params and adopts it when hasRunnableCompartmentWindow passes. Protected-tail safety is preserved by the runner's own Math.min(eligibleEndOrdinal, protectedTailStart) clamp downstream.
packages/plugin/src/hooks/magic-context/compartment-runner-types.ts Adds optional onHistorianRunStarted callback to CompartmentRunnerDeps with clear JSDoc; no breaking changes.
packages/plugin/src/hooks/magic-context/compartment-runner.test.ts Adds two targeted regression tests: stale-tail refresh (verifies publish + pending ops + compartmentInProgress=false) and sync-noop activeRuns cleanup (synchronous assertion before any await). Both cover the exact production failure modes described in the PR.

Sequence Diagram

sequenceDiagram
    participant T as Transform Pass
    participant SCA as startCompartmentAgent
    participant RCA as runCompartmentAgent
    participant DB as DB / State

    T->>SCA: startCompartmentAgent(deps)
    SCA->>DB: acquireCompartmentLease()
    SCA->>RCA: runCompartmentAgent(runnerDeps) [fire-and-forget]

    alt No-op path (stale snapshot)
        RCA->>DB: validateBoundarySnapshot → stale_snapshot
        Note over RCA: NEW: re-resolve boundary
        RCA->>DB: resolveOpenCodeProtectedTailBoundary (live)
        alt hasRunnableCompartmentWindow
            RCA->>RCA: "boundarySnapshot = refreshed, validation.ok = true"
            Note over RCA: continues to real run
            RCA-->>SCA: onHistorianRunStarted() called
            Note over SCA: realRunStarted = true
            RCA->>DB: publish compartments
        else no runnable window
            RCA-->>RCA: rollback, return (no await reached)
            Note over SCA: realRunStarted = false
        end
    else No-op path (nothing to compact)
        RCA-->>RCA: rollback, return synchronously
        Note over SCA: realRunStarted = false
    end

    SCA->>SCA: activeRuns.set(sessionId, promise)
    Note over SCA: NEW: if (!realRunStarted) activeRuns.delete() synchronously
    SCA-->>T: returns

    T->>T: getActiveCompartmentRun() → undefined
    T->>DB: drain pending drops (no longer blocked)

    Note over RCA: microtask later: .finally() clears interval + lease
Loading

Reviews (1): Last reviewed commit: "plugin: refresh protected-tail snapshot ..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

…ent drop starvation

In an active session the protected tail's newest message changes every
turn, so a boundary snapshot captured at trigger time fails validation on
the "last ordinal id changed" check by the time the historian runs — even
though the eligible head it would compact is untouched. The runner no-op'd
forever while the trigger refired each turn, queued drop ops accumulated
(observed: 27 consecutive stale no-ops, 179 pending drops, zero reduction).

- compartment-runner-incremental: on a stale_snapshot validation result,
  re-resolve the boundary once from current session state and adopt the
  fresh snapshot when it still exposes a runnable head. The refreshed
  snapshot recomputes protectedTailStart/eligibleEnd from live messages,
  so the protected-tail guarantee is preserved (the head can never include
  a message now in the live tail). Historian makes real progress, publishes,
  and queues drops so the accumulated pending ops drain.
- startCompartmentAgent: a synchronous runner no-op cleared compartmentInProgress
  but left the activeRuns registration alive until a microtask, so the same
  transform pass deferred queued drop ops for a run that already finished.
  Signal onHistorianRunStarted at the real-run commit point and drop the
  registration synchronously when the runner no-ops.
- Regression tests: stale-tail snapshot re-derives + publishes instead of
  no-op'ing; synchronous no-op clears the active-run belief in the same pass.
Copilot AI review requested due to automatic review settings June 12, 2026 19:46

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses a production livelock in the magic-context historian runner by ensuring stale protected-tail snapshots can be refreshed at run time and by preventing synchronous no-op runs from leaving a misleading “active run” registration that starves queued drop operations.

Changes:

  • Refresh stale protected-tail boundary snapshots at run time when the eligible head is still runnable.
  • Add an onHistorianRunStarted callback so startCompartmentAgent can distinguish a real run from synchronous no-ops and clear activeRuns immediately in the no-op case.
  • Add regression tests covering stale-tail refresh and synchronous no-op active-run cleanup.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
packages/plugin/src/hooks/magic-context/compartment-runner.ts Tracks whether a real historian run started and synchronously clears activeRuns for synchronous no-ops.
packages/plugin/src/hooks/magic-context/compartment-runner-incremental.ts Refreshes stale snapshots at run time and signals onHistorianRunStarted before the main awaited work.
packages/plugin/src/hooks/magic-context/compartment-runner-types.ts Extends runner deps with the onHistorianRunStarted callback and documents its intended semantics.
packages/plugin/src/hooks/magic-context/compartment-runner.test.ts Adds regression tests to validate stale-tail refresh and synchronous no-op active-run cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 127 to +166
@@ -143,6 +150,20 @@ export function startCompartmentAgent(deps: CompartmentRunnerDeps): void {
}
});
activeRuns.set(deps.sessionId, { promise, published: false });
// If the runner no-op'd synchronously (stale/empty snapshot, nothing to
// compact, drain-quota), it returned before signalling onHistorianRunStarted
// and before any `await`, so `promise` is already settling. It cleared
// compartmentInProgress in its own finally, but the activeRuns entry above
// would otherwise survive (cleared only by the microtask-scheduled
// promise.finally) and make the SAME transform pass treat a non-running
// historian as in-progress — deferring queued drop ops and starving them
// turn after turn (the production livelock). Drop the registration
// synchronously so pending ops can materialize this pass. The promise.finally
// below still runs for interval/lease cleanup; its `=== promise` guard makes
// the (now redundant) delete a no-op.
if (!realRunStarted && activeRuns.get(deps.sessionId)?.promise === promise) {
activeRuns.delete(deps.sessionId);
}
Comment on lines +120 to +129
/**
* Called synchronously the moment the runner commits to a REAL historian
* pass — after every no-op early-return (stale/empty snapshot, nothing to
* compact, drain-quota) and immediately before the first `await`. Lets
* `startCompartmentAgent` distinguish a fire-and-forget run that actually
* started from one that no-op'd synchronously, so a no-op does not leave
* the rest of the transform pass believing a historian is in progress
* (which would defer queued drop ops — the production livelock).
*/
onHistorianRunStarted?: () => void;
Comment on lines +376 to +382
// Past every synchronous no-op early-return and immediately before the
// first `await` (client.session.get below): we are now committed to a
// real historian pass. Signal the caller so startCompartmentAgent keeps
// the active-run registration; a synchronous no-op above never reaches
// here, so its lingering registration is cleared instead of blocking the
// same transform pass's pending-op drain.
deps.onHistorianRunStarted?.();
Comment on lines +376 to +382
// Past every synchronous no-op early-return and immediately before the
// first `await` (client.session.get below): we are now committed to a
// real historian pass. Signal the caller so startCompartmentAgent keeps
// the active-run registration; a synchronous no-op above never reaches
// here, so its lingering registration is cleared instead of blocking the
// same transform pass's pending-op drain.
deps.onHistorianRunStarted?.();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "Synchronous no-op" framing understates the callback's actual scope

The comment says "Past every synchronous no-op early-return … immediately before the first await", but two earlier early-return paths — existingValidationError (line ~197) and chunkCoverageError (line ~371) — both reach await notifyHistorianIssue(…) without ever calling onHistorianRunStarted. They are async paths that still leave realRunStarted = false, so startCompartmentAgent will also delete their activeRuns entry synchronously.

The mechanism is correct (both paths are genuine no-ops from a historian perspective), but the comment and the matching one in compartment-runner.ts should say "past every early-return path — including the async error paths above — and immediately before the first real work await" to avoid a future reader placing a new early-return after the callback by mistake.

Comment on lines +244 to +266
if (!validation.ok && validation.reason === "stale_snapshot") {
const refreshed = resolveOpenCodeProtectedTailBoundary({
db,
sessionId,
mode: "incremental-runner",
contextLimit: deps.currentContextLimit ?? boundarySnapshot.contextLimit,
executeThresholdPercentage: boundarySnapshot.executeThresholdPercentage,
usage: {
percentage: boundarySnapshot.usagePercentage,
inputTokens: boundarySnapshot.usageInputTokens,
},
usageSource: boundarySnapshot.usageSource,
emergencyTailScale: boundarySnapshot.emergencyTailScale,
});
if (hasRunnableCompartmentWindow(refreshed)) {
sessionLog(
sessionId,
`historian: refreshed stale protected-tail snapshot at run time (was: ${validation.detail ?? "stale"}) — eligible head ${refreshed.offset}-${refreshed.eligibleEndOrdinal - 1}`,
);
boundarySnapshot = refreshed;
validation = { ok: true };
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale usagePercentage passed to the refreshed boundary resolver

The refresh call forwards boundarySnapshot.usagePercentage and boundarySnapshot.usageInputTokens — values captured at trigger time. In the active-session livelock scenario (context growing every turn), the real usage at run time is higher than what the snapshot recorded. A lower-than-actual percentage causes resolveOpenCodeProtectedTailBoundary to compute a slightly smaller protected tail (protectedTailStart shifted toward the end), which could in principle include messages that belong to the current live tail.

In practice this is benign because the runner immediately clamps eligibleEndOrdinal = Math.min(snapshot.eligibleEndOrdinal, protectedTailStart), so nothing outside the resolved tail is ever compacted. Adding a brief inline note here — e.g., "usage is intentionally trigger-time; runner clamps eligibleEnd to protectedTailStart" — would make the safety argument self-documenting for the next reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants