fix: historian livelock starves pending drops in active sessions by tracycam · Pull Request #141 · cortexkit/magic-context

tracycam · 2026-06-12T19:46:53Z

Problem

In an active session, the historian can livelock and pending drop operations starve forever. Observed in production (OpenCode harness): a session accumulated 179 pending drops while active context grew past 280K tokens with zero reduction.

The cycle, repeating every turn:

commit-cluster trigger fires (clusters never get compartmentalized, so it refires every turn)
compartmentInProgress flag set, historian agent starts
historian no-op: stale protected-tail snapshot (last ordinal N id changed) — the boundary snapshot is captured at trigger time, and the ONLY failing validation is the "last message in protected tail" tuple, which changes every turn in an active session
transform: deferring pending ops — compartment agent in progress — the no-op returns synchronously, but the activeRuns registration is only cleared via a microtask promise.finally, so the same pass still believes a run is active and defers pending ops

Fix (two independent defects)

Re-derive the boundary snapshot at run time on staleness. On a stale_snapshot validation result, the runner re-resolves the boundary once from current session state and adopts it if it still has a runnable head. validateBoundarySnapshot itself is untouched — the refreshed snapshot recomputes protectedTailStart/eligibleEnd from live messages, so the compacted head can never include a message that belongs to the live protected tail.
Synchronous no-op clears the active-run registration. startCompartmentAgent now registers the run at the real-run commit point (after the last synchronous no-op return, before the first await) via onHistorianRunStarted, and drops the registration synchronously when the runner no-ops — so pending ops can materialize in the same pass.

Tests

Two regression tests in compartment-runner.test.ts:

refreshes a stale-tail boundary snapshot at run time instead of no-op'ing forever (asserts publish + queued drops + compartmentInProgress=false)
clears the active-run registration synchronously when the runner no-ops

bun run typecheck, bun run lint, full plugin suite pass (the two pre-existing flaky timing tests aside).

^{Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.}

Summary by cubic

Fixes a historian livelock that starved pending drop ops in active sessions. The runner now refreshes stale boundaries and clears no-op runs immediately, so compaction proceeds and drops drain.

Bug Fixes
- Re-resolves the protected-tail boundary on stale_snapshot at run time and adopts it only if the head is runnable, preserving protected-tail safety.
- Clears active-run registration synchronously on no-op via onHistorianRunStarted, so activeRuns doesn’t block pending drops in the same pass.

^{Written for commit 055da42. Summary will update on new commits.}

Greptile Summary

This PR fixes a production livelock where the historian agent repeatedly no-ops due to a stale boundary snapshot while pending drop operations starve indefinitely. Two independent defects are addressed: (1) on a stale_snapshot validation result the runner now re-resolves the boundary once from live DB state and proceeds if a runnable window still exists, and (2) the activeRuns registration is cleared synchronously when the runner determines it will no-op, so the same transform pass can drain pending drops instead of deferring them.

Stale-snapshot refresh (compartment-runner-incremental.ts): when validateBoundarySnapshot returns reason: \"stale_snapshot\", the runner calls resolveOpenCodeProtectedTailBoundary against the live DB and adopts the refreshed snapshot if hasRunnableCompartmentWindow is satisfied. The runner's own eligibleEndOrdinal = Math.min(snapshot.eligibleEndOrdinal, protectedTailStart) clamp maintains the protected-tail guarantee regardless of snapshot staleness.
Synchronous activeRuns cleanup (compartment-runner.ts): startCompartmentAgent introduces a realRunStarted boolean set by the new onHistorianRunStarted callback, called just before the first real await. After runCompartmentAgent is launched, if realRunStarted is still false the registration is deleted synchronously before the microtask-scheduled .finally can clear it — unblocking pending-op drain in the same pass.

Confidence Score: 4/5

Safe to merge — the two independent fixes address a well-documented production livelock with targeted, narrow changes backed by regression tests.

Both changes are tightly scoped and the core safety guarantee (protected tail never compacted) is enforced by the runner's own Math.min(eligibleEndOrdinal, protectedTailStart) clamp independent of the refreshed snapshot. The realRunStarted mechanism is correct for both synchronous and async no-op paths. The two findings are documentation and clarity gaps, not behavioral defects.

The stale-snapshot refresh block in compartment-runner-incremental.ts (lines 244–266) and the matching onHistorianRunStarted comment in both compartment-runner-incremental.ts and compartment-runner.ts would benefit from a one-line clarification about async early-return coverage and stale usage forwarding.

Important Files Changed

Filename	Overview
packages/plugin/src/hooks/magic-context/compartment-runner.ts	Introduces realRunStarted flag + onHistorianRunStarted callback to synchronously clear activeRuns when the runner no-ops; lease and interval cleanup still run via .finally() microtask. Logic is sound for both synchronous and async no-op paths.
packages/plugin/src/hooks/magic-context/compartment-runner-incremental.ts	Adds stale-snapshot refresh: on stale_snapshot validation failure, re-resolves boundary from live DB using trigger-time usage/threshold params and adopts it when hasRunnableCompartmentWindow passes. Protected-tail safety is preserved by the runner's own Math.min(eligibleEndOrdinal, protectedTailStart) clamp downstream.
packages/plugin/src/hooks/magic-context/compartment-runner-types.ts	Adds optional onHistorianRunStarted callback to CompartmentRunnerDeps with clear JSDoc; no breaking changes.
packages/plugin/src/hooks/magic-context/compartment-runner.test.ts	Adds two targeted regression tests: stale-tail refresh (verifies publish + pending ops + compartmentInProgress=false) and sync-noop activeRuns cleanup (synchronous assertion before any await). Both cover the exact production failure modes described in the PR.

Sequence Diagram

sequenceDiagram
    participant T as Transform Pass
    participant SCA as startCompartmentAgent
    participant RCA as runCompartmentAgent
    participant DB as DB / State

    T->>SCA: startCompartmentAgent(deps)
    SCA->>DB: acquireCompartmentLease()
    SCA->>RCA: runCompartmentAgent(runnerDeps) [fire-and-forget]

    alt No-op path (stale snapshot)
        RCA->>DB: validateBoundarySnapshot → stale_snapshot
        Note over RCA: NEW: re-resolve boundary
        RCA->>DB: resolveOpenCodeProtectedTailBoundary (live)
        alt hasRunnableCompartmentWindow
            RCA->>RCA: "boundarySnapshot = refreshed, validation.ok = true"
            Note over RCA: continues to real run
            RCA-->>SCA: onHistorianRunStarted() called
            Note over SCA: realRunStarted = true
            RCA->>DB: publish compartments
        else no runnable window
            RCA-->>RCA: rollback, return (no await reached)
            Note over SCA: realRunStarted = false
        end
    else No-op path (nothing to compact)
        RCA-->>RCA: rollback, return synchronously
        Note over SCA: realRunStarted = false
    end

    SCA->>SCA: activeRuns.set(sessionId, promise)
    Note over SCA: NEW: if (!realRunStarted) activeRuns.delete() synchronously
    SCA-->>T: returns

    T->>T: getActiveCompartmentRun() → undefined
    T->>DB: drain pending drops (no longer blocked)

    Note over RCA: microtask later: .finally() clears interval + lease

_{Reviews (1): Last reviewed commit: "plugin: refresh protected-tail snapshot ..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

…ent drop starvation In an active session the protected tail's newest message changes every turn, so a boundary snapshot captured at trigger time fails validation on the "last ordinal id changed" check by the time the historian runs — even though the eligible head it would compact is untouched. The runner no-op'd forever while the trigger refired each turn, queued drop ops accumulated (observed: 27 consecutive stale no-ops, 179 pending drops, zero reduction). - compartment-runner-incremental: on a stale_snapshot validation result, re-resolve the boundary once from current session state and adopt the fresh snapshot when it still exposes a runnable head. The refreshed snapshot recomputes protectedTailStart/eligibleEnd from live messages, so the protected-tail guarantee is preserved (the head can never include a message now in the live tail). Historian makes real progress, publishes, and queues drops so the accumulated pending ops drain. - startCompartmentAgent: a synchronous runner no-op cleared compartmentInProgress but left the activeRuns registration alive until a microtask, so the same transform pass deferred queued drop ops for a run that already finished. Signal onHistorianRunStarted at the real-run commit point and drop the registration synchronously when the runner no-ops. - Regression tests: stale-tail snapshot re-derives + publishes instead of no-op'ing; synchronous no-op clears the active-run belief in the same pass.

cubic-dev-ai

No issues found across 4 files

_{Re-trigger cubic}

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses a production livelock in the magic-context historian runner by ensuring stale protected-tail snapshots can be refreshed at run time and by preventing synchronous no-op runs from leaving a misleading “active run” registration that starves queued drop operations.

Changes:

Refresh stale protected-tail boundary snapshots at run time when the eligible head is still runnable.
Add an onHistorianRunStarted callback so startCompartmentAgent can distinguish a real run from synchronous no-ops and clear activeRuns immediately in the no-op case.
Add regression tests covering stale-tail refresh and synchronous no-op active-run cleanup.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
packages/plugin/src/hooks/magic-context/compartment-runner.ts	Tracks whether a real historian run started and synchronously clears `activeRuns` for synchronous no-ops.
packages/plugin/src/hooks/magic-context/compartment-runner-incremental.ts	Refreshes stale snapshots at run time and signals `onHistorianRunStarted` before the main awaited work.
packages/plugin/src/hooks/magic-context/compartment-runner-types.ts	Extends runner deps with the `onHistorianRunStarted` callback and documents its intended semantics.
packages/plugin/src/hooks/magic-context/compartment-runner.test.ts	Adds regression tests to validate stale-tail refresh and synchronous no-op active-run cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -143,6 +150,20 @@ export function startCompartmentAgent(deps: CompartmentRunnerDeps): void {
            }
        });
    activeRuns.set(deps.sessionId, { promise, published: false });
+    // If the runner no-op'd synchronously (stale/empty snapshot, nothing to
+    // compact, drain-quota), it returned before signalling onHistorianRunStarted
+    // and before any `await`, so `promise` is already settling. It cleared
+    // compartmentInProgress in its own finally, but the activeRuns entry above
+    // would otherwise survive (cleared only by the microtask-scheduled
+    // promise.finally) and make the SAME transform pass treat a non-running
+    // historian as in-progress — deferring queued drop ops and starving them
+    // turn after turn (the production livelock). Drop the registration
+    // synchronously so pending ops can materialize this pass. The promise.finally
+    // below still runs for interval/lease cleanup; its `=== promise` guard makes
+    // the (now redundant) delete a no-op.
+    if (!realRunStarted && activeRuns.get(deps.sessionId)?.promise === promise) {
+        activeRuns.delete(deps.sessionId);
+    }


+    /**
+     * Called synchronously the moment the runner commits to a REAL historian
+     * pass — after every no-op early-return (stale/empty snapshot, nothing to
+     * compact, drain-quota) and immediately before the first `await`. Lets
+     * `startCompartmentAgent` distinguish a fire-and-forget run that actually
+     * started from one that no-op'd synchronously, so a no-op does not leave
+     * the rest of the transform pass believing a historian is in progress
+     * (which would defer queued drop ops — the production livelock).
+     */
+    onHistorianRunStarted?: () => void;


+        // Past every synchronous no-op early-return and immediately before the
+        // first `await` (client.session.get below): we are now committed to a
+        // real historian pass. Signal the caller so startCompartmentAgent keeps
+        // the active-run registration; a synchronous no-op above never reaches
+        // here, so its lingering registration is cleared instead of blocking the
+        // same transform pass's pending-op drain.
+        deps.onHistorianRunStarted?.();


greptile-apps · 2026-06-12T19:55:39Z

+        // Past every synchronous no-op early-return and immediately before the
+        // first `await` (client.session.get below): we are now committed to a
+        // real historian pass. Signal the caller so startCompartmentAgent keeps
+        // the active-run registration; a synchronous no-op above never reaches
+        // here, so its lingering registration is cleared instead of blocking the
+        // same transform pass's pending-op drain.
+        deps.onHistorianRunStarted?.();


"Synchronous no-op" framing understates the callback's actual scope

The comment says "Past every synchronous no-op early-return … immediately before the first await", but two earlier early-return paths — existingValidationError (line ~197) and chunkCoverageError (line ~371) — both reach await notifyHistorianIssue(…) without ever calling onHistorianRunStarted. They are async paths that still leave realRunStarted = false, so startCompartmentAgent will also delete their activeRuns entry synchronously.

The mechanism is correct (both paths are genuine no-ops from a historian perspective), but the comment and the matching one in compartment-runner.ts should say "past every early-return path — including the async error paths above — and immediately before the first real work await" to avoid a future reader placing a new early-return after the callback by mistake.

greptile-apps · 2026-06-12T19:55:40Z

+        if (!validation.ok && validation.reason === "stale_snapshot") {
+            const refreshed = resolveOpenCodeProtectedTailBoundary({
+                db,
+                sessionId,
+                mode: "incremental-runner",
+                contextLimit: deps.currentContextLimit ?? boundarySnapshot.contextLimit,
+                executeThresholdPercentage: boundarySnapshot.executeThresholdPercentage,
+                usage: {
+                    percentage: boundarySnapshot.usagePercentage,
+                    inputTokens: boundarySnapshot.usageInputTokens,
+                },
+                usageSource: boundarySnapshot.usageSource,
+                emergencyTailScale: boundarySnapshot.emergencyTailScale,
+            });
+            if (hasRunnableCompartmentWindow(refreshed)) {
+                sessionLog(
+                    sessionId,
+                    `historian: refreshed stale protected-tail snapshot at run time (was: ${validation.detail ?? "stale"}) — eligible head ${refreshed.offset}-${refreshed.eligibleEndOrdinal - 1}`,
+                );
+                boundarySnapshot = refreshed;
+                validation = { ok: true };
+            }
+        }


Stale usagePercentage passed to the refreshed boundary resolver

The refresh call forwards boundarySnapshot.usagePercentage and boundarySnapshot.usageInputTokens — values captured at trigger time. In the active-session livelock scenario (context growing every turn), the real usage at run time is higher than what the snapshot recorded. A lower-than-actual percentage causes resolveOpenCodeProtectedTailBoundary to compute a slightly smaller protected tail (protectedTailStart shifted toward the end), which could in principle include messages that belong to the current live tail.

In practice this is benign because the runner immediately clamps eligibleEndOrdinal = Math.min(snapshot.eligibleEndOrdinal, protectedTailStart), so nothing outside the resolved tail is ever compacted. Adding a brief inline note here — e.g., "usage is intentionally trigger-time; runner clamps eligibleEnd to protectedTailStart" — would make the safety argument self-documenting for the next reader.

Copilot AI review requested due to automatic review settings June 12, 2026 19:46

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Copilot AI reviewed Jun 12, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: historian livelock starves pending drops in active sessions#141

fix: historian livelock starves pending drops in active sessions#141
tracycam wants to merge 1 commit into
cortexkit:masterfrom
tracycam:fix/historian-stale-snapshot-livelock

tracycam commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tracycam commented Jun 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix (two independent defects)

Tests

Summary by cubic

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tracycam commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading