Skip to content

Persist agent conversation across container restarts via AgentDO event reconstruction #1236

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

Problem

When the container restarts (deploy, eviction, crash, sleep/wake), all agent sessions lose their entire conversation history. The Mayor is the worst-hit — a user who has been chatting with the Mayor for an hour suddenly gets a fresh session with zero memory of what was discussed. Polecats are more forgiving (focused tasks), but even they lose context about partial work and prior reasoning.

Currently, sendMayorMessage re-dispatches the Mayor with checkpoint: null (Town.do.ts:1810) — it does not even read the Mayor's checkpoint, let alone restore conversation history. The new session gets only the user's current message as its initial prompt.

What's Preserved vs. Lost Today

Preserved in DOs (survives restarts):

  • Agent metadata, bead data, checkpoints — TownDO SQLite
  • SDK streaming events (message.created, message.completed, message_part.updated, assistant.completed) — AgentDO rig_agent_events table
  • Bead events, mail, review queue, convoys — TownDO SQLite

Lost in container memory:

  • Full conversation history (all user/assistant/tool messages)
  • SDK session state
  • Active tool call state
  • In-progress reasoning

Key Insight: AgentDO Already Has the Data

The AgentDO stores SDK streaming events that contain message content. These events include message.created, message.completed, message_part.updated, and assistant.completed — the raw material to reconstruct conversation turns. The 10,000 event cap is generous (a typical Mayor session produces ~5-20 events per turn, so ~500-2000 events for a 100-turn conversation).

The events are streaming deltas, not clean {role, content} turns, but they can be reassembled.

Solution

Three tiers: a quick fix (context injection from existing data), graceful eviction handling (save work during the SIGTERM window), and the long-term strategy (DO-backed persistence so nothing is ever lost).

Tier 1: Context injection from AgentDO events (quick fix)

The AgentDO already stores SDK streaming events with message content. Reconstruct the conversation from these events and inject it into the new session on re-dispatch.

1a. Conversation reconstruction function

Add a reconstructConversation(agentId) function that:

  • Queries AgentDO.getEvents() for the agent's last session
  • Filters for message-type events (message.created, message.completed, message_part.updated)
  • Groups events by message boundaries (using message.createdmessage.completed as delimiters)
  • Reassembles streaming deltas into complete {role: 'user'|'assistant', content: string} turns
  • Returns a conversation transcript (array of turns or formatted text)

This does not need to be perfect — the goal is semantic continuity, not byte-level replay.

1b. Context injection on re-dispatch

When sendMayorMessage detects the Mayor needs re-dispatch (container restarted, isAlive = false):

  1. Call reconstructConversation(mayorAgentId) to get the prior transcript
  2. Include the transcript in the initial prompt as prior conversation context
  3. The Mayor continues with awareness of what was discussed

1c. Fix Mayor checkpoint propagation

sendMayorMessage at Town.do.ts:1810 passes checkpoint: null. Change to read the Mayor's checkpoint:

const checkpoint = agents.readCheckpoint(this.sql, mayorAgent.id);

This is a one-line fix that should be done immediately, independent of the larger work.

Context window management

  • Truncate to last N turns — keep the most recent conversation (e.g., last 50 turns)
  • Summarize older turns — use an LLM call to summarize the first half into a paragraph, keep the recent half verbatim
  • Token budget — set a max token budget for the restored transcript (e.g., 20% of the model's context window) and truncate/summarize to fit

Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (control-server.ts:649-654) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.

From Cloudflare docs:

When a container instance is going to be shut down, it is sent a SIGTERM signal, and then a SIGKILL signal after 15 minutes. You should perform any necessary cleanup to ensure a graceful shutdown in this time. The container instance will be rebooted elsewhere shortly after this.

Current Behavior

// control-server.ts:649-654
const shutdown = async () => {
  stopHeartbeat();
  await stopAll();      // Immediately aborts all sessions
  process.exit(0);
};
process.on("SIGTERM", () => void shutdown());

stopAll() aborts every session via session.abort(), sets all agents to exited, and kills SDK servers. No git push, no checkpoint save, no TownDO notification.

Proposed SIGTERM Handling

Phase 1: Notify TownDO (immediate, on SIGTERM)

Container sends POST /api/towns/:townId/container-eviction to the worker. The TownDO:

  • Inserts a container_eviction event
  • The reconciler processes this event and sets a draining flag
  • Reconciler stops emitting dispatch_agent actions for new work
  • Does NOT interrupt currently running agents

Phase 2: Nudge running agents to save and park (first 2 minutes)

For each running agent, inject a nudge message:

  • Polecats: "The container is shutting down. Please commit and push your current changes immediately, then call gt_done. You have 2 minutes."
  • Refinery: "The container is shutting down. If your review is complete, call gt_done now. Otherwise, your work will be saved and the review will resume after restart."
  • Mayor: No nudge needed — conversation history is already in AgentDO events.

Phase 3: Wait for agents to finish (up to 10 minutes)

Monitor running agents. As each one calls gt_done or finishes, it exits cleanly. Wait until all agents have exited OR 10 minutes have elapsed (leaving 5 min buffer before SIGKILL).

Phase 4: Force save and exit (last 5 minutes)

For any agents still running after 10 minutes:

  • Force git add -A && git commit -m "WIP: container eviction save" && git push
  • Abort the session
  • Report agentCompleted with a reason indicating eviction save

Phase 5: Clean exitstopAll() as today, then process.exit(0).

Timing Budget

T+0:00  SIGTERM received
T+0:01  Notify TownDO (draining), nudge agents to save
T+0:01  Agents begin saving (commit, push, gt_done)
T+2:00  Most agents have finished saving
T+10:00 Force-save remaining agents (git commit + push)
T+10:30 stopAll(), report completions
T+11:00 process.exit(0)
T+15:00 SIGKILL (we should be long gone)

Reconciler Integration

applyEvent('container_eviction') sets a draining flag. reconcileBeads Rule 1 and reconcileReviewQueue Rule 5 check it before emitting dispatch_agent. After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and dispatch resumes.


Tier 2: DO-backed persistence — never write to disk (long-term strategy)

The Tier 1 approach reconstructs from streaming deltas (fragile, lossy). The proper fix is to never lose the state in the first place by persisting all session and control server state to Durable Objects instead of ephemeral container disk/memory.

Design principle: the container never writes to disk

Cloudflare Container disks are ephemeral — anything written is lost on eviction. DOs are durable and globally consistent. Writes within the same Cloudflare colo are sub-millisecond. If all state lives in DOs, container eviction becomes a non-event — the new container reads the same state from the same DOs and picks up where the old one left off.

2a. SDK session storage → AgentDO

The Kilo SDK uses SQLite for session persistence. In the container, this SQLite file lives on ephemeral disk. The SDK's storage layer needs a pluggable backend:

Option A — SDK storage adapter (clean): Implement the SDK's storage interface with HTTP calls to AgentDO. Requires SDK changes but is architecturally clean.

Option B — SQLite proxy (transparent): Intercept the SDK's SQLite reads/writes and route them to AgentDO over HTTP. No SDK changes needed.

Option C — Write-behind cache (pragmatic): SDK writes to local ephemeral disk as today (fast). A background process async-flushes completed conversation turns to AgentDO. On container restart, hydrate local SQLite from AgentDO before starting agents. Combines local write speed with DO durability.

Option C is the most pragmatic starting point.

2b. Control server state → TownContainerDO

Persist the ProcessManager agents Map to TownContainerDO so the new container instance knows which agents were running, their session IDs, and ports. Boot becomes: read registry → hydrate sessions → resume.

2c. Recovery flow with DO-backed persistence

Container evicted
  → TownDO alarm detects dead container
  → New container starts
  → Control server reads process registry from TownContainerDO
  → For each previously-running agent:
      → Read session state from AgentDO
      → Hydrate local SQLite with persisted conversation
      → Start SDK server with restored session
      → Agent resumes mid-conversation — no context loss
  → Report ready to TownDO

Files

  • container/src/control-server.ts — SIGTERM handler (line 649-657)
  • container/src/process-manager.tsstopAll() (line 759), new drainAll() function
  • src/dos/town/reconciler.ts — draining flag check in dispatch rules
  • src/dos/town/events.ts — new container_eviction event type
  • src/dos/Town.do.tssendMayorMessage checkpoint fix, new /container-eviction endpoint

Acceptance Criteria

Tier 1 (quick fix)

  • sendMayorMessage reads and passes the Mayor's checkpoint
  • Conversation history reconstructed from AgentDO events on re-dispatch
  • Mayor re-dispatch includes prior conversation transcript in context
  • Transcript truncated/summarized to fit context window limits

Tier 1.5 (graceful eviction)

  • SIGTERM triggers a drain sequence instead of immediate abort
  • TownDO notified of eviction, reconciler pauses dispatch
  • Running agents nudged to save and push
  • Force-save after 10 minutes for stragglers
  • Container exits cleanly within 15-minute SIGKILL window

Tier 2 (DO-backed persistence)

  • SDK session state persisted to AgentDO
  • Control server process registry persisted to TownContainerDO
  • Container boot hydrates from DOs — agents resume with full context
  • Zero context loss on container eviction

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Should fix before soft launchenhancementNew feature or requestgt:containerContainer management, agent processes, SDK, heartbeatkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions