Persist agent conversation across container restarts via AgentDO event reconstruction

## Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

## Problem

When the container restarts (deploy, eviction, crash, sleep/wake), all agent sessions lose their entire conversation history. The Mayor is the worst-hit — a user who has been chatting with the Mayor for an hour suddenly gets a fresh session with zero memory of what was discussed. Polecats are more forgiving (focused tasks), but even they lose context about partial work and prior reasoning.

Currently, `sendMayorMessage` re-dispatches the Mayor with `checkpoint: null` (`Town.do.ts:1810`) — it does not even read the Mayor's checkpoint, let alone restore conversation history. The new session gets only the user's current message as its initial prompt.

## What's Preserved vs. Lost Today

**Preserved in DOs (survives restarts):**
- Agent metadata, bead data, checkpoints — TownDO SQLite
- SDK streaming events (`message.created`, `message.completed`, `message_part.updated`, `assistant.completed`) — AgentDO `rig_agent_events` table
- Bead events, mail, review queue, convoys — TownDO SQLite

**Lost in container memory:**
- Full conversation history (all user/assistant/tool messages)
- SDK session state
- Active tool call state
- In-progress reasoning

## Key Insight: AgentDO Already Has the Data

The AgentDO stores SDK streaming events that contain message content. These events include `message.created`, `message.completed`, `message_part.updated`, and `assistant.completed` — the raw material to reconstruct conversation turns. The 10,000 event cap is generous (a typical Mayor session produces ~5-20 events per turn, so ~500-2000 events for a 100-turn conversation).

The events are streaming deltas, not clean `{role, content}` turns, but they can be reassembled.

## Solution

Three tiers: a quick fix (context injection from existing data), graceful eviction handling (save work during the SIGTERM window), and the long-term strategy (DO-backed persistence so nothing is ever lost).

### Tier 1: Context injection from AgentDO events (quick fix)

The AgentDO already stores SDK streaming events with message content. Reconstruct the conversation from these events and inject it into the new session on re-dispatch.

#### 1a. Conversation reconstruction function

Add a `reconstructConversation(agentId)` function that:
- Queries `AgentDO.getEvents()` for the agent's last session
- Filters for message-type events (`message.created`, `message.completed`, `message_part.updated`)
- Groups events by message boundaries (using `message.created` → `message.completed` as delimiters)
- Reassembles streaming deltas into complete `{role: 'user'|'assistant', content: string}` turns
- Returns a conversation transcript (array of turns or formatted text)

This does not need to be perfect — the goal is semantic continuity, not byte-level replay.

#### 1b. Context injection on re-dispatch

When `sendMayorMessage` detects the Mayor needs re-dispatch (container restarted, `isAlive = false`):
1. Call `reconstructConversation(mayorAgentId)` to get the prior transcript
2. Include the transcript in the initial prompt as prior conversation context
3. The Mayor continues with awareness of what was discussed

#### 1c. Fix Mayor checkpoint propagation

`sendMayorMessage` at `Town.do.ts:1810` passes `checkpoint: null`. Change to read the Mayor's checkpoint:
```ts
const checkpoint = agents.readCheckpoint(this.sql, mayorAgent.id);
```
This is a one-line fix that should be done immediately, independent of the larger work.

#### Context window management

- **Truncate to last N turns** — keep the most recent conversation (e.g., last 50 turns)
- **Summarize older turns** — use an LLM call to summarize the first half into a paragraph, keep the recent half verbatim
- **Token budget** — set a max token budget for the restored transcript (e.g., 20% of the model's context window) and truncate/summarize to fit

---

### Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Cloudflare Containers send SIGTERM 15 minutes before SIGKILL on host server restarts. The current shutdown handler (`control-server.ts:649-654`) immediately aborts all sessions and exits. We should use the 15-minute window to let agents save their work gracefully and notify the TownDO so the reconciler pauses dispatch.

From [Cloudflare docs](https://developers.cloudflare.com/containers/):
> When a container instance is going to be shut down, it is sent a SIGTERM signal, and then a SIGKILL signal after 15 minutes. You should perform any necessary cleanup to ensure a graceful shutdown in this time. The container instance will be rebooted elsewhere shortly after this.

#### Current Behavior

```ts
// control-server.ts:649-654
const shutdown = async () => {
  stopHeartbeat();
  await stopAll();      // Immediately aborts all sessions
  process.exit(0);
};
process.on("SIGTERM", () => void shutdown());
```

`stopAll()` aborts every session via `session.abort()`, sets all agents to `exited`, and kills SDK servers. No git push, no checkpoint save, no TownDO notification.

#### Proposed SIGTERM Handling

**Phase 1: Notify TownDO (immediate, on SIGTERM)**

Container sends `POST /api/towns/:townId/container-eviction` to the worker. The TownDO:
- Inserts a `container_eviction` event
- The reconciler processes this event and sets a `draining` flag
- Reconciler stops emitting `dispatch_agent` actions for new work
- Does NOT interrupt currently running agents

**Phase 2: Nudge running agents to save and park (first 2 minutes)**

For each running agent, inject a nudge message:
- **Polecats**: "The container is shutting down. Please commit and push your current changes immediately, then call gt_done. You have 2 minutes."
- **Refinery**: "The container is shutting down. If your review is complete, call gt_done now. Otherwise, your work will be saved and the review will resume after restart."
- **Mayor**: No nudge needed — conversation history is already in AgentDO events.

**Phase 3: Wait for agents to finish (up to 10 minutes)**

Monitor running agents. As each one calls `gt_done` or finishes, it exits cleanly. Wait until all agents have exited OR 10 minutes have elapsed (leaving 5 min buffer before SIGKILL).

**Phase 4: Force save and exit (last 5 minutes)**

For any agents still running after 10 minutes:
- Force `git add -A && git commit -m "WIP: container eviction save" && git push`
- Abort the session
- Report `agentCompleted` with a reason indicating eviction save

**Phase 5: Clean exit** — `stopAll()` as today, then `process.exit(0)`.

#### Timing Budget

```
T+0:00  SIGTERM received
T+0:01  Notify TownDO (draining), nudge agents to save
T+0:01  Agents begin saving (commit, push, gt_done)
T+2:00  Most agents have finished saving
T+10:00 Force-save remaining agents (git commit + push)
T+10:30 stopAll(), report completions
T+11:00 process.exit(0)
T+15:00 SIGKILL (we should be long gone)
```

#### Reconciler Integration

`applyEvent('container_eviction')` sets a draining flag. `reconcileBeads` Rule 1 and `reconcileReviewQueue` Rule 5 check it before emitting `dispatch_agent`. After the container restarts and sends its first heartbeat, the TownDO clears the draining flag and dispatch resumes.

---

### Tier 2: DO-backed persistence — never write to disk (long-term strategy)

The Tier 1 approach reconstructs from streaming deltas (fragile, lossy). The proper fix is to **never lose the state in the first place** by persisting all session and control server state to Durable Objects instead of ephemeral container disk/memory.

#### Design principle: the container never writes to disk

Cloudflare Container disks are ephemeral — anything written is lost on eviction. DOs are durable and globally consistent. Writes within the same Cloudflare colo are sub-millisecond. If all state lives in DOs, container eviction becomes a non-event — the new container reads the same state from the same DOs and picks up where the old one left off.

#### 2a. SDK session storage → AgentDO

The Kilo SDK uses SQLite for session persistence. In the container, this SQLite file lives on ephemeral disk. The SDK's storage layer needs a pluggable backend:

**Option A — SDK storage adapter (clean):** Implement the SDK's storage interface with HTTP calls to AgentDO. Requires SDK changes but is architecturally clean.

**Option B — SQLite proxy (transparent):** Intercept the SDK's SQLite reads/writes and route them to AgentDO over HTTP. No SDK changes needed.

**Option C — Write-behind cache (pragmatic):** SDK writes to local ephemeral disk as today (fast). A background process async-flushes completed conversation turns to AgentDO. On container restart, hydrate local SQLite from AgentDO before starting agents. Combines local write speed with DO durability.

Option C is the most pragmatic starting point.

#### 2b. Control server state → TownContainerDO

Persist the `ProcessManager` `agents` Map to TownContainerDO so the new container instance knows which agents were running, their session IDs, and ports. Boot becomes: read registry → hydrate sessions → resume.

#### 2c. Recovery flow with DO-backed persistence

```
Container evicted
  → TownDO alarm detects dead container
  → New container starts
  → Control server reads process registry from TownContainerDO
  → For each previously-running agent:
      → Read session state from AgentDO
      → Hydrate local SQLite with persisted conversation
      → Start SDK server with restored session
      → Agent resumes mid-conversation — no context loss
  → Report ready to TownDO
```

## Files

- `container/src/control-server.ts` — SIGTERM handler (line 649-657)
- `container/src/process-manager.ts` — `stopAll()` (line 759), new `drainAll()` function
- `src/dos/town/reconciler.ts` — draining flag check in dispatch rules
- `src/dos/town/events.ts` — new `container_eviction` event type
- `src/dos/Town.do.ts` — `sendMayorMessage` checkpoint fix, new `/container-eviction` endpoint

## Acceptance Criteria

### Tier 1 (quick fix)
- [ ] `sendMayorMessage` reads and passes the Mayor's checkpoint
- [ ] Conversation history reconstructed from AgentDO events on re-dispatch
- [ ] Mayor re-dispatch includes prior conversation transcript in context
- [ ] Transcript truncated/summarized to fit context window limits

### Tier 1.5 (graceful eviction)
- [ ] SIGTERM triggers a drain sequence instead of immediate abort
- [ ] TownDO notified of eviction, reconciler pauses dispatch
- [ ] Running agents nudged to save and push
- [ ] Force-save after 10 minutes for stragglers
- [ ] Container exits cleanly within 15-minute SIGKILL window

### Tier 2 (DO-backed persistence)
- [ ] SDK session state persisted to AgentDO
- [ ] Control server process registry persisted to TownContainerDO
- [ ] Container boot hydrates from DOs — agents resume with full context
- [ ] Zero context loss on container eviction

## References

- #269 — Container Resilience — Checkpoint/Restore (git state recovery)
- [Reconciler spec](../docs/gt/reconciliation-spec.md) — event and action types
- [PR #1336](https://github.com/Kilo-Org/cloud/pull/1336) — Reconciler implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist agent conversation across container restarts via AgentDO event reconstruction #1236

Parent

Problem

What's Preserved vs. Lost Today

Key Insight: AgentDO Already Has the Data

Solution

Tier 1: Context injection from AgentDO events (quick fix)

1a. Conversation reconstruction function

1b. Context injection on re-dispatch

1c. Fix Mayor checkpoint propagation

Context window management

Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Current Behavior

Proposed SIGTERM Handling

Timing Budget

Reconciler Integration

Tier 2: DO-backed persistence — never write to disk (long-term strategy)

Design principle: the container never writes to disk

2a. SDK session storage → AgentDO

2b. Control server state → TownContainerDO

2c. Recovery flow with DO-backed persistence

Files

Acceptance Criteria

Tier 1 (quick fix)

Tier 1.5 (graceful eviction)

Tier 2 (DO-backed persistence)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Persist agent conversation across container restarts via AgentDO event reconstruction #1236

Description

Parent

Problem

What's Preserved vs. Lost Today

Key Insight: AgentDO Already Has the Data

Solution

Tier 1: Context injection from AgentDO events (quick fix)

1a. Conversation reconstruction function

1b. Context injection on re-dispatch

1c. Fix Mayor checkpoint propagation

Context window management

Tier 1.5: Graceful container eviction — SIGTERM draining with agent save-and-park

Current Behavior

Proposed SIGTERM Handling

Timing Budget

Reconciler Integration

Tier 2: DO-backed persistence — never write to disk (long-term strategy)

Design principle: the container never writes to disk

2a. SDK session storage → AgentDO

2b. Control server state → TownContainerDO

2c. Recovery flow with DO-backed persistence

Files

Acceptance Criteria

Tier 1 (quick fix)

Tier 1.5 (graceful eviction)

Tier 2 (DO-backed persistence)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions