-
Notifications
You must be signed in to change notification settings - Fork 24
Witness & Deacon: Alarm-driven orchestration with on-demand LLM triage #442
Description
Overview
In local Gastown, the Witness and Deacon are persistent AI agent sessions running continuous patrol loops in tmux. They burn LLM tokens on every cycle, even though ~90% of their behavior is mechanical — threshold checks, protocol message routing, session liveness detection, timer evaluation. Genuine reasoning is only needed for a small set of ambiguous situations.
The cloud should not replicate this. The TownDO alarm IS the patrol loop. It should run all mechanical checks as deterministic code, and only spawn short-lived LLM agent sessions when a check produces an ambiguous result that requires reasoning.
Parent: #204
Current State — What Already Works
The TownDO alarm loop (alarm() at src/dos/Town.do.ts) already runs 7 sub-tasks on a 5s (active) / 1m (idle) interval:
| Sub-task | Status | Notes |
|---|---|---|
ensureContainerReady() |
Done | Pings container /health, triggers restart if dead |
processReviewQueue() |
Done | Pops MR beads, spawns refinery, polls PRs, recovers stuck reviews |
processConvoyLandings() |
Done | Detects ready_to_land convoys, creates landing MR |
schedulePendingWork() |
Done | Dispatches idle+hooked agents, marks beads failed after 5 dispatch attempts |
witnessPatrol() |
Partial | Only does zombie detection (container status reconciliation) and basic GUPP mail (30-min stale last_activity_at sends a warning mail, no escalation or force-stop) |
deliverPendingMail() |
Done | Pushes undelivered mail to working agents |
reEscalateStaleEscalations() |
Done | Bumps severity of unacknowledged escalations after 4h thresholds |
What's completely missing:
deaconPatrol()— no function exists; some behaviors are scattered across other alarm sub-tasks- Stale hook detection (idle agent + hook + no dispatch for extended period)
- Stranded convoy detection (convoy has open beads with no assigned agent)
- Agent GC (dead/completed agents accumulate in the DB indefinitely)
- Per-bead timeout enforcement (no timer gates)
- Triage request queue and LLM triage agent dispatch
- External health watchdog (no Cron Trigger or independent alarm monitor)
- Witness/deacon system prompts (only a one-line stub for witness)
What was intentionally eliminated:
- Protocol mail flow (POLECAT_DONE → MERGE_READY → MERGED) — replaced by direct bead state transitions in the beads-centric refactor (Make Cloud Gastown beads-centric: unify all object types into the beads primitive #441).
agentDone()creates MR beads directly,completeReviewWithResult()closes them. No protocol messages needed.
Background: What Local Gastown's Witness & Deacon Actually Do
Three execution layers in local Gastown
- Go Daemon — Pure Go process on a 3-minute heartbeat. All behavior is mechanical. Handles session liveness, crash loop detection, orphan cleanup, GUPP checks, heartbeat freshness.
- Deacon — LLM agent in tmux running
mol-deacon-patrolformula. Continuous loop: inbox check → orphan cleanup → spawn triggers → gate evaluation → convoy checks → health scan → zombie scan → plugin run → loop. - Witness (one per rig) — LLM agent in tmux running
mol-witness-patrolformula. Continuous loop: inbox check → process cleanups → check refinery → survey workers → loop.
Mechanical behaviors (deterministic — no LLM needed)
These are implemented as Go handler functions that the LLM agents invoke but don't reason about:
Witness mechanical behaviors:
- Zombie detection: cross-reference agent metadata state with container process state
- Hung session detection (30-min inactivity threshold on
last_activity_at) - GUPP violation detection (hook set + no progress for 30 min)
- Orphaned work detection (hook set + no container process)
- Auto-nuke clean agents, flag dirty ones for triage
- Convoy/swarm completion tracking (count closed vs total tracked beads)
- Timer gate evaluation (
created_at + timeout < now)
Deacon mechanical behaviors:
- Heartbeat freshness check (timestamp age thresholds)
- Redispatch failed beads (cooldown timer + attempt counter + max retry threshold)
- Stale hook detection (dead session + hooked bead OR unknown assignee + age > 1 hour)
- Convoy completion checks (all tracked beads closed → land)
- Stranded convoy detection (open beads with no assigned agent)
- Gate evaluation (elapsed time > timeout)
- Crash loop detection (restart count + timing with exponential backoff)
Intelligent behaviors (genuinely need LLM reasoning)
These are the ~10% of behaviors where deterministic code can't make the right call:
| Behavior | Why LLM is needed |
|---|---|
| Dirty polecat triage | Must read git status/git diff output and judge if uncommitted changes are valuable work worth saving, disposable artifacts, or a confused state requiring escalation |
| Refinery queue health assessment | Must reason about queue depth, staleness patterns, time context — no hardcoded thresholds |
| Live agent progress inspection | Must interpret agent conversation/activity output to determine if an agent is stuck, thinking deeply, or making slow but real progress |
| Help request handling | When a polecat sends HELP, must understand the problem domain and craft contextual guidance |
| Escalation assessment | Must understand an escalation's context to decide: handle locally, forward to Mayor, or alert human |
| Zombie scan confirmation | Verify that automated zombie detection results are correct before taking destructive action (nuke) |
| Contextual notification composition | Compose convoy completion summaries, escalation descriptions, handoff notes |
Cloud Architecture: Alarm + On-Demand Triage Agents
TownDO alarm handler — what needs to be added
The alarm already handles the items marked done below. This issue adds the items marked as new:
TownDO.alarm()
├── ensureContainerReady() -- exists
├── processReviewQueue() -- exists
├── processConvoyLandings() -- exists
├── schedulePendingWork() -- exists
├── witnessPatrol() -- expand with:
│ ├── detectZombies() -- exists (container status reconciliation)
│ ├── detectGUPPViolations() -- exists but only sends mail; add escalation + force-stop after threshold
│ ├── detectOrphanedWork() NEW: idle+hooked agents with no dispatch activity
│ ├── agentGC() NEW: delete dead/completed agents past retention period
│ ├── checkTimerGates() NEW: per-bead timeout enforcement
│ └── flagForTriage() NEW: dirty/ambiguous → create triage request bead
├── deaconPatrol() NEW function:
│ ├── detectStaleHooks() NEW: hooked for unreasonable duration without activity
│ ├── feedStrandedConvoys() NEW: convoy has open beads with no assignee → auto-sling
│ └── detectCrashLoops() NEW: same agent failing repeatedly in short window
├── deliverPendingMail() -- exists
├── reEscalateStaleEscalations() -- exists
└── maybeDispatchTriageAgent() NEW: if triage request beads queued, spawn triage agent
On-demand triage agent (LLM, spawned when needed)
When the alarm's mechanical checks produce results that need reasoning, it queues them as triage request beads (type = 'triage_request') with structured context. When the queue is non-empty, the alarm dispatches a short-lived triage agent in the container.
The triage_request bead type must be added to the BeadType enum in src/types.ts and src/db/tables/beads.table.ts.
// In TownDO alarm handler
const triageQueue = await this.listBeads({ type: 'triage_request', status: 'open' });
if (triageQueue.length > 0) {
await this.dispatchTriageAgent(triageQueue);
}The triage agent gets a focused system prompt:
You are a Gastown triage agent. You will be given a list of situations that
require judgment. For each one, assess the situation and take one of the
prescribed actions. Be decisive. When done, call gt_done.
Situations to assess:
1. [DIRTY_POLECAT] Agent "Toast" has uncommitted changes after completion.
Git status: <output>
Git diff --stat: <output>
Options: COMMIT_AND_PUSH | DISCARD | ESCALATE
2. [STUCK_AGENT] Agent "Maple" has not made progress in 45 minutes.
Last activity: <timestamp>
Recent conversation tail: <last 20 lines>
Options: NUDGE | RESTART | ESCALATE
3. [HELP_REQUEST] Agent "Shadow" sent HELP: "Can't resolve merge conflict in auth.ts"
Context: <bead body>
Options: PROVIDE_GUIDANCE | ESCALATE_TO_MAYOR
The triage agent processes each item, takes an action (via tool calls back to the TownDO), and exits. Session lifetime: seconds to minutes, not hours. LLM cost: proportional to actual ambiguity in the system, not to wall-clock uptime.
Triage request bead schema
Triage requests are beads (consistent with #441):
-- No separate table needed. Uses the universal beads table.
-- type = 'triage_request'
-- metadata JSON contains the structured context:
{
"triage_type": "dirty_polecat", -- or "stuck_agent", "help_request", "queue_health", "zombie_confirm"
"agent_bead_id": "...", -- which agent this concerns
"context": { -- type-specific context
"git_status": "...",
"git_diff_stat": "..."
},
"options": ["COMMIT_AND_PUSH", "DISCARD", "ESCALATE"]
}Triage agent tools
The triage agent needs a narrow tool set (subset of the existing plugin):
| Tool | Purpose |
|---|---|
gt_triage_resolve |
Resolve a triage request with a chosen action. TownDO executes the action (nuke, restart, escalate, etc.) |
gt_mail_send |
Send contextual guidance to a stuck agent |
gt_escalate |
Forward to Mayor or human |
gt_nudge |
Send a message to a running agent's session |
gt_done |
Signal triage session complete |
When the alarm does NOT spawn a triage agent
Most patrol cycles will have zero ambiguous situations. The alarm runs, all checks pass (or produce clear mechanical outcomes like agent GC), and no triage agent is needed. The LLM is only invoked when the system encounters genuine uncertainty.
Expected frequency: triage agents spawn on <10% of alarm cycles in a healthy town. In a town with many stuck or failing agents, they'll spawn more often — which is correct, because that's when reasoning is most valuable.
What This Replaces
| Local Gastown | Cloud Gastown |
|---|---|
| Go Daemon (3-min heartbeat) | TownDO alarm (5s active / 1m idle) |
| Boot agent (ephemeral AI triage per tick) | Not needed — TownDO alarm is the external observer |
| Deacon (persistent AI patrol loop) | deaconPatrol() in alarm handler (mechanical) + on-demand triage agent (intelligent) |
| Witness (persistent AI patrol loop per rig) | witnessPatrol() in alarm handler (mechanical) + on-demand triage agent (intelligent) |
Why the watchdog chain simplifies
Local Gastown needs Boot→Deacon→Witness because "a hung Deacon can't detect it's hung" — Boot provides an external observer. In the cloud, DO alarms are the external observer. They're durable (re-fire after eviction), managed by the Cloudflare runtime (not by user code that can hang), and independent of the container. If the container dies, the alarm still fires and detects dead agents. The three-tier watchdog chain collapses to: DO alarm (always fires) → mechanical checks → triage agent (when needed).
One risk: a logic bug in the alarm handler could silently break the town. Mitigation: a Cron Trigger that pings each active town's health endpoint independently of the DO alarm, providing an external watchdog analogous to Boot.
Implementation Plan
Step 1: Expand witnessPatrol() with full mechanical checks
Enhance the existing witnessPatrol() to cover: GUPP escalation (not just mail — escalate after a second threshold, force-stop after a third), orphaned work detection (idle+hooked+no dispatch), agent GC (delete dead agents past retention), per-bead timeout enforcement.
Step 2: Add deaconPatrol()
New alarm sub-task covering: stale hook detection, stranded convoy feeding, crash loop detection.
Step 3: Triage request queue
Add triage_request to the BeadType enum in src/types.ts and src/db/tables/beads.table.ts. When mechanical checks produce ambiguous results (dirty polecat, stuck agent, help request), create triage request beads with structured context instead of taking immediate action.
Step 4: Triage agent dispatch
When triage requests are queued, the alarm dispatches a short-lived triage agent session in the container with a focused prompt and narrow tool set. The agent processes all pending requests and exits. Add a system prompt at src/prompts/triage-system.prompt.ts.
Step 5: Triage agent tools
Add gt_triage_resolve tool to the container plugin. This tool takes a triage request bead ID and a chosen action, and the TownDO executes the action (nuke agent, restart agent, send mail, escalate, etc.).
Step 6: External health watchdog
Add a Cron Trigger (or separate DO with its own alarm) that periodically verifies each active town's alarm is firing and its container is responsive. This replaces Boot's role as the external observer.
Acceptance Criteria
-
witnessPatrol()expanded: GUPP escalation after threshold, orphaned work detection, agent GC, per-bead timeouts -
deaconPatrol()added: stale hook detection, stranded convoy feeding, crash loop detection -
triage_requestbead type added toBeadTypeenum insrc/types.tsandsrc/db/tables/beads.table.ts - Ambiguous situations produce triage request beads with structured context
- Triage agent dispatched only when triage requests are queued
- Triage agent has a focused system prompt (
src/prompts/triage-system.prompt.ts) and narrow tool set - Triage agent processes all pending requests and exits (no persistent session)
-
gt_triage_resolvetool implemented in container plugin - External health watchdog exists independent of the TownDO alarm
- No persistent Witness or Deacon LLM sessions running continuously
Notes
- No data migration needed — cloud Gastown hasn't deployed to production
- Protocol mail flow (POLECAT_DONE → MERGE_READY → MERGED) was eliminated by the beads-centric refactor (Make Cloud Gastown beads-centric: unify all object types into the beads primitive #441). State transitions are now direct:
agentDone()→ MR bead,completeReviewWithResult()→ close. No need to reintroduce protocol messages. - The existing
PatrolResulttype insrc/types.tsdefinesdead_agents,stale_agents,orphaned_beadsarrays but is unused — can be repurposed or replaced for the expanded witness/deacon patrols. - The witness role exists as a town-wide singleton in
getOrCreateAgent()but has only a one-line stub prompt incontainer-dispatch.ts. The triage agent replaces what would have been a persistent witness LLM session.