Skip to content

Witness & Deacon: Alarm-driven orchestration with on-demand LLM triage #442

@jrf0110

Description

@jrf0110

Overview

In local Gastown, the Witness and Deacon are persistent AI agent sessions running continuous patrol loops in tmux. They burn LLM tokens on every cycle, even though ~90% of their behavior is mechanical — threshold checks, protocol message routing, session liveness detection, timer evaluation. Genuine reasoning is only needed for a small set of ambiguous situations.

The cloud should not replicate this. The TownDO alarm IS the patrol loop. It should run all mechanical checks as deterministic code, and only spawn short-lived LLM agent sessions when a check produces an ambiguous result that requires reasoning.

Parent: #204

Current State — What Already Works

The TownDO alarm loop (alarm() at src/dos/Town.do.ts) already runs 7 sub-tasks on a 5s (active) / 1m (idle) interval:

Sub-task Status Notes
ensureContainerReady() Done Pings container /health, triggers restart if dead
processReviewQueue() Done Pops MR beads, spawns refinery, polls PRs, recovers stuck reviews
processConvoyLandings() Done Detects ready_to_land convoys, creates landing MR
schedulePendingWork() Done Dispatches idle+hooked agents, marks beads failed after 5 dispatch attempts
witnessPatrol() Partial Only does zombie detection (container status reconciliation) and basic GUPP mail (30-min stale last_activity_at sends a warning mail, no escalation or force-stop)
deliverPendingMail() Done Pushes undelivered mail to working agents
reEscalateStaleEscalations() Done Bumps severity of unacknowledged escalations after 4h thresholds

What's completely missing:

  • deaconPatrol() — no function exists; some behaviors are scattered across other alarm sub-tasks
  • Stale hook detection (idle agent + hook + no dispatch for extended period)
  • Stranded convoy detection (convoy has open beads with no assigned agent)
  • Agent GC (dead/completed agents accumulate in the DB indefinitely)
  • Per-bead timeout enforcement (no timer gates)
  • Triage request queue and LLM triage agent dispatch
  • External health watchdog (no Cron Trigger or independent alarm monitor)
  • Witness/deacon system prompts (only a one-line stub for witness)

What was intentionally eliminated:


Background: What Local Gastown's Witness & Deacon Actually Do

Three execution layers in local Gastown

  1. Go Daemon — Pure Go process on a 3-minute heartbeat. All behavior is mechanical. Handles session liveness, crash loop detection, orphan cleanup, GUPP checks, heartbeat freshness.
  2. Deacon — LLM agent in tmux running mol-deacon-patrol formula. Continuous loop: inbox check → orphan cleanup → spawn triggers → gate evaluation → convoy checks → health scan → zombie scan → plugin run → loop.
  3. Witness (one per rig) — LLM agent in tmux running mol-witness-patrol formula. Continuous loop: inbox check → process cleanups → check refinery → survey workers → loop.

Mechanical behaviors (deterministic — no LLM needed)

These are implemented as Go handler functions that the LLM agents invoke but don't reason about:

Witness mechanical behaviors:

  • Zombie detection: cross-reference agent metadata state with container process state
  • Hung session detection (30-min inactivity threshold on last_activity_at)
  • GUPP violation detection (hook set + no progress for 30 min)
  • Orphaned work detection (hook set + no container process)
  • Auto-nuke clean agents, flag dirty ones for triage
  • Convoy/swarm completion tracking (count closed vs total tracked beads)
  • Timer gate evaluation (created_at + timeout < now)

Deacon mechanical behaviors:

  • Heartbeat freshness check (timestamp age thresholds)
  • Redispatch failed beads (cooldown timer + attempt counter + max retry threshold)
  • Stale hook detection (dead session + hooked bead OR unknown assignee + age > 1 hour)
  • Convoy completion checks (all tracked beads closed → land)
  • Stranded convoy detection (open beads with no assigned agent)
  • Gate evaluation (elapsed time > timeout)
  • Crash loop detection (restart count + timing with exponential backoff)

Intelligent behaviors (genuinely need LLM reasoning)

These are the ~10% of behaviors where deterministic code can't make the right call:

Behavior Why LLM is needed
Dirty polecat triage Must read git status/git diff output and judge if uncommitted changes are valuable work worth saving, disposable artifacts, or a confused state requiring escalation
Refinery queue health assessment Must reason about queue depth, staleness patterns, time context — no hardcoded thresholds
Live agent progress inspection Must interpret agent conversation/activity output to determine if an agent is stuck, thinking deeply, or making slow but real progress
Help request handling When a polecat sends HELP, must understand the problem domain and craft contextual guidance
Escalation assessment Must understand an escalation's context to decide: handle locally, forward to Mayor, or alert human
Zombie scan confirmation Verify that automated zombie detection results are correct before taking destructive action (nuke)
Contextual notification composition Compose convoy completion summaries, escalation descriptions, handoff notes

Cloud Architecture: Alarm + On-Demand Triage Agents

TownDO alarm handler — what needs to be added

The alarm already handles the items marked done below. This issue adds the items marked as new:

TownDO.alarm()
  ├── ensureContainerReady()              -- exists
  ├── processReviewQueue()                -- exists
  ├── processConvoyLandings()             -- exists
  ├── schedulePendingWork()               -- exists
  ├── witnessPatrol()                     -- expand with:
  │     ├── detectZombies()               -- exists (container status reconciliation)
  │     ├── detectGUPPViolations()        -- exists but only sends mail; add escalation + force-stop after threshold
  │     ├── detectOrphanedWork()          NEW: idle+hooked agents with no dispatch activity
  │     ├── agentGC()                     NEW: delete dead/completed agents past retention period
  │     ├── checkTimerGates()             NEW: per-bead timeout enforcement
  │     └── flagForTriage()               NEW: dirty/ambiguous → create triage request bead
  ├── deaconPatrol()                      NEW function:
  │     ├── detectStaleHooks()            NEW: hooked for unreasonable duration without activity
  │     ├── feedStrandedConvoys()         NEW: convoy has open beads with no assignee → auto-sling
  │     └── detectCrashLoops()            NEW: same agent failing repeatedly in short window
  ├── deliverPendingMail()                -- exists
  ├── reEscalateStaleEscalations()        -- exists
  └── maybeDispatchTriageAgent()          NEW: if triage request beads queued, spawn triage agent

On-demand triage agent (LLM, spawned when needed)

When the alarm's mechanical checks produce results that need reasoning, it queues them as triage request beads (type = 'triage_request') with structured context. When the queue is non-empty, the alarm dispatches a short-lived triage agent in the container.

The triage_request bead type must be added to the BeadType enum in src/types.ts and src/db/tables/beads.table.ts.

// In TownDO alarm handler
const triageQueue = await this.listBeads({ type: 'triage_request', status: 'open' });
if (triageQueue.length > 0) {
  await this.dispatchTriageAgent(triageQueue);
}

The triage agent gets a focused system prompt:

You are a Gastown triage agent. You will be given a list of situations that 
require judgment. For each one, assess the situation and take one of the 
prescribed actions. Be decisive. When done, call gt_done.

Situations to assess:
1. [DIRTY_POLECAT] Agent "Toast" has uncommitted changes after completion.
   Git status: <output>
   Git diff --stat: <output>
   Options: COMMIT_AND_PUSH | DISCARD | ESCALATE

2. [STUCK_AGENT] Agent "Maple" has not made progress in 45 minutes.
   Last activity: <timestamp>
   Recent conversation tail: <last 20 lines>
   Options: NUDGE | RESTART | ESCALATE

3. [HELP_REQUEST] Agent "Shadow" sent HELP: "Can't resolve merge conflict in auth.ts"
   Context: <bead body>
   Options: PROVIDE_GUIDANCE | ESCALATE_TO_MAYOR

The triage agent processes each item, takes an action (via tool calls back to the TownDO), and exits. Session lifetime: seconds to minutes, not hours. LLM cost: proportional to actual ambiguity in the system, not to wall-clock uptime.

Triage request bead schema

Triage requests are beads (consistent with #441):

-- No separate table needed. Uses the universal beads table.
-- type = 'triage_request'
-- metadata JSON contains the structured context:
{
  "triage_type": "dirty_polecat",          -- or "stuck_agent", "help_request", "queue_health", "zombie_confirm"
  "agent_bead_id": "...",                  -- which agent this concerns
  "context": {                             -- type-specific context
    "git_status": "...",
    "git_diff_stat": "..."
  },
  "options": ["COMMIT_AND_PUSH", "DISCARD", "ESCALATE"]
}

Triage agent tools

The triage agent needs a narrow tool set (subset of the existing plugin):

Tool Purpose
gt_triage_resolve Resolve a triage request with a chosen action. TownDO executes the action (nuke, restart, escalate, etc.)
gt_mail_send Send contextual guidance to a stuck agent
gt_escalate Forward to Mayor or human
gt_nudge Send a message to a running agent's session
gt_done Signal triage session complete

When the alarm does NOT spawn a triage agent

Most patrol cycles will have zero ambiguous situations. The alarm runs, all checks pass (or produce clear mechanical outcomes like agent GC), and no triage agent is needed. The LLM is only invoked when the system encounters genuine uncertainty.

Expected frequency: triage agents spawn on <10% of alarm cycles in a healthy town. In a town with many stuck or failing agents, they'll spawn more often — which is correct, because that's when reasoning is most valuable.

What This Replaces

Local Gastown Cloud Gastown
Go Daemon (3-min heartbeat) TownDO alarm (5s active / 1m idle)
Boot agent (ephemeral AI triage per tick) Not needed — TownDO alarm is the external observer
Deacon (persistent AI patrol loop) deaconPatrol() in alarm handler (mechanical) + on-demand triage agent (intelligent)
Witness (persistent AI patrol loop per rig) witnessPatrol() in alarm handler (mechanical) + on-demand triage agent (intelligent)

Why the watchdog chain simplifies

Local Gastown needs Boot→Deacon→Witness because "a hung Deacon can't detect it's hung" — Boot provides an external observer. In the cloud, DO alarms are the external observer. They're durable (re-fire after eviction), managed by the Cloudflare runtime (not by user code that can hang), and independent of the container. If the container dies, the alarm still fires and detects dead agents. The three-tier watchdog chain collapses to: DO alarm (always fires) → mechanical checks → triage agent (when needed).

One risk: a logic bug in the alarm handler could silently break the town. Mitigation: a Cron Trigger that pings each active town's health endpoint independently of the DO alarm, providing an external watchdog analogous to Boot.

Implementation Plan

Step 1: Expand witnessPatrol() with full mechanical checks

Enhance the existing witnessPatrol() to cover: GUPP escalation (not just mail — escalate after a second threshold, force-stop after a third), orphaned work detection (idle+hooked+no dispatch), agent GC (delete dead agents past retention), per-bead timeout enforcement.

Step 2: Add deaconPatrol()

New alarm sub-task covering: stale hook detection, stranded convoy feeding, crash loop detection.

Step 3: Triage request queue

Add triage_request to the BeadType enum in src/types.ts and src/db/tables/beads.table.ts. When mechanical checks produce ambiguous results (dirty polecat, stuck agent, help request), create triage request beads with structured context instead of taking immediate action.

Step 4: Triage agent dispatch

When triage requests are queued, the alarm dispatches a short-lived triage agent session in the container with a focused prompt and narrow tool set. The agent processes all pending requests and exits. Add a system prompt at src/prompts/triage-system.prompt.ts.

Step 5: Triage agent tools

Add gt_triage_resolve tool to the container plugin. This tool takes a triage request bead ID and a chosen action, and the TownDO executes the action (nuke agent, restart agent, send mail, escalate, etc.).

Step 6: External health watchdog

Add a Cron Trigger (or separate DO with its own alarm) that periodically verifies each active town's alarm is firing and its container is responsive. This replaces Boot's role as the external observer.

Acceptance Criteria

  • witnessPatrol() expanded: GUPP escalation after threshold, orphaned work detection, agent GC, per-bead timeouts
  • deaconPatrol() added: stale hook detection, stranded convoy feeding, crash loop detection
  • triage_request bead type added to BeadType enum in src/types.ts and src/db/tables/beads.table.ts
  • Ambiguous situations produce triage request beads with structured context
  • Triage agent dispatched only when triage requests are queued
  • Triage agent has a focused system prompt (src/prompts/triage-system.prompt.ts) and narrow tool set
  • Triage agent processes all pending requests and exits (no persistent session)
  • gt_triage_resolve tool implemented in container plugin
  • External health watchdog exists independent of the TownDO alarm
  • No persistent Witness or Deacon LLM sessions running continuously

Notes

  • No data migration needed — cloud Gastown hasn't deployed to production
  • Protocol mail flow (POLECAT_DONE → MERGE_READY → MERGED) was eliminated by the beads-centric refactor (Make Cloud Gastown beads-centric: unify all object types into the beads primitive #441). State transitions are now direct: agentDone() → MR bead, completeReviewWithResult() → close. No need to reintroduce protocol messages.
  • The existing PatrolResult type in src/types.ts defines dead_agents, stale_agents, orphaned_beads arrays but is unused — can be repurposed or replaced for the expanded witness/deacon patrols.
  • The witness role exists as a town-wide singleton in getOrCreateAgent() but has only a one-line stub prompt in container-dispatch.ts. The triage agent replaces what would have been a persistent witness LLM session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions