feat(gastown): alarm-driven witness/deacon orchestration with on-demand LLM triage (#442) by jrf0110 · Pull Request #924 · Kilo-Org/cloud

jrf0110 · 2026-03-08T19:28:19Z

Summary

Replace persistent AI patrol loops (Witness + Deacon) with deterministic alarm-driven checks and short-lived on-demand LLM triage agents. This is the cloud equivalent of local Gastown's three-tier watchdog chain (Boot → Deacon → Witness), collapsed into: DO alarm (always fires) → mechanical checks → triage agent (when needed).

Key changes:

Expanded witnessPatrol(): Tiered GUPP violation handling (30min warning → 1h escalation → 2h force-stop with triage request), orphaned work detection, agent GC with 24h retention for polecats/refinery (singletons never GC'd), per-bead timeout enforcement via metadata.timeout_ms
New deaconPatrol(): Stale hook nudging (resets last_activity_at to re-enter dispatch queue), stranded convoy auto-assignment (finds unassigned open convoy beads and hooks idle polecats), crash loop detection (3+ failures in 30min window → triage request)
Triage request queue: New triage_request bead type with structured context (triage type, agent reference, available actions). Ambiguous situations produce triage beads instead of taking immediate action.
On-demand triage agent: maybeDispatchTriageAgent() spawns a short-lived LLM session only when triage requests are queued. The agent gets a focused system prompt listing all pending situations, processes each via gt_triage_resolve, and exits. No persistent LLM sessions.
External health watchdog: GastownUserDO alarm (5min interval) pings each town's TownDO.healthCheck() to verify the alarm is set and re-arms it if missing. Replaces Boot's external observer role.
New triage agent role added to AgentRole enum across worker, container plugin, and DB schemas.

Verification

tsgo --noEmit passes for cloudflare-gastown (worker)
tsc --noEmit passes for cloudflare-gastown/container (plugin)
vitest run — 96 tests passing, 9 pre-existing failures in client.test.ts (from Gastown Feature Flags & Progressive Rollout #901 branch URL pattern changes, unrelated to this PR)

Visual Changes

N/A

Reviewer Notes

The patrol module (src/dos/town/patrol.ts) is structured as pure functions consistent with the existing sub-module pattern (beads.ts, agents.ts, mail.ts). All functions take SqlStorage as the first argument and are stateless.
The feedStrandedConvoys() function directly imports from ./agents — no circular dependency since patrol.ts never imports Town.do.ts.
Triage agent dispatch creates a synthetic issue bead as a hook target (not a triage_request bead) since the triage agent needs a bead to hook onto for the standard agent lifecycle.
The GastownUserDO watchdog only arms when a town is created. Existing users with towns won't have the watchdog until they create a new town — a migration or manual trigger may be needed for production.
The triage_request CHECK constraint on the beads table requires a migration for existing TownDO instances (the initBeadTables DDL will handle this on next initialization since it uses CREATE TABLE IF NOT EXISTS which won't update CHECK constraints). New towns get the correct schema automatically.

Closes #442

cloudflare-gastown/src/dos/Town.do.ts

cloudflare-gastown/src/db/tables/beads.table.ts

cloudflare-gastown/src/dos/town/patrol.ts

kilo-code-bot · 2026-03-08T19:33:03Z

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	1
WARNING	0
SUGGESTION	0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

CRITICAL

File	Line	Issue
`cloudflare-gastown/src/handlers/rig-triage.handler.ts`	51	Label-only triage auth can be spoofed by user-created `gt:triage` beads

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

N/A

Files Reviewed (44 files)

cloudflare-gastown/container/plugin/client.ts - 0 issues
cloudflare-gastown/container/plugin/tools.ts - 0 issues
cloudflare-gastown/container/plugin/types.ts - 0 issues
cloudflare-gastown/container/src/agent-runner.ts - 0 issues
cloudflare-gastown/container/src/control-server.ts - 0 issues
cloudflare-gastown/container/src/git-manager.ts - 0 issues
cloudflare-gastown/container/src/types.ts - 0 issues
cloudflare-gastown/src/db/tables/agent-metadata.table.ts - 0 issues
cloudflare-gastown/src/db/tables/beads.table.ts - 0 issues
cloudflare-gastown/src/db/tables/rig-agents.table.ts - 0 issues
cloudflare-gastown/src/dos/GastownUser.do.ts - 0 issues
cloudflare-gastown/src/dos/Town.do.ts - 0 issues
cloudflare-gastown/src/dos/town/agents.ts - 0 issues
cloudflare-gastown/src/dos/town/container-dispatch.ts - 0 issues
cloudflare-gastown/src/dos/town/patrol.ts - 0 issues
cloudflare-gastown/src/dos/town/review-queue.ts - 0 issues
cloudflare-gastown/src/dos/town/rigs.ts - 0 issues
cloudflare-gastown/src/gastown.worker.ts - 0 issues
cloudflare-gastown/src/handlers/rig-triage.handler.ts - 1 issue
cloudflare-gastown/src/prompts/mayor-system.prompt.ts - 0 issues
cloudflare-gastown/src/prompts/refinery-system.prompt.ts - 0 issues
cloudflare-gastown/src/prompts/triage-system.prompt.ts - 0 issues
cloudflare-gastown/src/trpc/init.ts - 0 issues
cloudflare-gastown/src/trpc/router.ts - 0 issues
cloudflare-gastown/src/trpc/schemas.ts - 0 issues
cloudflare-gastown/src/types.ts - 0 issues
cloudflare-gastown/src/ui/dashboard.ui.ts - 0 issues
src/app/(app)/gastown/[townId]/agents/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/beads/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/mail/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/merges/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/observability/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/rigs/[rigId]/page.tsx - 0 issues
src/app/(app)/gastown/[townId]/settings/page.tsx - 0 issues
src/app/(app)/gastown/page.tsx - 0 issues
src/app/api/gastown/token/route.ts - 0 issues
src/components/gastown/ConvoyTimeline.tsx - 0 issues
src/components/gastown/TerminalBar.tsx - 0 issues
src/components/gastown/TerminalBarContext.tsx - 0 issues
src/components/gastown/useXtermPty.ts - 0 issues
src/lib/gastown/feature-flags.ts - 0 issues
src/lib/gastown/types/router.d.ts - 0 issues
src/lib/gastown/types/schemas.d.ts - 0 issues

jrf0110 · 2026-03-08T19:43:46Z

We should remove witness from the AgentRole enum in this PR. It was never implemented as a running agent — the alarm loop IS the witness now, and this PR makes that explicit with the triage role replacing the LLM reasoning part.

Remaining witness references to clean up:

AgentRole enums + CHECK constraints (types.ts, agent-metadata.table.ts, rig-agents.table.ts, trpc/schemas.ts)
container-dispatch.ts stub prompt case
agents.ts singleton list
dashboard.ui.ts dropdown option
patrol.ts from_agent_id: 'witness' in GUPP mails → change to 'system' or 'patrol' since it's the alarm code sending these, not an agent

cloudflare-gastown/src/handlers/rig-triage.handler.ts

cloudflare-gastown/src/dos/Town.do.ts

cloudflare-gastown/src/prompts/triage-system.prompt.ts

jrf0110 · 2026-03-08T20:13:07Z

Done. Removed witness from AgentRole across all 6 definition sites (types.ts, agent-metadata.table.ts, rig-agents.table.ts, trpc/schemas.ts, container/plugin/types.ts, container/src/types.ts) and their CHECK constraints. Also cleaned up:

container-dispatch.ts: removed witness case from systemPromptForRole
agents.ts: replaced 'witness' with 'triage' in townSingletonRoles
dashboard.ui.ts: replaced witness dropdown option with triage
patrol.ts: changed from_agent_id: 'witness' to 'patrol' in GUPP mails (these are sent by alarm code, not an agent)
Updated docstring comments referencing witness

cloudflare-gastown/src/handlers/rig-triage.handler.ts

cloudflare-gastown/src/dos/town/patrol.ts

cloudflare-gastown/src/dos/GastownUser.do.ts

cloudflare-gastown/src/dos/Town.do.ts

cloudflare-gastown/src/db/tables/agent-metadata.table.ts

cloudflare-gastown/src/dos/Town.do.ts

cloudflare-gastown/src/prompts/triage-system.prompt.ts

cloudflare-gastown/src/dos/Town.do.ts

cloudflare-gastown/src/gastown.worker.ts

cloudflare-gastown/src/dos/town/patrol.ts

cloudflare-gastown/src/dos/Town.do.ts

src/components/gastown/TerminalBar.tsx

cloudflare-gastown/container/src/control-server.ts

cloudflare-gastown/src/trpc/schemas.ts

cloudflare-gastown/src/dos/town/rigs.ts

cloudflare-gastown/src/dos/town/patrol.ts

cloudflare-gastown/src/dos/Town.do.ts

cloudflare-gastown/container/src/git-manager.ts

kilo-code-bot · 2026-03-09T15:31:20Z

cloudflare-gastown/container/src/git-manager.ts

+    repo
+  ).catch(() => {
+    // --force --track may fail on very old git; fall back to create-or-reset
+    exec('git', ['branch', '-f', trackingBranch, `origin/${defaultBranch}`], repo).catch(() => {});


WARNING: Fallback branch reset is not awaited

If git branch --force --track ... fails, this catch launches git branch -f ... but never returns that promise. The outer await resolves immediately, so git worktree add ... trackingBranch can run before the fallback branch exists or before it has been reset to origin/<defaultBranch>, which makes browse-worktree setup flaky on the exact recovery path this block is trying to handle.

Fixed. The catch block now uses try/catch with await instead of a fire-and-forget .catch() chain.

kilo-code-bot · 2026-03-09T15:58:21Z

cloudflare-gastown/container/src/agent-runner.ts

@@ -378,7 +379,7 @@ export async function runAgent(request: StartAgentRequest): Promise<ManagedAgent
    // Resolve git credentials if missing. When the town config doesn't have
    // a token (common on first dispatch after rig creation), fetch one from
    // the Next.js server using the platform_integration_id.
-    const envVars = await resolveGitCredentialsIfMissing(request);
+    const envVars = await resolveGitCredentials(request);


WARNING: Resolved git credentials never reach the spawned agent

resolveGitCredentials() enriches envVars here, but runAgent() still builds the child process environment from the original request later on. For rigs that rely on platformIntegrationId, startup can clone and verify the repo successfully while the agent session itself still launches without GIT_TOKEN/GH_TOKEN, so in-session git push and gh commands fail.

Fixed. After resolveGitCredentials enriches envVars, the request is now reassigned with the resolved envVars (request = { ...request, envVars }) so buildAgentEnv picks up GIT_TOKEN/GH_TOKEN for the spawned process.

kilo-code-bot · 2026-03-09T16:06:59Z

cloudflare-gastown/container/src/git-manager.ts

+    } else {
+      // startPoint ref may not exist (e.g. first convoy bead before
+      // the feature branch is created) — fall back to HEAD
+      await exec('git', ['branch', options.branch], repo);


WARNING: Falling back to the repo's local HEAD can branch from stale code

When startPoint is missing on the remote, this path creates the branch with plain git branch <name>, which uses whatever commit the shared repo's local HEAD currently points at. That repo is only fetched, not checked out or reset, so its local branch can lag behind origin/<defaultBranch>; the first convoy bead can start from an old base and miss already-merged work.

Fixed. The fallback now uses origin/<defaultBranch> instead of bare HEAD. Added defaultBranch to WorktreeOptions and pipe it through from runAgent.

kilo-code-bot · 2026-03-09T16:06:59Z

cloudflare-gastown/src/dos/town/rigs.ts

+    // deleted rig may still exist if removeRig wasn't called (e.g. the
+    // tRPC deleteRig path before the fix). Remove the orphan and retry.
+    if (err instanceof Error && err.message.includes('UNIQUE constraint failed: rigs.name')) {
+      query(sql, /* sql */ `DELETE FROM rigs WHERE name = ? AND id != ?`, [


WARNING: This retry still deletes live rigs on a name collision

GastownUserDO.createRig() does not enforce unique rig names, so a legitimate second rig with the same name will also land here. Deleting every row whose name matches before retrying silently removes the existing rig from the TownDO registry instead of surfacing the conflict, which can orphan a real rig's state and break later dispatches.

Fixed. The retry path now checks if the conflicting rig has active beads (open or in_progress) before deleting. If it does, the error is surfaced instead of silently orphaning a live rig.

kilo-code-bot · 2026-03-09T16:23:41Z

cloudflare-gastown/src/dos/Town.do.ts

+      defaultBranch: rigConfig.defaultBranch,
+      kilocodeToken,
+      townConfig,
+      systemPromptOverride: systemPrompt,


WARNING: Triage prompt is only applied on the first launch

This batch is still started as a normal polecat and the triage behavior only exists in this one-off systemPromptOverride. If the process dies later, witnessPatrol() resets the hooked agent back to idle and schedulePendingWork() will redispatch the same gt:triage bead through the generic polecat path, which does not rebuild the triage prompt. That means recovery/retry can reopen a triage batch as ordinary coding work instead of a triage session.

Acknowledged. The triage batch bead carries the gt:triage label, and the triage prompt is built dynamically in dispatchTriageBatch. If the process dies and the bead is redispatched via schedulePendingWork, it would go through the generic polecat path. The fix would be to detect gt:triage beads in schedulePendingWork and route them back through dispatchTriageBatch. This is a real gap but requires changes to the scheduler's dispatch routing — tracking as a follow-up.

kilo-code-bot · 2026-03-09T16:23:41Z

cloudflare-gastown/src/dos/Town.do.ts

+          this.sql,
+          /* sql */ `
+            SELECT COUNT(*) AS cnt FROM ${agent_metadata}
+            WHERE ${agent_metadata.status} = 'idle'


SUGGESTION: orphanedHooks currently counts healthy queued work too

This query treats every idle agent with a hook as orphaned, but several normal flows intentionally create that state while waiting for the next scheduler tick (for example feedStrandedConvoys() and backoff restarts). The new Status tab will therefore report false patrol issues during healthy operation unless this metric reuses the stale-hook/orphaned-work criteria instead of raw idle + hooked.

Fixed. The orphanedHooks query now only counts idle+hooked agents that have been idle for >5 minutes, filtering out agents that were just hooked by feedStrandedConvoys or restarted with backoff and are legitimately waiting for the next scheduler tick.

kilo-code-bot · 2026-03-09T16:48:55Z

cloudflare-gastown/src/dos/Town.do.ts

+    // Proactively clone the rig's repo and create a browse worktree so
+    // the mayor has immediate access to the codebase without waiting for
+    // the first agent dispatch.
+    this.setupRigRepoInContainer(rigConfig).catch(err =>


WARNING: Failed repo setup is silently one-shot

setupRigRepoInContainer() is only kicked off here during configureRig(), but /repos/setup returns 202 before the clone/worktree actually succeeds and later failures are only logged in the container. If that background setup fails once (credentials not ready, transient git error, etc.), the mayor never gets a browse worktree for this rig until some unrelated future agent dispatch happens to clone the repo.

Acknowledged. The browse worktree is best-effort and will be created on the next agent dispatch (which clones the repo as a side effect). A retry mechanism (e.g. re-triggering setup on the next alarm cycle if the browse dir doesn't exist) would be a good follow-up but is out of scope for this PR.

kilo-code-bot · 2026-03-09T16:58:54Z

cloudflare-gastown/src/dos/Town.do.ts

+            WHERE ${agent_metadata.bead_id} = ?
+          `,
+          [agent.id]
+        );
      }
      return started;
    } catch (err) {


WARNING: Exceptions leave the agent and bead stuck in an active state

dispatchAgent() now flips the bead to in_progress and the agent to working before the container call, but this catch path only logs and returns false. If startAgentInContainer() throws before it can return false (for example on a container fetch/DO error), the scheduler never rolls either row back, so that work can stay wedged indefinitely and never get redispatched.

Fixed. The catch block in dispatchAgent now rolls back the agent to idle and the bead to open so the scheduler can retry on the next alarm tick.

kilo-code-bot · 2026-03-09T16:58:55Z

cloudflare-gastown/container/src/git-manager.ts

+    await exec('git', ['checkout', defaultBranch], browseDir).catch(() => {});
+    await exec('git', ['pull', '--rebase', '--autostash'], browseDir).catch(() => {});
+    console.log(`Updated browse worktree for rig ${rigId} at ${browseDir}`);
+    return browseDir;


WARNING: Browse setup reports success after a failed refresh

Both the checkout and pull calls above are swallowed, but this path still returns browseDir and logs the worktree as updated. An auth failure or transient git error would therefore leave the mayor browsing stale code while /repos/setup looks successful. This should either recreate the browse worktree or surface the failure so setup can retry.

Fixed. The checkout/pull calls are now in a try/catch that logs a warning with the failure message instead of silently swallowing errors. The browseDir is still returned (it exists on disk) but the log makes it clear the code may be stale.

kilo-code-bot · 2026-03-09T17:23:39Z

cloudflare-gastown/src/dos/Town.do.ts

+    const triageBead = beadOps.getBead(this.sql, input.triage_request_bead_id);
+    if (!triageBead)
+      throw new Error(`Triage request bead ${input.triage_request_bead_id} not found`);
+    if (!triageBead.labels.includes(patrol.TRIAGE_REQUEST_LABEL)) {


WARNING: Closed triage requests can be resolved again

resolveTriage() only validates the triage label here. Because the handler accepts an arbitrary triage_request_bead_id, a retried or stale call against an already-closed request will replay side effects like RESTART, CLOSE_BEAD, or REASSIGN_BEAD. Rejecting anything except status === 'open' would make this endpoint idempotent and prevent duplicate actions.

Fixed. resolveTriage now rejects requests where the triage bead status is not 'open', preventing duplicate side effects from retried or stale calls.

kilo-code-bot · 2026-03-09T17:23:39Z

cloudflare-gastown/src/trpc/router.ts

    .input(z.object({ rigId: z.string().uuid() }))
    .mutation(async ({ ctx, input }) => {
-      requireGastownAccess(ctx);
-      await verifyRigOwnership(ctx.env, ctx.userId, input.rigId);
+      const rig = await verifyRigOwnership(ctx.env, ctx.userId, input.rigId);
      const userStub = getGastownUserStub(ctx.env, ctx.userId);
      await userStub.deleteRig(input.rigId);


WARNING: Rig deletion still succeeds on partial cleanup failure

This deletes the user-facing rig record before the Town DO cleanup is guaranteed. If removeRig() then throws, the mutation still returns success with the user row gone but the Town DO registry still holding the stale rig entry, so recreating the same name can keep hitting the same UNIQUE conflict this change is trying to avoid. This should either clean up the Town DO first or fail/retry instead of swallowing the error.

Fixed. The deletion order is now reversed: TownDO.removeRig runs first (freeing the name), then userStub.deleteRig. If the Town DO cleanup fails, the mutation throws and the user record stays intact so they can retry.

kilo-code-bot · 2026-03-09T17:36:45Z

cloudflare-gastown/src/dos/town/patrol.ts

+        unhookBead(sql, row.assignee_agent_bead_id);
+      }
+
+      logBeadEvent(sql, {


WARNING: Timeout path logs the same failure twice

updateBeadStatus() already writes a status_changed event for the transition to failed. Emitting another status_changed event here means each timeout contributes two failure records, so detectCrashLoops() can trip after only two timed-out beads instead of the intended three, and the new status feed will show duplicate failure transitions.

Fixed. Removed the redundant logBeadEvent call since updateBeadStatus already writes a status_changed event internally.

kilo-code-bot · 2026-03-09T17:37:56Z

cloudflare-gastown/src/dos/Town.do.ts

+          }
+          break;
+        }
+        case 'CLOSE_BEAD': {


WARNING: Destructive triage actions leave the old polecat running

Unlike RESTART, the CLOSE_BEAD and REASSIGN_BEAD branches below never call stopAgentInContainer() when the target agent is still working or stalled. That lets the old process keep running after the bead has been failed/reopened in SQL, so the scheduler can redispatch the same work while the original polecat is still pushing stale changes.

Fixed. Both CLOSE_BEAD and REASSIGN_BEAD now call stopAgentInContainer when the target agent is working or stalled, matching the RESTART path.

kilo-code-bot · 2026-03-09T17:37:56Z

cloudflare-gastown/src/dos/town/review-queue.ts

+  // submitting to the review queue. Detected by the gt:triage label
+  // on the hooked bead.
+  const hookedBead = getBead(sql, agent.current_hook_bead_id);
+  if (hookedBead?.labels.includes('gt:triage')) {


WARNING: Any user-created gt:triage bead now bypasses review

This fast-path keys entirely off labels.includes('gt:triage'), but handleCreateBead() accepts arbitrary labels from the API/UI. A normal issue accidentally or intentionally created with that reserved label will be silently closed on gt_done() instead of entering the review queue, which drops the agent's output on the floor.

Fixed. The triage fast-path now also checks created_by === 'patrol' so only system-created triage beads bypass review. User-created beads that happen to carry the gt:triage label go through the normal review flow.

kilo-code-bot · 2026-03-09T17:49:47Z

cloudflare-gastown/src/gastown.worker.ts

+        const townId = statusMatch[1];
+        console.log(`[gastown-worker] WS upgrade (status): townId=${townId}`);
+        const stub = getTownDOStub(env, townId);
+        return stub.fetch(request);


WARNING: This status feed bypasses the town auth/ownership checks

WebSocket upgrades are handled here before the request reaches Hono, so neither townIdMiddleware nor the /api/towns/:townId/rigs/:rigId/* auth middleware runs. That means anyone who knows a townId can subscribe to /api/towns/:townId/status/ws and read recent events plus agent/bead state without going through the same authorization path as getAlarmStatus.

Acknowledged. This is the same pattern as the stream and PTY WebSocket endpoints (lines 466–486) which also bypass Hono middleware. All three rely on UUID-only knowledge and capability-token auth obtained via authenticated POSTs. Adding ticket-based auth for all WS paths is a holistic follow-up.

kilo-code-bot · 2026-03-09T17:49:47Z

src/components/gastown/TerminalBar.tsx

+      reconnectTimerRef.current = setTimeout(connect, 3_000);
+    };
+
+    ws.onerror = () => {


WARNING: The status pane never falls back if WebSockets are unavailable

useAlarmStatusWs() only reconnects on close/error; it never switches to the existing getAlarmStatus query path. In environments where WS upgrades are blocked or unsupported, data stays null forever and the new Status tab never renders a snapshot.

Acknowledged. The WS-only pattern is consistent with the existing stream/PTY terminal connections. Adding a polling fallback via the existing getAlarmStatus tRPC query is a valid improvement but is outside the scope of this PR.

kilo-code-bot · 2026-03-09T17:58:28Z

cloudflare-gastown/src/dos/Town.do.ts

+        /* sql */ `
+          SELECT ${beads.bead_id} FROM ${beads}
+          WHERE ${beads.type} = 'issue'
+            AND ${beads.labels} LIKE ?


WARNING: Reserved-label issues can block triage dispatch

handleCreateBead() accepts arbitrary labels, so any normal open issue tagged gt:triage will satisfy this duplicate guard. Once that happens, maybeDispatchTriageAgent() returns early forever and real gt:triage-request beads never get a worker. Filtering on a system-owned marker (for example created_by = 'patrol') would keep user-created labels from starving the triage queue.

Fixed. The duplicate guard query now also filters on created_by = 'patrol' so user-created beads with the gt:triage label can't starve the triage queue.

kilo-code-bot · 2026-03-09T17:58:28Z

cloudflare-gastown/src/dos/Town.do.ts

+      title: `Triage batch: ${pendingCount} request(s)`,
+      body: 'Process all pending triage request beads and resolve each one.',
+      priority: 'high',
+      labels: [patrol.TRIAGE_BATCH_LABEL],


WARNING: The gt_done safety net never matches this batch bead

reviewQueue.agentDone() now only short-circuits for gt:triage beads created by patrol, but this synthetic batch is inserted without created_by. If the model ever calls gt_done instead of gt_bead_close, the batch still goes into the review queue. Setting created_by: 'patrol' here keeps the new guard working for the system-created triage batch.

Fixed. The synthetic triage batch bead now sets created_by: 'patrol', matching the guard in agentDone() that checks created_by === 'patrol' for the gt:triage fast-path.

kilo-code-bot · 2026-03-09T18:36:52Z

cloudflare-gastown/src/handlers/rig-triage.handler.ts

+  // check, any rig agent (polecat, refinery) could call the endpoint and
+  // trigger restart/close/escalate side effects on other agents.
+  const hookedBead = await town.getHookedBead(agentId);
+  if (!hookedBead || !hookedBead.labels.includes('gt:triage')) {


CRITICAL: Label-only gating still allows non-triage agents to resolve requests

gt:triage is not reserved to patrol-created batch beads here: handleCreateBead() accepts arbitrary labels, and agentDone() explicitly treats user-created gt:triage beads as a supported case. Any agent hooked to one of those beads can still call /triage/resolve and trigger restart/close side effects on other agents. This authorization check needs a system-owned marker as well (for example created_by === 'patrol').

Fixed. The handler now also checks hookedBead.created_by === 'patrol' in addition to the label check. User-created beads with the gt:triage label can no longer satisfy the authorization gate.

… improvements (#442) Alarm-driven patrol system (witness & deacon): - Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop) - Orphaned work detection, stale hook recovery, agent GC, crash loop detection - Per-bead timeout enforcement with agent container termination - On-demand LLM triage agent for ambiguous situations - Triage action validation, access control, and snapshot-based resolution - Stranded convoy feeding with immediate dispatch eligibility Mayor codebase browsing: - Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access - POST /repos/setup container endpoint for proactive repo cloning - System prompt written to AGENTS.md so mayor and sub-agents share context - Git credential race fix: refreshGitCredentials runs before configureRig - GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs Agent dispatch improvements: - startPoint parameter for convoy agents to branch from feature branch - platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup - Existing users arm watchdog on DO init - RESTART_WITH_BACKOFF uses dispatch cooldown delay Rig deletion fix: - tRPC deleteRig now calls TownDO.removeRig (was missing) - addRig handles stale name conflicts via catch-and-retry Real-time alarm status UI: - Hibernatable WebSocket for live alarm status push - Status tab in terminal bar with agent/bead/patrol cards Other UI: - Convoy title and branch use flex-based truncation instead of fixed max-width - Status pane card padding normalized to p-2 - Legacy agent roles accepted in Zod schemas for backward compat - PostHog feature flag integration for gastown access gating

… improvements (#442) (#924) Alarm-driven patrol system (witness & deacon): - Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop) - Orphaned work detection, stale hook recovery, agent GC, crash loop detection - Per-bead timeout enforcement with agent container termination - On-demand LLM triage agent for ambiguous situations - Triage action validation, access control, and snapshot-based resolution - Stranded convoy feeding with immediate dispatch eligibility Mayor codebase browsing: - Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access - POST /repos/setup container endpoint for proactive repo cloning - System prompt written to AGENTS.md so mayor and sub-agents share context - Git credential race fix: refreshGitCredentials runs before configureRig - GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs Agent dispatch improvements: - startPoint parameter for convoy agents to branch from feature branch - platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup - Existing users arm watchdog on DO init - RESTART_WITH_BACKOFF uses dispatch cooldown delay Rig deletion fix: - tRPC deleteRig now calls TownDO.removeRig (was missing) - addRig handles stale name conflicts via catch-and-retry Real-time alarm status UI: - Hibernatable WebSocket for live alarm status push - Status tab in terminal bar with agent/bead/patrol cards Other UI: - Convoy title and branch use flex-based truncation instead of fixed max-width - Status pane card padding normalized to p-2 - Legacy agent roles accepted in Zod schemas for backward compat - PostHog feature flag integration for gastown access gating

#904) * feat(gastown): replace binary is_admin gate with PostHog feature flags (#901) Replace the binary is_admin check from #537 with PostHog feature flags for progressive rollout. Flag management (allowlists, percentage rollout, kill-switch) is handled entirely through the PostHog dashboard — no custom DB tables or admin UI needed. Gate points updated: - 9 Next.js pages use isFeatureFlagEnabled('gastown-access', user.id) - Sidebar uses useFeatureFlagEnabled('gastown-access') - Token endpoint evaluates the flag and embeds gastownAccess in the JWT - Worker checks gastownAccess JWT claim (isAdmin fallback for compat) Sub-feature flag names defined: gastown-convoys, gastown-pr-merge, gastown-multi-rig (to be created in PostHog when needed). Closes #901 * fix: address PR review comments - Switch from isFeatureFlagEnabled to isReleaseToggleEnabled for strict boolean auth checks (prevents multivariate variants from granting access) - Remove dev-mode bypass — gate in dev too via PostHog - Abstract requireGastownAccess into gastownProcedure composable tRPC middleware in init.ts, replacing manual requireGastownAccess(ctx) calls - Remove sub-feature flags (convoys, pr_merge, multi_rig) — only gastown-access remains * fix: use strict boolean check for client-side gastown flag Use useFeatureFlagVariantKey === true instead of useFeatureFlagEnabled to align the sidebar with the server-side isReleaseToggleEnabled check. This prevents multivariate string variants from showing the nav item when server-side access would be denied. * fix: use isFeatureFlagEnabled with dev override for gastown-access Switch all gate points from isReleaseToggleEnabled back to isFeatureFlagEnabled. Add a DEV_ENABLED_FLAGS set to posthog-feature-flags.ts that returns true for gastown-access in non-production environments so local dev works without PostHog configuration. Sidebar reverts to useFeatureFlagEnabled. * refactor: move dev override into isGastownEnabled, revert posthog-feature-flags.ts Move the dev-mode override out of the shared posthog-feature-flags.ts module and into isGastownEnabled in src/lib/gastown/feature-flags.ts. All pages and the token endpoint now call isGastownEnabled(user.id) which returns true in non-production and delegates to isFeatureFlagEnabled in production. The sidebar uses useFeatureFlagEnabled || isDevelopment for the same effect client-side. * feat(gastown): witness/deacon patrol, mayor codebase browsing, and UI improvements (#442) (#924) Alarm-driven patrol system (witness & deacon): - Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop) - Orphaned work detection, stale hook recovery, agent GC, crash loop detection - Per-bead timeout enforcement with agent container termination - On-demand LLM triage agent for ambiguous situations - Triage action validation, access control, and snapshot-based resolution - Stranded convoy feeding with immediate dispatch eligibility Mayor codebase browsing: - Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access - POST /repos/setup container endpoint for proactive repo cloning - System prompt written to AGENTS.md so mayor and sub-agents share context - Git credential race fix: refreshGitCredentials runs before configureRig - GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs Agent dispatch improvements: - startPoint parameter for convoy agents to branch from feature branch - platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup - Existing users arm watchdog on DO init - RESTART_WITH_BACKOFF uses dispatch cooldown delay Rig deletion fix: - tRPC deleteRig now calls TownDO.removeRig (was missing) - addRig handles stale name conflicts via catch-and-retry Real-time alarm status UI: - Hibernatable WebSocket for live alarm status push - Status tab in terminal bar with agent/bead/patrol cards Other UI: - Convoy title and branch use flex-based truncation instead of fixed max-width - Status pane card padding normalized to p-2 - Legacy agent roles accepted in Zod schemas for backward compat - PostHog feature flag integration for gastown access gating * fix(gastown): only stop agent on triage resolve if still hooked to snapshot bead CLOSE_BEAD and REASSIGN_BEAD now check that the agent's current hook matches the snapshot bead from the triage request before calling stopAgentInContainer. If the agent has moved on to different work, stopping it would abort unrelated sessions. * style: fix prettier formatting in gastown type declarations * fix: resolve lint errors (no-base-to-string, unused vars) * fix(gastown): triage agent should call gt_done not gt_bead_close gt_bead_close only marks the bead closed without unhooking the agent or resetting it to idle, leaking agent records. gt_done triggers the agentDone path which has the patrol-created triage fast-path that properly closes the batch, unhooks, and returns the agent to idle. * fix(gastown): refinery singleton, container eviction recovery, review queue safety - Treat refinery as per-rig singleton in getOrCreateAgent to prevent UNIQUE constraint on identity when a refinery already exists - Re-queue review entry (reset to open) when refinery is busy instead of leaving it stuck in in_progress - Return 'not_found' (not 'unknown') from checkAgentContainerStatus on 404, so witnessPatrol immediately resets and redispatches agents after container eviction instead of waiting for the 2-hour GUPP timeout * fix(gastown): triage prompt, timestamp format, per-rig creds, browse refresh - Remove remaining gt_bead_close reference in triage prompt (line 72) that contradicted the gt_done instruction on line 49 - Use strftime with ISO format in orphanedHooks SQL query to match the toISOString() format stored in last_activity_at - Resolve git credentials per-rig in mayor browse setup instead of sharing one credential set across all rigs - Browse worktree refresh uses fetch+reset instead of checkout to avoid wrong-branch errors (worktree is on synthetic browse branch) * fix(gastown): escalations now create triage requests for automated follow-up Previously, gt_escalate created an escalation bead and optionally notified the mayor, but nothing automated acted on it. Escalation beads sat open with no assignee indefinitely. Now routeEscalation creates a triage request alongside the escalation bead, feeding the escalation into the patrol→triage→resolve loop. The triage agent can then RESTART, REASSIGN, CLOSE, or ESCALATE_TO_MAYOR with the full context of the original escalation. When a triage request linked to an escalation is resolved, the escalation bead is also closed automatically. Also adds 'escalation' to the TriageType union and enriches the ESCALATE_TO_MAYOR mayor message with agent and bead context. * feat(gastown): store convoy_id and source_bead_id in escalation metadata When an agent escalates from within a convoy, the escalation bead and its triage request now carry convoy_id and source_bead_id in their metadata. This associates escalations with their convoy for display purposes and lays groundwork for Phase 4 convoy-aware triage handling. * fix(gastown): escalation metadata path, polling fallback, refinery rollback, regen types - Fix escalation_bead_id lookup in resolveTriage to read from metadata.context (matching createTriageRequest's structure) - Add polling fallback to AlarmStatusPane via tRPC getAlarmStatus query when WebSocket fails, with 5s refetch interval - Reset refinery to idle when container start fails in processReviewQueue - Regenerate gastown type declarations to include getAlarmStatus

kilo-code-bot bot reviewed Mar 8, 2026

View reviewed changes

cloudflare-gastown/src/handlers/rig-triage.handler.ts Outdated Show resolved Hide resolved

cloudflare-gastown/src/dos/Town.do.ts Outdated Show resolved Hide resolved

cloudflare-gastown/src/prompts/triage-system.prompt.ts Outdated Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 8, 2026

View reviewed changes

cloudflare-gastown/src/dos/GastownUser.do.ts Show resolved Hide resolved

cloudflare-gastown/src/dos/Town.do.ts Outdated Show resolved Hide resolved

jrf0110 force-pushed the 442-witness-deacon branch 2 times, most recently from c57493a to c59e803 Compare March 8, 2026 22:28

kilo-code-bot bot reviewed Mar 8, 2026

View reviewed changes

cloudflare-gastown/src/db/tables/agent-metadata.table.ts Outdated Show resolved Hide resolved

cloudflare-gastown/src/dos/Town.do.ts Show resolved Hide resolved

cloudflare-gastown/src/prompts/triage-system.prompt.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

cloudflare-gastown/src/dos/Town.do.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

cloudflare-gastown/src/gastown.worker.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

cloudflare-gastown/src/dos/town/patrol.ts Show resolved Hide resolved

cloudflare-gastown/src/dos/town/patrol.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

cloudflare-gastown/src/dos/Town.do.ts Show resolved Hide resolved

src/components/gastown/TerminalBar.tsx Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

cloudflare-gastown/container/src/control-server.ts Show resolved Hide resolved

cloudflare-gastown/src/trpc/schemas.ts Outdated Show resolved Hide resolved

cloudflare-gastown/src/dos/town/rigs.ts Outdated Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

cloudflare-gastown/src/dos/town/patrol.ts Show resolved Hide resolved

cloudflare-gastown/src/dos/Town.do.ts Show resolved Hide resolved

cloudflare-gastown/container/src/git-manager.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

jrf0110 force-pushed the 442-witness-deacon branch from d5afd5b to a1d1858 Compare March 9, 2026 17:15

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

jrf0110 force-pushed the 442-witness-deacon branch from a1d1858 to 6c1581a Compare March 9, 2026 17:30

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

jrf0110 force-pushed the 442-witness-deacon branch from 6c1581a to 5c2858c Compare March 9, 2026 17:41

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

jrf0110 force-pushed the 442-witness-deacon branch from 5c2858c to ec2c8af Compare March 9, 2026 17:50

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

jrf0110 force-pushed the 442-witness-deacon branch from ec2c8af to 546d2d4 Compare March 9, 2026 18:30

kilo-code-bot bot reviewed Mar 9, 2026

View reviewed changes

jrf0110 force-pushed the 442-witness-deacon branch 2 times, most recently from 84b1b21 to 133cff0 Compare March 9, 2026 18:44

jrf0110 changed the base branch from 901-feature-flags to main March 9, 2026 18:45

jrf0110 enabled auto-merge (squash) March 9, 2026 18:45

jrf0110 disabled auto-merge March 9, 2026 18:46

jrf0110 changed the base branch from main to 901-feature-flags March 9, 2026 18:46

jrf0110 force-pushed the 442-witness-deacon branch from 133cff0 to 8c92846 Compare March 9, 2026 18:50

jrf0110 merged this pull request into 901-feature-flags Mar 9, 2026
1 of 2 checks passed

jrf0110 deleted the 442-witness-deacon branch March 9, 2026 18:50

Conversation

jrf0110 commented Mar 8, 2026

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

CRITICAL

Uh oh!

jrf0110 commented Mar 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrf0110 commented Mar 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot commented Mar 8, 2026 •

edited

Loading