Skip to content

feat(gastown): alarm-driven witness/deacon orchestration with on-demand LLM triage (#442)#924

Merged
jrf0110 merged 1 commit into901-feature-flagsfrom
442-witness-deacon
Mar 9, 2026
Merged

feat(gastown): alarm-driven witness/deacon orchestration with on-demand LLM triage (#442)#924
jrf0110 merged 1 commit into901-feature-flagsfrom
442-witness-deacon

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 8, 2026

Summary

Replace persistent AI patrol loops (Witness + Deacon) with deterministic alarm-driven checks and short-lived on-demand LLM triage agents. This is the cloud equivalent of local Gastown's three-tier watchdog chain (Boot → Deacon → Witness), collapsed into: DO alarm (always fires) → mechanical checks → triage agent (when needed).

Key changes:

  • Expanded witnessPatrol(): Tiered GUPP violation handling (30min warning → 1h escalation → 2h force-stop with triage request), orphaned work detection, agent GC with 24h retention for polecats/refinery (singletons never GC'd), per-bead timeout enforcement via metadata.timeout_ms
  • New deaconPatrol(): Stale hook nudging (resets last_activity_at to re-enter dispatch queue), stranded convoy auto-assignment (finds unassigned open convoy beads and hooks idle polecats), crash loop detection (3+ failures in 30min window → triage request)
  • Triage request queue: New triage_request bead type with structured context (triage type, agent reference, available actions). Ambiguous situations produce triage beads instead of taking immediate action.
  • On-demand triage agent: maybeDispatchTriageAgent() spawns a short-lived LLM session only when triage requests are queued. The agent gets a focused system prompt listing all pending situations, processes each via gt_triage_resolve, and exits. No persistent LLM sessions.
  • External health watchdog: GastownUserDO alarm (5min interval) pings each town's TownDO.healthCheck() to verify the alarm is set and re-arms it if missing. Replaces Boot's external observer role.
  • New triage agent role added to AgentRole enum across worker, container plugin, and DB schemas.

Verification

  • tsgo --noEmit passes for cloudflare-gastown (worker)
  • tsc --noEmit passes for cloudflare-gastown/container (plugin)
  • vitest run — 96 tests passing, 9 pre-existing failures in client.test.ts (from Gastown Feature Flags & Progressive Rollout #901 branch URL pattern changes, unrelated to this PR)

Visual Changes

N/A

Reviewer Notes

  • The patrol module (src/dos/town/patrol.ts) is structured as pure functions consistent with the existing sub-module pattern (beads.ts, agents.ts, mail.ts). All functions take SqlStorage as the first argument and are stateless.
  • The feedStrandedConvoys() function directly imports from ./agents — no circular dependency since patrol.ts never imports Town.do.ts.
  • Triage agent dispatch creates a synthetic issue bead as a hook target (not a triage_request bead) since the triage agent needs a bead to hook onto for the standard agent lifecycle.
  • The GastownUserDO watchdog only arms when a town is created. Existing users with towns won't have the watchdog until they create a new town — a migration or manual trigger may be needed for production.
  • The triage_request CHECK constraint on the beads table requires a migration for existing TownDO instances (the initBeadTables DDL will handle this on next initialization since it uses CREATE TABLE IF NOT EXISTS which won't update CHECK constraints). New towns get the correct schema automatically.

Closes #442

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 8, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 1
WARNING 0
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

CRITICAL

File Line Issue
cloudflare-gastown/src/handlers/rig-triage.handler.ts 51 Label-only triage auth can be spoofed by user-created gt:triage beads
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

N/A

Files Reviewed (44 files)
  • cloudflare-gastown/container/plugin/client.ts - 0 issues
  • cloudflare-gastown/container/plugin/tools.ts - 0 issues
  • cloudflare-gastown/container/plugin/types.ts - 0 issues
  • cloudflare-gastown/container/src/agent-runner.ts - 0 issues
  • cloudflare-gastown/container/src/control-server.ts - 0 issues
  • cloudflare-gastown/container/src/git-manager.ts - 0 issues
  • cloudflare-gastown/container/src/types.ts - 0 issues
  • cloudflare-gastown/src/db/tables/agent-metadata.table.ts - 0 issues
  • cloudflare-gastown/src/db/tables/beads.table.ts - 0 issues
  • cloudflare-gastown/src/db/tables/rig-agents.table.ts - 0 issues
  • cloudflare-gastown/src/dos/GastownUser.do.ts - 0 issues
  • cloudflare-gastown/src/dos/Town.do.ts - 0 issues
  • cloudflare-gastown/src/dos/town/agents.ts - 0 issues
  • cloudflare-gastown/src/dos/town/container-dispatch.ts - 0 issues
  • cloudflare-gastown/src/dos/town/patrol.ts - 0 issues
  • cloudflare-gastown/src/dos/town/review-queue.ts - 0 issues
  • cloudflare-gastown/src/dos/town/rigs.ts - 0 issues
  • cloudflare-gastown/src/gastown.worker.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-triage.handler.ts - 1 issue
  • cloudflare-gastown/src/prompts/mayor-system.prompt.ts - 0 issues
  • cloudflare-gastown/src/prompts/refinery-system.prompt.ts - 0 issues
  • cloudflare-gastown/src/prompts/triage-system.prompt.ts - 0 issues
  • cloudflare-gastown/src/trpc/init.ts - 0 issues
  • cloudflare-gastown/src/trpc/router.ts - 0 issues
  • cloudflare-gastown/src/trpc/schemas.ts - 0 issues
  • cloudflare-gastown/src/types.ts - 0 issues
  • cloudflare-gastown/src/ui/dashboard.ui.ts - 0 issues
  • src/app/(app)/gastown/[townId]/agents/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/beads/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/mail/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/merges/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/observability/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/rigs/[rigId]/page.tsx - 0 issues
  • src/app/(app)/gastown/[townId]/settings/page.tsx - 0 issues
  • src/app/(app)/gastown/page.tsx - 0 issues
  • src/app/api/gastown/token/route.ts - 0 issues
  • src/components/gastown/ConvoyTimeline.tsx - 0 issues
  • src/components/gastown/TerminalBar.tsx - 0 issues
  • src/components/gastown/TerminalBarContext.tsx - 0 issues
  • src/components/gastown/useXtermPty.ts - 0 issues
  • src/lib/gastown/feature-flags.ts - 0 issues
  • src/lib/gastown/types/router.d.ts - 0 issues
  • src/lib/gastown/types/schemas.d.ts - 0 issues

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Mar 8, 2026

We should remove witness from the AgentRole enum in this PR. It was never implemented as a running agent — the alarm loop IS the witness now, and this PR makes that explicit with the triage role replacing the LLM reasoning part.

Remaining witness references to clean up:

  • AgentRole enums + CHECK constraints (types.ts, agent-metadata.table.ts, rig-agents.table.ts, trpc/schemas.ts)
  • container-dispatch.ts stub prompt case
  • agents.ts singleton list
  • dashboard.ui.ts dropdown option
  • patrol.ts from_agent_id: 'witness' in GUPP mails → change to 'system' or 'patrol' since it's the alarm code sending these, not an agent

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Mar 8, 2026

Done. Removed witness from AgentRole across all 6 definition sites (types.ts, agent-metadata.table.ts, rig-agents.table.ts, trpc/schemas.ts, container/plugin/types.ts, container/src/types.ts) and their CHECK constraints. Also cleaned up:

  • container-dispatch.ts: removed witness case from systemPromptForRole
  • agents.ts: replaced 'witness' with 'triage' in townSingletonRoles
  • dashboard.ui.ts: replaced witness dropdown option with triage
  • patrol.ts: changed from_agent_id: 'witness' to 'patrol' in GUPP mails (these are sent by alarm code, not an agent)
  • Updated docstring comments referencing witness

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch 2 times, most recently from c57493a to c59e803 Compare March 8, 2026 22:28
repo
).catch(() => {
// --force --track may fail on very old git; fall back to create-or-reset
exec('git', ['branch', '-f', trackingBranch, `origin/${defaultBranch}`], repo).catch(() => {});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Fallback branch reset is not awaited

If git branch --force --track ... fails, this catch launches git branch -f ... but never returns that promise. The outer await resolves immediately, so git worktree add ... trackingBranch can run before the fallback branch exists or before it has been reset to origin/<defaultBranch>, which makes browse-worktree setup flaky on the exact recovery path this block is trying to handle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The catch block now uses try/catch with await instead of a fire-and-forget .catch() chain.

@@ -378,7 +379,7 @@ export async function runAgent(request: StartAgentRequest): Promise<ManagedAgent
// Resolve git credentials if missing. When the town config doesn't have
// a token (common on first dispatch after rig creation), fetch one from
// the Next.js server using the platform_integration_id.
const envVars = await resolveGitCredentialsIfMissing(request);
const envVars = await resolveGitCredentials(request);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Resolved git credentials never reach the spawned agent

resolveGitCredentials() enriches envVars here, but runAgent() still builds the child process environment from the original request later on. For rigs that rely on platformIntegrationId, startup can clone and verify the repo successfully while the agent session itself still launches without GIT_TOKEN/GH_TOKEN, so in-session git push and gh commands fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. After resolveGitCredentials enriches envVars, the request is now reassigned with the resolved envVars (request = { ...request, envVars }) so buildAgentEnv picks up GIT_TOKEN/GH_TOKEN for the spawned process.

} else {
// startPoint ref may not exist (e.g. first convoy bead before
// the feature branch is created) — fall back to HEAD
await exec('git', ['branch', options.branch], repo);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Falling back to the repo's local HEAD can branch from stale code

When startPoint is missing on the remote, this path creates the branch with plain git branch <name>, which uses whatever commit the shared repo's local HEAD currently points at. That repo is only fetched, not checked out or reset, so its local branch can lag behind origin/<defaultBranch>; the first convoy bead can start from an old base and miss already-merged work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The fallback now uses origin/<defaultBranch> instead of bare HEAD. Added defaultBranch to WorktreeOptions and pipe it through from runAgent.

// deleted rig may still exist if removeRig wasn't called (e.g. the
// tRPC deleteRig path before the fix). Remove the orphan and retry.
if (err instanceof Error && err.message.includes('UNIQUE constraint failed: rigs.name')) {
query(sql, /* sql */ `DELETE FROM rigs WHERE name = ? AND id != ?`, [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This retry still deletes live rigs on a name collision

GastownUserDO.createRig() does not enforce unique rig names, so a legitimate second rig with the same name will also land here. Deleting every row whose name matches before retrying silently removes the existing rig from the TownDO registry instead of surfacing the conflict, which can orphan a real rig's state and break later dispatches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The retry path now checks if the conflicting rig has active beads (open or in_progress) before deleting. If it does, the error is surfaced instead of silently orphaning a live rig.

defaultBranch: rigConfig.defaultBranch,
kilocodeToken,
townConfig,
systemPromptOverride: systemPrompt,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Triage prompt is only applied on the first launch

This batch is still started as a normal polecat and the triage behavior only exists in this one-off systemPromptOverride. If the process dies later, witnessPatrol() resets the hooked agent back to idle and schedulePendingWork() will redispatch the same gt:triage bead through the generic polecat path, which does not rebuild the triage prompt. That means recovery/retry can reopen a triage batch as ordinary coding work instead of a triage session.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The triage batch bead carries the gt:triage label, and the triage prompt is built dynamically in dispatchTriageBatch. If the process dies and the bead is redispatched via schedulePendingWork, it would go through the generic polecat path. The fix would be to detect gt:triage beads in schedulePendingWork and route them back through dispatchTriageBatch. This is a real gap but requires changes to the scheduler's dispatch routing — tracking as a follow-up.

this.sql,
/* sql */ `
SELECT COUNT(*) AS cnt FROM ${agent_metadata}
WHERE ${agent_metadata.status} = 'idle'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: orphanedHooks currently counts healthy queued work too

This query treats every idle agent with a hook as orphaned, but several normal flows intentionally create that state while waiting for the next scheduler tick (for example feedStrandedConvoys() and backoff restarts). The new Status tab will therefore report false patrol issues during healthy operation unless this metric reuses the stale-hook/orphaned-work criteria instead of raw idle + hooked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The orphanedHooks query now only counts idle+hooked agents that have been idle for >5 minutes, filtering out agents that were just hooked by feedStrandedConvoys or restarted with backoff and are legitimately waiting for the next scheduler tick.

// Proactively clone the rig's repo and create a browse worktree so
// the mayor has immediate access to the codebase without waiting for
// the first agent dispatch.
this.setupRigRepoInContainer(rigConfig).catch(err =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Failed repo setup is silently one-shot

setupRigRepoInContainer() is only kicked off here during configureRig(), but /repos/setup returns 202 before the clone/worktree actually succeeds and later failures are only logged in the container. If that background setup fails once (credentials not ready, transient git error, etc.), the mayor never gets a browse worktree for this rig until some unrelated future agent dispatch happens to clone the repo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The browse worktree is best-effort and will be created on the next agent dispatch (which clones the repo as a side effect). A retry mechanism (e.g. re-triggering setup on the next alarm cycle if the browse dir doesn't exist) would be a good follow-up but is out of scope for this PR.

WHERE ${agent_metadata.bead_id} = ?
`,
[agent.id]
);
}
return started;
} catch (err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Exceptions leave the agent and bead stuck in an active state

dispatchAgent() now flips the bead to in_progress and the agent to working before the container call, but this catch path only logs and returns false. If startAgentInContainer() throws before it can return false (for example on a container fetch/DO error), the scheduler never rolls either row back, so that work can stay wedged indefinitely and never get redispatched.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The catch block in dispatchAgent now rolls back the agent to idle and the bead to open so the scheduler can retry on the next alarm tick.

await exec('git', ['checkout', defaultBranch], browseDir).catch(() => {});
await exec('git', ['pull', '--rebase', '--autostash'], browseDir).catch(() => {});
console.log(`Updated browse worktree for rig ${rigId} at ${browseDir}`);
return browseDir;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Browse setup reports success after a failed refresh

Both the checkout and pull calls above are swallowed, but this path still returns browseDir and logs the worktree as updated. An auth failure or transient git error would therefore leave the mayor browsing stale code while /repos/setup looks successful. This should either recreate the browse worktree or surface the failure so setup can retry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The checkout/pull calls are now in a try/catch that logs a warning with the failure message instead of silently swallowing errors. The browseDir is still returned (it exists on disk) but the log makes it clear the code may be stale.

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch from d5afd5b to a1d1858 Compare March 9, 2026 17:15
const triageBead = beadOps.getBead(this.sql, input.triage_request_bead_id);
if (!triageBead)
throw new Error(`Triage request bead ${input.triage_request_bead_id} not found`);
if (!triageBead.labels.includes(patrol.TRIAGE_REQUEST_LABEL)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Closed triage requests can be resolved again

resolveTriage() only validates the triage label here. Because the handler accepts an arbitrary triage_request_bead_id, a retried or stale call against an already-closed request will replay side effects like RESTART, CLOSE_BEAD, or REASSIGN_BEAD. Rejecting anything except status === 'open' would make this endpoint idempotent and prevent duplicate actions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. resolveTriage now rejects requests where the triage bead status is not 'open', preventing duplicate side effects from retried or stale calls.

.input(z.object({ rigId: z.string().uuid() }))
.mutation(async ({ ctx, input }) => {
requireGastownAccess(ctx);
await verifyRigOwnership(ctx.env, ctx.userId, input.rigId);
const rig = await verifyRigOwnership(ctx.env, ctx.userId, input.rigId);
const userStub = getGastownUserStub(ctx.env, ctx.userId);
await userStub.deleteRig(input.rigId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Rig deletion still succeeds on partial cleanup failure

This deletes the user-facing rig record before the Town DO cleanup is guaranteed. If removeRig() then throws, the mutation still returns success with the user row gone but the Town DO registry still holding the stale rig entry, so recreating the same name can keep hitting the same UNIQUE conflict this change is trying to avoid. This should either clean up the Town DO first or fail/retry instead of swallowing the error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The deletion order is now reversed: TownDO.removeRig runs first (freeing the name), then userStub.deleteRig. If the Town DO cleanup fails, the mutation throws and the user record stays intact so they can retry.

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch from a1d1858 to 6c1581a Compare March 9, 2026 17:30
unhookBead(sql, row.assignee_agent_bead_id);
}

logBeadEvent(sql, {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Timeout path logs the same failure twice

updateBeadStatus() already writes a status_changed event for the transition to failed. Emitting another status_changed event here means each timeout contributes two failure records, so detectCrashLoops() can trip after only two timed-out beads instead of the intended three, and the new status feed will show duplicate failure transitions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the redundant logBeadEvent call since updateBeadStatus already writes a status_changed event internally.

}
break;
}
case 'CLOSE_BEAD': {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Destructive triage actions leave the old polecat running

Unlike RESTART, the CLOSE_BEAD and REASSIGN_BEAD branches below never call stopAgentInContainer() when the target agent is still working or stalled. That lets the old process keep running after the bead has been failed/reopened in SQL, so the scheduler can redispatch the same work while the original polecat is still pushing stale changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Both CLOSE_BEAD and REASSIGN_BEAD now call stopAgentInContainer when the target agent is working or stalled, matching the RESTART path.

// submitting to the review queue. Detected by the gt:triage label
// on the hooked bead.
const hookedBead = getBead(sql, agent.current_hook_bead_id);
if (hookedBead?.labels.includes('gt:triage')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Any user-created gt:triage bead now bypasses review

This fast-path keys entirely off labels.includes('gt:triage'), but handleCreateBead() accepts arbitrary labels from the API/UI. A normal issue accidentally or intentionally created with that reserved label will be silently closed on gt_done() instead of entering the review queue, which drops the agent's output on the floor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The triage fast-path now also checks created_by === 'patrol' so only system-created triage beads bypass review. User-created beads that happen to carry the gt:triage label go through the normal review flow.

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch from 6c1581a to 5c2858c Compare March 9, 2026 17:41
const townId = statusMatch[1];
console.log(`[gastown-worker] WS upgrade (status): townId=${townId}`);
const stub = getTownDOStub(env, townId);
return stub.fetch(request);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This status feed bypasses the town auth/ownership checks

WebSocket upgrades are handled here before the request reaches Hono, so neither townIdMiddleware nor the /api/towns/:townId/rigs/:rigId/* auth middleware runs. That means anyone who knows a townId can subscribe to /api/towns/:townId/status/ws and read recent events plus agent/bead state without going through the same authorization path as getAlarmStatus.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. This is the same pattern as the stream and PTY WebSocket endpoints (lines 466–486) which also bypass Hono middleware. All three rely on UUID-only knowledge and capability-token auth obtained via authenticated POSTs. Adding ticket-based auth for all WS paths is a holistic follow-up.

reconnectTimerRef.current = setTimeout(connect, 3_000);
};

ws.onerror = () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: The status pane never falls back if WebSockets are unavailable

useAlarmStatusWs() only reconnects on close/error; it never switches to the existing getAlarmStatus query path. In environments where WS upgrades are blocked or unsupported, data stays null forever and the new Status tab never renders a snapshot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The WS-only pattern is consistent with the existing stream/PTY terminal connections. Adding a polling fallback via the existing getAlarmStatus tRPC query is a valid improvement but is outside the scope of this PR.

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch from 5c2858c to ec2c8af Compare March 9, 2026 17:50
/* sql */ `
SELECT ${beads.bead_id} FROM ${beads}
WHERE ${beads.type} = 'issue'
AND ${beads.labels} LIKE ?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Reserved-label issues can block triage dispatch

handleCreateBead() accepts arbitrary labels, so any normal open issue tagged gt:triage will satisfy this duplicate guard. Once that happens, maybeDispatchTriageAgent() returns early forever and real gt:triage-request beads never get a worker. Filtering on a system-owned marker (for example created_by = 'patrol') would keep user-created labels from starving the triage queue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The duplicate guard query now also filters on created_by = 'patrol' so user-created beads with the gt:triage label can't starve the triage queue.

title: `Triage batch: ${pendingCount} request(s)`,
body: 'Process all pending triage request beads and resolve each one.',
priority: 'high',
labels: [patrol.TRIAGE_BATCH_LABEL],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: The gt_done safety net never matches this batch bead

reviewQueue.agentDone() now only short-circuits for gt:triage beads created by patrol, but this synthetic batch is inserted without created_by. If the model ever calls gt_done instead of gt_bead_close, the batch still goes into the review queue. Setting created_by: 'patrol' here keeps the new guard working for the system-created triage batch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The synthetic triage batch bead now sets created_by: 'patrol', matching the guard in agentDone() that checks created_by === 'patrol' for the gt:triage fast-path.

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch from ec2c8af to 546d2d4 Compare March 9, 2026 18:30
// check, any rig agent (polecat, refinery) could call the endpoint and
// trigger restart/close/escalate side effects on other agents.
const hookedBead = await town.getHookedBead(agentId);
if (!hookedBead || !hookedBead.labels.includes('gt:triage')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: Label-only gating still allows non-triage agents to resolve requests

gt:triage is not reserved to patrol-created batch beads here: handleCreateBead() accepts arbitrary labels, and agentDone() explicitly treats user-created gt:triage beads as a supported case. Any agent hooked to one of those beads can still call /triage/resolve and trigger restart/close side effects on other agents. This authorization check needs a system-owned marker as well (for example created_by === 'patrol').

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The handler now also checks hookedBead.created_by === 'patrol' in addition to the label check. User-created beads with the gt:triage label can no longer satisfy the authorization gate.

@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch 2 times, most recently from 84b1b21 to 133cff0 Compare March 9, 2026 18:44
@jrf0110 jrf0110 changed the base branch from 901-feature-flags to main March 9, 2026 18:45
@jrf0110 jrf0110 enabled auto-merge (squash) March 9, 2026 18:45
@jrf0110 jrf0110 disabled auto-merge March 9, 2026 18:46
@jrf0110 jrf0110 changed the base branch from main to 901-feature-flags March 9, 2026 18:46
… improvements (#442)

Alarm-driven patrol system (witness & deacon):
- Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop)
- Orphaned work detection, stale hook recovery, agent GC, crash loop detection
- Per-bead timeout enforcement with agent container termination
- On-demand LLM triage agent for ambiguous situations
- Triage action validation, access control, and snapshot-based resolution
- Stranded convoy feeding with immediate dispatch eligibility

Mayor codebase browsing:
- Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access
- POST /repos/setup container endpoint for proactive repo cloning
- System prompt written to AGENTS.md so mayor and sub-agents share context
- Git credential race fix: refreshGitCredentials runs before configureRig
- GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs

Agent dispatch improvements:
- startPoint parameter for convoy agents to branch from feature branch
- platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup
- Existing users arm watchdog on DO init
- RESTART_WITH_BACKOFF uses dispatch cooldown delay

Rig deletion fix:
- tRPC deleteRig now calls TownDO.removeRig (was missing)
- addRig handles stale name conflicts via catch-and-retry

Real-time alarm status UI:
- Hibernatable WebSocket for live alarm status push
- Status tab in terminal bar with agent/bead/patrol cards

Other UI:
- Convoy title and branch use flex-based truncation instead of fixed max-width
- Status pane card padding normalized to p-2
- Legacy agent roles accepted in Zod schemas for backward compat
- PostHog feature flag integration for gastown access gating
@jrf0110 jrf0110 force-pushed the 442-witness-deacon branch from 133cff0 to 8c92846 Compare March 9, 2026 18:50
@jrf0110 jrf0110 merged this pull request into 901-feature-flags Mar 9, 2026
1 of 2 checks passed
@jrf0110 jrf0110 deleted the 442-witness-deacon branch March 9, 2026 18:50
jrf0110 added a commit that referenced this pull request Mar 9, 2026
… improvements (#442) (#924)

Alarm-driven patrol system (witness & deacon):
- Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop)
- Orphaned work detection, stale hook recovery, agent GC, crash loop detection
- Per-bead timeout enforcement with agent container termination
- On-demand LLM triage agent for ambiguous situations
- Triage action validation, access control, and snapshot-based resolution
- Stranded convoy feeding with immediate dispatch eligibility

Mayor codebase browsing:
- Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access
- POST /repos/setup container endpoint for proactive repo cloning
- System prompt written to AGENTS.md so mayor and sub-agents share context
- Git credential race fix: refreshGitCredentials runs before configureRig
- GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs

Agent dispatch improvements:
- startPoint parameter for convoy agents to branch from feature branch
- platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup
- Existing users arm watchdog on DO init
- RESTART_WITH_BACKOFF uses dispatch cooldown delay

Rig deletion fix:
- tRPC deleteRig now calls TownDO.removeRig (was missing)
- addRig handles stale name conflicts via catch-and-retry

Real-time alarm status UI:
- Hibernatable WebSocket for live alarm status push
- Status tab in terminal bar with agent/bead/patrol cards

Other UI:
- Convoy title and branch use flex-based truncation instead of fixed max-width
- Status pane card padding normalized to p-2
- Legacy agent roles accepted in Zod schemas for backward compat
- PostHog feature flag integration for gastown access gating
jrf0110 added a commit that referenced this pull request Mar 9, 2026
#904)

* feat(gastown): replace binary is_admin gate with PostHog feature flags (#901)

Replace the binary is_admin check from #537 with PostHog feature flags
for progressive rollout. Flag management (allowlists, percentage rollout,
kill-switch) is handled entirely through the PostHog dashboard — no
custom DB tables or admin UI needed.

Gate points updated:
- 9 Next.js pages use isFeatureFlagEnabled('gastown-access', user.id)
- Sidebar uses useFeatureFlagEnabled('gastown-access')
- Token endpoint evaluates the flag and embeds gastownAccess in the JWT
- Worker checks gastownAccess JWT claim (isAdmin fallback for compat)

Sub-feature flag names defined: gastown-convoys, gastown-pr-merge,
gastown-multi-rig (to be created in PostHog when needed).

Closes #901

* fix: address PR review comments

- Switch from isFeatureFlagEnabled to isReleaseToggleEnabled for strict
  boolean auth checks (prevents multivariate variants from granting access)
- Remove dev-mode bypass — gate in dev too via PostHog
- Abstract requireGastownAccess into gastownProcedure composable tRPC
  middleware in init.ts, replacing manual requireGastownAccess(ctx) calls
- Remove sub-feature flags (convoys, pr_merge, multi_rig) — only
  gastown-access remains

* fix: use strict boolean check for client-side gastown flag

Use useFeatureFlagVariantKey === true instead of useFeatureFlagEnabled
to align the sidebar with the server-side isReleaseToggleEnabled check.
This prevents multivariate string variants from showing the nav item
when server-side access would be denied.

* fix: use isFeatureFlagEnabled with dev override for gastown-access

Switch all gate points from isReleaseToggleEnabled back to
isFeatureFlagEnabled. Add a DEV_ENABLED_FLAGS set to
posthog-feature-flags.ts that returns true for gastown-access in
non-production environments so local dev works without PostHog
configuration. Sidebar reverts to useFeatureFlagEnabled.

* refactor: move dev override into isGastownEnabled, revert posthog-feature-flags.ts

Move the dev-mode override out of the shared posthog-feature-flags.ts
module and into isGastownEnabled in src/lib/gastown/feature-flags.ts.
All pages and the token endpoint now call isGastownEnabled(user.id)
which returns true in non-production and delegates to
isFeatureFlagEnabled in production. The sidebar uses
useFeatureFlagEnabled || isDevelopment for the same effect client-side.

* feat(gastown): witness/deacon patrol, mayor codebase browsing, and UI improvements (#442) (#924)

Alarm-driven patrol system (witness & deacon):
- Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop)
- Orphaned work detection, stale hook recovery, agent GC, crash loop detection
- Per-bead timeout enforcement with agent container termination
- On-demand LLM triage agent for ambiguous situations
- Triage action validation, access control, and snapshot-based resolution
- Stranded convoy feeding with immediate dispatch eligibility

Mayor codebase browsing:
- Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access
- POST /repos/setup container endpoint for proactive repo cloning
- System prompt written to AGENTS.md so mayor and sub-agents share context
- Git credential race fix: refreshGitCredentials runs before configureRig
- GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs

Agent dispatch improvements:
- startPoint parameter for convoy agents to branch from feature branch
- platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup
- Existing users arm watchdog on DO init
- RESTART_WITH_BACKOFF uses dispatch cooldown delay

Rig deletion fix:
- tRPC deleteRig now calls TownDO.removeRig (was missing)
- addRig handles stale name conflicts via catch-and-retry

Real-time alarm status UI:
- Hibernatable WebSocket for live alarm status push
- Status tab in terminal bar with agent/bead/patrol cards

Other UI:
- Convoy title and branch use flex-based truncation instead of fixed max-width
- Status pane card padding normalized to p-2
- Legacy agent roles accepted in Zod schemas for backward compat
- PostHog feature flag integration for gastown access gating

* fix(gastown): only stop agent on triage resolve if still hooked to snapshot bead

CLOSE_BEAD and REASSIGN_BEAD now check that the agent's current hook
matches the snapshot bead from the triage request before calling
stopAgentInContainer. If the agent has moved on to different work,
stopping it would abort unrelated sessions.

* style: fix prettier formatting in gastown type declarations

* fix: resolve lint errors (no-base-to-string, unused vars)

* fix(gastown): triage agent should call gt_done not gt_bead_close

gt_bead_close only marks the bead closed without unhooking the agent
or resetting it to idle, leaking agent records. gt_done triggers the
agentDone path which has the patrol-created triage fast-path that
properly closes the batch, unhooks, and returns the agent to idle.

* fix(gastown): refinery singleton, container eviction recovery, review queue safety

- Treat refinery as per-rig singleton in getOrCreateAgent to prevent
  UNIQUE constraint on identity when a refinery already exists
- Re-queue review entry (reset to open) when refinery is busy instead
  of leaving it stuck in in_progress
- Return 'not_found' (not 'unknown') from checkAgentContainerStatus on
  404, so witnessPatrol immediately resets and redispatches agents after
  container eviction instead of waiting for the 2-hour GUPP timeout

* fix(gastown): triage prompt, timestamp format, per-rig creds, browse refresh

- Remove remaining gt_bead_close reference in triage prompt (line 72)
  that contradicted the gt_done instruction on line 49
- Use strftime with ISO format in orphanedHooks SQL query to match
  the toISOString() format stored in last_activity_at
- Resolve git credentials per-rig in mayor browse setup instead of
  sharing one credential set across all rigs
- Browse worktree refresh uses fetch+reset instead of checkout to
  avoid wrong-branch errors (worktree is on synthetic browse branch)

* fix(gastown): escalations now create triage requests for automated follow-up

Previously, gt_escalate created an escalation bead and optionally
notified the mayor, but nothing automated acted on it. Escalation
beads sat open with no assignee indefinitely.

Now routeEscalation creates a triage request alongside the escalation
bead, feeding the escalation into the patrol→triage→resolve loop.
The triage agent can then RESTART, REASSIGN, CLOSE, or ESCALATE_TO_MAYOR
with the full context of the original escalation.

When a triage request linked to an escalation is resolved, the
escalation bead is also closed automatically.

Also adds 'escalation' to the TriageType union and enriches the
ESCALATE_TO_MAYOR mayor message with agent and bead context.

* feat(gastown): store convoy_id and source_bead_id in escalation metadata

When an agent escalates from within a convoy, the escalation bead and
its triage request now carry convoy_id and source_bead_id in their
metadata. This associates escalations with their convoy for display
purposes and lays groundwork for Phase 4 convoy-aware triage handling.

* fix(gastown): escalation metadata path, polling fallback, refinery rollback, regen types

- Fix escalation_bead_id lookup in resolveTriage to read from
  metadata.context (matching createTriageRequest's structure)
- Add polling fallback to AlarmStatusPane via tRPC getAlarmStatus
  query when WebSocket fails, with 5s refetch interval
- Reset refinery to idle when container start fails in processReviewQueue
- Regenerate gastown type declarations to include getAlarmStatus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Witness & Deacon: Alarm-driven orchestration with on-demand LLM triage

1 participant