feat(gastown): alarm-driven witness/deacon orchestration with on-demand LLM triage (#442)#924
Conversation
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)CRITICAL
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: N/A Files Reviewed (44 files)
|
|
We should remove Remaining
|
|
Done. Removed
|
c57493a to
c59e803
Compare
| repo | ||
| ).catch(() => { | ||
| // --force --track may fail on very old git; fall back to create-or-reset | ||
| exec('git', ['branch', '-f', trackingBranch, `origin/${defaultBranch}`], repo).catch(() => {}); |
There was a problem hiding this comment.
WARNING: Fallback branch reset is not awaited
If git branch --force --track ... fails, this catch launches git branch -f ... but never returns that promise. The outer await resolves immediately, so git worktree add ... trackingBranch can run before the fallback branch exists or before it has been reset to origin/<defaultBranch>, which makes browse-worktree setup flaky on the exact recovery path this block is trying to handle.
There was a problem hiding this comment.
Fixed. The catch block now uses try/catch with await instead of a fire-and-forget .catch() chain.
| @@ -378,7 +379,7 @@ export async function runAgent(request: StartAgentRequest): Promise<ManagedAgent | |||
| // Resolve git credentials if missing. When the town config doesn't have | |||
| // a token (common on first dispatch after rig creation), fetch one from | |||
| // the Next.js server using the platform_integration_id. | |||
| const envVars = await resolveGitCredentialsIfMissing(request); | |||
| const envVars = await resolveGitCredentials(request); | |||
There was a problem hiding this comment.
WARNING: Resolved git credentials never reach the spawned agent
resolveGitCredentials() enriches envVars here, but runAgent() still builds the child process environment from the original request later on. For rigs that rely on platformIntegrationId, startup can clone and verify the repo successfully while the agent session itself still launches without GIT_TOKEN/GH_TOKEN, so in-session git push and gh commands fail.
There was a problem hiding this comment.
Fixed. After resolveGitCredentials enriches envVars, the request is now reassigned with the resolved envVars (request = { ...request, envVars }) so buildAgentEnv picks up GIT_TOKEN/GH_TOKEN for the spawned process.
| } else { | ||
| // startPoint ref may not exist (e.g. first convoy bead before | ||
| // the feature branch is created) — fall back to HEAD | ||
| await exec('git', ['branch', options.branch], repo); |
There was a problem hiding this comment.
WARNING: Falling back to the repo's local HEAD can branch from stale code
When startPoint is missing on the remote, this path creates the branch with plain git branch <name>, which uses whatever commit the shared repo's local HEAD currently points at. That repo is only fetched, not checked out or reset, so its local branch can lag behind origin/<defaultBranch>; the first convoy bead can start from an old base and miss already-merged work.
There was a problem hiding this comment.
Fixed. The fallback now uses origin/<defaultBranch> instead of bare HEAD. Added defaultBranch to WorktreeOptions and pipe it through from runAgent.
| // deleted rig may still exist if removeRig wasn't called (e.g. the | ||
| // tRPC deleteRig path before the fix). Remove the orphan and retry. | ||
| if (err instanceof Error && err.message.includes('UNIQUE constraint failed: rigs.name')) { | ||
| query(sql, /* sql */ `DELETE FROM rigs WHERE name = ? AND id != ?`, [ |
There was a problem hiding this comment.
WARNING: This retry still deletes live rigs on a name collision
GastownUserDO.createRig() does not enforce unique rig names, so a legitimate second rig with the same name will also land here. Deleting every row whose name matches before retrying silently removes the existing rig from the TownDO registry instead of surfacing the conflict, which can orphan a real rig's state and break later dispatches.
There was a problem hiding this comment.
Fixed. The retry path now checks if the conflicting rig has active beads (open or in_progress) before deleting. If it does, the error is surfaced instead of silently orphaning a live rig.
| defaultBranch: rigConfig.defaultBranch, | ||
| kilocodeToken, | ||
| townConfig, | ||
| systemPromptOverride: systemPrompt, |
There was a problem hiding this comment.
WARNING: Triage prompt is only applied on the first launch
This batch is still started as a normal polecat and the triage behavior only exists in this one-off systemPromptOverride. If the process dies later, witnessPatrol() resets the hooked agent back to idle and schedulePendingWork() will redispatch the same gt:triage bead through the generic polecat path, which does not rebuild the triage prompt. That means recovery/retry can reopen a triage batch as ordinary coding work instead of a triage session.
There was a problem hiding this comment.
Acknowledged. The triage batch bead carries the gt:triage label, and the triage prompt is built dynamically in dispatchTriageBatch. If the process dies and the bead is redispatched via schedulePendingWork, it would go through the generic polecat path. The fix would be to detect gt:triage beads in schedulePendingWork and route them back through dispatchTriageBatch. This is a real gap but requires changes to the scheduler's dispatch routing — tracking as a follow-up.
| this.sql, | ||
| /* sql */ ` | ||
| SELECT COUNT(*) AS cnt FROM ${agent_metadata} | ||
| WHERE ${agent_metadata.status} = 'idle' |
There was a problem hiding this comment.
SUGGESTION: orphanedHooks currently counts healthy queued work too
This query treats every idle agent with a hook as orphaned, but several normal flows intentionally create that state while waiting for the next scheduler tick (for example feedStrandedConvoys() and backoff restarts). The new Status tab will therefore report false patrol issues during healthy operation unless this metric reuses the stale-hook/orphaned-work criteria instead of raw idle + hooked.
There was a problem hiding this comment.
Fixed. The orphanedHooks query now only counts idle+hooked agents that have been idle for >5 minutes, filtering out agents that were just hooked by feedStrandedConvoys or restarted with backoff and are legitimately waiting for the next scheduler tick.
| // Proactively clone the rig's repo and create a browse worktree so | ||
| // the mayor has immediate access to the codebase without waiting for | ||
| // the first agent dispatch. | ||
| this.setupRigRepoInContainer(rigConfig).catch(err => |
There was a problem hiding this comment.
WARNING: Failed repo setup is silently one-shot
setupRigRepoInContainer() is only kicked off here during configureRig(), but /repos/setup returns 202 before the clone/worktree actually succeeds and later failures are only logged in the container. If that background setup fails once (credentials not ready, transient git error, etc.), the mayor never gets a browse worktree for this rig until some unrelated future agent dispatch happens to clone the repo.
There was a problem hiding this comment.
Acknowledged. The browse worktree is best-effort and will be created on the next agent dispatch (which clones the repo as a side effect). A retry mechanism (e.g. re-triggering setup on the next alarm cycle if the browse dir doesn't exist) would be a good follow-up but is out of scope for this PR.
| WHERE ${agent_metadata.bead_id} = ? | ||
| `, | ||
| [agent.id] | ||
| ); | ||
| } | ||
| return started; | ||
| } catch (err) { |
There was a problem hiding this comment.
WARNING: Exceptions leave the agent and bead stuck in an active state
dispatchAgent() now flips the bead to in_progress and the agent to working before the container call, but this catch path only logs and returns false. If startAgentInContainer() throws before it can return false (for example on a container fetch/DO error), the scheduler never rolls either row back, so that work can stay wedged indefinitely and never get redispatched.
There was a problem hiding this comment.
Fixed. The catch block in dispatchAgent now rolls back the agent to idle and the bead to open so the scheduler can retry on the next alarm tick.
| await exec('git', ['checkout', defaultBranch], browseDir).catch(() => {}); | ||
| await exec('git', ['pull', '--rebase', '--autostash'], browseDir).catch(() => {}); | ||
| console.log(`Updated browse worktree for rig ${rigId} at ${browseDir}`); | ||
| return browseDir; |
There was a problem hiding this comment.
WARNING: Browse setup reports success after a failed refresh
Both the checkout and pull calls above are swallowed, but this path still returns browseDir and logs the worktree as updated. An auth failure or transient git error would therefore leave the mayor browsing stale code while /repos/setup looks successful. This should either recreate the browse worktree or surface the failure so setup can retry.
There was a problem hiding this comment.
Fixed. The checkout/pull calls are now in a try/catch that logs a warning with the failure message instead of silently swallowing errors. The browseDir is still returned (it exists on disk) but the log makes it clear the code may be stale.
d5afd5b to
a1d1858
Compare
| const triageBead = beadOps.getBead(this.sql, input.triage_request_bead_id); | ||
| if (!triageBead) | ||
| throw new Error(`Triage request bead ${input.triage_request_bead_id} not found`); | ||
| if (!triageBead.labels.includes(patrol.TRIAGE_REQUEST_LABEL)) { |
There was a problem hiding this comment.
WARNING: Closed triage requests can be resolved again
resolveTriage() only validates the triage label here. Because the handler accepts an arbitrary triage_request_bead_id, a retried or stale call against an already-closed request will replay side effects like RESTART, CLOSE_BEAD, or REASSIGN_BEAD. Rejecting anything except status === 'open' would make this endpoint idempotent and prevent duplicate actions.
There was a problem hiding this comment.
Fixed. resolveTriage now rejects requests where the triage bead status is not 'open', preventing duplicate side effects from retried or stale calls.
| .input(z.object({ rigId: z.string().uuid() })) | ||
| .mutation(async ({ ctx, input }) => { | ||
| requireGastownAccess(ctx); | ||
| await verifyRigOwnership(ctx.env, ctx.userId, input.rigId); | ||
| const rig = await verifyRigOwnership(ctx.env, ctx.userId, input.rigId); | ||
| const userStub = getGastownUserStub(ctx.env, ctx.userId); | ||
| await userStub.deleteRig(input.rigId); |
There was a problem hiding this comment.
WARNING: Rig deletion still succeeds on partial cleanup failure
This deletes the user-facing rig record before the Town DO cleanup is guaranteed. If removeRig() then throws, the mutation still returns success with the user row gone but the Town DO registry still holding the stale rig entry, so recreating the same name can keep hitting the same UNIQUE conflict this change is trying to avoid. This should either clean up the Town DO first or fail/retry instead of swallowing the error.
There was a problem hiding this comment.
Fixed. The deletion order is now reversed: TownDO.removeRig runs first (freeing the name), then userStub.deleteRig. If the Town DO cleanup fails, the mutation throws and the user record stays intact so they can retry.
a1d1858 to
6c1581a
Compare
| unhookBead(sql, row.assignee_agent_bead_id); | ||
| } | ||
|
|
||
| logBeadEvent(sql, { |
There was a problem hiding this comment.
WARNING: Timeout path logs the same failure twice
updateBeadStatus() already writes a status_changed event for the transition to failed. Emitting another status_changed event here means each timeout contributes two failure records, so detectCrashLoops() can trip after only two timed-out beads instead of the intended three, and the new status feed will show duplicate failure transitions.
There was a problem hiding this comment.
Fixed. Removed the redundant logBeadEvent call since updateBeadStatus already writes a status_changed event internally.
| } | ||
| break; | ||
| } | ||
| case 'CLOSE_BEAD': { |
There was a problem hiding this comment.
WARNING: Destructive triage actions leave the old polecat running
Unlike RESTART, the CLOSE_BEAD and REASSIGN_BEAD branches below never call stopAgentInContainer() when the target agent is still working or stalled. That lets the old process keep running after the bead has been failed/reopened in SQL, so the scheduler can redispatch the same work while the original polecat is still pushing stale changes.
There was a problem hiding this comment.
Fixed. Both CLOSE_BEAD and REASSIGN_BEAD now call stopAgentInContainer when the target agent is working or stalled, matching the RESTART path.
| // submitting to the review queue. Detected by the gt:triage label | ||
| // on the hooked bead. | ||
| const hookedBead = getBead(sql, agent.current_hook_bead_id); | ||
| if (hookedBead?.labels.includes('gt:triage')) { |
There was a problem hiding this comment.
WARNING: Any user-created gt:triage bead now bypasses review
This fast-path keys entirely off labels.includes('gt:triage'), but handleCreateBead() accepts arbitrary labels from the API/UI. A normal issue accidentally or intentionally created with that reserved label will be silently closed on gt_done() instead of entering the review queue, which drops the agent's output on the floor.
There was a problem hiding this comment.
Fixed. The triage fast-path now also checks created_by === 'patrol' so only system-created triage beads bypass review. User-created beads that happen to carry the gt:triage label go through the normal review flow.
6c1581a to
5c2858c
Compare
| const townId = statusMatch[1]; | ||
| console.log(`[gastown-worker] WS upgrade (status): townId=${townId}`); | ||
| const stub = getTownDOStub(env, townId); | ||
| return stub.fetch(request); |
There was a problem hiding this comment.
WARNING: This status feed bypasses the town auth/ownership checks
WebSocket upgrades are handled here before the request reaches Hono, so neither townIdMiddleware nor the /api/towns/:townId/rigs/:rigId/* auth middleware runs. That means anyone who knows a townId can subscribe to /api/towns/:townId/status/ws and read recent events plus agent/bead state without going through the same authorization path as getAlarmStatus.
There was a problem hiding this comment.
Acknowledged. This is the same pattern as the stream and PTY WebSocket endpoints (lines 466–486) which also bypass Hono middleware. All three rely on UUID-only knowledge and capability-token auth obtained via authenticated POSTs. Adding ticket-based auth for all WS paths is a holistic follow-up.
| reconnectTimerRef.current = setTimeout(connect, 3_000); | ||
| }; | ||
|
|
||
| ws.onerror = () => { |
There was a problem hiding this comment.
WARNING: The status pane never falls back if WebSockets are unavailable
useAlarmStatusWs() only reconnects on close/error; it never switches to the existing getAlarmStatus query path. In environments where WS upgrades are blocked or unsupported, data stays null forever and the new Status tab never renders a snapshot.
There was a problem hiding this comment.
Acknowledged. The WS-only pattern is consistent with the existing stream/PTY terminal connections. Adding a polling fallback via the existing getAlarmStatus tRPC query is a valid improvement but is outside the scope of this PR.
5c2858c to
ec2c8af
Compare
| /* sql */ ` | ||
| SELECT ${beads.bead_id} FROM ${beads} | ||
| WHERE ${beads.type} = 'issue' | ||
| AND ${beads.labels} LIKE ? |
There was a problem hiding this comment.
WARNING: Reserved-label issues can block triage dispatch
handleCreateBead() accepts arbitrary labels, so any normal open issue tagged gt:triage will satisfy this duplicate guard. Once that happens, maybeDispatchTriageAgent() returns early forever and real gt:triage-request beads never get a worker. Filtering on a system-owned marker (for example created_by = 'patrol') would keep user-created labels from starving the triage queue.
There was a problem hiding this comment.
Fixed. The duplicate guard query now also filters on created_by = 'patrol' so user-created beads with the gt:triage label can't starve the triage queue.
| title: `Triage batch: ${pendingCount} request(s)`, | ||
| body: 'Process all pending triage request beads and resolve each one.', | ||
| priority: 'high', | ||
| labels: [patrol.TRIAGE_BATCH_LABEL], |
There was a problem hiding this comment.
WARNING: The gt_done safety net never matches this batch bead
reviewQueue.agentDone() now only short-circuits for gt:triage beads created by patrol, but this synthetic batch is inserted without created_by. If the model ever calls gt_done instead of gt_bead_close, the batch still goes into the review queue. Setting created_by: 'patrol' here keeps the new guard working for the system-created triage batch.
There was a problem hiding this comment.
Fixed. The synthetic triage batch bead now sets created_by: 'patrol', matching the guard in agentDone() that checks created_by === 'patrol' for the gt:triage fast-path.
ec2c8af to
546d2d4
Compare
| // check, any rig agent (polecat, refinery) could call the endpoint and | ||
| // trigger restart/close/escalate side effects on other agents. | ||
| const hookedBead = await town.getHookedBead(agentId); | ||
| if (!hookedBead || !hookedBead.labels.includes('gt:triage')) { |
There was a problem hiding this comment.
CRITICAL: Label-only gating still allows non-triage agents to resolve requests
gt:triage is not reserved to patrol-created batch beads here: handleCreateBead() accepts arbitrary labels, and agentDone() explicitly treats user-created gt:triage beads as a supported case. Any agent hooked to one of those beads can still call /triage/resolve and trigger restart/close side effects on other agents. This authorization check needs a system-owned marker as well (for example created_by === 'patrol').
There was a problem hiding this comment.
Fixed. The handler now also checks hookedBead.created_by === 'patrol' in addition to the label check. User-created beads with the gt:triage label can no longer satisfy the authorization gate.
84b1b21 to
133cff0
Compare
… improvements (#442) Alarm-driven patrol system (witness & deacon): - Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop) - Orphaned work detection, stale hook recovery, agent GC, crash loop detection - Per-bead timeout enforcement with agent container termination - On-demand LLM triage agent for ambiguous situations - Triage action validation, access control, and snapshot-based resolution - Stranded convoy feeding with immediate dispatch eligibility Mayor codebase browsing: - Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access - POST /repos/setup container endpoint for proactive repo cloning - System prompt written to AGENTS.md so mayor and sub-agents share context - Git credential race fix: refreshGitCredentials runs before configureRig - GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs Agent dispatch improvements: - startPoint parameter for convoy agents to branch from feature branch - platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup - Existing users arm watchdog on DO init - RESTART_WITH_BACKOFF uses dispatch cooldown delay Rig deletion fix: - tRPC deleteRig now calls TownDO.removeRig (was missing) - addRig handles stale name conflicts via catch-and-retry Real-time alarm status UI: - Hibernatable WebSocket for live alarm status push - Status tab in terminal bar with agent/bead/patrol cards Other UI: - Convoy title and branch use flex-based truncation instead of fixed max-width - Status pane card padding normalized to p-2 - Legacy agent roles accepted in Zod schemas for backward compat - PostHog feature flag integration for gastown access gating
133cff0 to
8c92846
Compare
… improvements (#442) (#924) Alarm-driven patrol system (witness & deacon): - Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop) - Orphaned work detection, stale hook recovery, agent GC, crash loop detection - Per-bead timeout enforcement with agent container termination - On-demand LLM triage agent for ambiguous situations - Triage action validation, access control, and snapshot-based resolution - Stranded convoy feeding with immediate dispatch eligibility Mayor codebase browsing: - Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access - POST /repos/setup container endpoint for proactive repo cloning - System prompt written to AGENTS.md so mayor and sub-agents share context - Git credential race fix: refreshGitCredentials runs before configureRig - GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs Agent dispatch improvements: - startPoint parameter for convoy agents to branch from feature branch - platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup - Existing users arm watchdog on DO init - RESTART_WITH_BACKOFF uses dispatch cooldown delay Rig deletion fix: - tRPC deleteRig now calls TownDO.removeRig (was missing) - addRig handles stale name conflicts via catch-and-retry Real-time alarm status UI: - Hibernatable WebSocket for live alarm status push - Status tab in terminal bar with agent/bead/patrol cards Other UI: - Convoy title and branch use flex-based truncation instead of fixed max-width - Status pane card padding normalized to p-2 - Legacy agent roles accepted in Zod schemas for backward compat - PostHog feature flag integration for gastown access gating
#904) * feat(gastown): replace binary is_admin gate with PostHog feature flags (#901) Replace the binary is_admin check from #537 with PostHog feature flags for progressive rollout. Flag management (allowlists, percentage rollout, kill-switch) is handled entirely through the PostHog dashboard — no custom DB tables or admin UI needed. Gate points updated: - 9 Next.js pages use isFeatureFlagEnabled('gastown-access', user.id) - Sidebar uses useFeatureFlagEnabled('gastown-access') - Token endpoint evaluates the flag and embeds gastownAccess in the JWT - Worker checks gastownAccess JWT claim (isAdmin fallback for compat) Sub-feature flag names defined: gastown-convoys, gastown-pr-merge, gastown-multi-rig (to be created in PostHog when needed). Closes #901 * fix: address PR review comments - Switch from isFeatureFlagEnabled to isReleaseToggleEnabled for strict boolean auth checks (prevents multivariate variants from granting access) - Remove dev-mode bypass — gate in dev too via PostHog - Abstract requireGastownAccess into gastownProcedure composable tRPC middleware in init.ts, replacing manual requireGastownAccess(ctx) calls - Remove sub-feature flags (convoys, pr_merge, multi_rig) — only gastown-access remains * fix: use strict boolean check for client-side gastown flag Use useFeatureFlagVariantKey === true instead of useFeatureFlagEnabled to align the sidebar with the server-side isReleaseToggleEnabled check. This prevents multivariate string variants from showing the nav item when server-side access would be denied. * fix: use isFeatureFlagEnabled with dev override for gastown-access Switch all gate points from isReleaseToggleEnabled back to isFeatureFlagEnabled. Add a DEV_ENABLED_FLAGS set to posthog-feature-flags.ts that returns true for gastown-access in non-production environments so local dev works without PostHog configuration. Sidebar reverts to useFeatureFlagEnabled. * refactor: move dev override into isGastownEnabled, revert posthog-feature-flags.ts Move the dev-mode override out of the shared posthog-feature-flags.ts module and into isGastownEnabled in src/lib/gastown/feature-flags.ts. All pages and the token endpoint now call isGastownEnabled(user.id) which returns true in non-production and delegates to isFeatureFlagEnabled in production. The sidebar uses useFeatureFlagEnabled || isDevelopment for the same effect client-side. * feat(gastown): witness/deacon patrol, mayor codebase browsing, and UI improvements (#442) (#924) Alarm-driven patrol system (witness & deacon): - Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop) - Orphaned work detection, stale hook recovery, agent GC, crash loop detection - Per-bead timeout enforcement with agent container termination - On-demand LLM triage agent for ambiguous situations - Triage action validation, access control, and snapshot-based resolution - Stranded convoy feeding with immediate dispatch eligibility Mayor codebase browsing: - Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access - POST /repos/setup container endpoint for proactive repo cloning - System prompt written to AGENTS.md so mayor and sub-agents share context - Git credential race fix: refreshGitCredentials runs before configureRig - GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs Agent dispatch improvements: - startPoint parameter for convoy agents to branch from feature branch - platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup - Existing users arm watchdog on DO init - RESTART_WITH_BACKOFF uses dispatch cooldown delay Rig deletion fix: - tRPC deleteRig now calls TownDO.removeRig (was missing) - addRig handles stale name conflicts via catch-and-retry Real-time alarm status UI: - Hibernatable WebSocket for live alarm status push - Status tab in terminal bar with agent/bead/patrol cards Other UI: - Convoy title and branch use flex-based truncation instead of fixed max-width - Status pane card padding normalized to p-2 - Legacy agent roles accepted in Zod schemas for backward compat - PostHog feature flag integration for gastown access gating * fix(gastown): only stop agent on triage resolve if still hooked to snapshot bead CLOSE_BEAD and REASSIGN_BEAD now check that the agent's current hook matches the snapshot bead from the triage request before calling stopAgentInContainer. If the agent has moved on to different work, stopping it would abort unrelated sessions. * style: fix prettier formatting in gastown type declarations * fix: resolve lint errors (no-base-to-string, unused vars) * fix(gastown): triage agent should call gt_done not gt_bead_close gt_bead_close only marks the bead closed without unhooking the agent or resetting it to idle, leaking agent records. gt_done triggers the agentDone path which has the patrol-created triage fast-path that properly closes the batch, unhooks, and returns the agent to idle. * fix(gastown): refinery singleton, container eviction recovery, review queue safety - Treat refinery as per-rig singleton in getOrCreateAgent to prevent UNIQUE constraint on identity when a refinery already exists - Re-queue review entry (reset to open) when refinery is busy instead of leaving it stuck in in_progress - Return 'not_found' (not 'unknown') from checkAgentContainerStatus on 404, so witnessPatrol immediately resets and redispatches agents after container eviction instead of waiting for the 2-hour GUPP timeout * fix(gastown): triage prompt, timestamp format, per-rig creds, browse refresh - Remove remaining gt_bead_close reference in triage prompt (line 72) that contradicted the gt_done instruction on line 49 - Use strftime with ISO format in orphanedHooks SQL query to match the toISOString() format stored in last_activity_at - Resolve git credentials per-rig in mayor browse setup instead of sharing one credential set across all rigs - Browse worktree refresh uses fetch+reset instead of checkout to avoid wrong-branch errors (worktree is on synthetic browse branch) * fix(gastown): escalations now create triage requests for automated follow-up Previously, gt_escalate created an escalation bead and optionally notified the mayor, but nothing automated acted on it. Escalation beads sat open with no assignee indefinitely. Now routeEscalation creates a triage request alongside the escalation bead, feeding the escalation into the patrol→triage→resolve loop. The triage agent can then RESTART, REASSIGN, CLOSE, or ESCALATE_TO_MAYOR with the full context of the original escalation. When a triage request linked to an escalation is resolved, the escalation bead is also closed automatically. Also adds 'escalation' to the TriageType union and enriches the ESCALATE_TO_MAYOR mayor message with agent and bead context. * feat(gastown): store convoy_id and source_bead_id in escalation metadata When an agent escalates from within a convoy, the escalation bead and its triage request now carry convoy_id and source_bead_id in their metadata. This associates escalations with their convoy for display purposes and lays groundwork for Phase 4 convoy-aware triage handling. * fix(gastown): escalation metadata path, polling fallback, refinery rollback, regen types - Fix escalation_bead_id lookup in resolveTriage to read from metadata.context (matching createTriageRequest's structure) - Add polling fallback to AlarmStatusPane via tRPC getAlarmStatus query when WebSocket fails, with 5s refetch interval - Reset refinery to idle when container start fails in processReviewQueue - Regenerate gastown type declarations to include getAlarmStatus
Summary
Replace persistent AI patrol loops (Witness + Deacon) with deterministic alarm-driven checks and short-lived on-demand LLM triage agents. This is the cloud equivalent of local Gastown's three-tier watchdog chain (Boot → Deacon → Witness), collapsed into: DO alarm (always fires) → mechanical checks → triage agent (when needed).
Key changes:
witnessPatrol(): Tiered GUPP violation handling (30min warning → 1h escalation → 2h force-stop with triage request), orphaned work detection, agent GC with 24h retention for polecats/refinery (singletons never GC'd), per-bead timeout enforcement viametadata.timeout_msdeaconPatrol(): Stale hook nudging (resetslast_activity_atto re-enter dispatch queue), stranded convoy auto-assignment (finds unassigned open convoy beads and hooks idle polecats), crash loop detection (3+ failures in 30min window → triage request)triage_requestbead type with structured context (triage type, agent reference, available actions). Ambiguous situations produce triage beads instead of taking immediate action.maybeDispatchTriageAgent()spawns a short-lived LLM session only when triage requests are queued. The agent gets a focused system prompt listing all pending situations, processes each viagt_triage_resolve, and exits. No persistent LLM sessions.GastownUserDOalarm (5min interval) pings each town'sTownDO.healthCheck()to verify the alarm is set and re-arms it if missing. Replaces Boot's external observer role.triageagent role added toAgentRoleenum across worker, container plugin, and DB schemas.Verification
tsgo --noEmitpasses forcloudflare-gastown(worker)tsc --noEmitpasses forcloudflare-gastown/container(plugin)vitest run— 96 tests passing, 9 pre-existing failures inclient.test.ts(from Gastown Feature Flags & Progressive Rollout #901 branch URL pattern changes, unrelated to this PR)Visual Changes
N/A
Reviewer Notes
src/dos/town/patrol.ts) is structured as pure functions consistent with the existing sub-module pattern (beads.ts, agents.ts, mail.ts). All functions takeSqlStorageas the first argument and are stateless.feedStrandedConvoys()function directly imports from./agents— no circular dependency sincepatrol.tsnever importsTown.do.ts.issuebead as a hook target (not atriage_requestbead) since the triage agent needs a bead to hook onto for the standard agent lifecycle.GastownUserDOwatchdog only arms when a town is created. Existing users with towns won't have the watchdog until they create a new town — a migration or manual trigger may be needed for production.triage_requestCHECK constraint on the beads table requires a migration for existing TownDO instances (theinitBeadTablesDDL will handle this on next initialization since it usesCREATE TABLE IF NOT EXISTSwhich won't update CHECK constraints). New towns get the correct schema automatically.Closes #442