Skip to content

feat(gastown): replace binary is_admin gate with PostHog feature flags#904

Merged
jrf0110 merged 15 commits intomainfrom
901-feature-flags
Mar 9, 2026
Merged

feat(gastown): replace binary is_admin gate with PostHog feature flags#904
jrf0110 merged 15 commits intomainfrom
901-feature-flags

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 6, 2026

Summary

Replace the binary is_admin gate from #537 with PostHog feature flags for progressive rollout. All flag management — allowlists, percentage rollout, kill-switch — is handled through the PostHog dashboard. No custom DB tables, admin UI, or flag evaluation infrastructure needed.

Gate points updated (all 4 layers):

  1. 9 Next.js server pagesisFeatureFlagEnabled('gastown-access', user.id) with dev-mode bypass
  2. SidebaruseFeatureFlagEnabled('gastown-access') (client-side PostHog hook, same pattern as auto-triage-feature)
  3. Token endpoint (/api/gastown/token) — evaluates PostHog flag, embeds gastownAccess: true in JWT, returns 403 if denied
  4. Gastown worker tRPCrequireGastownAccess() checks gastownAccess JWT claim, falls back to isAdmin for backward compatibility

Sub-feature flag names defined (in src/lib/gastown/feature-flags.ts):

  • gastown-access — top-level gate
  • gastown-convoys — convoy creation
  • gastown-pr-merge — PR merge strategy
  • gastown-multi-rig — second rig per town

These follow the existing PostHog flag pattern used by auto-triage-feature, code-review-cloud-agent-next, etc.

Backward compatibility: When the gastown-access flag doesn't exist in PostHog (or PostHog is unavailable in non-production), the dev-mode bypass allows local development. In production, isFeatureFlagEnabled returns false when the flag is missing, so access is denied by default (same as the old is_admin gate for non-admins). The worker accepts both gastownAccess and isAdmin JWT claims, so old tokens continue to work.

Closes #901

Verification

  • pnpm typecheck — all 29 workspace packages pass
  • Verified ENABLE_GASTOWN_FEATURE has zero remaining references after removal
  • Verified requireAdmin has zero remaining references in gastown worker after rename

Visual Changes

N/A

Reviewer Notes

  • The ENABLE_GASTOWN_FEATURE compile-time constant has been removed — it had no consumers after this change. The PostHog kill-switch is strictly better (no redeploy needed).
  • The cohorts column on kilocode_users was investigated and found to be completely dead code (never populated, read, or used). It is not relevant to this feature.
  • The gastown-access flag must be created in PostHog before this can be deployed. Initially configure it to target users where is_admin is true to preserve the current behavior, then progressively widen access.
  • Dev-mode bypass (IS_DEVELOPMENT) follows the same pattern as auto-fix/page.tsx to ensure local development works without PostHog.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 6, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

No new inline-commentable issues in the current diff.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
cloudflare-gastown/src/dos/town/review-queue.ts 430 closeOrphanedReviewBeads() only looks for MR beads with status = 'open', but PR-backed reviews are moved to in_progress by markReviewInReview() and polled from that state. As written, the orphan cleanup path never matches the stuck PR-review beads it is meant to recover, so those reviews can remain blocked indefinitely after their agent/container disappears.

Fix these issues in Kilo Cloud

Files Reviewed (50 files)
  • cloudflare-gastown/container/plugin/client.ts
  • cloudflare-gastown/container/plugin/tools.ts
  • cloudflare-gastown/container/plugin/types.ts
  • cloudflare-gastown/container/src/agent-runner.ts
  • cloudflare-gastown/container/src/control-server.ts
  • cloudflare-gastown/container/src/git-manager.ts
  • cloudflare-gastown/container/src/types.ts
  • cloudflare-gastown/src/db/tables/agent-metadata.table.ts
  • cloudflare-gastown/src/db/tables/beads.table.ts
  • cloudflare-gastown/src/db/tables/rig-agents.table.ts
  • cloudflare-gastown/src/dos/GastownUser.do.ts
  • cloudflare-gastown/src/dos/Town.do.ts
  • cloudflare-gastown/src/dos/town/agents.ts
  • cloudflare-gastown/src/dos/town/container-dispatch.ts
  • cloudflare-gastown/src/dos/town/patrol.ts
  • cloudflare-gastown/src/dos/town/review-queue.ts
  • cloudflare-gastown/src/dos/town/rigs.ts
  • cloudflare-gastown/src/gastown.worker.ts
  • cloudflare-gastown/src/handlers/rig-triage.handler.ts
  • cloudflare-gastown/src/middleware/auth.middleware.ts
  • cloudflare-gastown/src/middleware/kilo-auth.middleware.ts
  • cloudflare-gastown/src/prompts/mayor-system.prompt.ts
  • cloudflare-gastown/src/prompts/refinery-system.prompt.ts
  • cloudflare-gastown/src/prompts/triage-system.prompt.ts
  • cloudflare-gastown/src/trpc/init.ts
  • cloudflare-gastown/src/trpc/router.ts
  • cloudflare-gastown/src/trpc/schemas.ts
  • cloudflare-gastown/src/types.ts
  • cloudflare-gastown/src/ui/dashboard.ui.ts
  • packages/worker-utils/src/kilo-token.ts
  • src/app/(app)/components/PersonalAppSidebar.tsx
  • src/app/(app)/gastown/[townId]/agents/page.tsx
  • src/app/(app)/gastown/[townId]/beads/page.tsx
  • src/app/(app)/gastown/[townId]/mail/page.tsx
  • src/app/(app)/gastown/[townId]/merges/page.tsx
  • src/app/(app)/gastown/[townId]/observability/page.tsx
  • src/app/(app)/gastown/[townId]/page.tsx
  • src/app/(app)/gastown/[townId]/rigs/[rigId]/page.tsx
  • src/app/(app)/gastown/[townId]/settings/page.tsx
  • src/app/(app)/gastown/page.tsx
  • src/app/api/gastown/token/route.ts
  • src/components/gastown/ConvoyTimeline.tsx
  • src/components/gastown/TerminalBar.tsx
  • src/components/gastown/TerminalBarContext.tsx
  • src/components/gastown/useXtermPty.ts
  • src/lib/constants.ts
  • src/lib/gastown/feature-flags.ts
  • src/lib/gastown/types/router.d.ts
  • src/lib/gastown/types/schemas.d.ts
  • src/lib/tokens.ts

@jrf0110 jrf0110 force-pushed the 901-feature-flags branch from feea622 to ebe2373 Compare March 7, 2026 00:35
@jrf0110 jrf0110 changed the title feat(gastown): replace binary is_admin gate with progressive feature flag rollout feat(gastown): replace binary is_admin gate with PostHog feature flags Mar 7, 2026
@jrf0110 jrf0110 force-pushed the 901-feature-flags branch from b6584cc to bdb3a53 Compare March 8, 2026 15:42
@jrf0110 jrf0110 force-pushed the 901-feature-flags branch from 0a435ce to e511ea6 Compare March 9, 2026 19:04
agentBeadId: input.source_agent_id ?? null,
title: `Escalation (${input.severity}): ${input.message.slice(0, 80)}`,
context: {
escalation_bead_id: beadId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Linked escalation ID is stored under a different metadata path than the resolver reads

createTriageRequest() wraps this object under metadata.context, but resolveTriage() later checks triageBead.metadata?.escalation_bead_id before deciding whether to close the original escalation bead. Because this value is nested here, every resolved escalation triage request leaves its escalation bead open indefinitely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The lookup now reads from triageBead.metadata.context.escalation_bead_id (matching the TriageRequestMetadata structure where context is nested) instead of triageBead.metadata.escalation_bead_id.

);
}

if (error && !data) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: WebSocket failures never reach the documented polling fallback

If the status socket fails before the first message, this branch renders only the error and the pane never calls the authenticated getAlarmStatus query. That leaves the entire Status tab unusable on browsers or networks where WebSockets are blocked, even though the backend added a polling endpoint for the same data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. AlarmStatusPane now uses the tRPC getAlarmStatus query as a polling fallback (5s interval) when the WebSocket fails. The connection indicator shows 'Live' (green) for WS, 'Polling' (blue) for fallback, or 'Reconnecting...' (yellow) when neither is working. Also regenerated the type declarations to include getAlarmStatus.

@@ -2009,6 +2665,10 @@ export class TownDO extends DurableObject<Env> {
// polecat assignee preserved.
agents.hookBead(this.sql, refineryAgent.id, entry.id);

// Mark as working before the async container start (same I/O gate
// rationale as dispatchAgent — see comment there).
agents.updateAgentStatus(this.sql, refineryAgent.id, 'working');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Failed refinery launches now leave the per-rig refinery stuck as working

This new pre-start status flip needs a matching rollback in the !started path below. Right now a single container-start failure unhooks the bead and fails the review, but the refinery row stays working, so every later review for that rig hits the status !== 'idle' guard above and gets re-queued forever.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The !started path in processReviewQueue now calls updateAgentStatus(refineryAgent.id, 'idle') alongside the existing unhook and review failure, so the per-rig refinery singleton is available for future reviews.

@jrf0110 jrf0110 enabled auto-merge (squash) March 9, 2026 21:22
@jrf0110 jrf0110 disabled auto-merge March 9, 2026 21:25
jrf0110 and others added 7 commits March 9, 2026 16:26
#901)

Replace the binary is_admin check from #537 with PostHog feature flags
for progressive rollout. Flag management (allowlists, percentage rollout,
kill-switch) is handled entirely through the PostHog dashboard — no
custom DB tables or admin UI needed.

Gate points updated:
- 9 Next.js pages use isFeatureFlagEnabled('gastown-access', user.id)
- Sidebar uses useFeatureFlagEnabled('gastown-access')
- Token endpoint evaluates the flag and embeds gastownAccess in the JWT
- Worker checks gastownAccess JWT claim (isAdmin fallback for compat)

Sub-feature flag names defined: gastown-convoys, gastown-pr-merge,
gastown-multi-rig (to be created in PostHog when needed).

Closes #901
- Switch from isFeatureFlagEnabled to isReleaseToggleEnabled for strict
  boolean auth checks (prevents multivariate variants from granting access)
- Remove dev-mode bypass — gate in dev too via PostHog
- Abstract requireGastownAccess into gastownProcedure composable tRPC
  middleware in init.ts, replacing manual requireGastownAccess(ctx) calls
- Remove sub-feature flags (convoys, pr_merge, multi_rig) — only
  gastown-access remains
Use useFeatureFlagVariantKey === true instead of useFeatureFlagEnabled
to align the sidebar with the server-side isReleaseToggleEnabled check.
This prevents multivariate string variants from showing the nav item
when server-side access would be denied.
Switch all gate points from isReleaseToggleEnabled back to
isFeatureFlagEnabled. Add a DEV_ENABLED_FLAGS set to
posthog-feature-flags.ts that returns true for gastown-access in
non-production environments so local dev works without PostHog
configuration. Sidebar reverts to useFeatureFlagEnabled.
…ture-flags.ts

Move the dev-mode override out of the shared posthog-feature-flags.ts
module and into isGastownEnabled in src/lib/gastown/feature-flags.ts.
All pages and the token endpoint now call isGastownEnabled(user.id)
which returns true in non-production and delegates to
isFeatureFlagEnabled in production. The sidebar uses
useFeatureFlagEnabled || isDevelopment for the same effect client-side.
… improvements (#442) (#924)

Alarm-driven patrol system (witness & deacon):
- Tiered GUPP violation handling (30min warn, 1h escalate+triage, 2h force-stop)
- Orphaned work detection, stale hook recovery, agent GC, crash loop detection
- Per-bead timeout enforcement with agent container termination
- On-demand LLM triage agent for ambiguous situations
- Triage action validation, access control, and snapshot-based resolution
- Stranded convoy feeding with immediate dispatch eligibility

Mayor codebase browsing:
- Browse worktrees at /workspace/rigs/<rigId>/browse/ for read-only access
- POST /repos/setup container endpoint for proactive repo cloning
- System prompt written to AGENTS.md so mayor and sub-agents share context
- Git credential race fix: refreshGitCredentials runs before configureRig
- GIT_TERMINAL_PROMPT=0 to prevent credential prompt hangs

Agent dispatch improvements:
- startPoint parameter for convoy agents to branch from feature branch
- platformIntegrationId and KILOCODE_TOKEN plumbed through repo setup
- Existing users arm watchdog on DO init
- RESTART_WITH_BACKOFF uses dispatch cooldown delay

Rig deletion fix:
- tRPC deleteRig now calls TownDO.removeRig (was missing)
- addRig handles stale name conflicts via catch-and-retry

Real-time alarm status UI:
- Hibernatable WebSocket for live alarm status push
- Status tab in terminal bar with agent/bead/patrol cards

Other UI:
- Convoy title and branch use flex-based truncation instead of fixed max-width
- Status pane card padding normalized to p-2
- Legacy agent roles accepted in Zod schemas for backward compat
- PostHog feature flag integration for gastown access gating
…apshot bead

CLOSE_BEAD and REASSIGN_BEAD now check that the agent's current hook
matches the snapshot bead from the triage request before calling
stopAgentInContainer. If the agent has moved on to different work,
stopping it would abort unrelated sessions.
jrf0110 added 8 commits March 9, 2026 16:26
gt_bead_close only marks the bead closed without unhooking the agent
or resetting it to idle, leaking agent records. gt_done triggers the
agentDone path which has the patrol-created triage fast-path that
properly closes the batch, unhooks, and returns the agent to idle.
… queue safety

- Treat refinery as per-rig singleton in getOrCreateAgent to prevent
  UNIQUE constraint on identity when a refinery already exists
- Re-queue review entry (reset to open) when refinery is busy instead
  of leaving it stuck in in_progress
- Return 'not_found' (not 'unknown') from checkAgentContainerStatus on
  404, so witnessPatrol immediately resets and redispatches agents after
  container eviction instead of waiting for the 2-hour GUPP timeout
…refresh

- Remove remaining gt_bead_close reference in triage prompt (line 72)
  that contradicted the gt_done instruction on line 49
- Use strftime with ISO format in orphanedHooks SQL query to match
  the toISOString() format stored in last_activity_at
- Resolve git credentials per-rig in mayor browse setup instead of
  sharing one credential set across all rigs
- Browse worktree refresh uses fetch+reset instead of checkout to
  avoid wrong-branch errors (worktree is on synthetic browse branch)
…llow-up

Previously, gt_escalate created an escalation bead and optionally
notified the mayor, but nothing automated acted on it. Escalation
beads sat open with no assignee indefinitely.

Now routeEscalation creates a triage request alongside the escalation
bead, feeding the escalation into the patrol→triage→resolve loop.
The triage agent can then RESTART, REASSIGN, CLOSE, or ESCALATE_TO_MAYOR
with the full context of the original escalation.

When a triage request linked to an escalation is resolved, the
escalation bead is also closed automatically.

Also adds 'escalation' to the TriageType union and enriches the
ESCALATE_TO_MAYOR mayor message with agent and bead context.
When an agent escalates from within a convoy, the escalation bead and
its triage request now carry convoy_id and source_bead_id in their
metadata. This associates escalations with their convoy for display
purposes and lays groundwork for Phase 4 convoy-aware triage handling.
…llback, regen types

- Fix escalation_bead_id lookup in resolveTriage to read from
  metadata.context (matching createTriageRequest's structure)
- Add polling fallback to AlarmStatusPane via tRPC getAlarmStatus
  query when WebSocket fails, with 5s refetch interval
- Reset refinery to idle when container start fails in processReviewQueue
- Regenerate gastown type declarations to include getAlarmStatus
@jrf0110 jrf0110 force-pushed the 901-feature-flags branch from 572a193 to fbe0bc3 Compare March 9, 2026 21:26
@jrf0110 jrf0110 merged commit 49aa8cc into main Mar 9, 2026
18 checks passed
@jrf0110 jrf0110 deleted the 901-feature-flags branch March 9, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gastown Feature Flags & Progressive Rollout

2 participants