-
Notifications
You must be signed in to change notification settings - Fork 24
Gastown User-Scoped Admin Panel — Inspect & Intervene via GastownUserDO → TownDO #897
Description
Parent
Part of #204 (Phase 4: Hardening)
Problem
When a Gastown user's town breaks — agents stall, beads get stuck, containers die, merges fail, credentials expire — there is currently no way for a Kilo admin to diagnose or fix it without SSH-level access to Cloudflare infrastructure. All state is locked inside Durable Object SQLite databases and KV storage, errors are scattered across console.error logs with no correlation, and intervention requires writing ad-hoc scripts against internal APIs.
This issue covers a user-scoped admin panel: given a specific user, look up their GastownUserDO, see their towns, and drill into any TownDO to inspect and intervene. Fleet-wide views (all towns across all users) are out of scope here and will be addressed alongside #228 (Observability) once a secondary index exists.
User Failure Scenarios & Required Admin Capabilities
1. "My agent isn't doing anything" — Stuck Agent
Root causes: Container died mid-session, git clone failed (bad credentials, private repo), dispatch budget exhausted (5 attempts), polecat name pool exhausted, GUPP timeout, agent process crashed inside container.
Admin needs:
- See agent status timeline:
idle→working→ (stalled? dead?) - See dispatch attempt history with error messages (currently only in
console.error) - See the agent's last SDK events (from AgentDO) — did it receive its prompt? did it call any tools? did the session error out?
- See container health: is the container running? when did it last restart? what's its process list?
- Intervention: Force-reset agent to
idle, force-unhook from bead, force-kill agent process in container, retry dispatch
2. "My bead has been open forever" — Stuck Bead
Root causes: No idle agent available, all polecats stalled, dispatch keeps failing, bead assigned to dead agent, convoy dependency blocking it, review queue backed up.
Admin needs:
- See bead state history (bead_events timeline)
- See which agent is hooked (if any) and that agent's status
- See dependency graph: is this bead blocked by another? part of a convoy? waiting for review?
- See dispatch attempts against this bead specifically
- Intervention: Force-close bead, force-fail bead, reassign to different agent, manually unhook current agent, change bead priority
3. "Code won't merge" — Refinery Failure
Root causes: Merge conflict, git push credentials expired, refinery agent hallucinated bad PR URL, PR-strategy entry stuck waiting for webhook, review queue single-item bottleneck.
Admin needs:
- See review queue depth and processing rate
- See per-MR-bead status: who's reviewing, how long in
in_progress, branch names, PR URL - See refinery agent's reasoning (conversation/tool calls)
- See git operation logs: what merge commands ran, what failed, stderr output
- Intervention: Force-retry a failed review, force-close an MR bead, skip review and direct-merge, clear the review queue, change merge strategy on the fly
4. "Container keeps dying" — Container Instability
Root causes: OOM from too many concurrent agents, Cloudflare container platform issues, sleep/wake cycle problems, environment variable misconfiguration.
Admin needs:
- Container lifecycle timeline: start, stop (with exit code + reason), error events
- Container resource usage (memory, CPU if available from CF)
- Current process list inside container
- Environment variable inspection (redacted secrets)
- Sleep/wake history
- Intervention: Force-restart container, force-stop container, update environment variables, change sleep timeout
5. "GitHub says permission denied" — Credential Issues
Root causes: GitHub App token expired, integration revoked, wrong integration linked to rig, credential store in container has stale token, dynamic credential resolution endpoint returning errors.
Admin needs:
- See current credential state: which integration is linked, when was the token last resolved, is the token valid?
- See git credential verification results (currently only a
console.warn) - See all git push/pull failures across agents in the town
- Intervention: Force-refresh credentials from integration, manually set a git token, re-link integration
6. "My convoy never finished" — Convoy Progress Stall
Root causes: A tracked bead was deleted instead of closed (counter never catches up), a tracked bead's merge failed, intermediate merge to feature branch conflicted, landing MR failed.
Admin needs:
- Convoy bead graph: all tracked beads with their statuses
- Progress counters:
closed_beads/total_beads, with specific identification of which beads are not yet closed - Landing MR status
- Intervention: Force-land convoy, add/remove tracked beads, force-close stuck child beads
7. "The Mayor isn't responding" — Mayor Dysfunction
Root causes: Mayor agent crashed, container sleeping, Mayor stuck in a tool call loop, Mayor's system prompt is stale/wrong, Mayor's session errored.
Admin needs:
- Mayor agent status and last activity timestamp
- Mayor's current session state (active? idle? errored?)
- Mayor's recent conversation (messages + tool calls)
- Mayor's system prompt (rendered, with all dynamic context)
- Intervention: Force-restart Mayor, clear Mayor conversation, resend last user message
8. "I'm being charged but nothing is happening" — Silent Resource Consumption
Root causes: Zombie agents consuming LLM tokens, agents in retry loops, refinery re-reviewing the same MR, container running with no active work.
Admin needs:
- LLM token usage per agent per bead (requires [Gastown] PR 22: Observability #228 metrics)
- Active agent count vs. actual work being done
- Container uptime with no dispatched agents
- Review queue churn: how many times has the same MR been retried?
Data Access Model — No Central Database
Gastown has no central Postgres table of towns or agents. All state is distributed across Durable Objects:
- GastownUserDO (keyed by userId) — owns
user_townsanduser_rigstables. This is the only way to discover which towns a user has. - TownDO (keyed by townId) — owns all beads, agents, rigs, events, config. This is where all the operational data lives.
- AgentDO (keyed by agentId) — owns high-volume SDK event streams per agent.
This means the admin panel cannot start from a global town list — there is no table to query. The entry point must be user-scoped: look up a user, query their GastownUserDO for their towns, then drill into a specific TownDO.
Navigation flow
Admin searches for user (by email, userId, or name from kilocode_users in Postgres)
→ Hits GastownUserDO for that userId
→ Gets list of towns (user_towns) and rigs (user_rigs)
→ Selects a town → enters Town Inspector (queries TownDO)
→ Drills into beads, agents, container, config, events
This user-scoped lookup should live inside the existing Kilo admin dashboard's user detail page — not in a separate Gastown-only admin route. When an admin is already looking at a user (e.g., for a support case), they should see a "Gastown" section showing that user's towns with health indicators, and be able to drill straight into the Town Inspector from there.
Admin Panel Sections
User → Gastown Section (entry point, on existing admin user detail page)
- List of user's towns (from GastownUserDO) with health indicators (green/yellow/red)
- Per-town summary: active agents, open beads, container status, last activity
- Quick actions: jump to Town Inspector, force-restart container
- List of user's rigs with linked repo and integration status
Town Inspector (single town deep-dive)
- State tab: All beads with current status, assigned agent, last event timestamp. Filterable by type/status. Clickable to bead detail view.
- Agents tab: All agents with role, status, hooked bead, dispatch attempts, last activity. Clickable to agent detail view with SDK event stream.
- Review Queue tab: Pending/in-progress MR beads, refinery assignment, PR URLs, time-in-queue.
- Container tab: Health status, process list, environment variables (redacted), lifecycle timeline, resource usage.
- Config tab: Town config, rig configs, integration links, credential status. Editable by admin.
- Events tab: Unified timeline merging bead_events + agent events + container events, correlated by bead/agent ID.
Bead Inspector (single bead deep-dive)
- Full state history (bead_events)
- Dependency graph visualization (what blocks this, what this blocks, convoy membership)
- Assigned agent history (who worked on this, for how long)
- Related MR beads (review submissions)
- Agent conversation for each assignment (pulled from AgentDO)
- Admin actions: force-close, force-fail, reassign, change priority, unhook agent
Agent Inspector (single agent deep-dive)
- Status timeline
- SDK event stream (from AgentDO, with search)
- Current/past hooked beads
- Dispatch attempt history with error details
- Git operations performed
- LLM token usage (when [Gastown] PR 22: Observability #228 metrics available)
- Admin actions: force-reset, force-kill, delete agent
Intervention Log
- All admin actions taken (who, when, what, on which town/bead/agent)
- Immutable audit trail — admins must not be able to intervene without a record
Data Requirements
Several pieces of data that admins need are currently not persisted anywhere:
- Dispatch attempt details — currently
console.erroronly. Need to persist dispatch errors (container response, error message, timestamp) in adispatch_attemptstable or as bead events. - Container lifecycle events — currently
console.logonly. Need to emit bead events or a dedicated container event table for start/stop/error/sleep/wake. - Git operation results — currently container-side
console.logonly. Need to report clone/push/merge outcomes back to TownDO as events. - Admin intervention audit log — does not exist. Need a new table.
- Credential resolution results — currently
console.warnonly. Need to persist token refresh success/failure.
These data gaps should be addressed in #228 or as a prerequisite to this issue.
Acceptance Criteria
- User → Gastown section on existing admin user detail page (GastownUserDO lookup, town list with health indicators)
- Town inspector with state, agents, review queue, container, config, and events tabs
- Bead inspector with full state history, dependency graph, and agent conversation replay
- Agent inspector with SDK event stream, dispatch history, and status timeline
- Admin interventions: force-reset agent, force-close/fail bead, force-restart container, force-retry review, credential refresh
- All admin interventions recorded in an immutable audit log
- Unified event timeline correlating bead events, agent events, and container events
- Admin panel gated to
is_adminusers (extends Gate Gastown UI to Kilo admins only #537) - Data persistence gaps addressed (dispatch attempts, container lifecycle, git operations, credential resolution)
Notes
- No data migration needed — cloud Gastown hasn't deployed to production
- This is the admin panel for Kilo operators, not the user-facing dashboard (which is [Gastown] Dashboard UI Overhaul — Town Home, Mayor Chat, Rig Workbench #346 / [Gastown] Dashboard Deep Drill-Down — Visualization, Conversation History, Cost Tracking #225)
- The entry point is the existing admin user detail page — add a Gastown section there, not a separate
/gastown/admin/route tree - Town Inspector and deeper views can be standalone admin pages, linked from the user detail page
- Fleet-wide overview (all towns across all users) is out of scope — will be addressed alongside [Gastown] PR 22: Observability #228 (Observability) once a secondary index exists