Skip to content

Gastown User-Scoped Admin Panel — Inspect & Intervene via GastownUserDO → TownDO #897

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 4: Hardening)

Problem

When a Gastown user's town breaks — agents stall, beads get stuck, containers die, merges fail, credentials expire — there is currently no way for a Kilo admin to diagnose or fix it without SSH-level access to Cloudflare infrastructure. All state is locked inside Durable Object SQLite databases and KV storage, errors are scattered across console.error logs with no correlation, and intervention requires writing ad-hoc scripts against internal APIs.

This issue covers a user-scoped admin panel: given a specific user, look up their GastownUserDO, see their towns, and drill into any TownDO to inspect and intervene. Fleet-wide views (all towns across all users) are out of scope here and will be addressed alongside #228 (Observability) once a secondary index exists.


User Failure Scenarios & Required Admin Capabilities

1. "My agent isn't doing anything" — Stuck Agent

Root causes: Container died mid-session, git clone failed (bad credentials, private repo), dispatch budget exhausted (5 attempts), polecat name pool exhausted, GUPP timeout, agent process crashed inside container.

Admin needs:

  • See agent status timeline: idleworking → (stalled? dead?)
  • See dispatch attempt history with error messages (currently only in console.error)
  • See the agent's last SDK events (from AgentDO) — did it receive its prompt? did it call any tools? did the session error out?
  • See container health: is the container running? when did it last restart? what's its process list?
  • Intervention: Force-reset agent to idle, force-unhook from bead, force-kill agent process in container, retry dispatch

2. "My bead has been open forever" — Stuck Bead

Root causes: No idle agent available, all polecats stalled, dispatch keeps failing, bead assigned to dead agent, convoy dependency blocking it, review queue backed up.

Admin needs:

  • See bead state history (bead_events timeline)
  • See which agent is hooked (if any) and that agent's status
  • See dependency graph: is this bead blocked by another? part of a convoy? waiting for review?
  • See dispatch attempts against this bead specifically
  • Intervention: Force-close bead, force-fail bead, reassign to different agent, manually unhook current agent, change bead priority

3. "Code won't merge" — Refinery Failure

Root causes: Merge conflict, git push credentials expired, refinery agent hallucinated bad PR URL, PR-strategy entry stuck waiting for webhook, review queue single-item bottleneck.

Admin needs:

  • See review queue depth and processing rate
  • See per-MR-bead status: who's reviewing, how long in in_progress, branch names, PR URL
  • See refinery agent's reasoning (conversation/tool calls)
  • See git operation logs: what merge commands ran, what failed, stderr output
  • Intervention: Force-retry a failed review, force-close an MR bead, skip review and direct-merge, clear the review queue, change merge strategy on the fly

4. "Container keeps dying" — Container Instability

Root causes: OOM from too many concurrent agents, Cloudflare container platform issues, sleep/wake cycle problems, environment variable misconfiguration.

Admin needs:

  • Container lifecycle timeline: start, stop (with exit code + reason), error events
  • Container resource usage (memory, CPU if available from CF)
  • Current process list inside container
  • Environment variable inspection (redacted secrets)
  • Sleep/wake history
  • Intervention: Force-restart container, force-stop container, update environment variables, change sleep timeout

5. "GitHub says permission denied" — Credential Issues

Root causes: GitHub App token expired, integration revoked, wrong integration linked to rig, credential store in container has stale token, dynamic credential resolution endpoint returning errors.

Admin needs:

  • See current credential state: which integration is linked, when was the token last resolved, is the token valid?
  • See git credential verification results (currently only a console.warn)
  • See all git push/pull failures across agents in the town
  • Intervention: Force-refresh credentials from integration, manually set a git token, re-link integration

6. "My convoy never finished" — Convoy Progress Stall

Root causes: A tracked bead was deleted instead of closed (counter never catches up), a tracked bead's merge failed, intermediate merge to feature branch conflicted, landing MR failed.

Admin needs:

  • Convoy bead graph: all tracked beads with their statuses
  • Progress counters: closed_beads / total_beads, with specific identification of which beads are not yet closed
  • Landing MR status
  • Intervention: Force-land convoy, add/remove tracked beads, force-close stuck child beads

7. "The Mayor isn't responding" — Mayor Dysfunction

Root causes: Mayor agent crashed, container sleeping, Mayor stuck in a tool call loop, Mayor's system prompt is stale/wrong, Mayor's session errored.

Admin needs:

  • Mayor agent status and last activity timestamp
  • Mayor's current session state (active? idle? errored?)
  • Mayor's recent conversation (messages + tool calls)
  • Mayor's system prompt (rendered, with all dynamic context)
  • Intervention: Force-restart Mayor, clear Mayor conversation, resend last user message

8. "I'm being charged but nothing is happening" — Silent Resource Consumption

Root causes: Zombie agents consuming LLM tokens, agents in retry loops, refinery re-reviewing the same MR, container running with no active work.

Admin needs:

  • LLM token usage per agent per bead (requires [Gastown] PR 22: Observability #228 metrics)
  • Active agent count vs. actual work being done
  • Container uptime with no dispatched agents
  • Review queue churn: how many times has the same MR been retried?

Data Access Model — No Central Database

Gastown has no central Postgres table of towns or agents. All state is distributed across Durable Objects:

  • GastownUserDO (keyed by userId) — owns user_towns and user_rigs tables. This is the only way to discover which towns a user has.
  • TownDO (keyed by townId) — owns all beads, agents, rigs, events, config. This is where all the operational data lives.
  • AgentDO (keyed by agentId) — owns high-volume SDK event streams per agent.

This means the admin panel cannot start from a global town list — there is no table to query. The entry point must be user-scoped: look up a user, query their GastownUserDO for their towns, then drill into a specific TownDO.

Navigation flow

Admin searches for user (by email, userId, or name from kilocode_users in Postgres)
  → Hits GastownUserDO for that userId
    → Gets list of towns (user_towns) and rigs (user_rigs)
      → Selects a town → enters Town Inspector (queries TownDO)
        → Drills into beads, agents, container, config, events

This user-scoped lookup should live inside the existing Kilo admin dashboard's user detail page — not in a separate Gastown-only admin route. When an admin is already looking at a user (e.g., for a support case), they should see a "Gastown" section showing that user's towns with health indicators, and be able to drill straight into the Town Inspector from there.


Admin Panel Sections

User → Gastown Section (entry point, on existing admin user detail page)

  • List of user's towns (from GastownUserDO) with health indicators (green/yellow/red)
  • Per-town summary: active agents, open beads, container status, last activity
  • Quick actions: jump to Town Inspector, force-restart container
  • List of user's rigs with linked repo and integration status

Town Inspector (single town deep-dive)

  • State tab: All beads with current status, assigned agent, last event timestamp. Filterable by type/status. Clickable to bead detail view.
  • Agents tab: All agents with role, status, hooked bead, dispatch attempts, last activity. Clickable to agent detail view with SDK event stream.
  • Review Queue tab: Pending/in-progress MR beads, refinery assignment, PR URLs, time-in-queue.
  • Container tab: Health status, process list, environment variables (redacted), lifecycle timeline, resource usage.
  • Config tab: Town config, rig configs, integration links, credential status. Editable by admin.
  • Events tab: Unified timeline merging bead_events + agent events + container events, correlated by bead/agent ID.

Bead Inspector (single bead deep-dive)

  • Full state history (bead_events)
  • Dependency graph visualization (what blocks this, what this blocks, convoy membership)
  • Assigned agent history (who worked on this, for how long)
  • Related MR beads (review submissions)
  • Agent conversation for each assignment (pulled from AgentDO)
  • Admin actions: force-close, force-fail, reassign, change priority, unhook agent

Agent Inspector (single agent deep-dive)

  • Status timeline
  • SDK event stream (from AgentDO, with search)
  • Current/past hooked beads
  • Dispatch attempt history with error details
  • Git operations performed
  • LLM token usage (when [Gastown] PR 22: Observability #228 metrics available)
  • Admin actions: force-reset, force-kill, delete agent

Intervention Log

  • All admin actions taken (who, when, what, on which town/bead/agent)
  • Immutable audit trail — admins must not be able to intervene without a record

Data Requirements

Several pieces of data that admins need are currently not persisted anywhere:

  1. Dispatch attempt details — currently console.error only. Need to persist dispatch errors (container response, error message, timestamp) in a dispatch_attempts table or as bead events.
  2. Container lifecycle events — currently console.log only. Need to emit bead events or a dedicated container event table for start/stop/error/sleep/wake.
  3. Git operation results — currently container-side console.log only. Need to report clone/push/merge outcomes back to TownDO as events.
  4. Admin intervention audit log — does not exist. Need a new table.
  5. Credential resolution results — currently console.warn only. Need to persist token refresh success/failure.

These data gaps should be addressed in #228 or as a prerequisite to this issue.

Acceptance Criteria

  • User → Gastown section on existing admin user detail page (GastownUserDO lookup, town list with health indicators)
  • Town inspector with state, agents, review queue, container, config, and events tabs
  • Bead inspector with full state history, dependency graph, and agent conversation replay
  • Agent inspector with SDK event stream, dispatch history, and status timeline
  • Admin interventions: force-reset agent, force-close/fail bead, force-restart container, force-retry review, credential refresh
  • All admin interventions recorded in an immutable audit log
  • Unified event timeline correlating bead events, agent events, and container events
  • Admin panel gated to is_admin users (extends Gate Gastown UI to Kilo admins only #537)
  • Data persistence gaps addressed (dispatch attempts, container lifecycle, git operations, credential resolution)

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions