Skip to content

fix(gastown): route MR bead failures through full review lifecycle to unblock convoys#1244

Closed
jrf0110 wants to merge 47 commits intomainfrom
1234-town-issues
Closed

fix(gastown): route MR bead failures through full review lifecycle to unblock convoys#1244
jrf0110 wants to merge 47 commits intomainfrom
1234-town-issues

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 18, 2026

Summary

  • Three failure paths in processReviewQueue() (no rig_id, no rig config, container fails to start) were calling reviewQueue.completeReview() directly — a raw SQL UPDATE that bypasses updateBeadStatus(), skipping convoy progress updates and leaving source beads stuck in in_review permanently. Replaced all three with a new failReviewWithRework() helper that routes through completeReviewWithResult(), which properly logs events, updates convoy progress, returns the source bead to in_progress, and dispatches a polecat for rework.
  • Added recoverOrphanedSourceBeads() — a safety-net recovery function (called each alarm cycle) that detects source beads stuck in in_review whose MR bead has already reached a terminal state, and returns them to in_progress.
  • Fixed bead status rollback in dispatchAgent() — the !started (container failed) path was rolling back the agent to idle but leaving the bead in in_progress. Now rolls back the bead to open (consistent with the throw path), making the retry via schedulePendingWork more predictable.

Closes #1243

Verification

  • pnpm typecheck — passes
  • pnpm lint (gastown) — passes
  • pnpm test:integration — all new tests pass (5/5); 1 pre-existing failure in convoy-dag.test.ts unrelated to this change
  • Verified the pre-existing test failure (should include merge_mode in review queue metadata for convoy beads) reproduces on main

Visual Changes

N/A

Reviewer Notes

  • The completeReview() function in review-queue.ts still exists and is still called internally by completeReviewWithResult() — it's the low-level SQL update. The fix ensures no caller in processReviewQueue uses it directly for failure paths.
  • recoverOrphanedSourceBeads uses the same 5-minute timeout as recoverStuckReviews. It only recovers beads whose MR bead is already in a terminal state (closed/failed), so it won't interfere with in-flight reviews.
  • The dispatchAgent bead rollback fix may be a no-op in fire-and-forget contexts (the I/O gate may already be closed per the existing comment at line 2982), but it ensures correct behavior when dispatchAgent is awaited directly (as in processReviewQueue).

… unblock convoys

Replace direct completeReview() calls in processReviewQueue() failure
paths with completeReviewWithResult(), which properly updates convoy
progress and returns source beads to in_progress for rework dispatch.
Also adds recovery for orphaned source beads stuck in in_review and
fixes bead status rollback consistency in dispatchAgent.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 18, 2026

Code Review Summary

Status: 24 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 24
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3201 The terminal-MR cleanup still unhooks the refinery whenever the status probe is anything except running, so starting/unknown can still let a late gt_done land on the next MR.
cloudflare-gastown/src/dos/Town.do.ts 3225 The idle-refinery retry still restarts reviews that already finished but are only waiting on delayed gt_done.
cloudflare-gastown/src/dos/Town.do.ts 3233 The town-global retry still returns before popReviewQueue(), so one idle hooked refinery can starve queued reviews for every other rig.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3330 Unhooking the previous MR before reusing the refinery still lets a late gt_done land on the new review.
cloudflare-gastown/src/dos/Town.do.ts 3368 Setting an uncertain refinery start back to idle still lets processReviewQueue() retry /agents/start while the original launch may still be coming up.
cloudflare-gastown/src/dos/town/review-queue.ts 206 The rig-wide guard still doesn't close stale open MR siblings, so outdated reviews can re-enter between rework cycles once no MR is in_progress.
cloudflare-gastown/src/dos/town/review-queue.ts 209 Treating every in_progress MR on the rig as active refinery work still blocks new reviews behind long-lived PR-strategy beads that are only waiting on humans.
cloudflare-gastown/src/gastown.worker.ts 197 The unauthenticated /debug/towns/:townId/status route still exposes live town state to anyone who can guess a townId.
cloudflare-gastown/container/src/git-manager.ts 134 configureRepoCredentials() still drops custom HTTPS ports by using url.hostname.
cloudflare-gastown/src/dos/town/patrol.ts 631 rehookOrphanedBeads() still matches gt:triage-request beads, so the alarm loop retries an intentionally rejected hookBead() path.
cloudflare-gastown/src/dos/town/container-dispatch.ts 111 AbortSignal.timeout() can still abort startup after the token is persisted, so slow /refresh-token cold starts fail dispatch spuriously.
cloudflare-gastown/src/dos/town/review-queue.ts 626 PR-backed MR beads still look orphaned once the refinery unhooks, so healthy human-review flows can be failed spuriously.
cloudflare-gastown/src/dos/town/review-queue.ts 724 Clearing assignee_agent_bead_id in the recovery path still strands standalone rework beads, because only convoy beads have another reassignment path.
cloudflare-gastown/src/dos/town/scheduling.ts N/A Startup rollback can still reopen beads that finished while launch was in flight.
cloudflare-gastown/src/dos/town/review-queue.ts 765 Recovering the most recent MR bead by assignee_agent_bead_id can still attach a late gt_done to the wrong review.
cloudflare-gastown/src/dos/Town.do.ts 2883 witnessPatrol() still treats not_found as a confirmed exit, so transient status probes can reset live agents.
cloudflare-gastown/src/dos/Town.do.ts 2919 Zombie recovery still leaves the dispatch cooldown in place, delaying the next retry after a restart.
cloudflare-gastown/src/dos/town/beads.ts 265 updateBeadStatus() can still no-op on terminal beads while callers emit terminal-state side effects from the requested status.
cloudflare-gastown/src/trpc/router.ts 1226 debugAgentMetadata still ships a temporary debugging hook as a permanent admin RPC.
cloudflare-gastown/src/dos/town/container-dispatch.ts 17 Module-level lastStartError still shares diagnostics across concurrent dispatch attempts.
cloudflare-gastown/src/dos/town/review-queue.ts 326 Closing every non-closed sibling MR bead still rewrites earlier failed review attempts to closed, so history/API consumers report them as merged.
cloudflare-gastown/src/dos/town/scheduling.ts 165 Uncertain starts still get redispatched after the cooldown.
cloudflare-gastown/src/dos/town/review-queue.ts 399 Clearing assignee_agent_bead_id on failed reviews still strands standalone rework beads because only convoy beads have a reassignment path.
cloudflare-gastown/src/dos/town/container-dispatch.ts 443 Treating every already running response as success also accepts agents that are still only starting, so a startup that later fails can leave the bead stuck in in_progress.
Files Reviewed (1 file)
  • cloudflare-gastown/src/dos/Town.do.ts - 3 prior warnings still present in the incremental diff

Fix these issues in Kilo Cloud


Reviewed by gpt-5.4-20260305 · 682,768 tokens

…e open status

- Add NOT EXISTS subquery to recoverOrphanedSourceBeads to skip source
  beads that still have a pending (open/in_progress) MR child, preventing
  false recovery during rework cycles.
- Change recovery target status from in_progress to open so the
  scheduler can assign and dispatch a fresh polecat (the original agent
  was already unhooked by agentDone).
jrf0110 added 3 commits March 18, 2026 20:24
…wn DO

blockConcurrencyWhile in the constructor already guarantees
initialization completes before any RPC method executes. The 64
redundant await this.ensureInitialized() calls in public methods were
all no-ops (just checking an already-resolved promise).
The town ID is persisted to DO storage on creation and loaded during
initializeDatabase(). Every subsequent RPC call was redundantly calling
setTownId, writing the same value to storage on every request. Removed
21 redundant calls from router.ts, mayor.handler.ts, org-towns.handler.ts,
and configureRig — keeping only the 3 creation-time calls.

Also removed a misleading comment in alarm() claiming processReviewQueue
must run before schedulePendingWork. The ordering is irrelevant: the
refinery and polecats operate on different bead types, and the DAG
blocker check at dispatch time prevents premature downstream dispatch
regardless of scheduling order.
… fix refinery recovery

Extract dispatchAgent, dispatchUnblockedBeads, schedulePendingWork, and
hasActiveWork into town/scheduling.ts (~360 lines out of Town.do.ts).

Improve refinery recovery after container restart:
- Reorder alarm: patrols run BEFORE scheduling so zombie agents
  detected by witnessPatrol are dispatched in the same tick instead
  of waiting for the next one.
- Exclude refineries from schedulePendingWork so they always go
  through processReviewQueue with the full system prompt (branch,
  strategy, gates). recoverStuckReviews resets the MR bead to open
  after the timeout, and processReviewQueue re-pops it correctly.
- Reduce REVIEW_RUNNING_TIMEOUT_MS from 5 min to 2.5 min so the
  review queue recovers faster after a container restart.

Parallelize the alarm loop for lower latency:
- Phase 1: witnessPatrol + deaconPatrol run in parallel (patrols)
- Phase 2: review pipeline (sequential) runs in parallel with
  schedulePendingWork (disjoint agent types)
- Phase 3: mail delivery, nudge expiry, re-escalation, and triage
  dispatch all run in parallel (fully independent)
- Zombie detection container status checks run in parallel via
  Promise.allSettled instead of sequential for-loop.
// Set slightly above the dispatch cooldown (2 min) so that zombie
// detection + cooldown expiry have a chance to recover the agent
// before we force-reset the MR bead.
const REVIEW_RUNNING_TIMEOUT_MS = 2.5 * 60 * 1000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This timeout can reopen healthy reviews after 150 seconds

recoverStuckReviews() uses this same constant for every in-progress MR, but the MR bead's updated_at is only touched when the review starts and finishes. Any direct-merge refinery run that takes longer than 2.5 minutes will be reset to open and can be popped a second time while the original refinery is still working, which risks duplicate review/merge attempts.

git-lfs smudge filters triggered by worktree creation (git worktree add,
git reset --hard) need credentials to download LFS objects from the
batch API. The token is embedded in the remote URL, but some git-lfs
versions resolve credentials through the credential helper chain for
the LFS batch endpoint (which uses a different URL path).

Configure a credential-store helper on the bare repo right after clone
and on every fetch (in case the token rotated), so all worktrees —
including the browse worktree created by /repos/setup — inherit
working credentials for LFS operations.
const url = new URL(gitUrl);
const credentialLine =
gitlabToken && (url.hostname.includes('gitlab') || envVars.GITLAB_INSTANCE_URL)
? `https://oauth2:${gitlabToken}@${url.hostname}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: hostname drops custom HTTPS ports

URL.hostname strips :port, so a repo URL like https://git.example.com:8443/org/repo.git writes https://oauth2:...@git.example.com / https://x-access-token:...@git.example.com into the credential store. Git credential matching treats git.example.com:8443 as a different host, so LFS batch requests on GitHub/GitLab Enterprise with non-default ports will still miss the helper and fail. Use url.host (or url.origin) when composing the credential line.

After container restarts, review failures, or rework cycles, beads can
end up in 'open' with a stale assignee_agent_bead_id — the agent was
unhooked but the bead still references it. Neither feedStrandedConvoys
(requires assignee IS NULL) nor schedulePendingWork (requires agent
hooked) picks these up, leaving them permanently stuck.

New rehookOrphanedBeads() patrol function finds open issue beads where
the assigned agent's current_hook_bead_id doesn't point back to the
bead, and re-hooks a polecat so schedulePendingWork dispatches it on
the next tick.
INNER JOIN ${agent_metadata}
ON ${agent_metadata.bead_id} = ${beads.assignee_agent_bead_id}
WHERE ${beads.status} = 'open'
AND ${beads.type} = 'issue'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Triage request beads match this recovery query

createTriageRequest() intentionally creates open issue beads with assignee_agent_bead_id set, but it never hooks an agent to those beads. That means any open gt:triage-request whose assigned agent is unhooked or working a different bead will satisfy this predicate, and the code below will keep trying to hookBead() it every alarm tick. hookBead() rejects triage-request labels, so this turns every such request into a permanent warning/log-spam loop instead of recovering real orphaned work. Exclude triage-request labels from this scan.

jrf0110 added 2 commits March 18, 2026 22:09
… agent status as not_found

Container fetch calls through the Container class auto-start the
container and wait for port readiness, which can block the alarm loop
during container restarts/deploys. Add AbortSignal.timeout to:
- ensureContainerReady health check (5s)
- checkAgentContainerStatus (5s)
- startAgentInContainer (60s)
- refreshContainerToken (10s)

Also treat non-OK responses and errors in checkAgentContainerStatus as
not_found instead of unknown, so zombie detection resets the agent
immediately rather than leaving it stuck in working state.
…e re-dispatch

witnessPatrol was setting last_activity_at = now() when resetting dead
agents to idle, which triggered the 2-minute dispatch cooldown in
schedulePendingWork. The cooldown exists to prevent double-dispatch of
live agents with an in-flight container start — but a confirmed-dead
agent has no in-flight dispatch to collide with.

Set last_activity_at = NULL instead so schedulePendingWork picks up the
agent on the very next alarm tick. Reduces non-review agent recovery
from ~2+ min to ~5-10s after container restart.
try {
const resp = await container.fetch('http://container/refresh-token', {
method: 'POST',
signal: AbortSignal.timeout(10_000),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Refresh-token timeouts now abort agent startup

ensureContainerToken() only suppresses network/"fetch" failures. AbortSignal.timeout() rejects with an abort/timeout error instead, so a slow cold-start on /refresh-token now bubbles out of this helper. startAgentInContainer() catches that and returns false, even though the token was already persisted via setEnvVar(), so dispatch can fail spuriously whenever the container needs more than 10s to come up.

// instead of waiting for the 2-hour GUPP timeout.
if (response.status === 404) return { status: 'not_found' };
if (!response.ok) return { status: 'unknown' };
if (!response.ok) return { status: 'not_found' };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Transient status probe failures now look like missing agents

witnessPatrol() treats not_found as a confirmed dead agent and immediately flips it back to idle with no cooldown. Returning not_found for 5xx/timeout/parse failures means a live agent can be redispatched just because the status endpoint was slow during a deploy or container restart, which risks duplicate work on the same hooked bead. Keeping a distinct retry/unknown state for non-404 failures avoids that false-positive recovery path.

jrf0110 added 3 commits March 18, 2026 22:15
… with unhooked agents

rehookOrphanedBeads previously only recovered 'open' beads. After
container restarts, beads can also get stuck in 'in_progress' with
a stale assignee — the agent was unhooked by witnessPatrol but the
bead status was never rolled back. Neither schedulePendingWork nor
any other recovery path picks these up.

Now also matches in_progress beads where the assigned agent is not
working and not hooked. Resets in_progress beads to open before
re-hooking so the dispatch flow starts cleanly. Filters out agents
in 'working' status to avoid interfering with transient races during
dispatch.
… source beads

Gap I: MR beads in in_progress with pr_url and a dead refinery were
stuck permanently — recoverStuckReviews excludes pr_url, schedulePendingWork
excludes refineries, and closeOrphanedReviewBeads only checked open beads.
Expanded closeOrphanedReviewBeads to also match in_progress MR beads.

Gap G: recoverOrphanedSourceBeads returned source beads to open but
left the stale assignee_agent_bead_id, preventing feedStrandedConvoys
(requires assignee IS NULL) from re-hooking convoy beads. Now clears
the assignee so both feedStrandedConvoys and rehookOrphanedBeads can
assign a fresh agent.
LEFT JOIN ${agent_metadata} ON ${beads.assignee_agent_bead_id} = ${agent_metadata.bead_id}
WHERE ${beads.type} = 'merge_request'
AND ${beads.status} = 'open'
AND ${beads.status} IN ('open', 'in_progress')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Normal PR-strategy reviews now look orphaned after 30 minutes

markReviewInReview() leaves PR-backed MR beads in in_progress, and agentDone() immediately unhooks the refinery so its status becomes idle. With this wider predicate, any healthy PR review that is simply waiting on a human reviewer for more than 30 minutes now satisfies the orphan scan and gets closed even though nothing is actually wrong.

sql,
/* sql */ `
UPDATE ${beads}
SET ${beads.columns.assignee_agent_bead_id} = NULL
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Clearing the assignee strands standalone rework beads

Before this change, rehookOrphanedBeads() could recover a standalone source bead because it still had its stale assignee_agent_bead_id. Setting that column to NULL means only feedStrandedConvoys() can pick it back up, but that patrol only scans convoy-tracked beads. A non-convoy issue that hits this recovery path will now stay open forever with no agent re-dispatched.

…patchAgent rollback

dispatchAgent's failure rollback paths used agent.current_hook_bead_id
to roll back the bead status. But when called from rework dispatch
(completeReviewWithResult, failReviewWithRework, agentCompleted), the
agent snapshot is fetched by getOrCreateAgent BEFORE hookBead is called.
So agent.current_hook_bead_id is null in the snapshot, and the bead
rollback is silently skipped — leaving the bead stuck in in_progress
with an idle unhooked agent.

Use bead.bead_id (the actual bead being dispatched) for rollback,
which is always correct regardless of when the agent snapshot was taken.
`,
[agent.id]
);
beadOps.updateBeadStatus(ctx.sql, bead.bead_id, 'open', agent.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Rollback can reopen beads that finished while startup was in flight

dispatchAgent() awaits container startup, so another RPC/alarm cycle can still move bead.bead_id to closed or failed before this rollback runs. updateBeadStatus() overwrites whatever status is there now, so both rollback branches will reopen that terminal bead to open even though the agent never actually started. Re-read the bead and only roll it back when it is still in_progress (and ideally still assigned to this agent).

jrf0110 added 2 commits March 18, 2026 23:05
…h active refinery

recoverStuckReviews was resetting in_progress MR beads to open every
alarm tick when they exceeded the timeout, even when the refinery agent
was actively working on the review. This caused infinite reset loops
where the refinery's work was repeatedly interrupted.

Add NOT EXISTS check to exclude MR beads whose assigned refinery agent
is currently in 'working' status. Also restore the timeout to 5 min
since the working-agent guard now prevents false positives — the timeout
only fires when the agent is truly dead/idle.
…TCH_ATTEMPTS

After a deploy + container eviction, dispatch failures burn through
MAX_DISPATCH_ATTEMPTS (was 5) within 25 seconds, permanently failing
beads before the container has time to start. Two changes:

1. Increase MAX_DISPATCH_ATTEMPTS from 5 to 20 to tolerate longer
   container cold starts.

2. Set last_activity_at = now() on dispatch failure to trigger the
   2-min dispatch cooldown, giving the container time to start up
   before the next retry. Combined with 20 max attempts, this allows
   up to 40 min of retries.
jrf0110 added 5 commits March 19, 2026 08:35
…f throwing 500

When the container restarts during a review, witnessPatrol unhooks the
refinery agent. If the refinery process resumes and calls gt_done after
being unhooked, agentDone threw 'Agent has no hooked bead' → 500 error.
This triggered triage requests and left the bead in a stuck state.

Return gracefully instead of throwing — the recovery paths
(recoverStuckReviews, rehookOrphanedBeads) will handle the bead
lifecycle.
…back

processReviewQueue passed rigConfig.kilocodeToken directly to
startAgentInContainer, which is undefined when the token was stored
in the town config rather than the rig config. Polecats work because
dispatchAgent calls resolveKilocodeToken() which falls back to the
town config. The refinery got no token → container rejected the start
→ every review failed with 'Refinery container failed to start'.

Use the same resolveKilocodeToken() fallback chain.
…ainer

Worktree creation fails when repos contain LFS-tracked files (e.g. .mp4)
and the LFS batch endpoint can't resolve credentials. This blocks agent
start entirely — the container returns a non-200 for /agents/start and
processReviewQueue fails the review with 'Refinery container failed to
start'.

Set GIT_LFS_SKIP_SMUDGE=1 globally in the git exec helper. Agents
don't need binary assets — LFS files are checked out as pointer files
instead of downloading the actual content.
Belt-and-suspenders alongside the GIT_LFS_SKIP_SMUDGE env var: install
a .gitconfig for the agent user that configures the LFS filter to skip
smudge and excludes all files from LFS fetch. This persists in the
Docker image layer and covers any code path that runs git outside the
exec() helper.
…finery

Three related fixes to stop recoverStuckReviews from resetting MR beads
while the refinery is actively reviewing:

1. checkAgentContainerStatus: restore 'unknown' for non-404 errors and
   timeouts instead of aggressive 'not_found'. Only 404 (container
   confirms agent doesn't exist) triggers zombie reset. Timeout/errors
   return 'unknown' which witnessPatrol ignores — the GUPP system
   handles truly dead agents after 2 hours.

2. recoverStuckReviews: exclude MR beads where ANY agent is hooked
   (not just 'working' agents). After witnessPatrol resets a refinery
   to idle but keeps the hook, the old check saw 'no working agent'
   and reset the MR bead while the refinery was about to be
   re-dispatched.

3. Increase REVIEW_RUNNING_TIMEOUT_MS from 5 min to 15 min. Reviews
   legitimately take 5-10 min (clone + review + test + merge).
… agent type

dispatchAgent was rolling back beads from in_progress to open when
startAgentInContainer returned false. But the container may have
actually started the agent (timeout race — the fetch timed out but
the container continued setup). The agent starts working, but the
DO already rolled the bead back to open.

Remove the bead rollback for both the !started and exception paths.
Leave the bead in in_progress with the agent idle+hooked. Two outcomes:
- Agent actually started → works normally → gt_done closes bead
- Agent truly failed → rehookOrphanedBeads recovers after 2 min

This is the same pattern applied to processReviewQueue in the previous
commit, now extended to the general dispatch path.
ctx.sql,
/* sql */ `
UPDATE ${agent_metadata}
SET ${agent_metadata.columns.status} = 'idle',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Uncertain starts still get redispatched after the cooldown

This branch marks the agent idle but keeps its hook on the bead. schedulePendingWork() only checks for status = 'idle', a non-null current_hook_bead_id, and an expired last_activity_at, so a /agents/start timeout race that actually launched the process will be started a second time about two minutes later if the first run is still active. That duplicates work instead of waiting for patrol/witness logic to prove the original start really failed.

Comprehensive fix addressing the remaining review agent findings:

1. agentCompleted: don't unhook refinery. Leave hook intact so
   recoverStuckReviews' guard works and gt_done can arrive after
   agentCompleted without a race window.

2. witnessPatrol: only act on 'exited' for refineries, skip 'not_found'.
   not_found is ambiguous (container restarting, status check timeout)
   and falsely setting the refinery to idle enables premature recovery.

3. completeReviewWithResult failure/conflict path: set source bead to
   'open' (not in_progress) and clear assignee. This lets the normal
   scheduling path (feedStrandedConvoys → hookBead → schedulePendingWork)
   handle rework instead of fire-and-forget dispatch that races with
   patrol recovery.

4. Remove all fire-and-forget rework dispatch code from
   failReviewWithRework, TownDO.completeReviewWithResult, and
   TownDO.agentCompleted. Rework is now exclusively handled by the
   alarm loop's scheduling/patrol functions, eliminating the race
   between dispatch and recovery.
sql,
/* sql */ `
UPDATE ${beads}
SET ${beads.columns.assignee_agent_bead_id} = NULL
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Clearing the assignee here strands standalone rework

feedStrandedConvoys() only re-hooks convoy issues, and rehookOrphanedBeads() only repairs beads whose assignee_agent_bead_id is still set. When a standalone review fails through this path, the source bead is reopened with no hook and no assignee, so nothing ever picks it back up.

// merge succeeded. recoverStuckReviews (which checks for
// status='working') handles the case where gt_done never arrives.
//
// No-op for the bead — just fall through to mark agent idle.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Falling through to idle while the refinery stays hooked makes the singleton reusable too early

getOrCreateAgent() returns the existing refinery even when current_hook_bead_id is still set, and processReviewQueue() only gates on status !== 'idle'. Once this path marks the agent idle, the next alarm can pop another MR and then hit hookBead()'s "already hooked" error, or let a late gt_done attach to the wrong review.

jrf0110 added 2 commits March 19, 2026 14:17
…rge path logging

witnessPatrol was setting non-refinery agents to idle on transient
not_found container status, causing working agents to appear idle.
Extend the not_found skip to all agent types — only act on confirmed
exited status.

Add diagnostic logging to completeReviewWithResult merged path to
trace why source beads are not being closed after successful merges.
… for abandoned MR beads

Two fixes for MR beads stuck in in_progress:

1. processReviewQueue: unhook the refinery from its previous MR bead
   before hooking to the new one. agentCompleted preserves the refinery
   hook, so when processReviewQueue runs next, hookBead throws 'already
   hooked'. This crashed processReviewQueue silently (caught by
   Promise.allSettled), leaving the MR bead stuck in in_progress with
   no agent.

2. recoverStuckReviews: add fast 2-min recovery for MR beads that are
   in_progress with NO agent hooked at all. These are clearly abandoned
   (hookBead threw or the refinery was unhooked by another path).
   The 30-min timeout remains for MR beads with an idle refinery
   hooked (waiting for gt_done to arrive).
// If the refinery is still hooked to a previous MR bead (agentCompleted
// preserves hooks for refineries), unhook first.
if (refineryAgent.current_hook_bead_id && refineryAgent.current_hook_bead_id !== entry.id) {
agents.unhookBead(this.sql, refineryAgent.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This repoints late refinery completion onto the wrong MR

reviewQueue.agentCompleted() intentionally leaves refineries idle but still hooked so a delayed gt_done can recover the MR they were actually reviewing. Clearing that hook here and then reusing the same singleton for entry.id means any late completion from the previous run will now be applied to the new review instead.

WHERE ${beads.type} = 'merge_request'
AND ${beads.status} = 'in_progress'
AND ${beads.updated_at} < ?
AND NOT EXISTS (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This fast-recovery scan reopens healthy PR-backed reviews

After agentDone() stores a pr_url, PR-strategy MR beads intentionally stay in_progress with no hooked refinery while a human reviews the PR. Because this new query does not exclude rows with review_metadata.pr_url, any healthy PR review older than two minutes will be reset to open and can dispatch a second refinery run against the same branch.

Skipping not_found for refineries caused them to stay 'working' in the
DB permanently when the container process died. witnessPatrol never
reset them, and recoverStuckReviews' guard (no WORKING agent) meant
30-min recovery also didn't fire.

Now set the refinery to idle on both not_found and exited (same as
before but without skipping). Keep the hook intact so gt_done can
still arrive and close the MR. If the refinery is idle+hooked:
- processReviewQueue unhooks it before hooking to the new MR (prev fix)
- recoverStuckReviews recovers after 30 min (guard: no working agent)
- 2-min fast recovery handles MR beads left with no hook at all

Non-refineries: reset to idle on both not_found and exited (unchanged).
if (result.status !== 'fulfilled') continue;
const { agentId, containerInfo } = result.value;

if (containerInfo.status === 'not_found' || containerInfo.status === 'exited') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: not_found is still too ambiguous for zombie cleanup

checkAgentContainerStatus() can still report not_found during container restarts or transient probe failures, so handling it the same as a confirmed exited process flips live agents to idle. For refineries that reopens the reuse path at processReviewQueue(), and for polecats it makes the scheduler eligible to launch a duplicate retry before the original process reports gt_done.

After a container restart, the first startAgentInContainer call may
fail (container not ready). The refinery is left idle+hooked to the
MR bead (in_progress). Without retries, the MR sits in in_progress
for 30 min until recoverStuckReviews resets it.

Now processReviewQueue checks for this state at the start of each
tick: if the refinery is idle+hooked to an in_progress MR bead,
re-dispatch it via dispatchAgent. This retries every 5 seconds
(alarm interval) until the container accepts the start.
`${TOWN_LOG} processReviewQueue: retrying refinery dispatch for MR bead=${hookedMr.bead_id}`
);
await this.dispatchAgent(refineryForRetry, hookedMr);
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: One stuck refinery can block every other rig's review queue

This retry runs before popReviewQueue() and always returns afterward. Because it grabs the first refinery in the town rather than a refinery tied to the entry being processed, an idle hooked refinery in rig A will monopolize processReviewQueue() until it stops matching this branch, so queued MRs for rig B never start.

…n any MR in-progress

Three fixes for the cascading MR bead problem:

1. processReviewQueue start failure: keep the refinery hook intact
   instead of unhooking. The retry block at the top of
   processReviewQueue handles idle+hooked refineries. By preserving
   the hook, we prevent popping new MR beads while the container is
   unavailable — the retry block retries the same MR each tick.

2. popReviewQueue: don't pop ANY open MR if there's already an
   in_progress MR for the same rig. Since there's only one refinery
   per rig, popping a second MR while the first is in_progress is
   always wrong.

3. Clear refinery checkpoint when unhooking from a previous MR bead
   to prevent stale conversation context leaking into new reviews.
AND NOT EXISTS (
SELECT 1 FROM ${beads} AS active_mr
WHERE active_mr.${beads.columns.type} = 'merge_request'
AND active_mr.${beads.columns.status} = 'in_progress'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: PR-backed reviews now block every later MR on the same rig

markReviewInReview() leaves PR-strategy beads in in_progress while humans review the external PR, but the refinery is already free once that PR is created. This rig-wide NOT EXISTS treats those waiting PRs as active refinery work, so one long-lived PR can freeze every new review on the rig until a human merges or closes it.


if (!started) {
agents.unhookBead(this.sql, refineryAgent.id);
// Keep hook intact — the retry block at the top of processReviewQueue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Uncertain refinery starts now get retried while the first launch may still be alive

This false branch is explicitly the timeout-race case where /agents/start may still finish later. Setting the refinery back to idle makes the retry block at the top of processReviewQueue() call dispatchAgent() again on the next 5s tick, so the same MR can get a second launch before the original process reports a heartbeat or gt_done.

…l start

When startAgentInContainer gets a 500 with 'already running', the
agent process IS alive from a previous dispatch attempt. The previous
dispatch returned false (timeout or transient error) but the container
actually started the agent. Subsequent retry attempts get 'already
running' and return false, causing the DO to keep the refinery idle
while the agent is actually running.

Detect 'already running' in the error response and return true so the
DO correctly marks the agent as working.
// "Already running" means a previous dispatch succeeded — the agent
// IS alive in the container. Treat as success so the DO marks the
// agent as working and stops retrying.
if (response.status === 500 && text.includes('already running')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This also treats agents that are still only starting as a successful launch

process-manager.startAgent() throws the same "already running" error for both running and starting agents. After the 60s client timeout, the first /agents/start can still be in the middle of clone/worktree setup, the retry hits this branch, and the DO flips the agent back to working even though startup has not finished yet. If that original startup later fails, the container reports failed, but witnessPatrol() only recovers not_found and exited, so the bead can stay stuck in in_progress indefinitely.

The retry block was calling dispatchAgent without checking if the
agent is already running in the container. When the original dispatch
succeeded (but we got a false negative from timeout), the retry sends
another /agents/start which gets 'already running'. While
startAgentInContainer now handles this, it's cleaner to check first.

Now: check container status → if running, restore to 'working' (no
dispatch needed). If not running, dispatch normally.
console.log(
`${TOWN_LOG} processReviewQueue: retrying refinery dispatch for MR bead=${hookedMr.bead_id}`
);
await this.dispatchAgent(refineryForRetry, hookedMr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This retries refineries that already finished normally

reviewQueue.agentCompleted() intentionally leaves a refinery idle but still hooked until its delayed gt_done closes the MR. In that window checkAgentContainerStatus() reports the process as exited, so this branch re-dispatches the same MR on the next alarm tick even though the original run already completed. That can replay the review/merge or create a second PR before the late gt_done arrives.

jrf0110 added 2 commits March 19, 2026 16:20
Three fixes for PR-strategy review beads:

1. recoverStuckReviews fast-recovery: exclude MR beads with pr_url.
   The fast 2-min recovery was resetting PR-strategy beads to open,
   making them invisible to pollPendingPRs (which only queries
   in_progress beads). This caused an infinite cycle: pop → dispatch
   → refinery sees existing PR → gt_done with pr_url → unhook → 2min
   → fast recovery resets to open → pop again.

2. pollPendingPRs: use this.completeReviewWithResult (the TownDO
   wrapper) instead of reviewQueue.completeReviewWithResult directly.
   The wrapper emits events and calls dispatchUnblockedBeads, which
   is critical for unblocking downstream convoy beads after a PR merge.

3. closeOrphanedReviewBeads: use completeReviewWithResult('failed')
   instead of closeBead. closeBead only closes the MR bead without
   transitioning the source bead, leaving it stuck in in_review.
…essReviewQueue

When gt_done or pollPendingPRs closes an MR bead while the refinery is
still running in the container, the refinery stays 'working' and hooked
to the closed MR. processReviewQueue sees the refinery as non-idle and
re-queues new MR beads instead of dispatching them.

Add cleanup at the top of processReviewQueue: if the refinery is hooked
to a closed/failed MR bead, unhook it, set to idle, and clear the
checkpoint. This frees the refinery for new work immediately.
if (existingRefinery?.current_hook_bead_id) {
const hookedMr = beadOps.getBead(this.sql, existingRefinery.current_hook_bead_id);
if (hookedMr && (hookedMr.status === 'closed' || hookedMr.status === 'failed')) {
agents.unhookBead(this.sql, existingRefinery.id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Unhooking the refinery here can misapply a late gt_done

This cleanup runs specifically when the MR is already terminal but the previous refinery process may still be running. Clearing the hook and marking the singleton idle at this point lets processReviewQueue() reuse the same refinery for a different MR before the old container exits, so a delayed gt_done from the first review can be attributed to the new one.

…minal MR

After gt_done closes an MR bead, the refinery process is still running
in the container (finishing its LLM turn). The cleanup block in
processReviewQueue was immediately unhooking the refinery and making
it available for new work, causing it to be dispatched for a new MR
while the old session was still active in the container.

Now check container status first: if the refinery is still running,
skip cleanup and return — don't pop any new MR this tick. Wait for
agentCompleted to fire (session ends), then clean up on the next tick.
This prevents dispatching a new review to a refinery whose container
session is still active from the previous review.
@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Mar 20, 2026

Superseded by #1336 which includes all changes from this branch plus the full reconciler implementation.

@jrf0110 jrf0110 closed this Mar 20, 2026
jrf0110 added a commit that referenced this pull request Mar 20, 2026
…gement

Replace the imperative patrol/scheduling/review-queue alarm phases with a
declarative reconciler that drains events, computes desired state, and
applies corrective actions.

Reconciler architecture (Phases 1-5):
- Event table (town_events) with 8 event types, dual-write in RPC handlers
- 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC
- applyEvent/applyAction for all event and action types
- Event-only agentDone/agentCompleted (no direct mutations)
- Lazy assignment for slingConvoy/startConvoy (#1249)
- Container status observation pre-phase (replaces witnessPatrol)
- Agent activity watermark via enriched heartbeat
- Invariant checker + reconciler metrics in getAlarmStatus
- Rework flow: gt_request_changes tool for refinery change requests
- Refinery system prompt wired in (buildRefinerySystemPrompt)

Bug fixes:
- Convoy landing MR cycling (reconciler created duplicates)
- Multi-agent hook mutual exclusion (hookBead unhooks stale agents)
- agentCompleted no longer closes beads when gt_done was not called
- GUPP NaN bug for agents with null last_activity_at
- Container status event flood (upsert instead of insert per tick)
- PR-strategy MR beads no longer block the refinery queue
- poll_pr refreshes updated_at to prevent false orphan kills
- Failed events retry on next tick instead of being dropped

Includes changes from #1244 (town issues rewrite):
- MR bead failure lifecycle fixes
- Review queue recovery improvements
- Convoy progress tracking fixes
- Post-deploy monitoring guide and scripts

Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts,
review-queue.ts, and Town.do.ts.
jrf0110 added a commit that referenced this pull request Mar 20, 2026
…gement

Replace the imperative patrol/scheduling/review-queue alarm phases with a
declarative reconciler that drains events, computes desired state, and
applies corrective actions.

Reconciler architecture (Phases 1-5):
- Event table (town_events) with 8 event types, dual-write in RPC handlers
- 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC
- applyEvent/applyAction for all event and action types
- Event-only agentDone/agentCompleted (no direct mutations)
- Lazy assignment for slingConvoy/startConvoy (#1249)
- Container status observation pre-phase (replaces witnessPatrol)
- Agent activity watermark via enriched heartbeat
- Invariant checker + reconciler metrics in getAlarmStatus
- Rework flow: gt_request_changes tool for refinery change requests
- Refinery system prompt wired in (buildRefinerySystemPrompt)

Bug fixes:
- Convoy landing MR cycling (reconciler created duplicates)
- Multi-agent hook mutual exclusion (hookBead unhooks stale agents)
- agentCompleted no longer closes beads when gt_done was not called
- GUPP NaN bug for agents with null last_activity_at
- Container status event flood (upsert instead of insert per tick)
- PR-strategy MR beads no longer block the refinery queue
- poll_pr refreshes updated_at to prevent false orphan kills
- Failed events retry on next tick instead of being dropped

Also includes prerequisite fixes from #1244:
- MR bead failure lifecycle fixes
- Review queue recovery improvements
- Convoy progress tracking fixes

Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts,
review-queue.ts, and Town.do.ts.
jrf0110 added a commit that referenced this pull request Mar 20, 2026
…gement

Replace the imperative patrol/scheduling/review-queue alarm phases with a
declarative reconciler that drains events, computes desired state, and
applies corrective actions.

Reconciler architecture (Phases 1-5):
- Event table (town_events) with 8 event types, dual-write in RPC handlers
- 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC
- applyEvent/applyAction for all event and action types
- Event-only agentDone/agentCompleted (no direct mutations)
- Lazy assignment for slingConvoy/startConvoy (#1249)
- Container status observation pre-phase (replaces witnessPatrol)
- Agent activity watermark via enriched heartbeat
- Invariant checker + reconciler metrics in getAlarmStatus
- Rework flow: gt_request_changes tool for refinery change requests
- Refinery system prompt wired in (buildRefinerySystemPrompt)

Bug fixes:
- Convoy landing MR cycling (reconciler created duplicates)
- Multi-agent hook mutual exclusion (hookBead unhooks stale agents)
- agentCompleted no longer closes beads when gt_done was not called
- GUPP NaN bug for agents with null last_activity_at
- Container status event flood (upsert instead of insert per tick)
- PR-strategy MR beads no longer block the refinery queue
- poll_pr refreshes updated_at to prevent false orphan kills
- Failed events retry on next tick instead of being dropped

Also includes prerequisite fixes from #1244:
- MR bead failure lifecycle fixes
- Review queue recovery improvements
- Convoy progress tracking fixes

Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts,
review-queue.ts, and Town.do.ts.
jrf0110 added a commit that referenced this pull request Mar 20, 2026
…gement

Replace the imperative patrol/scheduling/review-queue alarm phases with a
declarative reconciler that drains events, computes desired state, and
applies corrective actions.

Reconciler architecture (Phases 1-5):
- Event table (town_events) with 8 event types, dual-write in RPC handlers
- 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC
- applyEvent/applyAction for all event and action types
- Event-only agentDone/agentCompleted (no direct mutations)
- Lazy assignment for slingConvoy/startConvoy (#1249)
- Container status observation pre-phase (replaces witnessPatrol)
- Agent activity watermark via enriched heartbeat
- Invariant checker + reconciler metrics in getAlarmStatus
- Rework flow: gt_request_changes tool for refinery change requests
- Refinery system prompt wired in (buildRefinerySystemPrompt)

Bug fixes:
- Convoy landing MR cycling (reconciler created duplicates)
- Multi-agent hook mutual exclusion (hookBead unhooks stale agents)
- agentCompleted no longer closes beads when gt_done was not called
- GUPP NaN bug for agents with null last_activity_at
- Container status event flood (upsert instead of insert per tick)
- PR-strategy MR beads no longer block the refinery queue
- poll_pr refreshes updated_at to prevent false orphan kills
- Failed events retry on next tick instead of being dropped

Also includes prerequisite fixes from #1244:
- MR bead failure lifecycle fixes
- Review queue recovery improvements
- Convoy progress tracking fixes

Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts,
review-queue.ts, and Town.do.ts.
jrf0110 added a commit that referenced this pull request Mar 20, 2026
…gement

Replace the imperative patrol/scheduling/review-queue alarm phases with a
declarative reconciler that drains events, computes desired state, and
applies corrective actions.

Reconciler architecture (Phases 1-5):
- Event table (town_events) with 8 event types, dual-write in RPC handlers
- 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC
- applyEvent/applyAction for all event and action types
- Event-only agentDone/agentCompleted (no direct mutations)
- Lazy assignment for slingConvoy/startConvoy (#1249)
- Container status observation pre-phase (replaces witnessPatrol)
- Agent activity watermark via enriched heartbeat
- Invariant checker + reconciler metrics in getAlarmStatus
- Rework flow: gt_request_changes tool for refinery change requests
- Refinery system prompt wired in (buildRefinerySystemPrompt)

Bug fixes:
- Convoy landing MR cycling (reconciler created duplicates)
- Multi-agent hook mutual exclusion (hookBead unhooks stale agents)
- agentCompleted no longer closes beads when gt_done was not called
- GUPP NaN bug for agents with null last_activity_at
- Container status event flood (upsert instead of insert per tick)
- PR-strategy MR beads no longer block the refinery queue
- poll_pr refreshes updated_at to prevent false orphan kills
- Failed events retry on next tick instead of being dropped

Also includes prerequisite fixes from #1244:
- MR bead failure lifecycle fixes
- Review queue recovery improvements
- Convoy progress tracking fixes

Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts,
review-queue.ts, and Town.do.ts.
jrf0110 added a commit that referenced this pull request Mar 20, 2026
…gement

Replace the imperative patrol/scheduling/review-queue alarm phases with a
declarative reconciler that drains events, computes desired state, and
applies corrective actions.

Reconciler architecture (Phases 1-5):
- Event table (town_events) with 8 event types, dual-write in RPC handlers
- 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC
- applyEvent/applyAction for all event and action types
- Event-only agentDone/agentCompleted (no direct mutations)
- Lazy assignment for slingConvoy/startConvoy (#1249)
- Container status observation pre-phase (replaces witnessPatrol)
- Agent activity watermark via enriched heartbeat
- Invariant checker + reconciler metrics in getAlarmStatus
- Rework flow: gt_request_changes tool for refinery change requests
- Refinery system prompt wired in (buildRefinerySystemPrompt)

Bug fixes:
- Convoy landing MR cycling (reconciler created duplicates)
- Multi-agent hook mutual exclusion (hookBead unhooks stale agents)
- agentCompleted no longer closes beads when gt_done was not called
- GUPP NaN bug for agents with null last_activity_at
- Container status event flood (upsert instead of insert per tick)
- PR-strategy MR beads no longer block the refinery queue
- poll_pr refreshes updated_at to prevent false orphan kills
- Failed events retry on next tick instead of being dropped

Also includes prerequisite fixes from #1244:
- MR bead failure lifecycle fixes
- Review queue recovery improvements
- Convoy progress tracking fixes

Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts,
review-queue.ts, and Town.do.ts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: stuck refineries and polecats — MR bead failure paths don't update convoy progress

3 participants