fix(gastown): route MR bead failures through full review lifecycle to unblock convoys#1244
fix(gastown): route MR bead failures through full review lifecycle to unblock convoys#1244
Conversation
… unblock convoys Replace direct completeReview() calls in processReviewQueue() failure paths with completeReviewWithResult(), which properly updates convoy progress and returns source beads to in_progress for rework dispatch. Also adds recovery for orphaned source beads stuck in in_review and fixes bead status rollback consistency in dispatchAgent.
Code Review SummaryStatus: 24 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Files Reviewed (1 file)
Fix these issues in Kilo Cloud Reviewed by gpt-5.4-20260305 · 682,768 tokens |
…e open status - Add NOT EXISTS subquery to recoverOrphanedSourceBeads to skip source beads that still have a pending (open/in_progress) MR child, preventing false recovery during rework cycles. - Change recovery target status from in_progress to open so the scheduler can assign and dispatch a fresh polecat (the original agent was already unhooked by agentDone).
…wn DO blockConcurrencyWhile in the constructor already guarantees initialization completes before any RPC method executes. The 64 redundant await this.ensureInitialized() calls in public methods were all no-ops (just checking an already-resolved promise).
The town ID is persisted to DO storage on creation and loaded during initializeDatabase(). Every subsequent RPC call was redundantly calling setTownId, writing the same value to storage on every request. Removed 21 redundant calls from router.ts, mayor.handler.ts, org-towns.handler.ts, and configureRig — keeping only the 3 creation-time calls. Also removed a misleading comment in alarm() claiming processReviewQueue must run before schedulePendingWork. The ordering is irrelevant: the refinery and polecats operate on different bead types, and the DAG blocker check at dispatch time prevents premature downstream dispatch regardless of scheduling order.
… fix refinery recovery Extract dispatchAgent, dispatchUnblockedBeads, schedulePendingWork, and hasActiveWork into town/scheduling.ts (~360 lines out of Town.do.ts). Improve refinery recovery after container restart: - Reorder alarm: patrols run BEFORE scheduling so zombie agents detected by witnessPatrol are dispatched in the same tick instead of waiting for the next one. - Exclude refineries from schedulePendingWork so they always go through processReviewQueue with the full system prompt (branch, strategy, gates). recoverStuckReviews resets the MR bead to open after the timeout, and processReviewQueue re-pops it correctly. - Reduce REVIEW_RUNNING_TIMEOUT_MS from 5 min to 2.5 min so the review queue recovers faster after a container restart. Parallelize the alarm loop for lower latency: - Phase 1: witnessPatrol + deaconPatrol run in parallel (patrols) - Phase 2: review pipeline (sequential) runs in parallel with schedulePendingWork (disjoint agent types) - Phase 3: mail delivery, nudge expiry, re-escalation, and triage dispatch all run in parallel (fully independent) - Zombie detection container status checks run in parallel via Promise.allSettled instead of sequential for-loop.
| // Set slightly above the dispatch cooldown (2 min) so that zombie | ||
| // detection + cooldown expiry have a chance to recover the agent | ||
| // before we force-reset the MR bead. | ||
| const REVIEW_RUNNING_TIMEOUT_MS = 2.5 * 60 * 1000; |
There was a problem hiding this comment.
WARNING: This timeout can reopen healthy reviews after 150 seconds
recoverStuckReviews() uses this same constant for every in-progress MR, but the MR bead's updated_at is only touched when the review starts and finishes. Any direct-merge refinery run that takes longer than 2.5 minutes will be reset to open and can be popped a second time while the original refinery is still working, which risks duplicate review/merge attempts.
git-lfs smudge filters triggered by worktree creation (git worktree add, git reset --hard) need credentials to download LFS objects from the batch API. The token is embedded in the remote URL, but some git-lfs versions resolve credentials through the credential helper chain for the LFS batch endpoint (which uses a different URL path). Configure a credential-store helper on the bare repo right after clone and on every fetch (in case the token rotated), so all worktrees — including the browse worktree created by /repos/setup — inherit working credentials for LFS operations.
| const url = new URL(gitUrl); | ||
| const credentialLine = | ||
| gitlabToken && (url.hostname.includes('gitlab') || envVars.GITLAB_INSTANCE_URL) | ||
| ? `https://oauth2:${gitlabToken}@${url.hostname}` |
There was a problem hiding this comment.
WARNING: hostname drops custom HTTPS ports
URL.hostname strips :port, so a repo URL like https://git.example.com:8443/org/repo.git writes https://oauth2:...@git.example.com / https://x-access-token:...@git.example.com into the credential store. Git credential matching treats git.example.com:8443 as a different host, so LFS batch requests on GitHub/GitLab Enterprise with non-default ports will still miss the helper and fail. Use url.host (or url.origin) when composing the credential line.
After container restarts, review failures, or rework cycles, beads can end up in 'open' with a stale assignee_agent_bead_id — the agent was unhooked but the bead still references it. Neither feedStrandedConvoys (requires assignee IS NULL) nor schedulePendingWork (requires agent hooked) picks these up, leaving them permanently stuck. New rehookOrphanedBeads() patrol function finds open issue beads where the assigned agent's current_hook_bead_id doesn't point back to the bead, and re-hooks a polecat so schedulePendingWork dispatches it on the next tick.
| INNER JOIN ${agent_metadata} | ||
| ON ${agent_metadata.bead_id} = ${beads.assignee_agent_bead_id} | ||
| WHERE ${beads.status} = 'open' | ||
| AND ${beads.type} = 'issue' |
There was a problem hiding this comment.
WARNING: Triage request beads match this recovery query
createTriageRequest() intentionally creates open issue beads with assignee_agent_bead_id set, but it never hooks an agent to those beads. That means any open gt:triage-request whose assigned agent is unhooked or working a different bead will satisfy this predicate, and the code below will keep trying to hookBead() it every alarm tick. hookBead() rejects triage-request labels, so this turns every such request into a permanent warning/log-spam loop instead of recovering real orphaned work. Exclude triage-request labels from this scan.
… agent status as not_found Container fetch calls through the Container class auto-start the container and wait for port readiness, which can block the alarm loop during container restarts/deploys. Add AbortSignal.timeout to: - ensureContainerReady health check (5s) - checkAgentContainerStatus (5s) - startAgentInContainer (60s) - refreshContainerToken (10s) Also treat non-OK responses and errors in checkAgentContainerStatus as not_found instead of unknown, so zombie detection resets the agent immediately rather than leaving it stuck in working state.
…e re-dispatch witnessPatrol was setting last_activity_at = now() when resetting dead agents to idle, which triggered the 2-minute dispatch cooldown in schedulePendingWork. The cooldown exists to prevent double-dispatch of live agents with an in-flight container start — but a confirmed-dead agent has no in-flight dispatch to collide with. Set last_activity_at = NULL instead so schedulePendingWork picks up the agent on the very next alarm tick. Reduces non-review agent recovery from ~2+ min to ~5-10s after container restart.
| try { | ||
| const resp = await container.fetch('http://container/refresh-token', { | ||
| method: 'POST', | ||
| signal: AbortSignal.timeout(10_000), |
There was a problem hiding this comment.
WARNING: Refresh-token timeouts now abort agent startup
ensureContainerToken() only suppresses network/"fetch" failures. AbortSignal.timeout() rejects with an abort/timeout error instead, so a slow cold-start on /refresh-token now bubbles out of this helper. startAgentInContainer() catches that and returns false, even though the token was already persisted via setEnvVar(), so dispatch can fail spuriously whenever the container needs more than 10s to come up.
| // instead of waiting for the 2-hour GUPP timeout. | ||
| if (response.status === 404) return { status: 'not_found' }; | ||
| if (!response.ok) return { status: 'unknown' }; | ||
| if (!response.ok) return { status: 'not_found' }; |
There was a problem hiding this comment.
WARNING: Transient status probe failures now look like missing agents
witnessPatrol() treats not_found as a confirmed dead agent and immediately flips it back to idle with no cooldown. Returning not_found for 5xx/timeout/parse failures means a live agent can be redispatched just because the status endpoint was slow during a deploy or container restart, which risks duplicate work on the same hooked bead. Keeping a distinct retry/unknown state for non-404 failures avoids that false-positive recovery path.
… with unhooked agents rehookOrphanedBeads previously only recovered 'open' beads. After container restarts, beads can also get stuck in 'in_progress' with a stale assignee — the agent was unhooked by witnessPatrol but the bead status was never rolled back. Neither schedulePendingWork nor any other recovery path picks these up. Now also matches in_progress beads where the assigned agent is not working and not hooked. Resets in_progress beads to open before re-hooking so the dispatch flow starts cleanly. Filters out agents in 'working' status to avoid interfering with transient races during dispatch.
… source beads Gap I: MR beads in in_progress with pr_url and a dead refinery were stuck permanently — recoverStuckReviews excludes pr_url, schedulePendingWork excludes refineries, and closeOrphanedReviewBeads only checked open beads. Expanded closeOrphanedReviewBeads to also match in_progress MR beads. Gap G: recoverOrphanedSourceBeads returned source beads to open but left the stale assignee_agent_bead_id, preventing feedStrandedConvoys (requires assignee IS NULL) from re-hooking convoy beads. Now clears the assignee so both feedStrandedConvoys and rehookOrphanedBeads can assign a fresh agent.
| LEFT JOIN ${agent_metadata} ON ${beads.assignee_agent_bead_id} = ${agent_metadata.bead_id} | ||
| WHERE ${beads.type} = 'merge_request' | ||
| AND ${beads.status} = 'open' | ||
| AND ${beads.status} IN ('open', 'in_progress') |
There was a problem hiding this comment.
WARNING: Normal PR-strategy reviews now look orphaned after 30 minutes
markReviewInReview() leaves PR-backed MR beads in in_progress, and agentDone() immediately unhooks the refinery so its status becomes idle. With this wider predicate, any healthy PR review that is simply waiting on a human reviewer for more than 30 minutes now satisfies the orphan scan and gets closed even though nothing is actually wrong.
| sql, | ||
| /* sql */ ` | ||
| UPDATE ${beads} | ||
| SET ${beads.columns.assignee_agent_bead_id} = NULL |
There was a problem hiding this comment.
WARNING: Clearing the assignee strands standalone rework beads
Before this change, rehookOrphanedBeads() could recover a standalone source bead because it still had its stale assignee_agent_bead_id. Setting that column to NULL means only feedStrandedConvoys() can pick it back up, but that patrol only scans convoy-tracked beads. A non-convoy issue that hits this recovery path will now stay open forever with no agent re-dispatched.
…patchAgent rollback dispatchAgent's failure rollback paths used agent.current_hook_bead_id to roll back the bead status. But when called from rework dispatch (completeReviewWithResult, failReviewWithRework, agentCompleted), the agent snapshot is fetched by getOrCreateAgent BEFORE hookBead is called. So agent.current_hook_bead_id is null in the snapshot, and the bead rollback is silently skipped — leaving the bead stuck in in_progress with an idle unhooked agent. Use bead.bead_id (the actual bead being dispatched) for rollback, which is always correct regardless of when the agent snapshot was taken.
| `, | ||
| [agent.id] | ||
| ); | ||
| beadOps.updateBeadStatus(ctx.sql, bead.bead_id, 'open', agent.id); |
There was a problem hiding this comment.
WARNING: Rollback can reopen beads that finished while startup was in flight
dispatchAgent() awaits container startup, so another RPC/alarm cycle can still move bead.bead_id to closed or failed before this rollback runs. updateBeadStatus() overwrites whatever status is there now, so both rollback branches will reopen that terminal bead to open even though the agent never actually started. Re-read the bead and only roll it back when it is still in_progress (and ideally still assigned to this agent).
…h active refinery recoverStuckReviews was resetting in_progress MR beads to open every alarm tick when they exceeded the timeout, even when the refinery agent was actively working on the review. This caused infinite reset loops where the refinery's work was repeatedly interrupted. Add NOT EXISTS check to exclude MR beads whose assigned refinery agent is currently in 'working' status. Also restore the timeout to 5 min since the working-agent guard now prevents false positives — the timeout only fires when the agent is truly dead/idle.
…TCH_ATTEMPTS After a deploy + container eviction, dispatch failures burn through MAX_DISPATCH_ATTEMPTS (was 5) within 25 seconds, permanently failing beads before the container has time to start. Two changes: 1. Increase MAX_DISPATCH_ATTEMPTS from 5 to 20 to tolerate longer container cold starts. 2. Set last_activity_at = now() on dispatch failure to trigger the 2-min dispatch cooldown, giving the container time to start up before the next retry. Combined with 20 max attempts, this allows up to 40 min of retries.
…f throwing 500 When the container restarts during a review, witnessPatrol unhooks the refinery agent. If the refinery process resumes and calls gt_done after being unhooked, agentDone threw 'Agent has no hooked bead' → 500 error. This triggered triage requests and left the bead in a stuck state. Return gracefully instead of throwing — the recovery paths (recoverStuckReviews, rehookOrphanedBeads) will handle the bead lifecycle.
…back processReviewQueue passed rigConfig.kilocodeToken directly to startAgentInContainer, which is undefined when the token was stored in the town config rather than the rig config. Polecats work because dispatchAgent calls resolveKilocodeToken() which falls back to the town config. The refinery got no token → container rejected the start → every review failed with 'Refinery container failed to start'. Use the same resolveKilocodeToken() fallback chain.
…ainer Worktree creation fails when repos contain LFS-tracked files (e.g. .mp4) and the LFS batch endpoint can't resolve credentials. This blocks agent start entirely — the container returns a non-200 for /agents/start and processReviewQueue fails the review with 'Refinery container failed to start'. Set GIT_LFS_SKIP_SMUDGE=1 globally in the git exec helper. Agents don't need binary assets — LFS files are checked out as pointer files instead of downloading the actual content.
Belt-and-suspenders alongside the GIT_LFS_SKIP_SMUDGE env var: install a .gitconfig for the agent user that configures the LFS filter to skip smudge and excludes all files from LFS fetch. This persists in the Docker image layer and covers any code path that runs git outside the exec() helper.
…finery Three related fixes to stop recoverStuckReviews from resetting MR beads while the refinery is actively reviewing: 1. checkAgentContainerStatus: restore 'unknown' for non-404 errors and timeouts instead of aggressive 'not_found'. Only 404 (container confirms agent doesn't exist) triggers zombie reset. Timeout/errors return 'unknown' which witnessPatrol ignores — the GUPP system handles truly dead agents after 2 hours. 2. recoverStuckReviews: exclude MR beads where ANY agent is hooked (not just 'working' agents). After witnessPatrol resets a refinery to idle but keeps the hook, the old check saw 'no working agent' and reset the MR bead while the refinery was about to be re-dispatched. 3. Increase REVIEW_RUNNING_TIMEOUT_MS from 5 min to 15 min. Reviews legitimately take 5-10 min (clone + review + test + merge).
… agent type dispatchAgent was rolling back beads from in_progress to open when startAgentInContainer returned false. But the container may have actually started the agent (timeout race — the fetch timed out but the container continued setup). The agent starts working, but the DO already rolled the bead back to open. Remove the bead rollback for both the !started and exception paths. Leave the bead in in_progress with the agent idle+hooked. Two outcomes: - Agent actually started → works normally → gt_done closes bead - Agent truly failed → rehookOrphanedBeads recovers after 2 min This is the same pattern applied to processReviewQueue in the previous commit, now extended to the general dispatch path.
| ctx.sql, | ||
| /* sql */ ` | ||
| UPDATE ${agent_metadata} | ||
| SET ${agent_metadata.columns.status} = 'idle', |
There was a problem hiding this comment.
WARNING: Uncertain starts still get redispatched after the cooldown
This branch marks the agent idle but keeps its hook on the bead. schedulePendingWork() only checks for status = 'idle', a non-null current_hook_bead_id, and an expired last_activity_at, so a /agents/start timeout race that actually launched the process will be started a second time about two minutes later if the first run is still active. That duplicates work instead of waiting for patrol/witness logic to prove the original start really failed.
Comprehensive fix addressing the remaining review agent findings: 1. agentCompleted: don't unhook refinery. Leave hook intact so recoverStuckReviews' guard works and gt_done can arrive after agentCompleted without a race window. 2. witnessPatrol: only act on 'exited' for refineries, skip 'not_found'. not_found is ambiguous (container restarting, status check timeout) and falsely setting the refinery to idle enables premature recovery. 3. completeReviewWithResult failure/conflict path: set source bead to 'open' (not in_progress) and clear assignee. This lets the normal scheduling path (feedStrandedConvoys → hookBead → schedulePendingWork) handle rework instead of fire-and-forget dispatch that races with patrol recovery. 4. Remove all fire-and-forget rework dispatch code from failReviewWithRework, TownDO.completeReviewWithResult, and TownDO.agentCompleted. Rework is now exclusively handled by the alarm loop's scheduling/patrol functions, eliminating the race between dispatch and recovery.
| sql, | ||
| /* sql */ ` | ||
| UPDATE ${beads} | ||
| SET ${beads.columns.assignee_agent_bead_id} = NULL |
There was a problem hiding this comment.
WARNING: Clearing the assignee here strands standalone rework
feedStrandedConvoys() only re-hooks convoy issues, and rehookOrphanedBeads() only repairs beads whose assignee_agent_bead_id is still set. When a standalone review fails through this path, the source bead is reopened with no hook and no assignee, so nothing ever picks it back up.
| // merge succeeded. recoverStuckReviews (which checks for | ||
| // status='working') handles the case where gt_done never arrives. | ||
| // | ||
| // No-op for the bead — just fall through to mark agent idle. |
There was a problem hiding this comment.
WARNING: Falling through to idle while the refinery stays hooked makes the singleton reusable too early
getOrCreateAgent() returns the existing refinery even when current_hook_bead_id is still set, and processReviewQueue() only gates on status !== 'idle'. Once this path marks the agent idle, the next alarm can pop another MR and then hit hookBead()'s "already hooked" error, or let a late gt_done attach to the wrong review.
…rge path logging witnessPatrol was setting non-refinery agents to idle on transient not_found container status, causing working agents to appear idle. Extend the not_found skip to all agent types — only act on confirmed exited status. Add diagnostic logging to completeReviewWithResult merged path to trace why source beads are not being closed after successful merges.
… for abandoned MR beads Two fixes for MR beads stuck in in_progress: 1. processReviewQueue: unhook the refinery from its previous MR bead before hooking to the new one. agentCompleted preserves the refinery hook, so when processReviewQueue runs next, hookBead throws 'already hooked'. This crashed processReviewQueue silently (caught by Promise.allSettled), leaving the MR bead stuck in in_progress with no agent. 2. recoverStuckReviews: add fast 2-min recovery for MR beads that are in_progress with NO agent hooked at all. These are clearly abandoned (hookBead threw or the refinery was unhooked by another path). The 30-min timeout remains for MR beads with an idle refinery hooked (waiting for gt_done to arrive).
| // If the refinery is still hooked to a previous MR bead (agentCompleted | ||
| // preserves hooks for refineries), unhook first. | ||
| if (refineryAgent.current_hook_bead_id && refineryAgent.current_hook_bead_id !== entry.id) { | ||
| agents.unhookBead(this.sql, refineryAgent.id); |
There was a problem hiding this comment.
WARNING: This repoints late refinery completion onto the wrong MR
reviewQueue.agentCompleted() intentionally leaves refineries idle but still hooked so a delayed gt_done can recover the MR they were actually reviewing. Clearing that hook here and then reusing the same singleton for entry.id means any late completion from the previous run will now be applied to the new review instead.
| WHERE ${beads.type} = 'merge_request' | ||
| AND ${beads.status} = 'in_progress' | ||
| AND ${beads.updated_at} < ? | ||
| AND NOT EXISTS ( |
There was a problem hiding this comment.
WARNING: This fast-recovery scan reopens healthy PR-backed reviews
After agentDone() stores a pr_url, PR-strategy MR beads intentionally stay in_progress with no hooked refinery while a human reviews the PR. Because this new query does not exclude rows with review_metadata.pr_url, any healthy PR review older than two minutes will be reset to open and can dispatch a second refinery run against the same branch.
Skipping not_found for refineries caused them to stay 'working' in the DB permanently when the container process died. witnessPatrol never reset them, and recoverStuckReviews' guard (no WORKING agent) meant 30-min recovery also didn't fire. Now set the refinery to idle on both not_found and exited (same as before but without skipping). Keep the hook intact so gt_done can still arrive and close the MR. If the refinery is idle+hooked: - processReviewQueue unhooks it before hooking to the new MR (prev fix) - recoverStuckReviews recovers after 30 min (guard: no working agent) - 2-min fast recovery handles MR beads left with no hook at all Non-refineries: reset to idle on both not_found and exited (unchanged).
| if (result.status !== 'fulfilled') continue; | ||
| const { agentId, containerInfo } = result.value; | ||
|
|
||
| if (containerInfo.status === 'not_found' || containerInfo.status === 'exited') { |
There was a problem hiding this comment.
WARNING: not_found is still too ambiguous for zombie cleanup
checkAgentContainerStatus() can still report not_found during container restarts or transient probe failures, so handling it the same as a confirmed exited process flips live agents to idle. For refineries that reopens the reuse path at processReviewQueue(), and for polecats it makes the scheduler eligible to launch a duplicate retry before the original process reports gt_done.
After a container restart, the first startAgentInContainer call may fail (container not ready). The refinery is left idle+hooked to the MR bead (in_progress). Without retries, the MR sits in in_progress for 30 min until recoverStuckReviews resets it. Now processReviewQueue checks for this state at the start of each tick: if the refinery is idle+hooked to an in_progress MR bead, re-dispatch it via dispatchAgent. This retries every 5 seconds (alarm interval) until the container accepts the start.
| `${TOWN_LOG} processReviewQueue: retrying refinery dispatch for MR bead=${hookedMr.bead_id}` | ||
| ); | ||
| await this.dispatchAgent(refineryForRetry, hookedMr); | ||
| return; |
There was a problem hiding this comment.
WARNING: One stuck refinery can block every other rig's review queue
This retry runs before popReviewQueue() and always returns afterward. Because it grabs the first refinery in the town rather than a refinery tied to the entry being processed, an idle hooked refinery in rig A will monopolize processReviewQueue() until it stops matching this branch, so queued MRs for rig B never start.
…n any MR in-progress Three fixes for the cascading MR bead problem: 1. processReviewQueue start failure: keep the refinery hook intact instead of unhooking. The retry block at the top of processReviewQueue handles idle+hooked refineries. By preserving the hook, we prevent popping new MR beads while the container is unavailable — the retry block retries the same MR each tick. 2. popReviewQueue: don't pop ANY open MR if there's already an in_progress MR for the same rig. Since there's only one refinery per rig, popping a second MR while the first is in_progress is always wrong. 3. Clear refinery checkpoint when unhooking from a previous MR bead to prevent stale conversation context leaking into new reviews.
| AND NOT EXISTS ( | ||
| SELECT 1 FROM ${beads} AS active_mr | ||
| WHERE active_mr.${beads.columns.type} = 'merge_request' | ||
| AND active_mr.${beads.columns.status} = 'in_progress' |
There was a problem hiding this comment.
WARNING: PR-backed reviews now block every later MR on the same rig
markReviewInReview() leaves PR-strategy beads in in_progress while humans review the external PR, but the refinery is already free once that PR is created. This rig-wide NOT EXISTS treats those waiting PRs as active refinery work, so one long-lived PR can freeze every new review on the rig until a human merges or closes it.
|
|
||
| if (!started) { | ||
| agents.unhookBead(this.sql, refineryAgent.id); | ||
| // Keep hook intact — the retry block at the top of processReviewQueue |
There was a problem hiding this comment.
WARNING: Uncertain refinery starts now get retried while the first launch may still be alive
This false branch is explicitly the timeout-race case where /agents/start may still finish later. Setting the refinery back to idle makes the retry block at the top of processReviewQueue() call dispatchAgent() again on the next 5s tick, so the same MR can get a second launch before the original process reports a heartbeat or gt_done.
…l start When startAgentInContainer gets a 500 with 'already running', the agent process IS alive from a previous dispatch attempt. The previous dispatch returned false (timeout or transient error) but the container actually started the agent. Subsequent retry attempts get 'already running' and return false, causing the DO to keep the refinery idle while the agent is actually running. Detect 'already running' in the error response and return true so the DO correctly marks the agent as working.
| // "Already running" means a previous dispatch succeeded — the agent | ||
| // IS alive in the container. Treat as success so the DO marks the | ||
| // agent as working and stops retrying. | ||
| if (response.status === 500 && text.includes('already running')) { |
There was a problem hiding this comment.
WARNING: This also treats agents that are still only starting as a successful launch
process-manager.startAgent() throws the same "already running" error for both running and starting agents. After the 60s client timeout, the first /agents/start can still be in the middle of clone/worktree setup, the retry hits this branch, and the DO flips the agent back to working even though startup has not finished yet. If that original startup later fails, the container reports failed, but witnessPatrol() only recovers not_found and exited, so the bead can stay stuck in in_progress indefinitely.
The retry block was calling dispatchAgent without checking if the agent is already running in the container. When the original dispatch succeeded (but we got a false negative from timeout), the retry sends another /agents/start which gets 'already running'. While startAgentInContainer now handles this, it's cleaner to check first. Now: check container status → if running, restore to 'working' (no dispatch needed). If not running, dispatch normally.
| console.log( | ||
| `${TOWN_LOG} processReviewQueue: retrying refinery dispatch for MR bead=${hookedMr.bead_id}` | ||
| ); | ||
| await this.dispatchAgent(refineryForRetry, hookedMr); |
There was a problem hiding this comment.
WARNING: This retries refineries that already finished normally
reviewQueue.agentCompleted() intentionally leaves a refinery idle but still hooked until its delayed gt_done closes the MR. In that window checkAgentContainerStatus() reports the process as exited, so this branch re-dispatches the same MR on the next alarm tick even though the original run already completed. That can replay the review/merge or create a second PR before the late gt_done arrives.
Three fixes for PR-strategy review beads:
1. recoverStuckReviews fast-recovery: exclude MR beads with pr_url.
The fast 2-min recovery was resetting PR-strategy beads to open,
making them invisible to pollPendingPRs (which only queries
in_progress beads). This caused an infinite cycle: pop → dispatch
→ refinery sees existing PR → gt_done with pr_url → unhook → 2min
→ fast recovery resets to open → pop again.
2. pollPendingPRs: use this.completeReviewWithResult (the TownDO
wrapper) instead of reviewQueue.completeReviewWithResult directly.
The wrapper emits events and calls dispatchUnblockedBeads, which
is critical for unblocking downstream convoy beads after a PR merge.
3. closeOrphanedReviewBeads: use completeReviewWithResult('failed')
instead of closeBead. closeBead only closes the MR bead without
transitioning the source bead, leaving it stuck in in_review.
…essReviewQueue When gt_done or pollPendingPRs closes an MR bead while the refinery is still running in the container, the refinery stays 'working' and hooked to the closed MR. processReviewQueue sees the refinery as non-idle and re-queues new MR beads instead of dispatching them. Add cleanup at the top of processReviewQueue: if the refinery is hooked to a closed/failed MR bead, unhook it, set to idle, and clear the checkpoint. This frees the refinery for new work immediately.
0ac009c to
8b62a7e
Compare
| if (existingRefinery?.current_hook_bead_id) { | ||
| const hookedMr = beadOps.getBead(this.sql, existingRefinery.current_hook_bead_id); | ||
| if (hookedMr && (hookedMr.status === 'closed' || hookedMr.status === 'failed')) { | ||
| agents.unhookBead(this.sql, existingRefinery.id); |
There was a problem hiding this comment.
WARNING: Unhooking the refinery here can misapply a late gt_done
This cleanup runs specifically when the MR is already terminal but the previous refinery process may still be running. Clearing the hook and marking the singleton idle at this point lets processReviewQueue() reuse the same refinery for a different MR before the old container exits, so a delayed gt_done from the first review can be attributed to the new one.
…minal MR After gt_done closes an MR bead, the refinery process is still running in the container (finishing its LLM turn). The cleanup block in processReviewQueue was immediately unhooking the refinery and making it available for new work, causing it to be dispatched for a new MR while the old session was still active in the container. Now check container status first: if the refinery is still running, skip cleanup and return — don't pop any new MR this tick. Wait for agentCompleted to fire (session ends), then clean up on the next tick. This prevents dispatching a new review to a refinery whose container session is still active from the previous review.
|
Superseded by #1336 which includes all changes from this branch plus the full reconciler implementation. |
…gement Replace the imperative patrol/scheduling/review-queue alarm phases with a declarative reconciler that drains events, computes desired state, and applies corrective actions. Reconciler architecture (Phases 1-5): - Event table (town_events) with 8 event types, dual-write in RPC handlers - 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC - applyEvent/applyAction for all event and action types - Event-only agentDone/agentCompleted (no direct mutations) - Lazy assignment for slingConvoy/startConvoy (#1249) - Container status observation pre-phase (replaces witnessPatrol) - Agent activity watermark via enriched heartbeat - Invariant checker + reconciler metrics in getAlarmStatus - Rework flow: gt_request_changes tool for refinery change requests - Refinery system prompt wired in (buildRefinerySystemPrompt) Bug fixes: - Convoy landing MR cycling (reconciler created duplicates) - Multi-agent hook mutual exclusion (hookBead unhooks stale agents) - agentCompleted no longer closes beads when gt_done was not called - GUPP NaN bug for agents with null last_activity_at - Container status event flood (upsert instead of insert per tick) - PR-strategy MR beads no longer block the refinery queue - poll_pr refreshes updated_at to prevent false orphan kills - Failed events retry on next tick instead of being dropped Includes changes from #1244 (town issues rewrite): - MR bead failure lifecycle fixes - Review queue recovery improvements - Convoy progress tracking fixes - Post-deploy monitoring guide and scripts Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts, review-queue.ts, and Town.do.ts.
…gement Replace the imperative patrol/scheduling/review-queue alarm phases with a declarative reconciler that drains events, computes desired state, and applies corrective actions. Reconciler architecture (Phases 1-5): - Event table (town_events) with 8 event types, dual-write in RPC handlers - 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC - applyEvent/applyAction for all event and action types - Event-only agentDone/agentCompleted (no direct mutations) - Lazy assignment for slingConvoy/startConvoy (#1249) - Container status observation pre-phase (replaces witnessPatrol) - Agent activity watermark via enriched heartbeat - Invariant checker + reconciler metrics in getAlarmStatus - Rework flow: gt_request_changes tool for refinery change requests - Refinery system prompt wired in (buildRefinerySystemPrompt) Bug fixes: - Convoy landing MR cycling (reconciler created duplicates) - Multi-agent hook mutual exclusion (hookBead unhooks stale agents) - agentCompleted no longer closes beads when gt_done was not called - GUPP NaN bug for agents with null last_activity_at - Container status event flood (upsert instead of insert per tick) - PR-strategy MR beads no longer block the refinery queue - poll_pr refreshes updated_at to prevent false orphan kills - Failed events retry on next tick instead of being dropped Also includes prerequisite fixes from #1244: - MR bead failure lifecycle fixes - Review queue recovery improvements - Convoy progress tracking fixes Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts, review-queue.ts, and Town.do.ts.
…gement Replace the imperative patrol/scheduling/review-queue alarm phases with a declarative reconciler that drains events, computes desired state, and applies corrective actions. Reconciler architecture (Phases 1-5): - Event table (town_events) with 8 event types, dual-write in RPC handlers - 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC - applyEvent/applyAction for all event and action types - Event-only agentDone/agentCompleted (no direct mutations) - Lazy assignment for slingConvoy/startConvoy (#1249) - Container status observation pre-phase (replaces witnessPatrol) - Agent activity watermark via enriched heartbeat - Invariant checker + reconciler metrics in getAlarmStatus - Rework flow: gt_request_changes tool for refinery change requests - Refinery system prompt wired in (buildRefinerySystemPrompt) Bug fixes: - Convoy landing MR cycling (reconciler created duplicates) - Multi-agent hook mutual exclusion (hookBead unhooks stale agents) - agentCompleted no longer closes beads when gt_done was not called - GUPP NaN bug for agents with null last_activity_at - Container status event flood (upsert instead of insert per tick) - PR-strategy MR beads no longer block the refinery queue - poll_pr refreshes updated_at to prevent false orphan kills - Failed events retry on next tick instead of being dropped Also includes prerequisite fixes from #1244: - MR bead failure lifecycle fixes - Review queue recovery improvements - Convoy progress tracking fixes Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts, review-queue.ts, and Town.do.ts.
…gement Replace the imperative patrol/scheduling/review-queue alarm phases with a declarative reconciler that drains events, computes desired state, and applies corrective actions. Reconciler architecture (Phases 1-5): - Event table (town_events) with 8 event types, dual-write in RPC handlers - 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC - applyEvent/applyAction for all event and action types - Event-only agentDone/agentCompleted (no direct mutations) - Lazy assignment for slingConvoy/startConvoy (#1249) - Container status observation pre-phase (replaces witnessPatrol) - Agent activity watermark via enriched heartbeat - Invariant checker + reconciler metrics in getAlarmStatus - Rework flow: gt_request_changes tool for refinery change requests - Refinery system prompt wired in (buildRefinerySystemPrompt) Bug fixes: - Convoy landing MR cycling (reconciler created duplicates) - Multi-agent hook mutual exclusion (hookBead unhooks stale agents) - agentCompleted no longer closes beads when gt_done was not called - GUPP NaN bug for agents with null last_activity_at - Container status event flood (upsert instead of insert per tick) - PR-strategy MR beads no longer block the refinery queue - poll_pr refreshes updated_at to prevent false orphan kills - Failed events retry on next tick instead of being dropped Also includes prerequisite fixes from #1244: - MR bead failure lifecycle fixes - Review queue recovery improvements - Convoy progress tracking fixes Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts, review-queue.ts, and Town.do.ts.
…gement Replace the imperative patrol/scheduling/review-queue alarm phases with a declarative reconciler that drains events, computes desired state, and applies corrective actions. Reconciler architecture (Phases 1-5): - Event table (town_events) with 8 event types, dual-write in RPC handlers - 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC - applyEvent/applyAction for all event and action types - Event-only agentDone/agentCompleted (no direct mutations) - Lazy assignment for slingConvoy/startConvoy (#1249) - Container status observation pre-phase (replaces witnessPatrol) - Agent activity watermark via enriched heartbeat - Invariant checker + reconciler metrics in getAlarmStatus - Rework flow: gt_request_changes tool for refinery change requests - Refinery system prompt wired in (buildRefinerySystemPrompt) Bug fixes: - Convoy landing MR cycling (reconciler created duplicates) - Multi-agent hook mutual exclusion (hookBead unhooks stale agents) - agentCompleted no longer closes beads when gt_done was not called - GUPP NaN bug for agents with null last_activity_at - Container status event flood (upsert instead of insert per tick) - PR-strategy MR beads no longer block the refinery queue - poll_pr refreshes updated_at to prevent false orphan kills - Failed events retry on next tick instead of being dropped Also includes prerequisite fixes from #1244: - MR bead failure lifecycle fixes - Review queue recovery improvements - Convoy progress tracking fixes Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts, review-queue.ts, and Town.do.ts.
…gement Replace the imperative patrol/scheduling/review-queue alarm phases with a declarative reconciler that drains events, computes desired state, and applies corrective actions. Reconciler architecture (Phases 1-5): - Event table (town_events) with 8 event types, dual-write in RPC handlers - 6 reconciler functions: agents, beads, review queue, convoys, GUPP, GC - applyEvent/applyAction for all event and action types - Event-only agentDone/agentCompleted (no direct mutations) - Lazy assignment for slingConvoy/startConvoy (#1249) - Container status observation pre-phase (replaces witnessPatrol) - Agent activity watermark via enriched heartbeat - Invariant checker + reconciler metrics in getAlarmStatus - Rework flow: gt_request_changes tool for refinery change requests - Refinery system prompt wired in (buildRefinerySystemPrompt) Bug fixes: - Convoy landing MR cycling (reconciler created duplicates) - Multi-agent hook mutual exclusion (hookBead unhooks stale agents) - agentCompleted no longer closes beads when gt_done was not called - GUPP NaN bug for agents with null last_activity_at - Container status event flood (upsert instead of insert per tick) - PR-strategy MR beads no longer block the refinery queue - poll_pr refreshes updated_at to prevent false orphan kills - Failed events retry on next tick instead of being dropped Also includes prerequisite fixes from #1244: - MR bead failure lifecycle fixes - Review queue recovery improvements - Convoy progress tracking fixes Dead code removed: ~1,578 lines from patrol.ts, scheduling.ts, review-queue.ts, and Town.do.ts.
Summary
processReviewQueue()(no rig_id, no rig config, container fails to start) were callingreviewQueue.completeReview()directly — a raw SQL UPDATE that bypassesupdateBeadStatus(), skipping convoy progress updates and leaving source beads stuck inin_reviewpermanently. Replaced all three with a newfailReviewWithRework()helper that routes throughcompleteReviewWithResult(), which properly logs events, updates convoy progress, returns the source bead toin_progress, and dispatches a polecat for rework.recoverOrphanedSourceBeads()— a safety-net recovery function (called each alarm cycle) that detects source beads stuck inin_reviewwhose MR bead has already reached a terminal state, and returns them toin_progress.dispatchAgent()— the!started(container failed) path was rolling back the agent toidlebut leaving the bead inin_progress. Now rolls back the bead toopen(consistent with thethrowpath), making the retry viaschedulePendingWorkmore predictable.Closes #1243
Verification
pnpm typecheck— passespnpm lint(gastown) — passespnpm test:integration— all new tests pass (5/5); 1 pre-existing failure inconvoy-dag.test.tsunrelated to this changeshould include merge_mode in review queue metadata for convoy beads) reproduces onmainVisual Changes
N/A
Reviewer Notes
completeReview()function inreview-queue.tsstill exists and is still called internally bycompleteReviewWithResult()— it's the low-level SQL update. The fix ensures no caller inprocessReviewQueueuses it directly for failure paths.recoverOrphanedSourceBeadsuses the same 5-minute timeout asrecoverStuckReviews. It only recovers beads whose MR bead is already in a terminal state (closed/failed), so it won't interfere with in-flight reviews.dispatchAgentbead rollback fix may be a no-op in fire-and-forget contexts (the I/O gate may already be closed per the existing comment at line 2982), but it ensures correct behavior whendispatchAgentis awaited directly (as inprocessReviewQueue).