Skip to content

bug: investigate hanging sessions reported in CLI and VS Code (v7.1.20 → v7.2.0) #8677

@kilo-code-bot

Description

@kilo-code-bot

Summary

Users report that both CLI and VS Code have "hanging sessions" — the session isn't doing anything, but sending a message does nothing. This issue tracks a systematic investigation of all PRs merged between v7.1.20 and v7.2.0 to identify potential causes.

Investigation Methodology

Each of the 39 PRs merged between v7.1.20 and v7.2.0 was analyzed against the following known hanging risk vectors in the codebase:

  1. Queued callbacks in prompt.ts never being resolved — If the session loop exits without finding an assistant message and without aborting, queued promise callbacks are never resolved
  2. Snapshot git lock contention — All snapshot operations use Lock.write(git), serializing concurrent git operations
  3. Storage lock contention — RWLock on filesystem storage paths
  4. Blocking child sessions — Parent tool execution blocks on SessionPrompt.prompt() for child; if child hangs, parent hangs
  5. SSE stream write backpressureawait stream.writeSSE() blocks if client can't consume fast enough
  6. Abort signal not propagating — If AbortController is never aborted on cleanup, retry sleeps and LLM streams run indefinitely

PR-by-PR Analysis

PR Title Probability Notes
#8524 feat(agent-manager): add GitHub PR status badge to worktree sidebar None Pure UI in extension layer, no session/SSE interaction
#8525 fix(vscode): retry transient fetch failures in session.command and promptAsync Low Wraps read-only SDK calls with bounded retry (3 attempts). Does NOT touch promptAsync/session.command despite title. Minor: retry delay not abortable (~1s max)
#8431 fix: remove local storage from ignored folder None Changes file ignore patterns only, no session/lock interaction
#8445 perf(snapshot): add mutex lock, incremental add, and batched revert Medium ⚠️ TOP SUSPECT. Adds Lock.write(git) to ALL snapshot functions. diffFullUncached can hold the write lock for extended periods while streaming per-file git processes, blocking all other snapshot operations (track(), patch()) on the same worktree. Could cause sessions to appear stuck after tool calls complete (waiting for track() blocked behind a long diffFull).
#8479 fix(core): make follow-up execution aware of the saved plan file None Pure prompt text change, no control flow modification
#8508 fix(vscode): scope cycleAgentMode keybinding to Kilo Code panels None Keybinding scoping only
#8506 feat(vscode): pre-release publishing support None CI/CD change only, no runtime code
#8496 fix: local review git None Prompt text + one bounded read-only git command with .nothrow()
#8480 feat(vscode): reimplement task timeline graph header None Pure UI visualization, read-only session data
#8484 feature: glm/kimi/qwen reasoning support None Declarative config mapping, no stream handling changes
#8190 fix(agent-manager): suppress interactive prompts during background git fetch None Actually reduces hang risk by using GIT_TERMINAL_PROMPT=0 and BatchMode=yes
#8464 fix(cli): update simple-git to fix critical RCE None Security hardening only; CLI uses raw git commands, not simple-git for core ops
#8465 fix(cli): update hono to fix auth bypass and server vulnerabilities Low Hono SSE injection fix touched streaming internals. Canonical usage patterns (writeSSE/onAbort) are stable, but hono serves the SSE endpoints
#8466 fix(cli): update minimatch, @modelcontextprotocol/sdk, and @aws-sdk None MCP SDK 1.24.3 actually fixed a hanging connection bug (body.cancel()text())
#8467 fix: add safe overrides for transitive dependency vulnerabilities None Semver-compatible overrides, security fixes only
#8426 fix(vscode): mode picker sync None UI-layer mode picker synchronization, proper error recovery
#8417 fix(vscode): question recovery + sub-agent permission propagation Low Designed to FIX hangs. Adds question recovery on SSE reconnect and auto-adoption of child sessions. Minor: fetchAndSendPendingQuestions is awaited during SSE reconnection handler — if backend is unresponsive, could delay reconnection completion
#8386 fix(vscode): recover missed child-session prompts None Explicitly fixes hanging sessions from missed child prompts. Defensive error handling (try/catch, fire-and-forget)
#8400 fix(cli): cache diffFull and ignore legacy local storage None Promise-based caching reduces lock contention. Correct error eviction.
#8367 Session migration improvements None VS Code extension migration wizard only, not runtime sessions
#8230 feat(vscode): add Claude Code compatibility toggle None Env var at spawn time, no async flow changes
Plan mode commits (6 commits) Permission propagation to sub-agents Low Sub-agents inherit restrictive permissions. edit: "ask" creates blocking permission requests that surface to UI. Theoretical: deeply nested sub-agent permission request might be less visible to user, making parent appear hung
#8218 feat(cli): add org support for /kiloclaw command None Single parallel HTTP request with .catch(() => null) guard
CLI fix commits (5 commits) Guard prompt injection, review scope, health logging, Docker MCP --rm, variant config None Prompt text changes, log filtering, Docker cleanup fix
UI/stability commits Dialog escape, popover stability, JetBrains plugin None UI-only or isolated new package
Docs PRs (10 PRs) Various documentation updates None No runtime code changes
Security PRs (#8468, #8469) diff, dompurify, yaml, solid-js, vite, electron None Patch-level dependency updates, build-time tools
#8192, #8211, #8258 Agent session PRs (User-Agent header, docs link, health logging) None Header construction, static UI, log filtering

Conclusion: Likely Suspects

Primary Suspect: PR #8445 — Snapshot mutex lock (Medium probability)

This PR adds Lock.write(git) to every public function in the Snapshot namespace. While this correctly prevents concurrent git corruption, it introduces significant lock contention that could cause perceived hanging:

  • diffFullUncached() holds the write lock while streaming git output line-by-line and spawning up to 4 additional git processes per file. For large diffs, this could hold the lock for seconds to minutes.
  • During this time, all other snapshot operations on the same worktree are blocked: track() (called at every step-start/step-finish), patch(), revert(), diff(), cleanup().
  • A session that completes a tool call will call track() to checkpoint — if track() is blocked behind a long diffFullUncached(), the session appears stuck between steps.
  • With Agent Manager running multiple sessions on the same worktree, this serialization is amplified.

Recommended investigation: Add timing instrumentation to Lock.write(git) acquisitions and log when wait time exceeds 5 seconds. Check if diffFullUncached is the primary lock holder during perceived hangs.

Secondary Suspects

  1. Hono update (fix(cli): update hono to fix auth bypass and server vulnerabilities #8465) — Low probability, but hono serves the SSE endpoints. The SSE injection fix in hono 4.12 touched streaming internals. If edge cases in SSE stream termination/cleanup changed, it could affect how clients perceive connection state.

  2. Plan mode permission propagation — Low probability. Sub-agents that inherit restrictive edit: "ask" permissions create blocking permission requests. In deeply nested delegation chains, the permission UI might not be prominently visible, making the session appear hung while actually waiting for user input.

  3. Question recovery on SSE reconnect (fix(cli): harden plan mode permissions and propagate restrictions to sub-agents #8417) — Low probability. fetchAndSendPendingQuestions is awaited in the SSE "connected" handler. If the backend is slow, this delays SSE reconnection completion. However, SDK calls have timeouts.

Pre-existing Risk (Not introduced in this range)

The session prompt loop in prompt.ts has a known risk where queued callbacks (lines 304-309, 806-816) could never be resolved if the loop exits without finding an assistant message and without the abort signal firing. This is a pre-existing architectural issue that could manifest as permanent hangs regardless of these PRs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions