bug: investigate hanging sessions reported in CLI and VS Code (v7.1.20 → v7.2.0)

## Summary

Users report that both CLI and VS Code have "hanging sessions" — the session isn't doing anything, but sending a message does nothing. This issue tracks a systematic investigation of all PRs merged between v7.1.20 and v7.2.0 to identify potential causes.

## Investigation Methodology

Each of the 39 PRs merged between v7.1.20 and v7.2.0 was analyzed against the following known hanging risk vectors in the codebase:

1. **Queued callbacks in `prompt.ts` never being resolved** — If the session loop exits without finding an assistant message and without aborting, queued promise callbacks are never resolved
2. **Snapshot git lock contention** — All snapshot operations use `Lock.write(git)`, serializing concurrent git operations
3. **Storage lock contention** — RWLock on filesystem storage paths
4. **Blocking child sessions** — Parent tool execution blocks on `SessionPrompt.prompt()` for child; if child hangs, parent hangs
5. **SSE stream write backpressure** — `await stream.writeSSE()` blocks if client can't consume fast enough
6. **Abort signal not propagating** — If `AbortController` is never aborted on cleanup, retry sleeps and LLM streams run indefinitely

## PR-by-PR Analysis

| PR | Title | Probability | Notes |
|---|---|---|---|
| #8524 | feat(agent-manager): add GitHub PR status badge to worktree sidebar | **None** | Pure UI in extension layer, no session/SSE interaction |
| #8525 | fix(vscode): retry transient fetch failures in session.command and promptAsync | **Low** | Wraps read-only SDK calls with bounded retry (3 attempts). Does NOT touch promptAsync/session.command despite title. Minor: retry delay not abortable (~1s max) |
| #8431 | fix: remove local storage from ignored folder | **None** | Changes file ignore patterns only, no session/lock interaction |
| #8445 | perf(snapshot): add mutex lock, incremental add, and batched revert | **Medium** | ⚠️ **TOP SUSPECT.** Adds `Lock.write(git)` to ALL snapshot functions. `diffFullUncached` can hold the write lock for extended periods while streaming per-file git processes, blocking all other snapshot operations (`track()`, `patch()`) on the same worktree. Could cause sessions to appear stuck after tool calls complete (waiting for `track()` blocked behind a long `diffFull`). |
| #8479 | fix(core): make follow-up execution aware of the saved plan file | **None** | Pure prompt text change, no control flow modification |
| #8508 | fix(vscode): scope cycleAgentMode keybinding to Kilo Code panels | **None** | Keybinding scoping only |
| #8506 | feat(vscode): pre-release publishing support | **None** | CI/CD change only, no runtime code |
| #8496 | fix: local review git | **None** | Prompt text + one bounded read-only git command with `.nothrow()` |
| #8480 | feat(vscode): reimplement task timeline graph header | **None** | Pure UI visualization, read-only session data |
| #8484 | feature: glm/kimi/qwen reasoning support | **None** | Declarative config mapping, no stream handling changes |
| #8190 | fix(agent-manager): suppress interactive prompts during background git fetch | **None** | Actually reduces hang risk by using `GIT_TERMINAL_PROMPT=0` and `BatchMode=yes` |
| #8464 | fix(cli): update simple-git to fix critical RCE | **None** | Security hardening only; CLI uses raw git commands, not simple-git for core ops |
| #8465 | fix(cli): update hono to fix auth bypass and server vulnerabilities | **Low** | Hono SSE injection fix touched streaming internals. Canonical usage patterns (writeSSE/onAbort) are stable, but hono serves the SSE endpoints |
| #8466 | fix(cli): update minimatch, @modelcontextprotocol/sdk, and @aws-sdk | **None** | MCP SDK 1.24.3 actually **fixed** a hanging connection bug (`body.cancel()` → `text()`) |
| #8467 | fix: add safe overrides for transitive dependency vulnerabilities | **None** | Semver-compatible overrides, security fixes only |
| #8426 | fix(vscode): mode picker sync | **None** | UI-layer mode picker synchronization, proper error recovery |
| #8417 | fix(vscode): question recovery + sub-agent permission propagation | **Low** | Designed to FIX hangs. Adds question recovery on SSE reconnect and auto-adoption of child sessions. Minor: `fetchAndSendPendingQuestions` is `await`ed during SSE reconnection handler — if backend is unresponsive, could delay reconnection completion |
| #8386 | fix(vscode): recover missed child-session prompts | **None** | Explicitly fixes hanging sessions from missed child prompts. Defensive error handling (try/catch, fire-and-forget) |
| #8400 | fix(cli): cache diffFull and ignore legacy local storage | **None** | Promise-based caching reduces lock contention. Correct error eviction. |
| #8367 | Session migration improvements | **None** | VS Code extension migration wizard only, not runtime sessions |
| #8230 | feat(vscode): add Claude Code compatibility toggle | **None** | Env var at spawn time, no async flow changes |
| Plan mode commits (6 commits) | Permission propagation to sub-agents | **Low** | Sub-agents inherit restrictive permissions. `edit: "ask"` creates blocking permission requests that surface to UI. Theoretical: deeply nested sub-agent permission request might be less visible to user, making parent appear hung |
| #8218 | feat(cli): add org support for /kiloclaw command | **None** | Single parallel HTTP request with `.catch(() => null)` guard |
| CLI fix commits (5 commits) | Guard prompt injection, review scope, health logging, Docker MCP --rm, variant config | **None** | Prompt text changes, log filtering, Docker cleanup fix |
| UI/stability commits | Dialog escape, popover stability, JetBrains plugin | **None** | UI-only or isolated new package |
| Docs PRs (10 PRs) | Various documentation updates | **None** | No runtime code changes |
| Security PRs (#8468, #8469) | diff, dompurify, yaml, solid-js, vite, electron | **None** | Patch-level dependency updates, build-time tools |
| #8192, #8211, #8258 | Agent session PRs (User-Agent header, docs link, health logging) | **None** | Header construction, static UI, log filtering |

## Conclusion: Likely Suspects

### Primary Suspect: PR #8445 — Snapshot mutex lock (Medium probability)

This PR adds `Lock.write(git)` to **every** public function in the `Snapshot` namespace. While this correctly prevents concurrent git corruption, it introduces significant **lock contention** that could cause **perceived hanging**:

- **`diffFullUncached()`** holds the write lock while streaming git output line-by-line and spawning up to 4 additional git processes per file. For large diffs, this could hold the lock for **seconds to minutes**.
- During this time, **all other snapshot operations** on the same worktree are blocked: `track()` (called at every step-start/step-finish), `patch()`, `revert()`, `diff()`, `cleanup()`.
- A session that completes a tool call will call `track()` to checkpoint — if `track()` is blocked behind a long `diffFullUncached()`, the session appears stuck between steps.
- With Agent Manager running multiple sessions on the same worktree, this serialization is amplified.

**Recommended investigation:** Add timing instrumentation to Lock.write(git) acquisitions and log when wait time exceeds 5 seconds. Check if `diffFullUncached` is the primary lock holder during perceived hangs.

### Secondary Suspects

1. **Hono update (#8465)** — Low probability, but hono serves the SSE endpoints. The SSE injection fix in hono 4.12 touched streaming internals. If edge cases in SSE stream termination/cleanup changed, it could affect how clients perceive connection state.

2. **Plan mode permission propagation** — Low probability. Sub-agents that inherit restrictive `edit: "ask"` permissions create blocking permission requests. In deeply nested delegation chains, the permission UI might not be prominently visible, making the session appear hung while actually waiting for user input.

3. **Question recovery on SSE reconnect (#8417)** — Low probability. `fetchAndSendPendingQuestions` is `await`ed in the SSE "connected" handler. If the backend is slow, this delays SSE reconnection completion. However, SDK calls have timeouts.

### Pre-existing Risk (Not introduced in this range)

The session prompt loop in `prompt.ts` has a known risk where queued callbacks (lines 304-309, 806-816) could **never be resolved** if the loop exits without finding an assistant message and without the abort signal firing. This is a pre-existing architectural issue that could manifest as permanent hangs regardless of these PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: investigate hanging sessions reported in CLI and VS Code (v7.1.20 → v7.2.0) #8677

Summary

Investigation Methodology

PR-by-PR Analysis

Conclusion: Likely Suspects

Primary Suspect: PR #8445 — Snapshot mutex lock (Medium probability)

Secondary Suspects

Pre-existing Risk (Not introduced in this range)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PR	Title	Probability	Notes
#8524	feat(agent-manager): add GitHub PR status badge to worktree sidebar	None	Pure UI in extension layer, no session/SSE interaction
#8525	fix(vscode): retry transient fetch failures in session.command and promptAsync	Low	Wraps read-only SDK calls with bounded retry (3 attempts). Does NOT touch promptAsync/session.command despite title. Minor: retry delay not abortable (~1s max)
#8431	fix: remove local storage from ignored folder	None	Changes file ignore patterns only, no session/lock interaction
#8445	perf(snapshot): add mutex lock, incremental add, and batched revert	Medium	⚠️ TOP SUSPECT. Adds `Lock.write(git)` to ALL snapshot functions. `diffFullUncached` can hold the write lock for extended periods while streaming per-file git processes, blocking all other snapshot operations (`track()`, `patch()`) on the same worktree. Could cause sessions to appear stuck after tool calls complete (waiting for `track()` blocked behind a long `diffFull`).
#8479	fix(core): make follow-up execution aware of the saved plan file	None	Pure prompt text change, no control flow modification
#8508	fix(vscode): scope cycleAgentMode keybinding to Kilo Code panels	None	Keybinding scoping only
#8506	feat(vscode): pre-release publishing support	None	CI/CD change only, no runtime code
#8496	fix: local review git	None	Prompt text + one bounded read-only git command with `.nothrow()`
#8480	feat(vscode): reimplement task timeline graph header	None	Pure UI visualization, read-only session data
#8484	feature: glm/kimi/qwen reasoning support	None	Declarative config mapping, no stream handling changes
#8190	fix(agent-manager): suppress interactive prompts during background git fetch	None	Actually reduces hang risk by using `GIT_TERMINAL_PROMPT=0` and `BatchMode=yes`
#8464	fix(cli): update simple-git to fix critical RCE	None	Security hardening only; CLI uses raw git commands, not simple-git for core ops
#8465	fix(cli): update hono to fix auth bypass and server vulnerabilities	Low	Hono SSE injection fix touched streaming internals. Canonical usage patterns (writeSSE/onAbort) are stable, but hono serves the SSE endpoints
#8466	fix(cli): update minimatch, @modelcontextprotocol/sdk, and @aws-sdk	None	MCP SDK 1.24.3 actually fixed a hanging connection bug (`body.cancel()` → `text()`)
#8467	fix: add safe overrides for transitive dependency vulnerabilities	None	Semver-compatible overrides, security fixes only
#8426	fix(vscode): mode picker sync	None	UI-layer mode picker synchronization, proper error recovery
#8417	fix(vscode): question recovery + sub-agent permission propagation	Low	Designed to FIX hangs. Adds question recovery on SSE reconnect and auto-adoption of child sessions. Minor: `fetchAndSendPendingQuestions` is `await`ed during SSE reconnection handler — if backend is unresponsive, could delay reconnection completion
#8386	fix(vscode): recover missed child-session prompts	None	Explicitly fixes hanging sessions from missed child prompts. Defensive error handling (try/catch, fire-and-forget)
#8400	fix(cli): cache diffFull and ignore legacy local storage	None	Promise-based caching reduces lock contention. Correct error eviction.
#8367	Session migration improvements	None	VS Code extension migration wizard only, not runtime sessions
#8230	feat(vscode): add Claude Code compatibility toggle	None	Env var at spawn time, no async flow changes
Plan mode commits (6 commits)	Permission propagation to sub-agents	Low	Sub-agents inherit restrictive permissions. `edit: "ask"` creates blocking permission requests that surface to UI. Theoretical: deeply nested sub-agent permission request might be less visible to user, making parent appear hung
#8218	feat(cli): add org support for /kiloclaw command	None	Single parallel HTTP request with `.catch(() => null)` guard
CLI fix commits (5 commits)	Guard prompt injection, review scope, health logging, Docker MCP --rm, variant config	None	Prompt text changes, log filtering, Docker cleanup fix
UI/stability commits	Dialog escape, popover stability, JetBrains plugin	None	UI-only or isolated new package
Docs PRs (10 PRs)	Various documentation updates	None	No runtime code changes
Security PRs (#8468, #8469)	diff, dompurify, yaml, solid-js, vite, electron	None	Patch-level dependency updates, build-time tools
#8192, #8211, #8258	Agent session PRs (User-Agent header, docs link, health logging)	None	Header construction, static UI, log filtering

bug: investigate hanging sessions reported in CLI and VS Code (v7.1.20 → v7.2.0) #8677

Description

Summary

Investigation Methodology

PR-by-PR Analysis

Conclusion: Likely Suspects

Primary Suspect: PR #8445 — Snapshot mutex lock (Medium probability)

Secondary Suspects

Pre-existing Risk (Not introduced in this range)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions