Skip to content

fix(vscode): surface backend crash recovery#10783

Merged
marius-kilocode merged 2 commits into
mainfrom
gray-jet
Jun 1, 2026
Merged

fix(vscode): surface backend crash recovery#10783
marius-kilocode merged 2 commits into
mainfrom
gray-jet

Conversation

@marius-kilocode

@marius-kilocode marius-kilocode commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

The VS Code extension can keep a stale SDK client after its spawned kilo serve process exits. Subsequent requests then fail against a dead localhost endpoint while the chat surface remains in a state that looks usable, which makes prompts appear to disappear without a clear recovery path.

This change treats an unexpected CLI child exit as a connection failure, clears the stale client and server config immediately, and forwards a retryable error to the existing chat banner. The prompt input also rechecks connectivity after asynchronous attachment resolution so a crash during submit preparation preserves the draft instead of clearing it. Together these changes make backend death visible and leave the extension in a recoverable state rather than silently sending requests to a dead process.

image

@marius-kilocode marius-kilocode enabled auto-merge June 1, 2026 08:45
@kilo-code-bot

kilo-code-bot Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: 2 Issues Found | Recommendation: Low severity — safe to merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 1
Issue Details (click to expand)

WARNING

File Line Issue
packages/kilo-vscode/src/services/cli-backend/connection-service.ts 613–614 Narrow race: doConnect can re-assign this.client and this.sseClient after handleServerExit cleared them

SUGGESTION

File Line Issue
packages/kilo-vscode/tests/unit/prompt-input-connection-guard.test.ts 10–13 Source-text scanning test is fragile to refactors
Race condition detail (WARNING)

Scenario: doConnect is awaiting this.serverManager.getServer() (line 594) when the server process exits. handleServerExit fires synchronously (via the Node exit event), calls resetConnection() then setState("error"). JS then resumes the doConnect async function — if getServer() happened to resolve before the crash was detected, doConnect continues and assigns:

this.client = client   // overwrites the null set by resetConnection()
this.sseClient = sse   // overwrites the null set by resetConnection()

At this point this.state is "error" but this.client is a fresh object pointing to the now-dead server. The SSE stale-guard (if (this.sseClient !== sse) return) does not fire because this.sseClient was just set to sse. The SSE will eventually fail and cycle back to error/reconnect, but there's a window where this.client is non-null while this.state === "error" — which the fixed getClient() check (this.state !== "connected") correctly guards against.

The impact is limited because:

  • getClient() now requires state === "connected" before returning the client
  • The health poll and SSE error handlers will re-enter the error state quickly
  • This window requires the server to exit between doConnect's resetConnection() call and the line 613 assignment

This is inherent to the single-process architecture and the PR already substantially improves the status quo. Worth noting for a future hardening pass (e.g. a generation counter that doConnect checks before writing back to this.client).

Source-text scanning test (SUGGESTION)

prompt-input-connection-guard.test.ts verifies the ordering of isDisabled() by scanning source-code strings with indexOf. This will silently break if:

  • isDisabled is renamed
  • The git attachment block is extracted into a helper
  • The function is reformatted

A comment near the if (isDisabled()) return line explaining why it must appear here would make the intent clear and reduce the chance of a future refactor accidentally removing it. The test itself is an acceptable workaround for a webview component that can't be unit-tested in isolation, but its fragility should be acknowledged.

Incremental update (commit d3f8a65)

The new commit adds getConnectionError: () => null to three test connection stubs (kilo-provider-followup.test.ts, kilo-provider-load-messages.test.ts, kilo-provider-session-refresh.test.ts) to keep them in sync with the updated KiloConnectionService interface. The changes are correct and no new issues were introduced.

Files Reviewed (9 files)
  • .changeset/calm-kilos-reconnect.md — changeset looks correct, user-facing description is clear
  • packages/kilo-vscode/src/KiloProvider.tspostConnectionState() extraction is clean; error forwarding to the UI banner is correct
  • packages/kilo-vscode/src/services/cli-backend/connection-service.ts — core crash-recovery logic; stale-SSE guards, getClient() state check, handleServerExit all look correct
  • packages/kilo-vscode/src/services/cli-backend/server-manager.tsonExit callback wiring is minimal and correct
  • packages/kilo-vscode/tests/unit/sdk-sse-adapter.test.ts — new KiloConnectionService backend crash tests exercise real implementation paths; test quality is good
  • packages/kilo-vscode/tests/unit/prompt-input-connection-guard.test.ts — source-text ordering test (see suggestion above); unchanged in latest commit
  • packages/kilo-vscode/webview-ui/src/components/chat/PromptInput.tsx — single-line guard addition is correct
  • packages/kilo-vscode/tests/unit/kilo-provider-followup.test.ts — stub updated correctly with getConnectionError
  • packages/kilo-vscode/tests/unit/kilo-provider-load-messages.test.ts — stub updated correctly with getConnectionError
  • packages/kilo-vscode/tests/unit/kilo-provider-session-refresh.test.ts — stub updated correctly with getConnectionError

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 295,567 tokens

Review guidance: REVIEW.md from base branch main

@marius-kilocode marius-kilocode merged commit 33249d7 into main Jun 1, 2026
18 checks passed
@marius-kilocode marius-kilocode deleted the gray-jet branch June 1, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants