Skip to content

fix(indexing): prevent LanceDB metadata type coercion causing full re-index on restart#10703

Merged
marius-kilocode merged 2 commits into
Kilo-Org:mainfrom
barzhomi:fix/lancedb-metadata-type-corruption
May 29, 2026
Merged

fix(indexing): prevent LanceDB metadata type coercion causing full re-index on restart#10703
marius-kilocode merged 2 commits into
Kilo-Org:mainfrom
barzhomi:fix/lancedb-metadata-type-corruption

Conversation

@barzhomi

Copy link
Copy Markdown
Contributor

Fix: LanceDB metadata table type corruption causes full re-index on every restart

Summary

Fix a bug where the LanceDB metadata table's value column was inferred as number by LanceDB because the first row contained a numeric value (vector_size: 1024). All subsequent string values (embedding_provider, embedding_model_id) and boolean values (indexing_complete) were silently coerced to NaN/0/1, corrupting the stored metadata.

On every VS Code restart, _getStoredEmbeddingProfile() received NaN for the provider/model fields, failed the typeof !== "string" check, returned undefined, which set needsRecreation = true — causing the entire LanceDB database to be dropped and recreated, the file-hash cache to be cleared, and a full re-index to start from scratch.

Root Cause

In packages/kilo-indexing/src/indexing/vector-store/lancedb-vector-store.ts:

// BEFORE: mixed types cause LanceDB to infer value column as number
_createMetadataData() {
  return [
    { key: "vector_size",         value: 1024 },              // number — column type set here
    { key: "embedding_provider",  value: "ollama" },           // string → coerced to NaN
    { key: "embedding_model_id",  value: "qwen3-embed:0.6b" }, // string → coerced to NaN
    { key: "embedding_dimension", value: 1024 },                // number → OK
    { key: "indexing_complete",   value: false },               // boolean → coerced to 0/1
  ]
}

LanceDB infers the value column type from the first row's value (1024 = number). All subsequent rows with different types are silently coerced, losing their values.

Actual metadata stored on disk (confirmed via LanceDB query):

embedding_provider  = NaN     ← should be "ollama"
embedding_model_id  = NaN     ← should be "qwen3-embedding:0.6b"
indexing_complete   = 1       ← should be true/false

Fix

All metadata values are now explicitly converted to strings before being passed to LanceDB, ensuring a consistent string column type:

  • _createMetadataData() — wrap numeric/boolean values with String()
  • _upsertMetadata() — convert value to String(value) before storing
  • markIndexingComplete() — store "true" instead of true
  • markIndexingIncomplete() — store "false" instead of false
  • hasIndexedData() — compare with "true" string instead of truthy check

The same issue was present in _persistEmbeddingProfile() which calls _upsertMetadata() with raw numeric values — now also uses String().

Verification

Tested end-to-end with the CLI binary on a small project:

Run Message Files Mode
First (fresh DB) "Starting workspace scan..." 7/7 Full scan
Second (restart) "Checking for new or modified files..." 0/0 Incremental

Metadata after fix (all strings):

embedding_provider  = "ollama"                    ✓
embedding_model_id  = "qwen3-embedding:0.6b"      ✓
embedding_dimension = "1024"                      ✓
vector_size         = "1024"                      ✓
indexing_complete   = "true"                      ✓

File-hash cache persists across restarts (1123 bytes, not cleared).

Backward Compatibility

Existing databases with corrupted metadata (NaN values) will be auto-repaired on the first restart after this fix: the corrupted profile check triggers a one-time table recreation, which writes correct string values. Subsequent restarts use the persistent database normally.

…-index on restart

LanceDB infers the value column type from the first row in _createMetadataData().
Since vector_size (1024) was stored as a number, all subsequent string values
(embedding_provider, embedding_model_id) and booleans (indexing_complete) were
silently coerced to NaN/0/1, corrupting metadata.

On restart, _getStoredEmbeddingProfile() received NaN values, causing
needsRecreation=true every time — the database was dropped and recreated,
the file-hash cache cleared, and a full re-index triggered on every VS Code reopen.

Store all metadata values as strings to ensure consistent column type.
@kilo-code-bot

kilo-code-bot Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Previously Reported Issues — All Resolved

Severity Issue Status
WARNING Missing changeset file ✅ Fixed — .changeset/fix-lancedb-metadata-type-coercion.md added with user-facing description
SUGGESTION Redundant double String() conversion at call sites ✅ Fixed — outer String() wrappers removed; _upsertMetadata handles conversion internally
Files Reviewed (2 files)
  • .changeset/fix-lancedb-metadata-type-coercion.md — new file, correct patch bump and user-facing description
  • packages/kilo-indexing/src/indexing/vector-store/lancedb-vector-store.ts — redundant String() calls removed at lines 601–602

The fix is correct and complete. LanceDB metadata type coercion is resolved by storing all values as strings, the changeset is present with an appropriate user-facing message, and the minor style inconsistency has been cleaned up.


Reviewed by claude-sonnet-4.6 · 109,926 tokens

Review guidance: REVIEW.md from base branch main

@marius-kilocode marius-kilocode self-requested a review May 29, 2026 10:41
@marius-kilocode

Copy link
Copy Markdown
Collaborator

Verified this against the docs, this looks correct. Thanks @barzhomi

@marius-kilocode marius-kilocode enabled auto-merge May 29, 2026 10:56
@marius-kilocode marius-kilocode merged commit 4d1439e into Kilo-Org:main May 29, 2026
13 checks passed
@barzhomi barzhomi deleted the fix/lancedb-metadata-type-corruption branch May 29, 2026 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants