fix(indexing): prevent LanceDB metadata type coercion causing full re-index on restart#10703
Merged
marius-kilocode merged 2 commits intoMay 29, 2026
Conversation
…-index on restart LanceDB infers the value column type from the first row in _createMetadataData(). Since vector_size (1024) was stored as a number, all subsequent string values (embedding_provider, embedding_model_id) and booleans (indexing_complete) were silently coerced to NaN/0/1, corrupting metadata. On restart, _getStoredEmbeddingProfile() received NaN values, causing needsRecreation=true every time — the database was dropped and recreated, the file-hash cache cleared, and a full re-index triggered on every VS Code reopen. Store all metadata values as strings to ensure consistent column type.
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Previously Reported Issues — All Resolved
Files Reviewed (2 files)
The fix is correct and complete. LanceDB metadata type coercion is resolved by storing all values as strings, the changeset is present with an appropriate user-facing message, and the minor style inconsistency has been cleaned up. Reviewed by claude-sonnet-4.6 · 109,926 tokens Review guidance: REVIEW.md from base branch |
Collaborator
|
Verified this against the docs, this looks correct. Thanks @barzhomi |
marius-kilocode
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: LanceDB metadata table type corruption causes full re-index on every restart
Summary
Fix a bug where the LanceDB metadata table's
valuecolumn was inferred asnumberby LanceDB because the first row contained a numeric value (vector_size: 1024). All subsequent string values (embedding_provider,embedding_model_id) and boolean values (indexing_complete) were silently coerced toNaN/0/1, corrupting the stored metadata.On every VS Code restart,
_getStoredEmbeddingProfile()receivedNaNfor the provider/model fields, failed thetypeof !== "string"check, returnedundefined, which setneedsRecreation = true— causing the entire LanceDB database to be dropped and recreated, the file-hash cache to be cleared, and a full re-index to start from scratch.Root Cause
In
packages/kilo-indexing/src/indexing/vector-store/lancedb-vector-store.ts:LanceDB infers the
valuecolumn type from the first row's value (1024= number). All subsequent rows with different types are silently coerced, losing their values.Actual metadata stored on disk (confirmed via LanceDB query):
Fix
All metadata values are now explicitly converted to strings before being passed to LanceDB, ensuring a consistent
stringcolumn type:_createMetadataData()— wrap numeric/boolean values withString()_upsertMetadata()— convertvaluetoString(value)before storingmarkIndexingComplete()— store"true"instead oftruemarkIndexingIncomplete()— store"false"instead offalsehasIndexedData()— compare with"true"string instead of truthy checkThe same issue was present in
_persistEmbeddingProfile()which calls_upsertMetadata()with raw numeric values — now also usesString().Verification
Tested end-to-end with the CLI binary on a small project:
"Starting workspace scan...""Checking for new or modified files..."Metadata after fix (all strings):
File-hash cache persists across restarts (1123 bytes, not cleared).
Backward Compatibility
Existing databases with corrupted metadata (NaN values) will be auto-repaired on the first restart after this fix: the corrupted profile check triggers a one-time table recreation, which writes correct string values. Subsequent restarts use the persistent database normally.