Skip to content

Add scalar index lifecycle and scalar filter support#10

Merged
PhantomInTheWire merged 9 commits into
masterfrom
feat/scalar-index-lifecycle
Mar 23, 2026
Merged

Add scalar index lifecycle and scalar filter support#10
PhantomInTheWire merged 9 commits into
masterfrom
feat/scalar-index-lifecycle

Conversation

@PhantomInTheWire

@PhantomInTheWire PhantomInTheWire commented Mar 23, 2026

Copy link
Copy Markdown
Owner

Summary

  • add scalar index planning, storage, and engine lifecycle wiring
  • add scalar sidecars and scalar prefilter-aware segment search
  • add LIKE, CONTAINS, and IS NULL filter support

Testing

  • cargo test --manifest-path /Users/ghost/mysearch/GarudaDB/Cargo.toml

Open with Devin

Summary by CodeRabbit

  • New Features
    • Added scalar field indexing for accelerated filtering on non-vector data
    • Introduced string filter operators: LIKE (pattern matching), CONTAINS (substring search), and IS NULL (null checks)
    • Enabled query optimization through scalar index pushdown for improved filter performance

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai

coderabbitai Bot commented Mar 23, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@PhantomInTheWire has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 17 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 237c61c8-8a2d-46cc-9623-158d2cc83066

📥 Commits

Reviewing files that changed from the base of the PR and between a154fce and 71e4422.

📒 Files selected for processing (37)
  • crates/garuda-engine/Cargo.toml
  • crates/garuda-engine/src/checkpoint_service.rs
  • crates/garuda-engine/src/filter.rs
  • crates/garuda-engine/src/filter_parser.rs
  • crates/garuda-engine/src/lib.rs
  • crates/garuda-engine/src/query.rs
  • crates/garuda-engine/src/recovery_service.rs
  • crates/garuda-engine/src/schema.rs
  • crates/garuda-engine/src/schema_ddl.rs
  • crates/garuda-engine/src/segment_manager.rs
  • crates/garuda-engine/src/state.rs
  • crates/garuda-engine/tests/common.rs
  • crates/garuda-engine/tests/test_collection_ddl.rs
  • crates/garuda-engine/tests/test_collection_dql.rs
  • crates/garuda-engine/tests/test_collection_exception.rs
  • crates/garuda-engine/tests/test_collection_filter_behavior.rs
  • crates/garuda-engine/tests/test_collection_scalar_index.rs
  • crates/garuda-index-scalar/Cargo.toml
  • crates/garuda-index-scalar/src/lib.rs
  • crates/garuda-meta/src/lib.rs
  • crates/garuda-planner/src/lib.rs
  • crates/garuda-planner/tests/test_query_plan.rs
  • crates/garuda-segment/Cargo.toml
  • crates/garuda-segment/src/codec.rs
  • crates/garuda-segment/src/index.rs
  • crates/garuda-segment/src/lib.rs
  • crates/garuda-segment/src/search.rs
  • crates/garuda-segment/src/storage.rs
  • crates/garuda-segment/src/types.rs
  • crates/garuda-segment/tests/test_flat_sidecar.rs
  • crates/garuda-segment/tests/test_hnsw_sidecar.rs
  • crates/garuda-storage/src/codec.rs
  • crates/garuda-storage/src/layout.rs
  • crates/garuda-storage/src/lib.rs
  • crates/garuda-types/src/query.rs
  • crates/garuda-types/src/schema.rs
  • garuda/src/main.rs
📝 Walkthrough

Walkthrough

Introduces comprehensive scalar field indexing to Garuda DB via a new garuda-index-scalar crate, refactors core APIs to pass CollectionSchema instead of fragmented VectorFieldSchema across segment/storage layers, extends filter parsing to support string matching (LIKE, CONTAINS) and null checks, implements filter plan decomposition for scalar prefiltering, and replaces checkpoint byte-only backups with path-aware capture/restore supporting directory structures.

Changes

Cohort / File(s) Summary
New Scalar Index Infrastructure
crates/garuda-index-scalar/..., crates/garuda-engine/Cargo.toml, crates/garuda-segment/Cargo.toml
New crate for scalar field indexing with ScalarIndex variants (Bool, Int64, Float64, String), insertion/matching operations, and serialization; added as path dependency to engine and segment crates.
Type System & Schema Extensions
crates/garuda-types/src/query.rs, crates/garuda-types/src/schema.rs, crates/garuda-storage/src/codec.rs, crates/garuda-storage/src/layout.rs
Added FilterExpr variants (StringMatch, IsNull), scalar filter types (ScalarPredicate, ScalarPrefilter, LikePattern), IndexKind::Scalar, ScalarIndexState, and persistence paths/codec for scalar index serialization.
Filter & Query Parsing
crates/garuda-engine/src/filter.rs, crates/garuda-engine/src/filter_parser.rs, crates/garuda-meta/src/lib.rs
Extended filter validation for StringMatch/IsNull, added parser support for LIKE/CONTAINS/IS NULL keywords and pattern matching, and updated filter evaluation with string matching and null checking logic.
Query Planning & Decomposition
crates/garuda-planner/src/lib.rs, crates/garuda-planner/tests/test_query_plan.rs
Introduced FilterPlan separating scalar_prefilter from residual filtering, implemented filter decomposition to push indexed scalar predicates down into prefilter phase, updated planning to accept CollectionSchema and derive vector index defaults from schema.
Schema DDL & Validation
crates/garuda-engine/src/schema.rs, crates/garuda-engine/src/schema_ddl.rs
Refactored index management from vector-only to unified APIs: create_index(schema, field_name, params) and drop_index(schema, field_name, kind) supporting both vector and scalar index operations; added validation rejecting nullable indexed scalar fields.
Segment Type & Resource System
crates/garuda-segment/src/types.rs, crates/garuda-segment/src/index.rs
Added scalar_indexes field to WritingSegment/PersistedSegment, refactored signatures to accept CollectionSchema, introduced WritingSearchResources/PersistedSearchResources structs consolidating vector+scalar indexes, and added SegmentSearchRequest enum unifying flat/hnsw requests.
Segment Search & Storage APIs
crates/garuda-segment/src/search.rs, crates/garuda-segment/src/storage.rs, crates/garuda-segment/src/lib.rs, crates/garuda-storage/src/lib.rs
Removed flat/hnsw-specific search functions and SearchVisibility abstraction, replacing with unified search_writing/search_persisted(SegmentSearchRequest, allowed_doc_ids) with scalar prefiltering support; updated persistence layer to accept CollectionSchema and write/read scalar index sidecars; refactored exports.
Core Engine APIs
crates/garuda-engine/src/lib.rs, crates/garuda-engine/src/segment_manager.rs, crates/garuda-engine/src/state.rs
Updated all segment operations to accept CollectionSchema instead of VectorFieldSchema; added scalar index maintenance in append_new_record; integrated schema rebuild after column modifications; adjusted optimize/index management to work with full schema context.
Query Execution
crates/garuda-engine/src/query.rs
Unified persisted/writing segment querying via QuerySegment enum, introduced scalar prefiltering via prefilter_doc_ids(...), refactored search request construction to use SegmentSearchRequest, updated build_query_plan invocation to pass CollectionSchema.
Checkpoint & Recovery
crates/garuda-engine/src/checkpoint_service.rs, crates/garuda-engine/src/recovery_service.rs
Replaced byte-only file backups (capture_file/FileBackup) with path-aware backups (capture_path/PathBackup) supporting missing/directory paths with recursive capture/restore; added scalar index directory capture to checkpoint; refactored recovery to use CollectionSchema.
Test Coverage Additions
crates/garuda-engine/tests/test_collection_ddl.rs, crates/garuda-engine/tests/test_collection_dql.rs, crates/garuda-engine/tests/test_collection_exception.rs, crates/garuda-engine/tests/test_collection_filter_behavior.rs, crates/garuda-engine/tests/test_collection_scalar_index.rs, crates/garuda-engine/tests/common.rs, crates/garuda-segment/tests/test_flat_sidecar.rs, crates/garuda-segment/tests/test_hnsw_sidecar.rs
Updated test schemas to explicitly set ScalarIndexState; added DDL tests for scalar index create/drop; added DQL test for LIKE/CONTAINS/IS NULL filtering; added exception test for nullable indexed fields rejection; added comprehensive scalar index persistence and prefiltering tests.
CLI Updates
garuda/src/main.rs
Extended CLI to recognize "scalar" index kind and map to IndexParams::Scalar; updated schema construction to explicitly set scalar index state.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant QueryEngine as Query Engine
    participant Planner as Query Planner
    participant Segment as Segment Search
    participant ScalarIdx as Scalar Index
    participant Records as Record Store

    Client->>QueryEngine: VectorQuery with filter
    QueryEngine->>Planner: build_query_plan(schema, filter)
    Planner->>Planner: decompose_filter()
    Planner-->>QueryEngine: FilterPlan { scalar_prefilter, residual }
    
    QueryEngine->>Segment: search(SegmentSearchRequest, scalar_prefilter)
    Segment->>ScalarIdx: prefilter_doc_ids(scalar_prefilter)
    ScalarIdx->>ScalarIdx: eval_predicates()
    ScalarIdx-->>Segment: allowed_doc_ids: HashSet
    
    Segment->>Records: vector_search(allowed_doc_ids)
    Records-->>Segment: candidate_docs
    Segment->>Segment: apply_residual_filter()
    Segment-->>QueryEngine: final_results
    
    QueryEngine-->>Client: filtered_search_results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • Feat/index #8 — Implements identical scalar-index refactor across the same core files and crates (new garuda-index-scalar, schema/segment/storage/query overhaul), representing parallel or competing implementation.
  • Feat/storage #4 — Modifies overlapping filter/query infrastructure (parse_required_filter signature, filter evaluation in garuda-meta, filter expression handling), extending or conflicting with this PR's filter enhancements.
  • hnsw #7 — Updates segment_manager signatures, index create/drop APIs, planner query plan input, and storage read-write APIs that this PR also refactors, suggesting this PR builds upon or revises those changes.

Poem

🐰 Scalar indexes dance in the moonlight,
Prefiltering dreams with paths held tight,
Schemas unified, no more divide,
From filter to segment, a smoother glide!
Checkpoints remember each twist and turn—
Garuда blooms as the databases learn. 🌙✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: introducing scalar index lifecycle management and scalar filter support (LIKE, CONTAINS, IS NULL).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/scalar-index-lifecycle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +157 to +159
StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
value.starts_with(prefix) && value.ends_with(suffix)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 LIKE PrefixSuffix pattern matching allows false positives when prefix+suffix overlap in short strings

The string_matches function at crates/garuda-meta/src/lib.rs:157-159 checks value.starts_with(prefix) && value.ends_with(suffix) without verifying that the value is at least prefix.len() + suffix.len() characters long. This causes false positives when the prefix and suffix overlap in a short value.

For example, LIKE 'ab%b' (prefix="ab", suffix="b") would incorrectly match value "ab" because "ab".starts_with("ab") and "ab".ends_with("b") are both true, even though there aren't enough characters to satisfy both the prefix and suffix without overlap. In correct SQL LIKE semantics, after consuming prefix "ab", the % matches zero or more characters, then suffix "b" must still be matched — but nothing remains. The minimum matching value should be "abb" (length 3 = prefix.len() + suffix.len()).

Suggested change
StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
value.starts_with(prefix) && value.ends_with(suffix)
}
StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
value.len() >= prefix.len() + suffix.len()
&& value.starts_with(prefix)
&& value.ends_with(suffix)
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread crates/garuda-engine/src/checkpoint_service.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/garuda-storage/src/codec.rs (2)

200-216: ⚠️ Potential issue | 🔴 Critical

Version this manifest layout change before shipping it.

These lines add ScalarIndexState to every persisted ScalarFieldSchema, but the manifest is still treated as the same on-disk format. Reopening a collection written before this PR will misparse or fail while crates/garuda-engine/src/recovery_service.rs Lines 50-54 load the latest manifest, and older binaries cannot detect the new layout either.

🛠️ Suggested compatibility approach
-const FORMAT_VERSION: u16 = 1;
+const MANIFEST_FORMAT_VERSION: u16 = 2;
+const SNAPSHOT_FORMAT_VERSION: u16 = 1;

 pub fn encode_manifest(manifest: &Manifest) -> Result<Vec<u8>, Status> {
     let mut writer = BinaryWriter::new(MANIFEST_MAGIC);
-    writer.write_u16(FORMAT_VERSION);
+    writer.write_u16(MANIFEST_FORMAT_VERSION);
     write_collection_schema(&mut writer, &manifest.schema)?;
     ...
 }

 pub fn decode_manifest(bytes: &[u8]) -> Result<Manifest, Status> {
     let mut reader = BinaryReader::new(bytes, MANIFEST_MAGIC, "storage")?;
-    reader.expect_u16(FORMAT_VERSION)?;
-    let schema = read_collection_schema(&mut reader)?;
+    let manifest_version = reader.read_u16()?;
+    if manifest_version != 1 && manifest_version != MANIFEST_FORMAT_VERSION {
+        return Err(Status::err(
+            StatusCode::Internal,
+            "unsupported manifest format version",
+        ));
+    }
+    let schema = read_collection_schema(&mut reader, manifest_version)?;
-fn read_scalar_field_schema(reader: &mut BinaryReader<'_>) -> Result<ScalarFieldSchema, Status> {
+fn read_scalar_field_schema(
+    reader: &mut BinaryReader<'_>,
+    manifest_version: u16,
+) -> Result<ScalarFieldSchema, Status> {
     let name = FieldName::parse(reader.read_string()?)?;
     let field_type = ScalarType::from_tag(reader.read_u8()?)?;
-    let index = read_scalar_index_state(reader.read_u8()?)?;
+    let index = if manifest_version >= MANIFEST_FORMAT_VERSION {
+        read_scalar_index_state(reader.read_u8()?)?
+    } else {
+        ScalarIndexState::None
+    };
     let nullability = Nullability::from_tag(reader.read_u8()?)?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/codec.rs` around lines 200 - 216, The manifest
layout change that adds ScalarIndexState must be versioned and handled for
backward compatibility: introduce a manifest version tag written by the writer
(the routine that writes ScalarFieldSchema) and update the reader in
read_scalar_field_schema to inspect that manifest/version header (or fall back
if no version present) so older on-disk manifests without the extra u8 are
parsed by treating missing ScalarIndexState as the old default; ensure the
writer writes the version before writing fields and the reader branches based on
that version when populating field.index (use
write_scalar_index_state/read_scalar_index_state and
Nullability::from_tag/ScalarType::from_tag as anchors), and update recovery
loading code where manifests are loaded to rely on the versioned format.

279-295: ⚠️ Potential issue | 🟠 Major

Keep Scalar out of VectorIndexState::FlatAndHnsw.default.

read_vector_index_state() uses read_index_kind() for the default vector index kind. After this change, a manually constructed schema can persist IndexKind::Scalar inside a vector field, and crates/garuda-engine/src/schema.rs Lines 48-55 do not reject it today.

🧭 Narrow the vector default codec
         VectorIndexState::FlatAndHnsw { default, hnsw } => {
             writer.write_u8(2);
-            writer.write_u8(write_index_kind(*default));
+            writer.write_u8(write_default_vector_index_kind(*default)?);
             write_hnsw_index_params(writer, hnsw);
         }
         2 => Ok(VectorIndexState::FlatAndHnsw {
-            default: read_index_kind(reader.read_u8()?)?,
+            default: read_default_vector_index_kind(reader.read_u8()?)?,
             hnsw: read_hnsw_index_params(reader)?,
         }),
+fn write_default_vector_index_kind(kind: IndexKind) -> Result<u8, Status> {
+    match kind {
+        IndexKind::Flat | IndexKind::Hnsw => Ok(write_index_kind(kind)),
+        IndexKind::Scalar => Err(Status::err(
+            StatusCode::InvalidArgument,
+            "scalar is not a valid default vector index kind",
+        )),
+    }
+}
+
+fn read_default_vector_index_kind(tag: u8) -> Result<IndexKind, Status> {
+    match read_index_kind(tag)? {
+        IndexKind::Flat => Ok(IndexKind::Flat),
+        IndexKind::Hnsw => Ok(IndexKind::Hnsw),
+        IndexKind::Scalar => Err(Status::err(
+            StatusCode::InvalidArgument,
+            "scalar is not a valid default vector index kind",
+        )),
+    }
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/codec.rs` around lines 279 - 295,
read_vector_index_state() currently accepts IndexKind::Scalar via
read_index_kind(), allowing a scalar index to be stored as a vector field
default; change the validation so Scalar is rejected for vector defaults. After
calling read_index_kind() (or when constructing
VectorIndexState::FlatAndHnsw::default), check if the returned kind ==
IndexKind::Scalar and return an Err(Status::err(...)) indicating an
unrecognized/invalid index kind for vector fields; keep read_index_kind()
unchanged and perform this contextual validation in read_vector_index_state()
(referencing read_vector_index_state, read_index_kind,
VectorIndexState::FlatAndHnsw, and IndexKind::Scalar).
🧹 Nitpick comments (2)
crates/garuda-engine/tests/test_collection_exception.rs (1)

139-143: Make this regression test check the intended failure mode.

schema.fields[2] hard-codes the default schema order, and the is_err() assertions will still pass on unrelated failures. Looking up the field by name and asserting StatusCode::InvalidArgument will keep this test focused on the nullable-indexed rule.

🧪 Tighten the regression test
     let mut schema = default_schema("docs");
-    schema.fields[2].index = garuda_types::ScalarIndexState::Indexed;
-    schema.fields[2].nullability = garuda_types::Nullability::Nullable;
+    let category = schema
+        .fields
+        .iter_mut()
+        .find(|field| field.name == field_name("category"))
+        .expect("category field present");
+    category.index = garuda_types::ScalarIndexState::Indexed;
+    category.nullability = garuda_types::Nullability::Nullable;
     let create_result = db.create_collection(schema, default_options());
-    assert!(create_result.is_err());
+    assert_eq!(
+        create_result
+            .expect_err("nullable indexed field should be rejected")
+            .code,
+        garuda_types::StatusCode::InvalidArgument
+    );

Apply the same expect_err(...).code == StatusCode::InvalidArgument pattern to add_result and create_index_result.

Also applies to: 148-155, 166-170

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_collection_exception.rs` around lines 139 -
143, The test currently hard-codes schema.fields[2] and uses
assert!(...is_err()), which can pass for unrelated errors; change it to find the
field by name (e.g., locate the field with the expected name in schema.fields
rather than indexing at 2) and then replace bare is_err() checks on
create_result, add_result, and create_index_result with assertions that unwrap
the error and assert its code equals StatusCode::InvalidArgument (for example
use expect_err(...) or unwrap_err().code() == StatusCode::InvalidArgument) so
the test specifically verifies the nullable+indexed rule failure from
db.create_collection/schema modification, the add_* operations, and the
create_index_* call.
crates/garuda-engine/tests/test_collection_scalar_index.rs (1)

49-54: This setup makes the test hit persisted segments, not the writing path.

create_index() checkpoints the collection, so the docs inserted by seed_more_collection_docs() are sealed before the query runs. If the goal is to cover writing-segment scalar index maintenance, insert those docs after index creation, or rename the test to match what it actually exercises.

♻️ Minimal change to actually cover the writing path
     seed_collection(&collection);
-    seed_more_collection_docs(&collection);

     collection
         .create_index(&field_name("rank"), IndexParams::Scalar(ScalarIndexParams))
         .expect("create scalar index on rank");
+    seed_more_collection_docs(&collection);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_collection_scalar_index.rs` around lines 49 -
54, The test currently calls seed_more_collection_docs() before
collection.create_index(...), which causes create_index to checkpoint and seal
those docs so the test exercises persisted segments instead of the
writing-segment index path; to fix, either move the
seed_more_collection_docs(&collection) call to after
collection.create_index(&field_name("rank"),
IndexParams::Scalar(ScalarIndexParams)) so inserted docs are written into the
active (writing) segments and exercise scalar index maintenance, or if you
intend to test persisted segments, rename the test to reflect that behavior;
adjust only the order of seed_collection/seed_more_collection_docs relative to
create_index or rename the test function accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/garuda-engine/src/checkpoint_service.rs`:
- Around line 228-257: capture_dir_entries currently computes relative_path
using the changing `path`, which causes nested files to be stripped relative to
their parent dir instead of the original root; change capture_dir_entries to
accept an additional `root: &Path` parameter (or capture the original root in a
closure) so recursive calls pass the same original root, compute `relative_path`
with `child_path.strip_prefix(root)`, and adjust callers to pass the initial
root when first invoking capture_dir_entries; update references to
`capture_dir_entries` and `DirEntryBackup` creation accordingly.

In `@crates/garuda-engine/src/query.rs`:
- Around line 177-180: The code forces FlatSearchRequest.top_k to
segment.doc_count() in segment_request, causing unfiltered flat searches to
fetch entire segments; instead preserve the query's requested top_k and only
widen it when necessary (filters, deletes, or allowed-doc-id scopes are
present). Update the logic around prefilter_doc_ids, segment_request, and where
FlatSearchRequest.top_k is set so that if allowed_doc_ids is None and there are
no active filters/deletes (as already handled by
crates/garuda-segment/src/search.rs), you pass the original plan/query top_k
through; only set top_k = segment.doc_count() when prefilter_doc_ids indicates
restriction or when segment.search requires widening due to
filters/deletes/allowed-doc-ids. Ensure references: prefilter_doc_ids,
segment_request, FlatSearchRequest.top_k, and segment.doc_count() are the places
you change.

In `@crates/garuda-engine/src/segment_manager.rs`:
- Around line 147-162: The write-delete path currently flips record.state to
tombstone but leaves entries in writing_segment.scalar_indexes and the writing
segment's search structures (flat/HNSW) stale; after marking the record as
tombstoned (the code that flips record.state in the write-delete path), remove
the doc_id from writing_segment.scalar_indexes (by calling the same
index.remove(doc_id) logic used on inserts) and ensure the writing_segment's
vector/search resources are rebuilt or updated (e.g., call the existing
rebuild/update routine for HNSW/flat or invoke the same cleanup used at
seal/reload) so tombstoned docs are not persisted via write_writing_segment;
locate and update the code paths around record.state flipping and in functions
manipulating writing_segment to apply this removal/rebuild.

In `@crates/garuda-engine/tests/test_collection_dql.rs`:
- Around line 111-113: The test currently assumes a deterministic tie-break
order for the IS NULL query by asserting ids == vec![doc_id("doc-2"),
doc_id("doc-3"), doc_id("doc-4")]; change this to be order-agnostic: compute the
set (or sort) of ids from null_results and assert it contains exactly
doc_id("doc-2"), doc_id("doc-3"), doc_id("doc-4") regardless of order (e.g.,
compare HashSet of ids or sort the ids vector before asserting), so references
to null_results, ids and doc_id(...) are used to locate and update the
assertion.

In `@crates/garuda-index-scalar/src/lib.rs`:
- Around line 38-39: Remove the derived PartialEq from FloatKey and instead
implement PartialEq (and Eq) manually so equality matches your Ord
implementation that uses total_cmp; specifically, keep #[derive(Clone, Copy,
Debug)] on pub struct FloatKey(f64), add impl PartialEq for FloatKey { fn
eq(&self, other: &Self) -> bool { self.0.total_cmp(&other.0) ==
std::cmp::Ordering::Equal } } and add impl Eq for FloatKey {}, ensuring the
Eq/Ord contract is satisfied with the existing Ord implementation that uses
total_cmp().
- Around line 126-151: The function prefilter_doc_ids treats an empty
ScalarPrefilter::And as "match nothing" by returning Some(empty set); instead
guard against empty predicates by checking if predicates.is_empty() and return
None (i.e., no restriction / match-everything) before the loop. Update
prefilter_doc_ids (ScalarPrefilter::And arm handling) to early-return None for
empty predicates so callers see no prefilter rather than an empty HashSet.

In `@crates/garuda-meta/src/lib.rs`:
- Around line 157-159: The current StringMatchExpr::Like arm for
LikePattern::PrefixSuffix only checks starts_with(prefix) && ends_with(suffix)
which incorrectly accepts inputs where prefix and suffix overlap; change the
condition in that match arm (the StringMatchExpr::Like ->
LikePattern::PrefixSuffix { prefix, suffix } branch) to also require that the
value has enough length to accommodate both pieces without overlap, e.g. require
value.len() >= prefix.len() + suffix.len() (or equivalently ensure the suffix
start index is >= prefix.len()), so the match becomes starts_with(prefix) &&
ends_with(suffix) && value.len() >= prefix.len() + suffix.len().

In `@crates/garuda-segment/src/storage.rs`:
- Around line 168-185: The scalar index replacement currently removes
segment_scalar_index_dir(root, segment_id) up front and writes each field file
directly, which can leave the segment in a partially-updated state; instead,
create a temporary staging directory (e.g., segment_scalar_index_dir(root,
segment_id) with a suffix), write all encoded scalar index files into that temp
dir using write_file_atomically and segment_scalar_index_path-like names, then
atomically rename the temp dir into place (replacing the old scalar dir) once
all writes succeed; ensure create_dir_all and encode_scalar_index calls target
the temp dir and preserve the existing behavior
read_persisted_segment/load_scalar_indexes expect (so load_scalar_indexes will
always see either the old complete set or the new complete set).

In `@crates/garuda-types/src/schema.rs`:
- Around line 51-60: The binary codec now deserializes the new index u8 into the
wrong slot for older manifests because ScalarFieldSchema's layout changed;
instead of serde defaults, add explicit binary-versioning or migration in the
BinaryReader/BinaryWriter path so old payloads are detected and fields are
remapped: update the binary read/write logic (the code handling
ScalarFieldSchema deserialization) to read a manifest version header (or try
old-layout fallback), and when an older version is detected, read the previous
layout (reading nullability where index would be and then set index to a
sensible default) or perform a field-by-field migration, ensuring
ScalarIndexState/index is reconstructed correctly without treating nullability
as index. Ensure the writer records the new version and tests cover reading both
old and new manifest blobs.

---

Outside diff comments:
In `@crates/garuda-storage/src/codec.rs`:
- Around line 200-216: The manifest layout change that adds ScalarIndexState
must be versioned and handled for backward compatibility: introduce a manifest
version tag written by the writer (the routine that writes ScalarFieldSchema)
and update the reader in read_scalar_field_schema to inspect that
manifest/version header (or fall back if no version present) so older on-disk
manifests without the extra u8 are parsed by treating missing ScalarIndexState
as the old default; ensure the writer writes the version before writing fields
and the reader branches based on that version when populating field.index (use
write_scalar_index_state/read_scalar_index_state and
Nullability::from_tag/ScalarType::from_tag as anchors), and update recovery
loading code where manifests are loaded to rely on the versioned format.
- Around line 279-295: read_vector_index_state() currently accepts
IndexKind::Scalar via read_index_kind(), allowing a scalar index to be stored as
a vector field default; change the validation so Scalar is rejected for vector
defaults. After calling read_index_kind() (or when constructing
VectorIndexState::FlatAndHnsw::default), check if the returned kind ==
IndexKind::Scalar and return an Err(Status::err(...)) indicating an
unrecognized/invalid index kind for vector fields; keep read_index_kind()
unchanged and perform this contextual validation in read_vector_index_state()
(referencing read_vector_index_state, read_index_kind,
VectorIndexState::FlatAndHnsw, and IndexKind::Scalar).

---

Nitpick comments:
In `@crates/garuda-engine/tests/test_collection_exception.rs`:
- Around line 139-143: The test currently hard-codes schema.fields[2] and uses
assert!(...is_err()), which can pass for unrelated errors; change it to find the
field by name (e.g., locate the field with the expected name in schema.fields
rather than indexing at 2) and then replace bare is_err() checks on
create_result, add_result, and create_index_result with assertions that unwrap
the error and assert its code equals StatusCode::InvalidArgument (for example
use expect_err(...) or unwrap_err().code() == StatusCode::InvalidArgument) so
the test specifically verifies the nullable+indexed rule failure from
db.create_collection/schema modification, the add_* operations, and the
create_index_* call.

In `@crates/garuda-engine/tests/test_collection_scalar_index.rs`:
- Around line 49-54: The test currently calls seed_more_collection_docs() before
collection.create_index(...), which causes create_index to checkpoint and seal
those docs so the test exercises persisted segments instead of the
writing-segment index path; to fix, either move the
seed_more_collection_docs(&collection) call to after
collection.create_index(&field_name("rank"),
IndexParams::Scalar(ScalarIndexParams)) so inserted docs are written into the
active (writing) segments and exercise scalar index maintenance, or if you
intend to test persisted segments, rename the test to reflect that behavior;
adjust only the order of seed_collection/seed_more_collection_docs relative to
create_index or rename the test function accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 27658a26-acf5-477a-8457-ca31d2ec1f99

📥 Commits

Reviewing files that changed from the base of the PR and between 6a81e9d and a154fce.

📒 Files selected for processing (37)
  • crates/garuda-engine/Cargo.toml
  • crates/garuda-engine/src/checkpoint_service.rs
  • crates/garuda-engine/src/filter.rs
  • crates/garuda-engine/src/filter_parser.rs
  • crates/garuda-engine/src/lib.rs
  • crates/garuda-engine/src/query.rs
  • crates/garuda-engine/src/recovery_service.rs
  • crates/garuda-engine/src/schema.rs
  • crates/garuda-engine/src/schema_ddl.rs
  • crates/garuda-engine/src/segment_manager.rs
  • crates/garuda-engine/src/state.rs
  • crates/garuda-engine/tests/common.rs
  • crates/garuda-engine/tests/test_collection_ddl.rs
  • crates/garuda-engine/tests/test_collection_dql.rs
  • crates/garuda-engine/tests/test_collection_exception.rs
  • crates/garuda-engine/tests/test_collection_filter_behavior.rs
  • crates/garuda-engine/tests/test_collection_scalar_index.rs
  • crates/garuda-index-scalar/Cargo.toml
  • crates/garuda-index-scalar/src/lib.rs
  • crates/garuda-meta/src/lib.rs
  • crates/garuda-planner/src/lib.rs
  • crates/garuda-planner/tests/test_query_plan.rs
  • crates/garuda-segment/Cargo.toml
  • crates/garuda-segment/src/codec.rs
  • crates/garuda-segment/src/index.rs
  • crates/garuda-segment/src/lib.rs
  • crates/garuda-segment/src/search.rs
  • crates/garuda-segment/src/storage.rs
  • crates/garuda-segment/src/types.rs
  • crates/garuda-segment/tests/test_flat_sidecar.rs
  • crates/garuda-segment/tests/test_hnsw_sidecar.rs
  • crates/garuda-storage/src/codec.rs
  • crates/garuda-storage/src/layout.rs
  • crates/garuda-storage/src/lib.rs
  • crates/garuda-types/src/query.rs
  • crates/garuda-types/src/schema.rs
  • garuda/src/main.rs

Comment thread crates/garuda-engine/src/checkpoint_service.rs Outdated
Comment thread crates/garuda-engine/src/query.rs
Comment on lines +147 to +162
for field in &schema.fields {
if !field.is_indexed() {
continue;
}

let value = doc
.fields
.get(field.name.as_str())
.expect("validated indexed scalar field should exist");
let index = self
.writing_segment
.scalar_indexes
.get_mut(&field.name)
.expect("enabled writing scalar index should exist");
index.insert(doc_id, value);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Rebuild the writing segment's search resources after tombstones.

Lines 147-162 now maintain writing_segment.scalar_indexes incrementally, but the write-delete path still returns on Lines 184-186 after only flipping record.state. Tombstoned doc IDs therefore stay in the writing segment's flat/HNSW/scalar state until seal/reload, and write_writing_segment() persists segment.scalar_indexes verbatim in crates/garuda-segment/src/storage.rs:105-111.

♻️ Suggested fix
-    pub(crate) fn mark_writing_deleted(&mut self, doc_id: InternalDocId) -> bool {
+    pub(crate) fn mark_writing_deleted(
+        &mut self,
+        doc_id: InternalDocId,
+        schema: &CollectionSchema,
+    ) -> bool {
         let Some(record) = live_record_by_internal_id(&mut self.writing_segment.records, doc_id)
         else {
             return false;
         };

         record.state = RecordState::Deleted;
-        self.writing_segment.sync_meta();
+        self.writing_segment = WritingSegment::new(
+            self.writing_segment.meta.clone(),
+            self.writing_segment.records.clone(),
+            schema,
+        );
         true
     }

-        if self.mark_writing_deleted(doc_id) {
+        if self.mark_writing_deleted(doc_id, schema) {
             return true;
         }

Also applies to: 179-200

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/segment_manager.rs` around lines 147 - 162, The
write-delete path currently flips record.state to tombstone but leaves entries
in writing_segment.scalar_indexes and the writing segment's search structures
(flat/HNSW) stale; after marking the record as tombstoned (the code that flips
record.state in the write-delete path), remove the doc_id from
writing_segment.scalar_indexes (by calling the same index.remove(doc_id) logic
used on inserts) and ensure the writing_segment's vector/search resources are
rebuilt or updated (e.g., call the existing rebuild/update routine for HNSW/flat
or invoke the same cleanup used at seal/reload) so tombstoned docs are not
persisted via write_writing_segment; locate and update the code paths around
record.state flipping and in functions manipulating writing_segment to apply
this removal/rebuild.

Comment on lines +111 to +113
let null_results = collection.query(null_query).expect("is null query");
let ids: Vec<_> = null_results.into_iter().map(|doc| doc.id).collect();
assert_eq!(ids, vec![doc_id("doc-2"), doc_id("doc-3"), doc_id("doc-4")]);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't assert the tie order for IS NULL results.

doc-3 and doc-4 both have zero similarity to this probe vector, so their relative order is an implementation detail of the merge/tie-break path.

🧪 Make the assertion robust to score ties
     let null_results = collection.query(null_query).expect("is null query");
     let ids: Vec<_> = null_results.into_iter().map(|doc| doc.id).collect();
-    assert_eq!(ids, vec![doc_id("doc-2"), doc_id("doc-3"), doc_id("doc-4")]);
+    assert_eq!(ids.len(), 3);
+    assert_eq!(ids[0], doc_id("doc-2"));
+    assert!(ids[1..].contains(&doc_id("doc-3")));
+    assert!(ids[1..].contains(&doc_id("doc-4")));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let null_results = collection.query(null_query).expect("is null query");
let ids: Vec<_> = null_results.into_iter().map(|doc| doc.id).collect();
assert_eq!(ids, vec![doc_id("doc-2"), doc_id("doc-3"), doc_id("doc-4")]);
let null_results = collection.query(null_query).expect("is null query");
let ids: Vec<_> = null_results.into_iter().map(|doc| doc.id).collect();
assert_eq!(ids.len(), 3);
assert_eq!(ids[0], doc_id("doc-2"));
assert!(ids[1..].contains(&doc_id("doc-3")));
assert!(ids[1..].contains(&doc_id("doc-4")));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_collection_dql.rs` around lines 111 - 113,
The test currently assumes a deterministic tie-break order for the IS NULL query
by asserting ids == vec![doc_id("doc-2"), doc_id("doc-3"), doc_id("doc-4")];
change this to be order-agnostic: compute the set (or sort) of ids from
null_results and assert it contains exactly doc_id("doc-2"), doc_id("doc-3"),
doc_id("doc-4") regardless of order (e.g., compare HashSet of ids or sort the
ids vector before asserting), so references to null_results, ids and doc_id(...)
are used to locate and update the assertion.

Comment thread crates/garuda-index-scalar/src/lib.rs Outdated
Comment on lines +126 to +151
pub fn prefilter_doc_ids(
prefilter: &ScalarPrefilter,
scalar_indexes: &BTreeMap<FieldName, ScalarIndex>,
) -> Option<HashSet<InternalDocId>> {
let ScalarPrefilter::And(predicates) = prefilter else {
return None;
};

let mut doc_ids: Option<HashSet<InternalDocId>> = None;

for predicate in predicates {
let index = scalar_indexes
.get(&predicate.field)
.expect("planned scalar prefilter should have an index");
let matching_doc_ids = index.matching_doc_ids(predicate);

doc_ids = Some(match doc_ids {
None => matching_doc_ids,
Some(doc_ids) => doc_ids
.into_iter()
.filter(|doc_id| matching_doc_ids.contains(doc_id))
.collect(),
});
}

Some(doc_ids.unwrap_or_default())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C3 'ScalarPrefilter::And' crates

Repository: PhantomInTheWire/garudadb

Length of output: 3894


🏁 Script executed:

#!/bin/bash
# Find all places where ScalarPrefilter::And is constructed
rg -n 'ScalarPrefilter::And\s*\(' crates --type rust

echo "---"

# Find all callers of prefilter_doc_ids
rg -n 'prefilter_doc_ids' crates --type rust -B2 -A2

Repository: PhantomInTheWire/garudadb

Length of output: 4155


🏁 Script executed:

#!/bin/bash
# Find ScalarPrefilter definition
rg -n 'enum ScalarPrefilter\|pub enum ScalarPrefilter' crates --type rust -A 10

echo "---"

# Check if ScalarPrefilter has Serialize/Deserialize or is used in serialization contexts
rg -n 'Serialize\|Deserialize' crates/garuda-index-scalar --type rust | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 72


🏁 Script executed:

#!/bin/bash
# Find ScalarPrefilter enum definition across all crates
rg -n 'pub enum ScalarPrefilter' crates --type rust -A 15

echo "---"

# Also check struct definitions in case it's a different pattern
rg -n 'ScalarPrefilter\s*=' crates --type rust | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 998


🏁 Script executed:

#!/bin/bash
# Get ScalarPrefilter definition with full derives
sed -n '120,135p' crates/garuda-types/src/query.rs

echo "---"

# Check if there's any deserialization from external sources (JSON, etc.)
rg -n 'from_json\|from_str\|serde_json' crates/garuda-engine --type rust | head -15

Repository: PhantomInTheWire/garudadb

Length of output: 452


🏁 Script executed:

#!/bin/bash
# Check if queries are deserialized directly or always go through planner
rg -n 'ScalarPrefilter::' crates --type rust | grep -E 'from_|parse|json'

echo "---"

# Look for any direct query deserialization or API endpoints that might accept ScalarPrefilter
rg -n 'serde_json::from' crates --type rust | head -10

Repository: PhantomInTheWire/garudadb

Length of output: 383


Add a guard for empty And predicates.

The function currently returns Some(HashSet::new()) when predicates is empty, meaning an empty conjunction becomes "match nothing" rather than "match everything" (the correct identity for AND). While the planner prevents this by returning ScalarPrefilter::All for empty predicates, the function should still defend against it since ScalarPrefilter is Serialize/Deserialize-enabled and could be constructed from external sources.

Add the suggested check:

Fix
 let ScalarPrefilter::And(predicates) = prefilter else {
     return None;
 };
+if predicates.is_empty() {
+    return None;
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-index-scalar/src/lib.rs` around lines 126 - 151, The function
prefilter_doc_ids treats an empty ScalarPrefilter::And as "match nothing" by
returning Some(empty set); instead guard against empty predicates by checking if
predicates.is_empty() and return None (i.e., no restriction / match-everything)
before the loop. Update prefilter_doc_ids (ScalarPrefilter::And arm handling) to
early-return None for empty predicates so callers see no prefilter rather than
an empty HashSet.

Comment on lines +157 to +159
StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
value.starts_with(prefix) && value.ends_with(suffix)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Reject overlapping prefix/suffix matches in LIKE.

This arm over-matches when the prefix and suffix overlap. For example, ab%bc currently matches abc because both boundary checks succeed even though a single wildcard cannot cover overlapping segments. That can return or delete the wrong documents for LIKE filters.

🐛 Proposed fix
         StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
-            value.starts_with(prefix) && value.ends_with(suffix)
+            value.len() >= prefix.len() + suffix.len()
+                && value.starts_with(prefix)
+                && value.ends_with(suffix)
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
value.starts_with(prefix) && value.ends_with(suffix)
}
StringMatchExpr::Like(garuda_types::LikePattern::PrefixSuffix { prefix, suffix }) => {
value.len() >= prefix.len() + suffix.len()
&& value.starts_with(prefix)
&& value.ends_with(suffix)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-meta/src/lib.rs` around lines 157 - 159, The current
StringMatchExpr::Like arm for LikePattern::PrefixSuffix only checks
starts_with(prefix) && ends_with(suffix) which incorrectly accepts inputs where
prefix and suffix overlap; change the condition in that match arm (the
StringMatchExpr::Like -> LikePattern::PrefixSuffix { prefix, suffix } branch) to
also require that the value has enough length to accommodate both pieces without
overlap, e.g. require value.len() >= prefix.len() + suffix.len() (or
equivalently ensure the suffix start index is >= prefix.len()), so the match
becomes starts_with(prefix) && ends_with(suffix) && value.len() >= prefix.len()
+ suffix.len().

Comment on lines +168 to +185
let scalar_dir = segment_scalar_index_dir(root, segment_id);
remove_path_if_exists(&scalar_dir)?;

if live_doc_count == 0 {
return Ok(());
}

create_dir_all(&scalar_dir, "failed to create scalar index directory")?;

for field in indexed_scalar_fields(schema) {
let index = scalar_indexes
.get(&field.name)
.expect("enabled scalar index should exist");
let bytes = encode_scalar_index(index, live_doc_count)?;
write_file_atomically(
&segment_scalar_index_path(root, segment_id, &field.name),
&bytes,
)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace scalar sidecars atomically.

Line 169 deletes the live scalar-index directory before any replacement exists. If one of the writes on Lines 177-185 fails or the process stops midway, read_persisted_segment() will reopen the new segment data, but load_scalar_indexes() in crates/garuda-segment/src/index.rs:219-243 will hard-fail on the missing field file. Please stage these files in a temp directory and rename it into place only after the full set is written.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-segment/src/storage.rs` around lines 168 - 185, The scalar
index replacement currently removes segment_scalar_index_dir(root, segment_id)
up front and writes each field file directly, which can leave the segment in a
partially-updated state; instead, create a temporary staging directory (e.g.,
segment_scalar_index_dir(root, segment_id) with a suffix), write all encoded
scalar index files into that temp dir using write_file_atomically and
segment_scalar_index_path-like names, then atomically rename the temp dir into
place (replacing the old scalar dir) once all writes succeed; ensure
create_dir_all and encode_scalar_index calls target the temp dir and preserve
the existing behavior read_persisted_segment/load_scalar_indexes expect (so
load_scalar_indexes will always see either the old complete set or the new
complete set).

Comment on lines +51 to 60
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum ScalarIndexState {
None,
Indexed,
}

impl IndexParams {
pub fn kind(&self) -> IndexKind {
match self {
Self::Flat(_) => IndexKind::Flat,
Self::Hnsw(_) => IndexKind::Hnsw,
}
impl ScalarIndexState {
pub fn is_indexed(self) -> bool {
matches!(self, Self::Indexed)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C2 'CollectionSchema|ScalarFieldSchema' crates
rg -n -C2 'checkpoint|capture|restore|catalog\.schema|read_catalog|write_catalog' crates

Repository: PhantomInTheWire/garudadb

Length of output: 50381


🏁 Script executed:

sed -n '206,225p' crates/garuda-storage/src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 722


🏁 Script executed:

sed -n '194,204p' crates/garuda-storage/src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 475


🏁 Script executed:

rg -n 'codec.*version|format.*version|migration|CODEC_VERSION' crates/garuda-storage/src/

Repository: PhantomInTheWire/garudadb

Length of output: 292


🏁 Script executed:

# Check for any version handling in the manifest or persistence layer
rg -n 'Manifest|read_collection_schema|write_collection_schema' crates/garuda-storage/src/codec.rs | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 661


🏁 Script executed:

# Check for recent changes to ScalarFieldSchema and index field addition
git log --oneline -n 20 -- crates/garuda-types/src/schema.rs 2>/dev/null || echo "Git history unavailable"

Repository: PhantomInTheWire/garudadb

Length of output: 127


🏁 Script executed:

# Check ScalarIndexState enum and any helpers for it
sed -n '51,65p' crates/garuda-types/src/schema.rs
# Also check if there's a read_scalar_index_state function and how it handles values
rg -n 'read_scalar_index_state|write_scalar_index_state' crates/garuda-storage/src/codec.rs -A 5

Repository: PhantomInTheWire/garudadb

Length of output: 1341


🏁 Script executed:

# View the full ScalarFieldSchema struct definition
sed -n '178,186p' crates/garuda-types/src/schema.rs

Repository: PhantomInTheWire/garudadb

Length of output: 336


🏁 Script executed:

# Check the git diff to see if index field is new
git diff HEAD~1 crates/garuda-types/src/schema.rs 2>/dev/null | head -50
# Or check the previous commit
git show HEAD:crates/garuda-types/src/schema.rs 2>/dev/null | sed -n '178,186p'

Repository: PhantomInTheWire/garudadb

Length of output: 336


🏁 Script executed:

# Check for any migration or compat layer handling
rg -n 'old|legacy|migrate|compat|backward' crates/garuda-storage/src/codec.rs
# Also look for tests that read old payloads
rg -n 'test.*decode.*manifest|test.*read.*schema' crates/garuda-storage/tests/ -l

Repository: PhantomInTheWire/garudadb

Length of output: 137


🏁 Script executed:

# Find test files
find crates -name '*test*.rs' -type f | grep -E '(codec|manifest|schema)' | head -10
# Also check if there are any backward compat considerations or version checks in the codebase
rg -n 'backward|compat|version' crates/garuda-storage/src/ | grep -v 'manifest_version\|version_id' | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 301


🏁 Script executed:

# Check the version module
cat crates/garuda-storage/src/version.rs | head -100

Repository: PhantomInTheWire/garudadb

Length of output: 2157


🏁 Script executed:

# Check what changed in this commit
git show --stat HEAD 2>/dev/null | head -30
# Try to see the actual changes
git diff HEAD~1..HEAD crates/garuda-types/src/schema.rs 2>/dev/null || git show HEAD:crates/garuda-types/src/schema.rs | head -100

Repository: PhantomInTheWire/garudadb

Length of output: 4276


🏁 Script executed:

# Check when ScalarFieldSchema was first introduced and if it had the index field originally
git log --all --follow -p -- crates/garuda-types/src/schema.rs 2>/dev/null | grep -A 10 "struct ScalarFieldSchema" | head -30
# Or check if there are any manifest files with old format in tests
find crates -name '*.rs' -type f | xargs rg -l 'manifest.*test|encode.*decode' | head -5

Repository: PhantomInTheWire/garudadb

Length of output: 798


🏁 Script executed:

# Verify the old ScalarFieldSchema definition didn't have index field
git show HEAD~1:crates/garuda-types/src/schema.rs 2>/dev/null | sed -n '170,185p'

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

# Try to look at git log more carefully
git log --oneline --all -- crates/garuda-types/src/schema.rs | head -5
# Get the previous commit
git rev-parse HEAD~1

Repository: PhantomInTheWire/garudadb

Length of output: 468


🏁 Script executed:

# Check for integration tests that might save and reload manifests
rg -n 'encode_manifest|decode_manifest|write_manifest|read_.*_manifest' crates/garuda-engine/tests/ --type rs | head -20
# Also check test_storage_version_manager.rs which was mentioned
cat crates/garuda-engine/tests/test_storage_version_manager.rs 2>/dev/null | head -100

Repository: PhantomInTheWire/garudadb

Length of output: 3523


Binary codec field misalignment breaks deserialization of older persisted manifests.

ScalarFieldSchema now reads index as a binary u8 field in strict position. If existing manifests were persisted before the index field was added, the binary codec will read the old nullability u8 value as the new index field, causing field misalignment and deserialization failure on reopening collections.

This is not a serde JSON issue—the codec uses BinaryWriter/BinaryReader for persistence. The proposed #[serde(default)] fix is incorrect. A proper fix requires binary codec versioning or field migration logic to handle old manifest payloads gracefully.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-types/src/schema.rs` around lines 51 - 60, The binary codec now
deserializes the new index u8 into the wrong slot for older manifests because
ScalarFieldSchema's layout changed; instead of serde defaults, add explicit
binary-versioning or migration in the BinaryReader/BinaryWriter path so old
payloads are detected and fields are remapped: update the binary read/write
logic (the code handling ScalarFieldSchema deserialization) to read a manifest
version header (or try old-layout fallback), and when an older version is
detected, read the previous layout (reading nullability where index would be and
then set index to a sensible default) or perform a field-by-field migration,
ensuring ScalarIndexState/index is reconstructed correctly without treating
nullability as index. Ensure the writer records the new version and tests cover
reading both old and new manifest blobs.

@PhantomInTheWire PhantomInTheWire force-pushed the feat/scalar-index-lifecycle branch from a154fce to 571dad0 Compare March 23, 2026 12:41
@PhantomInTheWire PhantomInTheWire force-pushed the feat/scalar-index-lifecycle branch from aef47ef to cc6ce9c Compare March 23, 2026 18:44

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 15 additional findings in Devin Review.

Open in Devin Review

) -> Result<(), Status> {
writer.write_string(field.name.as_str())?;
writer.write_u8(field.field_type.to_tag());
writer.write_u8(write_scalar_index_state(field.index));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Manifest FORMAT_VERSION not bumped despite breaking binary format change

A new ScalarIndexState byte is inserted into the scalar field schema serialization (at crates/garuda-storage/src/codec.rs:200) between field_type and nullability, but the manifest FORMAT_VERSION remains at 1. When reading an old manifest (which lacks this byte), the old nullability tag is misinterpreted as the new index state, and all subsequent bytes are shifted — causing cascading corruption of the deserialized CollectionSchema. The checksum still passes (same bytes), and expect_u16(FORMAT_VERSION) also passes since both old and new versions use 1, so the corruption is silent. Bumping FORMAT_VERSION to 2 would cause old manifests to fail cleanly with "unsupported format version" instead of being silently misread.

Prompt for agents
In crates/garuda-storage/src/codec.rs, the FORMAT_VERSION constant (currently 1) must be bumped to 2 because the scalar field schema binary format has changed: a new ScalarIndexState byte is now written between field_type and nullability (line 200). The constant is defined near the top of the file (around line 16). Change it from 1 to 2. Additionally, consider whether you want to support reading old version-1 manifests by adding a migration path in read_scalar_field_schema (around line 206) that detects version 1 and reads the old format (without the index byte, defaulting to ScalarIndexState::None). This would require threading the format version through to the schema reader functions.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@PhantomInTheWire PhantomInTheWire merged commit b2bcbe9 into master Mar 23, 2026
5 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Apr 2, 2026
@PhantomInTheWire PhantomInTheWire deleted the feat/scalar-index-lifecycle branch April 2, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant