Skip to content

Add zvec-style binary storage runtime for GarudaDB#3

Merged
PhantomInTheWire merged 5 commits into
masterfrom
feat/storage
Mar 18, 2026
Merged

Add zvec-style binary storage runtime for GarudaDB#3
PhantomInTheWire merged 5 commits into
masterfrom
feat/storage

Conversation

@PhantomInTheWire

@PhantomInTheWire PhantomInTheWire commented Mar 18, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Write-Ahead Logging (WAL) for improved crash recovery; uncommitted writes are now preserved and replayed on collection reopening.
    • Introduced checkpoint-based persistence workflow for enhanced data consistency during failures.
  • Refactor

    • Migrated from JSON-based to binary storage format for more efficient on-disk representation.
    • Implemented manifest versioning for better state management and recovery workflows.

@coderabbitai

coderabbitai Bot commented Mar 18, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Engine storage was reworked: binary storage, snapshots, WAL, and manifest versioning were moved into garuda-storage and garuda-segment; engine now uses VersionManager and snapshot-backed manifest loading, performs WAL replay for the writing segment, and implements a two-phase checkpoint workflow with capture/restore semantics.

Changes

Cohort / File(s) Summary
Bootstrap & Engine Core
crates/garuda-engine/src/bootstrap.rs, crates/garuda-engine/src/lib.rs, crates/garuda-engine/src/state.rs
Switch to VersionManager/read_latest_manifest, add INITIAL_* IDs on manifest, load id-map/delete snapshots, update signatures, and replace in-memory index rebuild with WAL replay (replay_writing_segment_wal) and apply_wal_op semantics.
Persistence / Checkpointing (engine)
crates/garuda-engine/src/persistence.rs, crates/garuda-engine/src/persistence.rs
Add checkpoint_state two‑phase flow: capture current files, write segments/idmap/delete snapshots, persist manifest via VersionManager, rollback restore on failure, cleanup stale segments/snapshots, and reset WAL.
Removed engine JSON helpers
crates/garuda-engine/src/storage_io.rs (deleted), crates/garuda-engine/src/id_map.rs (deleted), crates/garuda-engine/src/delete_store.rs (deleted), crates/garuda-engine/src/version.rs (deleted), crates/garuda-engine/src/segment.rs (deleted)
Removed legacy JSON I/O, idmap/delete-store, version store, and in-engine segment persistence (responsibilities moved to garuda-storage / garuda-segment).
Garuda Storage (new crate)
crates/garuda-storage/src/lib.rs, crates/garuda-storage/src/codec.rs, crates/garuda-storage/src/io.rs, crates/garuda-storage/src/layout.rs, crates/garuda-storage/src/snapshots.rs, crates/garuda-storage/src/version.rs
New binary codecs for manifest/idmap/delete, atomic file I/O, canonical layout helpers, snapshot read/write/remove, and VersionManager for versioned manifests and pruning.
Garuda Segment (new crate)
crates/garuda-segment/src/lib.rs, crates/garuda-segment/src/codec.rs, crates/garuda-segment/src/wal.rs
New SegmentFile/StoredRecord, binary segment codec, WAL implementation (WalOp), append_wal_ops/read_wal_ops/reset_wal, and segment lifecycle helpers.
State / WAL integration
crates/garuda-engine/src/state.rs
Add apply_wal_op and is_redundant_wal_op with helpers to detect no-op inserts/updates/deletes; writing segment meta uses WRITING_SEGMENT_ID.
Storage re-exports
crates/garuda-engine/src/storage.rs
Replaced local storage helpers with crate-private re-exports from garuda_storage (crate visibility), shrinking engine-local public surface.
Types & Manifest fields
crates/garuda-types/src/lib.rs, crates/garuda-engine/src/bootstrap.rs
Add ManifestVersionId, SnapshotId, StorageFormatVersion; extend Manifest with id_map_snapshot_id, delete_snapshot_id, manifest_version_id; initialize with new INITIAL_* constants.
Tests & Test Utilities
crates/garuda-engine/tests/common.rs, crates/garuda-engine/tests/*.rs (multiple)
Add utilities for manifest/snapshot discovery and options; update/add tests for manifest versions, snapshot recovery, WAL semantics, checkpoint rollback, on-disk layout, and recovery edge cases.
Refactors / Imports & Build
crates/garuda-engine/src/ddl.rs, crates/garuda-engine/src/optimize.rs, crates/garuda-engine/Cargo.toml, crates/garuda-storage/Cargo.toml
Updated imports to garuda_segment/garuda_storage, removed serde_json workspace flags from Cargo manifests.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Lib as garuda-engine::lib
    participant State as CollectionState
    participant Seg as garuda-segment
    participant Store as garuda-storage
    participant VM as VersionManager
    participant Persist as persistence

    Client->>Lib: insert_one(doc)
    Lib->>State: apply_doc_write_batch([Insert(doc)])
    State->>Seg: append_wal_ops(WRITING_SEGMENT_ID, [Insert])
    Note over State: WAL appended to writing segment

    Client->>Lib: flush()
    Lib->>Persist: checkpoint_state(state)
    Persist->>Persist: capture_checkpoint_files()
    Persist->>Seg: write_segment(persisted segments & writing segment)
    Persist->>Store: write_id_map_snapshot(), write_delete_snapshot()
    Persist->>Seg: reset_wal(WRITING_SEGMENT_ID)
    Persist->>VM: write_manifest(manifest)
    alt Success
        Persist-->>Lib: Ok
    else Failure
        Persist->>Persist: restore_captured_files()
        Persist-->>Lib: Err
    end
    Lib-->>Client: Result
Loading
sequenceDiagram
    participant Reopen as Reopen
    participant Bootstrap as bootstrap.rs
    participant VM as VersionManager
    participant Store as garuda-storage
    participant Seg as garuda-segment
    participant State as CollectionState

    Reopen->>Bootstrap: load_collection_state(path)
    Bootstrap->>VM: read_latest_manifest()
    VM->>Store: decode_manifest
    VM-->>Bootstrap: Manifest
    Bootstrap->>Store: read_id_map_snapshot(manifest.id_map_snapshot_id)
    Store-->>Bootstrap: IdMap
    Bootstrap->>Store: read_delete_snapshot(manifest.delete_snapshot_id)
    Store-->>Bootstrap: DeleteSet
    Bootstrap->>Seg: load_persisted_segments(manifest)
    loop persisted segments
        Seg-->>Bootstrap: SegmentFile
    end
    Bootstrap->>Seg: read_wal_ops(WRITING_SEGMENT_ID)
    Seg-->>Bootstrap: Vec<WalOp>
    Bootstrap->>State: apply_wal_op() per WalOp
    Bootstrap-->>Reopen: CollectionState
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

  • refactor: version #1: related—also modifies manifest/version handling and shifts manifest loading toward a versioned manager approach.

Poem

🐰 I nibble checksums and hop through the log,

snapshots in pockets, no more JSON fog.
WALs drum the heartbeat, manifests march in line,
segments tucked safely, recovery feels fine.
Hooray — storage hums, garuda grows bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: introducing zvec-style binary storage runtime for GarudaDB. The changeset extensively implements binary encoding/decoding across multiple modules (codec, snapshots, VersionManager), persistence checkpointing, and WAL-based recovery—all central to a binary storage runtime.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/storage
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@PhantomInTheWire

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@PhantomInTheWire

Copy link
Copy Markdown
Owner Author

@codex review

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0bea2e5f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/garuda-engine/src/persistence.rs Outdated
state.manifest.delete_snapshot_id,
&state.deleted_doc_ids,
)?;
reset_wal(&state.path, WRITING_SEGMENT_ID)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Publish the new manifest before clearing the writing WAL

If the process dies after reset_wal() succeeds but before write_manifest() does, recovery will reopen the previous manifest and only load the segments listed there (bootstrap::load_collection_state). Any writes that were flushed into a newly sealed segment during this checkpoint are then lost: the segment data is on disk, but the old manifest does not reference it and the WAL that could replay those inserts has already been erased. This affects collections that rolled the writing segment before flush().

Useful? React with 👍 / 👎.

Comment thread crates/garuda-engine/src/state.rs Outdated

fn is_redundant_wal_op(&self, op: &WalOp) -> bool {
match op {
WalOp::Insert(doc) => self.live_doc_matches(doc),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make stale insert WAL entries idempotent during recovery

Recovery replays the WAL on top of whatever data.seg was just loaded. If a checkpoint crashes after write_all_segments() but before the WAL is reset, the segment file may already contain the final state while the WAL still contains earlier ops. In a sequence like insert then update before flush(), replaying the stale Insert hits AlreadyExists, and apply_wal_op() turns that into a hard reopen failure even though the on-disk segment is already correct. The current redundancy check only skips inserts when the full document matches exactly, which is too narrow for partially checkpointed state.

Useful? React with 👍 / 👎.

Comment on lines +66 to +67
let payload_size =
u64::from_le_bytes(bytes[cursor..cursor + 8].try_into().expect("fixed length"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate each WAL entry header before slicing its length

read_wal_ops() assumes there are always 8 bytes available for the payload length once it enters the loop. A truncated or otherwise corrupt WAL with fewer than 8 bytes remaining after the tag will panic on bytes[cursor..cursor + 8] instead of returning a Status, which can take down collection open/recovery on bad disk data. Since WAL files are untrusted persistence inputs, this path should fail gracefully rather than indexing past the buffer.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
crates/garuda-engine/tests/common.rs (1)

120-131: Narrow helper matching to numeric versioned files only.

At Line 130 and Line 148, starts_with(...) is broad and can match temp artifacts like manifest.1.tmp, which makes count/name assertions brittle across failure-path tests.

♻️ Proposed helper hardening
+fn has_numeric_suffix(file_name: &str, prefix: &str) -> bool {
+    file_name
+        .strip_prefix(prefix)
+        .and_then(|suffix| suffix.parse::<u64>().ok())
+        .is_some()
+}
+
 pub fn manifest_version_paths(root: &std::path::Path, name: &str) -> Vec<PathBuf> {
@@
-            file_name.starts_with("manifest.")
+            has_numeric_suffix(file_name, "manifest.")
         })
         .collect::<Vec<_>>();
@@
 pub fn storage_snapshot_paths(root: &std::path::Path, name: &str, prefix: &str) -> Vec<PathBuf> {
@@
-            file_name.starts_with(prefix)
+            has_numeric_suffix(file_name, prefix)
         })
         .collect::<Vec<_>>();

Also applies to: 138-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/common.rs` around lines 120 - 131, The helper
manifest_version_paths is currently using file_name.starts_with("manifest.")
which also matches temporary/artifact names like manifest.1.tmp; change the
filter to accept only names where the suffix after "manifest." is purely numeric
(e.g., parse or check that all remaining chars are digits) so files like
manifest.1.tmp are excluded; update the closure in manifest_version_paths to
extract the suffix (after "manifest.") and ensure suffix.chars().all(|c|
c.is_ascii_digit()) (or attempt to parse to an integer) before including the
path.
crates/garuda-segment/src/wal.rs (1)

175-184: Consider extracting shared checksum function.

This FNV-1a checksum implementation is duplicated in codec.rs. Consider moving it to a shared location (e.g., a hash module or garuda-types) to avoid duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-segment/src/wal.rs` around lines 175 - 184, The checksum
function (fn checksum) is duplicated between wal.rs and codec.rs; extract this
FNV-1a implementation into a shared module (e.g., create a new hash module or
place it in garuda-types) and replace the local implementations with a single
public function (e.g., hash::checksum or garuda_types::checksum); update wal.rs
and codec.rs to import and call that shared checksum function and remove the
duplicate definitions so both files reference the one centralized
implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/garuda-engine/src/state.rs`:
- Around line 21-37: The WAL replay treats WalOp::Insert as redundant only when
full doc matches, causing apply_wal_op to return Err if insert_doc returns
AlreadyExists for the same DocId; update apply_wal_op so that when handling
WalOp::Insert (in function apply_wal_op calling insert_doc) you convert an
AlreadyExists result.status into Ok (i.e., treat AlreadyExists as
harmless/idempotent during replay) instead of propagating Err; you can implement
this by checking the op variant (WalOp::Insert) and the result.status.code and
returning Ok() for AlreadyExists, otherwise preserve existing error handling.

In `@crates/garuda-engine/src/storage.rs`:
- Around line 1-4: segment.rs expects the constant SEGMENTS_DIR_NAME to be
available from crate::storage but storage.rs currently re-exports only
LOCK_FILE_NAME, collection_dir, ensure_database_root,
ensure_existing_collection_dir, and ensure_new_collection_dir; add
SEGMENTS_DIR_NAME to the re-exports (or add it into garuda_storage::layout if
missing) so segment.rs can import it—specifically, ensure the SEGMENTS_DIR_NAME
constant is defined in garuda_storage (e.g., garuda_storage::layout) and then
include SEGMENTS_DIR_NAME in the pub(crate) use list in storage.rs so segment.rs
can reference SEGMENTS_DIR_NAME on lines where it is used.

In `@crates/garuda-storage/src/codec.rs`:
- Around line 173-176: The inline numeric casts (e.g., segment_max_docs =
reader.read_u64()? as usize) must be changed to use TryFrom/usize::try_from(...)
and return a suitable Err(Status) when the conversion fails to avoid silent
truncation on 32-bit targets; update the four sites mentioned (segment_max_docs
in read_collection_options, m, ef_construction, ef_search, and doc_count) to
mirror the existing read_len() pattern: call reader.read_u64()?, then
usize::try_from(value).map_err(|_| Status::invalid_argument("... out of range"))
(or the same error construction used by read_len()) so out-of-range values are
rejected rather than truncated.
- Around line 44-45: The code currently preallocates vectors directly from
on-disk counts (e.g., using reader.read_len() to set persisted_segment_count and
calling Vec::with_capacity(persisted_segment_count)), which can panic or OOM on
corrupt input; update the logic to validate the count against a safe upper bound
(or refuse excessively large counts) before allocating and avoid unbounded
with_capacity calls—either allocate an empty Vec and push items in the loop or
perform a guarded reserve after checking count; apply the same fix to the other
occurrences referenced (the counts at the other read_len() sites around
persisted_segments and similar variables) and return a structured error when the
count is above the allowed limit.

In `@crates/garuda-storage/src/io.rs`:
- Around line 27-31: The File::open(path) error is always mapped to
StatusCode::NotFound; change this to map only io::ErrorKind::NotFound to
StatusCode::NotFound and map all other errors to StatusCode::Internal. Locate
the File::open(path) call and its map_err closure and inspect the underlying
std::io::Error (use error.kind()), returning Status::err(StatusCode::NotFound,
...) for ErrorKind::NotFound and Status::err(StatusCode::Internal, ...)
(including the original error message) for all other kinds; keep using the same
Status::err and StatusCode symbols for consistency.

In `@crates/garuda-storage/src/layout.rs`:
- Around line 21-31: The ensure_new_collection_dir function currently checks
path.exists() then calls create_dir_all, which is racy and fails to persist the
new entry in the parent; change it to call std::fs::create_dir(path) atomically
and map a AlreadyExists IO error to StatusCode::AlreadyExists, then create the
lock file (LOCK_FILE_NAME) inside the new directory and call
sync_directory(path) and also sync_directory on the parent directory
(path.parent().unwrap()) to persist the directory entry; keep existing error
wrapping/messages consistent when converting IO errors to Status.

In `@crates/garuda-storage/src/snapshots.rs`:
- Around line 37-41: When building id_map from entries, don't silently overwrite
duplicates: after parsing each key with DocId::parse and before inserting into
id_map, check whether that DocId already exists (e.g., via id_map.contains_key
or checking the return value of id_map.insert). If a duplicate is found return
an Err (propagate the existing function's error type) with a clear message
referencing the duplicate DocId and the conflicting internal_doc_id so snapshot
loading fails fast; update the loop that iterates over entries to perform this
check instead of blindly inserting.

---

Nitpick comments:
In `@crates/garuda-engine/tests/common.rs`:
- Around line 120-131: The helper manifest_version_paths is currently using
file_name.starts_with("manifest.") which also matches temporary/artifact names
like manifest.1.tmp; change the filter to accept only names where the suffix
after "manifest." is purely numeric (e.g., parse or check that all remaining
chars are digits) so files like manifest.1.tmp are excluded; update the closure
in manifest_version_paths to extract the suffix (after "manifest.") and ensure
suffix.chars().all(|c| c.is_ascii_digit()) (or attempt to parse to an integer)
before including the path.

In `@crates/garuda-segment/src/wal.rs`:
- Around line 175-184: The checksum function (fn checksum) is duplicated between
wal.rs and codec.rs; extract this FNV-1a implementation into a shared module
(e.g., create a new hash module or place it in garuda-types) and replace the
local implementations with a single public function (e.g., hash::checksum or
garuda_types::checksum); update wal.rs and codec.rs to import and call that
shared checksum function and remove the duplicate definitions so both files
reference the one centralized implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 53d66891-79da-4b56-85a0-42bada0a9807

📥 Commits

Reviewing files that changed from the base of the PR and between b4d276d and a0bea2e.

📒 Files selected for processing (25)
  • crates/garuda-engine/src/bootstrap.rs
  • crates/garuda-engine/src/ddl.rs
  • crates/garuda-engine/src/lib.rs
  • crates/garuda-engine/src/optimize.rs
  • crates/garuda-engine/src/persistence.rs
  • crates/garuda-engine/src/state.rs
  • crates/garuda-engine/src/storage.rs
  • crates/garuda-engine/tests/common.rs
  • crates/garuda-engine/tests/test_collection_exception.rs
  • crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs
  • crates/garuda-engine/tests/test_storage_atomic_writes.rs
  • crates/garuda-engine/tests/test_storage_disk_format.rs
  • crates/garuda-engine/tests/test_storage_id_map.rs
  • crates/garuda-engine/tests/test_storage_segment_layout.rs
  • crates/garuda-engine/tests/test_storage_version_manager.rs
  • crates/garuda-segment/src/codec.rs
  • crates/garuda-segment/src/lib.rs
  • crates/garuda-segment/src/wal.rs
  • crates/garuda-storage/src/codec.rs
  • crates/garuda-storage/src/io.rs
  • crates/garuda-storage/src/layout.rs
  • crates/garuda-storage/src/lib.rs
  • crates/garuda-storage/src/snapshots.rs
  • crates/garuda-storage/src/version.rs
  • crates/garuda-types/src/lib.rs

Comment thread crates/garuda-engine/src/state.rs
Comment thread crates/garuda-engine/src/storage.rs
Comment on lines +44 to +45
let persisted_segment_count = reader.read_len()?;
let mut persisted_segments = Vec::with_capacity(persisted_segment_count);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd crates/garuda-storage && wc -l src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 85


🏁 Script executed:

cd crates/garuda-storage && head -150 src/codec.rs | tail -120

Repository: PhantomInTheWire/garudadb

Length of output: 3742


🏁 Script executed:

cd crates/garuda-storage && sed -n '40,90p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 1619


🏁 Script executed:

cd crates/garuda-storage && sed -n '100,150p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 1565


🏁 Script executed:

cd crates/garuda-storage && sed -n '140,200p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 2056


🏁 Script executed:

cd crates/garuda-storage && sed -n '220,260p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 1296


🏁 Script executed:

cd crates/garuda-storage && find . -name "*.rs" | xargs grep -l "struct BinaryReader"

Repository: PhantomInTheWire/garudadb

Length of output: 83


🏁 Script executed:

cd crates/garuda-storage && find . -name "*.rs" | xargs grep -l "fn read_len"

Repository: PhantomInTheWire/garudadb

Length of output: 83


🏁 Script executed:

cd crates/garuda-storage && sed -n '170,180p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 431


🏁 Script executed:

cd crates/garuda-storage && sed -n '225,260p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 1266


🏁 Script executed:

cd crates/garuda-storage && grep -n "impl BinaryReader" src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

cd crates/garuda-storage && grep -n "fn read_len" src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 126


🏁 Script executed:

cd crates/garuda-storage && sed -n '350,450p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 2675


🏁 Script executed:

cd crates/garuda-storage && sed -n '520,540p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 850


🏁 Script executed:

cd crates/garuda-storage && sed -n '500,575p' src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 2289


Avoid unbounded preallocation from on-disk counts.

Even though read_len() validates values fit in usize, a corrupt file can still claim huge counts that cause Vec::with_capacity() to panic or OOM before the loop returns a structured error. For example, a file claiming 2 billion entries on a 1GB system will immediately fail on allocation.

Suggested fix
-    let mut persisted_segments = Vec::with_capacity(persisted_segment_count);
+    let mut persisted_segments = Vec::new();
@@
-    let mut entries = Vec::with_capacity(count);
+    let mut entries = Vec::new();
@@
-    let mut ids = Vec::with_capacity(count);
+    let mut ids = Vec::new();
@@
-    let mut fields = Vec::with_capacity(field_count);
+    let mut fields = Vec::new();

Also applies to: 82-83, 108-109, 142-143

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/codec.rs` around lines 44 - 45, The code currently
preallocates vectors directly from on-disk counts (e.g., using reader.read_len()
to set persisted_segment_count and calling
Vec::with_capacity(persisted_segment_count)), which can panic or OOM on corrupt
input; update the logic to validate the count against a safe upper bound (or
refuse excessively large counts) before allocating and avoid unbounded
with_capacity calls—either allocate an empty Vec and push items in the loop or
perform a guarded reserve after checking count; apply the same fix to the other
occurrences referenced (the counts at the other read_len() sites around
persisted_segments and similar variables) and return a structured error when the
count is above the allowed limit.

Comment on lines +173 to +176
fn read_collection_options(reader: &mut BinaryReader<'_>) -> Result<CollectionOptions, Status> {
let read_only = reader.read_bool()?;
let enable_mmap = reader.read_bool()?;
let segment_max_docs = reader.read_u64()? as usize;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n crates/garuda-storage/src/codec.rs | sed -n '170,260p'

Repository: PhantomInTheWire/garudadb

Length of output: 3528


🏁 Script executed:

rg -n "try_from" crates/garuda-storage/src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 148


🏁 Script executed:

cat -n crates/garuda-storage/src/codec.rs | sed -n '400,415p; 520,535p'

Repository: PhantomInTheWire/garudadb

Length of output: 1395


🏁 Script executed:

rg -n "as usize" crates/garuda-storage/src/codec.rs

Repository: PhantomInTheWire/garudadb

Length of output: 346


Replace as usize casts with try_from to reject out-of-range values.

These casts silently truncate on 32-bit targets. The decoder already uses this pattern elsewhere with proper error handling (see read_len()). Apply to all four locations: line 176 (segment_max_docs), lines 230-232 (m, ef_construction, ef_search), and line 256 (doc_count).

Suggested fix
-    let segment_max_docs = reader.read_u64()? as usize;
+    let segment_max_docs = usize::try_from(reader.read_u64()?)
+        .map_err(|_| Status::err(StatusCode::Internal, "value exceeds supported range"))?;
@@
-            m: reader.read_u64()? as usize,
-            ef_construction: reader.read_u64()? as usize,
-            ef_search: reader.read_u64()? as usize,
+            m: usize::try_from(reader.read_u64()?)
+                .map_err(|_| Status::err(StatusCode::Internal, "value exceeds supported range"))?,
+            ef_construction: usize::try_from(reader.read_u64()?)
+                .map_err(|_| Status::err(StatusCode::Internal, "value exceeds supported range"))?,
+            ef_search: usize::try_from(reader.read_u64()?)
+                .map_err(|_| Status::err(StatusCode::Internal, "value exceeds supported range"))?,
@@
-        doc_count: reader.read_u64()? as usize,
+        doc_count: usize::try_from(reader.read_u64()?)
+            .map_err(|_| Status::err(StatusCode::Internal, "value exceeds supported range"))?,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/codec.rs` around lines 173 - 176, The inline
numeric casts (e.g., segment_max_docs = reader.read_u64()? as usize) must be
changed to use TryFrom/usize::try_from(...) and return a suitable Err(Status)
when the conversion fails to avoid silent truncation on 32-bit targets; update
the four sites mentioned (segment_max_docs in read_collection_options, m,
ef_construction, ef_search, and doc_count) to mirror the existing read_len()
pattern: call reader.read_u64()?, then usize::try_from(value).map_err(|_|
Status::invalid_argument("... out of range")) (or the same error construction
used by read_len()) so out-of-range values are rejected rather than truncated.

Comment on lines +27 to +31
let mut file = File::open(path).map_err(|error| {
Status::err(
StatusCode::NotFound,
format!("failed to open file {}: {error}", path.display()),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -50 crates/garuda-storage/src/io.rs

Repository: PhantomInTheWire/garudadb

Length of output: 1635


🏁 Script executed:

rg "error\.kind\(\)" crates/garuda-storage/src/ --type rust -A 2 -B 2

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

rg "StatusCode::" crates/garuda-storage/src/ --type rust | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 1915


🏁 Script executed:

grep -n "pub enum StatusCode\|enum StatusCode" crates/garuda-types/src/*.rs

Repository: PhantomInTheWire/garudadb

Length of output: 94


🏁 Script executed:

sed -n '401,430p' crates/garuda-types/src/lib.rs

Repository: PhantomInTheWire/garudadb

Length of output: 626


🏁 Script executed:

sed -n '26,70p' crates/garuda-storage/src/io.rs

Repository: PhantomInTheWire/garudadb

Length of output: 1362


Only map actual ENOENT cases to NotFound.

All File::open failures here are mapped to StatusCode::NotFound, but permission errors, directory errors, and other IO faults should use StatusCode::Internal instead. Collapsing these masks real storage problems and can send recovery down the wrong path. Other functions in this file (create_dir_all, create_empty_file) already follow this pattern with StatusCode::Internal.

Suggested fix
-    let mut file = File::open(path).map_err(|error| {
-        Status::err(
-            StatusCode::NotFound,
-            format!("failed to open file {}: {error}", path.display()),
-        )
-    })?;
+    let mut file = File::open(path).map_err(|error| {
+        let code = if error.kind() == std::io::ErrorKind::NotFound {
+            StatusCode::NotFound
+        } else {
+            StatusCode::Internal
+        };
+
+        Status::err(code, format!("failed to open file {}: {error}", path.display()))
+    })?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut file = File::open(path).map_err(|error| {
Status::err(
StatusCode::NotFound,
format!("failed to open file {}: {error}", path.display()),
)
let mut file = File::open(path).map_err(|error| {
let code = if error.kind() == std::io::ErrorKind::NotFound {
StatusCode::NotFound
} else {
StatusCode::Internal
};
Status::err(code, format!("failed to open file {}: {error}", path.display()))
})?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/io.rs` around lines 27 - 31, The File::open(path)
error is always mapped to StatusCode::NotFound; change this to map only
io::ErrorKind::NotFound to StatusCode::NotFound and map all other errors to
StatusCode::Internal. Locate the File::open(path) call and its map_err closure
and inspect the underlying std::io::Error (use error.kind()), returning
Status::err(StatusCode::NotFound, ...) for ErrorKind::NotFound and
Status::err(StatusCode::Internal, ...) (including the original error message)
for all other kinds; keep using the same Status::err and StatusCode symbols for
consistency.

Comment on lines +21 to +31
pub fn ensure_new_collection_dir(path: &Path) -> Result<(), Status> {
if path.exists() {
return Err(Status::err(
StatusCode::AlreadyExists,
"collection directory already exists",
));
}

create_dir_all(path, "failed to create collection directory")?;
create_empty_file(&path.join(LOCK_FILE_NAME), "failed to create lock file")?;
sync_directory(path)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "layout.rs" | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 106


🏁 Script executed:

cat -n crates/garuda-storage/src/layout.rs | sed -n '1,50p'

Repository: PhantomInTheWire/garudadb

Length of output: 2021


🏁 Script executed:

cat -n crates/garuda-storage/src/io.rs | head -100

Repository: PhantomInTheWire/garudadb

Length of output: 3628


🏁 Script executed:

rg "ensure_new_collection_dir" -A 5 -B 5

Repository: PhantomInTheWire/garudadb

Length of output: 3522


🏁 Script executed:

rg "create_collection" -A 20 crates/garuda-engine/src/lib.rs | head -60

Repository: PhantomInTheWire/garudadb

Length of output: 1966


🏁 Script executed:

rg "checkpoint" crates/garuda-engine/src/lib.rs -B 2 -A 2

Repository: PhantomInTheWire/garudadb

Length of output: 2177


🏁 Script executed:

rg "ensure_new_collection_dir\|create_collection" crates/ --type rust | grep -E "test|fn" | head -30

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

find . -type f -name "*.rs" -path "*/tests/*" | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 1264


🏁 Script executed:

cat -n crates/garuda-engine/tests/test_collection_concurrency.rs | head -80

Repository: PhantomInTheWire/garudadb

Length of output: 1686


🏁 Script executed:

cat -n crates/garuda-engine/tests/test_collection_create_and_open.rs | head -60

Repository: PhantomInTheWire/garudadb

Length of output: 1120


🏁 Script executed:

cat -n crates/garuda-engine/src/lib.rs | sed -n '1,40p'

Repository: PhantomInTheWire/garudadb

Length of output: 1645


🏁 Script executed:

rg "checkpoint_state" crates/garuda-engine/src/persistence.rs -A 10 | head -40

Repository: PhantomInTheWire/garudadb

Length of output: 581


🏁 Script executed:

cat -n crates/garuda-storage/src/io.rs | sed -n '85,99p'

Repository: PhantomInTheWire/garudadb

Length of output: 645


🏁 Script executed:

rg "create_collection\|ensure_new_collection_dir" crates/garuda-engine/src/lib.rs -B 5 -A 10 | head -50

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

cat -n crates/garuda-engine/src/lib.rs | sed -n '60,100p'

Repository: PhantomInTheWire/garudadb

Length of output: 1496


🏁 Script executed:

cat -n crates/garuda-engine/src/lib.rs | sed -n '40,60p'

Repository: PhantomInTheWire/garudadb

Length of output: 682


Replace path.exists() + create_dir_all() with atomic create_dir() and sync the parent directory.

The current approach has two problems. First, the existence check and directory creation form a race window: two concurrent callers could both pass the path.exists() check before either calls create_dir_all(), allowing both to think they uniquely created the collection. Second, syncing only the collection directory itself does not persist its entry in the parent directory's inode—a crash between creation and the next checkpoint could lose the entire collection directory from the filesystem.

Using std::fs::create_dir() eliminates the race by making creation atomic. Follow it with sync_directory() on the parent to ensure the directory entry is durable.

Suggested fix
 pub fn ensure_new_collection_dir(path: &Path) -> Result<(), Status> {
-    if path.exists() {
-        return Err(Status::err(
-            StatusCode::AlreadyExists,
-            "collection directory already exists",
-        ));
-    }
-
-    create_dir_all(path, "failed to create collection directory")?;
+    std::fs::create_dir(path).map_err(|error| {
+        let code = if error.kind() == std::io::ErrorKind::AlreadyExists {
+            StatusCode::AlreadyExists
+        } else {
+            StatusCode::Internal
+        };
+
+        Status::err(code, format!("failed to create collection directory: {error}"))
+    })?;
     create_empty_file(&path.join(LOCK_FILE_NAME), "failed to create lock file")?;
     sync_directory(path)?;
+    let parent = path.parent().ok_or_else(|| {
+        Status::err(
+            StatusCode::Internal,
+            "cannot determine parent directory for collection creation",
+        )
+    })?;
+    sync_directory(parent)?;
 
     Ok(())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/layout.rs` around lines 21 - 31, The
ensure_new_collection_dir function currently checks path.exists() then calls
create_dir_all, which is racy and fails to persist the new entry in the parent;
change it to call std::fs::create_dir(path) atomically and map a AlreadyExists
IO error to StatusCode::AlreadyExists, then create the lock file
(LOCK_FILE_NAME) inside the new directory and call sync_directory(path) and also
sync_directory on the parent directory (path.parent().unwrap()) to persist the
directory entry; keep existing error wrapping/messages consistent when
converting IO errors to Status.

Comment on lines +37 to +41
let mut id_map = HashMap::new();

for (doc_id, internal_doc_id) in entries {
id_map.insert(DocId::parse(doc_id)?, internal_doc_id);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd crates/garuda-storage && wc -l src/snapshots.rs

Repository: PhantomInTheWire/garudadb

Length of output: 89


🏁 Script executed:

head -60 crates/garuda-storage/src/snapshots.rs

Repository: PhantomInTheWire/garudadb

Length of output: 2041


🏁 Script executed:

rg -A 15 "fn read_id_map_snapshot" crates/garuda-storage/src/snapshots.rs

Repository: PhantomInTheWire/garudadb

Length of output: 501


🏁 Script executed:

rg "Status::" crates/garuda-storage/src/snapshots.rs | head -20

Repository: PhantomInTheWire/garudadb

Length of output: 114


🏁 Script executed:

rg "HashMap::insert|contains_key" crates/garuda-storage/src/ -B 2 -A 2

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

rg "Status::err" crates/garuda-storage/src/ -B 1 -A 3 | head -40

Repository: PhantomInTheWire/garudadb

Length of output: 2455


🏁 Script executed:

rg "StatusCode::" crates/garuda-storage/src/snapshots.rs

Repository: PhantomInTheWire/garudadb

Length of output: 140


🏁 Script executed:

rg "duplicate|corruption|invalid" crates/garuda-storage/src/snapshots.rs -i

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

git log --oneline -n 20 -- crates/garuda-storage/src/snapshots.rs

Repository: PhantomInTheWire/garudadb

Length of output: 127


🏁 Script executed:

rg "fn parse" crates/garuda-storage/src/ -A 3

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

rg "decode_id_map" crates/garuda-storage/src/codec.rs -A 10

Repository: PhantomInTheWire/garudadb

Length of output: 451


🏁 Script executed:

cat -n crates/garuda-storage/src/codec.rs | sed -n '1,50p'

Repository: PhantomInTheWire/garudadb

Length of output: 2557


🏁 Script executed:

rg "write_id_map_snapshot\|read_id_map_snapshot" crates/ --type rust -B 2 -A 2

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

rg "write_id_map_snapshot" crates/ --type rust -B 3 -A 3

Repository: PhantomInTheWire/garudadb

Length of output: 2012


🏁 Script executed:

rg "DocId::parse" crates/garuda-storage/src/ -B 2 -A 2

Repository: PhantomInTheWire/garudadb

Length of output: 381


🏁 Script executed:

rg "fn as_str\|impl.*DocId" crates/garuda-types/src/ --type rust -A 5 | head -50

Repository: PhantomInTheWire/garudadb

Length of output: 51


🏁 Script executed:

rg -A 10 "fn read_delete_snapshot" crates/garuda-storage/src/snapshots.rs

Repository: PhantomInTheWire/garudadb

Length of output: 440


🏁 Script executed:

rg "decode_delete_snapshot" crates/garuda-storage/src/codec.rs -A 10

Repository: PhantomInTheWire/garudadb

Length of output: 417


🏁 Script executed:

cat -n crates/garuda-storage/src/snapshots.rs | sed -n '45,75p'

Repository: PhantomInTheWire/garudadb

Length of output: 1268


Reject duplicate doc IDs when loading an id-map snapshot.

HashMap::insert silently overwrites earlier entries when the snapshot contains duplicate document IDs. A corrupted or hand-edited snapshot can thus change reopen state without failing, instead of failing fast. Note that the delete snapshot loader uses HashSet which naturally deduplicates, creating an inconsistency.

Suggested fix
-    let mut id_map = HashMap::new();
+    let mut id_map = HashMap::with_capacity(entries.len());
@@
-        id_map.insert(DocId::parse(doc_id)?, internal_doc_id);
+        let doc_id = DocId::parse(doc_id)?;
+        if id_map.contains_key(&doc_id) {
+            return Err(Status::err(
+                StatusCode::Internal,
+                format!("duplicate doc id in id-map snapshot: {}", doc_id.as_str()),
+            ));
+        }
+        id_map.insert(doc_id, internal_doc_id);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/snapshots.rs` around lines 37 - 41, When building
id_map from entries, don't silently overwrite duplicates: after parsing each key
with DocId::parse and before inserting into id_map, check whether that DocId
already exists (e.g., via id_map.contains_key or checking the return value of
id_map.insert). If a duplicate is found return an Err (propagate the existing
function's error type) with a clear message referencing the duplicate DocId and
the conflicting internal_doc_id so snapshot loading fails fast; update the loop
that iterates over entries to perform this check instead of blindly inserting.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3255a9fc5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

state: &CollectionState,
version_manager: &VersionManager,
) -> Result<(), Status> {
write_all_segments(state)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Delay in-place segment rewrites until manifest commit

write_checkpoint_files() overwrites data.seg before publishing the new manifest. For checkpoints that rewrite existing records in place (for example alter_column/drop_column, which call rename_column_in_state and drop_column_from_state), a crash after write_all_segments() but before version_manager.write_manifest() leaves the previous manifest on disk but some segment files already in the new shape. On reopen, read_latest_manifest() loads the old schema against those rewritten docs, so filters/fetches can miss renamed or dropped fields and the collection no longer reflects any committed state.

Useful? React with 👍 / 👎.

match op {
WalOp::Insert(doc) => self.live_doc_matches(doc),
WalOp::Upsert(doc) => self.live_doc_matches(doc),
WalOp::Update(doc) => self.update_already_applied(doc),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat stale update WAL entries as redundant after deletes

The new redundancy check only skips Update when there is still a live record to compare against. If a flush crashes after checkpointing a later delete into data.seg but before the WAL is cleared, reopening loads the tombstoned state and then replays the earlier Update; update_already_applied() returns false, update_doc() raises NotFound, and collection open fails even though the on-disk state is already correct. A simple update followed by delete before flush() is enough to trigger this recovery failure.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
crates/garuda-engine/tests/test_storage_disk_format.rs (2)

45-46: Remove redundant legacy-path assertion.

At Line 46, the check is redundant once Line 45 already asserts segments/ does not exist; it adds no extra signal.

Cleanup
     assert!(!collection_dir.join("segments").exists());
-    assert!(!collection_dir.join("segments").join("segment-00000000000000000001.json").exists());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_storage_disk_format.rs` around lines 45 - 46,
The second assertion is redundant because
collection_dir.join("segments").exists() already covers the nonexistence of any
files under that directory; remove the specific file check assertion that
references
collection_dir.join("segments").join("segment-00000000000000000001.json") so
only assert!(!collection_dir.join("segments").exists()) remains (i.e., delete
the assertion that checks for "segment-00000000000000000001.json").

33-40: Refactor test assertions to validate snapshot filename structure rather than exact IDs.

Lines 33-40 assert exact filenames (idmap.1, del.1), which couples the test to the current snapshot ID sequencing (starting from 0, incrementing on first flush). While these exact names are guaranteed by the current design, this makes the test brittle if ID management evolves. Instead, validate that both files use matching numeric suffixes and follow the expected prefix pattern, which is the actual contract being tested.

Suggested refactor
-    assert_eq!(
-        idmap_paths[0].file_name().and_then(|name| name.to_str()),
-        Some("idmap.1")
-    );
-    assert_eq!(
-        delete_paths[0].file_name().and_then(|name| name.to_str()),
-        Some("del.1")
-    );
+    let idmap_name = idmap_paths[0]
+        .file_name()
+        .and_then(|name| name.to_str())
+        .expect("idmap snapshot filename");
+    let delete_name = delete_paths[0]
+        .file_name()
+        .and_then(|name| name.to_str())
+        .expect("delete snapshot filename");
+
+    assert!(idmap_name.starts_with("idmap."));
+    assert!(delete_name.starts_with("del."));
+
+    let idmap_suffix = idmap_name.trim_start_matches("idmap.");
+    let delete_suffix = delete_name.trim_start_matches("del.");
+    assert_eq!(idmap_suffix, delete_suffix);
+    assert!(idmap_suffix.parse::<u64>().is_ok(), "snapshot suffix must be numeric");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_storage_disk_format.rs` around lines 33 - 40,
The assertions currently check exact filenames ("idmap.1" and "del.1") which
couples the test to a specific snapshot ID; change the assertions in the test
that inspect idmap_paths and delete_paths to instead verify the filename
structure: ensure each path's file_name() string begins with the expected
prefixes ("idmap." and "del."), extract the suffix after the dot for both
idmap_paths[0] and delete_paths[0], assert that both suffixes are non-empty
numeric strings (all digits) and assert that the two numeric suffixes are equal;
update the assertions around idmap_paths and delete_paths accordingly (use the
existing idmap_paths and delete_paths variables and their file_name() / to_str()
results to locate the code).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/garuda-engine/tests/test_storage_disk_format.rs`:
- Around line 45-46: The second assertion is redundant because
collection_dir.join("segments").exists() already covers the nonexistence of any
files under that directory; remove the specific file check assertion that
references
collection_dir.join("segments").join("segment-00000000000000000001.json") so
only assert!(!collection_dir.join("segments").exists()) remains (i.e., delete
the assertion that checks for "segment-00000000000000000001.json").
- Around line 33-40: The assertions currently check exact filenames ("idmap.1"
and "del.1") which couples the test to a specific snapshot ID; change the
assertions in the test that inspect idmap_paths and delete_paths to instead
verify the filename structure: ensure each path's file_name() string begins with
the expected prefixes ("idmap." and "del."), extract the suffix after the dot
for both idmap_paths[0] and delete_paths[0], assert that both suffixes are
non-empty numeric strings (all digits) and assert that the two numeric suffixes
are equal; update the assertions around idmap_paths and delete_paths accordingly
(use the existing idmap_paths and delete_paths variables and their file_name() /
to_str() results to locate the code).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3da155c9-a944-47c4-acb0-8dc8f1c6a0bd

📥 Commits

Reviewing files that changed from the base of the PR and between a0bea2e and 3255a9f.

📒 Files selected for processing (8)
  • crates/garuda-engine/Cargo.toml
  • crates/garuda-engine/src/delete_store.rs
  • crates/garuda-engine/src/id_map.rs
  • crates/garuda-engine/src/segment.rs
  • crates/garuda-engine/src/storage_io.rs
  • crates/garuda-engine/src/version.rs
  • crates/garuda-engine/tests/test_storage_disk_format.rs
  • crates/garuda-storage/Cargo.toml
💤 Files with no reviewable changes (7)
  • crates/garuda-engine/Cargo.toml
  • crates/garuda-engine/src/delete_store.rs
  • crates/garuda-engine/src/storage_io.rs
  • crates/garuda-engine/src/id_map.rs
  • crates/garuda-engine/src/version.rs
  • crates/garuda-storage/Cargo.toml
  • crates/garuda-engine/src/segment.rs

@PhantomInTheWire PhantomInTheWire force-pushed the feat/storage branch 2 times, most recently from faf999a to 7a194ac Compare March 18, 2026 17:07

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/garuda-engine/src/persistence.rs (1)

66-68: Avoid silently ignoring WAL reset failures.

At Line 66-68, checkpoint returns success even when WAL reset fails. This is recoverable, but the silent swallow hides persistent I/O issues and makes checkpoint health opaque.

Proposed fix
-    if reset_wal(&state.path, WRITING_SEGMENT_ID).is_err() {
-        return Ok(());
-    }
+    if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) {
+        return Err(Status::err(
+            StatusCode::Internal,
+            format!("failed to reset writing WAL after checkpoint: {}", status.message),
+        ));
+    }
 
     Ok(())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/persistence.rs` around lines 66 - 68, The code
currently swallows errors from reset_wal(&state.path, WRITING_SEGMENT_ID) by
returning Ok(()) on failure; instead propagate or surface the failure so callers
know checkpointing failed. Replace the current if-check with a propagated error
(e.g., use reset_wal(&state.path, WRITING_SEGMENT_ID)? or map_err to the
function's error type) or at minimum log the error and return Err(...) from the
enclosing checkpoint function; reference reset_wal, state.path,
WRITING_SEGMENT_ID and the checkpoint function when making the change so the WAL
reset failure is not silently ignored.
crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs (1)

162-162: Prefer constant-driven WAL path construction.

At Line 162, the hardcoded "0" can drift from WRITING_SEGMENT_ID. Using the existing constant here will keep the test aligned with layout changes.

Proposed fix
-    let wal_path = collection.path().join("0").join("data.wal");
+    let wal_path = collection
+        .path()
+        .join(WRITING_SEGMENT_ID.to_string())
+        .join("data.wal");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs` at
line 162, The test currently hardcodes the segment folder "0" when building
wal_path (collection.path().join("0").join("data.wal")), which can drift from
the actual WRITING_SEGMENT_ID; change the path construction to use the
WRITING_SEGMENT_ID constant instead of the literal "0" (convert the constant to
a string/OsString as needed, e.g. via to_string() or format! when joining) so
the test stays in sync with layout changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/garuda-engine/src/persistence.rs`:
- Around line 167-177: capture_file currently calls read_file for any existing
path, which treats directories as files and causes pre-capture failures; change
capture_file to only attempt read_file when the path exists and is a regular
file (e.g., path.exists() && path.is_file() or metadata().is_file()), otherwise
set original_bytes to None (or handle non-file types explicitly) so directories
and other non-regular-file entries are not misclassified; update the function
around the call to read_file and the FileBackup population to use this file-only
check.

---

Nitpick comments:
In `@crates/garuda-engine/src/persistence.rs`:
- Around line 66-68: The code currently swallows errors from
reset_wal(&state.path, WRITING_SEGMENT_ID) by returning Ok(()) on failure;
instead propagate or surface the failure so callers know checkpointing failed.
Replace the current if-check with a propagated error (e.g., use
reset_wal(&state.path, WRITING_SEGMENT_ID)? or map_err to the function's error
type) or at minimum log the error and return Err(...) from the enclosing
checkpoint function; reference reset_wal, state.path, WRITING_SEGMENT_ID and the
checkpoint function when making the change so the WAL reset failure is not
silently ignored.

In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs`:
- Line 162: The test currently hardcodes the segment folder "0" when building
wal_path (collection.path().join("0").join("data.wal")), which can drift from
the actual WRITING_SEGMENT_ID; change the path construction to use the
WRITING_SEGMENT_ID constant instead of the literal "0" (convert the constant to
a string/OsString as needed, e.g. via to_string() or format! when joining) so
the test stays in sync with layout changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8aee7e8d-8368-43f6-b66b-39372305c15a

📥 Commits

Reviewing files that changed from the base of the PR and between 3255a9f and faf999a.

📒 Files selected for processing (6)
  • crates/garuda-engine/src/persistence.rs
  • crates/garuda-engine/src/state.rs
  • crates/garuda-engine/tests/test_collection_exception.rs
  • crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs
  • crates/garuda-segment/src/wal.rs
  • crates/garuda-storage/src/layout.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/garuda-segment/src/wal.rs
  • crates/garuda-storage/src/layout.rs

Comment on lines +167 to +177
fn capture_file(path: &Path) -> Result<FileBackup, Status> {
let original_bytes = if path.exists() {
Some(read_file(path)?)
} else {
None
};

Ok(FileBackup {
path: path.to_path_buf(),
original_bytes,
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

capture_file misclassifies existing directories as readable files.

At Line 168, any existing path is passed to read_file. If the path is a directory, checkpoint fails during pre-capture instead of write/rollback flow. This makes failure mode handling brittle for path-collision cases.

Proposed fix
+enum OriginalPathState {
+    Missing,
+    File(Vec<u8>),
+    Directory,
+}
+
 struct FileBackup {
     path: PathBuf,
-    original_bytes: Option<Vec<u8>>,
+    original: OriginalPathState,
 }
 
 fn capture_file(path: &Path) -> Result<FileBackup, Status> {
-    let original_bytes = if path.exists() {
-        Some(read_file(path)?)
-    } else {
-        None
-    };
+    let original = if !path.exists() {
+        OriginalPathState::Missing
+    } else if path.is_file() {
+        OriginalPathState::File(read_file(path)?)
+    } else if path.is_dir() {
+        OriginalPathState::Directory
+    } else {
+        OriginalPathState::Missing
+    };
 
     Ok(FileBackup {
         path: path.to_path_buf(),
-        original_bytes,
+        original,
     })
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn capture_file(path: &Path) -> Result<FileBackup, Status> {
let original_bytes = if path.exists() {
Some(read_file(path)?)
} else {
None
};
Ok(FileBackup {
path: path.to_path_buf(),
original_bytes,
})
enum OriginalPathState {
Missing,
File(Vec<u8>),
Directory,
}
struct FileBackup {
path: PathBuf,
original: OriginalPathState,
}
fn capture_file(path: &Path) -> Result<FileBackup, Status> {
let original = if !path.exists() {
OriginalPathState::Missing
} else if path.is_file() {
OriginalPathState::File(read_file(path)?)
} else if path.is_dir() {
OriginalPathState::Directory
} else {
OriginalPathState::Missing
};
Ok(FileBackup {
path: path.to_path_buf(),
original,
})
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/persistence.rs` around lines 167 - 177, capture_file
currently calls read_file for any existing path, which treats directories as
files and causes pre-capture failures; change capture_file to only attempt
read_file when the path exists and is a regular file (e.g., path.exists() &&
path.is_file() or metadata().is_file()), otherwise set original_bytes to None
(or handle non-file types explicitly) so directories and other non-regular-file
entries are not misclassified; update the function around the call to read_file
and the FileBackup population to use this file-only check.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a194ac076

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +36 to +38
pub fn ensure_existing_collection_dir(path: &Path) -> Result<(), Status> {
if path.exists() && !manifest_paths(path)?.is_empty() {
return Ok(());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add an upgrade path for collections written before this commit

If an installation upgrades with existing on-disk collections, open_collection() now rejects them immediately: ensure_existing_collection_dir() only accepts directories that already contain manifest.*, and bootstrap::load_collection_state() only reads the new binary manifest/snapshot files. The previous storage path wrote VERSION.json, IDMAP.json, DELETE_STORE.json, and JSON segment files, so without a fallback or migration step every collection created by earlier releases becomes unreadable after upgrade.

Useful? React with 👍 / 👎.

Comment thread crates/garuda-segment/src/wal.rs Outdated
Comment on lines +75 to +79
u64::from_le_bytes(bytes[cursor..cursor + 8].try_into().expect("fixed length"))
as usize;
cursor += 8;

let end = cursor + payload_size;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use checked_add for WAL payload boundaries

A corrupt data.wal can encode payload_size near u64::MAX. In release builds, cursor + payload_size wraps before the end > payload_len guard, so the next slice can panic instead of returning a Status. Because this parser runs during reopen/recovery, one malformed WAL entry is enough to crash collection open on an otherwise recoverable directory.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9983c6924

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +55 to +58
let temp_path = temp_path(path);
write_temp_file(&temp_path, bytes)?;
rename_path(&temp_path, path)?;
sync_directory(parent)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sync the collection root after creating segment directories

When write_segment() targets a brand-new path like <collection>/<segment_id>/data.seg, this helper may create <segment_id>/ and then only fsync that new directory. The parent collection directory is never synced, so after a power loss the manifest can be durable while the new segment directory entry is lost, and reopen then fails because the committed manifest references a segment that never reached disk. This affects the initial 0/ directory on collection creation and every rollover/optimize that introduces a new segment id.

Useful? React with 👍 / 👎.

Comment on lines +64 to +67
version_manager.write_manifest(&state.manifest)?;

if reset_wal(&state.path, WRITING_SEGMENT_ID).is_err() {
return Ok(());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate WAL reset failures after schema checkpoints

Returning success when reset_wal() fails is especially unsafe for add_column/alter_column/drop_column. Those paths rewrite the checkpointed docs and publish the new schema, but any pre-checkpoint Update/Upsert entries are still left in data.wal; on reopen, recovery replays them against the new schema and can silently resurrect dropped/renamed fields, so the collection no longer reflects the committed DDL even though the checkpoint reported success. This requires a schema-changing checkpoint with earlier unflushed writes and a WAL reset failure.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
crates/garuda-storage/src/layout.rs (1)

21-31: ⚠️ Potential issue | 🟠 Major

ensure_new_collection_dir is still racy and misses parent-dir durability.

Line 22 does a separate exists() check before Line 29 creates the directory, so concurrent creators can both pass the precheck. After that, Line 31 only syncs the child directory, which does not persist the new entry in its parent. Please switch to atomic std::fs::create_dir(path) and sync path.parent() as well.

Suggested fix
 pub fn ensure_new_collection_dir(path: &Path) -> Result<(), Status> {
-    if path.exists() {
-        return Err(Status::err(
-            StatusCode::AlreadyExists,
-            "collection directory already exists",
-        ));
-    }
-
-    create_dir_all(path, "failed to create collection directory")?;
+    std::fs::create_dir(path).map_err(|error| {
+        let code = if error.kind() == std::io::ErrorKind::AlreadyExists {
+            StatusCode::AlreadyExists
+        } else {
+            StatusCode::Internal
+        };
+
+        Status::err(code, format!("failed to create collection directory: {error}"))
+    })?;
     create_empty_file(&path.join(LOCK_FILE_NAME), "failed to create lock file")?;
     sync_directory(path)?;
+    let parent = path.parent().ok_or_else(|| {
+        Status::err(StatusCode::Internal, "cannot determine collection parent directory")
+    })?;
+    sync_directory(parent)?;
 
     Ok(())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-storage/src/layout.rs` around lines 21 - 31, The function
ensure_new_collection_dir is racy because it uses path.exists() and
create_dir_all; change it to call std::fs::create_dir(path) so directory
creation is atomic and handle the AlreadyExists error by returning
Status::err(StatusCode::AlreadyExists, ...). After successful creation, create
the lock file with create_empty_file(&path.join(LOCK_FILE_NAME), ...) as before,
then call sync_directory(path) and also sync the parent directory
(path.parent()) to ensure the new entry is durably persisted; if path.parent()
is None, handle it gracefully. Ensure error mapping and messages remain
consistent when propagating failures from create_dir, create_empty_file, and
sync_directory.
crates/garuda-engine/src/persistence.rs (1)

167-177: ⚠️ Potential issue | 🟠 Major

capture_file still collapses “directory exists” into a read error.

Line 168 treats any existing path as a regular file. If manifest.N or data.seg is blocked by a directory, pre-capture fails before the write/rollback flow runs, and the current Option<Vec<u8>> model would also make rollback delete that directory instead of restoring it. Preserve Directory as a distinct original state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/persistence.rs` around lines 167 - 177, capture_file
currently treats any existing path as a regular file and stores its bytes in
FileBackup::original_bytes, which collapses directories into a read error and
would cause rollback to delete or overwrite a directory (e.g., for manifest.N or
data.seg). Modify capture_file to distinguish three original states (absent,
file-with-bytes, directory) instead of Option<Vec<u8>>: update the FileBackup
type to record an enum/state (e.g., Absent | File(Vec<u8>) | Directory), change
capture_file to check path.exists() then path.is_dir() vs path.is_file() and
only read bytes for files, and ensure downstream rollback logic uses the new
enum to restore or preserve directories correctly (referencing capture_file,
FileBackup, and original_bytes in the change).
🧹 Nitpick comments (1)
crates/garuda-segment/src/wal.rs (1)

25-46: Avoid rewriting the whole WAL for each append.

This appends by loading the full WAL into memory, dropping the trailer, and rewriting the entire file. On a hot write path, WAL growth turns this into O(n²) read/write amplification and whole-file allocations on every call. An append-only layout, or per-record checksums, would keep append cost proportional to the new ops instead of the existing WAL size.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-segment/src/wal.rs` around lines 25 - 46, The current
implementation reads the entire WAL into memory, drops the trailer and rewrites
the whole file (functions: segment_wal_path, read_file, new_wal_bytes,
validate_header, append_op, checksum, write_file_atomically), which causes O(n²)
amplification; instead make appends streaming: if the WAL file exists open it
with OpenOptions::write() and if needed seek to file_len - size_of::<u32>() to
overwrite the existing trailer, serialize only the new ops (create an
append_op_bytes or change append_op to return bytes) and write those bytes
directly, update the checksum incrementally (use or add a
checksum_update/streaming checksum API that folds in the new ops) and then write
the new 4-byte trailer; when the WAL does not exist create it with header + ops
+ trailer in one write.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/garuda-engine/src/state.rs`:
- Around line 21-45: apply_wal_op currently aborts recovery when update_doc
returns NotFound; make replay idempotent for stale Update ops by treating
missing-doc updates as redundant: in apply_wal_op, after matching WalOp::Update
-> self.update_doc(...), if the result indicates a NotFound status return Ok(())
instead of propagating an error; additionally, update is_redundant_wal_op's
WalOp::Update arm to consider a missing live record redundant (e.g. return
self.update_already_applied(doc) || self.find_live_record(&doc.id).is_none()),
so Update on a deleted/missing document is swallowed during replay.

In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs`:
- Around line 162-167: The test currently appends a zero checksum so
validate_header() fails before read_wal_ops() can hit the truncated-entry-header
branch; after truncating the WAL payload (manipulating wal_bytes and making
header_remaining < 8), recompute and overwrite the correct checksum bytes
instead of appending 0 so validate_header() succeeds and execution reaches
read_wal_ops()'s header_remaining < 8 guard; update the bytes written to
wal_path accordingly (use the same checksum algorithm used by validate_header())
so the header checksum matches the modified payload and the test exercises the
truncated-entry-header branch.

---

Duplicate comments:
In `@crates/garuda-engine/src/persistence.rs`:
- Around line 167-177: capture_file currently treats any existing path as a
regular file and stores its bytes in FileBackup::original_bytes, which collapses
directories into a read error and would cause rollback to delete or overwrite a
directory (e.g., for manifest.N or data.seg). Modify capture_file to distinguish
three original states (absent, file-with-bytes, directory) instead of
Option<Vec<u8>>: update the FileBackup type to record an enum/state (e.g.,
Absent | File(Vec<u8>) | Directory), change capture_file to check path.exists()
then path.is_dir() vs path.is_file() and only read bytes for files, and ensure
downstream rollback logic uses the new enum to restore or preserve directories
correctly (referencing capture_file, FileBackup, and original_bytes in the
change).

In `@crates/garuda-storage/src/layout.rs`:
- Around line 21-31: The function ensure_new_collection_dir is racy because it
uses path.exists() and create_dir_all; change it to call
std::fs::create_dir(path) so directory creation is atomic and handle the
AlreadyExists error by returning Status::err(StatusCode::AlreadyExists, ...).
After successful creation, create the lock file with
create_empty_file(&path.join(LOCK_FILE_NAME), ...) as before, then call
sync_directory(path) and also sync the parent directory (path.parent()) to
ensure the new entry is durably persisted; if path.parent() is None, handle it
gracefully. Ensure error mapping and messages remain consistent when propagating
failures from create_dir, create_empty_file, and sync_directory.

---

Nitpick comments:
In `@crates/garuda-segment/src/wal.rs`:
- Around line 25-46: The current implementation reads the entire WAL into
memory, drops the trailer and rewrites the whole file (functions:
segment_wal_path, read_file, new_wal_bytes, validate_header, append_op,
checksum, write_file_atomically), which causes O(n²) amplification; instead make
appends streaming: if the WAL file exists open it with OpenOptions::write() and
if needed seek to file_len - size_of::<u32>() to overwrite the existing trailer,
serialize only the new ops (create an append_op_bytes or change append_op to
return bytes) and write those bytes directly, update the checksum incrementally
(use or add a checksum_update/streaming checksum API that folds in the new ops)
and then write the new 4-byte trailer; when the WAL does not exist create it
with header + ops + trailer in one write.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06f14da0-aff4-4bbf-9b96-d0eb86e19c8c

📥 Commits

Reviewing files that changed from the base of the PR and between faf999a and a9983c6.

📒 Files selected for processing (7)
  • crates/garuda-engine/src/persistence.rs
  • crates/garuda-engine/src/state.rs
  • crates/garuda-engine/tests/test_collection_exception.rs
  • crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs
  • crates/garuda-engine/tests/test_storage_disk_format.rs
  • crates/garuda-segment/src/wal.rs
  • crates/garuda-storage/src/layout.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/garuda-engine/tests/test_storage_disk_format.rs

Comment on lines +21 to +45
pub(crate) fn apply_wal_op(&mut self, op: &WalOp) -> Result<(), Status> {
if self.is_redundant_wal_op(op) {
return Ok(());
}

let result = match op {
WalOp::Insert(doc) => self.insert_doc(doc.clone(), false),
WalOp::Upsert(doc) => self.insert_doc(doc.clone(), true),
WalOp::Update(doc) => self.update_doc(doc.clone()),
WalOp::Delete(doc_id) => self.delete_doc(doc_id),
};

if result.status.is_ok() {
return Ok(());
}

Err(Status::err(result.status.code, result.status.message))
}

fn is_redundant_wal_op(&self, op: &WalOp) -> bool {
match op {
WalOp::Insert(doc) => self.insert_already_applied(doc),
WalOp::Upsert(doc) => self.live_doc_matches(doc),
WalOp::Update(doc) => self.update_already_applied(doc),
WalOp::Delete(doc_id) => self.find_live_record(doc_id).is_none(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat stale Update replay on a missing doc as idempotent.

Because checkpointing can currently succeed with an uncleared WAL, a replay stream like Update(doc-1), Delete(doc-1) can survive into reopen. If the checkpointed state already reflects the delete, Line 29 returns NotFound and Line 37 aborts recovery before the later delete is consumed. Please swallow NotFound for WalOp::Update during replay, or classify missing-doc updates as redundant.

Suggested fix
         if result.status.is_ok() {
             return Ok(());
         }
+
+        if matches!(op, WalOp::Update(_)) && result.status.code == StatusCode::NotFound {
+            return Ok(());
+        }
 
         Err(Status::err(result.status.code, result.status.message))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub(crate) fn apply_wal_op(&mut self, op: &WalOp) -> Result<(), Status> {
if self.is_redundant_wal_op(op) {
return Ok(());
}
let result = match op {
WalOp::Insert(doc) => self.insert_doc(doc.clone(), false),
WalOp::Upsert(doc) => self.insert_doc(doc.clone(), true),
WalOp::Update(doc) => self.update_doc(doc.clone()),
WalOp::Delete(doc_id) => self.delete_doc(doc_id),
};
if result.status.is_ok() {
return Ok(());
}
Err(Status::err(result.status.code, result.status.message))
}
fn is_redundant_wal_op(&self, op: &WalOp) -> bool {
match op {
WalOp::Insert(doc) => self.insert_already_applied(doc),
WalOp::Upsert(doc) => self.live_doc_matches(doc),
WalOp::Update(doc) => self.update_already_applied(doc),
WalOp::Delete(doc_id) => self.find_live_record(doc_id).is_none(),
pub(crate) fn apply_wal_op(&mut self, op: &WalOp) -> Result<(), Status> {
if self.is_redundant_wal_op(op) {
return Ok(());
}
let result = match op {
WalOp::Insert(doc) => self.insert_doc(doc.clone(), false),
WalOp::Upsert(doc) => self.insert_doc(doc.clone(), true),
WalOp::Update(doc) => self.update_doc(doc.clone()),
WalOp::Delete(doc_id) => self.delete_doc(doc_id),
};
if result.status.is_ok() {
return Ok(());
}
if matches!(op, WalOp::Update(_)) && result.status.code == StatusCode::NotFound {
return Ok(());
}
Err(Status::err(result.status.code, result.status.message))
}
fn is_redundant_wal_op(&self, op: &WalOp) -> bool {
match op {
WalOp::Insert(doc) => self.insert_already_applied(doc),
WalOp::Upsert(doc) => self.live_doc_matches(doc),
WalOp::Update(doc) => self.update_already_applied(doc),
WalOp::Delete(doc_id) => self.find_live_record(doc_id).is_none(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/state.rs` around lines 21 - 45, apply_wal_op
currently aborts recovery when update_doc returns NotFound; make replay
idempotent for stale Update ops by treating missing-doc updates as redundant: in
apply_wal_op, after matching WalOp::Update -> self.update_doc(...), if the
result indicates a NotFound status return Ok(()) instead of propagating an
error; additionally, update is_redundant_wal_op's WalOp::Update arm to consider
a missing live record redundant (e.g. return self.update_already_applied(doc) ||
self.find_live_record(&doc.id).is_none()), so Update on a deleted/missing
document is swallowed during replay.

Comment on lines +162 to +167
let wal_path = collection.path().join("0").join("data.wal");
let mut wal_bytes = fs::read(&wal_path).expect("read wal");
wal_bytes.truncate(wal_bytes.len() - std::mem::size_of::<u32>());
wal_bytes.push(0);
wal_bytes.extend_from_slice(&0u32.to_le_bytes());
fs::write(&wal_path, wal_bytes).expect("write truncated wal");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This corruption only exercises checksum validation.

Line 166 appends a zero checksum, so validate_header() rejects the file before read_wal_ops() ever reaches the truncated-entry-header branch. If this test is meant to cover the new header_remaining < 8 guard, recompute the checksum after corrupting the payload.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs`
around lines 162 - 167, The test currently appends a zero checksum so
validate_header() fails before read_wal_ops() can hit the truncated-entry-header
branch; after truncating the WAL payload (manipulating wal_bytes and making
header_remaining < 8), recompute and overwrite the correct checksum bytes
instead of appending 0 so validate_header() succeeds and execution reaches
read_wal_ops()'s header_remaining < 8 guard; update the bytes written to
wal_path accordingly (use the same checksum algorithm used by validate_header())
so the header checksum matches the modified payload and the test exercises the
truncated-entry-header branch.

@PhantomInTheWire PhantomInTheWire merged commit f8040e8 into master Mar 18, 2026
2 checks passed
This was referenced Mar 19, 2026
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant