Skip to content

perf: reduce memory and network cost for checkpoint metadata#680

Draft
pjbgf wants to merge 3 commits intomainfrom
pjbgf/jsonl-data2
Draft

perf: reduce memory and network cost for checkpoint metadata#680
pjbgf wants to merge 3 commits intomainfrom
pjbgf/jsonl-data2

Conversation

@pjbgf
Copy link
Member

@pjbgf pjbgf commented Mar 11, 2026

Summary

This PR focuses on two key optimisations:

1. Reduce memory allocations when reading checkpoint metadata

  • Streaming json.Decoder replaces json.Unmarshal for reading metadata.json files, avoiding loading entire JSON documents into memory as intermediate []byte
  • Lite structs (sessionMetadataLite, checkpointSummaryLite) replace full CommittedMetadata/CheckpointSummary when only a few fields are needed — avoids allocating large unused nested objects (Summary, InitialAttribution, TokenUsage, etc.)
  • Shared helpers (decodeSummaryLiteFromTree, decodeSessionMetadataLite) deduplicate the metadata reading logic between ListCheckpoints and ReadCheckpointMetadata

2. Reduce network cost for pulling checkpoint data from remote

  • Shallow treeless fetch (--depth=1 --filter=blob:none) downloads only the tip commit and tree objects for the metadata branch — no blobs, no history
  • On-demand blob resolution via BlobResolver uses go-git's HasEncodedObject to check the local object store (loose objects + packfile indices) before fetching anything
  • Targeted blob fetch via git fetch-pack requests only the specific full.jsonl blob hashes that are missing locally, with
    automatic fallback to a full branch fetch if the server doesn't support allow-reachable-sha1-in-want
  • CollectTranscriptBlobHashes walks tree entries to discover full.jsonl blob hashes (including chunked transcripts) without reading blob content — works after a treeless fetch where blobs aren't local

Resume flow (before → after)

Before: git fetch origin entire/checkpoints/v1 → downloads entire branch (all commits, trees, and every blob across all
checkpoints)

After:

  1. git fetch --depth=1 --filter=blob:none → tip commit + trees only
  2. go-git tree walk → find needed full.jsonl blob hashes
  3. HasEncodedObject → skip blobs already in local store
  4. git fetch-pack <url> <hash>... → fetch only the missing blobs

Test plan

  • BlobResolver unit tests: HasBlob present/missing, ReadBlob present/missing
  • CollectTranscriptBlobHashes: single session, multi-session, chunked, nonexistent checkpoint
  • Existing ListCheckpoints and ReadCheckpointMetadata tests still pass with lite structs
  • Full mise run lint clean
  • Manual E2E: entire resume on a branch with remote-only checkpoint data

pjbgf and others added 3 commits March 11, 2026 10:27
…data

Avoid allocating large unused fields (Summary, InitialAttribution,
TokenUsage) when reading checkpoint metadata by introducing minimal
sessionMetadataLite and checkpointSummaryLite structs with streaming
json.Decoder instead of json.Unmarshal into full types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: f64fee85be5f
Add treeless fetch (--filter=blob:none) for the metadata branch so only
commit and tree objects are downloaded initially. Transcript blobs
(full.jsonl) are then fetched individually via git fetch-pack using their
SHA-1 hashes, skipping any that already exist in the local object store.

- BlobResolver: go-git HasEncodedObject checks for local blob existence
- CollectTranscriptBlobHashes: walks tree entries to find full.jsonl blob
  hashes without reading blob content (works after treeless fetch)
- FetchMetadataTreeOnly: git CLI fetch with --filter=blob:none
- FetchBlobsByHash: git fetch-pack for specific objects, with full-fetch
  fallback
- resume.go: ensureTranscriptBlobs wired before RestoreLogsOnly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 48e91bc15f41
Add --depth=1 to the filtered fetch so only the tip commit and its tree
objects are downloaded, skipping the entire branch history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 564aea6f7ca5
Copilot AI review requested due to automatic review settings March 11, 2026 12:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes entire resume and checkpoint metadata handling by reducing memory allocations when decoding checkpoint metadata and by reducing network transfer when resuming from remote-only checkpoint data.

Changes:

  • Switch metadata reads to streaming json.Decoder + “lite” structs for ListCheckpoints / ReadCheckpointMetadata.
  • Add a “treeless” metadata-branch fetch (--depth=1 --filter=blob:none) and best-effort prefetching of only required transcript blobs.
  • Introduce BlobResolver + tree-walk collection of transcript blob hashes (including chunked transcripts) and associated unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cmd/entire/cli/strategy/common_helpers_test.go Adds a preserved “full deserialize” test helper for checkpoint metadata parsing.
cmd/entire/cli/strategy/common.go Refactors checkpoint listing/reading to stream JSON decoding and use minimal structs.
cmd/entire/cli/resume.go Attempts treeless fetch first and adds best-effort transcript blob prefetch before restore.
cmd/entire/cli/git_operations.go Adds treeless metadata fetch and a targeted blob fetch helper.
cmd/entire/cli/checkpoint/blob_resolver.go Adds local-object-store blob existence/read + transcript blob hash collection via tree entries.
cmd/entire/cli/checkpoint/blob_resolver_test.go Adds unit tests for blob resolution and transcript hash collection.

Comment on lines +261 to +268
// If we can't get the tree, blobs can't be checked — caller will handle the error
return nil //nolint:nilerr // Tree unavailable means no sparse fetch possible
}

refs, err := checkpoint.CollectTranscriptBlobHashes(metadataTree, checkpointID)
if err != nil {
// Tree navigation failed — not fatal, ReadSessionContent will fail with a clear error
return nil //nolint:nilerr // Best-effort blob prefetch
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureTranscriptBlobs returns nil on failures to load the metadata tree or to collect transcript blob hashes, but the caller only emits a warning when a non-nil error is returned. This means important failures are silently swallowed and the "restore may fail" warning will never trigger for these cases. Consider returning a wrapped error for these failure paths (still treated as best-effort by the caller), or logging inside ensureTranscriptBlobs so the reason for skipping the sparse fetch is visible.

Suggested change
// If we can't get the tree, blobs can't be checked — caller will handle the error
return nil //nolint:nilerr // Tree unavailable means no sparse fetch possible
}
refs, err := checkpoint.CollectTranscriptBlobHashes(metadataTree, checkpointID)
if err != nil {
// Tree navigation failed — not fatal, ReadSessionContent will fail with a clear error
return nil //nolint:nilerr // Best-effort blob prefetch
// If we can't get the tree, blobs can't be checked — treat as best-effort and let caller warn.
return fmt.Errorf("unable to load metadata tree for sparse transcript blob fetch: %w", err)
}
refs, err := checkpoint.CollectTranscriptBlobHashes(metadataTree, checkpointID)
if err != nil {
// Tree navigation failed — not fatal; this only disables sparse blob prefetch for this checkpoint.
return fmt.Errorf("unable to collect transcript blob hashes for checkpoint %s: %w", checkpointID, err)

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +129
// Collect transcript blob hashes from tree entries.
// tree.Entries contains the direct children — no blob reads needed.
var hasBaseFile bool
var chunkEntries []object.TreeEntry

for _, entry := range sessionTree.Entries {
if entry.Name == paths.TranscriptFileName || entry.Name == paths.TranscriptFileNameLegacy {
hasBaseFile = true
refs = append(refs, TranscriptBlobRef{
SessionIndex: i,
Hash: entry.Hash,
Path: sessionDir + "/" + entry.Name,
})
}
// Check for chunk files (full.jsonl.001, full.jsonl.002, etc.)
if strings.HasPrefix(entry.Name, paths.TranscriptFileName+".") {
idx := agent.ParseChunkIndex(entry.Name, paths.TranscriptFileName)
if idx > 0 {
chunkEntries = append(chunkEntries, entry)
}
}
}

// If chunks exist and the base file is also a chunk (chunk 0), it's already collected.
// Add the numbered chunk entries.
if len(chunkEntries) > 0 {
// If there was no base file, that's OK — chunks stand alone
_ = hasBaseFile
for _, entry := range chunkEntries {
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectTranscriptBlobHashes tracks hasBaseFile but then discards it (_ = hasBaseFile). This adds noise and makes the chunk-handling logic harder to read. Consider removing hasBaseFile entirely here, or using it to make the chunk collection rules explicit (e.g., include base file as chunk 0 only when present).

Copilot uses AI. Check for mistakes.
Comment on lines +440 to +475
// FetchBlobsByHash fetches specific blob objects from the remote by their SHA-1 hashes.
// Uses git fetch-pack to request individual objects, which works when the server
// supports allow-reachable-sha1-in-want (GitHub, GitLab, Bitbucket all do).
//
// If fetch-pack fails (e.g., old server without SHA-in-want support), falls back
// to a full metadata branch fetch.
func FetchBlobsByHash(ctx context.Context, hashes []plumbing.Hash) error {
if len(hashes) == 0 {
return nil
}

ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel()

// Get remote URL for fetch-pack
urlCmd := exec.CommandContext(ctx, "git", "remote", "get-url", "origin")
urlOutput, err := urlCmd.Output()
if err != nil {
return fmt.Errorf("failed to get remote URL: %w", err)
}
remoteURL := strings.TrimSpace(string(urlOutput))

// Build fetch-pack args with blob hashes
args := []string{"fetch-pack", "--thin", "--no-progress", remoteURL}
for _, h := range hashes {
args = append(args, h.String())
}

fetchCmd := exec.CommandContext(ctx, "git", args...)
if output, fetchErr := fetchCmd.CombinedOutput(); fetchErr != nil {
// Fallback: full metadata branch fetch (pack negotiation skips already-local objects)
if fallbackErr := FetchMetadataBranch(ctx); fallbackErr != nil {
return fmt.Errorf("fetch-pack failed (%s: %w) and fallback fetch also failed: %w",
strings.TrimSpace(string(output)), fetchErr, fallbackErr)
}
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FetchBlobsByHash runs git fetch-pack and captures CombinedOutput(). git fetch-pack produces a pack stream on stdout and does not install the objects into the local object database by itself; capturing stdout also risks buffering large pack data in memory. As written, this can both fail to make the requested blobs available locally and cause a large memory spike. Consider wiring fetch-pack to git index-pack (streaming stdout directly, avoiding buffering) or switching to a higher-level git fetch mechanism that actually writes the objects into .git/objects.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants