perf: reduce memory and network cost for checkpoint metadata#680
perf: reduce memory and network cost for checkpoint metadata#680
Conversation
…data Avoid allocating large unused fields (Summary, InitialAttribution, TokenUsage) when reading checkpoint metadata by introducing minimal sessionMetadataLite and checkpointSummaryLite structs with streaming json.Decoder instead of json.Unmarshal into full types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: f64fee85be5f
Add treeless fetch (--filter=blob:none) for the metadata branch so only commit and tree objects are downloaded initially. Transcript blobs (full.jsonl) are then fetched individually via git fetch-pack using their SHA-1 hashes, skipping any that already exist in the local object store. - BlobResolver: go-git HasEncodedObject checks for local blob existence - CollectTranscriptBlobHashes: walks tree entries to find full.jsonl blob hashes without reading blob content (works after treeless fetch) - FetchMetadataTreeOnly: git CLI fetch with --filter=blob:none - FetchBlobsByHash: git fetch-pack for specific objects, with full-fetch fallback - resume.go: ensureTranscriptBlobs wired before RestoreLogsOnly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: 48e91bc15f41
Add --depth=1 to the filtered fetch so only the tip commit and its tree objects are downloaded, skipping the entire branch history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: 564aea6f7ca5
There was a problem hiding this comment.
Pull request overview
This PR optimizes entire resume and checkpoint metadata handling by reducing memory allocations when decoding checkpoint metadata and by reducing network transfer when resuming from remote-only checkpoint data.
Changes:
- Switch metadata reads to streaming
json.Decoder+ “lite” structs forListCheckpoints/ReadCheckpointMetadata. - Add a “treeless” metadata-branch fetch (
--depth=1 --filter=blob:none) and best-effort prefetching of only required transcript blobs. - Introduce
BlobResolver+ tree-walk collection of transcript blob hashes (including chunked transcripts) and associated unit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/entire/cli/strategy/common_helpers_test.go | Adds a preserved “full deserialize” test helper for checkpoint metadata parsing. |
| cmd/entire/cli/strategy/common.go | Refactors checkpoint listing/reading to stream JSON decoding and use minimal structs. |
| cmd/entire/cli/resume.go | Attempts treeless fetch first and adds best-effort transcript blob prefetch before restore. |
| cmd/entire/cli/git_operations.go | Adds treeless metadata fetch and a targeted blob fetch helper. |
| cmd/entire/cli/checkpoint/blob_resolver.go | Adds local-object-store blob existence/read + transcript blob hash collection via tree entries. |
| cmd/entire/cli/checkpoint/blob_resolver_test.go | Adds unit tests for blob resolution and transcript hash collection. |
| // If we can't get the tree, blobs can't be checked — caller will handle the error | ||
| return nil //nolint:nilerr // Tree unavailable means no sparse fetch possible | ||
| } | ||
|
|
||
| refs, err := checkpoint.CollectTranscriptBlobHashes(metadataTree, checkpointID) | ||
| if err != nil { | ||
| // Tree navigation failed — not fatal, ReadSessionContent will fail with a clear error | ||
| return nil //nolint:nilerr // Best-effort blob prefetch |
There was a problem hiding this comment.
ensureTranscriptBlobs returns nil on failures to load the metadata tree or to collect transcript blob hashes, but the caller only emits a warning when a non-nil error is returned. This means important failures are silently swallowed and the "restore may fail" warning will never trigger for these cases. Consider returning a wrapped error for these failure paths (still treated as best-effort by the caller), or logging inside ensureTranscriptBlobs so the reason for skipping the sparse fetch is visible.
| // If we can't get the tree, blobs can't be checked — caller will handle the error | |
| return nil //nolint:nilerr // Tree unavailable means no sparse fetch possible | |
| } | |
| refs, err := checkpoint.CollectTranscriptBlobHashes(metadataTree, checkpointID) | |
| if err != nil { | |
| // Tree navigation failed — not fatal, ReadSessionContent will fail with a clear error | |
| return nil //nolint:nilerr // Best-effort blob prefetch | |
| // If we can't get the tree, blobs can't be checked — treat as best-effort and let caller warn. | |
| return fmt.Errorf("unable to load metadata tree for sparse transcript blob fetch: %w", err) | |
| } | |
| refs, err := checkpoint.CollectTranscriptBlobHashes(metadataTree, checkpointID) | |
| if err != nil { | |
| // Tree navigation failed — not fatal; this only disables sparse blob prefetch for this checkpoint. | |
| return fmt.Errorf("unable to collect transcript blob hashes for checkpoint %s: %w", checkpointID, err) |
| // Collect transcript blob hashes from tree entries. | ||
| // tree.Entries contains the direct children — no blob reads needed. | ||
| var hasBaseFile bool | ||
| var chunkEntries []object.TreeEntry | ||
|
|
||
| for _, entry := range sessionTree.Entries { | ||
| if entry.Name == paths.TranscriptFileName || entry.Name == paths.TranscriptFileNameLegacy { | ||
| hasBaseFile = true | ||
| refs = append(refs, TranscriptBlobRef{ | ||
| SessionIndex: i, | ||
| Hash: entry.Hash, | ||
| Path: sessionDir + "/" + entry.Name, | ||
| }) | ||
| } | ||
| // Check for chunk files (full.jsonl.001, full.jsonl.002, etc.) | ||
| if strings.HasPrefix(entry.Name, paths.TranscriptFileName+".") { | ||
| idx := agent.ParseChunkIndex(entry.Name, paths.TranscriptFileName) | ||
| if idx > 0 { | ||
| chunkEntries = append(chunkEntries, entry) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // If chunks exist and the base file is also a chunk (chunk 0), it's already collected. | ||
| // Add the numbered chunk entries. | ||
| if len(chunkEntries) > 0 { | ||
| // If there was no base file, that's OK — chunks stand alone | ||
| _ = hasBaseFile | ||
| for _, entry := range chunkEntries { |
There was a problem hiding this comment.
CollectTranscriptBlobHashes tracks hasBaseFile but then discards it (_ = hasBaseFile). This adds noise and makes the chunk-handling logic harder to read. Consider removing hasBaseFile entirely here, or using it to make the chunk collection rules explicit (e.g., include base file as chunk 0 only when present).
| // FetchBlobsByHash fetches specific blob objects from the remote by their SHA-1 hashes. | ||
| // Uses git fetch-pack to request individual objects, which works when the server | ||
| // supports allow-reachable-sha1-in-want (GitHub, GitLab, Bitbucket all do). | ||
| // | ||
| // If fetch-pack fails (e.g., old server without SHA-in-want support), falls back | ||
| // to a full metadata branch fetch. | ||
| func FetchBlobsByHash(ctx context.Context, hashes []plumbing.Hash) error { | ||
| if len(hashes) == 0 { | ||
| return nil | ||
| } | ||
|
|
||
| ctx, cancel := context.WithTimeout(ctx, 2*time.Minute) | ||
| defer cancel() | ||
|
|
||
| // Get remote URL for fetch-pack | ||
| urlCmd := exec.CommandContext(ctx, "git", "remote", "get-url", "origin") | ||
| urlOutput, err := urlCmd.Output() | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get remote URL: %w", err) | ||
| } | ||
| remoteURL := strings.TrimSpace(string(urlOutput)) | ||
|
|
||
| // Build fetch-pack args with blob hashes | ||
| args := []string{"fetch-pack", "--thin", "--no-progress", remoteURL} | ||
| for _, h := range hashes { | ||
| args = append(args, h.String()) | ||
| } | ||
|
|
||
| fetchCmd := exec.CommandContext(ctx, "git", args...) | ||
| if output, fetchErr := fetchCmd.CombinedOutput(); fetchErr != nil { | ||
| // Fallback: full metadata branch fetch (pack negotiation skips already-local objects) | ||
| if fallbackErr := FetchMetadataBranch(ctx); fallbackErr != nil { | ||
| return fmt.Errorf("fetch-pack failed (%s: %w) and fallback fetch also failed: %w", | ||
| strings.TrimSpace(string(output)), fetchErr, fallbackErr) | ||
| } | ||
| } |
There was a problem hiding this comment.
FetchBlobsByHash runs git fetch-pack and captures CombinedOutput(). git fetch-pack produces a pack stream on stdout and does not install the objects into the local object database by itself; capturing stdout also risks buffering large pack data in memory. As written, this can both fail to make the requested blobs available locally and cause a large memory spike. Consider wiring fetch-pack to git index-pack (streaming stdout directly, avoiding buffering) or switching to a higher-level git fetch mechanism that actually writes the objects into .git/objects.
Summary
This PR focuses on two key optimisations:
1. Reduce memory allocations when reading checkpoint metadata
json.Decoderreplacesjson.Unmarshalfor readingmetadata.jsonfiles, avoiding loading entire JSON documents into memory as intermediate[]bytesessionMetadataLite,checkpointSummaryLite) replace fullCommittedMetadata/CheckpointSummarywhen only a few fields are needed — avoids allocating large unused nested objects (Summary,InitialAttribution,TokenUsage, etc.)decodeSummaryLiteFromTree,decodeSessionMetadataLite) deduplicate the metadata reading logic betweenListCheckpointsandReadCheckpointMetadata2. Reduce network cost for pulling checkpoint data from remote
--depth=1 --filter=blob:none) downloads only the tip commit and tree objects for the metadata branch — no blobs, no historyBlobResolveruses go-git'sHasEncodedObjectto check the local object store (loose objects + packfile indices) before fetching anythinggit fetch-packrequests only the specificfull.jsonlblob hashes that are missing locally, withautomatic fallback to a full branch fetch if the server doesn't support
allow-reachable-sha1-in-wantCollectTranscriptBlobHasheswalks tree entries to discoverfull.jsonlblob hashes (including chunked transcripts) without reading blob content — works after a treeless fetch where blobs aren't localResume flow (before → after)
Before:
git fetch origin entire/checkpoints/v1→ downloads entire branch (all commits, trees, and every blob across allcheckpoints)
After:
git fetch --depth=1 --filter=blob:none→ tip commit + trees onlyfull.jsonlblob hashesHasEncodedObject→ skip blobs already in local storegit fetch-pack <url> <hash>...→ fetch only the missing blobsTest plan
BlobResolverunit tests:HasBlobpresent/missing,ReadBlobpresent/missingCollectTranscriptBlobHashes: single session, multi-session, chunked, nonexistent checkpointListCheckpointsandReadCheckpointMetadatatests still pass with lite structsmise run lintcleanentire resumeon a branch with remote-only checkpoint data