Add per-version zip source archives#13978
Open
Turbo87 wants to merge 9 commits into
Open
Conversation
892ffe8 to
a549c72
Compare
This lets tests construct tarballs with explicit directory entries without reaching through `as_mut()` to the raw `tar::Builder`.
This introduces a self-contained crate that turns a `.crate` tarball into a deterministic, seekable zip plus a JSON manifest. `build_zip()` reads the gzipped tarball twice: once to count paths and buffer `Cargo.toml`, then again to write every other file. It writes `Cargo.toml` first so a consumer can fetch it from the start of the zip, compresses each entry with DEFLATE level 9, and stamps an explicit `last_modified_time`, so the output is byte-identical across re-runs. It mirrors cargo's extraction rules for unusual tarballs: `Cargo.toml` is matched case-insensitively, and when several entries share a path only the last occurrence is kept, since the zip cannot hold duplicate names. The returned `Manifest` lists one `FileEntry` per file, sorted case-insensitively by path, with the compressed-payload offset, compressed and uncompressed sizes, compression method, and the sha256 of the uncompressed contents. A consumer can use these to range-fetch and verify a single file without parsing the zip central directory.
These nullable columns hold the SHA256 checksums of a per-version source archive (`.zip`) and its JSON manifest (`.zip.json`). They are populated asynchronously after publish rather than synchronously like `checksum`, so they stay permanently nullable. A `CHECK (octet_length(...) = 32)` on each enforces the raw 32-byte digest length.
These mirror the existing `upload_crate_file()` helper: path helpers and
upload methods for the per-version zip source archive (`.zip`) and its JSON
manifest (`.zip.json`). Both objects get the same immutable cache-control
treatment as the `.crate`.
Both live under the shared `crates/{name}/` prefix, so
`delete_all_crate_files()` already removes them on crate deletion with no
new code, as the extended deletion test documents.
This wires the `crates_io_crate_zip` builder, the storage upload methods, and the `versions.zip_sha256`/`zip_json_sha256` columns into a deduplicated, idempotent job. For one version it downloads the `.crate` to a tempfile, builds the deterministic zip and manifest synchronously inside `spawn_blocking` (the `zip` writer needs `Write + Seek`), then streams the zip straight from disk to object storage and records both checksums in a single `UPDATE`. The uploads run before the database write, so a crash before the write leaves both columns NULL and a retry rebuilds byte-identical artifacts.
Mirrors the `analyze-crates` command: `--backfill` enqueues a `BuildCrateZip` job for every version still missing an artifact (`zip_sha256 IS NULL OR zip_json_sha256 IS NULL`), and a crate-spec mode (`crate@version` / `crate`) targets specific rebuilds. Jobs are inserted in chunks of 100, with each crate's default version queued at priority -20 and the rest at -50 so live publishes always take precedence.
New publishes now build their zip source archive automatically: `publish.rs` enqueues a `BuildCrateZip` job in the same `tokio::try_join!` as the index sync, next to the existing `AnalyzeCrateFile` job. The publish and delete tests' `stored_files` snapshots gain the `.zip`/`.zip.json` objects now produced during publish.
Notes the seekable `.zip` source archive and its `.zip.json` manifest in the object-storage overview, and adds them to the publish walkthrough's list of background follow-up work.
a549c72 to
db880f2
Compare
Collaborator
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
The `delete-version` command removed only the `.crate` and readme for a
version, leaving the new `.zip` source archive and its `.zip.json` manifest
orphaned in storage and cached on both CDNs. Those objects are served with
`immutable` caching, so a stale copy could linger for up to a year, and
neither CDN supports wildcard invalidation.
This adds `delete_crate_zip()` / `delete_crate_zip_manifest()` and the matching
`*_location` helpers to `Storage`, and deletes both objects (queueing their
CDN invalidation) next to the `.crate`. A missing object is ignored, since a
version may not have been built yet.
Crate deletion already covers these objects: it deletes by the shared
`crates/{name}/` prefix and invalidates every path it removed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds a seekable per-version source archive next to each published
.crate, to support a future source-code viewer and version-to-version diff viewer on crates.io.A
.crateis a single solid gzip stream, so one file cannot be read without decompressing everything before it, and it carries no per-file hash for detecting changes between versions. For each version the backend now also produces a.zipof the source files, where every entry is compressed independently and can be range-fetched (withCargo.tomlwritten first), plus a.zip.jsonmanifest recording each file's offset, sizes, compression, and the sha256 of its uncompressed contents.Both objects are deterministic, so a rebuild is byte-identical. They live under the existing
crates/{name}/prefix and are served behind the CDNs with the same immutable caching as the.crate. A background job builds and uploads them, with the zip streamed straight from disk so it is never held in memory.An admin command (
build-crate-zips) backfills existing versions, and publishing enqueues the job for new ones. The publish enqueue is fatal and runs inside the publish transaction, next to the existing index-sync jobs, so a failed enqueue rolls back the publish rather than silently skipping the archive. That matches how those index jobs already behave.Serving (HTTP endpoints and the browser viewer) and CDN configuration are out of scope. The artifact is designed to support either serving model later.