Skip to content

Add per-version zip source archives#13978

Open
Turbo87 wants to merge 9 commits into
rust-lang:mainfrom
Turbo87:zip-source-archives
Open

Add per-version zip source archives#13978
Turbo87 wants to merge 9 commits into
rust-lang:mainfrom
Turbo87:zip-source-archives

Conversation

@Turbo87

@Turbo87 Turbo87 commented Jun 17, 2026

Copy link
Copy Markdown
Member

This adds a seekable per-version source archive next to each published .crate, to support a future source-code viewer and version-to-version diff viewer on crates.io.

A .crate is a single solid gzip stream, so one file cannot be read without decompressing everything before it, and it carries no per-file hash for detecting changes between versions. For each version the backend now also produces a .zip of the source files, where every entry is compressed independently and can be range-fetched (with Cargo.toml written first), plus a .zip.json manifest recording each file's offset, sizes, compression, and the sha256 of its uncompressed contents.

Both objects are deterministic, so a rebuild is byte-identical. They live under the existing crates/{name}/ prefix and are served behind the CDNs with the same immutable caching as the .crate. A background job builds and uploads them, with the zip streamed straight from disk so it is never held in memory.

An admin command (build-crate-zips) backfills existing versions, and publishing enqueues the job for new ones. The publish enqueue is fatal and runs inside the publish transaction, next to the existing index-sync jobs, so a failed enqueue rolls back the publish rather than silently skipping the archive. That matches how those index jobs already behave.

Serving (HTTP endpoints and the browser viewer) and CDN configuration are out of scope. The artifact is designed to support either serving model later.

@Turbo87 Turbo87 force-pushed the zip-source-archives branch from 892ffe8 to a549c72 Compare June 17, 2026 20:13
@Turbo87 Turbo87 requested a review from a team June 17, 2026 20:13
@Turbo87 Turbo87 added the C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works label Jun 17, 2026
Turbo87 added 8 commits June 17, 2026 22:28
This lets tests construct tarballs with explicit directory entries
without reaching through `as_mut()` to the raw `tar::Builder`.
This introduces a self-contained crate that turns a `.crate` tarball into
a deterministic, seekable zip plus a JSON manifest. `build_zip()` reads the
gzipped tarball twice: once to count paths and buffer `Cargo.toml`, then
again to write every other file. It writes `Cargo.toml` first so a consumer
can fetch it from the start of the zip, compresses each entry with DEFLATE
level 9, and stamps an explicit `last_modified_time`, so the output is
byte-identical across re-runs.

It mirrors cargo's extraction rules for unusual tarballs: `Cargo.toml` is
matched case-insensitively, and when several entries share a path only the
last occurrence is kept, since the zip cannot hold duplicate names.

The returned `Manifest` lists one `FileEntry` per file, sorted
case-insensitively by path, with the compressed-payload offset, compressed
and uncompressed sizes, compression method, and the sha256 of the
uncompressed contents. A consumer can use these to range-fetch and verify
a single file without parsing the zip central directory.
These nullable columns hold the SHA256 checksums of a per-version source
archive (`.zip`) and its JSON manifest (`.zip.json`). They are populated
asynchronously after publish rather than synchronously like `checksum`,
so they stay permanently nullable. A `CHECK (octet_length(...) = 32)` on
each enforces the raw 32-byte digest length.
These mirror the existing `upload_crate_file()` helper: path helpers and
upload methods for the per-version zip source archive (`.zip`) and its JSON
manifest (`.zip.json`). Both objects get the same immutable cache-control
treatment as the `.crate`.

Both live under the shared `crates/{name}/` prefix, so
`delete_all_crate_files()` already removes them on crate deletion with no
new code, as the extended deletion test documents.
This wires the `crates_io_crate_zip` builder, the storage upload methods,
and the `versions.zip_sha256`/`zip_json_sha256` columns into a
deduplicated, idempotent job. For one version it downloads the `.crate` to
a tempfile, builds the deterministic zip and manifest synchronously inside
`spawn_blocking` (the `zip` writer needs `Write + Seek`), then streams the
zip straight from disk to object storage and records both checksums in a
single `UPDATE`.

The uploads run before the database write, so a crash before the write
leaves both columns NULL and a retry rebuilds byte-identical artifacts.
Mirrors the `analyze-crates` command: `--backfill` enqueues a
`BuildCrateZip` job for every version still missing an artifact
(`zip_sha256 IS NULL OR zip_json_sha256 IS NULL`), and a crate-spec mode
(`crate@version` / `crate`) targets specific rebuilds. Jobs are inserted in
chunks of 100, with each crate's default version queued at priority -20 and
the rest at -50 so live publishes always take precedence.
New publishes now build their zip source archive automatically: `publish.rs`
enqueues a `BuildCrateZip` job in the same `tokio::try_join!` as the index
sync, next to the existing `AnalyzeCrateFile` job.

The publish and delete tests' `stored_files` snapshots gain the
`.zip`/`.zip.json` objects now produced during publish.
Notes the seekable `.zip` source archive and its `.zip.json` manifest in
the object-storage overview, and adds them to the publish walkthrough's
list of background follow-up work.
@Turbo87 Turbo87 force-pushed the zip-source-archives branch from a549c72 to db880f2 Compare June 17, 2026 20:28
@rustbot

rustbot commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

The `delete-version` command removed only the `.crate` and readme for a
version, leaving the new `.zip` source archive and its `.zip.json` manifest
orphaned in storage and cached on both CDNs. Those objects are served with
`immutable` caching, so a stale copy could linger for up to a year, and
neither CDN supports wildcard invalidation.

This adds `delete_crate_zip()` / `delete_crate_zip_manifest()` and the matching
`*_location` helpers to `Storage`, and deletes both objects (queueing their
CDN invalidation) next to the `.crate`. A missing object is ignored, since a
version may not have been built yet.

Crate deletion already covers these objects: it deletes by the shared
`crates/{name}/` prefix and invalidates every path it removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants