fix(docker): use system Node in componentized builders + retry apk add#28888
Conversation
Two failure modes in the componentized image builds (backend, migrations, gateway) on project-releaser, with the same root cause: 1. The builder-stage `apk add` was missing `libatomic`. `prisma generate` triggers prisma-client-py's `nodeenv`, which downloads the latest stable Node.js at build time. Node 26.1.0 (last passing build on 2026-05-20) did not dynamically link `libatomic.so.1`. Node 26.2.0 (current latest) does, and the Wolfi builder doesn't ship libatomic — so `npm install prisma@…` fails with `node: error while loading shared libraries: libatomic.so.1` and exit 127. Retrying or pinning the Node version is a treadmill; the root issue is that nodeenv decides the Node version at build time. Fix: add `nodejs npm` to the builder-stage `apk add` so prisma-client-py uses Wolfi's own Node via its default `PRISMA_USE_GLOBAL_NODE=true`. The legacy `docker/Dockerfile.non_root` already does this; the componentized Dockerfiles regressed it. Setting `PRISMA_USE_GLOBAL_NODE=true` in ENV redundantly nails the intent so a future env override can't silently re-enable nodeenv's download. 2. Transient `apk.cgr.dev` mirror flakes during the arm64 leg of multi-arch builds cause individual package fetches to fail mid-install (we saw `nss-db-2.43-r7: remote server returned error (try 'apk update')` and similar for libzstd1, libogg, binutils in this run). None of the componentized Dockerfiles wrap `apk add` in a retry loop. Fix: wrap every `apk add` (builder + runtime, all three files) in the same `for i in 1 2 3; do … && break || sleep 5; done` loop that the legacy `docker/Dockerfile.non_root` already uses. Affected files all have the same shape — backend, migrations, gateway — because they're three near-identical componentizations of the original monolithic proxy Dockerfile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Yassin Kortam seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Same fix, leaner comments. The apk-add note is 3 lines now (was 8), and the PRISMA_USE_GLOBAL_NODE bullet matches the existing UV_* comment style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR fixes two build failures in the componentized proxy Dockerfiles (
Confidence Score: 5/5Build-system-only change that only adds packages and wraps existing apk add calls in retry loops; no runtime behaviour, application logic, or dependencies are altered. All three Dockerfiles receive identical, conservative changes: the new retry loop correctly calls exit 1 after three exhausted attempts (no silent-success path), nodejs/npm are builder-stage-only and don't reach the runtime image, and PRISMA_USE_GLOBAL_NODE=true matches the prisma-client-py default while preventing accidental override. Nothing is removed; the worst-case outcome is a slightly larger builder layer. No files require special attention.
|
| Filename | Overview |
|---|---|
| backend/Dockerfile | Adds nodejs npm to builder apk add, sets PRISMA_USE_GLOBAL_NODE=true, and wraps both builder and runtime apk add invocations in a 3-retry loop with correct exit 1 on exhaustion. |
| gateway/Dockerfile | Identical fix to backend/Dockerfile — same nodejs npm addition, PRISMA_USE_GLOBAL_NODE=true, and retry-loop hardening on both stages. |
| migrations/Dockerfile | Same retry-loop and nodejs npm / PRISMA_USE_GLOBAL_NODE changes as the other two; PRISMA_USE_GLOBAL_NODE comment present in the apk add block but omitted from the ENV block (minor inconsistency vs. backend/gateway, no functional impact). |
Reviews (2): Last reviewed commit: "fix(docker): make apk-add retry loop fai..." | Re-trigger Greptile
Greptile flagged that the retry pattern `apk add ... && break || sleep 5` exits 0 when all three attempts fail, because `sleep 5` is the last executed command. A persistent apk.cgr.dev outage would produce a silently "successful" RUN layer with no packages installed, followed by cryptic "command not found" errors in downstream RUN steps. Fix: explicitly fail on the third miss before sleeping. Same pattern in all six retry loops (3 files × builder + runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed Greptile's retry-loop exit-code finding in 572ea9e — the RUN for i in 1 2 3; do \
apk add --no-cache <packages> && break; \
[ $i = 3 ] && { echo "apk add failed after 3 retries" >&2; exit 1; }; \
sleep 5; \
doneSame change in all six retry loops (3 files × builder + runtime). A persistent Verified the failure path locally: Note: |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
a645d46
into
litellm_internal_staging
Mirror the retry-loop pattern from #28888 (which fixed backend/Dockerfile, gateway/Dockerfile, and migrations/Dockerfile) into docker/Dockerfile.database. The build_docker_database_image CI job has been intermittently failing with "remote server returned error (try 'apk update')" when apk.cgr.dev flakes mid-fetch; bumping the wolfi-base SHA doesn't address the mirror, only a retry does. Same explicit-failure form as #28888: exit non-zero on the 3rd miss instead of silently succeeding because `sleep 5` was the last command in the `&& break || sleep 5` chain.
…/JWT PR These Docker changes are out of scope for the MCP OAuth passthrough + JWT auth work and duplicate the build-reliability fix already merged to litellm_internal_staging in #28888, which adds the same apk retry loop on the componentized backend/gateway/migrations Dockerfiles and also fixes the underlying nodeenv/libatomic root cause. Restoring docker/Dockerfile.database and docker/Dockerfile.non_root to the base so this PR is purely the MCP/JWT change.
What
In the three componentized proxy Dockerfiles (
backend/,migrations/,gateway/):nodejsandnpmto the builder-stageapk add.PRISMA_USE_GLOBAL_NODE=trueexplicitly in the builder ENV.apk add(builder + runtime) in the same 3-try retry loop already used bydocker/Dockerfile.non_root.Why
The project-releaser componentized image build (
Build and Publish Componentized Images + Chart) has been failing since 2026-05-24. Two distinct failure modes, same root cause class — the builders were too coupled to "whatever the network gave us at build time."Failure 1 —
libatomicmissing, surfaced by upstream Node bumpprisma generate(run in the builder stage) triggers prisma-client-py'snodeenv, which downloads the latest stable Node.js from nodejs.org at build time. Two snapshots from the project-releaser logs:Node 26.2.0's prebuilt Linux binary added a runtime dependency on
libatomic.so.1. Wolfi doesn't includelibatomicin the base image, and the componentized builder stages don'tapk addit — so the first build that pulled 26.2.0 broke.Pinning the Node version (e.g. via
PRISMA_NODEENV_EXTRA_ARGS=--node=26.1.0) would unblock today, but it's a treadmill: the next time prisma-client-py changes its nodeenv default, or 26.1.0 gets pulled from the mirror, the build breaks again.Adding
libatomicto the builder would unblock today, but doesn't solve "the next Node release silently adds a new dynamic dep" either.The real fix is to stop letting nodeenv decide the Node version at build time. prisma-client-py respects
PRISMA_USE_GLOBAL_NODE(defaulttrue) — if anodebinary is onPATH, it skips nodeenv entirely. The legacydocker/Dockerfile.non_rootalready does this — its builderapk addincludesnodejsandnpm. The componentized Dockerfiles regressed it.Using Wolfi's own
nodejspackage means:libatomic.so.1gap.Setting
PRISMA_USE_GLOBAL_NODE=truein ENV is redundant with the default but documents intent and prevents a future env override from silently re-enabling nodeenv's download.Failure 2 — transient Chainguard mirror flakes
The same run also showed mid-
apk addfailures on the arm64 leg:These are HTTP errors from
apk.cgr.devon individual package fetches — classic mirror flakiness during multi-arch builds. None of the componentized Dockerfiles wrap theirapk addin a retry loop. The legacydocker/Dockerfile.non_rootdoes (lines 17–29 and 91–96). This PR mirrors that pattern across builder + runtime in all three files.Why these three files have the same diff
backend/,migrations/, andgateway/Dockerfileare three near-identical componentizations of the original monolithic proxy Dockerfile. They share the sameapk addlists and the same lifecycle. The fix is structurally identical in each.Testing
This change is build-system-only and can't be unit-tested. The verification is a green run of the project-releaser workflow — please trigger one against this branch before merge. The diff is conservative: it only adds packages and retries; nothing existing is removed. Worst case, the builder image grows slightly to include the Wolfi
nodejs+npmpackages (these layers are not in the final runtime image — they live in the builder stage only).Follow-ups (not in this PR)
docker/Dockerfile.databasehas the same shape (builder + runtimeapk addwith no retry) — it doesn't have the Node problem because it already includesnodejs npmin the builder, but it could use the same retry-loop hardening.libatomicin the base image as a belt-and-suspenders guard — but thenodejs-from-Wolfi approach in this PR makes that unnecessary.🤖 Generated with Claude Code