Skip to content

fix(docker): use system Node in componentized builders + retry apk add#28888

Merged
yassin-berriai merged 3 commits into
litellm_internal_stagingfrom
litellm_fix/componentized-builder-system-node
May 26, 2026
Merged

fix(docker): use system Node in componentized builders + retry apk add#28888
yassin-berriai merged 3 commits into
litellm_internal_stagingfrom
litellm_fix/componentized-builder-system-node

Conversation

@yassin-berriai

Copy link
Copy Markdown
Contributor

What

In the three componentized proxy Dockerfiles (backend/, migrations/, gateway/):

  • Add nodejs and npm to the builder-stage apk add.
  • Set PRISMA_USE_GLOBAL_NODE=true explicitly in the builder ENV.
  • Wrap every apk add (builder + runtime) in the same 3-try retry loop already used by docker/Dockerfile.non_root.

Why

The project-releaser componentized image build (Build and Publish Componentized Images + Chart) has been failing since 2026-05-24. Two distinct failure modes, same root cause class — the builders were too coupled to "whatever the network gave us at build time."

Failure 1 — libatomic missing, surfaced by upstream Node bump

prisma generate (run in the builder stage) triggers prisma-client-py's nodeenv, which downloads the latest stable Node.js from nodejs.org at build time. Two snapshots from the project-releaser logs:

# Last passing run (2026-05-20, https://github.com/BerriAI/project-releaser/actions/runs/26134099768)
#28 0.622  * Install prebuilt node (26.1.0)

# Today's failing run
#29 0.675  * Install prebuilt node (26.2.0)
…
node: error while loading shared libraries: libatomic.so.1: cannot open shared object file: No such file or directory
…
subprocess.CalledProcessError: Command '['/home/nonroot/.cache/prisma-python/nodeenv/bin/npm', 'install', 'prisma@5.4.2']' returned non-zero exit status 127.

Node 26.2.0's prebuilt Linux binary added a runtime dependency on libatomic.so.1. Wolfi doesn't include libatomic in the base image, and the componentized builder stages don't apk add it — so the first build that pulled 26.2.0 broke.

Pinning the Node version (e.g. via PRISMA_NODEENV_EXTRA_ARGS=--node=26.1.0) would unblock today, but it's a treadmill: the next time prisma-client-py changes its nodeenv default, or 26.1.0 gets pulled from the mirror, the build breaks again.

Adding libatomic to the builder would unblock today, but doesn't solve "the next Node release silently adds a new dynamic dep" either.

The real fix is to stop letting nodeenv decide the Node version at build time. prisma-client-py respects PRISMA_USE_GLOBAL_NODE (default true) — if a node binary is on PATH, it skips nodeenv entirely. The legacy docker/Dockerfile.non_root already does this — its builder apk add includes nodejs and npm. The componentized Dockerfiles regressed it.

Using Wolfi's own nodejs package means:

  • The Node binary is built against the same image's libc; no libatomic.so.1 gap.
  • The version moves only when we bump the Wolfi base digest — a deliberate, reviewable change instead of a daily roll.
  • No network call to nodejs.org during build.

Setting PRISMA_USE_GLOBAL_NODE=true in ENV is redundant with the default but documents intent and prevents a future env override from silently re-enabling nodeenv's download.

Failure 2 — transient Chainguard mirror flakes

The same run also showed mid-apk add failures on the arm64 leg:

ERROR: nss-db-2.43-r7: remote server returned error (try 'apk update')
ERROR: libzstd1-1.5.7-r7: remote server returned error (try 'apk update')
ERROR: libogg-1.3.6-r5: remote server returned error (try 'apk update')
ERROR: binutils-2.46-r1: remote server returned error (try 'apk update')

These are HTTP errors from apk.cgr.dev on individual package fetches — classic mirror flakiness during multi-arch builds. None of the componentized Dockerfiles wrap their apk add in a retry loop. The legacy docker/Dockerfile.non_root does (lines 17–29 and 91–96). This PR mirrors that pattern across builder + runtime in all three files.

Why these three files have the same diff

backend/, migrations/, and gateway/Dockerfile are three near-identical componentizations of the original monolithic proxy Dockerfile. They share the same apk add lists and the same lifecycle. The fix is structurally identical in each.

Testing

This change is build-system-only and can't be unit-tested. The verification is a green run of the project-releaser workflow — please trigger one against this branch before merge. The diff is conservative: it only adds packages and retries; nothing existing is removed. Worst case, the builder image grows slightly to include the Wolfi nodejs + npm packages (these layers are not in the final runtime image — they live in the builder stage only).

Follow-ups (not in this PR)

  • docker/Dockerfile.database has the same shape (builder + runtime apk add with no retry) — it doesn't have the Node problem because it already includes nodejs npm in the builder, but it could use the same retry-loop hardening.
  • The project-releaser repo could pin a Wolfi base digest known to ship libatomic in the base image as a belt-and-suspenders guard — but the nodejs-from-Wolfi approach in this PR makes that unnecessary.

🤖 Generated with Claude Code

Two failure modes in the componentized image builds (backend, migrations,
gateway) on project-releaser, with the same root cause:

1. The builder-stage `apk add` was missing `libatomic`. `prisma generate`
   triggers prisma-client-py's `nodeenv`, which downloads the latest stable
   Node.js at build time. Node 26.1.0 (last passing build on 2026-05-20) did
   not dynamically link `libatomic.so.1`. Node 26.2.0 (current latest) does,
   and the Wolfi builder doesn't ship libatomic — so `npm install prisma@…`
   fails with `node: error while loading shared libraries: libatomic.so.1`
   and exit 127. Retrying or pinning the Node version is a treadmill; the
   root issue is that nodeenv decides the Node version at build time.

   Fix: add `nodejs npm` to the builder-stage `apk add` so prisma-client-py
   uses Wolfi's own Node via its default `PRISMA_USE_GLOBAL_NODE=true`. The
   legacy `docker/Dockerfile.non_root` already does this; the componentized
   Dockerfiles regressed it. Setting `PRISMA_USE_GLOBAL_NODE=true` in ENV
   redundantly nails the intent so a future env override can't silently
   re-enable nodeenv's download.

2. Transient `apk.cgr.dev` mirror flakes during the arm64 leg of multi-arch
   builds cause individual package fetches to fail mid-install (we saw
   `nss-db-2.43-r7: remote server returned error (try 'apk update')` and
   similar for libzstd1, libogg, binutils in this run). None of the
   componentized Dockerfiles wrap `apk add` in a retry loop.

   Fix: wrap every `apk add` (builder + runtime, all three files) in the
   same `for i in 1 2 3; do … && break || sleep 5; done` loop that the
   legacy `docker/Dockerfile.non_root` already uses.

Affected files all have the same shape — backend, migrations, gateway —
because they're three near-identical componentizations of the original
monolithic proxy Dockerfile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CLAassistant

CLAassistant commented May 26, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Yassin Kortam
❌ yassin-berriai


Yassin Kortam seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Same fix, leaner comments. The apk-add note is 3 lines now (was 8), and the
PRISMA_USE_GLOBAL_NODE bullet matches the existing UV_* comment style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two build failures in the componentized proxy Dockerfiles (backend/, gateway/, migrations/) that have been breaking the project-releaser workflow since 2026-05-24: a missing libatomic dependency pulled in by Node 26.2.0's prebuilt binary, and transient apk.cgr.dev mirror errors during multi-arch builds.

  • Node from Wolfi instead of nodeenv: nodejs and npm are added to the builder-stage apk add and PRISMA_USE_GLOBAL_NODE=true is set in ENV, so prisma generate uses the Wolfi-managed Node binary (already linked against the image's own libc) and never reaches out to nodejs.org at build time.
  • Retry hardening on all apk add calls: both builder and runtime apk add invocations in all three Dockerfiles are wrapped in the same 3-attempt loop already present in docker/Dockerfile.non_root; the loop correctly calls exit 1 on the third failure, so persistent mirror outages produce a clear build error rather than silently continuing.

Confidence Score: 5/5

Build-system-only change that only adds packages and wraps existing apk add calls in retry loops; no runtime behaviour, application logic, or dependencies are altered.

All three Dockerfiles receive identical, conservative changes: the new retry loop correctly calls exit 1 after three exhausted attempts (no silent-success path), nodejs/npm are builder-stage-only and don't reach the runtime image, and PRISMA_USE_GLOBAL_NODE=true matches the prisma-client-py default while preventing accidental override. Nothing is removed; the worst-case outcome is a slightly larger builder layer.

No files require special attention.

Important Files Changed

Filename Overview
backend/Dockerfile Adds nodejs npm to builder apk add, sets PRISMA_USE_GLOBAL_NODE=true, and wraps both builder and runtime apk add invocations in a 3-retry loop with correct exit 1 on exhaustion.
gateway/Dockerfile Identical fix to backend/Dockerfile — same nodejs npm addition, PRISMA_USE_GLOBAL_NODE=true, and retry-loop hardening on both stages.
migrations/Dockerfile Same retry-loop and nodejs npm / PRISMA_USE_GLOBAL_NODE changes as the other two; PRISMA_USE_GLOBAL_NODE comment present in the apk add block but omitted from the ENV block (minor inconsistency vs. backend/gateway, no functional impact).

Reviews (2): Last reviewed commit: "fix(docker): make apk-add retry loop fai..." | Re-trigger Greptile

Comment thread backend/Dockerfile
Greptile flagged that the retry pattern `apk add ... && break || sleep 5`
exits 0 when all three attempts fail, because `sleep 5` is the last
executed command. A persistent apk.cgr.dev outage would produce a silently
"successful" RUN layer with no packages installed, followed by cryptic
"command not found" errors in downstream RUN steps.

Fix: explicitly fail on the third miss before sleeping. Same pattern in
all six retry loops (3 files × builder + runtime).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yassin-berriai

Copy link
Copy Markdown
Contributor Author

Addressed Greptile's retry-loop exit-code finding in 572ea9e — the apk add ... && break || sleep 5 pattern is replaced with:

RUN for i in 1 2 3; do \
      apk add --no-cache <packages> && break; \
      [ $i = 3 ] && { echo "apk add failed after 3 retries" >&2; exit 1; }; \
      sleep 5; \
    done

Same change in all six retry loops (3 files × builder + runtime). A persistent apk.cgr.dev outage now produces a clear failure on the third miss instead of a silently-successful RUN layer.

Verified the failure path locally:

$ sh -c 'for i in 1 2 3; do false && break; [ $i = 3 ] && { echo "apk add failed after 3 retries" >&2; exit 1; }; sleep 0; done; echo REACHED_END' ; echo "exit=$?"
apk add failed after 3 retries
exit=1

Note: docker/Dockerfile.non_root has the same pre-existing pattern Greptile mentioned — leaving that out of this PR to keep the scope tight; can do a follow-up.

@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@yassin-berriai

Copy link
Copy Markdown
Contributor Author

@greptileai

@yassin-berriai yassin-berriai enabled auto-merge (squash) May 26, 2026 22:40
@yassin-berriai yassin-berriai merged commit a645d46 into litellm_internal_staging May 26, 2026
116 of 118 checks passed
mateo-berri added a commit that referenced this pull request May 26, 2026
Mirror the retry-loop pattern from #28888 (which fixed backend/Dockerfile,
gateway/Dockerfile, and migrations/Dockerfile) into docker/Dockerfile.database.
The build_docker_database_image CI job has been intermittently failing with
"remote server returned error (try 'apk update')" when apk.cgr.dev flakes
mid-fetch; bumping the wolfi-base SHA doesn't address the mirror, only a
retry does.

Same explicit-failure form as #28888: exit non-zero on the 3rd miss instead
of silently succeeding because `sleep 5` was the last command in the
`&& break || sleep 5` chain.
mateo-berri added a commit that referenced this pull request May 30, 2026
…/JWT PR

These Docker changes are out of scope for the MCP OAuth passthrough + JWT
auth work and duplicate the build-reliability fix already merged to
litellm_internal_staging in #28888, which adds the same apk retry loop on
the componentized backend/gateway/migrations Dockerfiles and also fixes the
underlying nodeenv/libatomic root cause. Restoring docker/Dockerfile.database
and docker/Dockerfile.non_root to the base so this PR is purely the MCP/JWT
change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants