Skip to content

[Feat] Skip model download when model size > node GPU VRAM#630

Open
pallasathena92 wants to merge 1 commit into
mainfrom
yifeliu/early-gate-model-download-on-node
Open

[Feat] Skip model download when model size > node GPU VRAM#630
pallasathena92 wants to merge 1 commit into
mainfrom
yifeliu/early-gate-model-download-on-node

Conversation

@pallasathena92

@pallasathena92 pallasathena92 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

What this PR does

The model-agent currently downloads any BaseModel/ClusterBaseModel that
satisfies its existing PVC / NodeSelector / NodeAffinity gates, even when
the model will never fit in the node's GPU VRAM. This wastes disk and
bandwidth and guarantees a pod-load failure later.

Why we need it

This PR adds a VRAM precheck in Gopher that runs between the storage
listing step and the bulk download. When the estimated weight bytes (with a
configurable safety factor) exceed the node's aggregate VRAM, the download is
skipped, the metric model_agent_downloads_skipped_total{reason="vram_insufficient"}
fires, and a structured StatusDetail is persisted to the per-node ConfigMap
so the skip is observable end-to-end.

The gate is opt-out via the annotation ome.oracle.com/multi-node-sharded: "true"
for models that are intentionally split across nodes at serve time.

How the size estimate works

Per-backend RemoteSizeEstimator interface, dispatched by storage type:

Backend Strategy
OCI Inline: sum sizes returned by the existing ListObjects call, applying the same TensorRTLLM shape filter the bulk-download path uses (so per-shape subdirs don't over-count by N).
HuggingFace One /api/models/{id}?blobs=true call → two-stage classifier (format × method → strategy) → byte estimate. See below.
Local filepath.WalkDir + sum info.Size().
S3 Stub (fail-open) with TODO; activated when bulk S3 downloads land.
Vendor / PVC (0, nil) — no useful size signal, fail open.

HF classifier (Format × Method → Strategy)

Format is one of diffusion / safetensors / gguf / bin / unknown, picked
from library_name, safetensors, gguf, and sibling extensions. Method
is one of awq / gptq / bitsandbytes / hqq / compressed-tensors / nvfp4 / mxfp4 / fp8 / bf16 / fp32 / …, picked from config.quantization_config.{quant_method,format},
tags/name regex, GGUF filename quant tag, or dtype dominance.

Strategies:

  • safetensors_dtype_countΣ count × bytes_per_dtype(d) over the
    flat safetensors.parameters map, capped by usedStorage so AWQ's
    packed-int4-as-I32 over-count is absorbed without method-specific math.
  • diffusion_component_sum — sum the best weight file per component
    subdir (transformer/, unet/, vae/, text_encoder*/,
    image_encoder/), preferring .fp16.safetensors so we estimate
    serve-time VRAM rather than disk.
  • gguf_variant_max — group .gguf siblings by quant variant
    (Q4_K_M, Q8_0, …) and return the largest variant's total, so the gate
    doesn't under-block a user who picks the heaviest quant. Falls back to
    gguf.total_file_size.
  • siblings_byte_sum — last-resort sum of weight-extension siblings.

Dtype byte table is validated against a top-300 + top-1000 HF survey:

dtype bytes/param Notes
F32, I32, U32 4 I32 is always 4 (AWQ packed-int4-as-I32 caught by min cap; HQQ stores real fp32 metadata as I32).
BF16, F16 2
F8_E4M3, F8_E5M2, F8_E8M0 1
I8, U8 1 U8 is 1, not 0.5 — BnB / MXFP4 pre-pack two int4 into a uint8 and report the byte count.
I64, F64 8
BOOL 1

EstimateDetail{Format, Method, Strategy, UsedFallback} is logged on every
gate decision and persisted in the skip StatusDetail so operators can see
exactly why a number came out where it did.

Multi-node sharded models

A model annotated ome.oracle.com/multi-node-sharded: "true" bypasses the
VRAM gate entirely (Gopher logs the bypass and increments
model_agent_precheck_bypass_total{reason="multi_node_sharded"}). This is
the documented opt-out for models that are split across nodes at serve time
(e.g. a 405 B FP8 deployed 1/N across N nodes).

Observability

New Prometheus counters:

model_agent_downloads_skipped_total{model_type, namespace, name, reason} — gate refused the download.
model_agent_precheck_bypass_total{model_type, namespace, name, reason} — gate intentionally skipped (multi_node_sharded, size_unknown).
Per-node StatusDetail written to the model-agent ConfigMap on skip:

{
  "name": "llama-3-1-70b-instruct",
  "status": "Failed",
  "statusDetail": {
    "reason": "VRAMInsufficient",
    "message": "estimated weight bytes 141G (× safety 1.20 → required 169G) exceed available VRAM 96G on this node",
    "requiredBytes": 169000000000,
    "availableVRAMBytes": 96000000000,
    "estimatedWeightBytes": 141000000000,
    "safetyFactor": 1.2,
    "format": "safetensors",
    "method": "bf16",
    "strategy": "safetensors_dtype_count"
  }
}

Top-level CR-status aggregation of these per-node reasons is out of scope
for this PR — it's a model-controller change tracked separately. This PR
only writes the detail.

Changes

  • New pkg/modelagent/gpu_memory.go + test — env-var loader, AvailableNodeVRAMBytes, vramSafetyFactor.
  • New pkg/modelagent/remote_size.go + test — RemoteSizeEstimator interface, HF classifier, per-backend implementations.
  • Modified pkg/modelagent/scout.go — caches availableVRAMBytes, stamps AvailableVRAMBytes + MultiNodeSharded on every download/override task.
  • Modified pkg/modelagent/gopher.go — GopherTask fields, vramPrecheck / applyVRAMGate / markModelOnNodeSkipped helpers; inline OCI gate after ListObjects; TRT-LLM shape filter extracted into a helper.
  • Modified pkg/modelagent/model_data.go — MultiNodeShardedAnnotation, isMultiNodeSharded, StatusDetail type, ModelEntry.StatusDetail field.
  • Modified pkg/modelagent/node_label_reconciler.go + configmap_reconciler.go — StatusDetail plumbed through NodeLabelOp → ConfigMapStatusOp → cache → ConfigMap writes.
  • Modified pkg/modelagent/metrics.go — RecordSkippedDownload, RecordPrecheckBypass.
  • Modified Helm + config/model-agent ConfigMap and DaemonSet manifests.

Fixes #

How to test

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage:

Skipping download for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: required=500147599614 bytes > available VRAM=343597383680 bytes (estimated_weights=416789666345, safety=1.20, format=unknown, method=none, strategy=oci_list_sum)

Test on A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF:

Skipping download for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: required=500103152025 bytes > available VRAM=343597383680 bytes (estimated_weights=416752626688, safety=1.20, format=safetensors, method=fp8, strategy=safetensors_dtype_count)

Different estimated_weights caused by the different estimate function above.

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage with "ome.io/multi-node-sharded" annotation:

VRAM precheck skipped for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: ome.io/multi-node-sharded annotation set; multi-node sharding active

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF with "ome.io/multi-node-sharded" annotation:

VRAM precheck skipped for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: ome.io/multi-node-sharded annotation set; multi-node sharding active

Checklist

  • Tests added/updated (if applicable)
  • Docs updated (if applicable)
  • make test passes locally

@github-actions github-actions Bot added helm Helm chart changes model-agent Model agent changes tests Test changes config Configuration changes dependencies Dependency updates labels Jun 15, 2026
@github-actions

Copy link
Copy Markdown

⚠️ Pre-commit checks failed

Please run the following locally and commit the fixes:

pre-commit run --all-files
git add -u && git commit

See CONTRIBUTING.md for setup instructions.

1 similar comment
@github-actions

Copy link
Copy Markdown

⚠️ Pre-commit checks failed

Please run the following locally and commit the fixes:

pre-commit run --all-files
git add -u && git commit

See CONTRIBUTING.md for setup instructions.

@pallasathena92 pallasathena92 force-pushed the yifeliu/early-gate-model-download-on-node branch from 97472f1 to b4aad59 Compare June 15, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration changes dependencies Dependency updates helm Helm chart changes model-agent Model agent changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant