[Feat] Skip model download when model size > node GPU VRAM by pallasathena92 · Pull Request #630 · ome-projects/ome

pallasathena92 · 2026-06-15T20:59:06Z

What this PR does

The model-agent currently downloads any BaseModel/ClusterBaseModel that
satisfies its existing PVC / NodeSelector / NodeAffinity gates, even when
the model will never fit in the node's GPU VRAM. This wastes disk and
bandwidth and guarantees a pod-load failure later.

Why we need it

This PR adds a VRAM precheck in Gopher that runs between the storage
listing step and the bulk download. When the estimated weight bytes (with a
configurable safety factor) exceed the node's aggregate VRAM, the download is
skipped, the metric model_agent_downloads_skipped_total{reason="vram_insufficient"}
fires, and a structured StatusDetail is persisted to the per-node ConfigMap
so the skip is observable end-to-end.

The gate is opt-out via the annotation ome.oracle.com/multi-node-sharded: "true"
for models that are intentionally split across nodes at serve time.

How the size estimate works

Per-backend RemoteSizeEstimator interface, dispatched by storage type:

Backend	Strategy
OCI	Inline: sum sizes returned by the existing `ListObjects` call, applying the same TensorRTLLM shape filter the bulk-download path uses (so per-shape subdirs don't over-count by N).
HuggingFace	One `/api/models/{id}?blobs=true` call → two-stage classifier (`format × method → strategy`) → byte estimate. See below.
Local	`filepath.WalkDir` + sum `info.Size()`.
S3	Stub (fail-open) with TODO; activated when bulk S3 downloads land.
Vendor / PVC	`(0, nil)` — no useful size signal, fail open.

HF classifier (Format × Method → Strategy)

Format is one of diffusion / safetensors / gguf / bin / unknown, picked
from library_name, safetensors, gguf, and sibling extensions. Method
is one of awq / gptq / bitsandbytes / hqq / compressed-tensors / nvfp4 / mxfp4 / fp8 / bf16 / fp32 / …, picked from config.quantization_config.{quant_method,format},
tags/name regex, GGUF filename quant tag, or dtype dominance.

Strategies:

safetensors_dtype_count — Σ count × bytes_per_dtype(d) over the
flat safetensors.parameters map, capped by usedStorage so AWQ's
packed-int4-as-I32 over-count is absorbed without method-specific math.
diffusion_component_sum — sum the best weight file per component
subdir (transformer/, unet/, vae/, text_encoder*/,
image_encoder/), preferring .fp16.safetensors so we estimate
serve-time VRAM rather than disk.
gguf_variant_max — group .gguf siblings by quant variant
(Q4_K_M, Q8_0, …) and return the largest variant's total, so the gate
doesn't under-block a user who picks the heaviest quant. Falls back to
gguf.total_file_size.
siblings_byte_sum — last-resort sum of weight-extension siblings.

Dtype byte table is validated against a top-300 + top-1000 HF survey:

dtype	bytes/param	Notes
`F32`, `I32`, `U32`	4	I32 is always 4 (AWQ packed-int4-as-I32 caught by min cap; HQQ stores real fp32 metadata as I32).
`BF16`, `F16`	2
`F8_E4M3`, `F8_E5M2`, `F8_E8M0`	1
`I8`, `U8`	1	U8 is 1, not 0.5 — BnB / MXFP4 pre-pack two int4 into a uint8 and report the byte count.
`I64`, `F64`	8
`BOOL`	1

EstimateDetail{Format, Method, Strategy, UsedFallback} is logged on every
gate decision and persisted in the skip StatusDetail so operators can see
exactly why a number came out where it did.

Multi-node sharded models

A model annotated ome.oracle.com/multi-node-sharded: "true" bypasses the
VRAM gate entirely (Gopher logs the bypass and increments
model_agent_precheck_bypass_total{reason="multi_node_sharded"}). This is
the documented opt-out for models that are split across nodes at serve time
(e.g. a 405 B FP8 deployed 1/N across N nodes).

Observability

New Prometheus counters:

model_agent_downloads_skipped_total{model_type, namespace, name, reason} — gate refused the download.
model_agent_precheck_bypass_total{model_type, namespace, name, reason} — gate intentionally skipped (multi_node_sharded, size_unknown).
Per-node StatusDetail written to the model-agent ConfigMap on skip:

{
  "name": "llama-3-1-70b-instruct",
  "status": "Failed",
  "statusDetail": {
    "reason": "VRAMInsufficient",
    "message": "estimated weight bytes 141G (× safety 1.20 → required 169G) exceed available VRAM 96G on this node",
    "requiredBytes": 169000000000,
    "availableVRAMBytes": 96000000000,
    "estimatedWeightBytes": 141000000000,
    "safetyFactor": 1.2,
    "format": "safetensors",
    "method": "bf16",
    "strategy": "safetensors_dtype_count"
  }
}

Top-level CR-status aggregation of these per-node reasons is out of scope
for this PR — it's a model-controller change tracked separately. This PR
only writes the detail.

Changes

New pkg/modelagent/gpu_memory.go + test — env-var loader, AvailableNodeVRAMBytes, vramSafetyFactor.
New pkg/modelagent/remote_size.go + test — RemoteSizeEstimator interface, HF classifier, per-backend implementations.
Modified pkg/modelagent/scout.go — caches availableVRAMBytes, stamps AvailableVRAMBytes + MultiNodeSharded on every download/override task.
Modified pkg/modelagent/gopher.go — GopherTask fields, vramPrecheck / applyVRAMGate / markModelOnNodeSkipped helpers; inline OCI gate after ListObjects; TRT-LLM shape filter extracted into a helper.
Modified pkg/modelagent/model_data.go — MultiNodeShardedAnnotation, isMultiNodeSharded, StatusDetail type, ModelEntry.StatusDetail field.
Modified pkg/modelagent/node_label_reconciler.go + configmap_reconciler.go — StatusDetail plumbed through NodeLabelOp → ConfigMapStatusOp → cache → ConfigMap writes.
Modified pkg/modelagent/metrics.go — RecordSkippedDownload, RecordPrecheckBypass.
Modified Helm + config/model-agent ConfigMap and DaemonSet manifests.

Fixes #

How to test

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage:

Skipping download for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: required=500147599614 bytes > available VRAM=343597383680 bytes (estimated_weights=416789666345, safety=1.20, format=unknown, method=none, strategy=oci_list_sum)

Test on A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF:

Skipping download for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: required=500103152025 bytes > available VRAM=343597383680 bytes (estimated_weights=416752626688, safety=1.20, format=safetensors, method=fp8, strategy=safetensors_dtype_count)

Different estimated_weights caused by the different estimate function above.

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage with "ome.io/multi-node-sharded" annotation:

VRAM precheck skipped for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: ome.io/multi-node-sharded annotation set; multi-node sharding active

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF with "ome.io/multi-node-sharded" annotation:

VRAM precheck skipped for ClusterBaseModel llama-4-maverick-17b-128e-instruct-fp8: ome.io/multi-node-sharded annotation set; multi-node sharding active

Checklist

Tests added/updated (if applicable)
Docs updated (if applicable)
make test passes locally

github-actions · 2026-06-15T21:03:02Z

⚠️ Pre-commit checks failed

Please run the following locally and commit the fixes:

pre-commit run --all-files
git add -u && git commit

See CONTRIBUTING.md for setup instructions.

github-actions · 2026-06-15T21:15:52Z

⚠️ Pre-commit checks failed

Please run the following locally and commit the fixes:

pre-commit run --all-files
git add -u && git commit

See CONTRIBUTING.md for setup instructions.

pallasathena92 requested review from CatherineSue, XinyueZhang369, beiguo218, slin1237 and truddy0 as code owners June 15, 2026 20:59

github-actions Bot added helm Helm chart changes model-agent Model agent changes tests Test changes config Configuration changes dependencies Dependency updates labels Jun 15, 2026

[Feat] Skip model download when model size > node GPU VRAM

b4aad59

pallasathena92 force-pushed the yifeliu/early-gate-model-download-on-node branch from 97472f1 to b4aad59 Compare June 15, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Skip model download when model size > node GPU VRAM#630

[Feat] Skip model download when model size > node GPU VRAM#630
pallasathena92 wants to merge 1 commit into
mainfrom
yifeliu/early-gate-model-download-on-node

pallasathena92 commented Jun 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pallasathena92 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Why we need it

How the size estimate works

HF classifier (Format × Method → Strategy)

Multi-node sharded models

Observability

Changes

How to test

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage:

Test on A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF:

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage with "ome.io/multi-node-sharded" annotation:

Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF with "ome.io/multi-node-sharded" annotation:

Checklist

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pallasathena92 commented Jun 15, 2026 •

edited

Loading