[Feat] Skip model download when model size > node GPU VRAM#630
Open
pallasathena92 wants to merge 1 commit into
Open
[Feat] Skip model download when model size > node GPU VRAM#630pallasathena92 wants to merge 1 commit into
pallasathena92 wants to merge 1 commit into
Conversation
|
Please run the following locally and commit the fixes: pre-commit run --all-files
git add -u && git commitSee CONTRIBUTING.md for setup instructions. |
1 similar comment
|
Please run the following locally and commit the fixes: pre-commit run --all-files
git add -u && git commitSee CONTRIBUTING.md for setup instructions. |
97472f1 to
b4aad59
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
The model-agent currently downloads any
BaseModel/ClusterBaseModelthatsatisfies its existing PVC /
NodeSelector/NodeAffinitygates, even whenthe model will never fit in the node's GPU VRAM. This wastes disk and
bandwidth and guarantees a pod-load failure later.
Why we need it
This PR adds a VRAM precheck in Gopher that runs between the storage
listing step and the bulk download. When the estimated weight bytes (with a
configurable safety factor) exceed the node's aggregate VRAM, the download is
skipped, the metric
model_agent_downloads_skipped_total{reason="vram_insufficient"}fires, and a structured
StatusDetailis persisted to the per-node ConfigMapso the skip is observable end-to-end.
The gate is opt-out via the annotation
ome.oracle.com/multi-node-sharded: "true"for models that are intentionally split across nodes at serve time.
How the size estimate works
Per-backend
RemoteSizeEstimatorinterface, dispatched by storage type:ListObjectscall, applying the same TensorRTLLM shape filter the bulk-download path uses (so per-shape subdirs don't over-count by N)./api/models/{id}?blobs=truecall → two-stage classifier (format × method → strategy) → byte estimate. See below.filepath.WalkDir+ suminfo.Size().(0, nil)— no useful size signal, fail open.HF classifier (Format × Method → Strategy)
Formatis one ofdiffusion / safetensors / gguf / bin / unknown, pickedfrom
library_name,safetensors,gguf, and sibling extensions.Methodis one of
awq / gptq / bitsandbytes / hqq / compressed-tensors / nvfp4 / mxfp4 / fp8 / bf16 / fp32 / …, picked fromconfig.quantization_config.{quant_method,format},tags/name regex, GGUF filename quant tag, or dtype dominance.
Strategies:
safetensors_dtype_count—Σ count × bytes_per_dtype(d)over theflat
safetensors.parametersmap, capped byusedStorageso AWQ'spacked-int4-as-I32 over-count is absorbed without method-specific math.
diffusion_component_sum— sum the best weight file per componentsubdir (
transformer/,unet/,vae/,text_encoder*/,image_encoder/), preferring.fp16.safetensorsso we estimateserve-time VRAM rather than disk.
gguf_variant_max— group.ggufsiblings by quant variant(Q4_K_M, Q8_0, …) and return the largest variant's total, so the gate
doesn't under-block a user who picks the heaviest quant. Falls back to
gguf.total_file_size.siblings_byte_sum— last-resort sum of weight-extension siblings.Dtype byte table is validated against a top-300 + top-1000 HF survey:
F32,I32,U32BF16,F16F8_E4M3,F8_E5M2,F8_E8M0I8,U8I64,F64BOOLEstimateDetail{Format, Method, Strategy, UsedFallback}is logged on everygate decision and persisted in the skip
StatusDetailso operators can seeexactly why a number came out where it did.
Multi-node sharded models
A model annotated ome.oracle.com/multi-node-sharded: "true" bypasses the
VRAM gate entirely (Gopher logs the bypass and increments
model_agent_precheck_bypass_total{reason="multi_node_sharded"}). This is
the documented opt-out for models that are split across nodes at serve time
(e.g. a 405 B FP8 deployed 1/N across N nodes).
Observability
New Prometheus counters:
model_agent_downloads_skipped_total{model_type, namespace, name, reason} — gate refused the download.
model_agent_precheck_bypass_total{model_type, namespace, name, reason} — gate intentionally skipped (multi_node_sharded, size_unknown).
Per-node StatusDetail written to the model-agent ConfigMap on skip:
Top-level CR-status aggregation of these per-node reasons is out of scope
for this PR — it's a model-controller change tracked separately. This PR
only writes the detail.
Changes
pkg/modelagent/gpu_memory.go+ test — env-var loader, AvailableNodeVRAMBytes, vramSafetyFactor.pkg/modelagent/remote_size.go+ test — RemoteSizeEstimator interface, HF classifier, per-backend implementations.pkg/modelagent/scout.go— caches availableVRAMBytes, stamps AvailableVRAMBytes + MultiNodeSharded on every download/override task.pkg/modelagent/gopher.go— GopherTask fields, vramPrecheck / applyVRAMGate / markModelOnNodeSkipped helpers; inline OCI gate after ListObjects; TRT-LLM shape filter extracted into a helper.pkg/modelagent/model_data.go— MultiNodeShardedAnnotation, isMultiNodeSharded, StatusDetail type, ModelEntry.StatusDetail field.pkg/modelagent/node_label_reconciler.go+ configmap_reconciler.go — StatusDetail plumbed through NodeLabelOp → ConfigMapStatusOp → cache → ConfigMap writes.pkg/modelagent/metrics.go— RecordSkippedDownload, RecordPrecheckBypass.Helm+config/model-agentConfigMap and DaemonSet manifests.Fixes #
How to test
Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage:
Test on A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF:
Different estimated_weights caused by the different estimate function above.
Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from oci object storage with "ome.io/multi-node-sharded" annotation:
Test On A100-40G node with large model llama-4-maverick-17b-128e-instruct-fp8 download from HF with "ome.io/multi-node-sharded" annotation:
Checklist
make testpasses locally