Helm charts for automated GPU node health validation on Kubernetes. The system automatically runs GPU diagnostics (DCGM, NVBandwidth) on idle nodes, reports results as Prometheus metrics and Kubernetes node conditions, and enables automated remediation of unhealthy nodes.
Tests only run on GPU nodes with no active customer workloads. Test pods use the scavenger PriorityClass (value -1000), so customer workloads always take scheduling priority and will preempt any running tests.
| Chart | Description | Version |
|---|---|---|
autoclusters-argo-workflows |
Argo Workflows controller + GPU test WorkflowTemplates + scavenger PriorityClass | 1.1.0 |
autoclusters-controller |
Controller that schedules and submits GPU health test workflows on idle nodes | 1.0.0 |
crusoe-npd |
Node Problem Detector DaemonSet with GPU health monitoring plugins | 1.0.0 |
Runs dcgmi diag -r 2 on each idle GPU node, covering:
- GPU compute correctness
- HBM memory and ECC checks
- PCIe bandwidth measurement
- NVLink pairwise bandwidth
Results are reported per-GPU with subtest-level granularity.
Runs nvbandwidth -t device_to_device_memcpy_read_ce to measure NVLink bandwidth between every GPU pair on a node. Results are compared against per-SKU thresholds:
| SKU Pattern | Threshold (GB/s) |
|---|---|
*a100*sxm* |
220 |
*h100*sxm* |
316 |
*h200*sxm* |
316 |
*gb200*nvl* |
613 |
| Default (unknown SKU) | 50 |
Warning: B200 SKUs (
b200*) are currently skipped — neither DCGM nor NVBandwidth tests will run on these nodes.
Note: PCIe-only GPUs (e.g., L40S) automatically skip the NVBandwidth test. These SKUs have no NVLink interconnect, so device-to-device bandwidth testing is not applicable — DCGM Level 2 already validates PCIe bandwidth. The
GPUNVBandwidthUnhealthynode condition will still be present on these nodes with statusFalse(healthy), but it reflects a skipped test rather than an actual NVBandwidth measurement.
Thresholds are configurable via the nvlinkThresholds Helm value.
- Kubernetes cluster with NVIDIA GPU Operator installed
- Helm 3.x
kubectlconfigured for your cluster
Note: If your cluster was created with
--add-ons=autocluster_active_testing, these charts are already installed. The steps below are for manual installation only.
helm repo add autoclusters https://crusoecloud.github.io/autoclusters-controller-helm-charts
helm repo updateArgo Workflows CRDs must be installed before the chart because Helm validates WorkflowTemplate resources during install. The CRDs are too large for client-side apply (262KB annotation limit), so server-side apply is required.
helm pull autoclusters/autoclusters-argo-workflows --destination /tmp/
tar xzf /tmp/autoclusters-argo-workflows-*.tgz -C /tmp/ \
autoclusters-argo-workflows/charts/argo-workflows/files/crds/minimal/
kubectl apply --server-side \
-f /tmp/autoclusters-argo-workflows/charts/argo-workflows/files/crds/minimal/
rm -rf /tmp/autoclusters-argo-workflows /tmp/autoclusters-argo-workflows-*.tgzhelm install autoclusters-argo-workflows autoclusters/autoclusters-argo-workflows \
-n crusoe-system --create-namespace \
--set argo-workflows.crds.install=falseThe argo-workflows.crds.install=false flag disables CRD rendering in the chart since they were pre-installed in Step 2.
helm install crusoe-npd autoclusters/crusoe-npd \
--version 1.0.0 \
-n crusoe-systemhelm install autoclusters-controller autoclusters/autoclusters-controller \
--version 1.0.0 \
-n crusoe-system| Value | Default | Description |
|---|---|---|
replicaCount |
2 |
Number of controller replicas (leader election required for HA) |
image.repository |
ghcr.io/crusoecloud/autoclusters-controller |
Controller container image |
image.tag |
"" |
Image tag (defaults to chart appVersion: 1.0.0) |
image.digest |
"" |
Image digest for immutable pinning (overrides tag when set) |
controller.schedule |
"0 * * * *" |
Cron schedule for GPU health tests (default: hourly) |
controller.leaderElect |
true |
Enable leader election (required for multi-replica) |
controller.metrics.enabled |
true |
Enable Prometheus metrics endpoint |
controller.metrics.bindAddress |
":8443" |
Metrics server bind address |
debugPort |
8082 |
Debug server port |
resources.requests.cpu |
100m |
CPU request |
resources.requests.memory |
128Mi |
Memory request |
resources.limits.cpu |
500m |
CPU limit |
resources.limits.memory |
512Mi |
Memory limit |
nodeSelector |
{} |
Node selector for controller pods |
tolerations |
[] |
Tolerations for controller pods |
| Value | Default | Description |
|---|---|---|
workflowTemplates.enabled |
true |
Deploy built-in GPU test WorkflowTemplates |
images.dcgm.repository |
ghcr.io/crusoecloud/crusoe-dcgm |
DCGM diagnostic image |
images.dcgm.tag |
"1.0.0" |
DCGM image tag (statically set, not derived from chart appVersion) |
images.nvbandwidth.repository |
ghcr.io/crusoecloud/nvbandwidth |
NVBandwidth test image |
images.nvbandwidth.tag |
"1.0.0" |
NVBandwidth image tag (statically set, not derived from chart appVersion) |
nvlinkThresholds |
(see below) | Per-SKU NVLink bandwidth thresholds (GB/s) |
defaultNvlinkThreshold |
50 |
Fallback threshold for unrecognized SKUs (GB/s) |
skippedSKUPatterns |
["b200*"] |
SKU glob patterns to skip all tests |
nvbandwidthSkippedSKUPatterns |
["*l40s*"] |
SKU glob patterns to skip only the NVBandwidth test (PCIe-only GPUs) |
priorityClass.value |
-1000 |
Priority value for test pods (lower = more preemptible) |
NVLink thresholds default configuration:
nvlinkThresholds:
- pattern: "*a100*sxm*"
threshold: 220
- pattern: "*h100*sxm*"
threshold: 316
- pattern: "*h200*sxm*"
threshold: 316
- pattern: "*gb200*nvl*"
threshold: 613Patterns are matched case-insensitively against the node's node.kubernetes.io/instance-type label. PCIe variants (not matching *sxm*) will use the defaultNvlinkThreshold.
| Value | Default | Description |
|---|---|---|
image.repository |
ghcr.io/crusoecloud/crusoe-npd |
NPD container image |
image.tag |
"" |
Image tag (defaults to chart appVersion: 1.0.0) |
port |
20258 |
Metrics/health port (avoids conflict with standard NPD on 20256) |
workflowNamespace |
crusoe-system |
Namespace where Argo Workflows run |
pluginDefaults.invokeInterval |
"60s" |
How often NPD checks workflow results |
tolerations |
[{operator: Exists}] |
Runs on all nodes by default |
nodeSelector |
{} |
Optional node selector to restrict NPD placement |
monitors |
(2 entries) | GPU health monitor definitions (see below) |
Default monitors:
| Monitor | Condition Type | Healthy Reason | Unhealthy Reason |
|---|---|---|---|
gpu-dcgm-diag |
GPUDCGMUnhealthy |
GPUDCGMTestPassed |
GPUDCGMTestFailed |
nvbandwidth |
GPUNVBandwidthUnhealthy |
NVBandwidthTestPassed |
NVBandwidthTestFailed |
The controller exposes metrics on port 8443. Node-level metrics are available on all replicas; test metrics are only emitted by the leader.
Node metrics:
| Metric | Type | Description |
|---|---|---|
autoclusters_node_gpu_total |
Gauge | Total number of GPU nodes in the cluster |
autoclusters_node_idle_gpu_count |
Gauge | Number of GPU nodes currently idle |
Test metrics (labels: node, node_pool_id, instance_id, instance_type, test_type, test_name):
| Metric | Type | Description |
|---|---|---|
autoclusters_test_runs_total |
Counter | Total test runs (additional status label: success/failure) |
autoclusters_test_duration_seconds |
Histogram | Test run duration in seconds |
autoclusters_test_result |
Gauge | Latest test result (1=pass, 0=fail) |
autoclusters_test_failure_alert |
Gauge | Set to 1 on failure, 0 on pass (for alerting) |
autoclusters_test_node_health_status |
Gauge | Overall node health (1=healthy, 0=unhealthy) |
autoclusters_test_last_test_timestamp_seconds |
Gauge | Unix timestamp of last test run |
autoclusters_test_created_total |
Counter | Total number of tests created |
autoclusters_test_running_current |
Gauge | Number of currently running tests |
autoclusters_test_completed_total |
Counter | Completed tests (additional result label) |
autoclusters_test_failures_total |
Counter | Total test failure count |
NVBandwidth metrics (labels: node, node_pool_id, instance_id, instance_type):
| Metric | Type | Description |
|---|---|---|
autoclusters_test_nvbandwidth_min_gbps |
Gauge | Minimum NVLink bandwidth across all GPU pairs (GB/s) |
autoclusters_test_nvbandwidth_threshold_gbps |
Gauge | Threshold used for pass/fail comparison (GB/s) |
autoclusters_test_nvbandwidth_failed_pairs |
Gauge | Number of GPU pairs below threshold |
DCGM metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
autoclusters_test_dcgm_failed_gpu_count |
Gauge | node, node_pool_id, instance_id, instance_type | Number of GPUs that failed diagnostics |
autoclusters_test_dcgm_gpu_status |
Gauge | + gpu_id |
Per-GPU result (1=pass, 0=fail) |
autoclusters_test_dcgm_subtest_status |
Gauge | + gpu_id, subtest |
Per-GPU per-subtest result |
autoclusters_test_dcgm_pcie_bandwidth_gbps |
Gauge | + gpu_id, direction |
Per-GPU PCIe bandwidth (gpu_to_host, host_to_gpu, bidirectional) |
NPD translates test results into Kubernetes node conditions:
| Condition | Status: True | Status: False |
|---|---|---|
GPUDCGMUnhealthy |
DCGM diagnostics failed | DCGM diagnostics passed |
GPUNVBandwidthUnhealthy |
NVBandwidth test failed | NVBandwidth test passed (or skipped on PCIe-only SKUs) |
Check node conditions:
kubectl get nodes -o custom-columns=\
'NODE:.metadata.name,'\
'DCGM:.status.conditions[?(@.type=="GPUDCGMUnhealthy")].status,'\
'NVBW:.status.conditions[?(@.type=="GPUNVBandwidthUnhealthy")].status'helm repo update
helm upgrade autoclusters-argo-workflows autoclusters/autoclusters-argo-workflows \
-n crusoe-system --set argo-workflows.crds.install=false
helm upgrade crusoe-npd autoclusters/crusoe-npd \
-n crusoe-system
helm upgrade autoclusters-controller autoclusters/autoclusters-controller \
-n crusoe-systemUpgrade in the same order as installation: Argo Workflows first, then NPD, then the controller.
Remove charts in reverse order:
helm uninstall autoclusters-controller -n crusoe-system
helm uninstall crusoe-npd -n crusoe-system
helm uninstall autoclusters-argo-workflows -n crusoe-systemArgo Workflows CRDs are not removed by helm uninstall (they are configured with keep: true). To remove them manually:
kubectl delete crd workflows.argoproj.io workflowtemplates.argoproj.io \
cronworkflows.argoproj.io clusterworkflowtemplates.argoproj.io \
workflowtasksets.argoproj.io workflowtaskresults.argoproj.io \
workflowartifactgctasks.argoproj.io workfloweventbindings.argoproj.ioContainer images are published to GitHub Container Registry:
| Image | Description |
|---|---|
ghcr.io/crusoecloud/autoclusters-controller |
Controller binary |
ghcr.io/crusoecloud/crusoe-npd |
NPD with GPU health check plugin |
ghcr.io/crusoecloud/crusoe-dcgm |
DCGM diagnostic image |
ghcr.io/crusoecloud/nvbandwidth |
NVBandwidth test image |
This project is licensed under the MIT License - see the LICENSE file for details.