improvement(helm): update GPU device plugin and add cert-manager issuers #3036

waleedlatif1 · 2026-01-28T01:43:54Z

Summary

Update NVIDIA device plugin to v0.18.2 with ConfigMap-based configuration (best practice)
Add support for MIG and time-slicing GPU sharing strategies
Add cert-manager issuer resources with proper CA bootstrap pattern
Remove deprecated hostNetwork/hostPID settings from GPU plugin

Type of Change

Improvement (enhancement to existing feature)

Testing

Tested with helm lint and helm template - all templates render correctly

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

vercel · 2026-01-28T01:43:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
docs	Ready	Preview, Comment	Jan 28, 2026 2:14am

greptile-apps · 2026-01-28T01:47:08Z

Greptile Overview

Greptile Summary

This PR improves GPU support and certificate management infrastructure for the Sim Helm chart.

Key Changes:

Updated NVIDIA GPU device plugin from v0.14.5 to v0.18.2 with ConfigMap-based configuration (best practice)
Added support for both MIG (Multi-Instance GPU) and time-slicing GPU sharing strategies
Removed deprecated hostNetwork, hostPID, and runtimeClassName settings from GPU plugin
Added cert-manager issuer resources implementing the recommended CA bootstrap pattern
Removed orphaned RuntimeClass resource that was no longer referenced

Improvements:

ConfigMap-based GPU plugin configuration is more maintainable and follows current NVIDIA best practices
GPU sharing strategies enable better GPU resource utilization
cert-manager bootstrap pattern provides proper CA hierarchy for internal TLS certificates
Cleaner GPU plugin configuration with better resource limits (20Mi-50Mi memory vs 10Mi-20Mi)

Confidence Score: 4.5/5

This PR is safe to merge with minimal risk - all changes follow best practices
The changes follow vendor best practices (NVIDIA and cert-manager documentation), previous review comments have been addressed (duplicate nodeSelector removed, default values aligned), and the PR has been tested with helm lint and template. Minor confidence reduction only due to runtime dependencies (cert-manager must be pre-installed) which are well-documented.
No files require special attention - all changes are well-structured with clear documentation

Important Files Changed

Filename	Overview
helm/sim/templates/cert-manager-issuers.yaml	Added cert-manager issuer resources with proper CA bootstrap pattern
helm/sim/templates/gpu-device-plugin.yaml	Updated GPU plugin to v0.18.2 with ConfigMap-based config and removed deprecated settings
helm/sim/values.yaml	Added GPU sharing strategies and cert-manager configuration with clear documentation

Sequence Diagram

sequenceDiagram
    participant Helm as Helm Install
    participant K8s as Kubernetes API
    participant CM as cert-manager
    participant GPU as GPU Device Plugin
    participant Node as GPU Node

    Note over Helm,Node: cert-manager Issuer Bootstrap (if certManager.enabled=true)
    Helm->>K8s: Create SelfSigned ClusterIssuer
    K8s->>CM: Register bootstrap issuer
    Helm->>K8s: Create Root CA Certificate
    K8s->>CM: Request certificate from bootstrap issuer
    CM->>CM: Generate self-signed root CA
    CM->>K8s: Store CA cert in secret (cert-manager namespace)
    Helm->>K8s: Create CA ClusterIssuer
    K8s->>CM: Register CA issuer (references root CA secret)
    Note over CM: CA issuer auto-reconciles when secret ready

    Note over Helm,Node: GPU Device Plugin Setup (if ollama.gpu.enabled=true)
    Helm->>K8s: Create ConfigMap with GPU strategy config
    Note over K8s: Config includes MIG or time-slicing settings
    Helm->>K8s: Deploy DaemonSet (v0.18.2)
    K8s->>Node: Schedule pod on nodes with accelerator=nvidia
    Node->>GPU: Mount ConfigMap at /etc/device-plugin/
    GPU->>GPU: Parse config.yaml (MIG or time-slicing)
    GPU->>Node: Register GPU resources with kubelet
    Note over GPU,Node: GPUs now available as nvidia.com/gpu

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

helm/sim/templates/gpu-device-plugin.yaml

helm/sim/templates/cert-manager-issuers.yaml

helm/sim/templates/gpu-device-plugin.yaml

waleedlatif1 · 2026-01-28T02:05:23Z

@greptile

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

helm/sim/templates/gpu-device-plugin.yaml

helm/sim/templates/cert-manager-issuers.yaml

…ring

waleedlatif1 · 2026-01-28T02:16:18Z

@greptile

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-28T02:28:16Z

helm/sim/templates/gpu-device-plugin.yaml

+      failOnInitError: false
+    plugin:
+      passDeviceSpecs: true
+      deviceListStrategy: envvar


Invalid config structure for NVIDIA device plugin settings

Medium Severity

The ConfigMap places passDeviceSpecs and deviceListStrategy under a plugin: section, but the NVIDIA k8s-device-plugin config schema expects these settings under the flags: section. The original code passed these as CLI arguments (--pass-device-specs=true, --device-list-strategy=envvar), which map to the flags: section in config file format. With the current structure, the device plugin may ignore these settings and use default values instead, potentially causing GPU device passthrough and enumeration issues.

improvement(helm): update GPU device plugin and add cert-manager issuers

3c6f4e8

vercel bot deployed to Preview January 28, 2026 01:45 View deployment

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

helm/sim/templates/gpu-device-plugin.yaml Outdated Show resolved Hide resolved

helm/sim/templates/cert-manager-issuers.yaml Outdated Show resolved Hide resolved

cursor bot reviewed Jan 28, 2026

View reviewed changes

helm/sim/templates/gpu-device-plugin.yaml Outdated Show resolved Hide resolved

helm/sim/templates/gpu-device-plugin.yaml Show resolved Hide resolved

fix(helm): address code review feedback for GPU plugin and cert-manager

549edba

vercel bot deployed to Preview January 28, 2026 02:06 View deployment

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

helm/sim/templates/gpu-device-plugin.yaml Outdated Show resolved Hide resolved

helm/sim/templates/cert-manager-issuers.yaml Show resolved Hide resolved

fix(helm): remove duplicate nodeSelector, add hook for CA issuer orde…

43a8785

…ring

vercel bot deployed to Preview January 28, 2026 02:12 View deployment

fix(helm): remove incorrect hook, CA issuer auto-reconciles

be8abee

vercel bot deployed to Preview January 28, 2026 02:14 View deployment

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

waleedlatif1 mentioned this pull request Jan 28, 2026

v0.5.74: autolayout improvements, clerk integration, auth enforcements #3034

Merged

waleedlatif1 merged commit b4a389a into staging Jan 28, 2026
12 checks passed

waleedlatif1 deleted the improvement/helm branch January 28, 2026 02:25

cursor bot reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvement(helm): update GPU device plugin and add cert-manager issuers #3036

improvement(helm): update GPU device plugin and add cert-manager issuers #3036

waleedlatif1 commented Jan 28, 2026

Uh oh!

vercel bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jan 28, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jan 28, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

improvement(helm): update GPU device plugin and add cert-manager issuers #3036

improvement(helm): update GPU device plugin and add cert-manager issuers #3036

Conversation

waleedlatif1 commented Jan 28, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4.5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jan 28, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jan 28, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

Invalid config structure for NVIDIA device plugin settings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Jan 28, 2026 •

edited

Loading

greptile-apps bot commented Jan 28, 2026 •

edited

Loading