Skip to content

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • Update NVIDIA device plugin to v0.18.2 with ConfigMap-based configuration (best practice)
  • Add support for MIG and time-slicing GPU sharing strategies
  • Add cert-manager issuer resources with proper CA bootstrap pattern
  • Remove deprecated hostNetwork/hostPID settings from GPU plugin

Type of Change

  • Improvement (enhancement to existing feature)

Testing

Tested with helm lint and helm template - all templates render correctly

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Jan 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
docs Ready Ready Preview, Comment Jan 28, 2026 2:14am

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

This PR improves GPU support and certificate management infrastructure for the Sim Helm chart.

Key Changes:

  • Updated NVIDIA GPU device plugin from v0.14.5 to v0.18.2 with ConfigMap-based configuration (best practice)
  • Added support for both MIG (Multi-Instance GPU) and time-slicing GPU sharing strategies
  • Removed deprecated hostNetwork, hostPID, and runtimeClassName settings from GPU plugin
  • Added cert-manager issuer resources implementing the recommended CA bootstrap pattern
  • Removed orphaned RuntimeClass resource that was no longer referenced

Improvements:

  • ConfigMap-based GPU plugin configuration is more maintainable and follows current NVIDIA best practices
  • GPU sharing strategies enable better GPU resource utilization
  • cert-manager bootstrap pattern provides proper CA hierarchy for internal TLS certificates
  • Cleaner GPU plugin configuration with better resource limits (20Mi-50Mi memory vs 10Mi-20Mi)

Confidence Score: 4.5/5

  • This PR is safe to merge with minimal risk - all changes follow best practices
  • The changes follow vendor best practices (NVIDIA and cert-manager documentation), previous review comments have been addressed (duplicate nodeSelector removed, default values aligned), and the PR has been tested with helm lint and template. Minor confidence reduction only due to runtime dependencies (cert-manager must be pre-installed) which are well-documented.
  • No files require special attention - all changes are well-structured with clear documentation

Important Files Changed

Filename Overview
helm/sim/templates/cert-manager-issuers.yaml Added cert-manager issuer resources with proper CA bootstrap pattern
helm/sim/templates/gpu-device-plugin.yaml Updated GPU plugin to v0.18.2 with ConfigMap-based config and removed deprecated settings
helm/sim/values.yaml Added GPU sharing strategies and cert-manager configuration with clear documentation

Sequence Diagram

sequenceDiagram
    participant Helm as Helm Install
    participant K8s as Kubernetes API
    participant CM as cert-manager
    participant GPU as GPU Device Plugin
    participant Node as GPU Node

    Note over Helm,Node: cert-manager Issuer Bootstrap (if certManager.enabled=true)
    Helm->>K8s: Create SelfSigned ClusterIssuer
    K8s->>CM: Register bootstrap issuer
    Helm->>K8s: Create Root CA Certificate
    K8s->>CM: Request certificate from bootstrap issuer
    CM->>CM: Generate self-signed root CA
    CM->>K8s: Store CA cert in secret (cert-manager namespace)
    Helm->>K8s: Create CA ClusterIssuer
    K8s->>CM: Register CA issuer (references root CA secret)
    Note over CM: CA issuer auto-reconciles when secret ready

    Note over Helm,Node: GPU Device Plugin Setup (if ollama.gpu.enabled=true)
    Helm->>K8s: Create ConfigMap with GPU strategy config
    Note over K8s: Config includes MIG or time-slicing settings
    Helm->>K8s: Deploy DaemonSet (v0.18.2)
    K8s->>Node: Schedule pod on nodes with accelerator=nvidia
    Node->>GPU: Mount ConfigMap at /etc/device-plugin/
    GPU->>GPU: Parse config.yaml (MIG or time-slicing)
    GPU->>Node: Register GPU resources with kubelet
    Note over GPU,Node: GPUs now available as nvidia.com/gpu
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1 waleedlatif1 merged commit b4a389a into staging Jan 28, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/helm branch January 28, 2026 02:25
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

failOnInitError: false
plugin:
passDeviceSpecs: true
deviceListStrategy: envvar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid config structure for NVIDIA device plugin settings

Medium Severity

The ConfigMap places passDeviceSpecs and deviceListStrategy under a plugin: section, but the NVIDIA k8s-device-plugin config schema expects these settings under the flags: section. The original code passed these as CLI arguments (--pass-device-specs=true, --device-list-strategy=envvar), which map to the flags: section in config file format. With the current structure, the device plugin may ignore these settings and use default values instead, potentially causing GPU device passthrough and enumeration issues.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants