feat(janitor): implement ExternalRemediationRequest reconciler#1392
feat(janitor): implement ExternalRemediationRequest reconciler#1392jtschelling wants to merge 3 commits into
Conversation
|
🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1392.docs.buildwithfern.com/nvsentinel |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an ExternalRemediationRequest controller, condition conversion helpers and tests, Prometheus metrics for ERRs (lifecycle/open/age), a six-branch reconciliation state machine (apply, cleanup, no-op, deletion), comprehensive controller tests, and startup wiring to register the controller and metrics. ChangesExternalRemediationRequest Controller Implementation
Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant APIServer
participant Node
participant Metrics
Reconciler->>APIServer: GET ExternalRemediationRequest
Reconciler->>APIServer: GET Node (spec.healthEvent.nodeName)
Reconciler->>APIServer: PATCH Node (apply/remove taint & managed label)
Reconciler->>APIServer: PATCH ExternalRemediationRequest/status (conditions)
Reconciler->>Metrics: Inc/Adjust/Observe ExtRR metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
janitor/pkg/controller/externalremediationrequest_controller.go (1)
117-141: 💤 Low valueConsider extracting "unknown" as a package constant.
The string literal
"unknown"appears multiple times in this file (lines 123, 127, 130). Extract it as a package-level constant to improve maintainability and reduce the risk of typos.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@janitor/pkg/controller/externalremediationrequest_controller.go` around lines 117 - 141, Extract the repeated literal "unknown" into a package-level constant (e.g., const unknownLabel = "unknown") and replace all occurrences inside recommendedActionLabel and errNodeLabel (and any other uses in this file) with that constant; update the functions recommendedActionLabel(extrrObj *nvsentinelv1.ExternalRemediationRequest) and errNodeLabel(extrrObj *nvsentinelv1.ExternalRemediationRequest) to return unknownLabel instead of the raw string to centralize the value and avoid typos.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@janitor/pkg/condition/condition.go`:
- Around line 26-31: The import block in condition.go is misformatted; run gofmt
-s -w or goimports -w to reorder and format the imports so they conform to
gofmt/goimports rules (group standard library, external, then internal packages)
and remove/adjust the blank line between metav1, timestamppb and protos; check
the imports that reference metav1, timestamppb and protos remain unchanged after
formatting.
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Around line 262-302: In setInitialConditions, add a blank line immediately
above the final return statement so there is whitespace separating the call to
r.patchStatusConditions(...) and the subsequent "return patched, err" line; this
improves readability around the end of the function where conditions are patched
and returned (references: setInitialConditions, r.patchStatusConditions,
patched, err).
- Around line 202-222: The function needsInitialization in
ExternalRemediationRequestReconciler has a line that exceeds the 120-character
limit; locate the long conditional check (the meta.FindStatusCondition call that
checks ConditionNVSentinelOwnershipReleased on the conds slice inside
needsInitialization) and break it into two shorter lines (e.g., assign
meta.FindStatusCondition(conds, ConditionNVSentinelOwnershipReleased) to a local
variable or wrap the if condition across lines) so no source line exceeds 120
characters while preserving the same logic and returning true when the condition
is nil.
- Around line 167-200: In Reconcile (method
ExternalRemediationRequestReconciler.Reconcile) insert a blank line immediately
above the defer span.End() to separate the
tracing.StartSpanWithLinkFromTraceContext(...) call/assignment from the defer;
locate the tracing.StartSpanWithLinkFromTraceContext invocation that returns
ctx, span and the subsequent defer span.End() and add one empty line for
improved readability and to satisfy the static analysis style rule.
- Around line 360-441: reconcileApply is over the cyclomatic complexity
threshold; extract three helpers to simplify control flow: (1) move node
retrieval and missing-node requeue logic out of reconcileApply into a helper
like getNodeForApply(ctx, extrrObj) that returns (*corev1.Node, ctrl.Result,
error) to encapsulate the IsNotFound requeue behavior; (2) extract the taint
ownership/drift checks into checkTaintOwnership(node *corev1.Node, extrrObj
*nvsentinelv1.ExternalRemediationRequest) which returns an enum/struct or
(status, message) indicating "owned-by-us", "label-missing", "owned-by-other" so
reconcileApply can early-return to transitionToReleaseSuccess or
transitionToReleaseFailure without nested branches; and (3) pull the Patch error
handling into handlePatchError(err, nodeName, extrrObj) which returns
(ctrl.Result, error) and encapsulates the apierrors.IsForbidden branch and the
generic fmt.Errorf branch; update reconcileApply to call these helpers to reduce
branching and meet the complexity threshold.
- Around line 586-659: reconcileCleanup is too complex—extract two helpers: (1)
a node fetch helper on ExternalRemediationRequestReconciler (e.g.
fetchNodeForCleanup(ctx, nodeName) returning (*corev1.Node, bool, error)) that
wraps r.Get and implements the IsNotFound short-circuit used today, and (2) a
taint decision helper (e.g. shouldRemoveReleaseTaint(taints []corev1.Taint,
extrrName string) returning (newTaints []corev1.Taint, removed bool, ownerDrift
bool)) that uses findTaintByKey, ReleaseTaintKey and removeTaintByKey to decide
whether to remove the taint (match on extrrObj.Name) or leave it and signal
drift; then simplify reconcileCleanup to call these two helpers, adjust changed
logic and keep the existing Patch and logging paths intact.
- Around line 808-827: The reconciler currently uses the deprecated
mgr.GetEventRecorderFor and client-go record.EventRecorder; update
ExternalRemediationRequestReconciler.SetupWithManager to call
mgr.GetEventRecorder(...) and change the reconciler's Recorder field type and
import from k8s.io/client-go/tools/record.EventRecorder to
k8s.io/client-go/tools/events.EventRecorder (update imports and any uses of
Recorder accordingly), update tests to replace record.NewFakeRecorder usages
with the appropriate events-based fake/constructor, and add RBAC markers for
events.k8s.io if controller-runtime requires them.
In `@janitor/pkg/metrics/err_metrics.go`:
- Around line 48-58: The comment above the ExtRRTotal declaration has a line
exceeding the 120-char limit; edit the block comment describing phases (the long
sentence that starts with "phase=created ..." and continues through
"operator_deleted (kubectl delete err)") to wrap or split into multiple shorter
lines (or shorten wording) so no single comment line exceeds 120 characters
while preserving the same phase/result descriptions and labels for ExtRRTotal.
---
Nitpick comments:
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Around line 117-141: Extract the repeated literal "unknown" into a
package-level constant (e.g., const unknownLabel = "unknown") and replace all
occurrences inside recommendedActionLabel and errNodeLabel (and any other uses
in this file) with that constant; update the functions
recommendedActionLabel(extrrObj *nvsentinelv1.ExternalRemediationRequest) and
errNodeLabel(extrrObj *nvsentinelv1.ExternalRemediationRequest) to return
unknownLabel instead of the raw string to centralize the value and avoid typos.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f37c4f65-f22b-4fb2-a71c-b6ce4b3db6d8
📒 Files selected for processing (8)
janitor/main.gojanitor/pkg/condition/condition.gojanitor/pkg/condition/condition_test.gojanitor/pkg/controller/externalremediationrequest_controller.gojanitor/pkg/controller/externalremediationrequest_controller_test.gojanitor/pkg/controller/suite_test.gojanitor/pkg/metrics/err_metrics.gojanitor/pkg/metrics/metrics.go
0ea43b0 to
94158b1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
janitor/pkg/controller/externalremediationrequest_controller_test.go (1)
1064-1084: 💤 Low valueConsider removing or documenting unused helper.
snapshotConditionsis defined but not called anywhere in this test file. If it's intended for debugging or future use, consider adding a comment indicating that purpose, or remove it to avoid dead code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@janitor/pkg/controller/externalremediationrequest_controller_test.go` around lines 1064 - 1084, The helper function snapshotConditions (func snapshotConditions(extrrObj *nvsentinelv1.ExternalRemediationRequest) string) is currently unused; either delete it to remove dead code or keep it but add a short comment explaining its purpose (e.g., "helper for debugging/test snapshots of ExternalRemediationRequest conditions") and/or call it from a relevant test so it is exercised; update or add a linter suppression only if you intentionally keep it unused.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@janitor/pkg/controller/externalremediationrequest_controller_test.go`:
- Around line 1064-1084: The helper function snapshotConditions (func
snapshotConditions(extrrObj *nvsentinelv1.ExternalRemediationRequest) string) is
currently unused; either delete it to remove dead code or keep it but add a
short comment explaining its purpose (e.g., "helper for debugging/test snapshots
of ExternalRemediationRequest conditions") and/or call it from a relevant test
so it is exercised; update or add a linter suppression only if you intentionally
keep it unused.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dfb20a53-f3ff-40ea-8f2b-71953cbd0142
📒 Files selected for processing (8)
janitor/main.gojanitor/pkg/condition/condition.gojanitor/pkg/condition/condition_test.gojanitor/pkg/controller/externalremediationrequest_controller.gojanitor/pkg/controller/externalremediationrequest_controller_test.gojanitor/pkg/controller/suite_test.gojanitor/pkg/metrics/err_metrics.gojanitor/pkg/metrics/metrics.go
🚧 Files skipped from review as they are similar to previous changes (6)
- janitor/main.go
- janitor/pkg/controller/suite_test.go
- janitor/pkg/metrics/err_metrics.go
- janitor/pkg/condition/condition.go
- janitor/pkg/condition/condition_test.go
- janitor/pkg/controller/externalremediationrequest_controller.go
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
Merging this branch changes the coverage (2 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
janitor/pkg/controller/externalremediationrequest_controller.go (1)
15-15: 💤 Low valueAdd package-level godoc for the controller package.
Per coding guidelines, all Go packages should include package-level godoc. Consider adding a comment above the
package controllerline documenting the package's purpose.📝 Suggested addition
+// Package controller implements Kubernetes reconcilers for NVSentinel custom +// resources including RebootNode, TerminateNode, GPUReset, and +// ExternalRemediationRequest controllers. package controller🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@janitor/pkg/controller/externalremediationrequest_controller.go` at line 15, Add a package-level godoc comment immediately above the "package controller" declaration describing the purpose and responsibilities of the controller package (e.g., that it implements reconciliation logic for ExternalRemediationRequest resources and related helpers), mention the main abstractions it contains (controllers, reconcilers, and helpers for ExternalRemediationRequest), and keep it concise and formatted as a proper Go comment block to satisfy godoc requirements.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Around line 148-152: Rename the local variable currently named err (type
nvsentinelv1.ExternalRemediationRequest) to extrrObj in the Reconcile method of
ExternalRemediationRequestReconciler to avoid shadowing the conventional error
name; update the Get call and all subsequent uses (currently lines referencing
err) to use extrrObj, and rename the temporary error variable e (from the Get
call) to the conventional err so the function reads: declare extrrObj :=
nvsentinelv1.ExternalRemediationRequest (or var extrrObj
nvsentinelv1.ExternalRemediationRequest), call r.Get(..., &extrrObj) and handle
the returned error as err := r.Get(...) / if err != nil { return ctrl.Result{},
client.IgnoreNotFound(err) } ensuring all references to the ExtRR object are
updated to extrrObj.
---
Nitpick comments:
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Line 15: Add a package-level godoc comment immediately above the "package
controller" declaration describing the purpose and responsibilities of the
controller package (e.g., that it implements reconciliation logic for
ExternalRemediationRequest resources and related helpers), mention the main
abstractions it contains (controllers, reconcilers, and helpers for
ExternalRemediationRequest), and keep it concise and formatted as a proper Go
comment block to satisfy godoc requirements.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 67c3c8d4-d01d-440f-ab5c-f7af09f94d3c
📒 Files selected for processing (8)
janitor/main.gojanitor/pkg/condition/condition.gojanitor/pkg/condition/condition_test.gojanitor/pkg/controller/externalremediationrequest_controller.gojanitor/pkg/controller/externalremediationrequest_controller_test.gojanitor/pkg/controller/suite_test.gojanitor/pkg/metrics/err_metrics.gojanitor/pkg/metrics/metrics.go
🚧 Files skipped from review as they are similar to previous changes (7)
- janitor/pkg/controller/suite_test.go
- janitor/pkg/metrics/metrics.go
- janitor/main.go
- janitor/pkg/condition/condition.go
- janitor/pkg/condition/condition_test.go
- janitor/pkg/metrics/err_metrics.go
- janitor/pkg/controller/externalremediationrequest_controller_test.go
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Adds the controller-runtime reconciler that drives the ADR-040 ExtRR lifecycle. Owns the six-branch state machine: init (finalizer + initial Unknown conditions), deletion-cleanup, apply (release taint + managed=false in a single strategic-merge PATCH), True-cleanup, asymmetric False no-op, and the released steady-state. Includes: * Proto Condition ↔ metav1.Condition adapter (pkg/condition) so the reconciler uses standard meta.SetStatusCondition helpers while the proto stays authoritative on the apiserver. * Field indexer on spec.healthEvent.nodeName so the Node watch maps to ExtRRs in O(1) instead of scanning every ExtRR per Node event. * Predicate that drops Node events outside the release-state surface (managed label + release taint), so kubelet heartbeats don't re-enqueue every ExtRR. * Belt-and-suspenders nil-spec guard in Reconcile — the validating webhook enforces the contract, but a webhook outage shouldn't crashloop the controller. * Prometheus metrics in the nvsentinel_external_remediation_ namespace (total / open / age_seconds) with phase + result labels per ADR-040. * Ginkgo envtest suite covering init, apply, drift, RBAC denial, branch 2 + 4 cleanup paths, asymmetric False, and observability assertions. * Focused unit tests for the field indexer, node-watch predicate, and mapper using a fake client with the index pre-registered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…quest Adds a typed validator that enforces the contract the reconciler depends on: spec, spec.healthEvent, and spec.healthEvent.nodeName must all be populated, and nodeName is immutable after creation (the release taint carries the ExtRR name so flipping nodeName would orphan a node). The proto-backed CRD wrapper has pointer Spec/Status fields (forced by the proto's embedded sync.Mutex), so the apiserver's OpenAPI schema accepts spec: null and similar incomplete objects. The webhook closes that gap with failurePolicy=Fail so the reconciler can rely on a well-formed Spec. ExtRR intentionally has no Config.Enabled gate: per ADR-040 it is an unconditional capability — disabling it would strand any external system mid-remediation with no way to release affected nodes. Includes: * extrrValidator in pkg/webhook/v1alpha1 modelled on the existing RebootNode / TerminateNode / GPUReset validators. * kubebuilder marker emitting the standard webhook path. * Helm chart entry that wires the same cert/service infra as the sibling webhooks. * Ginkgo unit tests covering create rejection paths and update immutability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds five end-to-end tests against a chart-deployed janitor + webhook that prove the wiring works in a real cluster (cert provisioning, RBAC, manifest correctness, reconciler boot): * TestExtRRWebhookRejectsInvalidSpec — exercises the four webhook rejection paths (nil spec, nil healthEvent, empty nodeName, immutable nodeName on update). Proves the kubebuilder marker, helm chart entry, and cert wiring are all aligned. * TestExtRRLifecycleHappyPath — full ADR-040 happy path: apply → release taint + managed=false → Complete=True → cleanup → garbage collection. * TestExtRRAsymmetricFalse — Complete=False is a no-op; node remains released. Complete=True retry from False then closes the ExtRR. * TestExtRROperatorDeleteEscape — kubectl delete extrr while stuck at Complete=False drives the finalizer-based cleanup. * TestExtRRForeignTaintDrift — drift-safety: a fresh ExtRR for a node already tainted by another ExtRR transitions to NVSentinelOwnershipReleased=False, leaving the foreign taint alone. Each test's Teardown does a belt-and-suspenders scrub of the test node's taint+label via ScrubExtRRStateFromNode, so a mid-test failure doesn't leak state into subsequent tests sharing the same Node. Helpers added in tests/helpers/kube.go: CreateExtRRCR, CreateMalformedExtRR, SetExtRRComplete, WaitForExtRRCondition, WaitForExtRRGone, ScrubExtRRStateFromNode, plus the ExternalRemediationRequestGVK constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
74f8f7b to
e3ae6d4
Compare
Summary
Second in the ExternalRemediationRequest (ExtRR) series. Builds on #1376 (foundation: CRD + types + RBAC + scheme registration) by adding the controller that drives the node-coordination state machine. Until something starts creating ExtRR objects (a follow-up change wiring fault-remediation), this controller only acts when an operator hand-applies one.
What's included
State machine (6 dispatch branches)
NVSentinelOwnershipReleased=Unknown+ExternalRemediationComplete=Unknownon a fresh ExtRR.managed=falselabel, then transitionNVSentinelOwnershipReleased=True.ExternalRemediationComplete=True, remove the taint + label, then drop the finalizer so the apiserver can garbage-collect the object.ExternalRemediationComplete=Falsedoes NOT close the ExtRR. The node stays released until an operator deletes the object or the external system retries withTrue. This asymmetry is the core of the ADR-040 contract.kubectl delete extrrtriggers the same node cleanup before the finalizer drops, so operators can reclaim stalled nodes.Supporting pieces
janitor/pkg/condition— adapter between the protoConditionmessage andmeta.Conditionso existing controller-runtime helpers (meta.SetStatusCondition,meta.IsStatusConditionTrue) work against ExtRR status.janitor/pkg/metrics/err_metrics.go— three Prometheus series in thenvsentinel_external_remediation_namespace: lifecycle transitions counter, currently-open gauge, age-at-close histogram.janitor/main.go— manager registers the new reconciler alongside the existing RebootNode / TerminateNode / GPUReset controllers.Things deferred to follow-ups
EXTERNAL_REMEDIATIONrecommended action (a separate change).managed=false.Test Plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Observability
Tests