Skip to content

feat(janitor): implement ExternalRemediationRequest reconciler#1392

Open
jtschelling wants to merge 3 commits into
NVIDIA:mainfrom
jtschelling:feature/jsc-103-err-reconciler
Open

feat(janitor): implement ExternalRemediationRequest reconciler#1392
jtschelling wants to merge 3 commits into
NVIDIA:mainfrom
jtschelling:feature/jsc-103-err-reconciler

Conversation

@jtschelling

@jtschelling jtschelling commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Second in the ExternalRemediationRequest (ExtRR) series. Builds on #1376 (foundation: CRD + types + RBAC + scheme registration) by adding the controller that drives the node-coordination state machine. Until something starts creating ExtRR objects (a follow-up change wiring fault-remediation), this controller only acts when an operator hand-applies one.

What's included

State machine (6 dispatch branches)

  1. Initialise — add the cleanup finalizer; seed NVSentinelOwnershipReleased=Unknown + ExternalRemediationComplete=Unknown on a fresh ExtRR.
  2. Apply path — strategic-merge-patch the target node with the release taint (taint value = the ExtRR's own metadata.name, so the cleanup path can find only its own taint) and the managed=false label, then transition NVSentinelOwnershipReleased=True.
  3. True cleanup — once an external system reports ExternalRemediationComplete=True, remove the taint + label, then drop the finalizer so the apiserver can garbage-collect the object.
  4. False no-op (asymmetric)ExternalRemediationComplete=False does NOT close the ExtRR. The node stays released until an operator deletes the object or the external system retries with True. This asymmetry is the core of the ADR-040 contract.
  5. Deletion-with-finalizer cleanupkubectl delete extrr triggers the same node cleanup before the finalizer drops, so operators can reclaim stalled nodes.
  6. Drift handling — defends against foreign taints at the same key (left in place, never stomped), missing nodes (waits for apiserver), RBAC-forbidden patches (surfaces a failure condition), and taint already at the correct value (no-op).

Supporting pieces

  • janitor/pkg/condition — adapter between the proto Condition message and meta.Condition so existing controller-runtime helpers (meta.SetStatusCondition, meta.IsStatusConditionTrue) work against ExtRR status.
  • janitor/pkg/metrics/err_metrics.go — three Prometheus series in the nvsentinel_external_remediation_ namespace: lifecycle transitions counter, currently-open gauge, age-at-close histogram.
  • janitor/main.go — manager registers the new reconciler alongside the existing RebootNode / TerminateNode / GPUReset controllers.

Things deferred to follow-ups

  • fault-remediation producing ExtRRs on the EXTERNAL_REMEDIATION recommended action (a separate change).
  • node-labeler and cluster-scope monitor gating on managed=false.
  • Demo GIF + ADR cross-refs.

Test Plan

  • 143 janitor tests pass (124 prior + 19 new), including a full ginkgo envtest suite that exercises every dispatch branch end to end against a real apiserver.
  • `pkg/controller` coverage rises from 43.7% → 50.0%.
  • Foreign-taint test: apply a taint at our release key but with a different value, verify cleanup leaves it in place.
  • Asymmetric semantics: `Complete=False` must not trigger cleanup; verified in envtest.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an External Remediation Request controller to manage node remediation workflows (taint/label handling, finalizers, status conditions); controller is registered on startup.
    • Added condition conversion utilities to translate between protobuf and Kubernetes condition formats.
  • Observability

    • Added Prometheus metrics for External Remediation Request lifecycle counts, open-state gauges, and closed-age histograms; metrics are registered at startup.
  • Tests

    • Comprehensive unit and controller tests for the new controller and condition conversion helpers.

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an ExternalRemediationRequest controller, condition conversion helpers and tests, Prometheus metrics for ERRs (lifecycle/open/age), a six-branch reconciliation state machine (apply, cleanup, no-op, deletion), comprehensive controller tests, and startup wiring to register the controller and metrics.

Changes

ExternalRemediationRequest Controller Implementation

Layer / File(s) Summary
Condition type conversions
janitor/pkg/condition/condition.go, janitor/pkg/condition/condition_test.go
Bidirectional converters (ToMetav1, FromMetav1) and slice helpers translate between protobuf protos.Condition and Kubernetes metav1.Condition, with nil/zero handling and timestamp translation. Tests verify field preservation, round-trip integrity, and edge cases.
Observability: ERR metrics
janitor/pkg/metrics/err_metrics.go, janitor/pkg/metrics/metrics.go
Prometheus metrics for ERR lifecycle: ExtRRTotal counter (phase/result), ExtRROpen gauge (active requests by node/action/state), ExtRRAgeSeconds histogram (remediation duration). Methods on ActionMetrics record updates; metrics registered during NewActionMetrics().
Controller core, entry & initialization
janitor/pkg/controller/externalremediationrequest_controller.go
Introduces ExternalRemediationRequestReconciler, Reconcile entry (OTEL span), helpers/constants, and initialization that ensures cleanup finalizer plus two initial Unknown conditions.
Dispatcher & apply path
janitor/pkg/controller/externalremediationrequest_controller.go
ADR-040 dispatcher implementing six branches and the release-apply path: validates spec.healthEvent.nodeName, requeues on missing Node, patches Node with release taint + managed=false, detects drift/RBAC, and transitions NVSentinelOwnershipReleased accordingly.
Cleanup, close, and deletion
janitor/pkg/controller/externalremediationrequest_controller.go
Cleanup-on-complete and operator-deletion flows remove taint/label idempotently, record close metrics/events, and remove the finalizer to allow GC.
No-op handling, taint utilities, status patching
janitor/pkg/controller/externalremediationrequest_controller.go
Asymmetric no-op for ExternalRemediationComplete=false, taint slice helpers, and status-condition patching that only updates status when it actually changes.
Controller tests
janitor/pkg/controller/externalremediationrequest_controller_test.go, janitor/pkg/controller/suite_test.go
Comprehensive Ginkgo/Gomega suite exercising initialization idempotency, apply/cleanup branches, drift/RBAC/missing-node edge cases, operator deletion, metrics/event assertions, and small unit/static tests; test scheme registration updated.
Integration wiring & startup
janitor/main.go (lines 242–251)
Registers ExternalRemediationRequestReconciler in main startup with SetupWithManager error logging and updates the startup slog.Info to list the new controller.

Sequence Diagram(s)

sequenceDiagram
  participant Reconciler
  participant APIServer
  participant Node
  participant Metrics
  Reconciler->>APIServer: GET ExternalRemediationRequest
  Reconciler->>APIServer: GET Node (spec.healthEvent.nodeName)
  Reconciler->>APIServer: PATCH Node (apply/remove taint & managed label)
  Reconciler->>APIServer: PATCH ExternalRemediationRequest/status (conditions)
  Reconciler->>Metrics: Inc/Adjust/Observe ExtRR metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NVSentinel#1376: Adds the ExternalRemediationRequest CRD/scheme/RBAC/ERR identity + JSON handling that this controller builds on.

Suggested labels

area/core, area/api, area/tests

Suggested reviewers

  • lalitadithya
  • natherz97

Poem

A rabbit hops where signals mend,
Converting conditions end to end,
Metrics hum and reconcilers tread,
Taints applied, then softly shed,
Tests applaud the lifecycle's friend. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: implementing an ExternalRemediationRequest reconciler controller, which matches the core objective of this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
janitor/pkg/controller/externalremediationrequest_controller.go (1)

117-141: 💤 Low value

Consider extracting "unknown" as a package constant.

The string literal "unknown" appears multiple times in this file (lines 123, 127, 130). Extract it as a package-level constant to improve maintainability and reduce the risk of typos.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@janitor/pkg/controller/externalremediationrequest_controller.go` around lines
117 - 141, Extract the repeated literal "unknown" into a package-level constant
(e.g., const unknownLabel = "unknown") and replace all occurrences inside
recommendedActionLabel and errNodeLabel (and any other uses in this file) with
that constant; update the functions recommendedActionLabel(extrrObj
*nvsentinelv1.ExternalRemediationRequest) and errNodeLabel(extrrObj
*nvsentinelv1.ExternalRemediationRequest) to return unknownLabel instead of the
raw string to centralize the value and avoid typos.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@janitor/pkg/condition/condition.go`:
- Around line 26-31: The import block in condition.go is misformatted; run gofmt
-s -w or goimports -w to reorder and format the imports so they conform to
gofmt/goimports rules (group standard library, external, then internal packages)
and remove/adjust the blank line between metav1, timestamppb and protos; check
the imports that reference metav1, timestamppb and protos remain unchanged after
formatting.

In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Around line 262-302: In setInitialConditions, add a blank line immediately
above the final return statement so there is whitespace separating the call to
r.patchStatusConditions(...) and the subsequent "return patched, err" line; this
improves readability around the end of the function where conditions are patched
and returned (references: setInitialConditions, r.patchStatusConditions,
patched, err).
- Around line 202-222: The function needsInitialization in
ExternalRemediationRequestReconciler has a line that exceeds the 120-character
limit; locate the long conditional check (the meta.FindStatusCondition call that
checks ConditionNVSentinelOwnershipReleased on the conds slice inside
needsInitialization) and break it into two shorter lines (e.g., assign
meta.FindStatusCondition(conds, ConditionNVSentinelOwnershipReleased) to a local
variable or wrap the if condition across lines) so no source line exceeds 120
characters while preserving the same logic and returning true when the condition
is nil.
- Around line 167-200: In Reconcile (method
ExternalRemediationRequestReconciler.Reconcile) insert a blank line immediately
above the defer span.End() to separate the
tracing.StartSpanWithLinkFromTraceContext(...) call/assignment from the defer;
locate the tracing.StartSpanWithLinkFromTraceContext invocation that returns
ctx, span and the subsequent defer span.End() and add one empty line for
improved readability and to satisfy the static analysis style rule.
- Around line 360-441: reconcileApply is over the cyclomatic complexity
threshold; extract three helpers to simplify control flow: (1) move node
retrieval and missing-node requeue logic out of reconcileApply into a helper
like getNodeForApply(ctx, extrrObj) that returns (*corev1.Node, ctrl.Result,
error) to encapsulate the IsNotFound requeue behavior; (2) extract the taint
ownership/drift checks into checkTaintOwnership(node *corev1.Node, extrrObj
*nvsentinelv1.ExternalRemediationRequest) which returns an enum/struct or
(status, message) indicating "owned-by-us", "label-missing", "owned-by-other" so
reconcileApply can early-return to transitionToReleaseSuccess or
transitionToReleaseFailure without nested branches; and (3) pull the Patch error
handling into handlePatchError(err, nodeName, extrrObj) which returns
(ctrl.Result, error) and encapsulates the apierrors.IsForbidden branch and the
generic fmt.Errorf branch; update reconcileApply to call these helpers to reduce
branching and meet the complexity threshold.
- Around line 586-659: reconcileCleanup is too complex—extract two helpers: (1)
a node fetch helper on ExternalRemediationRequestReconciler (e.g.
fetchNodeForCleanup(ctx, nodeName) returning (*corev1.Node, bool, error)) that
wraps r.Get and implements the IsNotFound short-circuit used today, and (2) a
taint decision helper (e.g. shouldRemoveReleaseTaint(taints []corev1.Taint,
extrrName string) returning (newTaints []corev1.Taint, removed bool, ownerDrift
bool)) that uses findTaintByKey, ReleaseTaintKey and removeTaintByKey to decide
whether to remove the taint (match on extrrObj.Name) or leave it and signal
drift; then simplify reconcileCleanup to call these two helpers, adjust changed
logic and keep the existing Patch and logging paths intact.
- Around line 808-827: The reconciler currently uses the deprecated
mgr.GetEventRecorderFor and client-go record.EventRecorder; update
ExternalRemediationRequestReconciler.SetupWithManager to call
mgr.GetEventRecorder(...) and change the reconciler's Recorder field type and
import from k8s.io/client-go/tools/record.EventRecorder to
k8s.io/client-go/tools/events.EventRecorder (update imports and any uses of
Recorder accordingly), update tests to replace record.NewFakeRecorder usages
with the appropriate events-based fake/constructor, and add RBAC markers for
events.k8s.io if controller-runtime requires them.

In `@janitor/pkg/metrics/err_metrics.go`:
- Around line 48-58: The comment above the ExtRRTotal declaration has a line
exceeding the 120-char limit; edit the block comment describing phases (the long
sentence that starts with "phase=created ..." and continues through
"operator_deleted (kubectl delete err)") to wrap or split into multiple shorter
lines (or shorten wording) so no single comment line exceeds 120 characters
while preserving the same phase/result descriptions and labels for ExtRRTotal.

---

Nitpick comments:
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Around line 117-141: Extract the repeated literal "unknown" into a
package-level constant (e.g., const unknownLabel = "unknown") and replace all
occurrences inside recommendedActionLabel and errNodeLabel (and any other uses
in this file) with that constant; update the functions
recommendedActionLabel(extrrObj *nvsentinelv1.ExternalRemediationRequest) and
errNodeLabel(extrrObj *nvsentinelv1.ExternalRemediationRequest) to return
unknownLabel instead of the raw string to centralize the value and avoid typos.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f37c4f65-f22b-4fb2-a71c-b6ce4b3db6d8

📥 Commits

Reviewing files that changed from the base of the PR and between 29e14d9 and 14bf562.

📒 Files selected for processing (8)
  • janitor/main.go
  • janitor/pkg/condition/condition.go
  • janitor/pkg/condition/condition_test.go
  • janitor/pkg/controller/externalremediationrequest_controller.go
  • janitor/pkg/controller/externalremediationrequest_controller_test.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/metrics/err_metrics.go
  • janitor/pkg/metrics/metrics.go

Comment thread janitor/pkg/condition/condition.go
Comment thread janitor/pkg/controller/externalremediationrequest_controller.go
Comment thread janitor/pkg/controller/externalremediationrequest_controller.go Outdated
Comment thread janitor/pkg/controller/externalremediationrequest_controller.go Outdated
Comment thread janitor/pkg/controller/externalremediationrequest_controller.go
Comment thread janitor/pkg/controller/externalremediationrequest_controller.go Outdated
Comment thread janitor/pkg/controller/externalremediationrequest_controller.go Outdated
Comment thread janitor/pkg/metrics/extrr_metrics.go
@jtschelling jtschelling force-pushed the feature/jsc-103-err-reconciler branch from 0ea43b0 to 94158b1 Compare June 11, 2026 20:20

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
janitor/pkg/controller/externalremediationrequest_controller_test.go (1)

1064-1084: 💤 Low value

Consider removing or documenting unused helper.

snapshotConditions is defined but not called anywhere in this test file. If it's intended for debugging or future use, consider adding a comment indicating that purpose, or remove it to avoid dead code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@janitor/pkg/controller/externalremediationrequest_controller_test.go` around
lines 1064 - 1084, The helper function snapshotConditions (func
snapshotConditions(extrrObj *nvsentinelv1.ExternalRemediationRequest) string) is
currently unused; either delete it to remove dead code or keep it but add a
short comment explaining its purpose (e.g., "helper for debugging/test snapshots
of ExternalRemediationRequest conditions") and/or call it from a relevant test
so it is exercised; update or add a linter suppression only if you intentionally
keep it unused.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@janitor/pkg/controller/externalremediationrequest_controller_test.go`:
- Around line 1064-1084: The helper function snapshotConditions (func
snapshotConditions(extrrObj *nvsentinelv1.ExternalRemediationRequest) string) is
currently unused; either delete it to remove dead code or keep it but add a
short comment explaining its purpose (e.g., "helper for debugging/test snapshots
of ExternalRemediationRequest conditions") and/or call it from a relevant test
so it is exercised; update or add a linter suppression only if you intentionally
keep it unused.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dfb20a53-f3ff-40ea-8f2b-71953cbd0142

📥 Commits

Reviewing files that changed from the base of the PR and between 0ea43b0 and 94158b1.

📒 Files selected for processing (8)
  • janitor/main.go
  • janitor/pkg/condition/condition.go
  • janitor/pkg/condition/condition_test.go
  • janitor/pkg/controller/externalremediationrequest_controller.go
  • janitor/pkg/controller/externalremediationrequest_controller_test.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/metrics/err_metrics.go
  • janitor/pkg/metrics/metrics.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • janitor/main.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/metrics/err_metrics.go
  • janitor/pkg/condition/condition.go
  • janitor/pkg/condition/condition_test.go
  • janitor/pkg/controller/externalremediationrequest_controller.go

@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller 12.96% (-0.65%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller/externalremediationrequest_controller.go 15.73% (+15.73%) 2314 (+2314) 364 (+364) 1950 (+1950) 🎉

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@jtschelling

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor/pkg/condition 26.64% (+26.64%) 🌟
github.com/nvidia/nvsentinel/janitor/pkg/controller 12.95% (-0.66%) 👎
github.com/nvidia/nvsentinel/janitor/pkg/metrics 23.81% (-3.22%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/janitor/pkg/condition/condition.go 26.64% (+26.64%) 244 (+244) 65 (+65) 179 (+179) 🌟
github.com/nvidia/nvsentinel/janitor/pkg/controller/externalremediationrequest_controller.go 15.61% (+15.61%) 2377 (+2377) 371 (+371) 2006 (+2006) 🎉
github.com/nvidia/nvsentinel/janitor/pkg/metrics/err_metrics.go 22.50% (+22.50%) 40 (+40) 9 (+9) 31 (+31) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/externalremediationrequest_controller_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
janitor/pkg/controller/externalremediationrequest_controller.go (1)

15-15: 💤 Low value

Add package-level godoc for the controller package.

Per coding guidelines, all Go packages should include package-level godoc. Consider adding a comment above the package controller line documenting the package's purpose.

📝 Suggested addition
+// Package controller implements Kubernetes reconcilers for NVSentinel custom
+// resources including RebootNode, TerminateNode, GPUReset, and
+// ExternalRemediationRequest controllers.
 package controller
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@janitor/pkg/controller/externalremediationrequest_controller.go` at line 15,
Add a package-level godoc comment immediately above the "package controller"
declaration describing the purpose and responsibilities of the controller
package (e.g., that it implements reconciliation logic for
ExternalRemediationRequest resources and related helpers), mention the main
abstractions it contains (controllers, reconcilers, and helpers for
ExternalRemediationRequest), and keep it concise and formatted as a proper Go
comment block to satisfy godoc requirements.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Around line 148-152: Rename the local variable currently named err (type
nvsentinelv1.ExternalRemediationRequest) to extrrObj in the Reconcile method of
ExternalRemediationRequestReconciler to avoid shadowing the conventional error
name; update the Get call and all subsequent uses (currently lines referencing
err) to use extrrObj, and rename the temporary error variable e (from the Get
call) to the conventional err so the function reads: declare extrrObj :=
nvsentinelv1.ExternalRemediationRequest (or var extrrObj
nvsentinelv1.ExternalRemediationRequest), call r.Get(..., &extrrObj) and handle
the returned error as err := r.Get(...) / if err != nil { return ctrl.Result{},
client.IgnoreNotFound(err) } ensuring all references to the ExtRR object are
updated to extrrObj.

---

Nitpick comments:
In `@janitor/pkg/controller/externalremediationrequest_controller.go`:
- Line 15: Add a package-level godoc comment immediately above the "package
controller" declaration describing the purpose and responsibilities of the
controller package (e.g., that it implements reconciliation logic for
ExternalRemediationRequest resources and related helpers), mention the main
abstractions it contains (controllers, reconcilers, and helpers for
ExternalRemediationRequest), and keep it concise and formatted as a proper Go
comment block to satisfy godoc requirements.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 67c3c8d4-d01d-440f-ab5c-f7af09f94d3c

📥 Commits

Reviewing files that changed from the base of the PR and between 0ea43b0 and 859e5bc.

📒 Files selected for processing (8)
  • janitor/main.go
  • janitor/pkg/condition/condition.go
  • janitor/pkg/condition/condition_test.go
  • janitor/pkg/controller/externalremediationrequest_controller.go
  • janitor/pkg/controller/externalremediationrequest_controller_test.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/metrics/err_metrics.go
  • janitor/pkg/metrics/metrics.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/metrics/metrics.go
  • janitor/main.go
  • janitor/pkg/condition/condition.go
  • janitor/pkg/condition/condition_test.go
  • janitor/pkg/metrics/err_metrics.go
  • janitor/pkg/controller/externalremediationrequest_controller_test.go

Comment thread janitor/pkg/controller/externalremediationrequest_controller.go
@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller 12.95% (-0.66%) 👎
github.com/nvidia/nvsentinel/janitor/pkg/metrics 23.81% (-3.22%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller/externalremediationrequest_controller.go 15.61% (+15.61%) 2377 (+2377) 371 (+371) 2006 (+2006) 🎉
github.com/nvidia/nvsentinel/janitor/pkg/metrics/err_metrics.go 22.50% (+22.50%) 40 (+40) 9 (+9) 31 (+31) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller 12.88% (-0.71%) 👎
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1 13.72% (-0.90%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/janitor/pkg/controller/externalremediationrequest_controller.go 15.63% (+15.63%) 2124 (+2124) 332 (+332) 1792 (+1792) 🎉
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go 13.72% (-0.90%) 1873 (+368) 257 (+37) 1616 (+331) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/controller/externalremediationrequest_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go

jtschelling and others added 3 commits June 19, 2026 13:36
Adds the controller-runtime reconciler that drives the ADR-040 ExtRR
lifecycle. Owns the six-branch state machine: init (finalizer + initial
Unknown conditions), deletion-cleanup, apply (release taint +
managed=false in a single strategic-merge PATCH), True-cleanup, asymmetric
False no-op, and the released steady-state.

Includes:

* Proto Condition ↔ metav1.Condition adapter (pkg/condition) so the
  reconciler uses standard meta.SetStatusCondition helpers while the proto
  stays authoritative on the apiserver.
* Field indexer on spec.healthEvent.nodeName so the Node watch maps to
  ExtRRs in O(1) instead of scanning every ExtRR per Node event.
* Predicate that drops Node events outside the release-state surface
  (managed label + release taint), so kubelet heartbeats don't re-enqueue
  every ExtRR.
* Belt-and-suspenders nil-spec guard in Reconcile — the validating webhook
  enforces the contract, but a webhook outage shouldn't crashloop the
  controller.
* Prometheus metrics in the nvsentinel_external_remediation_ namespace
  (total / open / age_seconds) with phase + result labels per ADR-040.
* Ginkgo envtest suite covering init, apply, drift, RBAC denial, branch 2
  + 4 cleanup paths, asymmetric False, and observability assertions.
* Focused unit tests for the field indexer, node-watch predicate, and
  mapper using a fake client with the index pre-registered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…quest

Adds a typed validator that enforces the contract the reconciler depends
on: spec, spec.healthEvent, and spec.healthEvent.nodeName must all be
populated, and nodeName is immutable after creation (the release taint
carries the ExtRR name so flipping nodeName would orphan a node).

The proto-backed CRD wrapper has pointer Spec/Status fields (forced by the
proto's embedded sync.Mutex), so the apiserver's OpenAPI schema accepts
spec: null and similar incomplete objects. The webhook closes that gap
with failurePolicy=Fail so the reconciler can rely on a well-formed Spec.

ExtRR intentionally has no Config.Enabled gate: per ADR-040 it is an
unconditional capability — disabling it would strand any external system
mid-remediation with no way to release affected nodes.

Includes:

* extrrValidator in pkg/webhook/v1alpha1 modelled on the existing
  RebootNode / TerminateNode / GPUReset validators.
* kubebuilder marker emitting the standard webhook path.
* Helm chart entry that wires the same cert/service infra as the sibling
  webhooks.
* Ginkgo unit tests covering create rejection paths and update
  immutability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds five end-to-end tests against a chart-deployed janitor + webhook
that prove the wiring works in a real cluster (cert provisioning,
RBAC, manifest correctness, reconciler boot):

* TestExtRRWebhookRejectsInvalidSpec — exercises the four webhook
  rejection paths (nil spec, nil healthEvent, empty nodeName, immutable
  nodeName on update). Proves the kubebuilder marker, helm chart entry,
  and cert wiring are all aligned.
* TestExtRRLifecycleHappyPath — full ADR-040 happy path: apply →
  release taint + managed=false → Complete=True → cleanup → garbage
  collection.
* TestExtRRAsymmetricFalse — Complete=False is a no-op; node remains
  released. Complete=True retry from False then closes the ExtRR.
* TestExtRROperatorDeleteEscape — kubectl delete extrr while stuck at
  Complete=False drives the finalizer-based cleanup.
* TestExtRRForeignTaintDrift — drift-safety: a fresh ExtRR for a node
  already tainted by another ExtRR transitions to
  NVSentinelOwnershipReleased=False, leaving the foreign taint alone.

Each test's Teardown does a belt-and-suspenders scrub of the test
node's taint+label via ScrubExtRRStateFromNode, so a mid-test failure
doesn't leak state into subsequent tests sharing the same Node.

Helpers added in tests/helpers/kube.go: CreateExtRRCR,
CreateMalformedExtRR, SetExtRRComplete, WaitForExtRRCondition,
WaitForExtRRGone, ScrubExtRRStateFromNode, plus the
ExternalRemediationRequestGVK constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jtschelling jtschelling force-pushed the feature/jsc-103-err-reconciler branch from 74f8f7b to e3ae6d4 Compare June 19, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant