[BUG] workqueue_* metrics missing from /metrics endpoint after controller-runtime v0.20.0 upgrade

### Pre-submission Checklist

- [x] I have searched [existing issues](https://github.com/DataDog/datadog-operator/issues) and this is not a duplicate
- [x] This is a Datadog Operator issue (CRDs, reconciliation, etc.), not a Datadog Agent or Datadog service problem (dashboards, monitors, etc.)

### Operator version

1.23.1

### Operator Helm chart version

2.18.1

### Bug Report

**TL;DR:** The 7 standard `workqueue_*` metrics (including `workqueue_depth`) are missing from the operator's `/metrics` endpoint. These metrics were available transitively via controller-runtime v0.19.x but disappeared when the operator upgraded to controller-runtime v0.20.x (operator v1.16.0+). Without `workqueue_depth`, there is no way to detect reconciliation queue saturation under load. We're willing to contribute a PR with the fix (~130 LOC).

**What happened:**

The operator's `/metrics` endpoint does not expose any `workqueue_*` metrics, making it impossible to monitor reconciliation queue health when managing large numbers of `DatadogMonitor` or `DatadogSLO` CRs. This affects all controllers in the operator (DatadogAgent, DatadogMonitor, DatadogSLO, etc.) since workqueue metrics are per-controller-queue.

Without these metrics we cannot:

- **Detect queue saturation** — `workqueue_depth` (current backlog size) is the only way to know if the operator is falling behind. The existing `controller_runtime_active_workers` shows workers are busy but not whether 10 or 4,000 items are waiting.
- **Isolate latency sources** — `workqueue_queue_duration_seconds` separates queue wait time from processing time. A CR might take 10s end-to-end but only 200ms to reconcile — the other 9.8s is queue wait, invisible without this metric.
- **Monitor retry storms** — `workqueue_retries_total` reveals if failed reconciliations are creating churn.
- **Catch straggler reconciliations** — `workqueue_longest_running_processor_seconds` surfaces individual CRs that are stuck.

We discovered this while load testing the operator with thousands of DatadogMonitor and DatadogSLO CRs, where queue visibility is essential for capacity planning and performance analysis.

**Verification:**

```bash
$ kubectl port-forward deploy/datadog-operator 8383:8383
$ curl -s http://localhost:8383/metrics | grep -c workqueue
0
```

**What I expected:**

The operator should expose the standard controller-runtime workqueue metrics on its `/metrics` endpoint, consistent with how other `controller_runtime_*` metrics are available.

**Root Cause:**

This regression was introduced in **operator v1.16.0** ([PR #1924](https://github.com/DataDog/datadog-operator/pull/1924)), which upgraded controller-runtime from v0.19.0 to v0.20.4.

In controller-runtime v0.20.0 ([PR #3014](https://github.com/kubernetes-sigs/controller-runtime/pull/3014)), workqueue metric registration was moved from the public `pkg/metrics/workqueue.go` — which had an `init()` that auto-registered metrics and called `workqueue.SetProvider()` — to the internal `pkg/internal/metrics/workqueue.go`, which is only imported by the priority queue implementation (`pkg/controller/priorityqueue/`). Since the operator uses standard workqueues, the internal package is never imported, the `init()` never executes, and the 7 workqueue metrics are never registered. The public file was reduced to just metric name constants.

This was a side effect of the upstream priority queue feature ([issue #2374](https://github.com/kubernetes-sigs/controller-runtime/issues/2374)), which moved metric registration to an internal package. Operators that depended on the transitive registration need to explicitly opt back in.

**Proposed Solution:**

We're willing to contribute a PR if maintainers agree this is valuable. The implementation would:

1. Register the 7 workqueue Prometheus metrics with `metrics.Registry`
2. Implement a `WorkqueueMetricsProvider` satisfying `workqueue.MetricsProvider`
3. Call `workqueue.SetProvider()` during operator initialization

Reference implementation: [controller-runtime v0.19.0 pkg/metrics/workqueue.go](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.19.0/pkg/metrics/workqueue.go) (~130 lines)

**Alternative:** If registering all 7 metrics is too invasive, even just `workqueue_depth` would provide the most critical queue saturation visibility.

### Steps to Reproduce

1. Deploy the operator with `collectOperatorMetrics: true` (the default)
2. Port-forward to the operator pod: `kubectl port-forward deploy/datadog-operator 8383:8383`
3. Query the metrics endpoint: `curl -s http://localhost:8383/metrics | grep workqueue`
4. Observe zero results

### Environment

**Kubernetes version:** 1.34 (EKS)
**Helm version:** 3.x
**Cloud provider:** AWS
**controller-runtime version (in operator):** v0.20.4

### Additional Context

The 7 missing workqueue metrics are:

| Metric | Type | What It Measures |
|---|---|---|
| `workqueue_depth` | gauge | Current number of items waiting in the reconciliation queue |
| `workqueue_adds_total` | counter | Total items enqueued |
| `workqueue_queue_duration_seconds` | histogram | Time items wait in queue before processing begins |
| `workqueue_work_duration_seconds` | histogram | Time spent in the `Reconcile()` call per item |
| `workqueue_retries_total` | counter | Number of items re-enqueued after failed reconciliation |
| `workqueue_unfinished_work_seconds` | gauge | Sum of in-progress work durations |
| `workqueue_longest_running_processor_seconds` | gauge | Duration of the longest-running reconciliation |

- The `datadog.operator.*` namespace metrics (e.g., `datadog.operator.reconcile.success`, `datadog.operator.*.custom_resource.count`) sent via the `-operatorMetricsEnabled=true` path do not include workqueue metrics.
- Confirmed the same issue persists on the operator's `main` branch (controller-runtime v0.22.4) — `pkg/metrics/workqueue.go` still contains only constants, no registration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] workqueue_* metrics missing from /metrics endpoint after controller-runtime v0.20.0 upgrade #2771

Pre-submission Checklist

Operator version

Operator Helm chart version

Bug Report

Steps to Reproduce

Environment

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	What It Measures
`workqueue_depth`	gauge	Current number of items waiting in the reconciliation queue
`workqueue_adds_total`	counter	Total items enqueued
`workqueue_queue_duration_seconds`	histogram	Time items wait in queue before processing begins
`workqueue_work_duration_seconds`	histogram	Time spent in the `Reconcile()` call per item
`workqueue_retries_total`	counter	Number of items re-enqueued after failed reconciliation
`workqueue_unfinished_work_seconds`	gauge	Sum of in-progress work durations
`workqueue_longest_running_processor_seconds`	gauge	Duration of the longest-running reconciliation

[BUG] workqueue_* metrics missing from /metrics endpoint after controller-runtime v0.20.0 upgrade #2771

Description

Pre-submission Checklist

Operator version

Operator Helm chart version

Bug Report

Steps to Reproduce

Environment

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions