Pre-submission Checklist
Operator version
1.23.1
Operator Helm chart version
2.18.1
Bug Report
TL;DR: The 7 standard workqueue_* metrics (including workqueue_depth) are missing from the operator's /metrics endpoint. These metrics were available transitively via controller-runtime v0.19.x but disappeared when the operator upgraded to controller-runtime v0.20.x (operator v1.16.0+). Without workqueue_depth, there is no way to detect reconciliation queue saturation under load. We're willing to contribute a PR with the fix (~130 LOC).
What happened:
The operator's /metrics endpoint does not expose any workqueue_* metrics, making it impossible to monitor reconciliation queue health when managing large numbers of DatadogMonitor or DatadogSLO CRs. This affects all controllers in the operator (DatadogAgent, DatadogMonitor, DatadogSLO, etc.) since workqueue metrics are per-controller-queue.
Without these metrics we cannot:
- Detect queue saturation —
workqueue_depth (current backlog size) is the only way to know if the operator is falling behind. The existing controller_runtime_active_workers shows workers are busy but not whether 10 or 4,000 items are waiting.
- Isolate latency sources —
workqueue_queue_duration_seconds separates queue wait time from processing time. A CR might take 10s end-to-end but only 200ms to reconcile — the other 9.8s is queue wait, invisible without this metric.
- Monitor retry storms —
workqueue_retries_total reveals if failed reconciliations are creating churn.
- Catch straggler reconciliations —
workqueue_longest_running_processor_seconds surfaces individual CRs that are stuck.
We discovered this while load testing the operator with thousands of DatadogMonitor and DatadogSLO CRs, where queue visibility is essential for capacity planning and performance analysis.
Verification:
$ kubectl port-forward deploy/datadog-operator 8383:8383
$ curl -s http://localhost:8383/metrics | grep -c workqueue
0
What I expected:
The operator should expose the standard controller-runtime workqueue metrics on its /metrics endpoint, consistent with how other controller_runtime_* metrics are available.
Root Cause:
This regression was introduced in operator v1.16.0 (PR #1924), which upgraded controller-runtime from v0.19.0 to v0.20.4.
In controller-runtime v0.20.0 (PR #3014), workqueue metric registration was moved from the public pkg/metrics/workqueue.go — which had an init() that auto-registered metrics and called workqueue.SetProvider() — to the internal pkg/internal/metrics/workqueue.go, which is only imported by the priority queue implementation (pkg/controller/priorityqueue/). Since the operator uses standard workqueues, the internal package is never imported, the init() never executes, and the 7 workqueue metrics are never registered. The public file was reduced to just metric name constants.
This was a side effect of the upstream priority queue feature (issue #2374), which moved metric registration to an internal package. Operators that depended on the transitive registration need to explicitly opt back in.
Proposed Solution:
We're willing to contribute a PR if maintainers agree this is valuable. The implementation would:
- Register the 7 workqueue Prometheus metrics with
metrics.Registry
- Implement a
WorkqueueMetricsProvider satisfying workqueue.MetricsProvider
- Call
workqueue.SetProvider() during operator initialization
Reference implementation: controller-runtime v0.19.0 pkg/metrics/workqueue.go (~130 lines)
Alternative: If registering all 7 metrics is too invasive, even just workqueue_depth would provide the most critical queue saturation visibility.
Steps to Reproduce
- Deploy the operator with
collectOperatorMetrics: true (the default)
- Port-forward to the operator pod:
kubectl port-forward deploy/datadog-operator 8383:8383
- Query the metrics endpoint:
curl -s http://localhost:8383/metrics | grep workqueue
- Observe zero results
Environment
Kubernetes version: 1.34 (EKS)
Helm version: 3.x
Cloud provider: AWS
controller-runtime version (in operator): v0.20.4
Additional Context
The 7 missing workqueue metrics are:
| Metric |
Type |
What It Measures |
workqueue_depth |
gauge |
Current number of items waiting in the reconciliation queue |
workqueue_adds_total |
counter |
Total items enqueued |
workqueue_queue_duration_seconds |
histogram |
Time items wait in queue before processing begins |
workqueue_work_duration_seconds |
histogram |
Time spent in the Reconcile() call per item |
workqueue_retries_total |
counter |
Number of items re-enqueued after failed reconciliation |
workqueue_unfinished_work_seconds |
gauge |
Sum of in-progress work durations |
workqueue_longest_running_processor_seconds |
gauge |
Duration of the longest-running reconciliation |
- The
datadog.operator.* namespace metrics (e.g., datadog.operator.reconcile.success, datadog.operator.*.custom_resource.count) sent via the -operatorMetricsEnabled=true path do not include workqueue metrics.
- Confirmed the same issue persists on the operator's
main branch (controller-runtime v0.22.4) — pkg/metrics/workqueue.go still contains only constants, no registration.
Pre-submission Checklist
Operator version
1.23.1
Operator Helm chart version
2.18.1
Bug Report
TL;DR: The 7 standard
workqueue_*metrics (includingworkqueue_depth) are missing from the operator's/metricsendpoint. These metrics were available transitively via controller-runtime v0.19.x but disappeared when the operator upgraded to controller-runtime v0.20.x (operator v1.16.0+). Withoutworkqueue_depth, there is no way to detect reconciliation queue saturation under load. We're willing to contribute a PR with the fix (~130 LOC).What happened:
The operator's
/metricsendpoint does not expose anyworkqueue_*metrics, making it impossible to monitor reconciliation queue health when managing large numbers ofDatadogMonitororDatadogSLOCRs. This affects all controllers in the operator (DatadogAgent, DatadogMonitor, DatadogSLO, etc.) since workqueue metrics are per-controller-queue.Without these metrics we cannot:
workqueue_depth(current backlog size) is the only way to know if the operator is falling behind. The existingcontroller_runtime_active_workersshows workers are busy but not whether 10 or 4,000 items are waiting.workqueue_queue_duration_secondsseparates queue wait time from processing time. A CR might take 10s end-to-end but only 200ms to reconcile — the other 9.8s is queue wait, invisible without this metric.workqueue_retries_totalreveals if failed reconciliations are creating churn.workqueue_longest_running_processor_secondssurfaces individual CRs that are stuck.We discovered this while load testing the operator with thousands of DatadogMonitor and DatadogSLO CRs, where queue visibility is essential for capacity planning and performance analysis.
Verification:
$ kubectl port-forward deploy/datadog-operator 8383:8383 $ curl -s http://localhost:8383/metrics | grep -c workqueue 0What I expected:
The operator should expose the standard controller-runtime workqueue metrics on its
/metricsendpoint, consistent with how othercontroller_runtime_*metrics are available.Root Cause:
This regression was introduced in operator v1.16.0 (PR #1924), which upgraded controller-runtime from v0.19.0 to v0.20.4.
In controller-runtime v0.20.0 (PR #3014), workqueue metric registration was moved from the public
pkg/metrics/workqueue.go— which had aninit()that auto-registered metrics and calledworkqueue.SetProvider()— to the internalpkg/internal/metrics/workqueue.go, which is only imported by the priority queue implementation (pkg/controller/priorityqueue/). Since the operator uses standard workqueues, the internal package is never imported, theinit()never executes, and the 7 workqueue metrics are never registered. The public file was reduced to just metric name constants.This was a side effect of the upstream priority queue feature (issue #2374), which moved metric registration to an internal package. Operators that depended on the transitive registration need to explicitly opt back in.
Proposed Solution:
We're willing to contribute a PR if maintainers agree this is valuable. The implementation would:
metrics.RegistryWorkqueueMetricsProvidersatisfyingworkqueue.MetricsProviderworkqueue.SetProvider()during operator initializationReference implementation: controller-runtime v0.19.0 pkg/metrics/workqueue.go (~130 lines)
Alternative: If registering all 7 metrics is too invasive, even just
workqueue_depthwould provide the most critical queue saturation visibility.Steps to Reproduce
collectOperatorMetrics: true(the default)kubectl port-forward deploy/datadog-operator 8383:8383curl -s http://localhost:8383/metrics | grep workqueueEnvironment
Kubernetes version: 1.34 (EKS)
Helm version: 3.x
Cloud provider: AWS
controller-runtime version (in operator): v0.20.4
Additional Context
The 7 missing workqueue metrics are:
workqueue_depthworkqueue_adds_totalworkqueue_queue_duration_secondsworkqueue_work_duration_secondsReconcile()call per itemworkqueue_retries_totalworkqueue_unfinished_work_secondsworkqueue_longest_running_processor_secondsdatadog.operator.*namespace metrics (e.g.,datadog.operator.reconcile.success,datadog.operator.*.custom_resource.count) sent via the-operatorMetricsEnabled=truepath do not include workqueue metrics.mainbranch (controller-runtime v0.22.4) —pkg/metrics/workqueue.gostill contains only constants, no registration.