Skip to content

[BUG] workqueue_* metrics missing from /metrics endpoint after controller-runtime v0.20.0 upgrade #2771

@sebradloff

Description

@sebradloff

Pre-submission Checklist

  • I have searched existing issues and this is not a duplicate
  • This is a Datadog Operator issue (CRDs, reconciliation, etc.), not a Datadog Agent or Datadog service problem (dashboards, monitors, etc.)

Operator version

1.23.1

Operator Helm chart version

2.18.1

Bug Report

TL;DR: The 7 standard workqueue_* metrics (including workqueue_depth) are missing from the operator's /metrics endpoint. These metrics were available transitively via controller-runtime v0.19.x but disappeared when the operator upgraded to controller-runtime v0.20.x (operator v1.16.0+). Without workqueue_depth, there is no way to detect reconciliation queue saturation under load. We're willing to contribute a PR with the fix (~130 LOC).

What happened:

The operator's /metrics endpoint does not expose any workqueue_* metrics, making it impossible to monitor reconciliation queue health when managing large numbers of DatadogMonitor or DatadogSLO CRs. This affects all controllers in the operator (DatadogAgent, DatadogMonitor, DatadogSLO, etc.) since workqueue metrics are per-controller-queue.

Without these metrics we cannot:

  • Detect queue saturationworkqueue_depth (current backlog size) is the only way to know if the operator is falling behind. The existing controller_runtime_active_workers shows workers are busy but not whether 10 or 4,000 items are waiting.
  • Isolate latency sourcesworkqueue_queue_duration_seconds separates queue wait time from processing time. A CR might take 10s end-to-end but only 200ms to reconcile — the other 9.8s is queue wait, invisible without this metric.
  • Monitor retry stormsworkqueue_retries_total reveals if failed reconciliations are creating churn.
  • Catch straggler reconciliationsworkqueue_longest_running_processor_seconds surfaces individual CRs that are stuck.

We discovered this while load testing the operator with thousands of DatadogMonitor and DatadogSLO CRs, where queue visibility is essential for capacity planning and performance analysis.

Verification:

$ kubectl port-forward deploy/datadog-operator 8383:8383
$ curl -s http://localhost:8383/metrics | grep -c workqueue
0

What I expected:

The operator should expose the standard controller-runtime workqueue metrics on its /metrics endpoint, consistent with how other controller_runtime_* metrics are available.

Root Cause:

This regression was introduced in operator v1.16.0 (PR #1924), which upgraded controller-runtime from v0.19.0 to v0.20.4.

In controller-runtime v0.20.0 (PR #3014), workqueue metric registration was moved from the public pkg/metrics/workqueue.go — which had an init() that auto-registered metrics and called workqueue.SetProvider() — to the internal pkg/internal/metrics/workqueue.go, which is only imported by the priority queue implementation (pkg/controller/priorityqueue/). Since the operator uses standard workqueues, the internal package is never imported, the init() never executes, and the 7 workqueue metrics are never registered. The public file was reduced to just metric name constants.

This was a side effect of the upstream priority queue feature (issue #2374), which moved metric registration to an internal package. Operators that depended on the transitive registration need to explicitly opt back in.

Proposed Solution:

We're willing to contribute a PR if maintainers agree this is valuable. The implementation would:

  1. Register the 7 workqueue Prometheus metrics with metrics.Registry
  2. Implement a WorkqueueMetricsProvider satisfying workqueue.MetricsProvider
  3. Call workqueue.SetProvider() during operator initialization

Reference implementation: controller-runtime v0.19.0 pkg/metrics/workqueue.go (~130 lines)

Alternative: If registering all 7 metrics is too invasive, even just workqueue_depth would provide the most critical queue saturation visibility.

Steps to Reproduce

  1. Deploy the operator with collectOperatorMetrics: true (the default)
  2. Port-forward to the operator pod: kubectl port-forward deploy/datadog-operator 8383:8383
  3. Query the metrics endpoint: curl -s http://localhost:8383/metrics | grep workqueue
  4. Observe zero results

Environment

Kubernetes version: 1.34 (EKS)
Helm version: 3.x
Cloud provider: AWS
controller-runtime version (in operator): v0.20.4

Additional Context

The 7 missing workqueue metrics are:

Metric Type What It Measures
workqueue_depth gauge Current number of items waiting in the reconciliation queue
workqueue_adds_total counter Total items enqueued
workqueue_queue_duration_seconds histogram Time items wait in queue before processing begins
workqueue_work_duration_seconds histogram Time spent in the Reconcile() call per item
workqueue_retries_total counter Number of items re-enqueued after failed reconciliation
workqueue_unfinished_work_seconds gauge Sum of in-progress work durations
workqueue_longest_running_processor_seconds gauge Duration of the longest-running reconciliation
  • The datadog.operator.* namespace metrics (e.g., datadog.operator.reconcile.success, datadog.operator.*.custom_resource.count) sent via the -operatorMetricsEnabled=true path do not include workqueue metrics.
  • Confirmed the same issue persists on the operator's main branch (controller-runtime v0.22.4) — pkg/metrics/workqueue.go still contains only constants, no registration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions