Fix Prometheus end-user metric cardinality tracking#27316
Conversation
Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
|
|
Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
Greptile SummaryThis PR adds a
Confidence Score: 4/5Safe to merge; the cardinality fix is correct for the common case and the new code is well-tested. The eviction logic works correctly under all the scenarios exercised by the tests. The three issues flagged — global lock contention, silent stall on persistent removal failures, and the max_series=0 ghost child — would only manifest at high concurrency, in rare Prometheus error conditions, or with an explicitly unusual configuration value. None affect the default code path. bounded_prometheus_series_tracker.py deserves a second look on the removal-failure path and the max_series=0 branch; prometheus.py is worth a glance on the shared-lock scope if high-throughput deployments are a concern.
|
| Filename | Overview |
|---|---|
| litellm/integrations/prometheus_helpers/bounded_prometheus_series_tracker.py | New BoundedPrometheusSeriesTracker class using OrderedDict LRU eviction and TTL cleanup; logic is sound but minor edge cases around persistent remove failures and max_series=0 semantics. |
| litellm/integrations/prometheus.py | Counters routed through _inc_labeled_counter → _apply_labeled_metric; four histograms with end_user labels updated similarly; coverage of all end_user-labeled metrics appears complete. Outer lock in _apply_labeled_metric serializes all metric operations through a single per-logger RLock. |
| litellm/init.py | Three new module-level config knobs for end-user cardinality tracking added with sensible defaults (10000 series, 3600s TTL, 60s cleanup interval). |
| tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py | New test file covers max-series eviction, TTL expiry, label-agnostic cleanup, failed-removal state, filtered labels, default end-user behavior, and test-double support; no real network calls. |
Reviews (1): Last reviewed commit: "Handle Prometheus test-double metric upd..." | Re-trigger Greptile
| with tracker.lock: | ||
| labeled_metric = metric.labels(**labels) | ||
| self._track_bounded_prometheus_metric_series(metric, metric_name, labels) | ||
| apply_metric_update(labeled_metric) |
There was a problem hiding this comment.
Single RLock serializes all metric writes across all metrics
_bounded_prometheus_series_tracker.lock is one RLock instance shared by every metric in the logger. The outer with tracker.lock: block in _apply_labeled_metric holds that lock while executing metric.labels(**labels) and apply_metric_update(labeled_metric) — both of which are Prometheus-internal operations that are already thread-safe on their own. As a result, a call to litellm_spend_metric.inc() and a concurrent call to litellm_llm_api_latency_metric.observe() are fully serialized even though they update independent metrics.
In high-throughput deployments this becomes a single global write bottleneck. The outer lock is needed to guarantee that the current series is moved to the "newest" slot in the tracker before any other thread could evict it, but a per-metric lock (or storing the tracker's per-metric series dict lock separately) would scope the contention to just the metric being updated rather than the whole logger.
| if max_series is not None and max_series > 0: | ||
| while len(series) > max_series: | ||
| tracked_label_values = next(iter(series)) | ||
| if not self._remove_metric_child(metric, tracked_label_values): | ||
| break | ||
| del series[tracked_label_values] | ||
| elif max_series is not None: | ||
| while series: | ||
| tracked_label_values = next(iter(series)) | ||
| if not self._remove_metric_child(metric, tracked_label_values): | ||
| break | ||
| del series[tracked_label_values] |
There was a problem hiding this comment.
Persistent
ValueError from metric.remove() stalls eviction permanently
When _remove_metric_child catches (AttributeError, ValueError) and returns False, the eviction while loop breaks immediately, leaving more entries than max_series in the tracker. On the next request the same oldest entry is retried — this is the intended transient-retry path and the test covers it. However, if metric.remove(*label_values) raises ValueError consistently (e.g., because the label-value tuple was built with a different ordering than the metric's registration), every subsequent request breaks out of the loop after one failed attempt and the series dict grows without bound. There is no log or counter to surface that eviction has stalled, making the condition silent.
| except (AttributeError, ValueError): | ||
| return False |
There was a problem hiding this comment.
When eviction is stalled by a persistent removal failure, the condition is currently silent. Adding a warning log here would make it visible in production logs before the series dict grows too large.
| except (AttributeError, ValueError): | |
| return False | |
| except (AttributeError, ValueError): | |
| import logging | |
| logging.getLogger(__name__).warning( | |
| "BoundedPrometheusSeriesTracker: failed to remove Prometheus child " | |
| "for label_values=%s; eviction stalled until next successful removal.", | |
| label_values, | |
| ) | |
| return False |
| elif max_series is not None: | ||
| while series: | ||
| tracked_label_values = next(iter(series)) | ||
| if not self._remove_metric_child(metric, tracked_label_values): | ||
| break | ||
| del series[tracked_label_values] |
There was a problem hiding this comment.
max_series=0 leaves a ghost child in Prometheus
When max_series is set to 0, the elif max_series is not None: branch clears all entries from the tracker dict (including the one just inserted for the current request). The eviction calls metric.remove() for the current series, removing it from Prometheus. But then apply_metric_update(labeled_metric) in _apply_labeled_metric calls .inc()/.observe() on the already-removed child object, which re-creates that child in the Prometheus registry. On the next request for a different end user the same sequence repeats, and the previous call's child is never tracked again — it lingers in Prometheus indefinitely.
Setting max_series=0 is admittedly unusual, but Optional[int] with 0 is a valid Python value and is not explicitly rejected. A guard if max_series is not None and max_series <= 0: return (or documentation stating 0 is unsupported) would prevent the ghost-child scenario.
Summary
Prometheus metrics that include resolved
end_userlabels could grow one child series per distinct end user without a cleanup path. This rebuilds the cardinality cap fix on the current staging branch, keeps the existing custom-metadata label handling intact, and applies Bugbot feedback by sharing the bounded tracking path across counters/histograms with guarded tracker state updates.Repro
Emit
litellm_spend_metricsix times withend_useras the only enabled label and a configured cap of three series. Before the fix, the Prometheus child map retained all six label tuples; after the fix, it retains only the newest three.Evidence
Terminal transcript from the repro after the fix:
Targeted regression suite after addressing the CI-reported test-double path:
Tests
tests/test_litellm/integrations/test_prometheus_end_user_cardinality.pycovering max-series eviction, TTL expiry, label-agnostic cleanup, failed child removal state, filtered failure metrics, default end-user label behavior, and uninitialized logger test doubles.uv run --no-sync pytest tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py::test_async_log_failure_event tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py::test_async_log_failure_event_litellm_side_rate_limit tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py tests/test_litellm/integrations/test_prometheus_labels.py tests/test_litellm/integrations/test_prometheus_custom_metadata_label_counts.py tests/test_litellm/integrations/test_prometheus_cache_metrics.py::TestPrometheusCacheMetrics::test_increment_cache_metrics_on_cache_hit tests/test_litellm/integrations/test_prometheus_cache_metrics.py::TestPrometheusCacheMetrics::test_increment_cache_metrics_on_cache_miss tests/test_litellm/integrations/test_prometheus_client_ip_user_agent.py::test_async_post_call_success_hook_includes_client_ip_user_agent -q-> 28 passed.CI
Review
Relevant issues
Addresses follow-up feedback on #27272.
Linear ticket
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack.
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
See
## Evidence.Type
🐛 Bug Fix
✅ Test
Changes