Skip to content

Fix Prometheus end-user metric cardinality tracking#27316

Open
ishaan-berri wants to merge 2 commits into
litellm_internal_stagingfrom
litellm_fix-prometheus-cardinality-conflicts-1f27
Open

Fix Prometheus end-user metric cardinality tracking#27316
ishaan-berri wants to merge 2 commits into
litellm_internal_stagingfrom
litellm_fix-prometheus-cardinality-conflicts-1f27

Conversation

@ishaan-berri

@ishaan-berri ishaan-berri commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Prometheus metrics that include resolved end_user labels could grow one child series per distinct end user without a cleanup path. This rebuilds the cardinality cap fix on the current staging branch, keeps the existing custom-metadata label handling intact, and applies Bugbot feedback by sharing the bounded tracking path across counters/histograms with guarded tracker state updates.

Repro

Emit litellm_spend_metric six times with end_user as the only enabled label and a configured cap of three series. Before the fix, the Prometheus child map retained all six label tuples; after the fix, it retains only the newest three.

Evidence

Terminal transcript from the repro after the fix:

children 3
series [('end-user-3',), ('end-user-4',), ('end-user-5',)]

Targeted regression suite after addressing the CI-reported test-double path:

28 passed in 0.58s

Tests

  • Added tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py covering max-series eviction, TTL expiry, label-agnostic cleanup, failed child removal state, filtered failure metrics, default end-user label behavior, and uninitialized logger test doubles.
  • Ran uv run --no-sync pytest tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py::test_async_log_failure_event tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py::test_async_log_failure_event_litellm_side_rate_limit tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py tests/test_litellm/integrations/test_prometheus_labels.py tests/test_litellm/integrations/test_prometheus_custom_metadata_label_counts.py tests/test_litellm/integrations/test_prometheus_cache_metrics.py::TestPrometheusCacheMetrics::test_increment_cache_metrics_on_cache_hit tests/test_litellm/integrations/test_prometheus_cache_metrics.py::TestPrometheusCacheMetrics::test_increment_cache_metrics_on_cache_miss tests/test_litellm/integrations/test_prometheus_client_ip_user_agent.py::test_async_post_call_success_hook_includes_client_ip_user_agent -q -> 28 passed.
  • Ran CI lint gates locally: Black check, Ruff, MyPy, circular import check, and import safety -> passed.

CI

  • GitHub Actions lint, UI build, and integrations workflows are green on the latest head.
  • CircleCI currently reports failures in provider/router/local contexts that are outside this Prometheus diff, and its public API/log pages were not accessible from this environment for deeper triage. The directly related GitHub integrations workflow passed after the follow-up fix.
  • Some remote checks are still pending at the end of the bounded poll.

Review

  • Automated review has not posted a review on the latest pushed head during the bounded poll.

Relevant issues

Addresses follow-up feedback on #27272.

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack.

CI (LiteLLM team)

  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

See ## Evidence.

Type

🐛 Bug Fix
✅ Test

Changes

  • Added a reusable bounded Prometheus series tracker with max-series and TTL cleanup.
  • Applied bounded tracking to end-user-labeled counter and histogram updates while holding the tracker lock through the metric update.
  • Preserved existing custom metadata label support on the current staging branch.
  • Added regression tests for the merge-conflict and Bugbot feedback paths.
Open in Web Open in Cursor 

Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
@ishaan-berri ishaan-berri marked this pull request as ready for review May 6, 2026 19:18
@greptile-apps

greptile-apps Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a BoundedPrometheusSeriesTracker that caps the number of Prometheus child series per metric by evicting the oldest labelsets when a configurable max_series limit is exceeded and by running TTL-based cleanup at a configurable interval. All counter updates (via _inc_labeled_counter) and the four histograms that carry the end_user label are routed through the new _apply_labeled_metric helper, and three module-level knobs are added to litellm/__init__.py for runtime tuning.

  • The LRU eviction logic (OrderedDict + move_to_end) and TTL cleanup are correct; the test suite covers the main eviction, TTL, failed-removal, and label-agnostic paths.
  • _apply_labeled_metric holds a single per-logger RLock across metric.labels(), track_series(), and the actual inc()/observe() call, which serializes all metric writes for the entire logger under one lock rather than per-metric.
  • _remove_metric_child silently returns False on ValueError/AttributeError without any log, so a persistent removal failure stalls eviction invisibly; and max_series=0 creates an untracked ghost child in Prometheus on every request.

Confidence Score: 4/5

Safe to merge; the cardinality fix is correct for the common case and the new code is well-tested.

The eviction logic works correctly under all the scenarios exercised by the tests. The three issues flagged — global lock contention, silent stall on persistent removal failures, and the max_series=0 ghost child — would only manifest at high concurrency, in rare Prometheus error conditions, or with an explicitly unusual configuration value. None affect the default code path.

bounded_prometheus_series_tracker.py deserves a second look on the removal-failure path and the max_series=0 branch; prometheus.py is worth a glance on the shared-lock scope if high-throughput deployments are a concern.

Important Files Changed

Filename Overview
litellm/integrations/prometheus_helpers/bounded_prometheus_series_tracker.py New BoundedPrometheusSeriesTracker class using OrderedDict LRU eviction and TTL cleanup; logic is sound but minor edge cases around persistent remove failures and max_series=0 semantics.
litellm/integrations/prometheus.py Counters routed through _inc_labeled_counter → _apply_labeled_metric; four histograms with end_user labels updated similarly; coverage of all end_user-labeled metrics appears complete. Outer lock in _apply_labeled_metric serializes all metric operations through a single per-logger RLock.
litellm/init.py Three new module-level config knobs for end-user cardinality tracking added with sensible defaults (10000 series, 3600s TTL, 60s cleanup interval).
tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py New test file covers max-series eviction, TTL expiry, label-agnostic cleanup, failed-removal state, filtered labels, default end-user behavior, and test-double support; no real network calls.

Reviews (1): Last reviewed commit: "Handle Prometheus test-double metric upd..." | Re-trigger Greptile

Comment on lines +1003 to +1006
with tracker.lock:
labeled_metric = metric.labels(**labels)
self._track_bounded_prometheus_metric_series(metric, metric_name, labels)
apply_metric_update(labeled_metric)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Single RLock serializes all metric writes across all metrics

_bounded_prometheus_series_tracker.lock is one RLock instance shared by every metric in the logger. The outer with tracker.lock: block in _apply_labeled_metric holds that lock while executing metric.labels(**labels) and apply_metric_update(labeled_metric) — both of which are Prometheus-internal operations that are already thread-safe on their own. As a result, a call to litellm_spend_metric.inc() and a concurrent call to litellm_llm_api_latency_metric.observe() are fully serialized even though they update independent metrics.

In high-throughput deployments this becomes a single global write bottleneck. The outer lock is needed to guarantee that the current series is moved to the "newest" slot in the tracker before any other thread could evict it, but a per-metric lock (or storing the tracker's per-metric series dict lock separately) would scope the contention to just the metric being updated rather than the whole logger.

Comment on lines +54 to +65
if max_series is not None and max_series > 0:
while len(series) > max_series:
tracked_label_values = next(iter(series))
if not self._remove_metric_child(metric, tracked_label_values):
break
del series[tracked_label_values]
elif max_series is not None:
while series:
tracked_label_values = next(iter(series))
if not self._remove_metric_child(metric, tracked_label_values):
break
del series[tracked_label_values]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Persistent ValueError from metric.remove() stalls eviction permanently

When _remove_metric_child catches (AttributeError, ValueError) and returns False, the eviction while loop breaks immediately, leaving more entries than max_series in the tracker. On the next request the same oldest entry is retried — this is the intended transient-retry path and the test covers it. However, if metric.remove(*label_values) raises ValueError consistently (e.g., because the label-value tuple was built with a different ordering than the metric's registration), every subsequent request breaks out of the loop after one failed attempt and the series dict grows without bound. There is no log or counter to surface that eviction has stalled, making the condition silent.

Comment on lines +110 to +111
except (AttributeError, ValueError):
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When eviction is stalled by a persistent removal failure, the condition is currently silent. Adding a warning log here would make it visible in production logs before the series dict grows too large.

Suggested change
except (AttributeError, ValueError):
return False
except (AttributeError, ValueError):
import logging
logging.getLogger(__name__).warning(
"BoundedPrometheusSeriesTracker: failed to remove Prometheus child "
"for label_values=%s; eviction stalled until next successful removal.",
label_values,
)
return False

Comment on lines +60 to +65
elif max_series is not None:
while series:
tracked_label_values = next(iter(series))
if not self._remove_metric_child(metric, tracked_label_values):
break
del series[tracked_label_values]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 max_series=0 leaves a ghost child in Prometheus

When max_series is set to 0, the elif max_series is not None: branch clears all entries from the tracker dict (including the one just inserted for the current request). The eviction calls metric.remove() for the current series, removing it from Prometheus. But then apply_metric_update(labeled_metric) in _apply_labeled_metric calls .inc()/.observe() on the already-removed child object, which re-creates that child in the Prometheus registry. On the next request for a different end user the same sequence repeats, and the previous call's child is never tracked again — it lingers in Prometheus indefinitely.

Setting max_series=0 is admittedly unusual, but Optional[int] with 0 is a valid Python value and is not explicitly rejected. A guard if max_series is not None and max_series <= 0: return (or documentation stating 0 is unsupported) would prevent the ghost-child scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants