Fix Prometheus end-user metric cardinality tracking by ishaan-berri · Pull Request #27316 · BerriAI/litellm

ishaan-berri · 2026-05-06T18:44:27Z

Summary

Prometheus metrics that include resolved end_user labels could grow one child series per distinct end user without a cleanup path. This rebuilds the cardinality cap fix on the current staging branch, keeps the existing custom-metadata label handling intact, and applies Bugbot feedback by sharing the bounded tracking path across counters/histograms with guarded tracker state updates.

Repro

Emit litellm_spend_metric six times with end_user as the only enabled label and a configured cap of three series. Before the fix, the Prometheus child map retained all six label tuples; after the fix, it retains only the newest three.

Evidence

Terminal transcript from the repro after the fix:

children 3
series [('end-user-3',), ('end-user-4',), ('end-user-5',)]

Targeted regression suite after addressing the CI-reported test-double path:

28 passed in 0.58s

Tests

Added tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py covering max-series eviction, TTL expiry, label-agnostic cleanup, failed child removal state, filtered failure metrics, default end-user label behavior, and uninitialized logger test doubles.
Ran uv run --no-sync pytest tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py::test_async_log_failure_event tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py::test_async_log_failure_event_litellm_side_rate_limit tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py tests/test_litellm/integrations/test_prometheus_labels.py tests/test_litellm/integrations/test_prometheus_custom_metadata_label_counts.py tests/test_litellm/integrations/test_prometheus_cache_metrics.py::TestPrometheusCacheMetrics::test_increment_cache_metrics_on_cache_hit tests/test_litellm/integrations/test_prometheus_cache_metrics.py::TestPrometheusCacheMetrics::test_increment_cache_metrics_on_cache_miss tests/test_litellm/integrations/test_prometheus_client_ip_user_agent.py::test_async_post_call_success_hook_includes_client_ip_user_agent -q -> 28 passed.
Ran CI lint gates locally: Black check, Ruff, MyPy, circular import check, and import safety -> passed.

CI

GitHub Actions lint, UI build, and integrations workflows are green on the latest head.
CircleCI currently reports failures in provider/router/local contexts that are outside this Prometheus diff, and its public API/log pages were not accessible from this environment for deeper triage. The directly related GitHub integrations workflow passed after the follow-up fix.
Some remote checks are still pending at the end of the bounded poll.

Review

Automated review has not posted a review on the latest pushed head during the bounded poll.

Relevant issues

Addresses follow-up feedback on #27272.

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack.

CI (LiteLLM team)

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

See ## Evidence.

Type

🐛 Bug Fix
✅ Test

Changes

Added a reusable bounded Prometheus series tracker with max-series and TTL cleanup.
Applied bounded tracking to end-user-labeled counter and histogram updates while holding the tracker lock through the metric update.
Preserved existing custom metadata label support on the current staging branch.
Added regression tests for the merge-conflict and Bugbot feedback paths.

Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>

CLAassistant · 2026-05-06T18:44:39Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>

greptile-apps · 2026-05-06T19:27:09Z

Greptile Summary

This PR adds a BoundedPrometheusSeriesTracker that caps the number of Prometheus child series per metric by evicting the oldest labelsets when a configurable max_series limit is exceeded and by running TTL-based cleanup at a configurable interval. All counter updates (via _inc_labeled_counter) and the four histograms that carry the end_user label are routed through the new _apply_labeled_metric helper, and three module-level knobs are added to litellm/__init__.py for runtime tuning.

The LRU eviction logic (OrderedDict + move_to_end) and TTL cleanup are correct; the test suite covers the main eviction, TTL, failed-removal, and label-agnostic paths.
_apply_labeled_metric holds a single per-logger RLock across metric.labels(), track_series(), and the actual inc()/observe() call, which serializes all metric writes for the entire logger under one lock rather than per-metric.
_remove_metric_child silently returns False on ValueError/AttributeError without any log, so a persistent removal failure stalls eviction invisibly; and max_series=0 creates an untracked ghost child in Prometheus on every request.

Confidence Score: 4/5

Safe to merge; the cardinality fix is correct for the common case and the new code is well-tested.

The eviction logic works correctly under all the scenarios exercised by the tests. The three issues flagged — global lock contention, silent stall on persistent removal failures, and the max_series=0 ghost child — would only manifest at high concurrency, in rare Prometheus error conditions, or with an explicitly unusual configuration value. None affect the default code path.

bounded_prometheus_series_tracker.py deserves a second look on the removal-failure path and the max_series=0 branch; prometheus.py is worth a glance on the shared-lock scope if high-throughput deployments are a concern.

Important Files Changed

Filename	Overview
litellm/integrations/prometheus_helpers/bounded_prometheus_series_tracker.py	New BoundedPrometheusSeriesTracker class using OrderedDict LRU eviction and TTL cleanup; logic is sound but minor edge cases around persistent remove failures and max_series=0 semantics.
litellm/integrations/prometheus.py	Counters routed through _inc_labeled_counter → _apply_labeled_metric; four histograms with end_user labels updated similarly; coverage of all end_user-labeled metrics appears complete. Outer lock in _apply_labeled_metric serializes all metric operations through a single per-logger RLock.
litellm/init.py	Three new module-level config knobs for end-user cardinality tracking added with sensible defaults (10000 series, 3600s TTL, 60s cleanup interval).
tests/test_litellm/integrations/test_prometheus_end_user_cardinality.py	New test file covers max-series eviction, TTL expiry, label-agnostic cleanup, failed-removal state, filtered labels, default end-user behavior, and test-double support; no real network calls.

_{Reviews (1): Last reviewed commit: "Handle Prometheus test-double metric upd..." | Re-trigger Greptile}

greptile-apps · 2026-05-06T19:27:13Z

+        with tracker.lock:
+            labeled_metric = metric.labels(**labels)
+            self._track_bounded_prometheus_metric_series(metric, metric_name, labels)
+            apply_metric_update(labeled_metric)


Single RLock serializes all metric writes across all metrics

_bounded_prometheus_series_tracker.lock is one RLock instance shared by every metric in the logger. The outer with tracker.lock: block in _apply_labeled_metric holds that lock while executing metric.labels(**labels) and apply_metric_update(labeled_metric) — both of which are Prometheus-internal operations that are already thread-safe on their own. As a result, a call to litellm_spend_metric.inc() and a concurrent call to litellm_llm_api_latency_metric.observe() are fully serialized even though they update independent metrics.

In high-throughput deployments this becomes a single global write bottleneck. The outer lock is needed to guarantee that the current series is moved to the "newest" slot in the tracker before any other thread could evict it, but a per-metric lock (or storing the tracker's per-metric series dict lock separately) would scope the contention to just the metric being updated rather than the whole logger.

greptile-apps · 2026-05-06T19:27:14Z

+            if max_series is not None and max_series > 0:
+                while len(series) > max_series:
+                    tracked_label_values = next(iter(series))
+                    if not self._remove_metric_child(metric, tracked_label_values):
+                        break
+                    del series[tracked_label_values]
+            elif max_series is not None:
+                while series:
+                    tracked_label_values = next(iter(series))
+                    if not self._remove_metric_child(metric, tracked_label_values):
+                        break
+                    del series[tracked_label_values]


Persistent ValueError from metric.remove() stalls eviction permanently

When _remove_metric_child catches (AttributeError, ValueError) and returns False, the eviction while loop breaks immediately, leaving more entries than max_series in the tracker. On the next request the same oldest entry is retried — this is the intended transient-retry path and the test covers it. However, if metric.remove(*label_values) raises ValueError consistently (e.g., because the label-value tuple was built with a different ordering than the metric's registration), every subsequent request breaks out of the loop after one failed attempt and the series dict grows without bound. There is no log or counter to surface that eviction has stalled, making the condition silent.

greptile-apps · 2026-05-06T19:27:15Z

+        except (AttributeError, ValueError):
+            return False


When eviction is stalled by a persistent removal failure, the condition is currently silent. Adding a warning log here would make it visible in production logs before the series dict grows too large.

Suggested change

except (AttributeError, ValueError):

return False

except (AttributeError, ValueError):

import logging

logging.getLogger(__name__).warning(

"BoundedPrometheusSeriesTracker: failed to remove Prometheus child "

"for label_values=%s; eviction stalled until next successful removal.",

label_values,

)

return False

greptile-apps · 2026-05-06T19:27:16Z

+            elif max_series is not None:
+                while series:
+                    tracked_label_values = next(iter(series))
+                    if not self._remove_metric_child(metric, tracked_label_values):
+                        break
+                    del series[tracked_label_values]


max_series=0 leaves a ghost child in Prometheus

When max_series is set to 0, the elif max_series is not None: branch clears all entries from the tracker dict (including the one just inserted for the current request). The eviction calls metric.remove() for the current series, removing it from Prometheus. But then apply_metric_update(labeled_metric) in _apply_labeled_metric calls .inc()/.observe() on the already-removed child object, which re-creates that child in the Prometheus registry. On the next request for a different end user the same sequence repeats, and the previous call's child is never tracked again — it lingers in Prometheus indefinitely.

Setting max_series=0 is admittedly unusual, but Optional[int] with 0 is a valid Python value and is not explicitly rejected. A guard if max_series is not None and max_series <= 0: return (or documentation stating 0 is unsupported) would prevent the ghost-child scenario.

Fix Prometheus end-user metric cardinality tracking

848e1a4

Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>

Handle Prometheus test-double metric updates

159557f

Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>

ishaan-berri marked this pull request as ready for review May 6, 2026 19:18

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Prometheus end-user metric cardinality tracking#27316

Fix Prometheus end-user metric cardinality tracking#27316
ishaan-berri wants to merge 2 commits into
litellm_internal_stagingfrom
litellm_fix-prometheus-cardinality-conflicts-1f27

ishaan-berri commented May 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

CLAassistant commented May 6, 2026

Uh oh!

greptile-apps Bot commented May 6, 2026

Important Files Changed

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        except (AttributeError, ValueError):
-            return False
+        except (AttributeError, ValueError):
+            import logging
+            logging.getLogger(__name__).warning(
+                "BoundedPrometheusSeriesTracker: failed to remove Prometheus child "
+                "for label_values=%s; eviction stalled until next successful removal.",
+                label_values,
+            )
+            return False

Uh oh!

Conversation

ishaan-berri commented May 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Repro

Evidence

Tests

CI

Review

Relevant issues

Linear ticket

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Uh oh!

CLAassistant commented May 6, 2026

Uh oh!

greptile-apps Bot commented May 6, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ishaan-berri commented May 6, 2026 •

edited by cursor Bot

Loading