Skip to content

fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured limit#28805

Merged
mateo-berri merged 2 commits into
litellm_internal_stagingfrom
litellm_fix_v3_tpm_floor_small_caps
May 31, 2026
Merged

fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured limit#28805
mateo-berri merged 2 commits into
litellm_internal_stagingfrom
litellm_fix_v3_tpm_floor_small_caps

Conversation

@michelligabriele

Copy link
Copy Markdown
Collaborator

Relevant issues

N/A

Linear ticket

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

With project_metadata.model_tpm_limit = {"gpt-4o-2024-05-13": 1000} and a no-max_tokens request to that model.

Before:

HTTP 429
{
  "error": {
    "message": "Rate limit exceeded for model_per_project: <project-id>:gpt-4o-2024-05-13. Limit type: tokens. Current limit: 1000, Remaining: 1000. Limit resets at: ...",
    "code": "429"
  }
}

Debug log:

TPM reservation estimate: input=1, max_tokens=1024 (explicit=False), total=1025

Same request body fails on every retry because the rejected reservation never increments the counter — Remaining keeps showing the full limit.

After:

HTTP 200
{"choices":[{"message":{"content":"Hello! How can I assist you today?"}}],"usage":{"total_tokens":17}}

Debug log:

TPM reservation estimate: input=1, max_tokens=250 (explicit=False), total=251

Counter advances on successful reservations — a follow-up oversized-input request correctly returns 429 with Remaining: 972 (legitimate enforcement, no longer the paradox).

Unit tests:

tests/test_litellm/proxy/hooks/test_tpm_concurrent.py ............ 22 passed
tests/test_litellm/proxy/hooks/test_parallel_request_limiter_v3.py 49 passed, 1 skipped

Type

🐛 Bug Fix

Changes

  • litellm/proxy/hooks/parallel_request_limiter_v3.py:
    • _estimate_tokens_for_request gains an optional min_configured_tpm_limit: Optional[int] = None kwarg. When provided, the no-max_tokens output-budget floor is capped at min(DEFAULT_MAX_TOKENS_ESTIMATE // 4, max(1, min_configured_tpm_limit // 4)) so a small per-tenant TPM cap can't be tripped by the floor alone. Default None preserves the existing 1024 floor for callers that don't opt in.
    • Production call site in async_pre_call_hook now materializes the configured tokens_per_unit values across this request's descriptors into a list, reuses it for the has_tpm_limits check, and passes min(...) of that list into the estimator. No extra pass over descriptors.
  • tests/test_litellm/proxy/hooks/test_tpm_concurrent.py — 4 regression tests:

Out of scope (deferred follow-ups): the misleading Remaining wording in the 429 body, the DEFAULT_MAX_TOKENS_ESTIMATE constant itself, and the dead UserAPIKeyAuth.tpm_limit_per_model / rpm_limit_per_model constructor fields.

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a paradox where no-max_tokens requests to a model with a small per-tenant TPM cap always returned 429 because the static 1024-token output-budget floor exceeded the cap before the request even ran. The fix introduces _no_max_tokens_output_floor, which caps the floor at min(baseline, configured_limit // 4), and additionally injects data["max_tokens"] when the capped floor is smaller than the baseline so concurrent unbounded generations can't overspend the limit between pre-call reservation and post-call reconciliation.

  • Core logic change_estimate_tokens_for_request gains a min_configured_tpm_limit parameter and a new static helper _no_max_tokens_output_floor centralises the floor calculation using the extracted _TPM_FLOOR_FRACTION = 4 constant.
  • Secondary guardasync_pre_call_hook now materialises the configured TPM limits from descriptors into a list, reuses it for has_tpm_limits, computes min(), and injects data["max_tokens"] = capped_floor to bound actual model output when the capped floor is below baseline.
  • Test coverage — 7 new mock-only regression tests cover the floor-cap path, large-cap no-op, kwarg-omitted legacy path, end-to-end non-429 admission, max_tokens injection, injection suppression for large caps, and explicit-max_tokens preservation.

Confidence Score: 5/5

The change is narrowly scoped to the TPM reservation pre-call path and preserves all existing behaviour for callers that don't pass the new kwarg.

The fix is targeted, backward-compatible (default None preserves the previous 1024 floor), and backed by 7 purpose-built regression tests. The int() cast on tokens_per_unit guards against float leakage from JSON deserialisation, and the _TPM_FLOOR_FRACTION constant keeps both the baseline and capped-floor calculations in sync. No pre-existing tests are modified, no new network calls are introduced, and the injection of data[max_tokens] is guarded against explicit-max-tokens and embedding requests.

No files require special attention.

Important Files Changed

Filename Overview
litellm/proxy/hooks/parallel_request_limiter_v3.py Adds _TPM_FLOOR_FRACTION constant, _no_max_tokens_output_floor static method, and capped-floor logic in async_pre_call_hook; also injects data[max_tokens] to bound concurrent output for small TPM caps
tests/test_litellm/proxy/hooks/test_tpm_concurrent.py Adds 7 new regression tests covering floor capping, max_tokens injection, large-cap no-op, and explicit-max_tokens preservation — no real network calls, all mocked via the existing rate_limiter fixture

Reviews (2): Last reviewed commit: "fix(v3 limiter): inject matching max_tok..." | Re-trigger Greptile

Comment thread litellm/proxy/hooks/parallel_request_limiter_v3.py Outdated
Comment thread litellm/proxy/hooks/parallel_request_limiter_v3.py Outdated
Comment thread litellm/proxy/hooks/parallel_request_limiter_v3.py Outdated
@veria-ai

veria-ai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

@shivamrawat1

Copy link
Copy Markdown
Collaborator

@greptile review

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@mateo-berri mateo-berri merged commit 80cf50d into litellm_internal_staging May 31, 2026
116 of 118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants