fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured limit by michelligabriele · Pull Request #28805 · BerriAI/litellm

michelligabriele · 2026-05-25T20:27:19Z

Relevant issues

N/A

Linear ticket

Pre-Submission checklist

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

With project_metadata.model_tpm_limit = {"gpt-4o-2024-05-13": 1000} and a no-max_tokens request to that model.

Before:

HTTP 429
{
  "error": {
    "message": "Rate limit exceeded for model_per_project: <project-id>:gpt-4o-2024-05-13. Limit type: tokens. Current limit: 1000, Remaining: 1000. Limit resets at: ...",
    "code": "429"
  }
}

Debug log:

TPM reservation estimate: input=1, max_tokens=1024 (explicit=False), total=1025

Same request body fails on every retry because the rejected reservation never increments the counter — Remaining keeps showing the full limit.

After:

HTTP 200
{"choices":[{"message":{"content":"Hello! How can I assist you today?"}}],"usage":{"total_tokens":17}}

Debug log:

TPM reservation estimate: input=1, max_tokens=250 (explicit=False), total=251

Counter advances on successful reservations — a follow-up oversized-input request correctly returns 429 with Remaining: 972 (legitimate enforcement, no longer the paradox).

Unit tests:

tests/test_litellm/proxy/hooks/test_tpm_concurrent.py ............ 22 passed
tests/test_litellm/proxy/hooks/test_parallel_request_limiter_v3.py 49 passed, 1 skipped

Type

🐛 Bug Fix

Changes

litellm/proxy/hooks/parallel_request_limiter_v3.py:
- _estimate_tokens_for_request gains an optional min_configured_tpm_limit: Optional[int] = None kwarg. When provided, the no-max_tokens output-budget floor is capped at min(DEFAULT_MAX_TOKENS_ESTIMATE // 4, max(1, min_configured_tpm_limit // 4)) so a small per-tenant TPM cap can't be tripped by the floor alone. Default None preserves the existing 1024 floor for callers that don't opt in.
- Production call site in async_pre_call_hook now materializes the configured tokens_per_unit values across this request's descriptors into a list, reuses it for the has_tpm_limits check, and passes min(...) of that list into the estimator. No extra pass over descriptors.
tests/test_litellm/proxy/hooks/test_tpm_concurrent.py — 4 regression tests:
- Floor caps at configured_limit // 4 when the configured TPM is small.
- Floor stays at DEFAULT_MAX_TOKENS_ESTIMATE // 4 = 1024 when the configured TPM is large (preserves the PR fix: atomic TPM rate limit #27001 / fix(proxy): bound budget reservation per request instead of pinning to headroom #27509 anti-bypass intent for large budgets).
- Floor unchanged when the kwarg is omitted (legacy callers / test stubs).
- End-to-end via async_pre_call_hook with project_metadata.model_tpm_limit = {model: 1000} and a no-max_tokens request — must not 429 on the first call.

Out of scope (deferred follow-ups): the misleading Remaining wording in the 429 body, the DEFAULT_MAX_TOKENS_ESTIMATE constant itself, and the dead UserAPIKeyAuth.tpm_limit_per_model / rpm_limit_per_model constructor fields.

…imit

codecov · 2026-05-25T20:29:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-05-25T20:32:08Z

Greptile Summary

This PR fixes a paradox where no-max_tokens requests to a model with a small per-tenant TPM cap always returned 429 because the static 1024-token output-budget floor exceeded the cap before the request even ran. The fix introduces _no_max_tokens_output_floor, which caps the floor at min(baseline, configured_limit // 4), and additionally injects data["max_tokens"] when the capped floor is smaller than the baseline so concurrent unbounded generations can't overspend the limit between pre-call reservation and post-call reconciliation.

Core logic change — _estimate_tokens_for_request gains a min_configured_tpm_limit parameter and a new static helper _no_max_tokens_output_floor centralises the floor calculation using the extracted _TPM_FLOOR_FRACTION = 4 constant.
Secondary guard — async_pre_call_hook now materialises the configured TPM limits from descriptors into a list, reuses it for has_tpm_limits, computes min(), and injects data["max_tokens"] = capped_floor to bound actual model output when the capped floor is below baseline.
Test coverage — 7 new mock-only regression tests cover the floor-cap path, large-cap no-op, kwarg-omitted legacy path, end-to-end non-429 admission, max_tokens injection, injection suppression for large caps, and explicit-max_tokens preservation.

Confidence Score: 5/5

The change is narrowly scoped to the TPM reservation pre-call path and preserves all existing behaviour for callers that don't pass the new kwarg.

The fix is targeted, backward-compatible (default None preserves the previous 1024 floor), and backed by 7 purpose-built regression tests. The int() cast on tokens_per_unit guards against float leakage from JSON deserialisation, and the _TPM_FLOOR_FRACTION constant keeps both the baseline and capped-floor calculations in sync. No pre-existing tests are modified, no new network calls are introduced, and the injection of data[max_tokens] is guarded against explicit-max-tokens and embedding requests.

No files require special attention.

Important Files Changed

Filename	Overview
litellm/proxy/hooks/parallel_request_limiter_v3.py	Adds _TPM_FLOOR_FRACTION constant, _no_max_tokens_output_floor static method, and capped-floor logic in async_pre_call_hook; also injects data[max_tokens] to bound concurrent output for small TPM caps
tests/test_litellm/proxy/hooks/test_tpm_concurrent.py	Adds 7 new regression tests covering floor capping, max_tokens injection, large-cap no-op, and explicit-max_tokens preservation — no real network calls, all mocked via the existing rate_limiter fixture

_{Reviews (2): Last reviewed commit: "fix(v3 limiter): inject matching max_tok..." | Re-trigger Greptile}

veria-ai · 2026-05-25T20:36:42Z

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

…constrains no-max_tokens floor

shivamrawat1 · 2026-05-30T01:21:37Z

@greptile review

mateo-berri

LGTM; thanks!

fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured l…

d3b8df8

…imit

greptile-apps Bot reviewed May 25, 2026

View reviewed changes

Comment thread litellm/proxy/hooks/parallel_request_limiter_v3.py Outdated

Comment thread litellm/proxy/hooks/parallel_request_limiter_v3.py Outdated

veria-ai Bot reviewed May 25, 2026

View reviewed changes

Comment thread litellm/proxy/hooks/parallel_request_limiter_v3.py Outdated

fix(v3 limiter): inject matching max_tokens cap when small TPM limit …

aa2e0a8

…constrains no-max_tokens floor

mateo-berri approved these changes May 31, 2026

View reviewed changes

mateo-berri merged commit 80cf50d into litellm_internal_staging May 31, 2026
116 of 118 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured limit#28805

fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured limit#28805
mateo-berri merged 2 commits into
litellm_internal_stagingfrom
litellm_fix_v3_tpm_floor_small_caps

michelligabriele commented May 25, 2026

Uh oh!

codecov Bot commented May 25, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 25, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veria-ai Bot commented May 25, 2026 •

edited

Loading

Uh oh!

shivamrawat1 commented May 30, 2026

Uh oh!

mateo-berri left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

michelligabriele commented May 25, 2026

Relevant issues

Linear ticket

Pre-Submission checklist

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Uh oh!

codecov Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veria-ai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR overview

Security review

Uh oh!

shivamrawat1 commented May 30, 2026

Uh oh!

mateo-berri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 25, 2026 •

edited

Loading

greptile-apps Bot commented May 25, 2026 •

edited

Loading

veria-ai Bot commented May 25, 2026 •

edited

Loading