fix(v3 limiter): cap no-max_tokens TPM floor at smallest configured limit#28805
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR fixes a paradox where no-
Confidence Score: 5/5The change is narrowly scoped to the TPM reservation pre-call path and preserves all existing behaviour for callers that don't pass the new kwarg. The fix is targeted, backward-compatible (default None preserves the previous 1024 floor), and backed by 7 purpose-built regression tests. The int() cast on tokens_per_unit guards against float leakage from JSON deserialisation, and the _TPM_FLOOR_FRACTION constant keeps both the baseline and capped-floor calculations in sync. No pre-existing tests are modified, no new network calls are introduced, and the injection of data[max_tokens] is guarded against explicit-max-tokens and embedding requests. No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/proxy/hooks/parallel_request_limiter_v3.py | Adds _TPM_FLOOR_FRACTION constant, _no_max_tokens_output_floor static method, and capped-floor logic in async_pre_call_hook; also injects data[max_tokens] to bound concurrent output for small TPM caps |
| tests/test_litellm/proxy/hooks/test_tpm_concurrent.py | Adds 7 new regression tests covering floor capping, max_tokens injection, large-cap no-op, and explicit-max_tokens preservation — no real network calls, all mocked via the existing rate_limiter fixture |
Reviews (2): Last reviewed commit: "fix(v3 limiter): inject matching max_tok..." | Re-trigger Greptile
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 1 · PR risk: 0/10 |
…constrains no-max_tokens floor
|
@greptile review |
80cf50d
into
litellm_internal_staging
Relevant issues
N/A
Linear ticket
Pre-Submission checklist
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
With
project_metadata.model_tpm_limit = {"gpt-4o-2024-05-13": 1000}and a no-max_tokensrequest to that model.Before:
Debug log:
Same request body fails on every retry because the rejected reservation never increments the counter —
Remainingkeeps showing the full limit.After:
Debug log:
Counter advances on successful reservations — a follow-up oversized-input request correctly returns 429 with
Remaining: 972(legitimate enforcement, no longer the paradox).Unit tests:
Type
🐛 Bug Fix
Changes
litellm/proxy/hooks/parallel_request_limiter_v3.py:_estimate_tokens_for_requestgains an optionalmin_configured_tpm_limit: Optional[int] = Nonekwarg. When provided, the no-max_tokensoutput-budget floor is capped atmin(DEFAULT_MAX_TOKENS_ESTIMATE // 4, max(1, min_configured_tpm_limit // 4))so a small per-tenant TPM cap can't be tripped by the floor alone. DefaultNonepreserves the existing 1024 floor for callers that don't opt in.async_pre_call_hooknow materializes the configuredtokens_per_unitvalues across this request's descriptors into a list, reuses it for thehas_tpm_limitscheck, and passesmin(...)of that list into the estimator. No extra pass overdescriptors.tests/test_litellm/proxy/hooks/test_tpm_concurrent.py— 4 regression tests:configured_limit // 4when the configured TPM is small.DEFAULT_MAX_TOKENS_ESTIMATE // 4 = 1024when the configured TPM is large (preserves the PR fix: atomic TPM rate limit #27001 / fix(proxy): bound budget reservation per request instead of pinning to headroom #27509 anti-bypass intent for large budgets).async_pre_call_hookwithproject_metadata.model_tpm_limit = {model: 1000}and a no-max_tokensrequest — must not 429 on the first call.Out of scope (deferred follow-ups): the misleading
Remainingwording in the 429 body, theDEFAULT_MAX_TOKENS_ESTIMATEconstant itself, and the deadUserAPIKeyAuth.tpm_limit_per_model/rpm_limit_per_modelconstructor fields.