fix(proxy): bound budget reservation per request (backport of #27509 to 1.84.0rc2)#27539
Conversation
…o remaining headroom reserve_budget_for_request fell back to reserving the entire remaining team/key/user headroom whenever a request omitted max_tokens, which pinned the spend counter at max_budget for the duration of the in-flight request and false-positive-blocked every concurrent or back-to-back request until the success callback reconciled. Surfaced as an integration-test team being budget-blocked at its $2000 cap while DB spend was $0.144. Switch the missing-max_tokens path to a fixed default of 16384 output tokens (mirrors parallel_request_limiter_v3's DEFAULT_MAX_TOKENS_ESTIMATE precedent), and clamp explicit max_tokens at the model's max_output_tokens for reservation accounting only. The outbound request body is unchanged, so providers see whatever the caller actually sent; only the local integer used to compute reservation cost is bounded. This also prevents a hostile max_tokens=999999999 from inflating one request's reservation up to the entire team headroom. For Opus 4.7 (output $25/M, max_output 128K) on a $2000 budget the worst-case per-request reservation drops from "everything left" to $3.20, raising admittable concurrency from 1 to ~625.
Image-generation routes (dall-e-3, flux, etc.) have no per-token output cost so they fell through to the no-reservation read-time-only path. Concurrent image requests against a depleted budget could all pass common_checks (counter exactly at max_budget passes the strict-`>` gate) and reach the provider before reconciliation caught up. Add per-image reservation in _estimate_request_max_cost_for_model: when the model has a per-image cost field, reserve `n × cost_per_image` upfront. The atomic counter increment serializes concurrent admissions, so the second request sees the post-first-reservation counter and raises BudgetExceededError instead of silently leaking through. Both `output_cost_per_image` and `input_cost_per_image` are honored — naming is inconsistent across providers (OpenAI dall-e-3 uses input_cost_per_image, aiml/dall-e-3 uses output_cost_per_image for the same per-generated-image price). Per-pixel pricing (DALL-E 2 size variants) and TTS/STT routes still fall through to read-time enforcement; those are follow-ups.
The previous detection treated any model with input_cost_per_image
or output_cost_per_image as image generation. Several chat and
embedding models carry those fields to price multimodal vision input,
not generated images:
- gemini-3.1-pro-preview (mode=chat) has output_cost_per_image=0.00012
alongside input/output token pricing.
- azure/gpt-realtime-* (mode=chat) has input_cost_per_image=5e-6.
- amazon.titan-embed-image-v1 (mode=embedding) has
input_cost_per_image=6e-5.
For these models the image-gen branch fired first and reserved a
fraction of a cent per request, short-circuiting the token-priced
path entirely. Long Gemini chats reserved 1 × $0.00012 instead of
the true token cost.
Gate strictly on mode in {"image_generation", "image_edit"}. All 197
real image_generation entries and all 31 image_edit entries
(Flux Kontext, Stability inpaint/outpaint, etc.) carry the right mode,
so the field-presence fallback was unnecessary.
Adds regression tests for the chat-model-with-image-cost-field case
and for image_edit reservation.
Greptile SummaryThis PR (backport of #27509) fixes a budget-reservation bug where any request without an explicit
Confidence Score: 4/5Safe to merge; the fix correctly removes the headroom-pinning behavior and replaces it with a bounded per-request estimation. The budget-reservation logic is a well-supported change with 32 passing tests and sound core arithmetic. The summation of both per-image cost fields is a latent over-reservation risk for models that happen to have both fields set in the price table. The removal of the pre-flight budget-exceeded check for unknown-model routes is an intentional, documented trade-off and does not affect models that are in the cost map. budget_reservation.py lines 896-898 (_estimate_image_generation_cost cost summation) deserve a second look before any new image-model entry is added to model_prices_and_context_window.json with both input_cost_per_image and output_cost_per_image set.
|
| Filename | Overview |
|---|---|
| litellm/proxy/spend_tracking/budget_reservation.py | Core reservation logic overhauled: removes the headroom-pinning fallback, adds a bounded 16K-token default for uncapped requests, and adds image-gen/edit per-image cost reservation gated on model mode. |
| tests/test_litellm/proxy/test_budget_reservation.py | Three old tests covering the now-removed headroom-pinning path are correctly replaced; five new tests cover the 16K clamp, adversarial over-request, image-gen reservation, and the integration-team regression scenario. |
Reviews (1): Last reviewed commit: "fix(proxy): gate image-gen reservation s..." | Re-trigger Greptile
| output_cost_per_image = _to_float(model_info.get("output_cost_per_image")) | ||
| input_cost_per_image = _to_float(model_info.get("input_cost_per_image")) | ||
| cost_per_image = (output_cost_per_image or 0.0) + (input_cost_per_image or 0.0) |
There was a problem hiding this comment.
Summing both
input_cost_per_image and output_cost_per_image handles the naming inconsistency between OpenAI (input_cost_per_image) and aiml/dall-e-3 (output_cost_per_image), but would double-count for any model entry that has both fields populated with non-zero values. If a future provider entry sets both (even unintentionally), the reservation would be 2× the real price. A safer approach would be to prefer one over the other rather than add them.
| output_cost_per_image = _to_float(model_info.get("output_cost_per_image")) | |
| input_cost_per_image = _to_float(model_info.get("input_cost_per_image")) | |
| cost_per_image = (output_cost_per_image or 0.0) + (input_cost_per_image or 0.0) | |
| output_cost_per_image = _to_float(model_info.get("output_cost_per_image")) | |
| input_cost_per_image = _to_float(model_info.get("input_cost_per_image")) | |
| # Prefer output_cost_per_image; fall back to input_cost_per_image to handle | |
| # the OpenAI dall-e-3 naming quirk. Never sum both — that would double-count | |
| # a model entry that sets both fields. | |
| cost_per_image = output_cost_per_image or input_cost_per_image or 0.0 |
| llm_router=llm_router, | ||
| ) | ||
| if reservation_cost is None: | ||
| reservation_cost = await _get_smallest_remaining_budget( | ||
| counters=counters, | ||
| current_spend_by_counter_key=current_spend_by_counter_key, | ||
| ) | ||
| # estimate_request_max_cost still returns None when the model is unknown | ||
| # to the cost map (no token-priced cost fields, e.g. image/audio routes). | ||
| # In that case we fall back to read-time enforcement only. | ||
| if reservation_cost is None or reservation_cost <= 0: | ||
| return None |
There was a problem hiding this comment.
Unknown-model requests skip pre-flight budget check
The old _get_smallest_remaining_budget path would raise BudgetExceededError when a counter was already at or over budget, even for models whose cost couldn't be estimated. With that removed, a request to an unknown model whose budget is already exhausted now skips the reservation layer entirely and falls through to read-time enforcement. For single-threaded workloads this is fine, but concurrent requests to cost-unknown routes against a just-exhausted budget could all pass admission simultaneously before the first one's actual cost is reconciled. This is documented as intentional in the PR, but it is a narrower admission-control guarantee for those models.
| # estimate_request_max_cost still returns None when the model is unknown | ||
| # to the cost map (no token-priced cost fields, e.g. image/audio routes). | ||
| # In that case we fall back to read-time enforcement only. | ||
| if reservation_cost is None or reservation_cost <= 0: |
There was a problem hiding this comment.
High: Budget reservation bypass for unestimated requests
None estimates now skip reservation entirely, so a user with a nearly depleted key/team budget can fan out concurrent requests for unestimated models or routes, such as the per-pixel image path covered by the new tests, and all of them pass the read-time budget check before any response cost is recorded. Keep an admission-side reservation for these requests, even if it is a conservative bounded fallback or a route-specific estimate, rather than falling back to read-time enforcement only.
High: Unestimated requests can bypass in-flight budget reservationThis PR changes budget reservation to skip requests whose cost cannot be estimated. For priced-but-unhandled routes like per-pixel image generation, an authenticated user can send many concurrent requests after the read-time budget check and before spend callbacks update counters. Status: 1 new · 1 open |
Backport of #27509 to `litellm_1.84.0rc2`.
Cherry-picks the three fix commits cleanly with no conflicts:
Summary
Test plan