Skip to content

fix(proxy): bound budget reservation per request (backport of #27509 to 1.84.0rc2)#27539

Merged
yuneng-berri merged 3 commits into
litellm_1.84.0rc2from
litellm_/budget-reservation-rc2-backport
May 9, 2026
Merged

fix(proxy): bound budget reservation per request (backport of #27509 to 1.84.0rc2)#27539
yuneng-berri merged 3 commits into
litellm_1.84.0rc2from
litellm_/budget-reservation-rc2-backport

Conversation

@yuneng-berri

Copy link
Copy Markdown
Collaborator

Backport of #27509 to `litellm_1.84.0rc2`.

Cherry-picks the three fix commits cleanly with no conflicts:

  • `adc41ade8c` fix(proxy): bound budget reservation per request instead of pinning to remaining headroom
  • `0d551ac4f0` fix(proxy): reserve per-image cost for image-generation requests
  • `963cb4694d` fix(proxy): gate image-gen reservation strictly on model mode

Summary

  • `reserve_budget_for_request` previously fell back to reserving the entire remaining team/key/user headroom whenever a request omitted `max_tokens`. This pinned the spend counter at `max_budget` for the duration of the in-flight request and false-positive-blocked every concurrent or back-to-back request until the success callback reconciled. An internal integration-test team was being budget-blocked at its $2000 cap while DB spend was $0.144.
  • `_estimate_output_tokens` now returns `min(requested or 16384, model_info.max_output_tokens or 16384)`. The 16K default mirrors `parallel_request_limiter_v3`'s `DEFAULT_MAX_TOKENS_ESTIMATE` precedent. The model-ceiling clamp also applies to explicit `max_tokens` so a hostile `max_tokens=999999999` cannot inflate one request's reservation up to the entire team headroom.
  • `_estimate_image_generation_cost`: image-generation and image-edit routes (`dall-e-3`, `flux`, Stability inpaint, etc.) reserve `n × per-image cost` upfront. Strictly gated on `mode in {"image_generation", "image_edit"}` so chat models that carry `_cost_per_image` for vision input pricing (`gemini-3.1-pro-preview`, `azure/gpt-realtime-`, `amazon.titan-embed-image-v1`) still go through the token-priced path.
  • Reservation accounting only — the outbound request body is unchanged.

Test plan

  • `uv run pytest tests/test_litellm/proxy/test_budget_reservation.py` — 32/32 pass on the rc2 base
  • Cherry-picks applied cleanly with no conflicts

…o remaining headroom

reserve_budget_for_request fell back to reserving the entire remaining
team/key/user headroom whenever a request omitted max_tokens, which
pinned the spend counter at max_budget for the duration of the
in-flight request and false-positive-blocked every concurrent or
back-to-back request until the success callback reconciled. Surfaced
as an integration-test team being budget-blocked at its $2000 cap
while DB spend was $0.144.

Switch the missing-max_tokens path to a fixed default of 16384 output
tokens (mirrors parallel_request_limiter_v3's DEFAULT_MAX_TOKENS_ESTIMATE
precedent), and clamp explicit max_tokens at the model's
max_output_tokens for reservation accounting only. The outbound request
body is unchanged, so providers see whatever the caller actually sent;
only the local integer used to compute reservation cost is bounded.
This also prevents a hostile max_tokens=999999999 from inflating one
request's reservation up to the entire team headroom.

For Opus 4.7 (output $25/M, max_output 128K) on a $2000 budget the
worst-case per-request reservation drops from "everything left" to
$3.20, raising admittable concurrency from 1 to ~625.
Image-generation routes (dall-e-3, flux, etc.) have no per-token output
cost so they fell through to the no-reservation read-time-only path.
Concurrent image requests against a depleted budget could all pass
common_checks (counter exactly at max_budget passes the strict-`>`
gate) and reach the provider before reconciliation caught up.

Add per-image reservation in _estimate_request_max_cost_for_model:
when the model has a per-image cost field, reserve `n × cost_per_image`
upfront. The atomic counter increment serializes concurrent admissions,
so the second request sees the post-first-reservation counter and
raises BudgetExceededError instead of silently leaking through.

Both `output_cost_per_image` and `input_cost_per_image` are honored —
naming is inconsistent across providers (OpenAI dall-e-3 uses
input_cost_per_image, aiml/dall-e-3 uses output_cost_per_image for
the same per-generated-image price).

Per-pixel pricing (DALL-E 2 size variants) and TTS/STT routes still
fall through to read-time enforcement; those are follow-ups.
The previous detection treated any model with input_cost_per_image
or output_cost_per_image as image generation. Several chat and
embedding models carry those fields to price multimodal vision input,
not generated images:

- gemini-3.1-pro-preview (mode=chat) has output_cost_per_image=0.00012
  alongside input/output token pricing.
- azure/gpt-realtime-* (mode=chat) has input_cost_per_image=5e-6.
- amazon.titan-embed-image-v1 (mode=embedding) has
  input_cost_per_image=6e-5.

For these models the image-gen branch fired first and reserved a
fraction of a cent per request, short-circuiting the token-priced
path entirely. Long Gemini chats reserved 1 × $0.00012 instead of
the true token cost.

Gate strictly on mode in {"image_generation", "image_edit"}. All 197
real image_generation entries and all 31 image_edit entries
(Flux Kontext, Stability inpaint/outpaint, etc.) carry the right mode,
so the field-presence fallback was unnecessary.

Adds regression tests for the chat-model-with-image-cost-field case
and for image_edit reservation.
@greptile-apps

greptile-apps Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR (backport of #27509) fixes a budget-reservation bug where any request without an explicit max_tokens caused the spend counter to be pinned at the full team/key budget for the duration of the in-flight request, false-positive-blocking all concurrent or back-to-back requests until the success callback reconciled. The fix replaces the old headroom-reservation fallback with a bounded 16K-token default, adds a model-ceiling clamp to prevent adversarial max_tokens inflation, and introduces per-image cost reservation for image_generation/image_edit mode models.

  • Core fix (reserve_budget_for_request): _get_smallest_remaining_budget is removed; when estimate_request_max_cost returns None (unknown model) the request skips reservation entirely, falling back to read-time enforcement rather than pinning the counter at max_budget.
  • Output-token estimation (_estimate_output_tokens): now always returns a finite value — min(requested or 16384, model_ceiling or 16384) — so uncapped chat requests get a bounded reservation.
  • Image-generation reservation (_estimate_image_generation_cost): strictly gated on mode in {"image_generation","image_edit"} to avoid misclassifying multimodal chat models that carry *_cost_per_image fields for vision-input pricing.

Confidence Score: 4/5

Safe to merge; the fix correctly removes the headroom-pinning behavior and replaces it with a bounded per-request estimation.

The budget-reservation logic is a well-supported change with 32 passing tests and sound core arithmetic. The summation of both per-image cost fields is a latent over-reservation risk for models that happen to have both fields set in the price table. The removal of the pre-flight budget-exceeded check for unknown-model routes is an intentional, documented trade-off and does not affect models that are in the cost map.

budget_reservation.py lines 896-898 (_estimate_image_generation_cost cost summation) deserve a second look before any new image-model entry is added to model_prices_and_context_window.json with both input_cost_per_image and output_cost_per_image set.

Important Files Changed

Filename Overview
litellm/proxy/spend_tracking/budget_reservation.py Core reservation logic overhauled: removes the headroom-pinning fallback, adds a bounded 16K-token default for uncapped requests, and adds image-gen/edit per-image cost reservation gated on model mode.
tests/test_litellm/proxy/test_budget_reservation.py Three old tests covering the now-removed headroom-pinning path are correctly replaced; five new tests cover the 16K clamp, adversarial over-request, image-gen reservation, and the integration-team regression scenario.

Reviews (1): Last reviewed commit: "fix(proxy): gate image-gen reservation s..." | Re-trigger Greptile

Comment on lines +896 to +898
output_cost_per_image = _to_float(model_info.get("output_cost_per_image"))
input_cost_per_image = _to_float(model_info.get("input_cost_per_image"))
cost_per_image = (output_cost_per_image or 0.0) + (input_cost_per_image or 0.0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Summing both input_cost_per_image and output_cost_per_image handles the naming inconsistency between OpenAI (input_cost_per_image) and aiml/dall-e-3 (output_cost_per_image), but would double-count for any model entry that has both fields populated with non-zero values. If a future provider entry sets both (even unintentionally), the reservation would be 2× the real price. A safer approach would be to prefer one over the other rather than add them.

Suggested change
output_cost_per_image = _to_float(model_info.get("output_cost_per_image"))
input_cost_per_image = _to_float(model_info.get("input_cost_per_image"))
cost_per_image = (output_cost_per_image or 0.0) + (input_cost_per_image or 0.0)
output_cost_per_image = _to_float(model_info.get("output_cost_per_image"))
input_cost_per_image = _to_float(model_info.get("input_cost_per_image"))
# Prefer output_cost_per_image; fall back to input_cost_per_image to handle
# the OpenAI dall-e-3 naming quirk. Never sum both — that would double-count
# a model entry that sets both fields.
cost_per_image = output_cost_per_image or input_cost_per_image or 0.0

Comment on lines 96 to 102
llm_router=llm_router,
)
if reservation_cost is None:
reservation_cost = await _get_smallest_remaining_budget(
counters=counters,
current_spend_by_counter_key=current_spend_by_counter_key,
)
# estimate_request_max_cost still returns None when the model is unknown
# to the cost map (no token-priced cost fields, e.g. image/audio routes).
# In that case we fall back to read-time enforcement only.
if reservation_cost is None or reservation_cost <= 0:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unknown-model requests skip pre-flight budget check

The old _get_smallest_remaining_budget path would raise BudgetExceededError when a counter was already at or over budget, even for models whose cost couldn't be estimated. With that removed, a request to an unknown model whose budget is already exhausted now skips the reservation layer entirely and falls through to read-time enforcement. For single-threaded workloads this is fine, but concurrent requests to cost-unknown routes against a just-exhausted budget could all pass admission simultaneously before the first one's actual cost is reconciled. This is documented as intentional in the PR, but it is a narrower admission-control guarantee for those models.

# estimate_request_max_cost still returns None when the model is unknown
# to the cost map (no token-priced cost fields, e.g. image/audio routes).
# In that case we fall back to read-time enforcement only.
if reservation_cost is None or reservation_cost <= 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Budget reservation bypass for unestimated requests

None estimates now skip reservation entirely, so a user with a nearly depleted key/team budget can fan out concurrent requests for unestimated models or routes, such as the per-pixel image path covered by the new tests, and all of them pass the read-time budget check before any response cost is recorded. Keep an admission-side reservation for these requests, even if it is a conservative bounded fallback or a route-specific estimate, rather than falling back to read-time enforcement only.

@veria-ai

veria-ai Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

High: Unestimated requests can bypass in-flight budget reservation

This PR changes budget reservation to skip requests whose cost cannot be estimated. For priced-but-unhandled routes like per-pixel image generation, an authenticated user can send many concurrent requests after the read-time budget check and before spend callbacks update counters.


Status: 1 new · 1 open
Risk: 7/10

@yuneng-berri yuneng-berri merged commit 18c14d9 into litellm_1.84.0rc2 May 9, 2026
111 of 113 checks passed
@yuneng-berri yuneng-berri deleted the litellm_/budget-reservation-rc2-backport branch May 9, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants