fix(proxy): bound budget reservation per request (backport of #27509 to 1.84.0rc2) by yuneng-berri · Pull Request #27539 · BerriAI/litellm

yuneng-berri · 2026-05-09T17:31:12Z

Backport of #27509 to `litellm_1.84.0rc2`.

Cherry-picks the three fix commits cleanly with no conflicts:

`adc41ade8c` fix(proxy): bound budget reservation per request instead of pinning to remaining headroom
`0d551ac4f0` fix(proxy): reserve per-image cost for image-generation requests
`963cb4694d` fix(proxy): gate image-gen reservation strictly on model mode

Summary

`reserve_budget_for_request` previously fell back to reserving the entire remaining team/key/user headroom whenever a request omitted `max_tokens`. This pinned the spend counter at `max_budget` for the duration of the in-flight request and false-positive-blocked every concurrent or back-to-back request until the success callback reconciled. An internal integration-test team was being budget-blocked at its $2000 cap while DB spend was $0.144.
`_estimate_output_tokens` now returns `min(requested or 16384, model_info.max_output_tokens or 16384)`. The 16K default mirrors `parallel_request_limiter_v3`'s `DEFAULT_MAX_TOKENS_ESTIMATE` precedent. The model-ceiling clamp also applies to explicit `max_tokens` so a hostile `max_tokens=999999999` cannot inflate one request's reservation up to the entire team headroom.
`_estimate_image_generation_cost`: image-generation and image-edit routes (`dall-e-3`, `flux`, Stability inpaint, etc.) reserve `n × per-image cost` upfront. Strictly gated on `mode in {"image_generation", "image_edit"}` so chat models that carry `_cost_per_image` for vision input pricing (`gemini-3.1-pro-preview`, `azure/gpt-realtime-`, `amazon.titan-embed-image-v1`) still go through the token-priced path.
Reservation accounting only — the outbound request body is unchanged.

Test plan

`uv run pytest tests/test_litellm/proxy/test_budget_reservation.py` — 32/32 pass on the rc2 base
Cherry-picks applied cleanly with no conflicts

…o remaining headroom reserve_budget_for_request fell back to reserving the entire remaining team/key/user headroom whenever a request omitted max_tokens, which pinned the spend counter at max_budget for the duration of the in-flight request and false-positive-blocked every concurrent or back-to-back request until the success callback reconciled. Surfaced as an integration-test team being budget-blocked at its $2000 cap while DB spend was $0.144. Switch the missing-max_tokens path to a fixed default of 16384 output tokens (mirrors parallel_request_limiter_v3's DEFAULT_MAX_TOKENS_ESTIMATE precedent), and clamp explicit max_tokens at the model's max_output_tokens for reservation accounting only. The outbound request body is unchanged, so providers see whatever the caller actually sent; only the local integer used to compute reservation cost is bounded. This also prevents a hostile max_tokens=999999999 from inflating one request's reservation up to the entire team headroom. For Opus 4.7 (output $25/M, max_output 128K) on a $2000 budget the worst-case per-request reservation drops from "everything left" to $3.20, raising admittable concurrency from 1 to ~625.

Image-generation routes (dall-e-3, flux, etc.) have no per-token output cost so they fell through to the no-reservation read-time-only path. Concurrent image requests against a depleted budget could all pass common_checks (counter exactly at max_budget passes the strict-`>` gate) and reach the provider before reconciliation caught up. Add per-image reservation in _estimate_request_max_cost_for_model: when the model has a per-image cost field, reserve `n × cost_per_image` upfront. The atomic counter increment serializes concurrent admissions, so the second request sees the post-first-reservation counter and raises BudgetExceededError instead of silently leaking through. Both `output_cost_per_image` and `input_cost_per_image` are honored — naming is inconsistent across providers (OpenAI dall-e-3 uses input_cost_per_image, aiml/dall-e-3 uses output_cost_per_image for the same per-generated-image price). Per-pixel pricing (DALL-E 2 size variants) and TTS/STT routes still fall through to read-time enforcement; those are follow-ups.

The previous detection treated any model with input_cost_per_image or output_cost_per_image as image generation. Several chat and embedding models carry those fields to price multimodal vision input, not generated images: - gemini-3.1-pro-preview (mode=chat) has output_cost_per_image=0.00012 alongside input/output token pricing. - azure/gpt-realtime-* (mode=chat) has input_cost_per_image=5e-6. - amazon.titan-embed-image-v1 (mode=embedding) has input_cost_per_image=6e-5. For these models the image-gen branch fired first and reserved a fraction of a cent per request, short-circuiting the token-priced path entirely. Long Gemini chats reserved 1 × $0.00012 instead of the true token cost. Gate strictly on mode in {"image_generation", "image_edit"}. All 197 real image_generation entries and all 31 image_edit entries (Flux Kontext, Stability inpaint/outpaint, etc.) carry the right mode, so the field-presence fallback was unnecessary. Adds regression tests for the chat-model-with-image-cost-field case and for image_edit reservation.

greptile-apps · 2026-05-09T17:34:41Z

Greptile Summary

This PR (backport of #27509) fixes a budget-reservation bug where any request without an explicit max_tokens caused the spend counter to be pinned at the full team/key budget for the duration of the in-flight request, false-positive-blocking all concurrent or back-to-back requests until the success callback reconciled. The fix replaces the old headroom-reservation fallback with a bounded 16K-token default, adds a model-ceiling clamp to prevent adversarial max_tokens inflation, and introduces per-image cost reservation for image_generation/image_edit mode models.

Core fix (reserve_budget_for_request): _get_smallest_remaining_budget is removed; when estimate_request_max_cost returns None (unknown model) the request skips reservation entirely, falling back to read-time enforcement rather than pinning the counter at max_budget.
Output-token estimation (_estimate_output_tokens): now always returns a finite value — min(requested or 16384, model_ceiling or 16384) — so uncapped chat requests get a bounded reservation.
Image-generation reservation (_estimate_image_generation_cost): strictly gated on mode in {"image_generation","image_edit"} to avoid misclassifying multimodal chat models that carry *_cost_per_image fields for vision-input pricing.

Confidence Score: 4/5

Safe to merge; the fix correctly removes the headroom-pinning behavior and replaces it with a bounded per-request estimation.

The budget-reservation logic is a well-supported change with 32 passing tests and sound core arithmetic. The summation of both per-image cost fields is a latent over-reservation risk for models that happen to have both fields set in the price table. The removal of the pre-flight budget-exceeded check for unknown-model routes is an intentional, documented trade-off and does not affect models that are in the cost map.

budget_reservation.py lines 896-898 (_estimate_image_generation_cost cost summation) deserve a second look before any new image-model entry is added to model_prices_and_context_window.json with both input_cost_per_image and output_cost_per_image set.

Important Files Changed

Filename	Overview
litellm/proxy/spend_tracking/budget_reservation.py	Core reservation logic overhauled: removes the headroom-pinning fallback, adds a bounded 16K-token default for uncapped requests, and adds image-gen/edit per-image cost reservation gated on model mode.
tests/test_litellm/proxy/test_budget_reservation.py	Three old tests covering the now-removed headroom-pinning path are correctly replaced; five new tests cover the 16K clamp, adversarial over-request, image-gen reservation, and the integration-team regression scenario.

_{Reviews (1): Last reviewed commit: "fix(proxy): gate image-gen reservation s..." | Re-trigger Greptile}

greptile-apps · 2026-05-09T17:34:45Z

+    output_cost_per_image = _to_float(model_info.get("output_cost_per_image"))
+    input_cost_per_image = _to_float(model_info.get("input_cost_per_image"))
+    cost_per_image = (output_cost_per_image or 0.0) + (input_cost_per_image or 0.0)


Summing both input_cost_per_image and output_cost_per_image handles the naming inconsistency between OpenAI (input_cost_per_image) and aiml/dall-e-3 (output_cost_per_image), but would double-count for any model entry that has both fields populated with non-zero values. If a future provider entry sets both (even unintentionally), the reservation would be 2× the real price. A safer approach would be to prefer one over the other rather than add them.

Suggested change

output_cost_per_image = _to_float(model_info.get("output_cost_per_image"))

input_cost_per_image = _to_float(model_info.get("input_cost_per_image"))

cost_per_image = (output_cost_per_image or 0.0) + (input_cost_per_image or 0.0)

output_cost_per_image = _to_float(model_info.get("output_cost_per_image"))

input_cost_per_image = _to_float(model_info.get("input_cost_per_image"))

# Prefer output_cost_per_image; fall back to input_cost_per_image to handle

# the OpenAI dall-e-3 naming quirk. Never sum both — that would double-count

# a model entry that sets both fields.

cost_per_image = output_cost_per_image or input_cost_per_image or 0.0

greptile-apps · 2026-05-09T17:34:45Z

        llm_router=llm_router,
    )
-    if reservation_cost is None:
-        reservation_cost = await _get_smallest_remaining_budget(
-            counters=counters,
-            current_spend_by_counter_key=current_spend_by_counter_key,
-        )
+    # estimate_request_max_cost still returns None when the model is unknown
+    # to the cost map (no token-priced cost fields, e.g. image/audio routes).
+    # In that case we fall back to read-time enforcement only.
    if reservation_cost is None or reservation_cost <= 0:
        return None


Unknown-model requests skip pre-flight budget check

The old _get_smallest_remaining_budget path would raise BudgetExceededError when a counter was already at or over budget, even for models whose cost couldn't be estimated. With that removed, a request to an unknown model whose budget is already exhausted now skips the reservation layer entirely and falls through to read-time enforcement. For single-threaded workloads this is fine, but concurrent requests to cost-unknown routes against a just-exhausted budget could all pass admission simultaneously before the first one's actual cost is reconciled. This is documented as intentional in the PR, but it is a narrower admission-control guarantee for those models.

veria-ai · 2026-05-09T17:40:45Z

+    # estimate_request_max_cost still returns None when the model is unknown
+    # to the cost map (no token-priced cost fields, e.g. image/audio routes).
+    # In that case we fall back to read-time enforcement only.
    if reservation_cost is None or reservation_cost <= 0:


High: Budget reservation bypass for unestimated requests

None estimates now skip reservation entirely, so a user with a nearly depleted key/team budget can fan out concurrent requests for unestimated models or routes, such as the per-pixel image path covered by the new tests, and all of them pass the read-time budget check before any response cost is recorded. Keep an admission-side reservation for these requests, even if it is a conservative bounded fallback or a route-specific estimate, rather than falling back to read-time enforcement only.

veria-ai · 2026-05-09T17:40:47Z

High: Unestimated requests can bypass in-flight budget reservation

This PR changes budget reservation to skip requests whose cost cannot be estimated. For priced-but-unhandled routes like per-pixel image generation, an authenticated user can send many concurrent requests after the read-time budget check and before spend callbacks update counters.

Status: 1 new · 1 open
Risk: 7/10

yuneng-berri added 3 commits May 9, 2026 10:29

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

veria-ai Bot reviewed May 9, 2026

View reviewed changes

yuneng-berri merged commit 18c14d9 into litellm_1.84.0rc2 May 9, 2026
111 of 113 checks passed

yuneng-berri deleted the litellm_/budget-reservation-rc2-backport branch May 9, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(proxy): bound budget reservation per request (backport of #27509 to 1.84.0rc2)#27539

fix(proxy): bound budget reservation per request (backport of #27509 to 1.84.0rc2)#27539
yuneng-berri merged 3 commits into
litellm_1.84.0rc2from
litellm_/budget-reservation-rc2-backport

yuneng-berri commented May 9, 2026

Uh oh!

greptile-apps Bot commented May 9, 2026

Important Files Changed

Uh oh!

greptile-apps Bot May 9, 2026

Uh oh!

greptile-apps Bot May 9, 2026

Uh oh!

veria-ai Bot May 9, 2026

Uh oh!

veria-ai Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuneng-berri commented May 9, 2026

Summary

Test plan

Uh oh!

greptile-apps Bot commented May 9, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

veria-ai Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

veria-ai Bot commented May 9, 2026

High: Unestimated requests can bypass in-flight budget reservation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants