Skip to content

feat(bedrock_mantle): add Responses API support (/openai/v1/responses)#29490

Merged
Sameerlite merged 9 commits into
BerriAI:litellm_oss_staging_040626from
kingdoooo:litellm_bedrock_mantle_responses
Jun 4, 2026
Merged

feat(bedrock_mantle): add Responses API support (/openai/v1/responses)#29490
Sameerlite merged 9 commits into
BerriAI:litellm_oss_staging_040626from
kingdoooo:litellm_bedrock_mantle_responses

Conversation

@kingdoooo

@kingdoooo kingdoooo commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have added meaningful tests
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible; it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Path-level behavior, direct against Bedrock Mantle

Tested against the live Bedrock Mantle endpoint in us-east-2 using a real Bedrock API key (Bearer token). The three paths behave as follows, which is exactly why this PR is needed and why the routing is gated by model family.

gpt-oss-120b (legacy family, also speaks Chat Completions):

POST /v1/chat/completions   -> 200  {"choices":[{"message":{"content":"Hello there!"...}}]}
POST /v1/responses          -> 200  {"object":"response","output":[...]}
POST /openai/v1/responses   -> 400  {"error":{"message":"The model 'openai.gpt-oss-120b' does not support the '/openai/v1/responses' API"}}

gpt-5.5 (frontier family, Responses-only):

POST /v1/chat/completions   -> 400  {"error":{"message":"The model 'openai.gpt-5.5' does not support the '/v1/chat/completions' API"}}
POST /v1/responses          -> 400  {"error":{"message":"The model 'openai.gpt-5.5' does not support the '/v1/responses' API"}}
POST /openai/v1/responses   -> 200  {"object":"response","model":"openai.gpt-5.5","output":[{"content":[{"text":"Hi there"...}]}]}

The two families accept disjoint path sets, so there is no single path that serves both. The frontier models are only reachable on /openai/v1/responses, which LiteLLM did not target before this change.

End-to-end through the LiteLLM proxy, three code states

Run on a live proxy (Docker on EC2) against gpt-5.5 in us-east-2. The same requests were issued across three code states: (A) before this PR, (B) this PR without the file_search fix, and (C) this PR with the file_search fix. Screenshots are attached below the table.

Shot State Request Outcome
A-1 before this PR basic /v1/responses HTTP 500. With no Responses backend, the call falls through to the chat-completions bridge and crashes (functools.partial() got multiple values for keyword argument 'acompletion'). gpt-5.5 Responses is unusable.
B-1 this PR, no fix basic /v1/responses HTTP 200, replies PONG; x-litellm-model-api-base: https://bedrock-mantle.us-east-2.api.aws/v1. The PR makes gpt-5.5 Responses work.
B-2 this PR, no fix same call plus a file_search tool HTTP 500 carrying the upstream validation_error: Tool type 'file_search' is not supported. Supported tool types are: function, mcp, custom, namespace, tool_search. This is the bug raised in review.
C-1 this PR plus fix the same file_search request HTTP 200; output contains a file_search_call item with queries:["leave policy"]. The tool is now routed through LiteLLM's file_search emulation instead of being forwarded to Mantle.

In C-1 the model answers "I couldn't find any information about the leave policy in the available documents" because the demo vector_store_id is not a real Bedrock Knowledge Base, so the emulated search returns no rows and the model answers from that. The proof is the HTTP 200 plus the file_search_call routing, not the retrieval content; with a real vector store the same path performs the actual retrieval.

A-1 (before PR, basic Responses → 500):
A1_before_pr
A1_before_pr_2

B-1 (PR, basic Responses → 200):
B1_responses_call_200_1
B1_responses_call_200_2

B-2 (PR, file_search → 500, the bug):
B2_with_file_search_initial_fail

C-1 (PR + fix, file_search → 200):
C1_with_file_search_after_patch_1
C1_with_file_search_after_patch_2

Type

🆕 New Feature

Changes

OpenAI frontier models on Amazon Bedrock Mantle (openai.gpt-5.5, openai.gpt-5.4) are served only through the Responses API and only on the non-standard /openai/v1/responses path (the AWS model card states this differs from the /v1/responses path used by other models). LiteLLM's bedrock_mantle provider previously implemented only Chat Completions, so any /v1/responses call for these models was emulated as a chat completion and sent to /v1/chat/completions, which the frontier models reject with a validation error.

This PR adds a Responses backend for the provider. BedrockMantleResponsesAPIConfig subclasses OpenAIResponsesAPIConfig (Mantle speaks the OpenAI Responses spec, so the request, response, and streaming transforms are inherited) and overrides only the endpoint URL and Bearer authentication. The URL builder normalizes any base a user might supply (a chat-style /v1 base, an already-correct /openai/v1 base, a bare host, or the full endpoint AWS tells users to copy) down to the host before appending /openai/v1/responses, so none of those shapes produce a doubled path. Authentication reads litellm_params.api_key, then BEDROCK_MANTLE_API_KEY, then AWS_BEARER_TOKEN_BEDROCK, and raises a clear error if none is set rather than sending an empty bearer token.

Routing is gated by model family in the responses-config registry. Only OpenAI gpt frontier models (the openai.gpt- family, excluding gpt-oss) get the native Responses config; future names such as gpt-6 match automatically without a code change. Everything else falls through to the existing chat-completions emulation. This matters because Mantle hosts many non-OpenAI models (nvidia, mistral, google, zai, and others) alongside gpt-oss, and all of them are chat-completions only and return 400 on /openai/v1/responses (verified live). An earlier revision of this PR gated the other way (exclude gpt-oss, route everything else to Responses), which incorrectly pushed those chat-only models to the Responses path; the allow-list form fixes that.

Because the config subclasses OpenAIResponsesAPIConfig, it also inherited supports_native_file_search() -> True, which is wrong for Mantle. That flag tells LiteLLM the provider hosts OpenAI's vector stores and can handle the file_search tool natively, so LiteLLM skips its file_search emulation and forwards the tool to the backend. Mantle has no OpenAI vector stores; a forwarded file_search tool is rejected upstream with 400 Tool type 'file_search' is not supported (verified against the live endpoint, see B-2 above). The config now overrides supports_native_file_search() to return False, mirroring the existing supports_native_websocket() opt-out, so a Responses request carrying a file_search tool is routed through LiteLLM's emulation (vector search plus a function-tool loop) instead of failing (see C-1 above). A regression test asserts both the flag value and that the emulation router selects emulation for this config.

Price-map entries for bedrock_mantle/openai.gpt-5.5 and openai.gpt-5.4 are added to both model_prices_and_context_window.json and the bundled backup, using the in-region on-demand pricing from the AWS Bedrock pricing page and the 272K context window from the AWS model card. This makes spend tracking accurate for these models.

Out of scope for this PR: no change to Mantle Chat Completions behavior, no Anthropic Messages to Mantle conversion, and no Responses state-management subroutes (/compact, /cancel, /{id}/input_items). Cross-region (Geo and Global) pricing is deferred until AWS publishes it; only in-region pricing exists today. A possible follow-up is routing gpt-oss Responses calls to its own native /v1/responses path (which it does support) instead of the chat-completions emulation, but that is an independent behavior change and is left out here. SigV4 / IAM credential auth on the Mantle endpoint is also a possible follow-up; the endpoint accepts it (verified), but this PR implements only the Bearer token path that AWS documents as the primary method for the OpenAI-compatible surface.

Tests cover URL construction across all base shapes including trailing slashes and the full-endpoint copy-paste case, the authentication priority chain and the missing-key error, the registry gating (gpt-5.5, gpt-5.4, and a hypothetical gpt-6 get the config, while gpt-oss variants, the non-OpenAI families nvidia/mistral/google/zai, and model=None do not), the file_search emulation opt-out, and the price-map values and mode: responses. The existing chat tests are unchanged and still pass.

kingdoooo added 6 commits June 2, 2026 18:53
…t-5 for Responses routing

Frontier OpenAI models on Bedrock Mantle are Responses-only on /openai/v1/responses;
gpt-oss is the legacy family that also speaks chat-completions. Gate by excluding
gpt-oss (which keeps its chat-completions emulation) and defaulting everything else
to the native Responses config, so future frontier models (gpt-6, etc.) route
correctly without a code change. Verified against the live us-east-2 Mantle endpoint:
gpt-oss 400s on /openai/v1/responses while gpt-5.5 400s on both standard paths.
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/utils.py 77.77% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a Responses API backend for Amazon Bedrock Mantle, enabling openai.gpt-5.5 and openai.gpt-5.4 frontier models that are only reachable on the non-standard /openai/v1/responses path. BedrockMantleResponsesAPIConfig subclasses OpenAIResponsesAPIConfig, inheriting all request/response transforms and overriding only the URL builder and Bearer auth.

  • URL normalization: get_complete_url strips common path suffixes (longest-first) so any of the four input shapes AWS documents (bare host, /v1, /openai/v1, or full endpoint URL) all resolve to the correct …/openai/v1/responses without doubling.
  • Model-family routing: The registry gates the new config on "openai.gpt-" in the model name while excluding gpt-oss, letting future frontier models (e.g. gpt-6) match automatically; all other Mantle families (nvidia, mistral, google, zai, gpt-oss) fall through to the existing chat-completions emulation.
  • file_search opt-out: Overrides supports_native_file_search()False so LiteLLM's emulation handles file_search tools instead of forwarding them to Mantle (which rejects them with 400).

Confidence Score: 5/5

Safe to merge — adds an isolated new Responses backend with no changes to existing Mantle chat-completions behavior, all existing tests pass, and the new paths are covered by unit tests.

The change is well-scoped: new files for the new config, a single elif branch in the provider registry, and price-map additions. URL normalization is tested across all documented input shapes, auth priority is locked by tests, model-family routing is asserted for both the allow-list and exclusion cases, and the file_search emulation opt-out is verified end-to-end. No existing code paths are modified.

No files require special attention.

Important Files Changed

Filename Overview
litellm/llms/bedrock_mantle/responses/transformation.py New BedrockMantleResponsesAPIConfig that subclasses OpenAIResponsesAPIConfig; overrides get_complete_url (with robust suffix-stripping), validate_environment (Bearer auth priority chain), and opts out of native file_search and websocket.
litellm/utils.py Adds elif branch for BEDROCK_MANTLE in ProviderConfigManager.get_provider_responses_api_config, routing openai.gpt-* (excluding gpt-oss) to BedrockMantleResponsesAPIConfig and everything else to None (chat-completions emulation).
tests/test_litellm/llms/bedrock_mantle/test_bedrock_mantle_responses_transformation.py New unit tests covering URL construction, auth priority, registry routing (gpt-5.5, gpt-5.4, future gpt-6, gpt-oss exclusion, non-OpenAI families), file_search emulation opt-out, and price-map values — all mock-only, no network calls.
model_prices_and_context_window.json Adds bedrock_mantle/openai.gpt-5.5 and bedrock_mantle/openai.gpt-5.4 price entries with mode=responses, AWS in-region on-demand pricing, and 272K context window.
litellm/model_prices_and_context_window_backup.json Mirrors the primary price map additions for gpt-5.5 and gpt-5.4 in the bundled backup.
litellm/init.py Adds TYPE_CHECKING import for BedrockMantleResponsesAPIConfig alongside other provider response configs.
litellm/_lazy_imports_registry.py Registers BedrockMantleResponsesAPIConfig in LLM_CONFIG_NAMES and the import map, following the established lazy-import pattern.

Reviews (4): Last reviewed commit: "fix(bedrock_mantle): only route openai.g..." | Re-trigger Greptile

Comment thread litellm/utils.py
Comment thread model_prices_and_context_window.json
Comment thread litellm/model_prices_and_context_window_backup.json
Closes the one uncovered line flagged by codecov on the Responses config.
The assertion documents that Mantle Responses has no realtime/websocket
transport, so realtime routing must not attempt a socket it cannot serve.
@kingdoooo

Copy link
Copy Markdown
Contributor Author

@greptileai

@Sameerlite

Copy link
Copy Markdown
Collaborator

@kingdoooo nit: The transformation file inherits supports_native_file_search() -> True from OpenAIResponsesAPIConfig without overriding it. This causes LiteLLM's file search emulation to be skipped entirely for Bedrock Mantle, so any Responses API call that includes a file_search tool is forwarded directly to Mantle, which has no access to OpenAI's file storage and will return a 400.

Also, please add the ss instead of the code paste as proof. Helps in reviewing.

Thanks!

…orwarding to Mantle

BedrockMantleResponsesAPIConfig inherited supports_native_file_search()
-> True from OpenAIResponsesAPIConfig but never overrode it. Mantle has no
OpenAI vector stores, so a forwarded file_search tool is rejected with a
400 (verified upstream: Tool type 'file_search' is not supported). Opting
out, like the existing supports_native_websocket override, routes the tool
through LiteLLM's file_search emulation instead.
@Nasa62

Nasa62 commented Jun 3, 2026

Copy link
Copy Markdown

Does this break other models like nvidia.nemotron-nano-9b-v2 that only support chat completions? At least pulling this into my build it seems this causes them to try to go to responses. https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-nvidia-nvidia-nemotron-nano-9b-v2.html

@kingdoooo

Copy link
Copy Markdown
Contributor Author

@greptileai

The previous gate excluded gpt-oss and routed every other model to the
native Responses config. But on Mantle only the OpenAI gpt frontier models
(gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI
families (nvidia, mistral, google, zai, ...) are chat-completions only and
400 on that path. Allow-list the openai.gpt- family (excluding gpt-oss)
instead, so chat-only models fall through to the chat-completions emulation.
Verified against the live us-east-2 endpoint: nvidia.nemotron-nano-9b-v2
returns 400 on /openai/v1/responses and 200 on /v1/chat/completions.
@kingdoooo

Copy link
Copy Markdown
Contributor Author

@Nasa62 Confirmed, and you're right: this was a real regression in the routing gate. Thanks for catching it before it hit more builds.

The gate I had written excluded gpt-oss and routed every other model to the native Responses config. That assumed "not gpt-oss" meant "frontier Responses model", but on Mantle only the OpenAI gpt frontier models (openai.gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI families are chat-completions only and 400 on that path. So nvidia.nemotron-nano-9b-v2, and likewise mistral/google/zai models, were being pushed to Responses and would 400.

Verified directly against the live us-east-2 endpoint:

nvidia.nemotron-nano-9b-v2  POST /openai/v1/responses    -> 400  "The model 'nvidia.nemotron-nano-9b-v2' does not support the '/openai/v1/responses' API"
nvidia.nemotron-nano-9b-v2  POST /v1/chat/completions     -> 200

Fixed in df201f8202 by flipping the gate to an allow-list: only openai.gpt- models (excluding gpt-oss) get the Responses config, everything else falls through to the chat-completions emulation. Future OpenAI frontier names like gpt-6 still match automatically. Added a regression test that asserts nvidia, mistral, google, and zai models route to None (chat-completions) rather than the Responses config.

@Nasa62

Nasa62 commented Jun 3, 2026

Copy link
Copy Markdown

@Nasa62 Confirmed, and you're right: this was a real regression in the routing gate. Thanks for catching it before it hit more builds.

The gate I had written excluded gpt-oss and routed every other model to the native Responses config. That assumed "not gpt-oss" meant "frontier Responses model", but on Mantle only the OpenAI gpt frontier models (openai.gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI families are chat-completions only and 400 on that path. So nvidia.nemotron-nano-9b-v2, and likewise mistral/google/zai models, were being pushed to Responses and would 400.

Verified directly against the live us-east-2 endpoint:

nvidia.nemotron-nano-9b-v2  POST /openai/v1/responses    -> 400  "The model 'nvidia.nemotron-nano-9b-v2' does not support the '/openai/v1/responses' API"
nvidia.nemotron-nano-9b-v2  POST /v1/chat/completions     -> 200

Fixed in df201f8202 by flipping the gate to an allow-list: only openai.gpt- models (excluding gpt-oss) get the Responses config, everything else falls through to the chat-completions emulation. Future OpenAI frontier names like gpt-6 still match automatically. Added a regression test that asserts nvidia, mistral, google, and zai models route to None (chat-completions) rather than the Responses config.

Thank you!

Also here's the full list I compiled if you wanted to add some sort of more specific automatic detection.
However, I still think there should be a configuration driven way to override it when inevitably on some day one of these other labs suddenly releases a responses compatible one on Bedrock Mantle.

@kingdoooo

Copy link
Copy Markdown
Contributor Author

@greptileai

@kingdoooo

Copy link
Copy Markdown
Contributor Author

That list is useful, appreciate you compiling it.

I'm going to keep this PR's gate as the openai.gpt- allow-list rather than enumerating the chat-only models. An explicit deny-list of the families above would need editing every time a new chat-only model ships, which is the same staleness problem in reverse; the allow-list only grows when OpenAI adds a frontier name, which is rarer and is exactly the gpt-6-style case the gate already handles.

On the config-driven override, I agree with the need and it shouldn't live in this PR. The clean way is to make the gate also honor a model marked mode: "responses", which get_model_info already resolves from both the price map and a proxy model_info: block (verified: register_model({... "mode": "responses"}) flips it). That lets you opt a future non-OpenAI Responses model in from config with no code change. I kept it out of here because routing on get_model_info().mode couples the decision to the cost-map load state, and the gpt-5.x entries this PR adds aren't in the fetched model_cost until this merges, so a pure mode gate would misroute the very models this PR is about until then. Once this lands, a small follow-up can add the mode == "responses" override on top of the allow-list so the allow-list covers the merge-time bootstrap and config covers everything after. I'll open that follow-up and link it here.

Scope of this PR stays: add the Responses backend, route only openai.gpt- frontier models to it, and keep every chat-only model on the chat-completions path.

@krrish-berri-2

Copy link
Copy Markdown
Contributor

@kingdoooo — could you add a screenshot or short video showing that this change works as expected (e.g. a sample /openai/v1/responses request hitting Bedrock and returning a valid response)? It really helps reviewers verify the feature quickly. Thanks!

@kingdoooo

Copy link
Copy Markdown
Contributor Author

@krrish-berri-2 Added, they're in the PR description under "Screenshots / Proof of Fix". The run is on a live proxy (Docker on EC2) hitting Bedrock Mantle gpt-5.5 in us-east-2 with a real Bedrock API key, captured across three code states so the before/after is explicit:

  • A-1: before this PR, a basic /v1/responses call to gpt-5.5 fails (HTTP 500, no Responses backend).
  • B-1: with this PR, the same call returns HTTP 200 (x-litellm-model-api-base: https://bedrock-mantle.us-east-2.api.aws/v1), so the feature works end to end through the proxy.
  • B-2 / C-1: a /v1/responses call carrying a file_search tool, before and after the file_search fix (500 -> 200).

The direct-to-Bedrock path table above the screenshots also shows gpt-5.5 returning 200 only on /openai/v1/responses. Happy to grab any other specific request if it's useful.

@Nasa62

Nasa62 commented Jun 3, 2026

Copy link
Copy Markdown

@kingdoooo This might not be directly related but have you seen if prompt caching is working on Bedrock GPT-5.5? At least the cache_control_injection_points doesn't seem to cause prompt caching to be added to the mantle route?

Nevermind, I finally got it to show up.

@Cerrix

Cerrix commented Jun 4, 2026

Copy link
Copy Markdown

Hi @kingdoooo, thank you for this PR. It's needed. Upvoting it

This is the piece we need for cost tracking, thanks for adding the bedrock_mantle/openai.gpt-5.5 and gpt-5.4 price-map entries. Would be great to get these in (or folded into #29476), since #29476 covers routing but not pricing.

@Sameerlite Sameerlite left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Sameerlite Sameerlite changed the base branch from litellm_internal_staging to litellm_oss_staging_040626 June 4, 2026 11:42
@Sameerlite Sameerlite merged commit 066f978 into BerriAI:litellm_oss_staging_040626 Jun 4, 2026
46 checks passed
@kingdoooo

Copy link
Copy Markdown
Contributor Author

@Cerrix Thanks. Good news: this is already merged, so the bedrock_mantle/openai.gpt-5.5 and gpt-5.4 price-map entries are in (both the primary model_prices_and_context_window.json and the bundled backup), along with the full native Responses backend and the routing gate. So the pricing you need landed here directly and doesn't have to be folded into #29476.

The only overlap with #29476 is the routing branch in utils.py; the price-map entries are independent of that, so whichever order things settle in, cost tracking for these two models is covered.

mateo-berri added a commit that referenced this pull request Jun 4, 2026
* fix(azure): apply api_version fallback chain to image edit URL

`AzureImageEditConfig.get_complete_url` only read `api_version` from
`litellm_params`. When callers configured it via `litellm.api_version`
or `AZURE_API_VERSION`, the constructed URL had no `?api-version=` and
Azure responded `404 Resource not found`.

Apply the same fallback chain the Azure chat path already uses in
`common_utils.py`:

    litellm_params > litellm.api_version > AZURE_API_VERSION env >
    litellm.AZURE_DEFAULT_API_VERSION

Adds 5 unit tests pinning each layer of the chain plus a regression
guard for `api_base` that already carries `?api-version=`.

* feat(mcp): core sampling and elicitation flow with security hardening

- Add sampling_handler.py: full MCP sampling/createMessage flow with
  model selection (hint-based + priority-based), auth enforcement,
  budget checks, route restriction gates, and tag policy pre-auth
- Add elicitation_handler.py: MCP elicitation/create relay with
  downstream client capability detection
- Wire sampling/elicitation callbacks in mcp_server_manager.py
  gated behind allow_sampling/allow_elicitation config flags
- Add allow_sampling/allow_elicitation fields to MCPServer type
- Fix session lock deadlock: skip lock for JSON-RPC response POSTs
  (elicitation/sampling replies) with truncated-body heuristic
- Extend client.py with sampling_callback and elicitation_callback
- Security: RouteChecks gate, tag-budget bypass fix, x-forwarded-for
  spoofing fix, Latin-1 header encoding guard
- Add 4 new test modules (model access, priority selection, request
  builder, tool conversion) + update existing MCP tests

* fix(security): run pre-call guardrails before MCP sampling acompletion

Without this, an upstream MCP server with allow_sampling enabled could
send prompts that bypass every guardrail (content filtering, PII
redaction, prompt-injection detection) configured on /chat/completions.

- Call proxy_logging_obj.pre_call_hook(call_type='acompletion') before
  llm_router.acompletion so guardrails fire for sampling sub-calls
- Add HTTPException to the re-raise list so guardrail rejections
  propagate correctly instead of being swallowed as generic errors

* feat(bedrock_mantle): add Responses API support (/openai/v1/responses) (#29490)

* feat(bedrock_mantle): add Responses API transformation config

* test(bedrock_mantle): cover trailing-slash api_base normalization

* feat(bedrock_mantle): export BedrockMantleResponsesAPIConfig

* feat(bedrock_mantle): register gpt-5.x Responses config (gpt-oss unchanged)

* feat(bedrock_mantle): add gpt-5.5/gpt-5.4 Responses price-map entries

* refactor(bedrock_mantle): exclude gpt-oss instead of allow-listing gpt-5 for Responses routing

Frontier OpenAI models on Bedrock Mantle are Responses-only on /openai/v1/responses;
gpt-oss is the legacy family that also speaks chat-completions. Gate by excluding
gpt-oss (which keeps its chat-completions emulation) and defaulting everything else
to the native Responses config, so future frontier models (gpt-6, etc.) route
correctly without a code change. Verified against the live us-east-2 Mantle endpoint:
gpt-oss 400s on /openai/v1/responses while gpt-5.5 400s on both standard paths.

* test(bedrock_mantle): cover supports_native_websocket opt-out

Closes the one uncovered line flagged by codecov on the Responses config.
The assertion documents that Mantle Responses has no realtime/websocket
transport, so realtime routing must not attempt a socket it cannot serve.

* fix(bedrock_mantle): route file_search through emulation instead of forwarding to Mantle

BedrockMantleResponsesAPIConfig inherited supports_native_file_search()
-> True from OpenAIResponsesAPIConfig but never overrode it. Mantle has no
OpenAI vector stores, so a forwarded file_search tool is rejected with a
400 (verified upstream: Tool type 'file_search' is not supported). Opting
out, like the existing supports_native_websocket override, routes the tool
through LiteLLM's file_search emulation instead.

* fix(bedrock_mantle): only route openai.gpt frontier models to Responses

The previous gate excluded gpt-oss and routed every other model to the
native Responses config. But on Mantle only the OpenAI gpt frontier models
(gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI
families (nvidia, mistral, google, zai, ...) are chat-completions only and
400 on that path. Allow-list the openai.gpt- family (excluding gpt-oss)
instead, so chat-only models fall through to the chat-completions emulation.
Verified against the live us-east-2 endpoint: nvidia.nemotron-nano-9b-v2
returns 400 on /openai/v1/responses and 200 on /v1/chat/completions.

* feat(custom_llm): allow streaming/astreaming to yield ModelResponseStream (#27580)

* fix(custom_llm): allow streaming/astreaming to yield ModelResponseStream directly

* fix(streaming): enhance ModelResponseStream handling for custom LLM providers

* fix(streaming): strip finish_reason from content chunks and ensure tool_calls are preserved

* fix(streaming): add type ignore for finish_reason assignment in CustomStreamWrapper

* fix(proxy): strip stack trace from HTTP 503 responses (CWE-209) (#28330)

* fix(proxy/cwe-209): strip Python traceback from HTTP 503 error responses

The /cache/ping endpoint included a full Python traceback in its 503 error
response body (inside the ProxyException message), leaking internal file
paths, line numbers, and call stacks to any caller. Two MCP route handlers
in proxy_server.py similarly interpolated str(e) into "Internal server
error" detail strings.

Fix: log the traceback server-side via verbose_proxy_logger.exception()
and omit it from the ProxyException payload / HTTPException detail returned
to clients. Tests updated to assert no "traceback" keyword or frame paths
appear in the 503 body, with a new dedicated regression test.

CWE-209: Generation of Error Message Containing Sensitive Information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(proxy/cwe-209): apply Greptile P2 fixes and add MCP exception-path tests

Greptile 4/5 review identified two remaining gaps and Codecov reported
0% coverage on the two MCP handler exception branches:

1. caching_routes.py — str(e) in "Service Unhealthy ({str(e)})" could
   still leak Redis hostnames/IPs; replaced with static "Service Unhealthy".
   HTTPException is now re-raised before the generic handler so the
   "cache not initialized" 503 still reaches callers with its detail.
   Removed the redundant str(e) arg from verbose_proxy_logger.exception()
   (exception() already appends the traceback automatically).

2. tests — two new unit tests cover the exception paths in
   dynamic_mcp_route and toolset_mcp_route that were previously at 0%:
   - test_dynamic_mcp_route_unexpected_exception_returns_500_without_traceback
   - test_toolset_mcp_route_unexpected_exception_returns_500_without_traceback

All 25 tests pass (9 caching + 16 MCP).

CWE-209: Generation of Error Message Containing Sensitive Information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(caching_routes): restore precise assertion in test_cache_ping_no_cache_initialized

The assertion was weakened to `"Cache not initialized" in str(data)`, which
matches the raw string of the entire response dict and would pass even if the
error moved to an unexpected field or changed structure.

Restore a targeted check on the parsed response: assert the exact string in
the correct field `data["detail"]`, matching FastAPI's HTTPException
serialisation format {"detail": "<message>"}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(caching_routes): restore precise assertion and add CWE-209 no-cache path test

The assertion in test_cache_ping_no_cache_initialized was weakened to
`"Cache not initialized" in str(data)`, which matched against the raw string
representation of the entire response dict. This would pass silently even if
the error message moved to an unexpected field or the structure changed.

Restore a targeted assertion on the parsed field:
  assert data["detail"] == "Cache not initialized. litellm.cache is None"
matching FastAPI's HTTPException serialisation format exactly.

Add test_cache_ping_no_cache_does_not_expose_internals to show the code path
is still working correctly after the CWE-209 fix: verifies that the HTTPException
is re-raised as-is (no traceback, no source paths), and asserts the complete
response structure is exactly {"detail": "Cache not initialized. litellm.cache is None"}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(caching_routes): restore ProxyException envelope for null-cache 503

The except HTTPException: raise guard (added in the CWE-209 fix) caused
the null-cache HTTPException to escape as FastAPI's {"detail": "..."} shape
instead of the {"error": {...}} ProxyException envelope that callers expect.

Move the null-cache guard before the try block and raise ProxyException
directly so the response structure is consistent with all other /cache/ping
503s, and the except HTTPException: raise guard is only reachable by
unexpected downstream HTTPExceptions.

Update the two no-cache tests to assert the correct ProxyException envelope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update utils.py (#26609)

* feat(pricing): add Snowflake Cortex REST API model pricing (#26612)

* feat(pricing): add Snowflake Cortex REST API model pricing

## Summary

Adds pricing and context window information for 20+ Snowflake Cortex REST API models to `model_prices_and_context_window.json`.

## What's included

- **7 Claude models** (sonnet-4-5, sonnet-4-6, 4-sonnet, 4-opus, haiku-4-5, 3-7-sonnet, 3-5-sonnet) — with prompt caching rates
- **4 OpenAI models** (gpt-4.1, gpt-5, gpt-5-mini, gpt-5-nano) — with prompt caching rates  
- **5 Llama models** (3.1-8b, 3.1-70b, 3.1-405b, 3.3-70b, 4-maverick)
- **1 DeepSeek model** (deepseek-r1)
- **1 Mistral model** (mistral-large2)
- **1 Snowflake model** (snowflake-llama-3.3-70b)
- **2 Embedding models** (arctic-embed-l-v2.0, arctic-embed-m-v2.0)

Each entry includes `input_cost_per_token`, `output_cost_per_token`, `cache_read_input_token_cost` (where applicable), `max_input_tokens`, `max_output_tokens`, and capability flags (`supports_function_calling`, `supports_vision`, `supports_prompt_caching`, `supports_reasoning`).

## Pricing source

All prices are in USD per token, sourced from the official [Snowflake Service Consumption Table](https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf) — Tables 6(b) (REST API with Prompt Caching) and 6(c) (REST API).

## Context

The existing `snowflake/` provider has zero model entries in the pricing JSON, which means LiteLLM cannot track costs for Snowflake Cortex calls. This PR fills that gap.

## Related

- Existing provider: `litellm/llms/snowflake/`
- Cortex REST API docs: https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-rest-api

* Update model_prices_and_context_window.json

Fix the JSON parsing error

* Update model_prices_and_context_window.json

Removed the duplicate entry

* fix(utils): copy extra_body before adding unknown params to prevent model config mutation (#29620)

Fixes #29615. In add_provider_specific_params_to_optional_params, the line:

    extra_body = passed_params.pop("extra_body", None) or {}

returns the original dict reference when extra_body is non-empty (truthy).
Subsequent writes like extra_body[k] = passed_params[k] then mutate the
shared model config object held by the router, poisoning /model/info and
all subsequent requests for that deployment.

The or {} short-circuit creates a new dict only when extra_body is falsy
(None or {}), which is why the bug does not reproduce with extra_body: {}.

Fix: wrap in dict() so we always work on a fresh shallow copy.

* fix(vertex_ai): Bake tool_choice into Gemini CachedContent body to prevent silent drop (#29097)

* fix(vertex_ai): bake tool_choice into Gemini CachedContent body to prevent silent drop

* address greptile feedback on tool_choice cache test

* adds test that uses ToolConfig(functionCallingConfig=FunctionCallingConfig(mode=ANY)) instead of a dict literal, mirroring what map_tool_choice_values actually produce

* fix(gemini/veo): move image from parameters into instances[0] (#29501)

* fix(gemini/veo): move image from parameters into instances[0]

Veo's predictLongRunning schema puts image (and prompt) on the
instances element; parameters is for aspectRatio/durationSeconds/etc.
The Gemini path was leaving image in params_copy, so it ended up
nested under parameters and the API silently ignored it.

The Vertex path already builds the instance dict explicitly, so this
just aligns the Gemini path with it.

Fixes #29498

* address greptile: unconditional pop + BytesIO test

- Pop `image` from params_copy unconditionally so it never reaches
  GeminiVideoGenerationParameters even when None, removing implicit
  reliance on Pydantic's extra-field-ignore.
- Add test_transform_video_create_request_image_filelike_goes_to_instance
  covering the BytesIO path (_convert_image_to_gemini_format) — round-trips
  the base64 to confirm encoding.
- Add test_transform_video_create_request_image_none_is_dropped covering
  the new None branch.

* fix(huggingface): handle special token text in embedding usage (#29660)

* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params (#29655)

* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params

ToolPermissionGuardrail builds self.rules and the compiled target/pattern
maps only in __init__. The base update_in_memory_litellm_params re-sets raw
attributes via setattr but never rebuilds those maps, so a guardrail updated
in place (PUT /guardrails, or the immediate in-memory sync) keeps enforcing
the construction-time rules until it is reinitialized (PATCH path, periodic
DB poll, or restart).

Extract the compile step into _load_rules and override
update_in_memory_litellm_params to rebuild from it (dict- and model-safe),
re-normalizing default_action / on_disallowed_action. Mirrors the existing
PresidioGuardrail override of the same method. Adds regression tests.

Fixes #29592.

* fix(guardrails): handle dict params in ToolPermissionGuardrail in-memory update

Delegate to super() only for LitellmParams input (the base setattr loop is
model-only); apply the raw-dict case inline. Fixes the mypy arg-type error
and makes the recompile work when the proxy passes the raw DB dict.

* fix(guardrails): preserve tool-permission rules on a partial in-memory update

A partial update (e.g. a LitellmParams whose rules field is None) ran through
the generic setattr, which set self.rules to None, and the recompile was
skipped, leaving the guardrail with no rules. Snapshot the previous rules and
restore them when the update carries no rules; an explicit empty list still
clears them. Adds a regression test for the rules-absent case.

Addresses the Greptile review note on #29655.

* fix(bedrock): stop base_model label from stripping tools/tool_choice (#29621)

* fix(bedrock): stop base_model label from stripping tools/tool_choice

A Router/proxy Bedrock deployment whose model_info.base_model is a friendly
label (e.g. claude-haiku-4-5) silently lost tools/tool_choice: the outgoing
Converse request was built without toolConfig, so the model behaved as if no
tools were provided. Worked in v1.84.0, regressed in v1.85.0, and with
drop_params=true it failed silently.

Two changes compound into the bug. completion() passed model_info.base_model
as the model argument to get_optional_params, so the real Bedrock model id
never reached supported-param resolution; and get_supported_openai_params
resolved the provider config's params from base_model or model, letting the
label fully replace the real model. For Bedrock the label resolves to no tool
support, so tools/tool_choice were dropped before transformation.

completion() now keeps model as the real deployment model and threads the
resolved base_model (kwarg or model_info) through separately, and
get_supported_openai_params treats base_model as additive: it returns the
union of the params supported by model and by base_model. A hint can only add
capabilities, never strip ones the real model already exposes, which also
preserves the original base_model behavior from #27717 and Azure's base_model
driven model-type detection.

Fixes #29618

* test(main): make base_model param test robust to new parametrize cases

Restore an explicit per-case expected_model_param literal instead of
hardcoding the gemini id, so a future case with a different model can't
produce a misleading assertion failure.

* fix(fireworks_ai): pass response_format json_schema through unchanged (#29606)

FireworksAIConfig.map_openai_params was rewriting the OpenAI strict
`{type: json_schema, json_schema: {name, strict, schema}}` shape into
`{type: json_object, schema: ...}` before sending to Fireworks, dropping
`strict` and `name` and changing the `type`. Per Fireworks' docs json_object
means "force any valid JSON output (no specific schema)", so the schema
constraint was effectively dropped and grammar-guided decoding never ran;
model output silently violated the schema.

The rewrite landed in #7085 (Dec 2024) when Fireworks did not yet accept
native json_schema. Fireworks accepts the OpenAI strict shape natively now,
so the rewrite has become a regression.

Removes the rewrite. Passes response_format through unchanged. Updates the
existing test_map_response_format to assert pass-through. Adds focused
regression tests in tests/test_litellm/ covering preservation of type,
strict, name, and schema body, plus that json_object alone still works.

* fix(types): import Required from typing_extensions in gemini types

* style: reformat sampling_handler.py for py312 black compat

* refactor(mcp-sampling): extract helpers to fix PLR0915 too-many-statements in handle_sampling_create_message

* fix(proxy-server): add explicit ProxyLogging type annotation to proxy_logging_obj to fix mypy inference

* fix(mcp-sampling): suppress mypy assignment error on ImportError fallback for proxy_logging_obj

* fix(test): use .value when comparing LlmProviders enum against string in test_default_api_base

* fix(test): iterate LlmProviders enum in test_default_api_base to avoid str pollution from custom provider registration

litellm.provider_list is a mutable global initialized to list(LlmProviders) but custom_llm_setup() appends plain provider strings to it. When a test_custom_llm.py test runs first in the same xdist worker, provider_list contains a str and calling .value on it raises AttributeError. Iterate the immutable LlmProviders enum instead, which is deterministic and what the check intends.

* fix(mcp): depth-aware JSON-RPC response detection and neutral speed-priority fallback

Replace the flat substring check in the truncated-body routing path with a
top-level-key scan so a JSON-RPC response whose result payload nests a
"method" field is still detected as a response and skips the session lock,
removing a deadlock against the in-flight tool call awaiting it.

Drop the inverse max_output_tokens speed proxy when no model exposes
output_tokens_per_second; context-window size does not track latency, so a
neutral score avoids biasing speedPriority toward the smallest-context model.

* fix(guardrails): make ToolPermission rule reload atomic on invalid regex

_load_rules appended each rule to self.rules before compiling its regex, so an
invalid pattern raised mid-loop after the bad rule was already live but without
a _compiled_rule_targets entry. _matches_regex reads a missing compiled target
as a None pattern and returns True, turning the bad rule into a match-all that
silently applies its decision to every tool. Via update_in_memory_litellm_params
(PUT /guardrails) this corrupted the live guardrail.

Build the parsed rules and compiled maps into locals and swap them in only after
every regex compiles, and restore the previous ruleset if a live update is
rejected, so an invalid regex now fails the update without leaving the guardrail
enforcing a broken policy.

* test(mcp): cover sampling conversion, model resolution, and elicitation relay paths

The MCP sampling and elicitation handlers shipped with partial test
coverage, leaving the response-to-MCP conversion, the model resolution
fallback chain, completion-kwargs assembly, guardrail routing, and the
entire elicitation relay untested. That pulled the PR's diff (patch)
coverage below the codecov threshold even though overall project
coverage rose.

Add focused unit tests for _convert_openai_response_to_mcp_result,
_convert_mcp_tools_to_openai, _convert_mcp_tool_choice_to_openai, image
and audio content conversion, the hint-matching and fallback branches of
_resolve_model_from_preferences, _build_completion_kwargs, the router and
guardrail-rejection paths of _run_guardrails_and_call_llm, the
handle_sampling_create_message success and error-propagation flows, the
marker-hoisting fallback for tool content on unexpected roles, and the
elicitation form/url/generic relay together with its decline paths

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: lengkejun <lengkejun@xd.com>
Co-authored-by: Yug <yugborana000@gmail.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: tanmay958 <53569547+tanmay958@users.noreply.github.com>
Co-authored-by: DrishnaTrivedi <142084770+DrishnaTrivedi@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Navnit Shukla <Navnit.shukla25@gmail.com>
Co-authored-by: PRABHU KIRAN VANDRANKI <72809214+VANDRANKI@users.noreply.github.com>
Co-authored-by: Adrian Lopez <109683617+adriangomez24@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com>
Co-authored-by: Dinesh Girbide <85330597+Dinesh-Girbide@users.noreply.github.com>
Co-authored-by: cloudwiz <22098246+andrey-dubnik@users.noreply.github.com>
Co-authored-by: Ahmad Khan <ahmadkhan2508@gmail.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
jeremymcgee73 pushed a commit to jeremymcgee73/litellm that referenced this pull request Jun 4, 2026
Follow-up to BerriAI#29490, which landed routing and price-map entries for
gpt-5.5/gpt-5.4 on the bedrock-mantle /openai/v1/responses path but
implemented only the Bearer-token auth path. IAM-only deployments
(EKS/ECS with IRSA, no long-lived static secrets) had no working path.

BedrockMantleResponsesAPIConfig now multi-inherits BaseAWSLLM and selects
auth automatically: a bearer key (api_key, BEDROCK_MANTLE_API_KEY, or
AWS_BEARER_TOKEN_BEDROCK) keeps the existing behavior, otherwise it signs
with SigV4 from the standard AWS credential chain, reusing the same
signing the sibling bedrock/mantle Claude route already performs.

Signing is done through the provider sign_request hook, which the
Responses HTTP handler now invokes just before the request goes out and
whose returned body bytes it sends verbatim. This mirrors the chat and
embedding handlers and is required for SigV4 correctness: signing has to
happen after every body mutation (normalize, extra_body, and the
fake-stream "stream" strip) and the exact signed bytes must reach the
wire, otherwise the body hash in the signature would not match what is
sent. The default hook is a no-op, so other Responses providers are
unaffected.

get_llm_provider injects a default Mantle host whose region is resolved
without aws_region_name, so the signing region could diverge from the
host region and the endpoint rejected the request. The host region is
now pinned to the resolved signing region, and signing uses the region
in the URL actually posted to. The default region is corrected to
us-east-2, where these models are currently available.

No routing or price-map changes; BerriAI#29490 covers those.
jeremymcgee73 added a commit to jeremymcgee73/litellm that referenced this pull request Jun 4, 2026
Follow-up to BerriAI#29490, which landed routing and price-map entries for
gpt-5.5/gpt-5.4 on the bedrock-mantle /openai/v1/responses path but
implemented only the Bearer-token auth path. IAM-only deployments
(EKS/ECS with IRSA, no long-lived static secrets) had no working path.

BedrockMantleResponsesAPIConfig now multi-inherits BaseAWSLLM and selects
auth automatically: a bearer key (api_key, BEDROCK_MANTLE_API_KEY, or
AWS_BEARER_TOKEN_BEDROCK) keeps the existing behavior, otherwise it signs
with SigV4 from the standard AWS credential chain, reusing the same
signing the sibling bedrock/mantle Claude route already performs.

Signing is done through the provider sign_request hook, which the
Responses HTTP handler now invokes just before the request goes out and
whose returned body bytes it sends verbatim. This mirrors the chat and
embedding handlers and is required for SigV4 correctness: signing has to
happen after every body mutation (normalize, extra_body, and the
fake-stream "stream" strip) and the exact signed bytes must reach the
wire, otherwise the body hash in the signature would not match what is
sent. The default hook is a no-op, so other Responses providers are
unaffected.

get_llm_provider injects a default Mantle host whose region is resolved
without aws_region_name, so the signing region could diverge from the
host region and the endpoint rejected the request. The host region is
now pinned to the resolved signing region, and signing uses the region
in the URL actually posted to. The default region is corrected to
us-east-2, where these models are currently available.

No routing or price-map changes; BerriAI#29490 covers those.
jeremymcgee73 added a commit to jeremymcgee73/litellm that referenced this pull request Jun 4, 2026
Follow-up to BerriAI#29490, which landed routing and price-map entries for
gpt-5.5/gpt-5.4 on the bedrock-mantle /openai/v1/responses path but
implemented only the Bearer-token auth path. IAM-only deployments
(EKS/ECS with IRSA, no long-lived static secrets) had no working path.

BedrockMantleResponsesAPIConfig now multi-inherits BaseAWSLLM and selects
auth automatically: a bearer key (api_key, BEDROCK_MANTLE_API_KEY, or
AWS_BEARER_TOKEN_BEDROCK) keeps the existing behavior, otherwise it signs
with SigV4 from the standard AWS credential chain, reusing the same
signing the sibling bedrock/mantle Claude route already performs.

Signing is done through the provider sign_request hook, which the
Responses HTTP handler now invokes just before the request goes out and
whose returned body bytes it sends verbatim. This mirrors the chat and
embedding handlers and is required for SigV4 correctness: signing has to
happen after every body mutation (normalize, extra_body, and the
fake-stream "stream" strip) and the exact signed bytes must reach the
wire, otherwise the body hash in the signature would not match what is
sent. The default hook is a no-op, so other Responses providers are
unaffected.

get_llm_provider injects a default Mantle host whose region is resolved
without aws_region_name, so the signing region could diverge from the
host region and the endpoint rejected the request. The host region is
now pinned to the resolved signing region, and signing uses the region
in the URL actually posted to. The default region is corrected to
us-east-2, where these models are currently available.

No routing or price-map changes; BerriAI#29490 covers those.
jeremymcgee73 added a commit to jeremymcgee73/litellm that referenced this pull request Jun 4, 2026
Follow-up to BerriAI#29490, which landed routing and price-map entries for
gpt-5.5/gpt-5.4 on the bedrock-mantle /openai/v1/responses path but
implemented only the Bearer-token auth path. IAM-only deployments
(EKS/ECS with IRSA, no long-lived static secrets) had no working path.

BedrockMantleResponsesAPIConfig now multi-inherits BaseAWSLLM and selects
auth automatically: a bearer key (api_key, BEDROCK_MANTLE_API_KEY, or
AWS_BEARER_TOKEN_BEDROCK) keeps the existing behavior, otherwise it signs
with SigV4 from the standard AWS credential chain, reusing the same
signing the sibling bedrock/mantle Claude route already performs.

Signing is done through the provider sign_request hook, which the
Responses HTTP handler now invokes just before the request goes out and
whose returned body bytes it sends verbatim. This mirrors the chat and
embedding handlers and is required for SigV4 correctness: signing has to
happen after every body mutation (normalize, extra_body, and the
fake-stream "stream" strip) and the exact signed bytes must reach the
wire, otherwise the body hash in the signature would not match what is
sent. The default hook is a no-op, so other Responses providers are
unaffected.

get_llm_provider injects a default Mantle host whose region is resolved
without aws_region_name, so the signing region could diverge from the
host region and the endpoint rejected the request. The host region is
now pinned to the resolved signing region, and signing uses the region
in the URL actually posted to. The default region is corrected to
us-east-2, where these models are currently available.

No routing or price-map changes; BerriAI#29490 covers those.
jeremymcgee73 added a commit to jeremymcgee73/litellm that referenced this pull request Jun 5, 2026
Follow-up to BerriAI#29490, which landed routing and price-map entries for
gpt-5.5/gpt-5.4 on the bedrock-mantle /openai/v1/responses path but
implemented only the Bearer-token auth path. IAM-only deployments
(EKS/ECS with IRSA, no long-lived static secrets) had no working path.

BedrockMantleResponsesAPIConfig now multi-inherits BaseAWSLLM and selects
auth automatically: a bearer key (api_key, BEDROCK_MANTLE_API_KEY, or
AWS_BEARER_TOKEN_BEDROCK) keeps the existing behavior, otherwise it signs
with SigV4 from the standard AWS credential chain, reusing the same
signing the sibling bedrock/mantle Claude route already performs.

Signing is done through the provider sign_request hook, which the
Responses HTTP handler now invokes just before the request goes out and
whose returned body bytes it sends verbatim. This mirrors the chat and
embedding handlers and is required for SigV4 correctness: signing has to
happen after every body mutation (normalize, extra_body, and the
fake-stream "stream" strip) and the exact signed bytes must reach the
wire, otherwise the body hash in the signature would not match what is
sent. The default hook is a no-op, so other Responses providers are
unaffected.

get_llm_provider injects a default Mantle host whose region is resolved
without aws_region_name, so the signing region could diverge from the
host region and the endpoint rejected the request. The host region is
now pinned to the resolved signing region, and signing uses the region
in the URL actually posted to. The default region is corrected to
us-east-2, where these models are currently available.

No routing or price-map changes; BerriAI#29490 covers those.
mateo-berri pushed a commit that referenced this pull request Jun 11, 2026
…29490)

Backport prerequisite for #29788. Applied as the squash diff of PR #29490
(head 50ab150^..), which landed upstream inside the litellm_oss_staging_040626
sync (cb04196, #29671) and has no standalone commit to cherry-pick.
mateo-berri pushed a commit that referenced this pull request Jun 11, 2026
…29490)

Backport prerequisite for #29788. Applied as the squash diff of PR #29490,
which landed upstream inside the litellm_oss_staging_040626 sync
(cb04196, #29671) and has no standalone commit to cherry-pick.
mateo-berri added a commit that referenced this pull request Jun 11, 2026
…AIDR, Mantle SigV4, NetApp streaming-cost fix, and team-scoped Datadog toward v1.89.0-rc.3 (#30179)

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009)

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593)

After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)

Restores the reverse-lookup for the JSONL body.model fallback path so that
legacy/pre-target_model_names managed files still map stripped provider IDs
back to proxy aliases before auth. Also cleans up redundant `or None`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)"

This reverts commit 30d2e96.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 2cd7e87)

* feat(guardrails): capture user and model metadata in CrowdStrike AIDR

(cherry picked from commit 6fc715c)

* fix(guardrails): read CrowdStrike AIDR identity from both metadata bags (#29991)

Capture user_id and extra_info from metadata or litellm_metadata. The single-bag read dropped identity whenever a request carried a present litellm_metadata field (null or a user-supplied dict), since /chat/completions routes the authenticated identity into metadata while the guardrail read litellm_metadata first

(cherry picked from commit 1bbaf1c)

* feat(bedrock_mantle): add SigV4/IAM auth to Responses API route (#29788)

Applied as the squash diff of PR #29788 (head 9800b2f), which landed
upstream inside the litellm_oss_staging_080626 sync (32c88ca, #29932)
and has no standalone commit to cherry-pick. The rc line already carries
the prerequisite #29490 Responses route via the 040626 sync.

* fix: completion_cost AttributeError on streaming Anthropic web_search responses (#26153) (#27346)

Cherry-picked from staging squash 4a3860d.

The rc line predates the Usage.__init__ server_tool_use dict->ServerToolUse
coercion that staging carries (it landed via the squashed OSS sync #29932 /
32c88ca, not as a standalone commit). The calculate_usage
Usage(**returned_usage.model_dump()) round-trip re-serializes server_tool_use
to a plain dict, so without that coercion the rebuilt usage holds a dict and the
regression test asserting a ServerToolUse type fails. Restored the coercion in
litellm/types/utils.py to satisfy the prerequisite -- it matches #27346's own
first commit (coerce server_tool_use dict to ServerToolUse in Usage.__init__),
which was dropped from the squash only because staging already carried it.

* feat(datadog): add team-scoped Datadog callback support (#29947)

Cherry-picked from the PR head 9c049da (single-commit PR, merged to
litellm_oss_branch). Applied cleanly; no conflicts.

Note: black --check in this worktree flags pre-existing multi-line string
formatting in litellm_core_utils/litellm_logging.py (lines ~1006-1050) that is
already present on the patch/v1.89.0-rc.1 base and is untouched by this pick --
left as-is to avoid reformatting unrelated lines.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Kenan Yildirim <kenan@kenany.me>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Kent <kingdooo@gmail.com>
Co-authored-by: ishaan-berri <155045088+ishaan-berri@users.noreply.github.com>
Co-authored-by: aanchal22 <12680748+aanchal22@users.noreply.github.com>
Sameerlite added a commit that referenced this pull request Jun 18, 2026
* fix(key_generate): allow team members to create keys on org-scoped teams (#29310)

* fix(key_generate): allow team members to create keys on org-scoped teams

When a virtual key is created for a team, enterprise logic inherits the
team's organization_id onto the key (add_team_organization_id). Since the
VERIA-55 org-IDOR fix, /key/generate then required the caller to be an
explicit LiteLLM_OrganizationMembership member of that org, returning
403 "Caller is not a member of organization_id=<uuid>". Admins normally
only add users to teams (not orgs), so self-serve key creation regressed
for any user on an org-scoped team (regression since v1.84.0-rc.1).

Skip the org-membership check when organization_id was inherited from the
key's team (organization_id == team_table.organization_id). Team-level
authorization already gates this path, so team membership is sufficient.
The membership check still runs when a caller assigns an organization_id
that did not come from the key's team, preserving the IDOR protection.

Adds regression tests covering both the team-inherited (allowed) and
foreign-org (still blocked) cases.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(key_generate): cover mismatched team org IDOR path on generate

Add test_generate_key_foreign_org_with_mismatched_team_still_enforces_membership
for the case where a team is present but request organization_id differs from
team_table.organization_id. Enterprise inheritance is no-op'd in the test so
the guard is exercised directly; membership validation must still run.

Addresses Greptile review on #29310.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(pass-through): move Gemini pass-through tests to gemini-3.1-flash-lite (#29595)

* test(pass-through): move Gemini pass-through tests to gemini-3.1-flash-lite

gemini-2.5-flash-lite is a generation behind and is slated for discontinuation on Vertex AI no earlier than October 16, 2026, so the pass-through suite was exercising an aging model. Every reference now points at gemini-3.1-flash-lite, which is GA and already priced in the cost map so the spend-logging assertions still compute a real cost

test_vertex.test.js also gains jest.retryTimes(3) to match the sibling spend tests. The CI failures were intermittent 429 RESOURCE_EXHAUSTED from Vertex quota pressure, and that file was the only one without a retry, so a single rate-limited request was failing the whole job

* test(pass-through): point Vertex tests at the global endpoint for gemini-3.1-flash-lite

gemini-3.1-flash-lite is not served on the Vertex us-central1 regional endpoint for the CI project, so the Vertex pass-through tests were returning a deterministic 404 "Publisher Model ... was not found or your project does not have access to it" while the Gemini API tests passed. Move the Vertex clients to the global location, which the pass-through router maps to aiplatform.googleapis.com, where the 3.1 family is served

* Litellm oss staging 030626 (#29578)

* Fix incorrect agent API request example payload structure (#29556)

* fix(otel): add litellm_metadata fallback in _get_span_context and _end_proxy_span_from_kwargs (#29427)

* fix(otel): add litellm_metadata fallback in _get_span_context and _end_proxy_span_from_kwargs

On /v1/messages and other LITELLM_METADATA_ROUTES, the parent OTel span
is stored in litellm_params['litellm_metadata'] instead of
litellm_params['metadata']. When the request body contains a native
'metadata' field (e.g. Anthropic's {"user_id": "..."}),
litellm_params['metadata'] gets overwritten and the parent span is lost,
producing orphan root spans with a different trace_id.

Add fallback checks to litellm_metadata in:
- _get_span_context(): so child spans find the correct parent
- _end_proxy_span_from_kwargs(): so the proxy span gets closed

Fixes: https://github.com/BerriAI/litellm/issues/27934

* test(otel): tighten assertions per Greptile review

- test_span_context_metadata_takes_priority: assert litellm_metadata
  span is never accessed, proving metadata takes priority
- test_span_context_no_parent_when_neither_has_span: assert both ctx
  and detected_span are None

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Aneesh-Fiddler <aneeshfiddler@gmail.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix: remove premature end-user budget check from get_end_user_object (#29420)

* fix(proxy): remove premature end-user budget check from get_end_user_object

Problem:
- `_check_end_user_budget()` was called inside `get_end_user_object()`
- This caused budget checks to run BEFORE `skip_budget_checks` could be evaluated
- Zero-cost models (e.g., local vLLM) were incorrectly blocked when
  end-users exceeded their budget, even though they should bypass budget checks

Solution:
- Remove `_check_end_user_budget()` calls from `get_end_user_object()`
- Budget enforcement now happens exclusively in `common_checks()` where
  `skip_budget_checks` context is available
- `get_end_user_object()` keeps `route` as optional in function parameter for backwards compatibility and future implementation.

* refactor(tests): update budget enforcement tests to reflect changes in get_end_user_object

- test_get_end_user_object() verifies data fetching
- test_check_end_user_budget() verifies enforcement
- test_budget_enforcement_blocks_over_budget_users() integrates _check_end_user_budget()
- test_resolve_end_user_reraises_budget_exceeded() is now test_resolve_end_user since no budget exceeded is thrown in get_end_user_object()

* Gemini /images/generate and /images/edits billing fixes + add support for size and aspect ratio params (#29534)

* Fix Gemini image config mapping

* Address Gemini image config review

* Format Gemini image generation transform

* Fix Gemini image token usage logging

* Share Gemini image request helpers

* Fix Gemini Imagen model routing

* Fixes as per self code review

* Fixes per internal code review

* Stop gating Imagen imageSize forwarding

* Document Gemini image size mapping source

* chore: retrigger lint

* Clarify Gemini candidate count precedence

* Add Inception provider (#29522)

* add inception as provider (chat, fim)

* linting

* seperate test suite for chat and fim

* fix test coverage

* fix: model hub custom pricing model info (#29293)

* Opik user auth key metadata extractors (#28397)

* fix: enhance Opik metadata extraction to include user API key auth context fixed after refactoring to extractor logic

* test: add unit tests for OPik metadata extraction logic

* fix: enhance extract_opik_metadata function to prioritize metadata sources for improved accuracy

* fix(ci): clarified comments and edited unit tests

* test: add unit tests for OPik metadata extraction with auth and requester overrides

* fix(ui): replace fixed favicon.ico with current api get /get_favicon (#29532)

Signed-off-by: José Luis Di Biase <josx@interorganic.com.ar>

* fix(vertex/gemini): keep tool_call reference when a text-only assistant message follows (#29561)

`_gemini_convert_messages_with_history` tracks `last_message_with_tool_calls`
so a following tool result can be matched back to its tool call. The assignment
was inside a branch guarded by
`assistant_msg.get("tool_calls", []) is not None`, which is also True for a
text-only assistant message (an empty list is not None). As a result, an
assistant message with no tool calls that appears between a tool call and its
tool result overwrote the reference, and conversion failed with:

    Exception: Missing corresponding tool call for tool response message.

This shape is common: a model emits a short narration/assistant message after a
tool call before the tool result is appended.

Only update `last_message_with_tool_calls` when the assistant message actually
carries tool_calls (or a function_call). Adds a regression test.

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* Add 1-hour cache write pricing for EU/AU/JP Bedrock Anthropic models (#28572)

* fix(thinking): handle None thinking param in is_thinking_enabled (#28598)

Squash-merged by litellm-agent from Terrajlz's PR.

* feat(helm): support tpl rendering in podAnnotations (#28609)

Squash-merged by litellm-agent from devauxbr's PR.

* Forward custom_llm_provider through the Responses API bridge (Fixes #28505) (#28575)

* Forward custom_llm_provider through the Responses API bridge (Fixes #28505)

When a Chat Completions request to a GPT-5.4+ model contains both
`tools` and `reasoning_effort`, `completion()` auto-routes through
`responses_api_bridge`. The bridge handler called
`litellm.responses()` / `litellm.aresponses()` without forwarding the
already-resolved `custom_llm_provider`, so the downstream call
re-invoked `get_llm_provider()` with `custom_llm_provider=None` and
stripped a second provider prefix from a `provider/provider/model`
deployment string.

For a deployment configured as `openai/openai/openai/gpt-5.5`,
the bridge flow sent `openai/gpt-5.5` to the upstream API instead of
the correct `openai/openai/gpt-5.5`. Upstream APIs that enforce
model-name allow-lists rejected this as `key_model_access_denied`.

Fix: pass the locally-resolved `custom_llm_provider` into both the
sync `responses()` and async `aresponses()` calls so the downstream
`_resolve_model_provider_for_responses` sees an explicit provider
and skips the second prefix-strip.

New regression test
`tests/test_litellm/completion_extras/test_responses_bridge_provider_propagation.py`
pins both call sites: each must forward `custom_llm_provider`.

* fix(28505): set custom_llm_provider on request_data instead of as duplicate kwarg

Greptile flagged that the previous patch passed custom_llm_provider as an
explicit kwarg to responses()/aresponses() while request_data already
carried it via the spread of sanitized_litellm_params, which would raise
TypeError: got multiple values for keyword argument on every real bridge
call.

Switches to assigning request_data['custom_llm_provider'] before the call
so the resolved provider wins over whatever sanitized_litellm_params spread
in, without duplicating the kwarg.

Updates the regression test to seed request_data with a sentinel
custom_llm_provider so it actually exercises the overwrite path (the
previous test mocked transform_request with a minimal dict and never hit
the conflict).

* chore: trigger shin-agent re-eval on retargeted staging base

* chore: trigger shin-agent re-eval against updated Greptile state

* Add 1-hour cache write pricing for EU/AU/JP Bedrock Anthropic models

The 1-hour prompt-cache write tier
(`cache_creation_input_token_cost_above_1hr`) was added to the
us./global. variants of the Claude 4.5/4.6/4.7 family on Bedrock, but
the eu./au./jp. cross-region inference profiles were left without it.
AWS Bedrock pricing applies the same +10% regional premium across all
geo profiles, so eu./au./jp. should carry the same 1-hour rates as
us. (1.6x the 5-minute regional rate).

Without these fields, cost tracking on EU/AU/JP Bedrock 1-hour-TTL
prompt caching falls back to the 5-minute write rate and undercounts
spend by ~60% for European, Australian, and Japanese tenants.

Adds the 1-hour tier (and Sonnet 4.5's long-context >200K tier where
AWS publishes one) to 14 regional Bedrock entries in both
`model_prices_and_context_window.json` and the bundled
`model_prices_and_context_window_backup.json`:

  - eu./au.   Opus 4.6     ($11.00 / MTok)
  - eu./au.   Opus 4.7     ($11.00 / MTok)
  - eu./au./jp. Sonnet 4.6 ($6.60 / MTok)
  - eu./au./jp. Sonnet 4.5 ($6.60 / MTok regular, $13.20 / MTok LC)
  - eu./au./jp. Haiku 4.5  ($2.20 / MTok)

Also extends `tests/test_litellm/test_bedrock_anthropic_1hr_cache_pricing.py`
with a `REGIONAL_EXPECTED` parametrized block covering all 13 new
entries plus the existing 1.6x ratio invariant.

Note: `eu.anthropic.claude-opus-4-5-20251101-v1:0` carries the
wrong 5m rate today (base 6.25e-06 instead of regional 6.875e-06),
which would break the 1.6x ratio check. It is intentionally left out
of this PR so the scope stays "1-hour cache tier addition" — a
separate follow-up should correct the EU 5m rates for Opus 4.5.

---------

Co-authored-by: Terrajlz <info@jouleselectrictech.com>
Co-authored-by: Bruno Devaux <devaux.br@gmail.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>

* Add 1-hour cache write pricing tier for Vertex AI Anthropic models (#28569)

* fix(thinking): handle None thinking param in is_thinking_enabled (#28598)

Squash-merged by litellm-agent from Terrajlz's PR.

* feat(helm): support tpl rendering in podAnnotations (#28609)

Squash-merged by litellm-agent from devauxbr's PR.

* Forward custom_llm_provider through the Responses API bridge (Fixes #28505) (#28575)

* Forward custom_llm_provider through the Responses API bridge (Fixes #28505)

When a Chat Completions request to a GPT-5.4+ model contains both
`tools` and `reasoning_effort`, `completion()` auto-routes through
`responses_api_bridge`. The bridge handler called
`litellm.responses()` / `litellm.aresponses()` without forwarding the
already-resolved `custom_llm_provider`, so the downstream call
re-invoked `get_llm_provider()` with `custom_llm_provider=None` and
stripped a second provider prefix from a `provider/provider/model`
deployment string.

For a deployment configured as `openai/openai/openai/gpt-5.5`,
the bridge flow sent `openai/gpt-5.5` to the upstream API instead of
the correct `openai/openai/gpt-5.5`. Upstream APIs that enforce
model-name allow-lists rejected this as `key_model_access_denied`.

Fix: pass the locally-resolved `custom_llm_provider` into both the
sync `responses()` and async `aresponses()` calls so the downstream
`_resolve_model_provider_for_responses` sees an explicit provider
and skips the second prefix-strip.

New regression test
`tests/test_litellm/completion_extras/test_responses_bridge_provider_propagation.py`
pins both call sites: each must forward `custom_llm_provider`.

* fix(28505): set custom_llm_provider on request_data instead of as duplicate kwarg

Greptile flagged that the previous patch passed custom_llm_provider as an
explicit kwarg to responses()/aresponses() while request_data already
carried it via the spread of sanitized_litellm_params, which would raise
TypeError: got multiple values for keyword argument on every real bridge
call.

Switches to assigning request_data['custom_llm_provider'] before the call
so the resolved provider wins over whatever sanitized_litellm_params spread
in, without duplicating the kwarg.

Updates the regression test to seed request_data with a sentinel
custom_llm_provider so it actually exercises the overwrite path (the
previous test mocked transform_request with a minimal dict and never hit
the conflict).

* chore: trigger shin-agent re-eval on retargeted staging base

* chore: trigger shin-agent re-eval against updated Greptile state

* Add 1-hour cache write pricing tier for Vertex AI Anthropic models

GCP Vertex AI publishes a separate 1-hour cache write column for the
Claude family (1.6x the 5-minute write rate, matching the documented
Bedrock ratio). LiteLLM's Vertex AI Anthropic entries only carry the
5-minute tier, so any request that uses `cache_control: {"ttl": "1h"}`
on Vertex AI Claude is undercounted in cost tracking by ~60%.

The runtime side already supports the 1-hour tier — `VertexAIAnthropicConfig`
extends `AnthropicConfig`, populating `ephemeral_1h_input_tokens`, and
`_calculate_cache_creation_cost` reads `cache_creation_input_token_cost_above_1hr`.
Only the price registry was missing data.

Adds the field to 19 vertex_ai/claude-* entries across both
`model_prices_and_context_window.json` and the bundled
`model_prices_and_context_window_backup.json`:

  - Haiku 4.5 ($1.25 -> $2.00 / MTok)
  - Sonnet 3.7 / 4 / 4.5 / 4.6 ($3.75 -> $6.00 / MTok)
  - Opus 4.5 / 4.6 / 4.7 ($6.25 -> $10.00 / MTok)
  - Opus 4 / 4.1 ($18.75 -> $30.00 / MTok)

Adds `tests/test_litellm/test_vertex_anthropic_1hr_cache_pricing.py`
mirroring the Bedrock equivalent — pins each (5m, 1h) pair per model
and asserts the 1.6x ratio across the family.

Fixes #27781.

---------

Co-authored-by: Terrajlz <info@jouleselectrictech.com>
Co-authored-by: Bruno Devaux <devaux.br@gmail.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>

* Fix Gemini multimodal function responses (#29325)

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* address greptile review: add _transform_image_usage method and model-map supports_image_size flag

- Add _transform_image_usage instance method to GoogleImageGenConfig that
  delegates to transform_gemini_image_usage, fixing the regression test
- Replace hardcoded "2.5-flash" string check in supports_gemini_image_size
  with a get_model_info lookup on supports_image_size (default true)
- Add supports_image_size: false to all gemini-2.5-flash model entries in
  model_prices_and_context_window.json so capability is controlled via the
  model map rather than embedded in code

* fix test failures: schema validation, mypy type, model info plumbing, pricing test

- Add supports_image_size to ModelInfoBase TypedDict so get_model_info surfaces it
- Pass supports_image_size through _get_model_info_helper constructor call
- Fix supports_gemini_image_size to use value is not False (None means unset, defaults to True)
- Add supports_image_size to JSON schema in test_aaamodel_prices_and_context_window_json_is_valid
- Correct gemini-3.1-flash-lite pricing assertions in test to match JSON values

* Add Azure AI Kimi K2.6 metadata (#27052)

* Add Azure AI Kimi K2.6 metadata

* Scope Kimi metadata test cost map setup

* fall back to substring check for models not in model_prices_and_context_window.json

Models like gemini-2.5-flash-image-preview are not in the pricing JSON,
so get_model_info raises. Fall back to "2.5-flash" not in model when the
JSON has no explicit supports_image_size entry for the model.

* fix(inception): don't forward global litellm.api_key to Inception FIM

Match the Inception chat config: resolve only an Inception-specific key
(param, litellm.inception_key, or INCEPTION_API_KEY) for the text-completion
FIM path. The global litellm.api_key (often an OpenAI key) was both leaking
to api.inceptionlabs.ai and taking precedence over the configured Inception
key when set.

* fix(auth): enforce end-user budget on custom-auth path that skips common_checks

get_end_user_object() no longer raises BudgetExceededError, so custom-auth
deployments with custom_auth_run_common_checks unset (which skip the
centralized common_checks gate) stopped enforcing the end-user budget,
letting an over-budget end user keep making requests. Re-enforce the
budget in _run_post_custom_auth_checks on that path.

---------

Signed-off-by: José Luis Di Biase <josx@interorganic.com.ar>
Co-authored-by: Isha <72744901+IshaMeera@users.noreply.github.com>
Co-authored-by: aneeshsangvikar <aneeshsangvikar@fiddler.ai>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Aneesh-Fiddler <aneeshfiddler@gmail.com>
Co-authored-by: Suleiman Elkhoury <108065141+suleimanelkhoury@users.noreply.github.com>
Co-authored-by: Dmitriy Alergant <93501479+DmitriyAlergant@users.noreply.github.com>
Co-authored-by: Yanis Miraoui <yanis.miraoui19@imperial.ac.uk>
Co-authored-by: Lovro Seder <vrovro@gmail.com>
Co-authored-by: Thomas Mildner <12685945+Thomas-Mildner@users.noreply.github.com>
Co-authored-by: José Luis Di Biase <josx@interorganic.com.ar>
Co-authored-by: Lai Quang Huy <64073540+1qh@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Terrajlz <info@jouleselectrictech.com>
Co-authored-by: Bruno Devaux <devaux.br@gmail.com>
Co-authored-by: ZHONG Ziwen <67355585+zzw-math@users.noreply.github.com>
Co-authored-by: Emerson Gomes <emerson.gomes@thalesgroup.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

* Fix : a2a bugs 030626 (#29566)

* Fix error code and context id injection bug

* Add support for all A2A methods

* Add logging

* address greptile review: relay upstream JSON-RPC errors, move _PASCAL_TO_WIRE to module level, add error path tests

* fix(a2a): run pre_call_hook for tasks/resubscribe SSE path to enforce guardrails

tasks/resubscribe was returning the raw SSE stream without calling proxy_logging_obj.pre_call_hook, silently bypassing any guardrails configured on the agent. This patch calls pre_call_hook before streaming begins and wires post_call_failure_hook into the SSE generator so errors are logged. Adds a regression test verifying the hook is called.

* fix(a2a): use get_async_httpx_client instead of creating httpx clients per request

Creating httpx.AsyncClient instances per-request adds ~500ms latency. Switch _forward_jsonrpc and _forward_jsonrpc_sse to use the shared client from get_async_httpx_client(httpxSpecialProvider.A2A).

* fix(a2a): forward caller identity headers on task ops; validate push notification URL

Two security fixes for task management methods:

1. All task operations (tasks/get, tasks/list, tasks/cancel, tasks/resubscribe, push notification config methods) now forward X-LiteLLM-User-Id and X-LiteLLM-Team-Id headers to the upstream agent, so the agent can scope task access to the authenticated caller.

2. tasks/pushNotificationConfig/set validates the callback URL before forwarding: requires HTTPS and rejects private/loopback/reserved IP ranges and localhost hostnames to prevent SSRF.

* Fix A2A task hook and push URL handling

* fix(a2a): fix mypy type errors for request_id and header_name dict key types

* Fix A2A request id and params forwarding

* Forward trace IDs for A2A task calls

* fix(a2a): strip client-forwarded X-LiteLLM-* headers before applying authenticated identity

A client could send x-a2a-<agent>-x-litellm-user-id in their request and have it forwarded to the upstream agent as an authenticated identity header. Fix: sanitize any X-LiteLLM-* headers from agent_extra_headers before merging, then apply the authenticated identity headers last so they always override client-supplied values.

* Fix A2A SSE fallback JSON-RPC error code

* Fix A2A SSE error id backfill

* fix(a2a): validate both push notification url fields to close SSRF bypass

* fix(a2a): widen request_id annotation to match JSON-RPC id call sites

* fix(a2a): run post-call streaming hook for tasks/resubscribe so agent guardrails apply

tasks/resubscribe returned the raw upstream SSE stream without routing events
through the post-call streaming hook, so output guardrails configured on the
agent were silently skipped for streaming task subscriptions while every other
task method and message/stream applied them. Parse upstream JSON-RPC SSE events
and feed them through async_streaming_data_generator, matching message/stream,
so guardrails inspect the streamed task content. Adds a regression test that
fails when the streamed events bypass the guardrail hook.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

* fix(anthropic/adapter): emit thinking block for reasoning_content-only streaming chunks (#29600)

* fix(anthropic/adapter): open thinking block for reasoning_content-only streaming chunks

The /v1/messages streaming content-block classifier (_translate_streaming_openai_chunk_to_anthropic_content_block) only recognized thinking_blocks. OpenAI-compatible reasoning backends (vLLM/SGLang reasoning parsers: DeepSeek-R1, Qwen3, gpt-oss, ...) populate reasoning_content with thinking_blocks=None, so the classifier fell through to a text block. The delta translator already emits thinking_delta for reasoning_content, so those deltas landed inside a text block and Anthropic streaming clients (Claude Code, SDK .stream()) silently dropped the chain-of-thought.

Mirror the reasoning_content fallback already present in the non-stream translator and the streaming delta translator so the classifier opens a thinking block. Adds a focused regression test.

* fix(anthropic/adapter): reach reasoning_content branch when thinking_blocks attr is absent

Delta deletes the thinking_blocks attribute when unset, so the prior nested check was unreachable for reasoning-only chunks (vLLM/SGLang). Make it a sibling elif so the content block is classified as thinking.

* test(proxy): stop component-allowlist test leaking DATABASE_URL into xdist peers

The component-allowlist test pins throwaway DATABASE_URL/LITELLM_MASTER_KEY
values at import time via os.environ so importing proxy_server doesn't need a
live database. Those values persisted for the whole pytest-xdist worker, so a
sibling test sharing the worker (test_key_rotation_e2e's DB-backed E2E case)
saw the leaked sqlite DATABASE_URL, treated it as an available database instead
of skipping, and the Prisma engine rejected the non-postgres URL (P1012 ->
httpx.ConnectError). Restore the prior environment after the import so the
throwaway values never escape the module.

---------

Co-authored-by: Tai An <antai12232931@outlook.com>

* ci: reproduce default-Windows wheel install to guard MAX_PATH (#29597)

* ci: reproduce default-Windows wheel install to guard MAX_PATH

The existing using_litellm_on_windows job installs the project with
`uv sync`, an editable source install that never copies package files
into a deep site-packages path, so it cannot see the 260-char MAX_PATH
overflow that breaks `pip install litellm` on default Windows. The
content-filter benchmark fixtures have hit that limit three times
(#21941, #22039, #29536), each caught only after release.

This adds a guard to the same job that builds the wheel and installs it
the way an end user would: into a venv whose site-packages prefix is
padded to a realistic worst-case Windows length (~100 chars), then
asserts the install completes and litellm imports. Any packaged path
long enough to bust MAX_PATH at that prefix is reported up front, so the
check is deterministic regardless of the runner's long-path setting,
while the real install also covers failure modes a length heuristic
cannot (half-unpacked packages, reserved names, case collisions).

This commit is the guard only; on the current tree it correctly fails
because nine fixtures still exceed the limit. The rename that brings
them back under it follows on this branch.

* fix(packaging): shorten content-filter benchmark fixtures under MAX_PATH

The 10 content-filter benchmark result fixtures used the legacy
block_{topic}_-_contentfilter_({yaml}).json naming, up to 176 chars
inside the wheel, which busts the Windows 260-char MAX_PATH limit once
extracted under a realistic site-packages prefix and aborts
`pip install litellm` on default Windows.

Rename them to the short {topic}_cf.json scheme that
_save_confusion_results already emits today (it splits the label on the
em-dash and writes f"{topic}_cf"), matching the insults_cf.json and
investment_cf.json files fixed earlier. Re-running the eval suite now
regenerates these same short names rather than recreating the long ones.

This drops the longest packaged path from 176 to 128, so the guard added
in the previous commit goes from red to green with a 32-char margin.

* test(windows): tidy MAX_PATH guard per review

Close the wheel zip via a context manager rather than leaning on
refcount collection, and select the wheel under dist/ by newest mtime so
a stale artifact from an earlier build cannot be tested instead of the
one just produced. Also pin down the venv-depth formula with a short
note: the +2 is the separator joining the venv root to "Lib" plus the
trailing separator before the entry, which lands the simulated
site-packages prefix at exactly 100 chars.

* fix(vertex): strip output_config.effort for Vertex Claude models that reject it (Haiku 4.5) (#29585)

* fix(vertex): strip output_config.effort for models that reject it

Haiku 4.5 on Vertex AI does not support output_config.effort and 400s with
"output_config.effort: Extra inputs are not permitted". PR #27074 emptied
VERTEX_UNSUPPORTED_OUTPUT_CONFIG_KEYS so effort would forward for Opus/Sonnet
4.6+, but that made the strip unconditional across every Vertex Anthropic
model, including ones that don't support it. Claude Code injects effort into
its default Messages payload, so `claude --model claude-haiku-4.5` started
failing.

Make the sanitizer model-aware: drop output_config.effort for models that
don't advertise output_config support (or any reasoning effort level) while
forwarding it for those that do. The fix covers both the chat-completion and
Messages pass-through transformation paths since they share the helper.

* chore(vertex): log at debug when dropping unsupported output_config.effort

Operators pointing an unregistered Vertex Claude alias that does support
effort would otherwise see it stripped with no signal. Debug level keeps it
out of normal logs since Claude Code sends effort on every request.

* Litellm websocket improvements (#29563)

* Add support for websocket via codex

* Add model alias and creds support

* fix: skip cost tracking for WS session wrapper call types

The @client decorator on _aresponses_websocket fires async_success_handler
with result=None after the session ends. This triggered cost tracking errors
because standard_logging_object is never built for None results.

Per-turn costs are correctly tracked by individual litellm.aresponses calls
inside the session. The outer session-level logging obj should not attempt
cost tracking.

Fix: skip _aresponses_websocket and _arealtime call types in deployment_callback_on_success,
RouterBudgetLimiting.async_log_success_event, and _PROXY_track_cost_callback.

* fix: address Greptile review comments

Fix JSON injection: use json.dumps instead of f-string interpolation for model name in WS body.

Add 30s timeout for first WS frame to prevent unbounded connection resource tie-up.

Restore per-event model override in streaming_iterator; fall back to connection-level model when event omits it.

Strengthen regression test: inject alias into kwargs via _update_kwargs_with_deployment mock so the test would fail on un-fixed code.

* fix: handle nested response.create format in first-frame model extraction

When ?model= is omitted, the first WS frame can carry the model in either flat
format (first_event["model"]) or nested format (first_event["response"]["model"]).
The flat-only check would silently reject clients using the nested wire format.

Mirrors the same two-format logic in _build_base_call_kwargs.

* fix: don't force connection-level custom_llm_provider on per-event model overrides

If a client sends a different model per response.create turn, litellm needs to
re-resolve the provider from that model string. Forcing the connection-level
custom_llm_provider would silently route the request to the wrong backend.

Only inject custom_llm_provider when the per-event model matches the
connection-level model.

* refactor: extract WS model extraction into testable function

Pull the flat/nested model extraction into _extract_model_from_first_ws_event
so tests import and exercise the real function rather than a copy.

* fix: compare providers not full model strings in _inject_credentials

The model == self.model guard was too strict: same-provider model variants
(e.g., vertex_ai/gemini-2.0 -> vertex_ai/gemini-1.5 on one connection) would
lose custom_llm_provider, breaking routing when a custom api_base is in use.

Compare the provider extracted by get_llm_provider instead, so same-provider
variants still inherit the connection-level provider while cross-provider
overrides let litellm re-resolve.

* style: black formatting

* refactor: extract first-frame model resolution to fix PLR0915 (too many statements)

* Fix responses WebSocket first-frame validation

* fix: classify WS first-frame read errors and clarify cost-skip log

Distinguish client disconnects from server errors when reading the
responses WebSocket first frame, make the cost-tracking skip log message
accurate for session wrappers (which do carry a model), and resolve the
connection-level provider once per session instead of on every
response.create event.

* test: cover WS first-frame read errors and same-provider credential injection

Adds regression tests for the still-uncovered responses WebSocket paths:
the timeout, invalid-JSON and missing-model branches of
_read_ws_model_from_first_frame, plus the provider comparison in
ManagedResponsesWebSocketHandler._same_provider and _inject_credentials
(same-provider model variants keep the connection provider; cross-provider
models re-resolve).

* fix(responses-ws): fall back to explicit custom_llm_provider when connection model is unresolvable

When a WebSocket session is opened with a custom deployment alias that litellm
cannot resolve to a provider, _connection_provider was None, so _same_provider
returned False for every resolvable per-event model and the connection-level
custom_llm_provider was dropped. Use the explicitly-set custom_llm_provider as
the connection provider in that case so same-provider per-event models still
inherit it while genuinely cross-provider models continue to re-resolve.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

* feat(arize/phoenix): OpenInference rendering parity — tool_calls, cost, passthrough I/O, session/user, multimodal, cache tokens (#28800)

* feat(arize): enrich OpenInference attributes for better span rendering

Pure rendering enhancements to the Arize / Arize Phoenix integration. No
existing attribute keys or values are removed or overwritten; every new
emit is independently try/except-wrapped and fires only when its source
data is present so existing behavior is preserved.

What this adds
- Coerce non-dict response objects (e.g. httpx.Response from passthrough
  routes) via JSON decode so id/model/usage extraction stops crashing
  with "'Response' object has no attribute 'get'". Dicts and Pydantic
  objects with .get pass through unchanged.
- Set OPENINFERENCE_SPAN_KIND defensively early so a downstream failure
  can't blank the kind; the original late write (incl. TOOL upgrade) is
  preserved.
- Add "passthrough" keyword to _infer_open_inference_span_kind so
  allm_passthrough_route / llm_passthrough_route resolve to LLM instead
  of UNKNOWN.
- Emit cache token breakdown: LLM_TOKEN_COUNT_PROMPT_DETAILS_CACHE_READ /
  _CACHE_WRITE / _AUDIO. Sources covered: OpenAI prompt_tokens_details
  and Anthropic / Bedrock cache_{read,creation}_input_tokens.
- Render assistant tool_calls on both input and output messages via
  MESSAGE_TOOL_CALLS.* (Pydantic-aware, handles ModelResponse choices).
  Tool-result input messages also get MESSAGE_TOOL_CALL_ID and
  MESSAGE_NAME.
- Render multimodal list-shaped content via MESSAGE_CONTENTS.* (OpenAI
  image_url, Anthropic source.{media_type,data} as data: URI). Legacy
  MESSAGE_CONTENT write is unchanged.
- Emit SESSION_ID (end_user_id / trace_id), USER_ID (only when not
  already set by optional_params.user or model_params.user), and
  litellm.{team_id,team_alias,key_alias} from StandardLoggingPayload
  metadata.
- Emit llm.response.cost as float from StandardLoggingPayload.response_cost.
- Bedrock / Anthropic passthrough normalization: extract input from
  additional_args.complete_input_dict and output from the coerced
  provider response so INPUT_VALUE / OUTPUT_VALUE / LLM_INPUT_MESSAGES /
  LLM_OUTPUT_MESSAGES are populated. Only runs when call_type contains
  "passthrough" / "pass_through".

Tests
- 15 new unit tests covering each addition plus explicit regression
  guards (USER_ID overwrite protection, passthrough normalizer scope,
  coerce identity for dicts/.get-bearing objects, no spurious cache
  emits).
- Existing test_arize_set_attributes count bumped from 26 to 27 to
  account for the additional defensive span.kind write (same value,
  written twice).
- tests/test_litellm/integrations/arize/: 70 passed (55 baseline + 15
  new). tests/test_litellm/integrations/test_opentelemetry.py: 221
  passed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(arize): collapse additive try/except blocks into _safe_emit helper

The additive attribute emitters all share the same shape: run a callable,
swallow any exception to debug log so it cannot blank the span. Hoisting
that pattern into a single _safe_emit(label, fn, *args, **kwargs) helper
removes 5 repeated try/except blocks. Behavior unchanged; arize test
suite still passes (70/70).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(arize): emit cost under canonical llm.cost.total key

Arize's "Total Cost" column reads the OpenInference-standard
`llm.cost.total` attribute. The previous custom `llm.response.cost`
key never surfaced in the trace list. Now emits both keys (canonical +
legacy) so renderers + any existing consumers both work.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(arize): keep span.kind=LLM for tool-using completions + render tool_calls in Output

A chat completion that passes `tools=[...]` or returns `tool_calls` is still
an LLM call per the OpenInference spec — TOOL is reserved for actual tool
execution. The previous override demoted these to TOOL, breaking Arize's
LLM-scoped dashboards/evals and skewing token/cost analytics for any
tool-using traffic.

Additionally, when an assistant response had no text content but did
request tool calls, `output.value` was set to the empty string so Arize's
"Output" pane rendered blank. Now serializes the tool_calls into a compact
JSON summary in `output.value` (the structured `MESSAGE_TOOL_CALLS.*`
attributes are still emitted unchanged).

Cleanups:
  - extract `_get_tool_calls` and `_normalize_tool_call` helpers,
    deduplicating the dict-vs-Pydantic + function-dict logic across
    `_set_choice_outputs`, `_emit_message_tool_calls`, and the new
    `_summarize_tool_calls_for_output`.
  - drop redundant late `OPENINFERENCE_SPAN_KIND` write — the defensive
    early write is now the single source of truth.
  - remove a dead local re-import of `MessageAttributes`/`SpanAttributes`.

Tests: 73 pass (added regression guard asserting span.kind stays LLM for
completions that pass tools AND return tool_calls; existing call_count
assertion restored to 26).

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(arize): tighten cleanup — fold _get_tool_calls into _safe_get

Two tiny cleanups, no behavior change:
- collapse `_get_tool_calls` to use `_safe_get`, removing a 7-line
  hand-rolled dict-vs-attribute fallback that duplicated existing logic.
- trim the `_set_choice_outputs` tool-call summary comment from 4 lines
  to 2 (was over-explaining).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(arize): address Greptile review — drop session_id=trace_id fallback, remove dead code, fix Black

Three Greptile-flagged issues + the Black formatting CI failure.

1. SESSION_ID no longer falls back to trace_id. Previously every span
   without an explicit `user_api_key_end_user_id` would have its
   session.id set to the per-request trace_id, which creates one
   distinct "session" per request and breaks Arize's Session-grouping
   analytics. Now SESSION_ID is emitted only when an explicit end-user
   identifier exists, and the trace_id is emitted under its own
   `litellm.trace_id` key so spans remain filterable by trace.

2. Removed dead `ArizeOTELAttributes.set_response_output_messages`
   override. Confirmed zero callers in the entire repo (the live path
   is `_set_choice_outputs` via `_set_response_attributes`). The
   override was preexisting dead code, but the expansion of
   `_set_choice_outputs` in this PR made the divergence misleading.

3. Removed permanently-dead first branch in cache_write detection.
   `_safe_get(prompt_token_details, "cache_creation_tokens")` looks
   for a key that neither OpenAI's `prompt_tokens_details` nor
   Anthropic's payload ever exposes. Now reads straight off `usage`
   for `cache_creation_input_tokens`.

4. Reformatted both files under Black 26.3.1 (the version CI uses
   via `uv sync --frozen`). Local previously used 24.10.0.

Tests: 74/74 pass in the arize suite (added
`test_arize_does_not_use_trace_id_as_session_id_fallback`).
Combined arize + opentelemetry suite: 295/295 pass.

End-to-end verified live: tool-call still emits `span.kind=LLM` and
JSON tool_calls in `output.value`; `session.id` is now correctly
unset when no end_user_id is provided; `litellm.trace_id` is
populated; Bedrock passthrough input/output unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(arize): gate passthrough prompt export on message redaction

- Skip the complete_input_dict bridge in _maybe_normalize_passthrough when
  should_redact_message_logging() is true, so enabling redaction no longer
  leaks raw passthrough prompts into Arize (Veria security finding).
- Split passthrough input/output rendering into helpers to satisfy PLR0915.
- Remove dead call_type assignment (F841).

Validated live against a Bedrock passthrough proxy exporting to Arize:
non-redacted renders the real prompt on litellm_request; global
turn_off_message_logging yields input.value=redacted-by-litellm with the
raw_gen_ai_request child span suppressed and no SSN/marker leakage.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: passthrough endpoints duplicate logs (#29598)

* fix duplicate cost callbacks for anthropic streaming pass-through

Two bugs caused _PROXY_track_cost_callback to see stream=True +
complete_streaming_response=None on every streaming pass-through request,
making the dedup guard in dispatch_success_handlers permanently inactive:

1. pass_through_endpoints.py created the Logging object with stream=False
   for all requests. _is_assembled_stream_success short-circuits on
   self.stream is not True, so has_dispatched_final_stream_success was
   never set and any second dispatch went through unchecked.
   Fix: set logging_obj.stream = True after stream detection.

2. _create_anthropic_response_logging_payload set complete_streaming_response
   inside the try block after litellm.completion_cost(), so a pricing error
   caused an early return without setting it on model_call_details.
   Fix: set complete_streaming_response before the try block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix stream

* add stream to logging obj

* test(pass_through): give mock logging object a real model_call_details dict

The anthropic passthrough logging payload now records the assembled
response on model_call_details before cost calculation, which requires
model_call_details to support item assignment. In production it is always
a dict; the existing unit test stubbed the logging object with a bare Mock
whose attribute is not subscriptable, so the new assignment raised
TypeError. Use a real dict to match the production logging object.

* test(pass_through): cover streaming logging-obj stream flag

The streaming branch of pass_through_request that marks the logging object
as streaming (logging_obj.stream and model_call_details["stream"]) had no
unit coverage, so the patch coverage gate flagged it. Add a regression test
that drives a streaming pass-through request through pass_through_request and
asserts the logging object is flagged as a stream before dispatch.

* test(pass_through): cover SSE-response stream flag fallback branch

The auto-detected streaming branch of pass_through_request (when a request
that was not flagged as streaming returns a text/event-stream response) sets
logging_obj.stream and model_call_details["stream"] but had no unit coverage,
so the codecov patch gate failed at 60%. Drive a non-streaming pass-through
request whose upstream response is SSE through pass_through_request and assert
the logging object is flagged as a stream before dispatch.

* fix(pass_through): gate complete_streaming_response on stream flag

perform_redaction only scrubs complete_streaming_response when
model_call_details["stream"] is True. Setting it unconditionally for
non-streaming Anthropic pass-through responses left the assembled
response unredacted in model_call_details, which is handed to logging
callbacks as kwargs when message logging is disabled. Only record it for
actual streaming responses so redaction always applies.

---------

Co-authored-by: mubashir1osmani <mubashir.osmani777@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(ci): keep coverage rename green when a parallel node runs no tests (#29608)

* fix(ci): keep coverage rename green when a parallel node runs no tests

local_testing_part1 and local_testing_part2 run with parallelism 4. When
CircleCI reruns only the failed tests, the failed test lands on a single
node and the other nodes receive an empty bucket, so pytest never writes
coverage.xml or .coverage. The unguarded "mv coverage.xml ..." then exits
1 and turns the whole job red even though the rerun passed; the next
persist_to_workspace step would fail the same way on the missing paths.

Guard the rename so a node with no coverage emits empty placeholders
instead. coverage combine tolerates the empty files, so the downstream
upload-coverage job keeps the real nodes' data intact.

* fix(ci): pre-create test-results in litellm_router_testing for empty-bucket reruns

litellm_router_testing also runs with parallelism 4. On a rerun of only the
failed tests, a node can receive no tests, so the test command never creates
test-results and the final store_test_results step can fail on the missing
path. Pre-create the directory up front, matching what local_testing_part1
and part2 already do and CircleCI's own guidance for parallel reruns.

* test(openai): retry wildcard chat completion on transient OpenAI 500

build_and_test reddened on test_openai_wildcard_chat_completion when the
real gpt-3.5-turbo-0125 call returned an OpenAI 500 ("The server had an
error while processing your request"). The base branch passed the same
call concurrently, so the 500 is an intermittent OpenAI server error, not
a regression. Add the same pytest-retry marker the sibling real-call tests
in this file already use so a transient upstream 500 no longer fails CI.

* test(vcr): close out the remaining VCR live-call leaks (#29603)

* Fix remaining VCR live-call leaks

* test(vcr): dedupe live-test helpers and drop spurious kwargs

Extract the duplicated isVertexQuotaError/runVertexRequestOrSkip Vertex
quota-skip helpers into tests/pass_through_tests/vertex_test_helpers.js and the
duplicated _skip_live_prompt_caching_test guard into tests/_live_test_helpers.py
so each lives in one place. In test_aarun_thread_litellm, build a separate
message_data carrying role/content for add_message and a thread_data without
them for run_thread/run_thread_stream/get_messages, which no longer receive the
spurious message fields.

* test(overhead): assert mock transport is exercised in non-streaming and stream tests

* fix(key_generate): exempt UI/CLI session tokens from the budget ceiling for team keys (#29612)

Non-admin users creating a team key through the UI were rejected with
"max_budget cannot exceed the caller's own max_budget (0.25)". The request is
authenticated by a UI/CLI session token whose max_budget is the per-session chat
spend cap (max_ui_session_budget, default $0.25), and the delegated-authority
budget ceiling (GHSA-q775-qw9r-2r4g) treated that cap as a delegation limit.

Skip the ceiling only when a session token creates a team key (data.team_id set);
that key's spend is bounded by the team budget at request time. Personal keys and
every other non-admin caller keep the ceiling, so a session token cannot mint an
arbitrary-budget personal key.

* fix(realtime): allow null transcripts in stream logging payloads (#29625)

Allow realtime event transcript fields to be nullable so GA conversation.item payloads with transcript=null don't fail logging normalization and suppress success callbacks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* build(ui): migrate eslint to flat config and bump eslint-config-next to 16 (#29626)

ESLint 9 defaults to flat config and eslint-config-next was pinned at 15
while Next is on 16, so eslint only ran with ESLINT_USE_FLAT_CONFIG=false
and next lint is gone on Next 16. Replace .eslintrc.json with a native
flat eslint.config.mjs (config-next 16 ships flat configs, so no
FlatCompat shim is needed), bump eslint-config-next to 16.2.6, add
@eslint/js and typescript-eslint as explicit devDeps for the recommended
rule sets, and point the lint script at eslint directly.

This only makes eslint runnable on modern tooling; it does not wire it
into CI. The same rules carry over (next/core-web-vitals, eslint and
typescript-eslint recommended, prettier, unused-imports)

* fix(key_generate): scope session-token team-key budget exemption to caller-supplied team_id (#29641)

#29612 exempts UI/CLI session tokens from the key budget ceiling when they
create a team key, keyed on data.team_id. That value is read after the
default_key_generate_params loop can populate team_id, so on deployments that
set default_key_generate_params.team_id a request the caller did not scope to a
team is treated as a team key and skips the ceiling. Capture _requested_team_id
before defaults run and key the exemption off it, mirroring how
_requested_max_budget is already captured. Requests the caller did not scope to a
team keep the ceiling.

* fix(proxy): disable proxy buffering on streaming SSE responses (#29557)

Streaming responses from the proxy (/chat/completions, /v1/messages,
/v1/responses, assistants) all return through create_response() but never
sent the headers that tell an intermediary reverse proxy not to buffer the
SSE stream. nginx with the default proxy_buffering, k8s ingress-nginx, and
Envoy/Istio sidecars therefore hold the whole stream and release it in one
batch, which looks like a broken/buffered stream to the client even though
litellm is yielding chunks incrementally.

Add Cache-Control: no-cache and X-Accel-Buffering: no to every
StreamingResponse create_response() returns, matching what the proxy already
does for its own usage/policy SSE endpoints. Fixes #28384.

* fix(mcp): gate /public/mcp_hub strictly on litellm.public_mcp_servers (#27764)

* fix(mcp): gate /public/mcp_hub strictly on litellm.public_mcp_servers

* fix(mcp): add public_mcp_hub_strict_whitelist flag (default True) for migration

* ci(ui): frontend-lint job enforcing prettier + eslint on changed files (#29633)

* ci(ui): add frontend-lint job enforcing prettier and eslint on changed files

Lints only the files a PR adds or modifies under ui/litellm-dashboard,
so new and touched code must be prettier-clean and eslint-clean while the
existing tree is grandfathered. Skips cleanly when a PR touches no
lintable UI files. This lets us adopt the formatters incrementally
without a repo-wide reformat

* ci(ui): write frontend-lint file lists to $RUNNER_TEMP

Keep the prettier/eslint changed-file lists out of the checkout dir so
they cannot collide with a future source file of the same name

* lint(ui): baseline existing eslint findings so only new ones block

Capture the current error-level eslint findings (318 across 183 files)
in a committed suppressions baseline via eslint --suppress-all. Every
rule stays at its error severity, so any newly introduced violation
fails the frontend-lint gate, while the existing tree is grandfathered;
touching a legacy file never forces fixing its pre-existing issues. CI
runs eslint with --pass-on-unpruned-suppressions so that fixing a
baselined issue does not fail on a now-stale suppression, and the
generated baseline is prettier-ignored since eslint owns its format.
Burn the baseline down over time with eslint --prune-suppressions

* lint(ui): enforce a count budget for explicit any

Make @typescript-eslint/no-explicit-any a warning and cap the total
instead of hard-blocking each new one. A frontend-lint step counts the
repo-wide explicit any and fails only when it exceeds the committed
budget in eslint-any-budget.json. max starts at 2031, ten above the
current 2021, so the next ten land as warnings and the build fails once
that headroom is gone. Lower max over time toward target to ratchet the
count down. New anys still surface as warnings on changed files via the
normal eslint step

* lint(ui): enable zero-cost rules no-var, no-self-assign, react/no-danger

These have no existing violations, so they need no baseline; turning them
on purely blocks new instances. react/no-danger guards against new
dangerouslySetInnerHTML (XSS), no-var enforces let/const, and
no-self-assign catches self-assignment typos. no-debugger is already
enforced by the recommended preset

* lint(ui): add baselined complexity rules

Enable complexity:20, max-depth:4, max-params:4, max-nested-callbacks:4,
with thresholds set near the codebase p99 so only genuine outliers are
flagged. The 272 existing over-threshold functions are grandfathered in
the suppressions baseline; new over-threshold functions block. Lower the
thresholds over time to ratchet complexity down. max-lines-per-function
is intentionally left off since React components are legitimately long

* lint(ui): ban new raw fetch, standardize on React Query

Add a no-restricted-syntax rule flagging bare fetch() calls, pointing
contributors at React Query (@tanstack/react-query). The rule is not
exempted anywhere, including the already-bloated networking.tsx, so all
331 existing fetch calls are grandfathered but no new ones can be added
there or elsewhere. New data access goes through React Query, and the
networking layer can be migrated out and pruned from the baseline over
time

* lint(ui): ban new @tremor/react imports

Add a no-restricted-imports rule flagging imports from @tremor/react so
tremor is phased out rather than spread further. The 232 existing tremor
imports are grandfathered in the baseline; new ones block and point at
antd. Migrate components off tremor and prune the baseline over time

* lint(ui): widen explicit-any budget headroom to 2040

Raise max from 2031 to 2040, giving ~19 of slack over the current 2021
instead of 10

* style(ui): prettier-format eslint.config.mjs

The frontend-lint gate flagged its own config file. Format it so the
prettier check on this PR's changed files passes

* lint(ui): soften complexity and max-depth to warnings

These two are smell metrics with arbitrary thresholds where a legit new
function can trip them, so make them advisory rather than hard-blocking.
They drop out of the baseline (now 963). max-params, max-nested-callbacks,
and the react-hooks rules stay strict since those are clear-cut

* lint(ui): move complexity and max-depth to the count-budget pattern

Generalize the explicit-any budget into a shared lint-budget mechanism:
eslint-budgets.json maps a rule to {max, target} and check-lint-budgets.mjs
counts each across the repo and fails when a count exceeds its max.
complexity (129, max 140) and max-depth (61, max 70) now use the same
slack-plus-counter model as explicit-any (2021, max 2040): they warn
per-file and the build only fails if the repo-wide total crosses the
ceiling. Lower each max toward its target over time

* docs(ui): note pruning the eslint suppressions baseline when fixing lint debt

* fix(gemini): googleSearch + server-side tools and googleMaps JSON schema (#29582)

* fix(gemini): keep googleSearch with server-side tools and googleMaps JSON schema

Wire include_server_side_tool_invocations through completion() so mixed
google_search and function tools are not dropped on Gemini 3+. Rewrite
generationConfig to responseFormat when googleMaps is used with JSON schema.

Fixes #27479
Fixes #29451

Co-authored-by: Cursor <cursoragent@cursor.com>

* address greptile review feedback (greploop iteration 1)

* style: fix black formatting in main.py for py312 compat

* Fix Gemini Google Maps extra_body JSON rewrite

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): passthrough 404 when SERVER_ROOT_PATH is set (#29658)

* fix(proxy): match passthrough registry routes bare-to-bare with SERVER_ROOT_PATH

After #28547, get_request_route strips the deployment prefix while registry
lookup still re-inflated stored paths via SERVER_ROOT_PATH, causing 404s
under paths like /llmproxy/ml. Compare normalized bare routes in both
is_registered_pass_through_route and get_registered_pass_through_route.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(proxy): patch utils.get_server_root_path in passthrough auth tests

After removing get_server_root_path from pass_through_endpoints, route
and JWT tests must mock litellm.proxy.utils where normalization reads it.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): use GA event names for Pipecat 1.3.x compatibility (#29662)

* fix(gemini-realtime): use GA event names for Pipecat 1.3.x compatibility

Pipecat v1.3.0 adopted the OpenAI Realtime API GA event naming:
  response.audio.delta          -> response.output_audio.delta
  response.text.delta           -> response.output_text.delta
  response.audio.done           -> response.output_audio.done
  response.text.done            -> response.output_text.done

The proxy was still emitting the old beta names; Pipecat's
`parse_server_event` raises "Unimplemented server event type" for any
unknown type, which killed the receive task handler and broke audio
playback and tool-call delivery.

Also:
- conversation.item.created -> conversation.item.added (already handled)
- client audio is buffered until backend setupComplete in deferred mode
- call_id fallback UUID when Gemini returns empty id
- status_details / token detail fields added to Pydantic-strict events

The _GA_TO_BETA_EVENT_TYPES map in RealTimeStreaming already translates
GA names back to beta for clients that opt in with the openai-beta
header, so legacy clients are unaffected.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): address greptile review comments

- emit outputTranscription as response.output_audio_transcript.delta
  instead of suppressing it; GA_TO_BETA map handles translation for
  legacy clients
- cap pre-setup audio buffer at 200 frames to prevent memory exhaustion;
  log a warning when the limit is hit and additional frames are dropped
- log remaining dropped message count on flush error

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): address veria review comments

- remove unused OpenAIRealtimeConversationItemCreated import
- fix guardrail bypass: semantic_vad early-return now preserves
  create_response when set so a guardrail-injected create_response:false
  is not silently dropped
- add per-connection 10 MB byte cap alongside the 200-frame count cap
  for the pre-setup audio buffer to prevent memory exhaustion

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): fix mypy arg-type on _finalize_gemini_live_setup

setup parameter typed as BidiGenerateContentSetup to match the TypedDict
passed at both call sites; was dict which mypy rejected.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): widen _finalize_gemini_live_setup to Dict[str, Any]

BidiGenerateContentSetup (TypedDict) is a subtype of Dict[str,Any] so
both call sites (one passing a plain dict, one passing the TypedDict)
satisfy mypy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): cast BidiGenerateContentSetup to Dict at _finalize call site

mypy rejects TypedDict as dict[str, Any] argument; cast at the call site
where follow_up_setup is BidiGenerateContentSetup to satisfy the checker.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Gemini realtime beta compatibility

* Fix deferred Gemini setup audio ordering

* fix: preserve Gemini audio transcript ids

* fix(realtime): cap pre-setup client buffer on all append paths

Route every append to the deferred-setup pending buffer through the
per-connection message/byte caps. Previously only the audio-buffer
fast path enforced the caps; once one frame was buffered, a client
that withheld session.update could stream arbitrary frames into
_pending_messages_until_setup unbounded and exhaust proxy memory.

* style(gemini-realtime): apply black formatting to transformation.py

* fix(gemini-realtime): log beta-translation fallback and name native-audio marker

Surface the previously swallowed exception in _send_event_to_client so a
failed GA->beta translation is observable instead of silently forwarding the
untranslated event. Extract the native-audio model substring used by
_finalize_gemini_live_setup into a named constant documenting why speechConfig
is dropped on those setups.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

* Litellm oss staging 040626 (#29671)

* fix(azure): apply api_version fallback chain to image edit URL

`AzureImageEditConfig.get_complete_url` only read `api_version` from
`litellm_params`. When callers configured it via `litellm.api_version`
or `AZURE_API_VERSION`, the constructed URL had no `?api-version=` and
Azure responded `404 Resource not found`.

Apply the same fallback chain the Azure chat path already uses in
`common_utils.py`:

    litellm_params > litellm.api_version > AZURE_API_VERSION env >
    litellm.AZURE_DEFAULT_API_VERSION

Adds 5 unit tests pinning each layer of the chain plus a regression
guard for `api_base` that already carries `?api-version=`.

* feat(mcp): core sampling and elicitation flow with security hardening

- Add sampling_handler.py: full MCP sampling/createMessage flow with
  model selection (hint-based + priority-based), auth enforcement,
  budget checks, route restriction gates, and tag policy pre-auth
- Add elicitation_handler.py: MCP elicitation/create relay with
  downstream client capability detection
- Wire sampling/elicitation callbacks in mcp_server_manager.py
  gated behind allow_sampling/allow_elicitation config flags
- Add allow_sampling/allow_elicitation fields to MCPServer type
- Fix session lock deadlock: skip lock for JSON-RPC response POSTs
  (elicitation/sampling replies) with truncated-body heuristic
- Extend client.py with sampling_callback and elicitation_callback
- Security: RouteChecks gate, tag-budget bypass fix, x-forwarded-for
  spoofing fix, Latin-1 header encoding guard
- Add 4 new test modules (model access, priority selection, request
  builder, tool conversion) + update existing MCP tests

* fix(security): run pre-call guardrails before MCP sampling acompletion

Without this, an upstream MCP server with allow_sampling enabled could
send prompts that bypass every guardrail (content filtering, PII
redaction, prompt-injection detection) configured on /chat/completions.

- Call proxy_logging_obj.pre_call_hook(call_type='acompletion') before
  llm_router.acompletion so guardrails fire for sampling sub-calls
- Add HTTPException to the re-raise list so guardrail rejections
  propagate correctly instead of being swallowed as generic errors

* feat(bedrock_mantle): add Responses API support (/openai/v1/responses) (#29490)

* feat(bedrock_mantle): add Responses API transformation config

* test(bedrock_mantle): cover trailing-slash api_base normalization

* feat(bedrock_mantle): export BedrockMantleResponsesAPIConfig

* feat(bedrock_mantle): register gpt-5.x Responses config (gpt-oss unchanged)

* feat(bedrock_mantle): add gpt-5.5/gpt-5.4 Responses price-map entries

* refactor(bedrock_mantle): exclude gpt-oss instead of allow-listing gpt-5 for Responses routing

Frontier OpenAI models on Bedrock Mantle are Responses-only on /openai/v1/responses;
gpt-oss is the legacy family that also speaks chat-completions. Gate by excluding
gpt-oss (which keeps its chat-completions emulation) and defaulting everything else
to the native Responses config, so future frontier models (gpt-6, etc.) route
correctly without a code change. Verified against the live us-east-2 Mantle endpoint:
gpt-oss 400s on /openai/v1/responses while gpt-5.5 400s on both standard paths.

* test(bedrock_mantle): cover supports_native_websocket opt-out

Closes the one uncovered line flagged by codecov on the Responses config.
The assertion documents that Mantle Responses has no realtime/websocket
transport, so realtime routing must not attempt a socket it cannot serve.

* fix(bedrock_mantle): route file_search through emulation instead of forwarding to Mantle

BedrockMantleResponsesAPIConfig inherited supports_native_file_search()
-> True from OpenAIResponsesAPIConfig but never overrode it. Mantle has no
OpenAI vector stores, so a forwarded file_search tool is rejected with a
400 (verified upstream: Tool type 'file_search' is not supported). Opting
out, like the existing supports_native_websocket override, routes the tool
through LiteLLM's file_search emulation instead.

* fix(bedrock_mantle): only route ope…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants