Skip to content

[Infra] Promote internal staging to main#27906

Merged
shin-berri merged 43 commits into
mainfrom
litellm_internal_staging
May 14, 2026
Merged

[Infra] Promote internal staging to main#27906
shin-berri merged 43 commits into
mainfrom
litellm_internal_staging

Conversation

@yuneng-berri

Copy link
Copy Markdown
Collaborator

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

stuxf and others added 30 commits May 13, 2026 01:05
… path

``get_instance_fn`` previously routed any ``s3://`` / ``gcs://``
value into ``_load_instance_from_remote_storage`` regardless of how
the value got there. The function ultimately calls
``spec.loader.exec_module(module)`` — Python in the proxy process. On
admin-callable endpoints that accept a ``target`` / ``custom_handler``
field from the request body (e.g. ``/config/pass_through_endpoint``,
custom-callback registration), that is a one-step admin-to-RCE
primitive: any future privilege-escalation bug becomes immediate
code execution.

The documented operator flow for remote-module loading is
``litellm_settings.callbacks: ["s3://bucket/module.instance"]`` in
``config.yaml``. That path always carries the YAML's
``config_file_path`` through to ``get_instance_fn``. Use the presence
of ``config_file_path`` as the discriminator: refuse remote URLs
when it is absent (the request-body path) unless the operator
explicitly opts back in via
``LITELLM_ALLOW_REMOTE_INSTANCE_FN_FROM_API=true``.

The three success/failure/audit-log callback-loop call sites in
``proxy_server.py:load_config`` were already running inside the
startup config-file load but had stopped threading
``config_file_path`` through. Pass it through so the documented
``s3://`` callback flow continues to work unchanged.

Tests cover: remote URL without ``config_file_path`` raises;
remote URL with the opt-in env reaches the loader; remote URL
with ``config_file_path`` passes (documented startup flow); local
dotted-name imports unaffected.
The pre-existing s3:// / gcs:// custom-logger tests called
``get_instance_fn`` without ``config_file_path``, which means the
new runtime gate (refuse remote URLs unless invoked from a
config-file load) now raises ``ValueError`` before reaching the
mocked download paths. Each test was exercising the documented
startup config-file load scenario; pass ``config_file_path="/any/path"``
to make that intent explicit and route past the gate.

Affected: test_s3_download_success, test_gcs_download_success,
test_invalid_url_format, test_download_failure_handling,
test_file_cleanup.
The runtime gate on s3://gcs:// loading in get_instance_fn previously
allowed an opt-in via LITELLM_ALLOW_REMOTE_INSTANCE_FN_FROM_API. That
env var is admin-flippable at runtime (DB-overlay environment_variables
flow into os.environ), which defeats the gate's purpose, and it isn't
needed for the documented operator flow: config.yaml callbacks always
pass config_file_path through to the loader.

Remove the helper, raise unconditionally when config_file_path is None,
and drop the corresponding test for the opt-in branch.
…ML loaders

The previous commit's gate broke two legitimate startup paths for
operators using s3://gcs:// remote module loading from their config.yaml:

- general_settings.pass_through_endpoints[].custom_handler
- mcp_tools[].handler

Both call sites called get_instance_fn without a config_file_path, so
the new gate rejected them at startup. Thread config_file_path through:

- create_pass_through_route accepts config_file_path and forwards it to
  get_instance_fn. add_exact_path_route, add_subpath_route,
  _register_pass_through_endpoint, and initialize_pass_through_endpoints
  accept and propagate it.
- The YAML-load call site in proxy_server.load_config now passes
  config_file_path; the DB-overlay call site in _update_general_settings
  leaves it as the default None so the gate still fires on admin-written
  s3:// values.
- MCPToolRegistry.load_tools_from_config accepts config_file_path and
  threads it into get_instance_fn; _init_non_llm_configs forwards it
  from load_config.

Adds two regression tests verifying that the YAML-source callers thread
the path through to get_instance_fn.
)

* fix(gemini): normalize response_schema on native generateContent

The /v1beta/models/{model}:generateContent passthrough forwarded
generationConfig.response_schema verbatim, so schemas containing $defs,
$ref, anyOf-with-null, default, or title were rejected by Gemini even
though /chat/completions already handles them.

GoogleGenAIConfig.transform_generate_content_request now calls a new
_normalize_response_schema helper that mirrors the chat/completions
path: Gemini 2.0+ models get the schema promoted to responseJsonSchema
via _build_json_schema (preserving $defs/$ref natively), older models
keep responseSchema but the schema is flattened with
_build_vertex_schema. VertexAIGoogleGenAIConfig (which overrides the
transform entirely) calls the same helper before building the request.

* fix(gemini): preserve caller-supplied responseJsonSchema when responseSchema co-present

Previously, when both responseJsonSchema and responseSchema were present
on Gemini 2.0+, _normalize_response_schema processed responseJsonSchema
first (no-op normalization) then unconditionally promoted responseSchema
to responseJsonSchema, clobbering the caller-supplied value.

Now skip the promotion (and drop the redundant responseSchema) when the
caller already supplied responseJsonSchema.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore: strip restating comments from response-schema normalize

Drop the docstring on _normalize_response_schema and the two inline
comments that just restated what the surrounding code/asserts already
say. Function name + variable names carry the intent; PR description
covers the why-it-exists context.

* perf(gemini): drop redundant deepcopy on responseJsonSchema normalize

_build_json_schema is a no-op (returns its argument unchanged), so the
deepcopy + round-trip on the responseJsonSchema branch allocated a full
schema copy on every request with no observable effect. Forward the
caller's value as-is, and just move the popped responseSchema value when
promoting on Gemini 2.0+.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* style: remove unneeded comment

* fix(gemini): drop unsupported responseJsonSchema for older models

* test(gemini): add parity test between native and chat schema normalization

Per @Sameerlite review: lock the two Gemini schema-normalization paths
together. If either GoogleGenAIConfig._normalize_response_schema (native
generateContent) or VertexGeminiConfig.apply_response_schema_transformation
(/chat/completions) drifts, the parity test fails — forcing both to be
updated together.

* fix(google_genai): preserve key naming convention in _normalize_response_schema

When the input schema key is snake_case (response_schema), the promoted
JSON schema key should also be snake_case (response_json_schema) instead
of mixing in camelCase (responseJsonSchema). This matters for the Vertex
AI google_genai path which converts all keys to snake_case before
calling _normalize_response_schema.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
…th when flag set (#27716)

* feat(proxy): skip disable_background_health_check models on GET /health when flag set

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix comment

* fix greptile comments

* Fix health check fallback kwargs

* Format health endpoint

* Harden direct health check kwargs compatibility for monkeypatched perform_health_check

Replace substring-based TypeError detection with unexpected-keyword checks
and a short retry chain (full kwargs, instrumentation only, filter only,
minimal) so partial stubs work regardless of which optional kwarg fails first.
Add proxy unit tests for legacy three-arg stubs and single-kwarg variants.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix black

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>
…ocks (#27850)

* fix(bedrock-converse): drop blank-text fallback for empty thinking blocks

Claude Code with extended thinking replays prior assistant turns that
include an empty thinking block (`thinking=""`, `signature=""`) alongside
tool_use blocks. The unsigned-reasoning fallback in
`add_thinking_blocks_to_assistant_content` was emitting
`BedrockContentBlock(text="")`, which Bedrock Converse rejects with:

  "The text field in the ContentBlock object at messages.X.content.0
   is blank."

Guard the fallback with a strip() check, matching the existing
empty-text guards elsewhere in `_bedrock_converse_messages_pt`.

* style: remove unneeded comments
…lidate

LiteLLM_JWTAuth.__init__ calls get_instance_fn(custom_validate) without
config_file_path, so an operator who configures custom_validate:
s3://bucket/module.fn in their YAML JWT auth section would hit the
runtime gate on startup and break their deployment.

Accept config_file_path as a non-field kwarg (popped before the
invalid-keys check), thread it into get_instance_fn, and pass it from
the startup-load callsite via the existing user_config_file_path
module-level path. Admin-API JWT config writes leave the kwarg at None
and still hit the gate.
* fix(mcp): surface upstream 401 for token-forwarding MCP servers

For MCP servers configured with extra_headers: [Authorization], the gateway
forwards the client token directly to the upstream. When that token is rejected
(expired or invalid) the upstream returns 401, but the MCP SDK starts the SSE
stream with 200 OK before calling handlers, so the 401 can't be returned
mid-stream.

Fix: add a pre-flight httpx probe in handle_streamable_http_mcp — before the
SDK opens the session — so the gateway can still return HTTP 401 with
WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> when the
upstream rejects the token. The probe fails-open (returns 200) on network
errors so a transient hiccup does not block valid requests.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): parallelize pre-flight auth probes and use HEAD to avoid side effects

- Extract forwarded_auth outside the pass-through server loop (was called N times for the same scope value)
- Gather all upstream auth probes concurrently with asyncio.gather instead of sequentially; eliminates N×5 s worst-case latency
- Switch probe from POST+initialize JSON-RPC body to HEAD request; HEAD carries the Authorization header so the upstream rejects invalid tokens with 401 but never allocates a session or writes an audit entry

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): use get_async_httpx_client in _probe_upstream_auth

Replaces bare httpx.AsyncClient with the project-standard
get_async_httpx_client(httpxSpecialProvider.MCP) to satisfy the
ensure_async_clients_test code coverage check and avoid the +500 ms
per-request overhead of creating a new client on every probe call.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(mcp): extract pre-flight probe into _check_passthrough_upstream_auth

Moves the parallel upstream auth probe logic out of
handle_streamable_http_mcp into a dedicated helper to satisfy
Ruff PLR0915 (Too many statements > 50).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): gate pre-flight probes on authorized server set to prevent bypass

_check_passthrough_upstream_auth was resolving user-supplied server names
directly before authorization ran, letting any permitted LiteLLM key
trigger an upstream HEAD probe to a server it was not allowed to use.

Changes:
- Call _get_allowed_mcp_servers inside the helper so only servers the
  caller's key is authorized for are probed.
- Move the call site to after toolset scoping so the auth context is
  fully resolved before the probe list is built.
- Thread user_api_key_auth into the helper signature (replaces the raw
  mcp_servers name list).

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add async HTTP HEAD support

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): use Scope type annotation in _get_forwarded_auth_from_scope

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix MCP upstream auth probe method

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Remove unused AsyncHTTPHandler head method

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): exclude has_client_credentials servers from pre-flight auth probe

_prepare_mcp_server_headers skips caller Authorization when the server
uses OAuth client-credentials (M2M), but the pre-flight probe was still
selecting those servers and forwarding the caller's raw token in the HEAD
request. Exclude servers with has_client_credentials from the probe list
to match the actual downstream header-preparation logic.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): propagate upstream 403 as 403, not 401 with WWW-Authenticate

Per RFC 9110, 401 means "go get new credentials." Mapping an upstream 403
to a gateway 401 causes OAuth clients to restart the authorization flow,
obtain a fresh token with identical scopes, hit 403 again, and loop
indefinitely.

401 from upstream → gateway 401 + WWW-Authenticate (re-authorize)
403 from upstream → gateway 403 (no WWW-Authenticate hint)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): skip auth probe when Authorization may be the LiteLLM proxy key

The pre-flight upstream probe must not forward the caller's Authorization
header when it could itself be the LiteLLM proxy API key. Restrict the
probe to requests that supply x-litellm-api-key explicitly — only then is
the Authorization header unambiguously the upstream OAuth token the
caller wants forwarded.

* Fix MCP ASGI HTTPException propagation

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): use public AsyncHTTPHandler.post() in auth probe

Use AsyncHTTPHandler.post() and catch httpx.HTTPStatusError explicitly so
the 401/403 we want to surface is not silently swallowed by the broad
fail-open except Exception block. Avoids reaching into the handler's
private client attribute, which would silently regress to fail-open if
AsyncHTTPHandler is ever refactored.

* Fix MCP auth probe tests

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(mcp): add coverage for httpx.HTTPStatusError path in auth probe

AsyncHTTPHandler.post() calls raise_for_status() internally, so a real
upstream 401/403 lands as httpx.HTTPStatusError. Add a test that exercises
that specific exception path so a regression that swallows the error in
the broad fail-open except Exception would be caught.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: claude-bot <claude-bot@anthropic.com>
…timodal pricing (#27848)

* fix(cost): align vertex_ai/gemini-embedding-2-preview with Vertex multimodal pricing

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cost): align vertex_ai/gemini-embedding-2 GA source URL with preview

Per Greptile review on #27848: GA entry referenced ai.google.dev while
the preview entry was updated to the canonical Vertex AI pricing page.
Both share identical pricing values; sync the source URL for consistency.

https://claude.ai/code/session_01W8jRwstnmduadGw8Z8egxe

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude <noreply@anthropic.com>
…27834)

* feat(mcp): add delegate_auth_to_upstream flag for PKCE passthrough

Adds an opt-in per-server flag that lets clients (e.g. VS Code) complete
PKCE directly with an upstream OAuth2 MCP server, instead of LiteLLM
double-gating with its own API-key/SSO check. Only honored when
auth_type=oauth2 and the operator explicitly sets the flag; mixed-target
or non-oauth2 requests fail closed.

- Adds the field to Pydantic models, Prisma schema, and a migration
- New MCPRequestHandler._target_servers_delegate_auth_to_upstream gate
  that runs only when no x-litellm-api-key is present, so authenticated
  users still get user_id resolution + stored-credential lookup
- Anonymous callers now see delegate servers in get_allowed_mcp_servers
  (scoped to delegate servers only; the upstream still enforces auth)
- mcp_management_endpoints: allow anonymous /authorize and /token for
  delegate servers so VS Code can complete PKCE without a LiteLLM session
- UI toggle (shown only for oauth2) + payload/view wiring
- Tests covering: oauth2 on/off, non-oauth2 with flag, mixed targets,
  no resolvable target, explicit key precedence, and 401 emission

Co-authored-by: Cursor <cursoragent@cursor.com>

* Enforce oauth2 for delegated MCP auth bypass

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): close secondary Authorization bypass for delegate servers

The delegate-auth bypass gated only on the primary `x-litellm-api-key`
header, so a LiteLLM key sent via `Authorization: Bearer sk-...` (the
secondary header) was silently dropped — skipping spend tracking and
rate limiting. Gate on the resolved litellm_api_key (which considers
both headers) so the bypass fires only when neither is present.

Also update the existing "Authorization header present" test to reflect
that an upstream OAuth token now flows through the existing oauth2
fallback (LiteLLM auth attempt → fail → anonymous), not via the
delegate branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Avoid duplicate MCP OAuth credential lookup

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): block delegate bypass for M2M and internal-only servers

Two security issues flagged in code review:

1. High – client_credentials (M2M) servers must not be delegatable:
   LiteLLM auto-fetches the upstream token using stored credentials, so
   allowing anonymous bypass would let any external caller invoke tools
   authenticated as LiteLLM's service account.
   Fix: check `server.has_client_credentials` in
   `_target_servers_delegate_auth_to_upstream`, the anonymous
   allow-list in `get_allowed_mcp_servers`, and `_mcp_oauth_user_api_key_auth`.

2. Medium – internal-only servers exposed to public internet:
   The anonymous delegate allow-list was not filtering by
   `available_on_public_internet`, so external callers with an upstream
   OAuth token could invoke tools on servers marked internal-only.
   Fix: add `available_on_public_internet` guard to the anonymous
   delegate server list in `get_allowed_mcp_servers`.

Tests added for both cases.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Require public MCP delegate auth servers

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): align delegate auth path parsing with downstream routing

`_extract_target_server_names_from_path` used a naive segments-based
split while `server.py::_get_mcp_servers_in_path` uses a regex that
allows server names with one embedded slash and comma-separated lists.
With the old parser, a request to `/mcp/<delegated>/<garbage>` was
parsed as targeting `<delegated>` by the auth gate (bypassing LiteLLM
auth) while the routing layer parsed it as `<delegated>/<garbage>` —
when that name did not resolve, the request fell back to the anonymous
allow-list, which can include `allow_all_keys` servers that normally
require a LiteLLM key.

Replace the parser with the same regex logic as
`_get_mcp_servers_in_path` so auth gating sees the exact target name(s)
downstream routing sees. Add regression tests covering parser parity
and the specific extra-path-segment bypass attempt.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* fix(mcp): close header/path TOCTOU in MCP delegate auth gate

`_target_servers_delegate_auth_to_upstream` and
`_target_servers_use_oauth2` trusted the `x-mcp-servers` header when
present, but `server.py::extract_mcp_auth_context` overrides that
header with the path-derived list for `/mcp/...` routes. An attacker
could set `x-mcp-servers: <delegated>` while pointing the URL path at
a non-delegate server, flipping the auth gate without changing the
target downstream routing actually uses.

Extract a shared `_resolve_target_server_names` helper that mirrors
the downstream override (path-derived names for `/mcp/...` routes,
header value otherwise). Add regression tests covering the TOCTOU
attempt and the helper's path-vs-header precedence.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* Fix delegated MCP OAuth test mock

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): drop unreachable /{server}/mcp branch in auth path parser

`_extract_target_server_names_from_path` also matched the
``/{server_name}/mcp`` form, but the downstream parser
``_get_mcp_servers_in_path`` only handles ``/mcp/...`` — and
``dynamic_mcp_route`` in ``proxy_server`` rewrites ``/{name}/mcp``
to ``/mcp/{name}`` on the scope before the MCP handler runs. Parsing
the un-rewritten form on the auth side was therefore unreachable in
production, and contradicted the docstring's claim of mirroring the
downstream parser — exactly the kind of mismatch that risks a future
header/path TOCTOU if any new entry point skips the rewrite.

Drop the branch; the canonical ``/mcp/...`` path matches both
parsers. Update the regression test to assert the new behavior.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* Fix MCP path auth target resolution

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): require auth for refresh_token grants on delegate-auth servers

`_mcp_oauth_user_api_key_auth` gates the unauthenticated PKCE flow for
``delegate_auth_to_upstream`` servers, but the bypass applied to BOTH
``/authorize`` and ``/token`` regardless of grant type. ``mcp_token``
accepts ``grant_type=refresh_token`` as well as ``authorization_code``,
and ``exchange_token_with_server`` attaches the server's stored
``client_secret`` to whatever is forwarded upstream. An unauthenticated
caller holding a refresh token issued to that OAuth client could mint
fresh upstream access tokens through LiteLLM.

Limit the anonymous bypass on ``/token`` to ``grant_type=authorization_code``
(the only grant PKCE actually protects via ``code_verifier``); fall
through to normal LiteLLM auth for ``refresh_token`` and any other grant.
``/authorize`` continues to allow anonymous PKCE redirects.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* fix(ui): clear delegate_auth_to_upstream when switching off oauth2

The ``delegate_auth_to_upstream`` form field is rendered inside an
``isOAuth2 && (...)`` conditional, so the Form.Item unmounts when the
user changes ``auth_type`` away from ``oauth2``. The follow-up
``form.setFieldValue("delegate_auth_to_upstream", false)`` runs after
the field has already deregistered, so ``onFinish`` receives
``undefined`` and the fallback ``?? mcpServer.delegate_auth_to_upstream``
preserved the old ``true``. The flag then persisted in the database for
a non-oauth2 server and silently re-activated if ``auth_type`` was later
switched back to ``oauth2``.

In the edit payload, force the flag to ``false`` whenever
``auth_type !== oauth2``; only trust the form value (and the existing
DB fallback) when the server is actually oauth2. Backend defense-in-depth
already ignores the flag for non-oauth2 servers, but the DB state should
stay clean too.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* Fix MCP delegate auth reset on edit

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Claude <claude@anthropic.com>
…etion transformation (#27727)

* fix(responses): preserve cache_control in Responses API -> Chat Completion transformation

cache_control injected by AnthropicCacheControlHook was silently dropped when
_transform_responses_api_content_to_chat_completion_content rebuilt content blocks
with only {type, text}. Now copies cache_control through so Anthropic prompt caching
works correctly when using client.responses.create with cache_control_injection_points.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(responses): preserve cache_control for input_image and input_file blocks

Extends the cache_control fix to image and file content blocks, which were
also silently dropping cache_control during the Responses API -> Chat Completion
transformation. Adds tests for all three content block types.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Babysitter <claude@anthropic.com>
External readiness probes consumed the legacy detailed payload's `db`
field to drive alerting and pod-rotation decisions. Stripping the body
to `{"status": "healthy"}` broke those probes silently — the HTTP code
still flipped to 503, but probes checking `body.db == "connected"`
treated the response as healthy.

Add `db` back to the unauthenticated payload. Keep the rest of the
diagnostic fields (litellm_version, callbacks, cache, log_level) gated
behind /health/readiness/details so the recon-leak gate from #26912
holds. Values match the legacy contract: "connected", "disconnected",
"Not connected".
fix(proxy): expose db status on public /health/readiness
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Document the purpose of the daemon thread that backs the sync
branch of the timeout decorator.

Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
#26302)

* fix: Fix Redis Sentinel client handling to solve authentication error with password protected sentinel (#25625)

* fix Redis Sentinel authentication handling

* test: cover Redis Sentinel auth routing

* refactor: align Redis Sentinel kwargs threading

* fix: avoid duplicate Redis Sentinel socket timeouts

* Address review comments

* refactor(_redis): return set from _get_redis_kwargs for O(1) lookup

Align _get_redis_kwargs() with the cluster helper by returning a set
instead of a list, so the sentinel connection-kwargs filter uses O(1)
membership tests. Addresses Greptile review feedback on PR #26302.

* fix(_redis): restore Azure-specific kwargs in cluster kwargs set

The set-literal refactor of _get_redis_cluster_kwargs dropped four
LiteLLM-custom Azure keys (azure_redis_ad_token, azure_client_id,
azure_tenant_id, azure_client_secret) that the prior list form had
explicitly appended. Because they are not in RedisCluster's argspec,
they were silently stripped, breaking Azure IAM auth on cluster
clients. Re-add them to the explicit include set.

---------

Co-authored-by: Kristin Cowalcijk <kristincowalcijk@gmail.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: krrish-berri-2 <krrish-berri-2@users.noreply.github.com>
Co-authored-by: claude <claude@anthropic.com>
* fix(ollama): Include provider in model list for ollama (#26135)

* Include provider in model names for ollama

* Fix unit tests

* fix(ollama): process both thinking and content in same streaming chunk (#26098)

* fix(health_check): skip max_tokens for image_generation mode (#26417)

* fix(health_check): skip max_tokens for image_generation mode

`_update_litellm_params_for_health_check` injected `max_tokens` for
every deployment. OpenAI `/v1/images/generations` strictly rejects
unknown fields, so health checks for dall-e-* and gpt-image-1 always
failed with `400 "Unknown parameter: 'max_tokens'"` even though the
actual image endpoint calls succeed. Skip the `max_tokens` injection
when `model_info.mode == "image_generation"`. `messages` still gets
injected (downstream `_filter_model_params` already strips it for
non-chat handlers).

* Switch to allow-list with per-deployment override

Per @krrishdholakia review: deny-listing image_generation only re-introduces
the same bug for every other non-chat mode (embedding, audio_*, rerank,
video_generation, ocr, search, moderation, ...).

Replace the single image_generation skip with `_MAX_TOKEN_SUPPORT_MODES =
{chat, completion, responses}`. Missing `mode` is treated as chat for
backward compatibility. New modes are safe by default.

Add `model_info.health_check_supports_max_tokens` as an operator escape
hatch — True forces injection on a non-listed deployment (operator wants
to bound probe tokens), False suppresses it on a chat-style deployment
behind a strict-schema provider.

Tests: parametrize over 3 chat-style + 10 non-chat modes, plus override
on/off and the no-mode legacy path.

* fix(http_handler): handle RequestNotRead in MaskedHTTPStatusError for multipart uploads (#26718)

Squash-merged by litellm-agent from dawidkulpa's PR.

* fix(ollama): guard against double 'ollama/' prefix in live model listing

Greptile flagged that Ollama servers can return names that already start
with 'ollama/'. Check the prefix before prepending so we don't produce
'ollama/ollama/...'. Adds a regression test.

* Fix Ollama empty reasoning stream chunks

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: James Myatt <james@jamesmyatt.co.uk>
Co-authored-by: VHash <225398745+vhash0@users.noreply.github.com>
Co-authored-by: hayden <sewhan.kim+@a-bly.com>
Co-authored-by: dawidkulpa <84176950+dawidkulpa@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* fix: strip Gemini thought-signature from tool_use.id in non-streaming path; example websearch config (#27873)

- adapters/transformation.py: mirror the streaming path and strip the
  `__thought__<b64>` suffix off `tool_call.id` before building the
  AnthropicResponseContentBlockToolUse. Base64's `+ / =` characters
  violate Anthropic's `^[a-zA-Z0-9_-]+$` tool_use.id pattern, so when a
  conversation that flowed through Gemini is later replayed to an
  Anthropic-native provider (Bedrock or Anthropic API) the request 400s.
- example_config_yaml/websearch_interception_config.yaml: register the
  interceptor under `callbacks:` not `success_callback:`. `success_callback`
  does not run pre-request hooks, so the tool-conversion step never fires
  on `/v1/messages` and the raw `web_search_20250305` tool is forwarded
  to Bedrock, which 400s.
- adds a unit test pinning the non-streaming strip behavior and the
  surviving `^[a-zA-Z0-9_-]+$` shape of the resulting id.

Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>

* Fix/azure image edit auth header (#27863)

* fix(azure/image_edit): use api-key header instead of Authorization Bearer

Delegate `AzureImageEditConfig.validate_environment` to
`BaseAzureLLM._base_validate_azure_environment` so the image-edit route
follows the same auth resolution as every other Azure provider:

- prefer the Azure-native `api-key` header when an API key is available
- fall back to `Authorization: Bearer <azure_ad_token>` only for AAD auth

The previous implementation unconditionally set
`Authorization: Bearer <api_key>`, which is the OpenAI-direct convention
and is rejected by Azure OpenAI / APIM-fronted deployments with
`401 Access denied due to missing subscription key`.

Adds regression tests covering api_key kwarg, litellm_params.api_key, and
the AAD-token fallback path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(azure/image_edit): pin api-key precedence semantics + add regression test

Address review feedback that the move to
``BaseAzureLLM._base_validate_azure_environment`` changed the relative
priority of the positional ``api_key`` kwarg vs. ``litellm_params["api_key"]``.

The new behavior — ``litellm_params["api_key"]`` wins, positional only fills
in when ``litellm_params["api_key"]`` is empty — is intentional and matches
every other Azure ``validate_environment``: ``AzureVideosConfig`` uses the
exact same merge logic, while ``AzureVectorStoresConfig`` and
``AzureResponsesAPIConfig`` don't accept a positional ``api_key`` at all.
The old ``or`` chain (positional wins) was the outlier and was part of the
same OpenAI-vs-Azure convention drift that produced the original
``Authorization: Bearer`` bug.

The only production caller (``llm_http_handler.image_edit``) sources both
values from the same ``litellm_params.api_key``, so this change is
behaviorally a no-op there. Document the precedence in the docstring and
lock it in with an explicit test so future refactors can't quietly
re-invert it.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Adam Kirstein <adam.kirstein@disney.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* test(azure/image_edit): expect api-key header instead of Authorization Bearer

PR #27863 fixed Azure image edit to use the Azure-native api-key header
instead of OpenAI's Authorization: Bearer convention, but did not update
test_azure_image_edit_litellm_sdk to match. The test still asserted
'Authorization' in headers, which now fails since the new code routes
through BaseAzureLLM._base_validate_azure_environment and emits
api-key when an api_key is provided.

Update the assertion to pin the correct Azure behavior: api-key header
present with the resolved key, and no Authorization header.

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Co-authored-by: Adam Kirstein <107421694+justalittleadam@users.noreply.github.com>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Adam Kirstein <adam.kirstein@disney.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
…Fireworks API call (#27881)

* fix(fireworks_ai): strip thinking_blocks from chat messages before API call

Fireworks OpenAI-compatible ChatMessage schema uses additionalProperties:false
and rejects Anthropic-style messages[].thinking_blocks (e.g. Claude Code replays),
returning invalid_request_error. Remove the field in _transform_messages_helper
alongside provider_specific_fields.

Adds unit test test_transform_messages_helper_strips_thinking_blocks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(fireworks_ai): drop inline comments from message sanitization

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(fireworks_ai): explain why provider_specific_fields and thinking_blocks are stripped

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Authenticated clients could supply CustomPricingLiteLLMParams fields
(input_cost_per_token, output_cost_per_token, etc.) in the request body.
These were forwarded to register_model() in main.py, permanently mutating
the shared global litellm.model_cost dict for all users on the instance.

Adds all CustomPricingLiteLLMParams fields to _BANNED_REQUEST_BODY_PARAMS
so is_request_body_safe() rejects them before they reach completion().
New pricing fields added to CustomPricingLiteLLMParams are auto-covered.

Admin opt-in via allow_client_side_credentials or
configurable_clientside_auth_params still works as before.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When ``ProxyConfig`` merges DB-persisted ``litellm_settings`` /
``general_settings`` on top of the YAML config, the merged dict is
later iterated by ``load_config`` which threads ``config_file_path``
(the YAML path) into ``get_instance_fn``. The runtime gate that
refuses ``s3://`` / ``gcs://`` modules when ``config_file_path`` is
``None`` therefore can't distinguish a YAML-sourced value from a
DB-sourced one: both look the same to ``get_instance_fn``.

Strip ``s3://`` / ``gcs://`` entries from the DB-overlay value for
every field whose contents reach ``get_instance_fn`` during config
load:

- litellm_settings: ``callbacks``, ``success_callback``,
  ``failure_callback``, ``audit_log_callbacks``, ``post_call_rules``,
  ``custom_provider_map[].custom_handler``
- general_settings: ``custom_auth``, ``custom_key_generate``,
  ``custom_key_update``, ``custom_sso``,
  ``custom_ui_sso_sign_in_handler``,
  ``litellm_jwtauth.custom_validate``

The YAML config-file load path is unchanged — the documented operator
flow (``callbacks: ["s3://bucket/module.instance"]`` in ``config.yaml``)
still works. Only DB-overlay writes (e.g. via ``/config/update``) are
stripped.

Adds 16 regression tests covering the scrub matrix.
A pass-through endpoint's ``target`` field is passed through
``create_pass_through_route`` into ``get_instance_fn`` during config
load. A PROXY_ADMIN persisting ``target: "s3://attacker/m.i"`` via
the DB-overlay ``pass_through_endpoints`` write path was not covered
by the previous scrub matrix, so the remote module load would still
reach the loader because the YAML-load chain has ``config_file_path``
set.

Walk each entry in ``general_settings.pass_through_endpoints`` and
null out any ``target`` that starts with ``s3://`` or ``gcs://``. The
entry itself is preserved so the path-registration helper can choose
how to handle a missing target (the existing code skips the route
when ``target is None``).

Adds two regression tests.
…nd Vertex (#27705)

* fix(prometheus): emit remaining_tokens/requests gauges for bedrock + vertex (LIT-2719)

Bedrock and Vertex AI never return x-ratelimit-remaining-* response headers,
so litellm_remaining_tokens_metric / litellm_remaining_requests_metric only
fired for OpenAI / Azure / Anthropic deployments even when tpm/rpm was
configured on the router.

Add a provider-agnostic fallback in PrometheusLogger.async_log_success_event
that asks Router.get_remaining_model_group_usage() for the same model_group
and emits the gauges with configured_limit - current_usage when the upstream
provider didn't populate the headers itself. Existing OpenAI / Azure /
Anthropic flows are unchanged because the fallback short-circuits when both
header values are already present.

Tests: 8 new tests covering bedrock + vertex emission, header short-circuit,
partial-header fill, llm_router=None, missing model_group, empty router
result, and router exception swallowing.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(prometheus): narrow except to ImportError, log router lookup failures via verbose_logger.exception

Address greptile review:
- The optional 'from litellm.proxy.proxy_server import llm_router' should
  guard against ImportError specifically, not all exceptions, so that
  unexpected errors (e.g. AttributeError from partially-initialized state)
  stay visible.
- get_remaining_model_group_usage failures are now logged via
  verbose_logger.exception (with traceback) instead of debug, matching the
  PR description's intent and avoiding silent loss of router-cache errors
  in production.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(prometheus): subtract in-flight delta in router-remaining fallback

The router's TPM/RPM counter is incremented by
Router.deployment_callback_on_success, which fires alongside this
prometheus callback in the success-log fan-out. Prometheus wins the
race, so get_remaining_model_group_usage returns the pre-decrement
counter for the current request — while vendor headers
(OpenAI/Anthropic/Azure) are already post-decrement.

That broke parity between providers on the same gauge: dashboards
plotting litellm_remaining_requests_metric showed Bedrock/Vertex
perpetually one request behind Anthropic for the same throughput.

Replay the in-flight increment before emit: subtract total_tokens
from remaining_tokens and 1 from remaining_requests.

* Revert "fix(prometheus): subtract in-flight delta in router-remaining fallback"

This reverts commit 001ce95ecdd952b4b5a23dd2b1e62c4562c932bc.

* fix(router): post-decrement router-derived ratelimit headers

Router.set_response_headers injects x-ratelimit-remaining-{tokens,
requests} for providers that don't return them natively (Bedrock,
Vertex). The values come from get_remaining_model_group_usage, which
reads the router's TPM/RPM counter — incremented post-response by
deployment_callback_on_success. So the headers reflected the counter
state before the current request was counted: pre-decrement.

Vendor headers from OpenAI/Anthropic/Azure are post-decrement (the
vendor counted the request before responding). Same metric name, two
semantics — dashboards plotting litellm_remaining_requests_metric
showed Bedrock/Vertex perpetually one request behind for the same
throughput, and the HTTP response headers exposed the same skew to
clients.

Subtract the in-flight delta before writing: 1 from
remaining-requests, response.usage.total_tokens from remaining-tokens.
Fixes both the response headers and (transitively) the prometheus
gauges that read from standard_logging_payload.additional_headers.

---------

Co-authored-by: cursor <cursor@example.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* Update gpt-4o-transcribe price

* Update test for gpt-4o-transcribe pricing fix

* Update gpt-4o-mini-transcribe price
aws_sts_endpoint, aws_web_identity_token, and aws_bedrock_runtime_endpoint
in ingest_options.vector_store were passed directly to the Bedrock ingestion
class, which reads them into boto3 STS client construction. Any authenticated
caller could redirect AssumeRole calls to an attacker-controlled server,
leaking the proxy's instance profile credentials.

Calls is_request_body_safe() on ingest_options["vector_store"] before
forwarding to litellm.aingest(). Same banned-params list and admin opt-in
escape hatch (allow_client_side_credentials) as the /chat/completions path.
ValueError from the safety check is caught and re-raised as HTTP 400.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…verlay

A guardrail entry's ``callbacks`` list (v1: ``{name: {callbacks:[...]}}``,
v2: ``{guardrail_name, litellm_params: {callbacks: [...], guardrail:
"module.path"}}``) is iterated during config load and threaded through
``get_instance_fn``. A PROXY_ADMIN persisting
``litellm_settings.guardrails[*].callbacks: ["s3://..."]`` or
``litellm_settings.guardrails[*].litellm_params.guardrail: "s3://..."``
via ``/config/update`` was not covered by the previous scrub matrix.

Walk both v1 and v2 entry shapes and null out remote-URL callbacks /
module-path values before the merge. Adds four regression tests.
…27726)

* feat(mcp): support MCP access group names in URL-based namespacing

Extends dynamic_mcp_route to resolve /{name}/mcp requests where {name}
is an MCP access group tag or a comma-separated list of servers/groups,
matching what the documentation promised but the handler did not implement.

Resolution order: registered server alias → toolset → comma-separated
list → single access group tag (404 if none match).

Adds unit tests covering all four resolution paths plus 404 cases.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): address Greptile review comments on dynamic_mcp_route

- Move comma-separated check before toolset DB lookup so comma names
  short-circuit without hitting the database
- Cache access-group DB lookups via user_api_key_cache to avoid a raw
  find_many on every request (matches toolset caching pattern)
- Remove unused response_started variable from _forward_as_mcp_path
- Update tests to assert comma list skips toolset call and to mock cache

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(mcp): extract helpers to fix PLR0915 too-many-statements in dynamic_mcp_route

Extract _mcp_forward_as_path and _is_mcp_access_group_cached as
module-level helpers so dynamic_mcp_route stays under the 50-statement
limit. Update tests to patch the new module-level symbols directly.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Avoid caching missing MCP access groups

* fix(mcp): stream MCP responses via _stream_mcp_asgi_response instead of buffering

_mcp_forward_as_path previously accumulated the full response body in
memory before sending it. Replace the buffering custom_send pattern with
_stream_mcp_asgi_response, which uses an asyncio.Queue bridge so chunks
are yielded to the client as they arrive, preventing unbounded memory
growth on large or long-lived MCP responses.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): short-TTL negative cache for access-group existence lookup

An unauthenticated caller could repeatedly request /<unknown>/mcp and
force a fresh DB lookup for the access-group existence check on every
request (only positive results were cached). Cache negative results
for a short DEFAULT_MCP_ACCESS_GROUP_NEGATIVE_CACHE_TTL window (10s by
default) so the DB is shielded from flooding while a transient DB error
(which surfaces as an empty list) cannot hide a real group for long.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* fix(mcp): use plain int for access-group negative cache TTL

Drop the os.getenv wrapper around DEFAULT_MCP_ACCESS_GROUP_NEGATIVE_CACHE_TTL
to avoid the documentation_test_env_keys check failing on the new variable.
The negative-cache window is a small internal tuning constant, not a
user-facing knob, so a plain integer is clearer than an env override.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* fix(mcp): validate, dedupe, and cap CSV tokens in dynamic MCP route

For /{name1,name2,...}/mcp, validate every token resolves to a known
server alias or access group, dedupe case-insensitively, and cap at
DEFAULT_MCP_NAMESPACE_CSV_MAX_TOKENS=16 before forwarding.

- Bounds the per-request DB / cache fan-out an authenticated caller can
  trigger by stuffing the path with tokens (raised by veria-ai).
- Returns 404 instead of forwarding when no token resolves, so the
  downstream server filter cannot silently fall back to the full
  allowed_mcp_servers list (raised by Cursor agentic security review).
- Forwards only the resolved subset, so unknown tokens cannot ride along
  into the downstream filter.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(mcp): exact-match CSV token dedupe to preserve case-sensitive distinct tokens

Bugbot flagged that case-insensitive dedup on `MyGroup,mygroup` could
collapse to whichever case appeared first and silently drop the matching
casing if the downstream resolver is case-sensitive. Switch to exact-match
dedup so distinct casings survive; whitespace-only differences still
collapse via the .strip() before comparison.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: mateo-berri <mateo@berri.ai>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
``extra_body`` is the OpenAI-SDK passthrough container. Provider
modules read provider-auth fields out of it directly (Azure's
``extra_body.azure_ad_token``, Bedrock's
``extra_body.aws_web_identity_token``, etc.) without re-validating, so
the boundary check has to walk it the same way it walks
``litellm_embedding_config``. Adding it to ``_NESTED_CONFIG_KEYS``
extends single-level banned-key descent into the container — top-level
admin opt-ins (``allow_client_side_credentials`` /
``configurable_clientside_auth_params``) still apply.

``azure_ad_token`` was not in ``_BANNED_REQUEST_BODY_PARAMS`` despite
being the bearer-token field the Azure transformer resolves through
``get_secret`` (same shape as ``aws_web_identity_token`` on the
Bedrock STS path). Added so it can't be supplied per-request without
an admin opt-in.
yuneng-berri and others added 7 commits May 13, 2026 20:33
…27896)

* fix(ui): fetch version + debug flag from /health/readiness/details

The proxy moved `litellm_version`, `is_detailed_debug`, and other
diagnostic fields off the public `/health/readiness` payload behind
an auth-gated `/health/readiness/details` endpoint. The navbar
version tag and the detailed-debug-mode banner stopped working
because they were still reading those fields from the unauthed
response, which no longer contains them.

Replace `useHealthReadiness` with a `useHealthReadinessDetails`
hook that takes an `accessToken` argument and sends a Bearer header
to the auth-gated endpoint. The hook stays disabled while
`accessToken` is falsy, so the navbar can keep rendering on the
public model hub (where the token is null) without triggering an
auth redirect or a 401-loop.

* fix(ui): disable retries on readiness/details + cover token forwarding

Two small follow-ups on the readiness/details migration:

- Set `retry: false` on the query. The payload feeds a passive
  navbar tag and a debug banner; a 401 from an expired token
  shouldn't fan out into three retries against the proxy.
- Add navbar specs that assert the `accessToken` prop is forwarded
  into the hook (matches the DebugWarningBanner spec). Without
  this, the navbar could silently regress to passing `undefined`
  and the existing tests wouldn't catch it.
``_NESTED_CONFIG_KEYS`` descent used ``isinstance(nested, dict)``, so a
caller sending ``extra_body`` as a JSON-encoded string instead of an
object (the same shape multipart/form-data clients use for
``litellm_metadata``) skipped the banned-key check entirely. Switched to
``_coerce_metadata_to_dict`` so the JSON-string path is parsed before
descent — mirrors the existing handling on ``_NESTED_METADATA_KEYS``.
``test_azure_ad_token_is_in_banned_list`` only asserted tuple
membership of a name the parametrized test already exercises end-to-end
through ``is_request_body_safe``. Removed.

Tightened the admin-opt-in test comment.
…over

chore(proxy): cover extra_body + azure_ad_token in banned-params check
…-gate

chore(proxy): refuse remote-URL instance-fn loads outside config-file path
* fix: patch Host-header auth bypass in get_request_route

Starlette reconstructs request.url from the Host header. A malformed
Host like `localhost/?x=1` causes Starlette to build the full URL as
`http://localhost/?x=1/health`, which url-parses to path="/". Since "/"
is in LiteLLMRoutes.public_routes, all protected routes became reachable
without authentication.

Fix: read scope["path"] (set by uvicorn from the HTTP request line,
not derivable from headers) instead of request.url.path. Sub-path
deployments are handled via scope["app_root_path"] / scope["root_path"],
mirroring Starlette's own base_url construction logic.

Affected variants confirmed fixed:
  Host: localhost/?x=1
  Host: localhost:4000/?x=1
  Host: localhost/#test
  Host: localhost:4000/#test

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* style: reduce comments in route fix

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: block credential fields in RAG ingest vector_store options

Credential fields (vertex_credentials, aws_access_key_id, api_key, etc.)
in ingest_options.vector_store are now rejected at the API boundary with
a 400 error. Credentials must be configured server-side.

Previously any authenticated user could supply a vertex_credentials dict
with type=external_account pointing credential_source.file at an
arbitrary path (e.g. /proc/1/environ) and token_url at an
attacker-controlled server. google-auth's identity_pool.Credentials
refresh() would read the file and POST its contents to the attacker.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: block /key/update self-escalation by assigned users

Non-admin users who were assigned a key (created_by != caller) could
update any non-budget field — models, rpm_limit, guardrails, etc. —
without admin authorization, allowing privilege self-escalation.

Gate: only the key creator (created_by == caller) may edit their own
key without admin check; budget changes always require admin regardless
of creator status. All other callers must pass _check_key_admin_access.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: block user-controlled api_base in RAG ingest vector_store options

A user-supplied api_base in ingest_options.vector_store caused the server
to forward its configured provider credentials (Gemini, OpenAI) to an
attacker-controlled endpoint via SSRF.

Add api_base to the blocked credential params set alongside api_key and
the existing credential fields.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: restrict /utils/transform_request to PROXY_ADMIN and apply body safety check

Any authenticated internal_user could POST arbitrary provider config
(aws_sts_endpoint, api_base, etc.) to /utils/transform_request and have
the server forward its credentials to an attacker-controlled endpoint.

- Gate the endpoint on PROXY_ADMIN role (403 for all other roles)
- Call is_request_body_safe() to reject banned params even for admins
- Convert ValueError from safety check to HTTP 400

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: apply banned-param check to /utils/transform_request

Without is_request_body_safe(), any authenticated user could pass
aws_sts_endpoint, api_base, or aws_web_identity_token to
/utils/transform_request and have the server forward its configured
provider credentials to an attacker-controlled endpoint during SDK
credential resolution.

Applies the same banned-param blocklist already used by LLM endpoints.
Endpoint remains accessible to all authenticated users.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: block SSRF via api_base in /prompts/test dotprompt YAML frontmatter

Any frontmatter key not in ["model","input","output"] flowed into
optional_params and was merged into the LLM call data dict, bypassing
is_request_body_safe. An attacker with any bearer key could set
api_base in YAML to redirect the outbound LLM request — including the
provider API key — to an attacker-controlled host.

Fix: call is_request_body_safe on the constructed data dict after
optional_params are merged, before invoking ProxyBaseLLMRequestProcessing.
ValueError from the banned-param check is surfaced as HTTP 400.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update litellm/proxy/rag_endpoints/endpoints.py

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* fix: coerce nested config strings before banned-param check

_NESTED_CONFIG_KEYS descent used isinstance(nested, dict) which silently
skipped litellm_embedding_config when delivered as a JSON string via
multipart/form-data. Banned params (api_base, aws_sts_endpoint, etc.)
nested inside the stringified value were invisible to is_request_body_safe.

_NESTED_METADATA_KEYS already used _coerce_metadata_to_dict which parses
JSON strings before checking. Apply the same coercion to _NESTED_CONFIG_KEYS.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: replace substring match with prefix match in is_llm_api_route

mapped_pass_through_routes used `_llm_passthrough_route in route` (substring)
so any admin-only path whose URL contained a provider name (openai, anthropic,
azure, bedrock, etc.) was misclassified as an LLM API route and bypassed the
admin gate in non_proxy_admin_allowed_routes_check.

Confirmed live: non-admin key could GET /credentials/by_name/openai (read
masked provider API key) and DELETE /credentials/openai (delete credential).

Fix: use exact match or startswith(prefix + "/") — the same pattern used
everywhere else in RouteChecks — so only routes that actually start with a
passthrough prefix are allowed through.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: stabilize PR #27878 test failures

- key_management_endpoints: extend can_skip_admin_check to team keys so
  team members with /key/update permission can update non-budget fields.
  can_team_member_execute_key_management_endpoint already validates team
  membership + permission and raises if unauthorized; reaching the admin
  check on a team key means the caller was authorized.

- test: set created_by on mock key in
  test_update_key_non_budget_fields_allowed_for_internal_user so
  caller_is_creator resolves correctly (MagicMock default ≠ user_id).

- auth_utils.get_request_route: guard against non-dict request.scope
  (e.g. MagicMock in unit tests) to prevent a MagicMock leaking into
  UserAPIKeyAuth.request_route and failing Pydantic validation.

- ci: assign test_multipart_bypass_repro.py to the proxy-runtime shard
  in test-unit-proxy-db.yml to satisfy the shard-coverage check.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix(lint): add explicit str() cast in get_request_route for MyPy

scope.get() returns Any|None which MyPy cannot coerce to str implicitly.
Wrap both scope.get() calls in str() to satisfy the type checker.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: guard bare-/ root_path strip + make total_spend migration idempotent

auth_utils.get_request_route: when Starlette sets scope["app_root_path"]
to "/" (e.g. behind some middleware), the old stripping logic would
remove the leading slash from every path ("/team/new" → "team/new"),
breaking route matching and causing auth to misclassify protected routes.
Skip stripping when root_path is bare "/".

migration: add IF NOT EXISTS to total_spend ALTER TABLE so the migration
is safe to replay when a prior partial run already created the column.
Without this guard, prisma migrate deploy fails on CI DBs that were
partially migrated, causing all subsequent DB operations (including
/team/new) to 500.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: require creator still owns key for personal-key bypass in /key/update

caller_is_creator now requires both created_by == caller AND user_id ==
caller. Previously checking only created_by let a demoted admin who
originally created a key for another user continue editing non-budget
fields on it after reassignment, bypassing _check_key_admin_access.

Adds regression test: creator whose key was reassigned is blocked (403).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: extract auth checks to fix PLR0915 + broaden max_budget assertion

internal_user_endpoints._update_single_user_helper exceeded 50 statements
(PLR0915). Extract authorization checks into _check_user_update_authz helper
to bring statement count under the limit.

test_validate_max_budget: assert "negative" (substring of both the local
"cannot be negative" and the CI "non-negative finite number" messages) so
the test is stable regardless of which exact wording the function uses.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>
@greptile-apps

greptile-apps Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Too many files changed for review. (105 files found, 100 file limit)

@CLAassistant

CLAassistant commented May 14, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
6 out of 10 committers have signed the CLA.

✅ stuxf
✅ mateo-berri
✅ Sameerlite
✅ yuneng-berri
✅ lmcdonald-godaddy
✅ milan-berri
❌ oss-agent-shin
❌ ishaan-berri
❌ krrish-berri-2
❌ yassin-berriai
You have signed the CLA already but the status is still pending? Let us recheck it.

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@codspeed-hq

codspeed-hq Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing litellm_internal_staging (de1747d) with main (7af0f05)

Open in CodSpeed

@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

from litellm.litellm_core_utils.url_utils import SSRFError, validate_url
from litellm.proxy._types import *
from litellm.types.router import CONFIGURABLE_CLIENTSIDE_AUTH_PARAMS
from litellm.types.utils import CustomPricingLiteLLMParams
)
from litellm.proxy.auth.auth_utils import check_response_size_is_safe
from litellm.proxy.auth.auth_utils import (
check_response_size_is_safe,
or getattr(user_api_key_auth, "api_key", None)
)
)
if is_anonymous:
_safe_get_request_headers,
get_form_data,
)
from litellm.proxy.auth.auth_utils import is_request_body_safe
from litellm.proxy.auth.auth_utils import check_response_size_is_safe
from litellm.proxy.auth.auth_utils import (
check_response_size_is_safe,
is_request_body_safe,

from litellm._logging import verbose_proxy_logger
from litellm.proxy._types import CommonProxyErrors, LitellmUserRoles, UserAPIKeyAuth
from litellm.proxy.auth.auth_utils import is_request_body_safe
Comment on lines +55 to +58
from litellm.llms.custom_httpx.http_handler import (
get_async_httpx_client,
httpxSpecialProvider,
)
Comment on lines +15 to +18
from litellm.llms.vertex_ai.common_utils import (
_build_vertex_schema,
supports_response_json_schema,
)
from litellm.llms.azure.common_utils import BaseAzureLLM
from litellm.llms.openai.image_edit.transformation import OpenAIImageEditConfig
from litellm.secret_managers.main import get_secret_str
from litellm.types.router import GenericLiteLLMParams
import httpx

import litellm
from litellm.llms.azure.common_utils import BaseAzureLLM
yuneng-berri and others added 5 commits May 13, 2026 21:51
#27689)

Provider validation errors (e.g. OpenAI RateLimitError carrying 178
pydantic errors each with their own 'input': [...]) were stored verbatim
in LiteLLM_SpendLogs.metadata.error_information.error_message via
str(original_exception), producing rows >12 MB.

Sanitize before metadata is serialized:
- redact 'input'/'messages' values in both error_message and traceback
  when store_prompts_in_spend_logs is False (back-door leak paths)
- always apply the MAX_STRING_LENGTH_PROMPT_IN_DB size cap to
  error_message and traceback (DB-storage safeguard)

Value scanning uses a parser-based balanced-bracket walk that respects
string quoting, so multi-modal payloads ('messages': [{'content': [...]}])
and user text containing literal brackets ("secret[123") are handled
correctly instead of leaking past a depth-1 regex.

Scoped to the spend-log path so OTEL/Datadog/etc. callbacks still
receive the untruncated error per LITELLM_TRUNCATION_DB_SAFEGUARD_NOTE.

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shin-berri shin-berri merged commit e58a561 into main May 14, 2026
126 of 130 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.