Skip to content

fix(mcp): surface upstream 401 for token-forwarding MCP servers#27847

Merged
mateo-berri merged 16 commits into
litellm_internal_stagingfrom
litellm_mcp_passthrough_upstream_401
May 13, 2026
Merged

fix(mcp): surface upstream 401 for token-forwarding MCP servers#27847
mateo-berri merged 16 commits into
litellm_internal_stagingfrom
litellm_mcp_passthrough_upstream_401

Conversation

@Sameerlite

@Sameerlite Sameerlite commented May 13, 2026

Copy link
Copy Markdown
Collaborator

What

For MCP servers configured with extra_headers: [Authorization], the gateway forwards the client's Bearer token directly to the upstream (OAuth pass-through). When that token is rejected by the upstream (expired/invalid), the upstream returns HTTP 401 — but LiteLLM was swallowing it and returning 200 {"tools":[]} instead.

Root cause: the MCP SDK (StreamableHTTPSessionManager) sends 200 OK + SSE headers before dispatching to handlers, so by the time list_tools detected the failure the HTTP response was already committed. Exception propagation through the SDK's internals isn't viable because except Exception guards at multiple layers catch it first.

Fix

Add a pre-flight auth probe in handle_streamable_http_mcp, before session_manager.handle_request is called:

  1. For servers whose extra_headers include Authorization (token-forwarding servers), extract the client's Authorization header from the ASGI scope.
  2. Send a minimal JSON-RPC initialize probe to the upstream with that token (5 s timeout).
  3. If the upstream returns 401/403 → raise HTTPException(401) with WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> (same format as the existing per-user OAuth2 server handling).
  4. On network error / timeout the probe fails-open (returns 200, None) so a transient hiccup does not block valid requests.
Screenshot 2026-05-13 at 6 59 03 PM

Note

Medium Risk
Touches MCP authentication/authorization flow and adds outbound preflight requests; mistakes could cause unexpected 401/403s or added latency for MCP connections.

Overview
Adds a pre-flight auth probe for StreamableHTTP MCP requests when a server is configured to pass through caller Authorization via extra_headers, so upstream 401/403 is detected before the MCP SDK commits 200 SSE headers.

Introduces helpers to safely extract a forwardable Authorization header only when x-litellm-api-key is present (to avoid leaking proxy keys), probes all authorized pass-through servers in parallel via a JSON-RPC initialize POST, and maps upstream 401 to a gateway 401 with WWW-Authenticate (and 403 to Forbidden) while failing open on network errors.

Updates _stream_mcp_asgi_response to propagate handler exceptions occurring before response headers (including HTTPException), and adds unit tests covering the probe behavior and exception propagation.

Reviewed by Cursor Bugbot for commit 28eda2a. Bugbot is set up for automated code reviews on this repo. Configure here.

For MCP servers configured with extra_headers: [Authorization], the gateway
forwards the client token directly to the upstream. When that token is rejected
(expired or invalid) the upstream returns 401, but the MCP SDK starts the SSE
stream with 200 OK before calling handlers, so the 401 can't be returned
mid-stream.

Fix: add a pre-flight httpx probe in handle_streamable_http_mcp — before the
SDK opens the session — so the gateway can still return HTTP 401 with
WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> when the
upstream rejects the token. The probe fails-open (returns 200) on network
errors so a transient hiccup does not block valid requests.

Co-authored-by: Cursor <cursoragent@cursor.com>
@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.11111% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/proxy/_experimental/mcp_server/server.py 71.11% 13 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR surfaces upstream 401/403 errors for token-forwarding MCP servers by adding a pre-flight auth probe before StreamableHTTPSessionManager.handle_request commits 200 SSE headers. It also updates _stream_mcp_asgi_response to propagate exceptions that occur before response headers are sent.

  • Introduces _get_forwarded_auth_from_scope, _probe_upstream_auth, and _check_passthrough_upstream_auth helpers; all authorized pass-through servers are probed in parallel via asyncio.gather.
  • Updates the _ensure_eof done-callback in proxy_server.py to set the exception on headers_ready when the handler raises before committing headers, so callers receive the original HTTPException rather than a 30 s timeout.
  • Adds unit tests for the probe and the new exception-propagation path.

Confidence Score: 4/5

The core auth-probe logic is sound and the exception-propagation fix is correct; the main concern is an extra permission-lookup chain now running on every qualifying MCP request.

The probe correctly parallels calls with asyncio.gather, properly catches MaskedHTTPStatusError (a httpx.HTTPStatusError subclass) from AsyncHTTPHandler.post(), and the _ensure_eof exception-propagation change is validated by a new test. The only substantive concern is _check_passthrough_upstream_auth triggering a multi-step permission chain (key, team, end_user, agent, org DB lookups) that already runs later in the same request inside list_mcp_tools, doubling the work for every pass-through auth request.

litellm/proxy/_experimental/mcp_server/server.py — specifically the _get_allowed_mcp_servers call inside _check_passthrough_upstream_auth.

Important Files Changed

Filename Overview
litellm/proxy/_experimental/mcp_server/server.py Adds pre-flight auth probe helpers and wires them in before the MCP session manager; _get_allowed_mcp_servers (potential multi-step DB chain) is now called on every request with x-litellm-api-key + Authorization targeting pass-through servers.
litellm/proxy/proxy_server.py Updates _ensure_eof done-callback to propagate pre-header exceptions through headers_ready; change is minimal and correct.
tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py New probe tests are present; test_probe_upstream_auth_returns_upstream_status mocks client.post returning a 401 response object directly, a path that is unreachable in production since AsyncHTTPHandler.post() raises MaskedHTTPStatusError for any 4xx.
tests/test_litellm/proxy/test_mcp_asgi_response.py New test validating that a pre-header HTTPException propagates through _stream_mcp_asgi_response; covers the _ensure_eof change correctly.

Reviews (7): Last reviewed commit: "test(mcp): add coverage for httpx.HTTPSt..." | Re-trigger Greptile

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated
Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated
Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated
Comment thread litellm/proxy/_experimental/mcp_server/server.py
Sameerlite and others added 2 commits May 13, 2026 19:04
…de effects

- Extract forwarded_auth outside the pass-through server loop (was called N times for the same scope value)
- Gather all upstream auth probes concurrently with asyncio.gather instead of sequentially; eliminates N×5 s worst-case latency
- Switch probe from POST+initialize JSON-RPC body to HEAD request; HEAD carries the Authorization header so the upstream rejects invalid tokens with 401 but never allocates a session or writes an audit entry

Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces bare httpx.AsyncClient with the project-standard
get_async_httpx_client(httpxSpecialProvider.MCP) to satisfy the
ensure_async_clients_test code coverage check and avoid the +500 ms
per-request overhead of creating a new client on every probe call.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Sameerlite

Copy link
Copy Markdown
Collaborator Author

@greptile re review

…eam_auth

Moves the parallel upstream auth probe logic out of
handle_streamable_http_mcp into a dedicated helper to satisfy
Ruff PLR0915 (Too many statements > 50).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated
@veria-ai

veria-ai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

MCP upstream auth preflight added

This PR adds a pre-header probe for token-forwarding MCP servers so upstream 401/403 responses can be surfaced before the streaming response starts, and updates the ASGI bridge to propagate pre-header exceptions. I reviewed the MCP server selection, forwarded-header handling, and exception propagation path and did not find a security issue introduced by these changes.


Status: 1 open
Risk: 2/10

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Probe always fails open: AsyncHTTPHandler lacks head method
    • Added an async HEAD helper to AsyncHTTPHandler so the upstream auth probe reaches the server and can surface 401/403 responses.
Preview (b211144a33)
diff --git a/litellm/llms/custom_httpx/http_handler.py b/litellm/llms/custom_httpx/http_handler.py
--- a/litellm/llms/custom_httpx/http_handler.py
+++ b/litellm/llms/custom_httpx/http_handler.py
@@ -598,6 +598,26 @@
         )
         return response
 
+    async def head(
+        self,
+        url: str,
+        params: Optional[dict] = None,
+        headers: Optional[dict] = None,
+        follow_redirects: Optional[bool] = None,
+    ):
+        # Set follow_redirects to UseClientDefault if None
+        _follow_redirects = (
+            follow_redirects if follow_redirects is not None else USE_CLIENT_DEFAULT
+        )
+
+        params = params or {}
+        params.update(HTTPHandler.extract_query_params(url))
+
+        response = await self.client.head(
+            url, params=params, headers=headers, follow_redirects=_follow_redirects  # type: ignore
+        )
+        return response
+
     @track_llm_api_timing()
     async def post(
         self,

diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -51,6 +51,10 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
@@ -2754,6 +2758,98 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: dict) -> Optional[str]:
+        """Return the raw Authorization header value from the ASGI scope, or None."""
+        for key, value in scope.get("headers", []):
+            if key.lower() == b"authorization":
+                return value.decode("latin-1")
+        return None
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """HEAD-probe the upstream URL to check whether the token is accepted.
+
+        Uses HEAD so the upstream receives no request body and allocates no
+        session or audit state. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+        """
+        try:
+            client = get_async_httpx_client(
+                llm_provider=httpxSpecialProvider.MCP,
+                params={"timeout": timeout},
+            )
+            resp = await client.head(
+                url,
+                headers={"Authorization": auth_header},
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status in (401, 403):
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2923,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/tests/test_litellm/llms/custom_httpx/test_http_handler.py b/tests/test_litellm/llms/custom_httpx/test_http_handler.py
--- a/tests/test_litellm/llms/custom_httpx/test_http_handler.py
+++ b/tests/test_litellm/llms/custom_httpx/test_http_handler.py
@@ -27,6 +27,39 @@
 
 
 @pytest.mark.asyncio
+async def test_async_head_returns_response_without_raise_for_status():
+    captured_request = None
+
+    async def mock_handler(request: httpx.Request) -> httpx.Response:
+        nonlocal captured_request
+        captured_request = request
+        return httpx.Response(
+            401,
+            request=request,
+            headers={"www-authenticate": 'Bearer realm="test"'},
+        )
+
+    litellm_handler = AsyncHTTPHandler()
+    await litellm_handler.client.aclose()
+    litellm_handler.client = httpx.AsyncClient(
+        transport=httpx.MockTransport(mock_handler)
+    )
+    try:
+        response = await litellm_handler.head(
+            "https://upstream.example/mcp",
+            headers={"Authorization": "Bearer some-token"},
+        )
+
+        assert response.status_code == 401
+        assert response.headers["www-authenticate"] == 'Bearer realm="test"'
+        assert captured_request is not None
+        assert captured_request.method == "HEAD"
+        assert captured_request.headers["Authorization"] == "Bearer some-token"
+    finally:
+        await litellm_handler.close()
+
+
+@pytest.mark.asyncio
 async def test_async_post_streaming_status_error_should_not_wait_forever_for_body(
     monkeypatch,
 ):

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,75 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = AsyncMock()
+    mock_client.head = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.head = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """_get_forwarded_auth_from_scope returns the Authorization value."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None

You can send follow-ups to the cloud agent here.

Comment thread litellm/proxy/_experimental/mcp_server/server.py
…bypass

_check_passthrough_upstream_auth was resolving user-supplied server names
directly before authorization ran, letting any permitted LiteLLM key
trigger an upstream HEAD probe to a server it was not allowed to use.

Changes:
- Call _get_allowed_mcp_servers inside the helper so only servers the
  caller's key is authorized for are probed.
- Move the call site to after toolset scoping so the auth context is
  fully resolved before the probe list is built.
- Thread user_api_key_auth into the helper signature (replaces the raw
  mcp_servers name list).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated
Co-authored-by: Yassin Kortam <yassin@berri.ai>
@CLAassistant

CLAassistant commented May 13, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ Sameerlite
❌ cursoragent
❌ claude-bot


claude-bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread litellm/proxy/_experimental/mcp_server/server.py
Comment thread litellm/proxy/_experimental/mcp_server/server.py
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Comment thread litellm/proxy/_experimental/mcp_server/server.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Unused head method added to AsyncHTTPHandler
    • Removed the unused AsyncHTTPHandler.head method and its production-unexercised test.
Preview (ade8ca1f0e)
diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -51,13 +51,17 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
     LiteLLMProxyRequestSetup,
     get_chain_id_from_headers,
 )
-from litellm.types.mcp import MCPAuth
+from litellm.types.mcp import MCPAuth, MCPSpecVersion
 from litellm.types.mcp_server.mcp_server_manager import MCPInfo, MCPServer
 from litellm.types.utils import CallTypes, StandardLoggingMCPToolCall
 from litellm.utils import Rules, client, function_setup
@@ -2754,6 +2758,115 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: Scope) -> Optional[str]:
+        """Return the raw Authorization header value from the ASGI scope, or None."""
+        for key, value in scope.get("headers", []):
+            if key.lower() == b"authorization":
+                return value.decode("latin-1")
+        return None
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """JSON-RPC initialize-probe the upstream URL to check whether the token is accepted.
+
+        Uses POST so StreamableHTTP MCP servers run the same auth path as a
+        real client request. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+        """
+        try:
+            client = get_async_httpx_client(
+                llm_provider=httpxSpecialProvider.MCP,
+                params={"timeout": timeout},
+            )
+            probe_payload = {
+                "jsonrpc": "2.0",
+                "id": "litellm-mcp-auth-probe",
+                "method": "initialize",
+                "params": {
+                    "protocolVersion": MCPSpecVersion.jun_2025.value,
+                    "capabilities": {},
+                    "clientInfo": {
+                        "name": "litellm-mcp-auth-probe",
+                        "version": "1.0.0",
+                    },
+                },
+            }
+            resp = await client.client.post(  # type: ignore[attr-defined]
+                url,
+                headers={
+                    "Authorization": auth_header,
+                    "Accept": "application/json, text/event-stream",
+                },
+                json=probe_payload,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status in (401, 403):
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2940,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,79 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.client.post.assert_awaited_once()
+    _, kwargs = mock_client.client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """_get_forwarded_auth_from_scope returns the Authorization value."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None

You can send follow-ups to the cloud agent here.

Comment thread litellm/llms/custom_httpx/http_handler.py Outdated
cursoragent and others added 2 commits May 13, 2026 14:17
Co-authored-by: Yassin Kortam <yassin@berri.ai>
… probe

_prepare_mcp_server_headers skips caller Authorization when the server
uses OAuth client-credentials (M2M), but the pre-flight probe was still
selecting those servers and forwarding the caller's raw token in the HEAD
request. Exclude servers with has_client_credentials from the probe list
to match the actual downstream header-preparation logic.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Sameerlite

Copy link
Copy Markdown
Collaborator Author

@greptile re review

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated
Per RFC 9110, 401 means "go get new credentials." Mapping an upstream 403
to a gateway 401 causes OAuth clients to restart the authorization flow,
obtain a fresh token with identical scopes, hit 403 again, and loop
indefinitely.

401 from upstream → gateway 401 + WWW-Authenticate (re-authorize)
403 from upstream → gateway 403 (no WWW-Authenticate hint)

Co-authored-by: Cursor <cursoragent@cursor.com>
@Sameerlite

Copy link
Copy Markdown
Collaborator Author

@greptile re review


probe_results = await asyncio.gather(
*[
_probe_upstream_auth(srv.url or "", forwarded_auth)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Proxy API key forwarded upstream

forwarded_auth is taken from the raw request Authorization header, but that same header is also accepted as the LiteLLM API key when x-litellm-api-key is not present. A configured token-forwarding MCP server can now receive and reuse a user's LiteLLM proxy key during this preflight probe; only run the probe when the proxy was authenticated with a separate credential, or pass an upstream token value that process_mcp_request has confirmed is not the proxy API key.

… key

The pre-flight upstream probe must not forward the caller's Authorization
header when it could itself be the LiteLLM proxy API key. Restrict the
probe to requests that supply x-litellm-api-key explicitly — only then is
the Authorization header unambiguously the upstream OAuth token the
caller wants forwarded.
@mateo-berri

Copy link
Copy Markdown
Collaborator

@greptileai please re-review

Responses to outstanding feedback:

  • Greptile P1 (15:59) — 403 mapped to 401: addressed in 1925335 — upstream 403 now propagates as 403 without WWW-Authenticate.
  • Veria High (16:12) — proxy API key forwarded upstream: addressed in this push. _get_forwarded_auth_from_scope now requires x-litellm-api-key to be present before returning the Authorization header. When x-litellm-api-key is absent, Authorization may itself be the LiteLLM proxy key (backward-compat path in MCPRequestHandler.process_mcp_request), so the probe is skipped to prevent leaking the proxy key upstream.
  • Greptile P1 (13:30) — backwards-compat without feature flag: the prior behavior (returning 200 {"tools":[]} on upstream 401) was a bug masking auth failures. The probe also fails-open on network errors, so the only behavior change is converting silent-fail-with-empty-tools into a proper 401/403 — there is no benign behavior to preserve.
  • Greptile P1/P2 (13:30) — probe latency and redundant forwarded_auth: already addressed; probes run in parallel via asyncio.gather and forwarded_auth is computed once before the loop.
  • Greptile P2 (13:30) — initialize RPC side-effects: tried HEAD first but most MCP StreamableHTTP servers return 405 on HEAD and never invoke their auth middleware, so the probe missed real 401s. Falling back to POST initialize is the smallest payload that reliably exercises upstream auth; the upstream allocates a transient session at worst, which is acceptable for catching expired tokens before the SDK commits 200 OK.

Comment on lines +2814 to +2827
resp = await client.client.post( # type: ignore[attr-defined]
url,
headers={
"Authorization": auth_header,
"Accept": "application/json, text/event-stream",
},
json=probe_payload,
)
return resp.status_code, resp.headers.get("www-authenticate")
except Exception as exc:
verbose_logger.debug(
f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
)
return 200, None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Probe bypasses AsyncHTTPHandler.post() and will silently fail-open if refactored

AsyncHTTPHandler.post() calls response.raise_for_status() internally, so calling client.post() for a 401/403 upstream would raise httpx.HTTPStatusError, which would then be caught by the broad except Exception block and cause the probe to return (200, None) — silently defeating the entire feature. The workaround (client.client.post) accesses the internal httpx client directly (confirmed by the # type: ignore[attr-defined]) to avoid this. If AsyncHTTPHandler is ever refactored (e.g., the self.client attribute renamed), the probe fails-open on every request with no warning.

The correct fix is to use the public AsyncHTTPHandler.post() and handle the status error before the catch-all — add import httpx and restructure as:

  • resp = await client.post(url, headers=..., json=..., timeout=timeout)
  • Add except httpx.HTTPStatusError as exc: return exc.response.status_code, exc.response.headers.get("www-authenticate") before the general except Exception

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 560c22b — the probe now uses the public AsyncHTTPHandler.post() and catches httpx.HTTPStatusError explicitly before the broad fail-open except Exception, so a 401/403 from upstream is no longer at risk of being silently swallowed if the handler refactors its internal client.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Pre-flight 401 becomes 504 timeout on toolset/dynamic routes
    • Propagated pre-header handler exceptions through the ASGI bridge so upstream 401/403 responses preserve their status and headers instead of timing out.
Preview (851d9f6628)
diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -51,13 +51,17 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
     LiteLLMProxyRequestSetup,
     get_chain_id_from_headers,
 )
-from litellm.types.mcp import MCPAuth
+from litellm.types.mcp import MCPAuth, MCPSpecVersion
 from litellm.types.mcp_server.mcp_server_manager import MCPInfo, MCPServer
 from litellm.types.utils import CallTypes, StandardLoggingMCPToolCall
 from litellm.utils import Rules, client, function_setup
@@ -2754,6 +2758,144 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: Scope) -> Optional[str]:
+        """Return the upstream-bound ``Authorization`` header value, or None.
+
+        Only returns the ``Authorization`` header when ``x-litellm-api-key`` is
+        also present. In that case ``Authorization`` is unambiguously the
+        upstream token the caller wants forwarded to the MCP server. When
+        ``x-litellm-api-key`` is absent the ``Authorization`` header may itself
+        be the LiteLLM proxy API key (backward-compat path in
+        ``MCPRequestHandler.process_mcp_request``), and forwarding it upstream
+        would leak the proxy key to a third-party MCP server.
+        """
+        authorization = None
+        has_litellm_key_header = False
+        for key, value in scope.get("headers", []):
+            key_lower = key.lower()
+            if key_lower == b"authorization":
+                authorization = value.decode("latin-1")
+            elif key_lower == b"x-litellm-api-key":
+                has_litellm_key_header = True
+        if not has_litellm_key_header:
+            return None
+        return authorization
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """JSON-RPC initialize-probe the upstream URL to check whether the token is accepted.
+
+        Uses POST so StreamableHTTP MCP servers run the same auth path as a
+        real client request. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+        """
+        try:
+            client = get_async_httpx_client(
+                llm_provider=httpxSpecialProvider.MCP,
+                params={"timeout": timeout},
+            )
+            probe_payload = {
+                "jsonrpc": "2.0",
+                "id": "litellm-mcp-auth-probe",
+                "method": "initialize",
+                "params": {
+                    "protocolVersion": MCPSpecVersion.jun_2025.value,
+                    "capabilities": {},
+                    "clientInfo": {
+                        "name": "litellm-mcp-auth-probe",
+                        "version": "1.0.0",
+                    },
+                },
+            }
+            resp = await client.client.post(  # type: ignore[attr-defined]
+                url,
+                headers={
+                    "Authorization": auth_header,
+                    "Accept": "application/json, text/event-stream",
+                },
+                json=probe_payload,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+            # Exclude M2M servers: _prepare_mcp_server_headers skips caller
+            # Authorization when has_client_credentials is set, so probing
+            # those with the caller's token would send the wrong credential.
+            and not srv.has_client_credentials
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status == 401:
+                # Token is missing or expired — direct the client to re-authorize.
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+            if probe_status == 403:
+                # Token is valid but the caller lacks permission — do not hint
+                # at re-authorization (RFC 9110: a fresh token with the same
+                # scopes would just hit 403 again and loop indefinitely).
+                raise HTTPException(
+                    status_code=403,
+                    detail="Forbidden",
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2969,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/litellm/proxy/proxy_server.py b/litellm/proxy/proxy_server.py
--- a/litellm/proxy/proxy_server.py
+++ b/litellm/proxy/proxy_server.py
@@ -15006,10 +15006,19 @@
     # If the handler task dies (exception or cancellation) without sending the EOF
     # sentinel, body_iter() would block forever on body_queue.get().  The callback
     # below guarantees the queue gets unblocked regardless of how the task ends.
+    # When this happens before response headers, propagate the original exception
+    # instead of waiting for the header timeout.
     def _ensure_eof(task: asyncio.Task) -> None:
-        if task.cancelled() or task.exception() is not None:
+        if task.cancelled():
             body_queue.put_nowait(None)
+            return
 
+        task_exception = task.exception()
+        if task_exception is not None:
+            if not headers_ready.done():
+                headers_ready.set_exception(task_exception)
+            body_queue.put_nowait(None)
+
     handler_task.add_done_callback(_ensure_eof)
 
     try:

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,101 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.client.post.assert_awaited_once()
+    _, kwargs = mock_client.client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """Returns Authorization value when x-litellm-api-key is also present."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"x-litellm-api-key", b"sk-litellm-proxy-key"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None
+
+
+def test_get_forwarded_auth_from_scope_skips_when_no_litellm_key_header():
+    """Skip when ``x-litellm-api-key`` is absent.
+
+    Without ``x-litellm-api-key``, the ``Authorization`` header may itself be
+    the LiteLLM proxy API key (backward-compat). Forwarding it upstream would
+    leak the proxy key, so the helper must return None and the probe must
+    not fire.
+    """
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer ambiguous-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) is None

diff --git a/tests/test_litellm/proxy/test_mcp_asgi_response.py b/tests/test_litellm/proxy/test_mcp_asgi_response.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/proxy/test_mcp_asgi_response.py
@@ -1,0 +1,36 @@
+import asyncio
+
+import pytest
+from fastapi import HTTPException
+
+from litellm.proxy.proxy_server import _stream_mcp_asgi_response
+
+
+@pytest.mark.asyncio
+async def test_stream_mcp_asgi_response_propagates_pre_header_http_exception():
+    async def handle_fn(_scope, _receive, _send):
+        raise HTTPException(
+            status_code=401,
+            detail="Unauthorized",
+            headers={
+                "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+            },
+        )
+
+    async def receive():
+        return {"type": "http.request", "body": b"", "more_body": False}
+
+    with pytest.raises(HTTPException) as exc_info:
+        await asyncio.wait_for(
+            _stream_mcp_asgi_response(
+                handle_fn,
+                {"type": "http", "method": "POST", "path": "/mcp", "headers": []},
+                receive,
+            ),
+            timeout=1.0,
+        )
+
+    assert exc_info.value.status_code == 401
+    assert exc_info.value.headers == {
+        "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+    }

You can send follow-ups to the cloud agent here.

Comment thread litellm/proxy/_experimental/mcp_server/server.py
cursoragent and others added 2 commits May 13, 2026 17:09
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Use AsyncHTTPHandler.post() and catch httpx.HTTPStatusError explicitly so
the 401/403 we want to surface is not silently swallowed by the broad
fail-open except Exception block. Avoids reaching into the handler's
private client attribute, which would silently regress to fail-open if
AsyncHTTPHandler is ever refactored.
@mateo-berri

Copy link
Copy Markdown
Collaborator

@greptileai

Comment on lines +3275 to +3311
mock_client = AsyncMock()
mock_client.client.post = AsyncMock(return_value=mock_response)

with patch(
"litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
return_value=mock_client,
):
status, www_auth = await _probe_upstream_auth(
"http://upstream/mcp", "Bearer some-token"
)

assert status == 401
assert www_auth == 'Bearer realm="test"'
mock_client.client.post.assert_awaited_once()
_, kwargs = mock_client.client.post.call_args
assert kwargs["headers"]["Authorization"] == "Bearer some-token"
assert kwargs["json"]["method"] == "initialize"


@pytest.mark.asyncio
async def test_probe_upstream_auth_fails_open_on_network_error():
"""_probe_upstream_auth returns (200, None) when the network call fails."""
from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth

mock_client = AsyncMock()
mock_client.client.post = AsyncMock(side_effect=Exception("connection refused"))

with patch(
"litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
return_value=mock_client,
):
status, www_auth = await _probe_upstream_auth(
"http://upstream/mcp", "Bearer some-token"
)

assert status == 200
assert www_auth is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Probe tests mock the internal client instead of the public method

Both probe tests set up mock_client.client.post, but _probe_upstream_auth now calls the public client.post(...) (AsyncHTTPHandler.post()), which is a completely separate mock attribute. AsyncHTTPHandler.post() calls self.client.send() then raise_for_status(), meaning a real 401 would arrive as an httpx.HTTPStatusError, not as a response object.

For test_probe_upstream_auth_returns_upstream_status: await mock_client.post(...) returns mock_client.post.return_value (an AsyncMock), so resp.status_code is a MagicMock, not 401; the assertion fails.

For test_probe_upstream_auth_fails_open_on_network_error: mock_client.client.post is wired to raise but mock_client.post is not — no exception is ever thrown, so the fail-open path is never exercised and status is again a MagicMock rather than 200.

To correctly test the 401 case, mock_client.post should be configured to raise httpx.HTTPStatusError with a 401 response (since AsyncHTTPHandler.post() calls raise_for_status()). For the fail-open test, mock_client.post itself should be wired with side_effect=Exception(...).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c07b62a + 28eda2a — tests now mock client.post (matching the production call) and there is a new test_probe_upstream_auth_surfaces_httpx_status_error that exercises the httpx.HTTPStatusError path raised by raise_for_status().

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Tests mock wrong attribute path, never exercising production code
    • Updated the auth-probe tests to mock and assert AsyncHTTPHandler.post directly, so the production call path and network-error branch are exercised.
Preview (28eda2a662)
diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -23,6 +23,7 @@
     cast,
 )
 
+import httpx
 from fastapi import FastAPI, HTTPException
 from pydantic import AnyUrl, ConfigDict
 from starlette.requests import Request as StarletteRequest
@@ -51,13 +52,17 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
     LiteLLMProxyRequestSetup,
     get_chain_id_from_headers,
 )
-from litellm.types.mcp import MCPAuth
+from litellm.types.mcp import MCPAuth, MCPSpecVersion
 from litellm.types.mcp_server.mcp_server_manager import MCPInfo, MCPServer
 from litellm.types.utils import CallTypes, StandardLoggingMCPToolCall
 from litellm.utils import Rules, client, function_setup
@@ -2754,6 +2759,157 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: Scope) -> Optional[str]:
+        """Return the upstream-bound ``Authorization`` header value, or None.
+
+        Only returns the ``Authorization`` header when ``x-litellm-api-key`` is
+        also present. In that case ``Authorization`` is unambiguously the
+        upstream token the caller wants forwarded to the MCP server. When
+        ``x-litellm-api-key`` is absent the ``Authorization`` header may itself
+        be the LiteLLM proxy API key (backward-compat path in
+        ``MCPRequestHandler.process_mcp_request``), and forwarding it upstream
+        would leak the proxy key to a third-party MCP server.
+        """
+        authorization = None
+        has_litellm_key_header = False
+        for key, value in scope.get("headers", []):
+            key_lower = key.lower()
+            if key_lower == b"authorization":
+                authorization = value.decode("latin-1")
+            elif key_lower == b"x-litellm-api-key":
+                has_litellm_key_header = True
+        if not has_litellm_key_header:
+            return None
+        return authorization
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """JSON-RPC initialize-probe the upstream URL to check whether the token is accepted.
+
+        Uses POST so StreamableHTTP MCP servers run the same auth path as a
+        real client request. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+
+        Uses the public ``AsyncHTTPHandler.post()`` interface and catches
+        ``httpx.HTTPStatusError`` separately so the 401/403 we want to surface
+        is not swallowed by the broad fail-open ``except Exception`` below.
+        """
+        client = get_async_httpx_client(
+            llm_provider=httpxSpecialProvider.MCP,
+            params={"timeout": timeout},
+        )
+        probe_payload = {
+            "jsonrpc": "2.0",
+            "id": "litellm-mcp-auth-probe",
+            "method": "initialize",
+            "params": {
+                "protocolVersion": MCPSpecVersion.jun_2025.value,
+                "capabilities": {},
+                "clientInfo": {
+                    "name": "litellm-mcp-auth-probe",
+                    "version": "1.0.0",
+                },
+            },
+        }
+        probe_headers = {
+            "Authorization": auth_header,
+            "Accept": "application/json, text/event-stream",
+        }
+        try:
+            resp = await client.post(
+                url=url,
+                headers=probe_headers,
+                json=probe_payload,
+                timeout=timeout,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except httpx.HTTPStatusError as exc:
+            # AsyncHTTPHandler.post() calls raise_for_status(); a 401/403 from
+            # upstream lands here. Return its status so the caller can map it
+            # to the appropriate response.
+            return exc.response.status_code, exc.response.headers.get(
+                "www-authenticate"
+            )
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+            # Exclude M2M servers: _prepare_mcp_server_headers skips caller
+            # Authorization when has_client_credentials is set, so probing
+            # those with the caller's token would send the wrong credential.
+            and not srv.has_client_credentials
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status == 401:
+                # Token is missing or expired — direct the client to re-authorize.
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+            if probe_status == 403:
+                # Token is valid but the caller lacks permission — do not hint
+                # at re-authorization (RFC 9110: a fresh token with the same
+                # scopes would just hit 403 again and loop indefinitely).
+                raise HTTPException(
+                    status_code=403,
+                    detail="Forbidden",
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2983,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/litellm/proxy/proxy_server.py b/litellm/proxy/proxy_server.py
--- a/litellm/proxy/proxy_server.py
+++ b/litellm/proxy/proxy_server.py
@@ -15006,10 +15006,19 @@
     # If the handler task dies (exception or cancellation) without sending the EOF
     # sentinel, body_iter() would block forever on body_queue.get().  The callback
     # below guarantees the queue gets unblocked regardless of how the task ends.
+    # When this happens before response headers, propagate the original exception
+    # instead of waiting for the header timeout.
     def _ensure_eof(task: asyncio.Task) -> None:
-        if task.cancelled() or task.exception() is not None:
+        if task.cancelled():
             body_queue.put_nowait(None)
+            return
 
+        task_exception = task.exception()
+        if task_exception is not None:
+            if not headers_ready.done():
+                headers_ready.set_exception(task_exception)
+            body_queue.put_nowait(None)
+
     handler_task.add_done_callback(_ensure_eof)
 
     try:

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,137 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = MagicMock()
+    mock_client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.post.assert_awaited_once()
+    _, kwargs = mock_client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_surfaces_httpx_status_error():
+    """Probe extracts status + WWW-Authenticate from httpx.HTTPStatusError.
+
+    AsyncHTTPHandler.post() calls raise_for_status() internally, so when the
+    upstream returns 401/403 the call raises httpx.HTTPStatusError rather than
+    returning the response. The probe must catch that specifically (before the
+    fail-open `except Exception`) so the auth check is not silently defeated.
+    """
+    import httpx
+
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+    request = httpx.Request("POST", "http://upstream/mcp")
+    error = httpx.HTTPStatusError(
+        message="401 Unauthorized", request=request, response=mock_response
+    )
+
+    mock_client = MagicMock()
+    mock_client.post = AsyncMock(side_effect=error)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = MagicMock()
+    mock_client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """Returns Authorization value when x-litellm-api-key is also present."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"x-litellm-api-key", b"sk-litellm-proxy-key"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None
+
+
+def test_get_forwarded_auth_from_scope_skips_when_no_litellm_key_header():
+    """Skip when ``x-litellm-api-key`` is absent.
+
+    Without ``x-litellm-api-key``, the ``Authorization`` header may itself be
+    the LiteLLM proxy API key (backward-compat). Forwarding it upstream would
+    leak the proxy key, so the helper must return None and the probe must
+    not fire.
+    """
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer ambiguous-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) is None

diff --git a/tests/test_litellm/proxy/test_mcp_asgi_response.py b/tests/test_litellm/proxy/test_mcp_asgi_response.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/proxy/test_mcp_asgi_response.py
@@ -1,0 +1,36 @@
+import asyncio
+
+import pytest
+from fastapi import HTTPException
+
+from litellm.proxy.proxy_server import _stream_mcp_asgi_response
+
+
+@pytest.mark.asyncio
+async def test_stream_mcp_asgi_response_propagates_pre_header_http_exception():
+    async def handle_fn(_scope, _receive, _send):
+        raise HTTPException(
+            status_code=401,
+            detail="Unauthorized",
+            headers={
+                "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+            },
+        )
+
+    async def receive():
+        return {"type": "http.request", "body": b"", "more_body": False}
+
+    with pytest.raises(HTTPException) as exc_info:
+        await asyncio.wait_for(
+            _stream_mcp_asgi_response(
+                handle_fn,
+                {"type": "http", "method": "POST", "path": "/mcp", "headers": []},
+                receive,
+            ),
+            timeout=1.0,
+        )
+
+    assert exc_info.value.status_code == 401
+    assert exc_info.value.headers == {
+        "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+    }

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 560c22b. Configure here.

cursoragent and others added 2 commits May 13, 2026 17:38
Co-authored-by: Yassin Kortam <yassin@berri.ai>
AsyncHTTPHandler.post() calls raise_for_status() internally, so a real
upstream 401/403 lands as httpx.HTTPStatusError. Add a test that exercises
that specific exception path so a regression that swallows the error in
the broad fail-open except Exception would be caught.
@mateo-berri

Copy link
Copy Markdown
Collaborator

@greptileai please re-review.

Latest fixes:

  • 560c22b — probe now uses public AsyncHTTPHandler.post() and catches httpx.HTTPStatusError explicitly so a real 401/403 is not swallowed by the broad fail-open block.
  • c07b62a + 28eda2a — probe tests now mock client.post directly (matching the production call) and new test_probe_upstream_auth_surfaces_httpx_status_error covers the HTTPStatusError path.

@mateo-berri mateo-berri enabled auto-merge (squash) May 13, 2026 19:03

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@mateo-berri mateo-berri merged commit 466f06d into litellm_internal_staging May 13, 2026
117 checks passed
@mateo-berri mateo-berri deleted the litellm_mcp_passthrough_upstream_401 branch May 13, 2026 19:03
Sameerlite added a commit that referenced this pull request May 26, 2026
* fix(proxy): always merge caller-supplied tags into request metadata

Caller-supplied tags (`x-litellm-tags` header, body `tags`, `metadata.tags`)
were silently dropped unless the key/team had
`metadata.allow_client_tags: true` set. Restore the documented behavior:
tags from the request always flow into `metadata.tags` and union with any
admin-configured static tags from key/team/project metadata.

Removes the `allow_client_tags` opt-in flag from the pre-call pipeline.
The flag was only ever read here; it has no schema or endpoint footprint,
so leftover values in existing key metadata are inert.

Test cleanup mirrors the simplification: drop the three tests that
verified the strip-when-not-opted-in path, drop the `allow_client_tags`
fixture lines from the merge/union tests.

* docs(proxy): refresh stale comments referencing removed tag strip

The tag-strip block was removed in the parent commit but two surrounding
comments still referenced "tags without opt-in" and "runs AFTER the
strip". Update them to describe the remaining user_api_key_* and
_pipeline_managed_guardrails strip that the snapshot/merge ordering
actually protects against.

* fix(tests): swap dall-e to gpt-image-1 after openai deprecation

DALL-E 2 and DALL-E 3 were removed from the OpenAI API on 2026-05-12,
causing e2e image-generation tests to fail with "model does not exist".
Swap all live-API DALL-E references in proxy-backed tests to gpt-image-1
and update the dall-e-2 alias in proxy_server_config.yaml to point at
openai/gpt-image-1 (preserves any historical dall-e-2 callers).

* fix(tests): drop dall-e-only test classes; route live image tests via gpt-image-1

Second wave of failures from the 2026-05-12 DALL-E shutdown:
- tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditDallE2
  and tests/image_gen_tests/test_image_generation.py::TestOpenAIDalle3
  are explicitly named for the deprecated models and can't pass; remove.
  gpt-image-1 coverage already exists in sibling classes.
- tests/local_testing/test_router.py image gen tests use dall-e-3 only
  as a routing example; swap to gpt-image-1.
- tests/local_testing/test_custom_callback_input.py image_generation
  success/failure paths swapped to gpt-image-1.

* chore: reject bare str at file-input sinks to prevent local-file read (#27762)

* chore: reject bare str at file-input sinks to prevent local-file read (#27667)

Squash-merged by litellm-agent from stuxf's PR.

* fix: use os.PathLike in ocr sink and check truthy reasoningSummary for bridge

- ocr/main.py: widen Path check to os.PathLike for consistency with other sinks
- main.py: bridge condition checks truthiness of reasoning_summary, not just None

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused pathlib.Path import in ocr/main.py

---------

Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix(tests): swap dall-e to gpt-image-1 after openai deprecation

DALL-E 2 and DALL-E 3 were removed from the OpenAI API on 2026-05-12,
causing e2e image-generation tests to fail with "model does not exist".
Swap all live-API DALL-E references in proxy-backed tests to gpt-image-1
and update the dall-e-2 alias in proxy_server_config.yaml to point at
openai/gpt-image-1 (preserves any historical dall-e-2 callers).

* fix(tests): drop dall-e-only test classes; route live image tests via gpt-image-1

Second wave of failures from the 2026-05-12 DALL-E shutdown:
- tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditDallE2
  and tests/image_gen_tests/test_image_generation.py::TestOpenAIDalle3
  are explicitly named for the deprecated models and can't pass; remove.
  gpt-image-1 coverage already exists in sibling classes.
- tests/local_testing/test_router.py image gen tests use dall-e-3 only
  as a routing example; swap to gpt-image-1.
- tests/local_testing/test_custom_callback_input.py image_generation
  success/failure paths swapped to gpt-image-1.

* fix(proxy): always merge caller-supplied tags into request metadata

Caller-supplied tags (`x-litellm-tags` header, body `tags`, `metadata.tags`)
were silently dropped unless the key/team had
`metadata.allow_client_tags: true` set. Restore the documented behavior:
tags from the request always flow into `metadata.tags` and union with any
admin-configured static tags from key/team/project metadata.

Removes the `allow_client_tags` opt-in flag from the pre-call pipeline.
The flag was only ever read here; it has no schema or endpoint footprint,
so leftover values in existing key metadata are inert.

Test cleanup mirrors the simplification: drop the three tests that
verified the strip-when-not-opted-in path, drop the `allow_client_tags`
fixture lines from the merge/union tests.

* docs(proxy): refresh stale comments referencing removed tag strip

The tag-strip block was removed in the parent commit but two surrounding
comments still referenced "tags without opt-in" and "runs AFTER the
strip". Update them to describe the remaining user_api_key_* and
_pipeline_managed_guardrails strip that the snapshot/merge ordering
actually protects against.

* feat(ui): add Vertex AI Search as vector store provider (#27790)

* feat(ui): add Vertex AI Search as vector store provider

Adds a "Vertex AI Search" entry to the provider dropdown
(custom_llm_provider=vertex_ai/search_api) with fields for project,
location (global/us/eu select), and optional collection ID. Extends
VectorStoreFieldConfig with `options` so select fields can be
data-driven instead of falling through to the embedding-model list.

* fix(ui): clarify vertex_collection_id placeholder copy

Placeholder previously displayed "default_collection" — the literal
fallback value — which invited users to type it instead of leaving the
field blank. Switch to an example placeholder and tighten the tooltip.

* Litellm key rotation bug (#27756)

* fix(proxy): resolve cache handling issues in _lookup_deprecated_key

- Updated the in-memory cache for deprecated key lookups to store a 3-tuple (active_token_id, cache_expires_at_ts, revoke_at_ts) instead of a 2-tuple, ensuring proper unpacking and backward compatibility.
- Removed duplicate cache reads and added logic to handle legacy cache entries gracefully.
- Enhanced unit tests to cover scenarios for cache hits, DB misses, and respect for revoke_at timestamps, ensuring robust handling of the grace-period key-rotation feature.

* refactor(proxy): streamline cache handling in _lookup_deprecated_key

- Simplified the cache retrieval logic by directly unpacking the 3-tuple cache entries, removing the need for backward compatibility checks for 2-tuple entries.
- Updated unit tests to ensure that pre-warmed 3-tuple cache entries are served correctly without unnecessary database lookups.

* chore(ci): add new unit test for deprecated key grace period

- Included `test_deprecated_key_grace_period.py` in the CI workflow to enhance coverage for deprecated key handling scenarios.

* fix(proxy): remove unnecessary check for revoke_at in _lookup_deprecated_key

- Eliminated the redundant check for None on revoke_at, streamlining the logic for handling deprecated keys in the cache. This change enhances the efficiency of the key lookup process.

* test(proxy): add end-to-end tests for deprecated key lookup behavior

- Introduced a new test class `TestDeprecatedKeyLookupDbE2E` to validate the behavior of deprecated key lookups against a real Prisma-backed database.
- The test ensures that old key hashes resolve correctly and that repeated lookups utilize the in-memory cache without errors.
- Cleaned up the `_lookup_deprecated_key` function by removing an unnecessary check for `revoke_at`, enhancing the efficiency of the key lookup process.

* chore(proxy): close /key/regenerate ownership-rebind + premium-gate bypass

A non-admin caller could rebind their own key's ``user_id`` via
``/key/regenerate``. ``_execute_virtual_key_regeneration`` had org/team
guards but no ``user_id`` guard, and ``prepare_key_update_data`` did not
strip the field — it survived ``model_dump(exclude_unset=True)`` into
the Prisma update. On the next request,
``_return_user_api_key_auth_obj`` resolved the rebound ``user_id``
against ``litellm_usertable`` and returned ``PROXY_ADMIN`` whenever
the target row's ``user_role`` was admin (e.g. the default
``user_id="default_user_id"`` created on first password-UI login).

``/key/update`` had the equivalent guard inline at
``_validate_update_key_data``; extract it to a shared helper
``_validate_caller_can_change_key_ownership`` and call from both
``/key/update`` and ``_execute_virtual_key_regeneration``. Future
regenerate-style endpoints inherit the guard for free.

Also tighten the premium gate that allowed the master-key rotation
branch to skip the enterprise check. The previous predicate was
``data.new_master_key is not None`` — a field-presence test, not an
identity check. Any non-premium caller could send any value in that
field and the premium check would no-op. Verify the caller actually
holds the master key via ``_is_master_key`` before allowing the
non-premium path.

Tests:
- ``test_regenerate_user_id_rebind_guard`` — parametrized table over
  cross-user rebind (blocked), empty-string removal (blocked), and
  same-user no-op rebind (allowed).
- ``test_regenerate_premium_gate_requires_actual_master_key`` /
  ``test_regenerate_premium_gate_allows_actual_master_key_holder`` —
  ensure the premium check requires the caller actually present the
  master key, and that legitimate master-key rotation still works.

* test(vcr): classify cache verdicts, detect live calls, surface cost leaks

Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop dead 'from respx import MockRouter' imports

These seven test files were on _RESPX_CONFLICTING_FILES, which made the
auto-marker skip them entirely. Inspecting the source shows the only
respx artifact is a top-level 'from respx import MockRouter' that no
test ever uses - no @pytest.mark.respx, no respx_mock fixture, no
respx.mock context manager. The import is dead code left over from a
previous mocking pattern.

Now that apply_vcr_auto_marker_to_items detects respx per-item via the
marker / fixture chain (b637d9f64a), the file-level skip is no longer
needed for these files - they were the reason the OpenAI tests
(test_o3_reasoning_effort, test_streaming_response[o1/o3-mini],
TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search,
TestOpenAIO3::test_web_search, etc.) ran live every CI build despite
the cassette cache being healthy.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(image_edits): regenerate fixtures per call instead of holding open module-level file handles

Module-level

    TEST_IMAGES = [
        open(os.path.join(pwd, 'ishaan_github.png'), 'rb'),
        open(os.path.join(pwd, 'litellm_site.png'), 'rb'),
    ]
    SINGLE_TEST_IMAGE = open(...)

opens the file once at import. After the first multipart upload, the
file pointer is at EOF, so every subsequent test in the same xdist
worker sends an empty multipart body. That non-determinism (a) blows
the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so
_RedisPersister.save_cassette refuses to save it, and (b) re-bills the
live image edit endpoint on every CI run.

Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py
shows six tests parking at 51-52 cassette entries
(TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False],
TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio,
test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False],
test_multiple_image_edit_with_different_formats).

Replace the module-level file handles with _make_test_images() /
_make_single_test_image() factories that return fresh _RewindableImage
(BytesIO subclass) objects whose pointer always starts at 0. The image
bytes are read once at import into module-level constants
(_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is
unchanged.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(proxy): clarify ownership-rebind error message (actor vs target)

Previous wording read "User=<new_owner> is not allowed to update the
key to belong to user=<current_owner>" — easy to misread as "caller
wants to keep the key on its current owner". Reframe as
"Non-admin caller is not allowed to rebind the key from
user=<existing> to user=<incoming>" so the direction of the failed
operation is unambiguous.

Same shape preserved (HTTPException 403); only the ``detail`` string
changes. Regression test substring updated.

* fix(vcr): match real Bedrock hostnames in live-call probe

The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock
endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com'
(region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit
host check for that pattern so Bedrock live calls are visible to the
probe, and update the unit test accordingly. Also drop the unused
'_LIVE_CALL_PROBE_INSTALLED' module variable.

* test(proxy): drop allow_client_tags opt-in gate and add credential rename cascade tests

Removes the allow_client_tags metadata check from apply_client_tag_policy_pre_auth so
x-litellm-tags headers are always merged into request metadata, matching the post-auth
behavior in add_litellm_data_to_request. Updates pre-call tests accordingly and adds a
new test suite covering cascading credential renames into model rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(proxy): block explicit-null user_id in ownership rebind guard

``model_dump(exclude_unset=True)`` in ``prepare_key_update_data``
includes any field the caller explicitly set, even when the value is
``None``. The previous guard short-circuited on ``getattr(data,
'user_id', None) is None``, which conflated "field omitted" (safe)
with "field explicitly set to null" (writes NULL to the token row,
detaching the key from its user and bypassing user-row role
checks).

Switch the omitted-vs-set distinction to ``data.model_fields_set``;
treat explicit-null and explicit-empty-string identically as a
removal attempt, both 403-rejected for non-admin callers.

Parametrized regression adds ``explicit_null_blocked`` alongside the
existing ``rebind_blocked`` / ``empty_blocked`` / ``same_user_id_allowed``
cases.

* fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes

* fix(image_edits): drop _RewindableImage to prevent infinite multipart upload

The _RewindableImage(BytesIO) wrapper auto-rewound on every read after
EOF, which made the OpenAI SDK's multipart upload writer read the same
bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd:

    [gw0] node down: Not properly terminated
    replacing crashed worker gw0
    ...
    worker 'gw1' crashed while running
        'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]'

The auto-rewind was added defensively for parametrized + flaky-retried
tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk
already calls get_base_image_edit_call_args() once per invocation and
that helper now constructs fresh streams via _make_test_images(), so
rewinding inside the stream is unnecessary. Replace with plain BytesIO
seeded with the cached image bytes.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(proxy): refuse remote-URL instance-fn loads outside config-file path

``get_instance_fn`` previously routed any ``s3://`` / ``gcs://``
value into ``_load_instance_from_remote_storage`` regardless of how
the value got there. The function ultimately calls
``spec.loader.exec_module(module)`` — Python in the proxy process. On
admin-callable endpoints that accept a ``target`` / ``custom_handler``
field from the request body (e.g. ``/config/pass_through_endpoint``,
custom-callback registration), that is a one-step admin-to-RCE
primitive: any future privilege-escalation bug becomes immediate
code execution.

The documented operator flow for remote-module loading is
``litellm_settings.callbacks: ["s3://bucket/module.instance"]`` in
``config.yaml``. That path always carries the YAML's
``config_file_path`` through to ``get_instance_fn``. Use the presence
of ``config_file_path`` as the discriminator: refuse remote URLs
when it is absent (the request-body path) unless the operator
explicitly opts back in via
``LITELLM_ALLOW_REMOTE_INSTANCE_FN_FROM_API=true``.

The three success/failure/audit-log callback-loop call sites in
``proxy_server.py:load_config`` were already running inside the
startup config-file load but had stopped threading
``config_file_path`` through. Pass it through so the documented
``s3://`` callback flow continues to work unchanged.

Tests cover: remote URL without ``config_file_path`` raises;
remote URL with the opt-in env reaches the loader; remote URL
with ``config_file_path`` passes (documented startup flow); local
dotted-name imports unaffected.

* fix(proxy): parse string metadata before pre-auth tag merge

`apply_client_tag_policy_pre_auth` overwrote string-typed metadata
with `{}` before merging header tags, dropping any tags inside. A
caller could send `metadata='{"tags":["over-budget"]}'` plus
`x-litellm-tags: within-budget` and bypass `_tag_max_budget_check`
on the body tag. Parse the string via `safe_json_loads` first so
existing tags survive the merge.

Also drop the empty `tests/test_litellm/proxy/credential_endpoints/`
directory — the cascade-rename tests it held imported a function
that was never implemented (out of scope for this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tests): thread config_file_path through s3/gcs custom-logger tests

The pre-existing s3:// / gcs:// custom-logger tests called
``get_instance_fn`` without ``config_file_path``, which means the
new runtime gate (refuse remote URLs unless invoked from a
config-file load) now raises ``ValueError`` before reaching the
mocked download paths. Each test was exercising the documented
startup config-file load scenario; pass ``config_file_path="/any/path"``
to make that intent explicit and route past the gate.

Affected: test_s3_download_success, test_gcs_download_success,
test_invalid_url_format, test_download_failure_handling,
test_file_cleanup.

* test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible

The pass_through prompt-caching tests
(test_prompt_caching_returns_cache_read_tokens_on_second_call,
test_prompt_caching_streaming_second_call_returns_cache_read) make a
warm-up call and then assert the *second* call sees a non-zero
cache_read_input_tokens count from the upstream's prompt-cache. VCR
replay can't model cross-call provider state — both calls match the
same cassette episode, so the second call returns the first call's
pre-warmup response and the assertion fails:

    AssertionError: Expected cache_read_input_tokens > 0 on second call,
    but got 0. Full usage: {'input_tokens': 4986,
    'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0}

This started biting after the AWS SigV4 fingerprint stabilization
(b637d9f64a): Bedrock requests now produce a stable per-access-key
fingerprint instead of a per-request signature, so cassettes
successfully replay where they previously always missed and re-recorded
live. Opt these tests out via skip_nodeid_suffixes so they run live and
match the existing pattern in tests/llm_translation/conftest.py
(::test_prompt_caching).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* Fix 3 OpenTelemetry tracing bugs in proxy integration (#27757)

1. Missing litellm_request child span when proxy parent in metadata:
   _get_span_context now returns (ctx, None) for the metadata-injected
   proxy parent so the primary span is always emitted as a child of ctx.
   Proxy span lifecycle managed by new _end_proxy_span_from_kwargs.

2. open_telemetry_logger overwrite by later handlers:
   _init_otel_logger_on_litellm_proxy now uses first-registered-wins —
   only assigns proxy_server.open_telemetry_logger when currently None.

3. Duplicate litellm_request success spans in streaming paths:
   Added _mark_success_span_once with per-handler dedupe key stored in
   kwargs metadata, suppressing the second span when both sync and async
   success callbacks fire for the same request.

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: update Next.js build artifacts (2026-05-13 01:42 UTC, node v20.20.2)

* test(vcr): tighten OVERFLOW classification and switch respx detection to AST

Address two greptile P2 review concerns on PR #27795:

1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE
   regardless of cassette state. A cassette that grew past the cap
   historically but this run only *replayed* (dirty=False) is
   healthy — the persister never tries to save, so the cache state is
   stable and the next run will replay too. Only flag OVERFLOW when
   dirty=True (new episodes were recorded that the persister would
   refuse to save). Add a regression test covering the
   dirty=False + large-total case.

2. _module_uses_respx did substring matching on the module source,
   which false-positives on comments / docstrings / string literals.
   A comment like # Previously tried respx.mock but switched to
   vcrpy would keep a file pinned on the opt-out list, defeating the
   dead-import pruning goal of this PR. Replace the substring scan
   with an ast.NodeVisitor (_RespxUsageVisitor) that only
   counts:

     - @pytest.mark.respx / @respx.mock decorators
     - with respx.mock(): ... (sync + async) context managers
     - respx.mock(...) calls outside a with/decorator
     - function parameters / fixture names equal to respx_mock

   Add tests for the comment / docstring / string-literal cases plus
   each real-usage pattern.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(types_utils): drop opt-in env from remote-module runtime gate

The runtime gate on s3://gcs:// loading in get_instance_fn previously
allowed an opt-in via LITELLM_ALLOW_REMOTE_INSTANCE_FN_FROM_API. That
env var is admin-flippable at runtime (DB-overlay environment_variables
flow into os.environ), which defeats the gate's purpose, and it isn't
needed for the documented operator flow: config.yaml callbacks always
pass config_file_path through to the loader.

Remove the helper, raise unconditionally when config_file_path is None,
and drop the corresponding test for the opt-in branch.

* fix(proxy): thread config_file_path into pass-through and MCP-tool YAML loaders

The previous commit's gate broke two legitimate startup paths for
operators using s3://gcs:// remote module loading from their config.yaml:

- general_settings.pass_through_endpoints[].custom_handler
- mcp_tools[].handler

Both call sites called get_instance_fn without a config_file_path, so
the new gate rejected them at startup. Thread config_file_path through:

- create_pass_through_route accepts config_file_path and forwards it to
  get_instance_fn. add_exact_path_route, add_subpath_route,
  _register_pass_through_endpoint, and initialize_pass_through_endpoints
  accept and propagate it.
- The YAML-load call site in proxy_server.load_config now passes
  config_file_path; the DB-overlay call site in _update_general_settings
  leaves it as the default None so the gate still fires on admin-written
  s3:// values.
- MCPToolRegistry.load_tools_from_config accepts config_file_path and
  threads it into get_instance_fn; _init_non_llm_configs forwards it
  from load_config.

Adds two regression tests verifying that the YAML-source callers thread
the path through to get_instance_fn.

* Strip SERVER_ROOT_PATH before lazy-feature prefix match

LazyFeatureMiddleware compared the raw scope path against registered
prefixes (e.g. /policies), so requests under a server root path like
/api/v1/policies/... never matched, the feature never loaded, and the
endpoint returned 404. Strip the configured root path before matching,
normalizing trailing slashes and enforcing a component boundary so
/api does not falsely match /apiv2.

* Cache normalized SERVER_ROOT_PATH at middleware init

SERVER_ROOT_PATH is a process-startup env var. Read it once in
__init__ instead of calling get_server_root_path() + rstrip on every
request that arrives before all lazy features have loaded.

* test: replace dall-e-3 with gpt-image-1 in health check and router tests (#27813)

OpenAI returns 'The model dall-e-3 does not exist' for the test account,
breaking test_openai_img_gen_health_check and test_image_generation.
Switch to gpt-image-1, matching the existing TestOpenAIGPTImage1 pattern.

* fix(gemini): normalize response_schema on native generateContent (#27775)

* fix(gemini): normalize response_schema on native generateContent

The /v1beta/models/{model}:generateContent passthrough forwarded
generationConfig.response_schema verbatim, so schemas containing $defs,
$ref, anyOf-with-null, default, or title were rejected by Gemini even
though /chat/completions already handles them.

GoogleGenAIConfig.transform_generate_content_request now calls a new
_normalize_response_schema helper that mirrors the chat/completions
path: Gemini 2.0+ models get the schema promoted to responseJsonSchema
via _build_json_schema (preserving $defs/$ref natively), older models
keep responseSchema but the schema is flattened with
_build_vertex_schema. VertexAIGoogleGenAIConfig (which overrides the
transform entirely) calls the same helper before building the request.

* fix(gemini): preserve caller-supplied responseJsonSchema when responseSchema co-present

Previously, when both responseJsonSchema and responseSchema were present
on Gemini 2.0+, _normalize_response_schema processed responseJsonSchema
first (no-op normalization) then unconditionally promoted responseSchema
to responseJsonSchema, clobbering the caller-supplied value.

Now skip the promotion (and drop the redundant responseSchema) when the
caller already supplied responseJsonSchema.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore: strip restating comments from response-schema normalize

Drop the docstring on _normalize_response_schema and the two inline
comments that just restated what the surrounding code/asserts already
say. Function name + variable names carry the intent; PR description
covers the why-it-exists context.

* perf(gemini): drop redundant deepcopy on responseJsonSchema normalize

_build_json_schema is a no-op (returns its argument unchanged), so the
deepcopy + round-trip on the responseJsonSchema branch allocated a full
schema copy on every request with no observable effect. Forward the
caller's value as-is, and just move the popped responseSchema value when
promoting on Gemini 2.0+.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* style: remove unneeded comment

* fix(gemini): drop unsupported responseJsonSchema for older models

* test(gemini): add parity test between native and chat schema normalization

Per @Sameerlite review: lock the two Gemini schema-normalization paths
together. If either GoogleGenAIConfig._normalize_response_schema (native
generateContent) or VertexGeminiConfig.apply_response_schema_transformation
(/chat/completions) drifts, the parity test fails — forcing both to be
updated together.

* fix(google_genai): preserve key naming convention in _normalize_response_schema

When the input schema key is snake_case (response_schema), the promoted
JSON schema key should also be snake_case (response_json_schema) instead
of mixing in camelCase (responseJsonSchema). This matters for the Vertex
AI google_genai path which converts all keys to snake_case before
calling _normalize_response_schema.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>

* fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist

`_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate`
— which runs in each xdist worker process. The controller's
`pytest_terminal_summary` then reads its own empty `_session_stats` and
bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections
the rest of this PR adds never make it into CI logs in the dist mode CI
actually uses.

Ship a structured `vcr_outcome` payload via `user_properties` (which
xdist round-trips) and add `aggregate_report_outcome` on the controller
to fold worker outcomes into `_session_stats`. The recording process
tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can
tell "single-process — already counted locally" apart from "produced by
a worker — needs aggregation here", and not double-count when there's
no xdist.

Covered by 9 new unit tests in test_vcr_classification.py including the
end-to-end summary render path.

* fix(responses): register cooldowns on failure + fail fast on stale encrypted_content (#27820)

* feat(proxy): skip disable_background_health_check models on GET /health when flag set (#27716)

* feat(proxy): skip disable_background_health_check models on GET /health when flag set

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix comment

* fix greptile comments

* Fix health check fallback kwargs

* Format health endpoint

* Harden direct health check kwargs compatibility for monkeypatched perform_health_check

Replace substring-based TypeError detection with unexpected-keyword checks
and a short retry chain (full kwargs, instrumentation only, filter only,
minimal) so partial stubs work regardless of which optional kwarg fails first.
Add proxy unit tests for legacy three-arg stubs and single-kwarg variants.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix black

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(bedrock-converse): drop blank-text fallback for empty thinking blocks (#27850)

* fix(bedrock-converse): drop blank-text fallback for empty thinking blocks

Claude Code with extended thinking replays prior assistant turns that
include an empty thinking block (`thinking=""`, `signature=""`) alongside
tool_use blocks. The unsigned-reasoning fallback in
`add_thinking_blocks_to_assistant_content` was emitting
`BedrockContentBlock(text="")`, which Bedrock Converse rejects with:

  "The text field in the ContentBlock object at messages.X.content.0
   is blank."

Guard the fallback with a strip() check, matching the existing
empty-text guards elsewhere in `_bedrock_converse_messages_pt`.

* style: remove unneeded comments

* fix(proxy): thread config_file_path through LiteLLM_JWTAuth.custom_validate

LiteLLM_JWTAuth.__init__ calls get_instance_fn(custom_validate) without
config_file_path, so an operator who configures custom_validate:
s3://bucket/module.fn in their YAML JWT auth section would hit the
runtime gate on startup and break their deployment.

Accept config_file_path as a non-field kwarg (popped before the
invalid-keys check), thread it into get_instance_fn, and pass it from
the startup-load callsite via the existing user_config_file_path
module-level path. Admin-API JWT config writes leave the kwarg at None
and still hit the gate.

* fix(mcp): surface upstream 401 for token-forwarding MCP servers (#27847)

* fix(mcp): surface upstream 401 for token-forwarding MCP servers

For MCP servers configured with extra_headers: [Authorization], the gateway
forwards the client token directly to the upstream. When that token is rejected
(expired or invalid) the upstream returns 401, but the MCP SDK starts the SSE
stream with 200 OK before calling handlers, so the 401 can't be returned
mid-stream.

Fix: add a pre-flight httpx probe in handle_streamable_http_mcp — before the
SDK opens the session — so the gateway can still return HTTP 401 with
WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> when the
upstream rejects the token. The probe fails-open (returns 200) on network
errors so a transient hiccup does not block valid requests.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): parallelize pre-flight auth probes and use HEAD to avoid side effects

- Extract forwarded_auth outside the pass-through server loop (was called N times for the same scope value)
- Gather all upstream auth probes concurrently with asyncio.gather instead of sequentially; eliminates N×5 s worst-case latency
- Switch probe from POST+initialize JSON-RPC body to HEAD request; HEAD carries the Authorization header so the upstream rejects invalid tokens with 401 but never allocates a session or writes an audit entry

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): use get_async_httpx_client in _probe_upstream_auth

Replaces bare httpx.AsyncClient with the project-standard
get_async_httpx_client(httpxSpecialProvider.MCP) to satisfy the
ensure_async_clients_test code coverage check and avoid the +500 ms
per-request overhead of creating a new client on every probe call.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(mcp): extract pre-flight probe into _check_passthrough_upstream_auth

Moves the parallel upstream auth probe logic out of
handle_streamable_http_mcp into a dedicated helper to satisfy
Ruff PLR0915 (Too many statements > 50).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): gate pre-flight probes on authorized server set to prevent bypass

_check_passthrough_upstream_auth was resolving user-supplied server names
directly before authorization ran, letting any permitted LiteLLM key
trigger an upstream HEAD probe to a server it was not allowed to use.

Changes:
- Call _get_allowed_mcp_servers inside the helper so only servers the
  caller's key is authorized for are probed.
- Move the call site to after toolset scoping so the auth context is
  fully resolved before the probe list is built.
- Thread user_api_key_auth into the helper signature (replaces the raw
  mcp_servers name list).

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add async HTTP HEAD support

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): use Scope type annotation in _get_forwarded_auth_from_scope

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix MCP upstream auth probe method

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Remove unused AsyncHTTPHandler head method

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): exclude has_client_credentials servers from pre-flight auth probe

_prepare_mcp_server_headers skips caller Authorization when the server
uses OAuth client-credentials (M2M), but the pre-flight probe was still
selecting those servers and forwarding the caller's raw token in the HEAD
request. Exclude servers with has_client_credentials from the probe list
to match the actual downstream header-preparation logic.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): propagate upstream 403 as 403, not 401 with WWW-Authenticate

Per RFC 9110, 401 means "go get new credentials." Mapping an upstream 403
to a gateway 401 causes OAuth clients to restart the authorization flow,
obtain a fresh token with identical scopes, hit 403 again, and loop
indefinitely.

401 from upstream → gateway 401 + WWW-Authenticate (re-authorize)
403 from upstream → gateway 403 (no WWW-Authenticate hint)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): skip auth probe when Authorization may be the LiteLLM proxy key

The pre-flight upstream probe must not forward the caller's Authorization
header when it could itself be the LiteLLM proxy API key. Restrict the
probe to requests that supply x-litellm-api-key explicitly — only then is
the Authorization header unambiguously the upstream OAuth token the
caller wants forwarded.

* Fix MCP ASGI HTTPException propagation

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): use public AsyncHTTPHandler.post() in auth probe

Use AsyncHTTPHandler.post() and catch httpx.HTTPStatusError explicitly so
the 401/403 we want to surface is not silently swallowed by the broad
fail-open except Exception block. Avoids reaching into the handler's
private client attribute, which would silently regress to fail-open if
AsyncHTTPHandler is ever refactored.

* Fix MCP auth probe tests

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(mcp): add coverage for httpx.HTTPStatusError path in auth probe

AsyncHTTPHandler.post() calls raise_for_status() internally, so a real
upstream 401/403 lands as httpx.HTTPStatusError. Add a test that exercises
that specific exception path so a regression that swallows the error in
the broad fail-open except Exception would be caught.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: claude-bot <claude-bot@anthropic.com>

* fix(cost): align vertex_ai/gemini-embedding-2-preview with Vertex multimodal pricing (#27848)

* fix(cost): align vertex_ai/gemini-embedding-2-preview with Vertex multimodal pricing

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cost): align vertex_ai/gemini-embedding-2 GA source URL with preview

Per Greptile review on #27848: GA entry referenced ai.google.dev while
the preview entry was updated to the canonical Vertex AI pricing page.
Both share identical pricing values; sync the source URL for consistency.

https://claude.ai/code/session_01W8jRwstnmduadGw8Z8egxe

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude <noreply@anthropic.com>

* feat(mcp): add delegate_auth_to_upstream flag for PKCE passthrough (#27834)

* feat(mcp): add delegate_auth_to_upstream flag for PKCE passthrough

Adds an opt-in per-server flag that lets clients (e.g. VS Code) complete
PKCE directly with an upstream OAuth2 MCP server, instead of LiteLLM
double-gating with its own API-key/SSO check. Only honored when
auth_type=oauth2 and the operator explicitly sets the flag; mixed-target
or non-oauth2 requests fail closed.

- Adds the field to Pydantic models, Prisma schema, and a migration
- New MCPRequestHandler._target_servers_delegate_auth_to_upstream gate
  that runs only when no x-litellm-api-key is present, so authenticated
  users still get user_id resolution + stored-credential lookup
- Anonymous callers now see delegate servers in get_allowed_mcp_servers
  (scoped to delegate servers only; the upstream still enforces auth)
- mcp_management_endpoints: allow anonymous /authorize and /token for
  delegate servers so VS Code can complete PKCE without a LiteLLM session
- UI toggle (shown only for oauth2) + payload/view wiring
- Tests covering: oauth2 on/off, non-oauth2 with flag, mixed targets,
  no resolvable target, explicit key precedence, and 401 emission

Co-authored-by: Cursor <cursoragent@cursor.com>

* Enforce oauth2 for delegated MCP auth bypass

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): close secondary Authorization bypass for delegate servers

The delegate-auth bypass gated only on the primary `x-litellm-api-key`
header, so a LiteLLM key sent via `Authorization: Bearer sk-...` (the
secondary header) was silently dropped — skipping spend tracking and
rate limiting. Gate on the resolved litellm_api_key (which considers
both headers) so the bypass fires only when neither is present.

Also update the existing "Authorization header present" test to reflect
that an upstream OAuth token now flows through the existing oauth2
fallback (LiteLLM auth attempt → fail → anonymous), not via the
delegate branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Avoid duplicate MCP OAuth credential lookup

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): block delegate bypass for M2M and internal-only servers

Two security issues flagged in code review:

1. High – client_credentials (M2M) servers must not be delegatable:
   LiteLLM auto-fetches the upstream token using stored credentials, so
   allowing anonymous bypass would let any external caller invoke tools
   authenticated as LiteLLM's service account.
   Fix: check `server.has_client_credentials` in
   `_target_servers_delegate_auth_to_upstream`, the anonymous
   allow-list in `get_allowed_mcp_servers`, and `_mcp_oauth_user_api_key_auth`.

2. Medium – internal-only servers exposed to public internet:
   The anonymous delegate allow-list was not filtering by
   `available_on_public_internet`, so external callers with an upstream
   OAuth token could invoke tools on servers marked internal-only.
   Fix: add `available_on_public_internet` guard to the anonymous
   delegate server list in `get_allowed_mcp_servers`.

Tests added for both cases.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Require public MCP delegate auth servers

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): align delegate auth path parsing with downstream routing

`_extract_target_server_names_from_path` used a naive segments-based
split while `server.py::_get_mcp_servers_in_path` uses a regex that
allows server names with one embedded slash and comma-separated lists.
With the old parser, a request to `/mcp/<delegated>/<garbage>` was
parsed as targeting `<delegated>` by the auth gate (bypassing LiteLLM
auth) while the routing layer parsed it as `<delegated>/<garbage>` —
when that name did not resolve, the request fell back to the anonymous
allow-list, which can include `allow_all_keys` servers that normally
require a LiteLLM key.

Replace the parser with the same regex logic as
`_get_mcp_servers_in_path` so auth gating sees the exact target name(s)
downstream routing sees. Add regression tests covering parser parity
and the specific extra-path-segment bypass attempt.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* fix(mcp): close header/path TOCTOU in MCP delegate auth gate

`_target_servers_delegate_auth_to_upstream` and
`_target_servers_use_oauth2` trusted the `x-mcp-servers` header when
present, but `server.py::extract_mcp_auth_context` overrides that
header with the path-derived list for `/mcp/...` routes. An attacker
could set `x-mcp-servers: <delegated>` while pointing the URL path at
a non-delegate server, flipping the auth gate without changing the
target downstream routing actually uses.

Extract a shared `_resolve_target_server_names` helper that mirrors
the downstream override (path-derived names for `/mcp/...` routes,
header value otherwise). Add regression tests covering the TOCTOU
attempt and the helper's path-vs-header precedence.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* Fix delegated MCP OAuth test mock

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): drop unreachable /{server}/mcp branch in auth path parser

`_extract_target_server_names_from_path` also matched the
``/{server_name}/mcp`` form, but the downstream parser
``_get_mcp_servers_in_path`` only handles ``/mcp/...`` — and
``dynamic_mcp_route`` in ``proxy_server`` rewrites ``/{name}/mcp``
to ``/mcp/{name}`` on the scope before the MCP handler runs. Parsing
the un-rewritten form on the auth side was therefore unreachable in
production, and contradicted the docstring's claim of mirroring the
downstream parser — exactly the kind of mismatch that risks a future
header/path TOCTOU if any new entry point skips the rewrite.

Drop the branch; the canonical ``/mcp/...`` path matches both
parsers. Update the regression test to assert the new behavior.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* Fix MCP path auth target resolution

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp): require auth for refresh_token grants on delegate-auth servers

`_mcp_oauth_user_api_key_auth` gates the unauthenticated PKCE flow for
``delegate_auth_to_upstream`` servers, but the bypass applied to BOTH
``/authorize`` and ``/token`` regardless of grant type. ``mcp_token``
accepts ``grant_type=refresh_token`` as well as ``authorization_code``,
and ``exchange_token_with_server`` attaches the server's stored
``client_secret`` to whatever is forwarded upstream. An unauthenticated
caller holding a refresh token issued to that OAuth client could mint
fresh upstream access tokens through LiteLLM.

Limit the anonymous bypass on ``/token`` to ``grant_type=authorization_code``
(the only grant PKCE actually protects via ``code_verifier``); fall
through to normal LiteLLM auth for ``refresh_token`` and any other grant.
``/authorize`` continues to allow anonymous PKCE redirects.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* fix(ui): clear delegate_auth_to_upstream when switching off oauth2

The ``delegate_auth_to_upstream`` form field is rendered inside an
``isOAuth2 && (...)`` conditional, so the Form.Item unmounts when the
user changes ``auth_type`` away from ``oauth2``. The follow-up
``form.setFieldValue("delegate_auth_to_upstream", false)`` runs after
the field has already deregistered, so ``onFinish`` receives
``undefined`` and the fallback ``?? mcpServer.delegate_auth_to_upstream``
preserved the old ``true``. The flag then persisted in the database for
a non-oauth2 server and silently re-activated if ``auth_type`` was later
switched back to ``oauth2``.

In the edit payload, force the flag to ``false`` whenever
``auth_type !== oauth2``; only trust the form value (and the existing
DB fallback) when the server is actually oauth2. Backend defense-in-depth
already ignores the flag for non-oauth2 servers, but the DB state should
stay clean too.

https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9

* Fix MCP delegate auth reset on edit

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Claude <claude@anthropic.com>

* fix(responses): preserve cache_control in Responses API -> Chat Completion transformation (#27727)

* fix(responses): preserve cache_control in Responses API -> Chat Completion transformation

cache_control injected by AnthropicCacheControlHook was silently dropped when
_transform_responses_api_content_to_chat_completion_content rebuilt content blocks
with only {type, text}. Now copies cache_control through so Anthropic prompt caching
works correctly when using client.responses.create with cache_control_injection_points.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(responses): preserve cache_control for input_image and input_file blocks

Extends the cache_control fix to image and file content blocks, which were
also silently dropping cache_control during the Responses API -> Chat Completion
transformation. Adds tests for all three content block types.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Babysitter <claude@anthropic.com>

* fix(proxy): expose db status on public /health/readiness

External readiness probes consumed the legacy detailed payload's `db`
field to drive alerting and pod-rotation decisions. Stripping the body
to `{"status": "healthy"}` broke those probes silently — the HTTP code
still flipped to 503, but probes checking `body.db == "connected"`
treated the response as healthy.

Add `db` back to the unauthenticated payload. Keep the rest of the
diagnostic fields (litellm_version, callbacks, cache, log_level) gated
behind /health/readiness/details so the recon-leak gate from #26912
holds. Values match the legacy contract: "connected", "disconnected",
"Not connected".

* docs(budget_manager): add docstring to BudgetManager.reset_cost (#27867)

Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>

* docs: add class docstring to _LoopWrapper (#27870)

Document the purpose of the daemon thread that backs the sync
branch of the timeout decorator.

Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>

* fix: Fix Redis Sentinel client handling to solve authentication error… (#26302)

* fix: Fix Redis Sentinel client handling to solve authentication error with password protected sentinel (#25625)

* fix Redis Sentinel authentication handling

* test: cover Redis Sentinel auth routing

* refactor: align Redis Sentinel kwargs threading

* fix: avoid duplicate Redis Sentinel socket timeouts

* Address review comments

* refactor(_redis): return set from _get_redis_kwargs for O(1) lookup

Align _get_redis_kwargs() with the cluster helper by returning a set
instead of a list, so the sentinel connection-kwargs filter uses O(1)
membership tests. Addresses Greptile review feedback on PR #26302.

* fix(_redis): restore Azure-specific kwargs in cluster kwargs set

The set-literal refactor of _get_redis_cluster_kwargs dropped four
LiteLLM-custom Azure keys (azure_redis_ad_token, azure_client_id,
azure_tenant_id, azure_client_secret) that the prior list form had
explicitly appended. Because they are not in RedisCluster's argspec,
they were silently stripped, breaking Azure IAM auth on cluster
clients. Re-add them to the explicit include set.

---------

Co-authored-by: Kristin Cowalcijk <kristincowalcijk@gmail.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: krrish-berri-2 <krrish-berri-2@users.noreply.github.com>
Co-authored-by: claude <claude@anthropic.com>

* Litellm agent oss staging 05 11 2026 (#27733)

* fix(ollama): Include provider in model list for ollama (#26135)

* Include provider in model names for ollama

* Fix unit tests

* fix(ollama): process both thinking and content in same streaming chunk (#26098)

* fix(health_check): skip max_tokens for image_generation mode (#26417)

* fix(health_check): skip max_tokens for image_generation mode

`_update_litellm_params_for_health_check` injected `max_tokens` for
every deployment. OpenAI `/v1/images/generations` strictly rejects
unknown fields, so health checks for dall-e-* and gpt-image-1 always
failed with `400 "Unknown parameter: 'max_tokens'"` even though the
actual image endpoint calls succeed. Skip the `max_tokens` injection
when `model_info.mode == "image_generation"`. `messages` still gets
injected (downstream `_filter_model_params` already strips it for
non-chat handlers).

* Switch to allow-list with per-deployment override

Per @krrishdholakia review: deny-listing image_generation only re-introduces
the same bug for every other non-chat mode (embedding, audio_*, rerank,
video_generation, ocr, search, moderation, ...).

Replace the single image_generation skip with `_MAX_TOKEN_SUPPORT_MODES =
{chat, completion, responses}`. Missing `mode` is treated as chat for
backward compatibility. New modes are safe by default.

Add `model_info.health_check_supports_max_tokens` as an operator escape
hatch — True forces injection on a non-listed deployment (operator wants
to bound probe tokens), False suppresses it on a chat-style deployment
behind a strict-schema provider.

Tests: parametrize over 3 chat-style + 10 non-chat modes, plus override
on/off and the no-mode legacy path.

* fix(http_handler): handle RequestNotRead in MaskedHTTPStatusError for multipart uploads (#26718)

Squash-merged by litellm-agent from dawidkulpa's PR.

* fix(ollama): guard against double 'ollama/' prefix in live model listing

Greptile flagged that Ollama servers can return names that already start
with 'ollama/'. Check the prefix before prepending so we don't produce
'ollama/ollama/...'. Adds a regression test.

* Fix Ollama empty reasoning stream chunks

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: James Myatt <james@jamesmyatt.co.uk>
Co-authored-by: VHash <225398745+vhash0@users.noreply.github.com>
Co-authored-by: hayden <sewhan.kim+@a-bly.com>
Co-authored-by: dawidkulpa <84176950+dawidkulpa@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Ishaan - May 13th Staging LiteLLM (#27877)

* fix: strip Gemini thought-signature from tool_use.id in non-streaming path; example websearch config (#27873)

- adapters/transformation.py: mirror the streaming path and strip the
  `__thought__<b64>` suffix off `tool_call.id` before building the
  AnthropicResponseContentBlockToolUse. Base64's `+ / =` characters
  violate Anthropic's `^[a-zA-Z0-9_-]+$` tool_use.id pattern, so when a
  conversation that flowed through Gemini is later replayed to an
  Anthropic-native provider (Bedrock or Anthropic API) the request 400s.
- example_config_yaml/websearch_interception_config.yaml: register the
  interceptor under `callbacks:` not `success_callback:`. `success_callback`
  does not run pre-request hooks, so the tool-conversion step never fires
  on `/v1/messages` and the raw `web_search_20250305` tool is forwarded
  to Bedrock, which 400s.
- adds a unit test pinning the non-streaming strip behavior and the
  surviving `^[a-zA-Z0-9_-]+$` shape of the resulting id.

Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>

* Fix/azure image edit auth header (#27863)

* fix(azure/image_edit): use api-key header instead of Authorization Bearer

Delegate `AzureImageEditConfig.validate_environment` to
`BaseAzureLLM._base_validate_azure_environment` so the image-edit route
follows the same auth resolution as every other Azure provider:

- prefer the Azure-native `api-key` header when an API key is available
- fall back to `Authorization: Bearer <azure_ad_token>` only for AAD auth

The previous implementation unconditionally set
`Authorization: Bearer <api_key>`, which is the OpenAI-direct convention
and is rejected by Azure OpenAI / APIM-fronted deployments with
`401 Access denied due to missing subscription key`.

Adds regression tests covering api_key kwarg, litellm_params.api_key, and
the AAD-token fallback path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(azure/image_edit): pin api-key precedence semantics + add regression test

Address review feedback that the move to
``BaseAzureLLM._base_validate_azure_environment`` changed the relative
priority of the positional ``api_key`` kwarg vs. ``litellm_params["api_key"]``.

The new behavior — ``litellm_params["api_key"]`` wins, positional only fills
in when ``litellm_params["api_key"]`` is empty — is intentional and matches
every other Azure ``validate_environment``: ``AzureVideosConfig`` uses the
exact same merge logic, while ``AzureVectorStoresConfig`` and
``AzureResponsesAPIConfig`` don't accept a positional ``api_key`` at all.
The old ``or`` chain (positional wins) was the outlier and was part of the
same OpenAI-vs-Azure convention drift that produced the original
``Authorization: Bearer`` bug.

The only production caller (``llm_http_handler.image_edit``) sources both
values from the same ``litellm_params.api_key``, so this change is
behaviorally a no-op there. Document the precedence in the docstring and
lock it in with an explicit test so future refactors can't quietly
re-invert it.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Adam Kirstein <adam.kirstein@disney.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* test(azure/image_edit): expect api-key header instead of Authorization Bearer

PR #27863 fixed Azure image edit to use the Azure-native api-key header
instead of OpenAI's Authorization: Bearer convention, but did not update
test_azure_image_edit_litellm_sdk to match. The test still asserted
'Authorization' in headers, which now fails since the new code routes
through BaseAzureLLM._base_validate_azure_environment and emits
api-key when an api_key is provided.

Update the assertion to pin the correct Azure behavior: api-key header
present with the resolved key, and no Authorization header.

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Co-authored-by: Adam Kirstein <107421694+justalittleadam@users.noreply.github.com>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Adam Kirstein <adam.kirstein@disney.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>

* fix(fireworks_ai): strip `thinking_blocks` from chat messages before Fireworks API call (#27881)

* fix(fireworks_ai): strip thinking_blocks from chat messages before API call

Fireworks OpenAI-compatible ChatMessage schema uses additionalProperties:false
and rejects Anthropic-style messages[].thinking_blocks (e.g. Claude Code replays),
returning invalid_request_error. Remove the field in _transform_messages_helper
alongside provider_specific_fields.

Adds unit test test_transform_messages_helper_strips_thinking_blocks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(fireworks_ai): drop inline comments from message sanitization

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(fireworks_ai): explain why provider_specific_fields and thinking_blocks are stripped

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: block client-side pricing injection via request body

Authenticated clients could supply CustomPricingLiteLLMParams fields
(input_cost_per_token, output_cost_per_token, etc.) in the request body.
These were forwarded to register_model() in main.py, permanently mutating
the shared global litellm.model_cost dict for all users on the instance.

Adds all CustomPricingLiteLLMParams fields to _BANNED_REQUEST_BODY_PARAMS
so is_request_body_safe() rejects them before they reach completion().
New pricing fields added to CustomPricingLiteLLMParams are auto-covered.

Admin opt-in via allow_client_side_credentials or
configurable_clientside_auth_params still works as before.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* chore(proxy): scrub remote-URL module loads from DB-overlay config

When ``ProxyConfig`` merges DB-persisted ``litellm_settings`` /
``general_settings`` on top of the YAML config, the merged dict is
later iterated by ``load_config`` which threads ``config_file_path``
(the YAML path) into ``get_instance_fn``. The runtime gate that
refuses ``s3://`` / ``gcs://`` modules when ``config_file_path`` is
``None`` therefore can't distinguish a YAML-sourced value from a
DB-sourced one: both look the same to ``get_instance_fn``.

Strip ``s3://`` / ``gcs://`` entries from the DB-overlay value for
every field whose contents reach ``get_instance_fn`` during config
load:

- litellm_settings: ``callbacks``, ``success_callback``,
  ``failure_callback``, ``audit_log_callbacks``, ``post_call_rules``,
  ``custom_provider_map[].custom_handler``
- general_settings: ``custom_auth``, ``custom_key_generate``,
  ``custom_key_update``, ``custom_sso``,
  ``custom_ui_sso_sign_in_handler``,
  ``litellm_jwtauth.custom_validate``

The YAML config-file load path is unchanged — the documented operator
flow (``callbacks: ["s3://bucket/module.instance"]`` in ``config.yaml``)
still works. Only DB-overlay writes (e.g. via ``/config/update``) are
stripped.

Adds 16 regression tests covering the scrub matrix.

* chore(proxy): also scrub pass_through_endpoints[].target from DB overlay

A pass-through endpoint's ``target`` field is passed through
``create_pass_through_route`` into ``get_instance_fn`` during config
load. A PROXY_ADMIN persisting ``target: "s3://attacker/m.i"`` via
the DB-overlay ``pass_through_endpoints`` write path was not covered
by the previous scrub matrix, so the remote module load would still
reach the loader because the YAML-load chain has ``config_file_path``
set.

Walk each entry in ``general_settings.pass_through_endpoints`` and
null out any ``target`` that starts with ``s3://`` or ``gcs://``. The
entry itself is preserved so the path-registration helper can choose
how to handle a missing target (the existing code skips the route
when ``target is None``).

Adds two regression tests.

* fix(prometheus): emit `litellm_remaining_tokens_metric` for Bedrock and Vertex (#27705)

* fix(prometheus): emit remaining_tokens/requests gauges for bedrock + vertex (LIT-2719)

Bedrock and Vertex AI never return x-ratelimit-remaining-* response headers,
so litellm_remaining_tokens_metric / litellm_remaining_requests_metric only
fired for OpenAI / Azure / Anthropic deployments even when tpm/rpm was
configured on the router.

Add a provider-agnostic fallback in PrometheusLogger.async_log_success_event
that asks Router.get_remaining_model_group_usage() for the same model_group
and emits the gauges with configured_limit - current_usage when the upstream
provider didn't populate the headers itself. Existing OpenAI / Azure /
Anthropic flows are unchanged because the fallback short-circuits when both
header values are already present.

Tests: 8 new tests covering bedrock + vertex emission, header short-circuit,
partial-header fill, llm_router=None, missing model_group, empty router
result, and router exception swallowing.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(prometheus): narrow except to ImportError, log router lookup failures via verbose_logger.exception

Address greptile review:
- The optional 'from litellm.proxy.proxy_server import llm_router' should
  guard against ImportError specifically, not all exceptions, so that
  unexpected errors (e.g. AttributeError from partially-initialized state)
  stay visible.
- get_remaining_model_group_usage failures are now logged via
  verbose_logger.exception (with traceback) instead of debug, matching the
  PR description's intent and avoiding silent loss of router-cache errors
  in production.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(prometheus): subtract in-flight delta in router-remaining fallback

The router's TPM/RPM counter is incremented by
Router.deployment_callback_on_success, which f…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants