fix(mcp): surface upstream 401 for token-forwarding MCP servers by Sameerlite · Pull Request #27847 · BerriAI/litellm

Sameerlite · 2026-05-13T13:26:16Z

What

For MCP servers configured with extra_headers: [Authorization], the gateway forwards the client's Bearer token directly to the upstream (OAuth pass-through). When that token is rejected by the upstream (expired/invalid), the upstream returns HTTP 401 — but LiteLLM was swallowing it and returning 200 {"tools":[]} instead.

Root cause: the MCP SDK (StreamableHTTPSessionManager) sends 200 OK + SSE headers before dispatching to handlers, so by the time list_tools detected the failure the HTTP response was already committed. Exception propagation through the SDK's internals isn't viable because except Exception guards at multiple layers catch it first.

Fix

Add a pre-flight auth probe in handle_streamable_http_mcp, before session_manager.handle_request is called:

For servers whose extra_headers include Authorization (token-forwarding servers), extract the client's Authorization header from the ASGI scope.
Send a minimal JSON-RPC initialize probe to the upstream with that token (5 s timeout).
If the upstream returns 401/403 → raise HTTPException(401) with WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> (same format as the existing per-user OAuth2 server handling).
On network error / timeout the probe fails-open (returns 200, None) so a transient hiccup does not block valid requests.

Note

Medium Risk
Touches MCP authentication/authorization flow and adds outbound preflight requests; mistakes could cause unexpected 401/403s or added latency for MCP connections.

Overview
Adds a pre-flight auth probe for StreamableHTTP MCP requests when a server is configured to pass through caller Authorization via extra_headers, so upstream 401/403 is detected before the MCP SDK commits 200 SSE headers.

Introduces helpers to safely extract a forwardable Authorization header only when x-litellm-api-key is present (to avoid leaking proxy keys), probes all authorized pass-through servers in parallel via a JSON-RPC initialize POST, and maps upstream 401 to a gateway 401 with WWW-Authenticate (and 403 to Forbidden) while failing open on network errors.

Updates _stream_mcp_asgi_response to propagate handler exceptions occurring before response headers (including HTTPException), and adds unit tests covering the probe behavior and exception propagation.

^{Reviewed by Cursor Bugbot for commit 28eda2a. Bugbot is set up for automated code reviews on this repo. Configure here.}

For MCP servers configured with extra_headers: [Authorization], the gateway forwards the client token directly to the upstream. When that token is rejected (expired or invalid) the upstream returns 401, but the MCP SDK starts the SSE stream with 200 OK before calling handlers, so the 401 can't be returned mid-stream. Fix: add a pre-flight httpx probe in handle_streamable_http_mcp — before the SDK opens the session — so the gateway can still return HTTP 401 with WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> when the upstream rejects the token. The probe fails-open (returns 200) on network errors so a transient hiccup does not block valid requests. Co-authored-by: Cursor <cursoragent@cursor.com>

codecov · 2026-05-13T13:29:58Z

Codecov Report

❌ Patch coverage is 71.11111% with 13 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/proxy/_experimental/mcp_server/server.py	71.11%	13 Missing ⚠️

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-05-13T13:30:19Z

Greptile Summary

This PR surfaces upstream 401/403 errors for token-forwarding MCP servers by adding a pre-flight auth probe before StreamableHTTPSessionManager.handle_request commits 200 SSE headers. It also updates _stream_mcp_asgi_response to propagate exceptions that occur before response headers are sent.

Introduces _get_forwarded_auth_from_scope, _probe_upstream_auth, and _check_passthrough_upstream_auth helpers; all authorized pass-through servers are probed in parallel via asyncio.gather.
Updates the _ensure_eof done-callback in proxy_server.py to set the exception on headers_ready when the handler raises before committing headers, so callers receive the original HTTPException rather than a 30 s timeout.
Adds unit tests for the probe and the new exception-propagation path.

Confidence Score: 4/5

The core auth-probe logic is sound and the exception-propagation fix is correct; the main concern is an extra permission-lookup chain now running on every qualifying MCP request.

The probe correctly parallels calls with asyncio.gather, properly catches MaskedHTTPStatusError (a httpx.HTTPStatusError subclass) from AsyncHTTPHandler.post(), and the _ensure_eof exception-propagation change is validated by a new test. The only substantive concern is _check_passthrough_upstream_auth triggering a multi-step permission chain (key, team, end_user, agent, org DB lookups) that already runs later in the same request inside list_mcp_tools, doubling the work for every pass-through auth request.

litellm/proxy/_experimental/mcp_server/server.py — specifically the _get_allowed_mcp_servers call inside _check_passthrough_upstream_auth.

Important Files Changed

Filename	Overview
litellm/proxy/_experimental/mcp_server/server.py	Adds pre-flight auth probe helpers and wires them in before the MCP session manager; `_get_allowed_mcp_servers` (potential multi-step DB chain) is now called on every request with `x-litellm-api-key` + `Authorization` targeting pass-through servers.
litellm/proxy/proxy_server.py	Updates `_ensure_eof` done-callback to propagate pre-header exceptions through `headers_ready`; change is minimal and correct.
tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py	New probe tests are present; `test_probe_upstream_auth_returns_upstream_status` mocks `client.post` returning a 401 response object directly, a path that is unreachable in production since `AsyncHTTPHandler.post()` raises `MaskedHTTPStatusError` for any 4xx.
tests/test_litellm/proxy/test_mcp_asgi_response.py	New test validating that a pre-header `HTTPException` propagates through `_stream_mcp_asgi_response`; covers the `_ensure_eof` change correctly.

_{Reviews (7): Last reviewed commit: "test(mcp): add coverage for httpx.HTTPSt..." | Re-trigger Greptile}

…de effects - Extract forwarded_auth outside the pass-through server loop (was called N times for the same scope value) - Gather all upstream auth probes concurrently with asyncio.gather instead of sequentially; eliminates N×5 s worst-case latency - Switch probe from POST+initialize JSON-RPC body to HEAD request; HEAD carries the Authorization header so the upstream rejects invalid tokens with 401 but never allocates a session or writes an audit entry Co-authored-by: Cursor <cursoragent@cursor.com>

Replaces bare httpx.AsyncClient with the project-standard get_async_httpx_client(httpxSpecialProvider.MCP) to satisfy the ensure_async_clients_test code coverage check and avoid the +500 ms per-request overhead of creating a new client on every probe call. Co-authored-by: Cursor <cursoragent@cursor.com>

Sameerlite · 2026-05-13T13:37:56Z

@greptile re review

…eam_auth Moves the parallel upstream auth probe logic out of handle_streamable_http_mcp into a dedicated helper to satisfy Ruff PLR0915 (Too many statements > 50). Co-authored-by: Cursor <cursoragent@cursor.com>

veria-ai · 2026-05-13T13:39:43Z

MCP upstream auth preflight added

This PR adds a pre-header probe for token-forwarding MCP servers so upstream 401/403 responses can be surfaced before the streaming response starts, and updates the ASGI bridge to propagate pre-header exceptions. I reviewed the MCP server selection, forwarded-header handling, and exception propagation path and did not find a security issue introduced by these changes.

Status: 1 open
Risk: 2/10

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Probe always fails open: AsyncHTTPHandler lacks head method
- Added an async HEAD helper to AsyncHTTPHandler so the upstream auth probe reaches the server and can surface 401/403 responses.

Preview (b211144a33)

diff --git a/litellm/llms/custom_httpx/http_handler.py b/litellm/llms/custom_httpx/http_handler.py
--- a/litellm/llms/custom_httpx/http_handler.py
+++ b/litellm/llms/custom_httpx/http_handler.py
@@ -598,6 +598,26 @@
         )
         return response
 
+    async def head(
+        self,
+        url: str,
+        params: Optional[dict] = None,
+        headers: Optional[dict] = None,
+        follow_redirects: Optional[bool] = None,
+    ):
+        # Set follow_redirects to UseClientDefault if None
+        _follow_redirects = (
+            follow_redirects if follow_redirects is not None else USE_CLIENT_DEFAULT
+        )
+
+        params = params or {}
+        params.update(HTTPHandler.extract_query_params(url))
+
+        response = await self.client.head(
+            url, params=params, headers=headers, follow_redirects=_follow_redirects  # type: ignore
+        )
+        return response
+
     @track_llm_api_timing()
     async def post(
         self,

diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -51,6 +51,10 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
@@ -2754,6 +2758,98 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: dict) -> Optional[str]:
+        """Return the raw Authorization header value from the ASGI scope, or None."""
+        for key, value in scope.get("headers", []):
+            if key.lower() == b"authorization":
+                return value.decode("latin-1")
+        return None
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """HEAD-probe the upstream URL to check whether the token is accepted.
+
+        Uses HEAD so the upstream receives no request body and allocates no
+        session or audit state. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+        """
+        try:
+            client = get_async_httpx_client(
+                llm_provider=httpxSpecialProvider.MCP,
+                params={"timeout": timeout},
+            )
+            resp = await client.head(
+                url,
+                headers={"Authorization": auth_header},
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status in (401, 403):
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2923,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/tests/test_litellm/llms/custom_httpx/test_http_handler.py b/tests/test_litellm/llms/custom_httpx/test_http_handler.py
--- a/tests/test_litellm/llms/custom_httpx/test_http_handler.py
+++ b/tests/test_litellm/llms/custom_httpx/test_http_handler.py
@@ -27,6 +27,39 @@
 
 
 @pytest.mark.asyncio
+async def test_async_head_returns_response_without_raise_for_status():
+    captured_request = None
+
+    async def mock_handler(request: httpx.Request) -> httpx.Response:
+        nonlocal captured_request
+        captured_request = request
+        return httpx.Response(
+            401,
+            request=request,
+            headers={"www-authenticate": 'Bearer realm="test"'},
+        )
+
+    litellm_handler = AsyncHTTPHandler()
+    await litellm_handler.client.aclose()
+    litellm_handler.client = httpx.AsyncClient(
+        transport=httpx.MockTransport(mock_handler)
+    )
+    try:
+        response = await litellm_handler.head(
+            "https://upstream.example/mcp",
+            headers={"Authorization": "Bearer some-token"},
+        )
+
+        assert response.status_code == 401
+        assert response.headers["www-authenticate"] == 'Bearer realm="test"'
+        assert captured_request is not None
+        assert captured_request.method == "HEAD"
+        assert captured_request.headers["Authorization"] == "Bearer some-token"
+    finally:
+        await litellm_handler.close()
+
+
+@pytest.mark.asyncio
 async def test_async_post_streaming_status_error_should_not_wait_forever_for_body(
     monkeypatch,
 ):

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,75 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = AsyncMock()
+    mock_client.head = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.head = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """_get_forwarded_auth_from_scope returns the Authorization value."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None

_{You can send follow-ups to the cloud agent here.}

…bypass _check_passthrough_upstream_auth was resolving user-supplied server names directly before authorization ran, letting any permitted LiteLLM key trigger an upstream HEAD probe to a server it was not allowed to use. Changes: - Call _get_allowed_mcp_servers inside the helper so only servers the caller's key is authorized for are probed. - Move the call site to after toolset scoping so the auth context is fully resolved before the probe list is built. - Thread user_api_key_auth into the helper signature (replaces the raw mcp_servers name list). Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Yassin Kortam <yassin@berri.ai>

CLAassistant · 2026-05-13T13:48:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ Sameerlite
❌ cursoragent
❌ claude-bot

claude-bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Yassin Kortam <yassin@berri.ai>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Unused head method added to AsyncHTTPHandler
- Removed the unused AsyncHTTPHandler.head method and its production-unexercised test.

Preview (ade8ca1f0e)

diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -51,13 +51,17 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
     LiteLLMProxyRequestSetup,
     get_chain_id_from_headers,
 )
-from litellm.types.mcp import MCPAuth
+from litellm.types.mcp import MCPAuth, MCPSpecVersion
 from litellm.types.mcp_server.mcp_server_manager import MCPInfo, MCPServer
 from litellm.types.utils import CallTypes, StandardLoggingMCPToolCall
 from litellm.utils import Rules, client, function_setup
@@ -2754,6 +2758,115 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: Scope) -> Optional[str]:
+        """Return the raw Authorization header value from the ASGI scope, or None."""
+        for key, value in scope.get("headers", []):
+            if key.lower() == b"authorization":
+                return value.decode("latin-1")
+        return None
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """JSON-RPC initialize-probe the upstream URL to check whether the token is accepted.
+
+        Uses POST so StreamableHTTP MCP servers run the same auth path as a
+        real client request. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+        """
+        try:
+            client = get_async_httpx_client(
+                llm_provider=httpxSpecialProvider.MCP,
+                params={"timeout": timeout},
+            )
+            probe_payload = {
+                "jsonrpc": "2.0",
+                "id": "litellm-mcp-auth-probe",
+                "method": "initialize",
+                "params": {
+                    "protocolVersion": MCPSpecVersion.jun_2025.value,
+                    "capabilities": {},
+                    "clientInfo": {
+                        "name": "litellm-mcp-auth-probe",
+                        "version": "1.0.0",
+                    },
+                },
+            }
+            resp = await client.client.post(  # type: ignore[attr-defined]
+                url,
+                headers={
+                    "Authorization": auth_header,
+                    "Accept": "application/json, text/event-stream",
+                },
+                json=probe_payload,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status in (401, 403):
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2940,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,79 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.client.post.assert_awaited_once()
+    _, kwargs = mock_client.client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """_get_forwarded_auth_from_scope returns the Authorization value."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None

_{You can send follow-ups to the cloud agent here.}

Co-authored-by: Yassin Kortam <yassin@berri.ai>

… probe _prepare_mcp_server_headers skips caller Authorization when the server uses OAuth client-credentials (M2M), but the pre-flight probe was still selecting those servers and forwarding the caller's raw token in the HEAD request. Exclude servers with has_client_credentials from the probe list to match the actual downstream header-preparation logic. Co-authored-by: Cursor <cursoragent@cursor.com>

Sameerlite · 2026-05-13T15:55:10Z

@greptile re review

Per RFC 9110, 401 means "go get new credentials." Mapping an upstream 403 to a gateway 401 causes OAuth clients to restart the authorization flow, obtain a fresh token with identical scopes, hit 403 again, and loop indefinitely. 401 from upstream → gateway 401 + WWW-Authenticate (re-authorize) 403 from upstream → gateway 403 (no WWW-Authenticate hint) Co-authored-by: Cursor <cursoragent@cursor.com>

Sameerlite · 2026-05-13T16:05:21Z

@greptile re review

veria-ai · 2026-05-13T16:12:10Z

+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)


High: Proxy API key forwarded upstream

forwarded_auth is taken from the raw request Authorization header, but that same header is also accepted as the LiteLLM API key when x-litellm-api-key is not present. A configured token-forwarding MCP server can now receive and reuse a user's LiteLLM proxy key during this preflight probe; only run the probe when the proxy was authenticated with a separate credential, or pass an upstream token value that process_mcp_request has confirmed is not the proxy API key.

… key The pre-flight upstream probe must not forward the caller's Authorization header when it could itself be the LiteLLM proxy API key. Restrict the probe to requests that supply x-litellm-api-key explicitly — only then is the Authorization header unambiguously the upstream OAuth token the caller wants forwarded.

mateo-berri · 2026-05-13T16:54:42Z

@greptileai please re-review

Responses to outstanding feedback:

Greptile P1 (15:59) — 403 mapped to 401: addressed in 1925335 — upstream 403 now propagates as 403 without WWW-Authenticate.
Veria High (16:12) — proxy API key forwarded upstream: addressed in this push. _get_forwarded_auth_from_scope now requires x-litellm-api-key to be present before returning the Authorization header. When x-litellm-api-key is absent, Authorization may itself be the LiteLLM proxy key (backward-compat path in MCPRequestHandler.process_mcp_request), so the probe is skipped to prevent leaking the proxy key upstream.
Greptile P1 (13:30) — backwards-compat without feature flag: the prior behavior (returning 200 {"tools":[]} on upstream 401) was a bug masking auth failures. The probe also fails-open on network errors, so the only behavior change is converting silent-fail-with-empty-tools into a proper 401/403 — there is no benign behavior to preserve.
Greptile P1/P2 (13:30) — probe latency and redundant forwarded_auth: already addressed; probes run in parallel via asyncio.gather and forwarded_auth is computed once before the loop.
Greptile P2 (13:30) — initialize RPC side-effects: tried HEAD first but most MCP StreamableHTTP servers return 405 on HEAD and never invoke their auth middleware, so the probe missed real 401s. Falling back to POST initialize is the smallest payload that reliably exercises upstream auth; the upstream allocates a transient session at worst, which is acceptable for catching expired tokens before the SDK commits 200 OK.

greptile-apps · 2026-05-13T17:02:55Z

+            resp = await client.client.post(  # type: ignore[attr-defined]
+                url,
+                headers={
+                    "Authorization": auth_header,
+                    "Accept": "application/json, text/event-stream",
+                },
+                json=probe_payload,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None


Probe bypasses AsyncHTTPHandler.post() and will silently fail-open if refactored

AsyncHTTPHandler.post() calls response.raise_for_status() internally, so calling client.post() for a 401/403 upstream would raise httpx.HTTPStatusError, which would then be caught by the broad except Exception block and cause the probe to return (200, None) — silently defeating the entire feature. The workaround (client.client.post) accesses the internal httpx client directly (confirmed by the # type: ignore[attr-defined]) to avoid this. If AsyncHTTPHandler is ever refactored (e.g., the self.client attribute renamed), the probe fails-open on every request with no warning.

The correct fix is to use the public AsyncHTTPHandler.post() and handle the status error before the catch-all — add import httpx and restructure as:

resp = await client.post(url, headers=..., json=..., timeout=timeout)

Add except httpx.HTTPStatusError as exc: return exc.response.status_code, exc.response.headers.get("www-authenticate") before the general except Exception

Addressed in 560c22b — the probe now uses the public AsyncHTTPHandler.post() and catches httpx.HTTPStatusError explicitly before the broad fail-open except Exception, so a 401/403 from upstream is no longer at risk of being silently swallowed if the handler refactors its internal client.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Pre-flight 401 becomes 504 timeout on toolset/dynamic routes
- Propagated pre-header handler exceptions through the ASGI bridge so upstream 401/403 responses preserve their status and headers instead of timing out.

Preview (851d9f6628)

diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -51,13 +51,17 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
     LiteLLMProxyRequestSetup,
     get_chain_id_from_headers,
 )
-from litellm.types.mcp import MCPAuth
+from litellm.types.mcp import MCPAuth, MCPSpecVersion
 from litellm.types.mcp_server.mcp_server_manager import MCPInfo, MCPServer
 from litellm.types.utils import CallTypes, StandardLoggingMCPToolCall
 from litellm.utils import Rules, client, function_setup
@@ -2754,6 +2758,144 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: Scope) -> Optional[str]:
+        """Return the upstream-bound ``Authorization`` header value, or None.
+
+        Only returns the ``Authorization`` header when ``x-litellm-api-key`` is
+        also present. In that case ``Authorization`` is unambiguously the
+        upstream token the caller wants forwarded to the MCP server. When
+        ``x-litellm-api-key`` is absent the ``Authorization`` header may itself
+        be the LiteLLM proxy API key (backward-compat path in
+        ``MCPRequestHandler.process_mcp_request``), and forwarding it upstream
+        would leak the proxy key to a third-party MCP server.
+        """
+        authorization = None
+        has_litellm_key_header = False
+        for key, value in scope.get("headers", []):
+            key_lower = key.lower()
+            if key_lower == b"authorization":
+                authorization = value.decode("latin-1")
+            elif key_lower == b"x-litellm-api-key":
+                has_litellm_key_header = True
+        if not has_litellm_key_header:
+            return None
+        return authorization
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """JSON-RPC initialize-probe the upstream URL to check whether the token is accepted.
+
+        Uses POST so StreamableHTTP MCP servers run the same auth path as a
+        real client request. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+        """
+        try:
+            client = get_async_httpx_client(
+                llm_provider=httpxSpecialProvider.MCP,
+                params={"timeout": timeout},
+            )
+            probe_payload = {
+                "jsonrpc": "2.0",
+                "id": "litellm-mcp-auth-probe",
+                "method": "initialize",
+                "params": {
+                    "protocolVersion": MCPSpecVersion.jun_2025.value,
+                    "capabilities": {},
+                    "clientInfo": {
+                        "name": "litellm-mcp-auth-probe",
+                        "version": "1.0.0",
+                    },
+                },
+            }
+            resp = await client.client.post(  # type: ignore[attr-defined]
+                url,
+                headers={
+                    "Authorization": auth_header,
+                    "Accept": "application/json, text/event-stream",
+                },
+                json=probe_payload,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+            # Exclude M2M servers: _prepare_mcp_server_headers skips caller
+            # Authorization when has_client_credentials is set, so probing
+            # those with the caller's token would send the wrong credential.
+            and not srv.has_client_credentials
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status == 401:
+                # Token is missing or expired — direct the client to re-authorize.
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+            if probe_status == 403:
+                # Token is valid but the caller lacks permission — do not hint
+                # at re-authorization (RFC 9110: a fresh token with the same
+                # scopes would just hit 403 again and loop indefinitely).
+                raise HTTPException(
+                    status_code=403,
+                    detail="Forbidden",
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2969,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/litellm/proxy/proxy_server.py b/litellm/proxy/proxy_server.py
--- a/litellm/proxy/proxy_server.py
+++ b/litellm/proxy/proxy_server.py
@@ -15006,10 +15006,19 @@
     # If the handler task dies (exception or cancellation) without sending the EOF
     # sentinel, body_iter() would block forever on body_queue.get().  The callback
     # below guarantees the queue gets unblocked regardless of how the task ends.
+    # When this happens before response headers, propagate the original exception
+    # instead of waiting for the header timeout.
     def _ensure_eof(task: asyncio.Task) -> None:
-        if task.cancelled() or task.exception() is not None:
+        if task.cancelled():
             body_queue.put_nowait(None)
+            return
 
+        task_exception = task.exception()
+        if task_exception is not None:
+            if not headers_ready.done():
+                headers_ready.set_exception(task_exception)
+            body_queue.put_nowait(None)
+
     handler_task.add_done_callback(_ensure_eof)
 
     try:

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,101 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.client.post.assert_awaited_once()
+    _, kwargs = mock_client.client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """Returns Authorization value when x-litellm-api-key is also present."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"x-litellm-api-key", b"sk-litellm-proxy-key"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None
+
+
+def test_get_forwarded_auth_from_scope_skips_when_no_litellm_key_header():
+    """Skip when ``x-litellm-api-key`` is absent.
+
+    Without ``x-litellm-api-key``, the ``Authorization`` header may itself be
+    the LiteLLM proxy API key (backward-compat). Forwarding it upstream would
+    leak the proxy key, so the helper must return None and the probe must
+    not fire.
+    """
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer ambiguous-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) is None

diff --git a/tests/test_litellm/proxy/test_mcp_asgi_response.py b/tests/test_litellm/proxy/test_mcp_asgi_response.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/proxy/test_mcp_asgi_response.py
@@ -1,0 +1,36 @@
+import asyncio
+
+import pytest
+from fastapi import HTTPException
+
+from litellm.proxy.proxy_server import _stream_mcp_asgi_response
+
+
+@pytest.mark.asyncio
+async def test_stream_mcp_asgi_response_propagates_pre_header_http_exception():
+    async def handle_fn(_scope, _receive, _send):
+        raise HTTPException(
+            status_code=401,
+            detail="Unauthorized",
+            headers={
+                "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+            },
+        )
+
+    async def receive():
+        return {"type": "http.request", "body": b"", "more_body": False}
+
+    with pytest.raises(HTTPException) as exc_info:
+        await asyncio.wait_for(
+            _stream_mcp_asgi_response(
+                handle_fn,
+                {"type": "http", "method": "POST", "path": "/mcp", "headers": []},
+                receive,
+            ),
+            timeout=1.0,
+        )
+
+    assert exc_info.value.status_code == 401
+    assert exc_info.value.headers == {
+        "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+    }

_{You can send follow-ups to the cloud agent here.}

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Use AsyncHTTPHandler.post() and catch httpx.HTTPStatusError explicitly so the 401/403 we want to surface is not silently swallowed by the broad fail-open except Exception block. Avoids reaching into the handler's private client attribute, which would silently regress to fail-open if AsyncHTTPHandler is ever refactored.

mateo-berri · 2026-05-13T17:27:51Z

@greptileai

greptile-apps · 2026-05-13T17:32:12Z

+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.client.post.assert_awaited_once()
+    _, kwargs = mock_client.client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = AsyncMock()
+    mock_client.client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None


Probe tests mock the internal client instead of the public method

Both probe tests set up mock_client.client.post, but _probe_upstream_auth now calls the public client.post(...) (AsyncHTTPHandler.post()), which is a completely separate mock attribute. AsyncHTTPHandler.post() calls self.client.send() then raise_for_status(), meaning a real 401 would arrive as an httpx.HTTPStatusError, not as a response object.

For test_probe_upstream_auth_returns_upstream_status: await mock_client.post(...) returns mock_client.post.return_value (an AsyncMock), so resp.status_code is a MagicMock, not 401; the assertion fails.

For test_probe_upstream_auth_fails_open_on_network_error: mock_client.client.post is wired to raise but mock_client.post is not — no exception is ever thrown, so the fail-open path is never exercised and status is again a MagicMock rather than 200.

To correctly test the 401 case, mock_client.post should be configured to raise httpx.HTTPStatusError with a 401 response (since AsyncHTTPHandler.post() calls raise_for_status()). For the fail-open test, mock_client.post itself should be wired with side_effect=Exception(...).

Addressed in c07b62a + 28eda2a — tests now mock client.post (matching the production call) and there is a new test_probe_upstream_auth_surfaces_httpx_status_error that exercises the httpx.HTTPStatusError path raised by raise_for_status().

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Tests mock wrong attribute path, never exercising production code
- Updated the auth-probe tests to mock and assert AsyncHTTPHandler.post directly, so the production call path and network-error branch are exercised.

Preview (28eda2a662)

diff --git a/litellm/proxy/_experimental/mcp_server/server.py b/litellm/proxy/_experimental/mcp_server/server.py
--- a/litellm/proxy/_experimental/mcp_server/server.py
+++ b/litellm/proxy/_experimental/mcp_server/server.py
@@ -23,6 +23,7 @@
     cast,
 )
 
+import httpx
 from fastapi import FastAPI, HTTPException
 from pydantic import AnyUrl, ConfigDict
 from starlette.requests import Request as StarletteRequest
@@ -51,13 +52,17 @@
     get_server_prefix,
     iter_known_server_prefixes,
 )
+from litellm.llms.custom_httpx.http_handler import (
+    get_async_httpx_client,
+    httpxSpecialProvider,
+)
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.auth.ip_address_utils import IPAddressUtils
 from litellm.proxy.litellm_pre_call_utils import (
     LiteLLMProxyRequestSetup,
     get_chain_id_from_headers,
 )
-from litellm.types.mcp import MCPAuth
+from litellm.types.mcp import MCPAuth, MCPSpecVersion
 from litellm.types.mcp_server.mcp_server_manager import MCPInfo, MCPServer
 from litellm.types.utils import CallTypes, StandardLoggingMCPToolCall
 from litellm.utils import Rules, client, function_setup
@@ -2754,6 +2759,157 @@
             )
         return user_api_key_auth.model_copy(update={"object_permission": updated_op})
 
+    def _get_forwarded_auth_from_scope(scope: Scope) -> Optional[str]:
+        """Return the upstream-bound ``Authorization`` header value, or None.
+
+        Only returns the ``Authorization`` header when ``x-litellm-api-key`` is
+        also present. In that case ``Authorization`` is unambiguously the
+        upstream token the caller wants forwarded to the MCP server. When
+        ``x-litellm-api-key`` is absent the ``Authorization`` header may itself
+        be the LiteLLM proxy API key (backward-compat path in
+        ``MCPRequestHandler.process_mcp_request``), and forwarding it upstream
+        would leak the proxy key to a third-party MCP server.
+        """
+        authorization = None
+        has_litellm_key_header = False
+        for key, value in scope.get("headers", []):
+            key_lower = key.lower()
+            if key_lower == b"authorization":
+                authorization = value.decode("latin-1")
+            elif key_lower == b"x-litellm-api-key":
+                has_litellm_key_header = True
+        if not has_litellm_key_header:
+            return None
+        return authorization
+
+    async def _probe_upstream_auth(
+        url: str,
+        auth_header: str,
+        timeout: float = 5.0,
+    ) -> tuple:
+        """JSON-RPC initialize-probe the upstream URL to check whether the token is accepted.
+
+        Uses POST so StreamableHTTP MCP servers run the same auth path as a
+        real client request. Returns (status_code, www_authenticate).
+        Fails-open with (200, None) on network errors so a transient hiccup
+        does not block valid requests.
+
+        Uses the public ``AsyncHTTPHandler.post()`` interface and catches
+        ``httpx.HTTPStatusError`` separately so the 401/403 we want to surface
+        is not swallowed by the broad fail-open ``except Exception`` below.
+        """
+        client = get_async_httpx_client(
+            llm_provider=httpxSpecialProvider.MCP,
+            params={"timeout": timeout},
+        )
+        probe_payload = {
+            "jsonrpc": "2.0",
+            "id": "litellm-mcp-auth-probe",
+            "method": "initialize",
+            "params": {
+                "protocolVersion": MCPSpecVersion.jun_2025.value,
+                "capabilities": {},
+                "clientInfo": {
+                    "name": "litellm-mcp-auth-probe",
+                    "version": "1.0.0",
+                },
+            },
+        }
+        probe_headers = {
+            "Authorization": auth_header,
+            "Accept": "application/json, text/event-stream",
+        }
+        try:
+            resp = await client.post(
+                url=url,
+                headers=probe_headers,
+                json=probe_payload,
+                timeout=timeout,
+            )
+            return resp.status_code, resp.headers.get("www-authenticate")
+        except httpx.HTTPStatusError as exc:
+            # AsyncHTTPHandler.post() calls raise_for_status(); a 401/403 from
+            # upstream lands here. Return its status so the caller can map it
+            # to the appropriate response.
+            return exc.response.status_code, exc.response.headers.get(
+                "www-authenticate"
+            )
+        except Exception as exc:
+            verbose_logger.debug(
+                f"_probe_upstream_auth: probe to {url} failed ({exc}), allowing request through"
+            )
+            return 200, None
+
+    async def _check_passthrough_upstream_auth(
+        scope: Scope,
+        user_api_key_auth: Optional[UserAPIKeyAuth],
+        mcp_servers: Optional[List[str]],
+        client_ip: Optional[str],
+    ) -> None:
+        """Probe pass-through upstream servers in parallel before the MCP session starts.
+
+        Only servers the caller's key is already authorized to reach are probed —
+        the list is derived from _get_allowed_mcp_servers so that a user cannot
+        trigger an upstream probe against a server their key is not permitted for.
+
+        The MCP SDK commits HTTP 200 headers before invoking handlers, so a 401
+        can only be returned before that point. This function raises HTTPException(401)
+        with a WWW-Authenticate header if any upstream rejects the client token.
+        Fails-open: network errors are logged and the request is allowed through.
+        """
+        forwarded_auth = _get_forwarded_auth_from_scope(scope)
+        if not forwarded_auth:
+            return
+
+        # Use the authorized server set, not the raw user-supplied names, so that
+        # a caller cannot force a probe to a server their key is not allowed to use.
+        allowed_servers = await _get_allowed_mcp_servers(
+            user_api_key_auth=user_api_key_auth,
+            mcp_servers=mcp_servers,
+            client_ip=client_ip,
+        )
+        passthrough_servers = [
+            srv
+            for srv in allowed_servers
+            if srv.extra_headers
+            and any(h.lower() == "authorization" for h in srv.extra_headers)
+            # Exclude M2M servers: _prepare_mcp_server_headers skips caller
+            # Authorization when has_client_credentials is set, so probing
+            # those with the caller's token would send the wrong credential.
+            and not srv.has_client_credentials
+        ]
+        if not passthrough_servers:
+            return
+
+        probe_results = await asyncio.gather(
+            *[
+                _probe_upstream_auth(srv.url or "", forwarded_auth)
+                for srv in passthrough_servers
+            ]
+        )
+        request = StarletteRequest(scope)
+        base_url = get_request_base_url(request)
+        for srv, (probe_status, _) in zip(passthrough_servers, probe_results):
+            if probe_status == 401:
+                # Token is missing or expired — direct the client to re-authorize.
+                authorization_uri = (
+                    f"Bearer authorization_uri="
+                    f"{base_url}/.well-known/oauth-authorization-server/{srv.name}"
+                )
+                raise HTTPException(
+                    status_code=401,
+                    detail="Unauthorized",
+                    headers={"WWW-Authenticate": authorization_uri},
+                )
+            if probe_status == 403:
+                # Token is valid but the caller lacks permission — do not hint
+                # at re-authorization (RFC 9110: a fresh token with the same
+                # scopes would just hit 403 again and loop indefinitely).
+                raise HTTPException(
+                    status_code=403,
+                    detail="Forbidden",
+                )
+
     async def handle_streamable_http_mcp(
         scope: Scope, receive: Receive, send: Send
     ) -> None:
@@ -2827,6 +2983,13 @@
                     user_api_key_auth, active_toolset_id
                 )
 
+            # Pre-flight auth check for pass-through servers.  Must run after
+            # toolset scoping so the probe list is derived from the fully-authorized
+            # server set, not the raw user-supplied names.
+            await _check_passthrough_upstream_auth(
+                scope, user_api_key_auth, mcp_servers, _client_ip
+            )
+
             # Inject masked debug headers when client sends x-litellm-mcp-debug: true
             _debug_headers = MCPDebug.maybe_build_debug_headers(
                 raw_headers=raw_headers,

diff --git a/litellm/proxy/proxy_server.py b/litellm/proxy/proxy_server.py
--- a/litellm/proxy/proxy_server.py
+++ b/litellm/proxy/proxy_server.py
@@ -15006,10 +15006,19 @@
     # If the handler task dies (exception or cancellation) without sending the EOF
     # sentinel, body_iter() would block forever on body_queue.get().  The callback
     # below guarantees the queue gets unblocked regardless of how the task ends.
+    # When this happens before response headers, propagate the original exception
+    # instead of waiting for the header timeout.
     def _ensure_eof(task: asyncio.Task) -> None:
-        if task.cancelled() or task.exception() is not None:
+        if task.cancelled():
             body_queue.put_nowait(None)
+            return
 
+        task_exception = task.exception()
+        if task_exception is not None:
+            if not headers_ready.done():
+                headers_ready.set_exception(task_exception)
+            body_queue.put_nowait(None)
+
     handler_task.add_done_callback(_ensure_eof)
 
     try:

diff --git a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
--- a/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
+++ b/tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py
@@ -3256,3 +3256,137 @@
     ), "P2 API consistency issue: expected None for empty extra_headers, got: " + str(
         captured_extra_headers
     )
+
+
+# ---------------------------------------------------------------------------
+# Pre-flight upstream auth check tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_returns_upstream_status():
+    """_probe_upstream_auth forwards the status code from the upstream server."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+
+    mock_client = MagicMock()
+    mock_client.post = AsyncMock(return_value=mock_response)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+    mock_client.post.assert_awaited_once()
+    _, kwargs = mock_client.post.call_args
+    assert kwargs["headers"]["Authorization"] == "Bearer some-token"
+    assert kwargs["json"]["method"] == "initialize"
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_surfaces_httpx_status_error():
+    """Probe extracts status + WWW-Authenticate from httpx.HTTPStatusError.
+
+    AsyncHTTPHandler.post() calls raise_for_status() internally, so when the
+    upstream returns 401/403 the call raises httpx.HTTPStatusError rather than
+    returning the response. The probe must catch that specifically (before the
+    fail-open `except Exception`) so the auth check is not silently defeated.
+    """
+    import httpx
+
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_response = MagicMock()
+    mock_response.status_code = 401
+    mock_response.headers = {"www-authenticate": 'Bearer realm="test"'}
+    request = httpx.Request("POST", "http://upstream/mcp")
+    error = httpx.HTTPStatusError(
+        message="401 Unauthorized", request=request, response=mock_response
+    )
+
+    mock_client = MagicMock()
+    mock_client.post = AsyncMock(side_effect=error)
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 401
+    assert www_auth == 'Bearer realm="test"'
+
+
+@pytest.mark.asyncio
+async def test_probe_upstream_auth_fails_open_on_network_error():
+    """_probe_upstream_auth returns (200, None) when the network call fails."""
+    from litellm.proxy._experimental.mcp_server.server import _probe_upstream_auth
+
+    mock_client = MagicMock()
+    mock_client.post = AsyncMock(side_effect=Exception("connection refused"))
+
+    with patch(
+        "litellm.proxy._experimental.mcp_server.server.get_async_httpx_client",
+        return_value=mock_client,
+    ):
+        status, www_auth = await _probe_upstream_auth(
+            "http://upstream/mcp", "Bearer some-token"
+        )
+
+    assert status == 200
+    assert www_auth is None
+
+
+def test_get_forwarded_auth_from_scope_extracts_header():
+    """Returns Authorization value when x-litellm-api-key is also present."""
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"x-litellm-api-key", b"sk-litellm-proxy-key"),
+            (b"authorization", b"Bearer my-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) == "Bearer my-token"
+
+
+def test_get_forwarded_auth_from_scope_returns_none_when_missing():
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    assert _get_forwarded_auth_from_scope({"headers": []}) is None
+
+
+def test_get_forwarded_auth_from_scope_skips_when_no_litellm_key_header():
+    """Skip when ``x-litellm-api-key`` is absent.
+
+    Without ``x-litellm-api-key``, the ``Authorization`` header may itself be
+    the LiteLLM proxy API key (backward-compat). Forwarding it upstream would
+    leak the proxy key, so the helper must return None and the probe must
+    not fire.
+    """
+    from litellm.proxy._experimental.mcp_server.server import (
+        _get_forwarded_auth_from_scope,
+    )
+
+    scope = {
+        "headers": [
+            (b"content-type", b"application/json"),
+            (b"authorization", b"Bearer ambiguous-token"),
+        ]
+    }
+    assert _get_forwarded_auth_from_scope(scope) is None

diff --git a/tests/test_litellm/proxy/test_mcp_asgi_response.py b/tests/test_litellm/proxy/test_mcp_asgi_response.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/proxy/test_mcp_asgi_response.py
@@ -1,0 +1,36 @@
+import asyncio
+
+import pytest
+from fastapi import HTTPException
+
+from litellm.proxy.proxy_server import _stream_mcp_asgi_response
+
+
+@pytest.mark.asyncio
+async def test_stream_mcp_asgi_response_propagates_pre_header_http_exception():
+    async def handle_fn(_scope, _receive, _send):
+        raise HTTPException(
+            status_code=401,
+            detail="Unauthorized",
+            headers={
+                "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+            },
+        )
+
+    async def receive():
+        return {"type": "http.request", "body": b"", "more_body": False}
+
+    with pytest.raises(HTTPException) as exc_info:
+        await asyncio.wait_for(
+            _stream_mcp_asgi_response(
+                handle_fn,
+                {"type": "http", "method": "POST", "path": "/mcp", "headers": []},
+                receive,
+            ),
+            timeout=1.0,
+        )
+
+    assert exc_info.value.status_code == 401
+    assert exc_info.value.headers == {
+        "WWW-Authenticate": "Bearer authorization_uri=https://example.test/auth"
+    }

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 560c22b. Configure here.}

Co-authored-by: Yassin Kortam <yassin@berri.ai>

AsyncHTTPHandler.post() calls raise_for_status() internally, so a real upstream 401/403 lands as httpx.HTTPStatusError. Add a test that exercises that specific exception path so a regression that swallows the error in the broad fail-open except Exception would be caught.

mateo-berri · 2026-05-13T17:55:46Z

@greptileai please re-review.

Latest fixes:

560c22b — probe now uses public AsyncHTTPHandler.post() and catches httpx.HTTPStatusError explicitly so a real 401/403 is not swallowed by the broad fail-open block.
c07b62a + 28eda2a — probe tests now mock client.post directly (matching the production call) and new test_probe_upstream_auth_surfaces_httpx_status_error covers the HTTPStatusError path.

mateo-berri

LGTM; thanks!

* fix(proxy): always merge caller-supplied tags into request metadata Caller-supplied tags (`x-litellm-tags` header, body `tags`, `metadata.tags`) were silently dropped unless the key/team had `metadata.allow_client_tags: true` set. Restore the documented behavior: tags from the request always flow into `metadata.tags` and union with any admin-configured static tags from key/team/project metadata. Removes the `allow_client_tags` opt-in flag from the pre-call pipeline. The flag was only ever read here; it has no schema or endpoint footprint, so leftover values in existing key metadata are inert. Test cleanup mirrors the simplification: drop the three tests that verified the strip-when-not-opted-in path, drop the `allow_client_tags` fixture lines from the merge/union tests. * docs(proxy): refresh stale comments referencing removed tag strip The tag-strip block was removed in the parent commit but two surrounding comments still referenced "tags without opt-in" and "runs AFTER the strip". Update them to describe the remaining user_api_key_* and _pipeline_managed_guardrails strip that the snapshot/merge ordering actually protects against. * fix(tests): swap dall-e to gpt-image-1 after openai deprecation DALL-E 2 and DALL-E 3 were removed from the OpenAI API on 2026-05-12, causing e2e image-generation tests to fail with "model does not exist". Swap all live-API DALL-E references in proxy-backed tests to gpt-image-1 and update the dall-e-2 alias in proxy_server_config.yaml to point at openai/gpt-image-1 (preserves any historical dall-e-2 callers). * fix(tests): drop dall-e-only test classes; route live image tests via gpt-image-1 Second wave of failures from the 2026-05-12 DALL-E shutdown: - tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditDallE2 and tests/image_gen_tests/test_image_generation.py::TestOpenAIDalle3 are explicitly named for the deprecated models and can't pass; remove. gpt-image-1 coverage already exists in sibling classes. - tests/local_testing/test_router.py image gen tests use dall-e-3 only as a routing example; swap to gpt-image-1. - tests/local_testing/test_custom_callback_input.py image_generation success/failure paths swapped to gpt-image-1. * chore: reject bare str at file-input sinks to prevent local-file read (#27762) * chore: reject bare str at file-input sinks to prevent local-file read (#27667) Squash-merged by litellm-agent from stuxf's PR. * fix: use os.PathLike in ocr sink and check truthy reasoningSummary for bridge - ocr/main.py: widen Path check to os.PathLike for consistency with other sinks - main.py: bridge condition checks truthiness of reasoning_summary, not just None Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: remove unused pathlib.Path import in ocr/main.py --------- Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix(tests): swap dall-e to gpt-image-1 after openai deprecation DALL-E 2 and DALL-E 3 were removed from the OpenAI API on 2026-05-12, causing e2e image-generation tests to fail with "model does not exist". Swap all live-API DALL-E references in proxy-backed tests to gpt-image-1 and update the dall-e-2 alias in proxy_server_config.yaml to point at openai/gpt-image-1 (preserves any historical dall-e-2 callers). * fix(tests): drop dall-e-only test classes; route live image tests via gpt-image-1 Second wave of failures from the 2026-05-12 DALL-E shutdown: - tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditDallE2 and tests/image_gen_tests/test_image_generation.py::TestOpenAIDalle3 are explicitly named for the deprecated models and can't pass; remove. gpt-image-1 coverage already exists in sibling classes. - tests/local_testing/test_router.py image gen tests use dall-e-3 only as a routing example; swap to gpt-image-1. - tests/local_testing/test_custom_callback_input.py image_generation success/failure paths swapped to gpt-image-1. * fix(proxy): always merge caller-supplied tags into request metadata Caller-supplied tags (`x-litellm-tags` header, body `tags`, `metadata.tags`) were silently dropped unless the key/team had `metadata.allow_client_tags: true` set. Restore the documented behavior: tags from the request always flow into `metadata.tags` and union with any admin-configured static tags from key/team/project metadata. Removes the `allow_client_tags` opt-in flag from the pre-call pipeline. The flag was only ever read here; it has no schema or endpoint footprint, so leftover values in existing key metadata are inert. Test cleanup mirrors the simplification: drop the three tests that verified the strip-when-not-opted-in path, drop the `allow_client_tags` fixture lines from the merge/union tests. * docs(proxy): refresh stale comments referencing removed tag strip The tag-strip block was removed in the parent commit but two surrounding comments still referenced "tags without opt-in" and "runs AFTER the strip". Update them to describe the remaining user_api_key_* and _pipeline_managed_guardrails strip that the snapshot/merge ordering actually protects against. * feat(ui): add Vertex AI Search as vector store provider (#27790) * feat(ui): add Vertex AI Search as vector store provider Adds a "Vertex AI Search" entry to the provider dropdown (custom_llm_provider=vertex_ai/search_api) with fields for project, location (global/us/eu select), and optional collection ID. Extends VectorStoreFieldConfig with `options` so select fields can be data-driven instead of falling through to the embedding-model list. * fix(ui): clarify vertex_collection_id placeholder copy Placeholder previously displayed "default_collection" — the literal fallback value — which invited users to type it instead of leaving the field blank. Switch to an example placeholder and tighten the tooltip. * Litellm key rotation bug (#27756) * fix(proxy): resolve cache handling issues in _lookup_deprecated_key - Updated the in-memory cache for deprecated key lookups to store a 3-tuple (active_token_id, cache_expires_at_ts, revoke_at_ts) instead of a 2-tuple, ensuring proper unpacking and backward compatibility. - Removed duplicate cache reads and added logic to handle legacy cache entries gracefully. - Enhanced unit tests to cover scenarios for cache hits, DB misses, and respect for revoke_at timestamps, ensuring robust handling of the grace-period key-rotation feature. * refactor(proxy): streamline cache handling in _lookup_deprecated_key - Simplified the cache retrieval logic by directly unpacking the 3-tuple cache entries, removing the need for backward compatibility checks for 2-tuple entries. - Updated unit tests to ensure that pre-warmed 3-tuple cache entries are served correctly without unnecessary database lookups. * chore(ci): add new unit test for deprecated key grace period - Included `test_deprecated_key_grace_period.py` in the CI workflow to enhance coverage for deprecated key handling scenarios. * fix(proxy): remove unnecessary check for revoke_at in _lookup_deprecated_key - Eliminated the redundant check for None on revoke_at, streamlining the logic for handling deprecated keys in the cache. This change enhances the efficiency of the key lookup process. * test(proxy): add end-to-end tests for deprecated key lookup behavior - Introduced a new test class `TestDeprecatedKeyLookupDbE2E` to validate the behavior of deprecated key lookups against a real Prisma-backed database. - The test ensures that old key hashes resolve correctly and that repeated lookups utilize the in-memory cache without errors. - Cleaned up the `_lookup_deprecated_key` function by removing an unnecessary check for `revoke_at`, enhancing the efficiency of the key lookup process. * chore(proxy): close /key/regenerate ownership-rebind + premium-gate bypass A non-admin caller could rebind their own key's ``user_id`` via ``/key/regenerate``. ``_execute_virtual_key_regeneration`` had org/team guards but no ``user_id`` guard, and ``prepare_key_update_data`` did not strip the field — it survived ``model_dump(exclude_unset=True)`` into the Prisma update. On the next request, ``_return_user_api_key_auth_obj`` resolved the rebound ``user_id`` against ``litellm_usertable`` and returned ``PROXY_ADMIN`` whenever the target row's ``user_role`` was admin (e.g. the default ``user_id="default_user_id"`` created on first password-UI login). ``/key/update`` had the equivalent guard inline at ``_validate_update_key_data``; extract it to a shared helper ``_validate_caller_can_change_key_ownership`` and call from both ``/key/update`` and ``_execute_virtual_key_regeneration``. Future regenerate-style endpoints inherit the guard for free. Also tighten the premium gate that allowed the master-key rotation branch to skip the enterprise check. The previous predicate was ``data.new_master_key is not None`` — a field-presence test, not an identity check. Any non-premium caller could send any value in that field and the premium check would no-op. Verify the caller actually holds the master key via ``_is_master_key`` before allowing the non-premium path. Tests: - ``test_regenerate_user_id_rebind_guard`` — parametrized table over cross-user rebind (blocked), empty-string removal (blocked), and same-user no-op rebind (allowed). - ``test_regenerate_premium_gate_requires_actual_master_key`` / ``test_regenerate_premium_gate_allows_actual_master_key_holder`` — ensure the premium check requires the caller actually present the master key, and that legitimate master-key rotation still works. * test(vcr): classify cache verdicts, detect live calls, surface cost leaks Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS / PARTIAL' tag into a classified outcome that distinguishes the cases that silently bill the live API on every CI run from the ones that don't: HIT pure replay PARTIAL mixed replay + new recordings MISS:RECORDED new cassette saved to Redis (cached next run) MISS:OVERFLOW cassette > MAX_EPISODES_PER_CASSETTE; persister refused to save; re-bills every run MISS:NOT_PERSISTED test failed; save_cassette skipped; re-bills NOOP VCR-marked but no HTTP traffic (mocked elsewhere) UNMARKED:LIVE_CALL test bypassed VCR AND opened a TCP connection to a known LLM provider host -> wasted spend UNMARKED:NO_TRAFFIC test bypassed VCR but didn't call out The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits live' into 'this test connected to api.openai.com'. We install a socket.connect / socket.create_connection wrapper for the duration of each non-VCR-marked test and record any outbound TCP to a known LLM provider hostname. The probe sits below the httpx layer so vcrpy and respx (which both patch above the socket) are unaffected. Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the llm_translation and local_testing conftests with per-item respx detection in apply_vcr_auto_marker_to_items. A test now skips VCR when it actually carries @pytest.mark.respx or has respx_mock in its fixture chain - not just because some other test in the same file imports MockRouter. Items skipped by skip_files are split into respx_conflict (real conflict, the module wires up respx) vs file_opt_out (dead skip- list entry whose module never touches respx) so the session summary makes pruning obvious. Stabilize the AWS SigV4 fingerprint: the Authorization header on Bedrock requests rotates its Credential date and Signature on every call, which previously pushed every Bedrock test past the 50-episode overflow threshold. Extract the access-key id only ('aws-sigv4:AKIA...') so two requests with the same identity match. Always emit verdict logging when VCR is active (set LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a session-end classification summary that lists overflow tests, unmarked live-call tests, and the skip-reason breakdown. Wire the live-call probe + summary hook into every test directory that already uses the Redis-backed VCR cache (audio_tests, guardrails_tests, image_gen_tests, litellm_utils_tests, llm_responses_api_testing, llm_translation, local_testing, logging_callback_tests, ocr_tests, pass_through_unit_tests, router_unit_tests, search_tests, unified_google_tests). Add tests/llm_translation/test_vcr_classification.py covering the verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability, live-host classification, and session summary rendering. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(vcr): drop dead 'from respx import MockRouter' imports These seven test files were on _RESPX_CONFLICTING_FILES, which made the auto-marker skip them entirely. Inspecting the source shows the only respx artifact is a top-level 'from respx import MockRouter' that no test ever uses - no @pytest.mark.respx, no respx_mock fixture, no respx.mock context manager. The import is dead code left over from a previous mocking pattern. Now that apply_vcr_auto_marker_to_items detects respx per-item via the marker / fixture chain (b637d9f64a), the file-level skip is no longer needed for these files - they were the reason the OpenAI tests (test_o3_reasoning_effort, test_streaming_response[o1/o3-mini], TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search, TestOpenAIO3::test_web_search, etc.) ran live every CI build despite the cassette cache being healthy. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(image_edits): regenerate fixtures per call instead of holding open module-level file handles Module-level TEST_IMAGES = [ open(os.path.join(pwd, 'ishaan_github.png'), 'rb'), open(os.path.join(pwd, 'litellm_site.png'), 'rb'), ] SINGLE_TEST_IMAGE = open(...) opens the file once at import. After the first multipart upload, the file pointer is at EOF, so every subsequent test in the same xdist worker sends an empty multipart body. That non-determinism (a) blows the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so _RedisPersister.save_cassette refuses to save it, and (b) re-bills the live image edit endpoint on every CI run. Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py shows six tests parking at 51-52 cassette entries (TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False], TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio, test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False], test_multiple_image_edit_with_different_formats). Replace the module-level file handles with _make_test_images() / _make_single_test_image() factories that return fresh _RewindableImage (BytesIO subclass) objects whose pointer always starts at 0. The image bytes are read once at import into module-level constants (_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is unchanged. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * chore(proxy): clarify ownership-rebind error message (actor vs target) Previous wording read "User=<new_owner> is not allowed to update the key to belong to user=<current_owner>" — easy to misread as "caller wants to keep the key on its current owner". Reframe as "Non-admin caller is not allowed to rebind the key from user=<existing> to user=<incoming>" so the direction of the failed operation is unambiguous. Same shape preserved (HTTPException 403); only the ``detail`` string changes. Regression test substring updated. * fix(vcr): match real Bedrock hostnames in live-call probe The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com' (region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit host check for that pattern so Bedrock live calls are visible to the probe, and update the unit test accordingly. Also drop the unused '_LIVE_CALL_PROBE_INSTALLED' module variable. * test(proxy): drop allow_client_tags opt-in gate and add credential rename cascade tests Removes the allow_client_tags metadata check from apply_client_tag_policy_pre_auth so x-litellm-tags headers are always merged into request metadata, matching the post-auth behavior in add_litellm_data_to_request. Updates pre-call tests accordingly and adds a new test suite covering cascading credential renames into model rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(proxy): block explicit-null user_id in ownership rebind guard ``model_dump(exclude_unset=True)`` in ``prepare_key_update_data`` includes any field the caller explicitly set, even when the value is ``None``. The previous guard short-circuited on ``getattr(data, 'user_id', None) is None``, which conflated "field omitted" (safe) with "field explicitly set to null" (writes NULL to the token row, detaching the key from its user and bypassing user-row role checks). Switch the omitted-vs-set distinction to ``data.model_fields_set``; treat explicit-null and explicit-empty-string identically as a removal attempt, both 403-rejected for non-admin callers. Parametrized regression adds ``explicit_null_blocked`` alongside the existing ``rebind_blocked`` / ``empty_blocked`` / ``same_user_id_allowed`` cases. * fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes * fix(image_edits): drop _RewindableImage to prevent infinite multipart upload The _RewindableImage(BytesIO) wrapper auto-rewound on every read after EOF, which made the OpenAI SDK's multipart upload writer read the same bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd: [gw0] node down: Not properly terminated replacing crashed worker gw0 ... worker 'gw1' crashed while running 'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]' The auto-rewind was added defensively for parametrized + flaky-retried tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk already calls get_base_image_edit_call_args() once per invocation and that helper now constructs fresh streams via _make_test_images(), so rewinding inside the stream is unnecessary. Replace with plain BytesIO seeded with the cached image bytes. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * chore(proxy): refuse remote-URL instance-fn loads outside config-file path ``get_instance_fn`` previously routed any ``s3://`` / ``gcs://`` value into ``_load_instance_from_remote_storage`` regardless of how the value got there. The function ultimately calls ``spec.loader.exec_module(module)`` — Python in the proxy process. On admin-callable endpoints that accept a ``target`` / ``custom_handler`` field from the request body (e.g. ``/config/pass_through_endpoint``, custom-callback registration), that is a one-step admin-to-RCE primitive: any future privilege-escalation bug becomes immediate code execution. The documented operator flow for remote-module loading is ``litellm_settings.callbacks: ["s3://bucket/module.instance"]`` in ``config.yaml``. That path always carries the YAML's ``config_file_path`` through to ``get_instance_fn``. Use the presence of ``config_file_path`` as the discriminator: refuse remote URLs when it is absent (the request-body path) unless the operator explicitly opts back in via ``LITELLM_ALLOW_REMOTE_INSTANCE_FN_FROM_API=true``. The three success/failure/audit-log callback-loop call sites in ``proxy_server.py:load_config`` were already running inside the startup config-file load but had stopped threading ``config_file_path`` through. Pass it through so the documented ``s3://`` callback flow continues to work unchanged. Tests cover: remote URL without ``config_file_path`` raises; remote URL with the opt-in env reaches the loader; remote URL with ``config_file_path`` passes (documented startup flow); local dotted-name imports unaffected. * fix(proxy): parse string metadata before pre-auth tag merge `apply_client_tag_policy_pre_auth` overwrote string-typed metadata with `{}` before merging header tags, dropping any tags inside. A caller could send `metadata='{"tags":["over-budget"]}'` plus `x-litellm-tags: within-budget` and bypass `_tag_max_budget_check` on the body tag. Parse the string via `safe_json_loads` first so existing tags survive the merge. Also drop the empty `tests/test_litellm/proxy/credential_endpoints/` directory — the cascade-rename tests it held imported a function that was never implemented (out of scope for this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tests): thread config_file_path through s3/gcs custom-logger tests The pre-existing s3:// / gcs:// custom-logger tests called ``get_instance_fn`` without ``config_file_path``, which means the new runtime gate (refuse remote URLs unless invoked from a config-file load) now raises ``ValueError`` before reaching the mocked download paths. Each test was exercising the documented startup config-file load scenario; pass ``config_file_path="/any/path"`` to make that intent explicit and route past the gate. Affected: test_s3_download_success, test_gcs_download_success, test_invalid_url_format, test_download_failure_handling, test_file_cleanup. * test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible The pass_through prompt-caching tests (test_prompt_caching_returns_cache_read_tokens_on_second_call, test_prompt_caching_streaming_second_call_returns_cache_read) make a warm-up call and then assert the *second* call sees a non-zero cache_read_input_tokens count from the upstream's prompt-cache. VCR replay can't model cross-call provider state — both calls match the same cassette episode, so the second call returns the first call's pre-warmup response and the assertion fails: AssertionError: Expected cache_read_input_tokens > 0 on second call, but got 0. Full usage: {'input_tokens': 4986, 'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0} This started biting after the AWS SigV4 fingerprint stabilization (b637d9f64a): Bedrock requests now produce a stable per-access-key fingerprint instead of a per-request signature, so cassettes successfully replay where they previously always missed and re-recorded live. Opt these tests out via skip_nodeid_suffixes so they run live and match the existing pattern in tests/llm_translation/conftest.py (::test_prompt_caching). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * Fix 3 OpenTelemetry tracing bugs in proxy integration (#27757) 1. Missing litellm_request child span when proxy parent in metadata: _get_span_context now returns (ctx, None) for the metadata-injected proxy parent so the primary span is always emitted as a child of ctx. Proxy span lifecycle managed by new _end_proxy_span_from_kwargs. 2. open_telemetry_logger overwrite by later handlers: _init_otel_logger_on_litellm_proxy now uses first-registered-wins — only assigns proxy_server.open_telemetry_logger when currently None. 3. Duplicate litellm_request success spans in streaming paths: Added _mark_success_span_once with per-handler dedupe key stored in kwargs metadata, suppressing the second span when both sync and async success callbacks fire for the same request. Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: update Next.js build artifacts (2026-05-13 01:42 UTC, node v20.20.2) * test(vcr): tighten OVERFLOW classification and switch respx detection to AST Address two greptile P2 review concerns on PR #27795: 1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE regardless of cassette state. A cassette that grew past the cap historically but this run only *replayed* (dirty=False) is healthy — the persister never tries to save, so the cache state is stable and the next run will replay too. Only flag OVERFLOW when dirty=True (new episodes were recorded that the persister would refuse to save). Add a regression test covering the dirty=False + large-total case. 2. _module_uses_respx did substring matching on the module source, which false-positives on comments / docstrings / string literals. A comment like # Previously tried respx.mock but switched to vcrpy would keep a file pinned on the opt-out list, defeating the dead-import pruning goal of this PR. Replace the substring scan with an ast.NodeVisitor (_RespxUsageVisitor) that only counts: - @pytest.mark.respx / @respx.mock decorators - with respx.mock(): ... (sync + async) context managers - respx.mock(...) calls outside a with/decorator - function parameters / fixture names equal to respx_mock Add tests for the comment / docstring / string-literal cases plus each real-usage pattern. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(types_utils): drop opt-in env from remote-module runtime gate The runtime gate on s3://gcs:// loading in get_instance_fn previously allowed an opt-in via LITELLM_ALLOW_REMOTE_INSTANCE_FN_FROM_API. That env var is admin-flippable at runtime (DB-overlay environment_variables flow into os.environ), which defeats the gate's purpose, and it isn't needed for the documented operator flow: config.yaml callbacks always pass config_file_path through to the loader. Remove the helper, raise unconditionally when config_file_path is None, and drop the corresponding test for the opt-in branch. * fix(proxy): thread config_file_path into pass-through and MCP-tool YAML loaders The previous commit's gate broke two legitimate startup paths for operators using s3://gcs:// remote module loading from their config.yaml: - general_settings.pass_through_endpoints[].custom_handler - mcp_tools[].handler Both call sites called get_instance_fn without a config_file_path, so the new gate rejected them at startup. Thread config_file_path through: - create_pass_through_route accepts config_file_path and forwards it to get_instance_fn. add_exact_path_route, add_subpath_route, _register_pass_through_endpoint, and initialize_pass_through_endpoints accept and propagate it. - The YAML-load call site in proxy_server.load_config now passes config_file_path; the DB-overlay call site in _update_general_settings leaves it as the default None so the gate still fires on admin-written s3:// values. - MCPToolRegistry.load_tools_from_config accepts config_file_path and threads it into get_instance_fn; _init_non_llm_configs forwards it from load_config. Adds two regression tests verifying that the YAML-source callers thread the path through to get_instance_fn. * Strip SERVER_ROOT_PATH before lazy-feature prefix match LazyFeatureMiddleware compared the raw scope path against registered prefixes (e.g. /policies), so requests under a server root path like /api/v1/policies/... never matched, the feature never loaded, and the endpoint returned 404. Strip the configured root path before matching, normalizing trailing slashes and enforcing a component boundary so /api does not falsely match /apiv2. * Cache normalized SERVER_ROOT_PATH at middleware init SERVER_ROOT_PATH is a process-startup env var. Read it once in __init__ instead of calling get_server_root_path() + rstrip on every request that arrives before all lazy features have loaded. * test: replace dall-e-3 with gpt-image-1 in health check and router tests (#27813) OpenAI returns 'The model dall-e-3 does not exist' for the test account, breaking test_openai_img_gen_health_check and test_image_generation. Switch to gpt-image-1, matching the existing TestOpenAIGPTImage1 pattern. * fix(gemini): normalize response_schema on native generateContent (#27775) * fix(gemini): normalize response_schema on native generateContent The /v1beta/models/{model}:generateContent passthrough forwarded generationConfig.response_schema verbatim, so schemas containing $defs, $ref, anyOf-with-null, default, or title were rejected by Gemini even though /chat/completions already handles them. GoogleGenAIConfig.transform_generate_content_request now calls a new _normalize_response_schema helper that mirrors the chat/completions path: Gemini 2.0+ models get the schema promoted to responseJsonSchema via _build_json_schema (preserving $defs/$ref natively), older models keep responseSchema but the schema is flattened with _build_vertex_schema. VertexAIGoogleGenAIConfig (which overrides the transform entirely) calls the same helper before building the request. * fix(gemini): preserve caller-supplied responseJsonSchema when responseSchema co-present Previously, when both responseJsonSchema and responseSchema were present on Gemini 2.0+, _normalize_response_schema processed responseJsonSchema first (no-op normalization) then unconditionally promoted responseSchema to responseJsonSchema, clobbering the caller-supplied value. Now skip the promotion (and drop the redundant responseSchema) when the caller already supplied responseJsonSchema. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * chore: strip restating comments from response-schema normalize Drop the docstring on _normalize_response_schema and the two inline comments that just restated what the surrounding code/asserts already say. Function name + variable names carry the intent; PR description covers the why-it-exists context. * perf(gemini): drop redundant deepcopy on responseJsonSchema normalize _build_json_schema is a no-op (returns its argument unchanged), so the deepcopy + round-trip on the responseJsonSchema branch allocated a full schema copy on every request with no observable effect. Forward the caller's value as-is, and just move the popped responseSchema value when promoting on Gemini 2.0+. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * style: remove unneeded comment * fix(gemini): drop unsupported responseJsonSchema for older models * test(gemini): add parity test between native and chat schema normalization Per @Sameerlite review: lock the two Gemini schema-normalization paths together. If either GoogleGenAIConfig._normalize_response_schema (native generateContent) or VertexGeminiConfig.apply_response_schema_transformation (/chat/completions) drifts, the parity test fails — forcing both to be updated together. * fix(google_genai): preserve key naming convention in _normalize_response_schema When the input schema key is snake_case (response_schema), the promoted JSON schema key should also be snake_case (response_json_schema) instead of mixing in camelCase (responseJsonSchema). This matters for the Vertex AI google_genai path which converts all keys to snake_case before calling _normalize_response_schema. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> * fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist `_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate` — which runs in each xdist worker process. The controller's `pytest_terminal_summary` then reads its own empty `_session_stats` and bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections the rest of this PR adds never make it into CI logs in the dist mode CI actually uses. Ship a structured `vcr_outcome` payload via `user_properties` (which xdist round-trips) and add `aggregate_report_outcome` on the controller to fold worker outcomes into `_session_stats`. The recording process tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can tell "single-process — already counted locally" apart from "produced by a worker — needs aggregation here", and not double-count when there's no xdist. Covered by 9 new unit tests in test_vcr_classification.py including the end-to-end summary render path. * fix(responses): register cooldowns on failure + fail fast on stale encrypted_content (#27820) * feat(proxy): skip disable_background_health_check models on GET /health when flag set (#27716) * feat(proxy): skip disable_background_health_check models on GET /health when flag set Co-authored-by: Cursor <cursoragent@cursor.com> * fix comment * fix greptile comments * Fix health check fallback kwargs * Format health endpoint * Harden direct health check kwargs compatibility for monkeypatched perform_health_check Replace substring-based TypeError detection with unexpected-keyword checks and a short retry chain (full kwargs, instrumentation only, filter only, minimal) so partial stubs work regardless of which optional kwarg fails first. Add proxy unit tests for legacy three-arg stubs and single-kwarg variants. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix black --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(bedrock-converse): drop blank-text fallback for empty thinking blocks (#27850) * fix(bedrock-converse): drop blank-text fallback for empty thinking blocks Claude Code with extended thinking replays prior assistant turns that include an empty thinking block (`thinking=""`, `signature=""`) alongside tool_use blocks. The unsigned-reasoning fallback in `add_thinking_blocks_to_assistant_content` was emitting `BedrockContentBlock(text="")`, which Bedrock Converse rejects with: "The text field in the ContentBlock object at messages.X.content.0 is blank." Guard the fallback with a strip() check, matching the existing empty-text guards elsewhere in `_bedrock_converse_messages_pt`. * style: remove unneeded comments * fix(proxy): thread config_file_path through LiteLLM_JWTAuth.custom_validate LiteLLM_JWTAuth.__init__ calls get_instance_fn(custom_validate) without config_file_path, so an operator who configures custom_validate: s3://bucket/module.fn in their YAML JWT auth section would hit the runtime gate on startup and break their deployment. Accept config_file_path as a non-field kwarg (popped before the invalid-keys check), thread it into get_instance_fn, and pass it from the startup-load callsite via the existing user_config_file_path module-level path. Admin-API JWT config writes leave the kwarg at None and still hit the gate. * fix(mcp): surface upstream 401 for token-forwarding MCP servers (#27847) * fix(mcp): surface upstream 401 for token-forwarding MCP servers For MCP servers configured with extra_headers: [Authorization], the gateway forwards the client token directly to the upstream. When that token is rejected (expired or invalid) the upstream returns 401, but the MCP SDK starts the SSE stream with 200 OK before calling handlers, so the 401 can't be returned mid-stream. Fix: add a pre-flight httpx probe in handle_streamable_http_mcp — before the SDK opens the session — so the gateway can still return HTTP 401 with WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> when the upstream rejects the token. The probe fails-open (returns 200) on network errors so a transient hiccup does not block valid requests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): parallelize pre-flight auth probes and use HEAD to avoid side effects - Extract forwarded_auth outside the pass-through server loop (was called N times for the same scope value) - Gather all upstream auth probes concurrently with asyncio.gather instead of sequentially; eliminates N×5 s worst-case latency - Switch probe from POST+initialize JSON-RPC body to HEAD request; HEAD carries the Authorization header so the upstream rejects invalid tokens with 401 but never allocates a session or writes an audit entry Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): use get_async_httpx_client in _probe_upstream_auth Replaces bare httpx.AsyncClient with the project-standard get_async_httpx_client(httpxSpecialProvider.MCP) to satisfy the ensure_async_clients_test code coverage check and avoid the +500 ms per-request overhead of creating a new client on every probe call. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(mcp): extract pre-flight probe into _check_passthrough_upstream_auth Moves the parallel upstream auth probe logic out of handle_streamable_http_mcp into a dedicated helper to satisfy Ruff PLR0915 (Too many statements > 50). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): gate pre-flight probes on authorized server set to prevent bypass _check_passthrough_upstream_auth was resolving user-supplied server names directly before authorization ran, letting any permitted LiteLLM key trigger an upstream HEAD probe to a server it was not allowed to use. Changes: - Call _get_allowed_mcp_servers inside the helper so only servers the caller's key is authorized for are probed. - Move the call site to after toolset scoping so the auth context is fully resolved before the probe list is built. - Thread user_api_key_auth into the helper signature (replaces the raw mcp_servers name list). Co-authored-by: Cursor <cursoragent@cursor.com> * Add async HTTP HEAD support Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): use Scope type annotation in _get_forwarded_auth_from_scope Co-authored-by: Cursor <cursoragent@cursor.com> * Fix MCP upstream auth probe method Co-authored-by: Yassin Kortam <yassin@berri.ai> * Remove unused AsyncHTTPHandler head method Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): exclude has_client_credentials servers from pre-flight auth probe _prepare_mcp_server_headers skips caller Authorization when the server uses OAuth client-credentials (M2M), but the pre-flight probe was still selecting those servers and forwarding the caller's raw token in the HEAD request. Exclude servers with has_client_credentials from the probe list to match the actual downstream header-preparation logic. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): propagate upstream 403 as 403, not 401 with WWW-Authenticate Per RFC 9110, 401 means "go get new credentials." Mapping an upstream 403 to a gateway 401 causes OAuth clients to restart the authorization flow, obtain a fresh token with identical scopes, hit 403 again, and loop indefinitely. 401 from upstream → gateway 401 + WWW-Authenticate (re-authorize) 403 from upstream → gateway 403 (no WWW-Authenticate hint) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): skip auth probe when Authorization may be the LiteLLM proxy key The pre-flight upstream probe must not forward the caller's Authorization header when it could itself be the LiteLLM proxy API key. Restrict the probe to requests that supply x-litellm-api-key explicitly — only then is the Authorization header unambiguously the upstream OAuth token the caller wants forwarded. * Fix MCP ASGI HTTPException propagation Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): use public AsyncHTTPHandler.post() in auth probe Use AsyncHTTPHandler.post() and catch httpx.HTTPStatusError explicitly so the 401/403 we want to surface is not silently swallowed by the broad fail-open except Exception block. Avoids reaching into the handler's private client attribute, which would silently regress to fail-open if AsyncHTTPHandler is ever refactored. * Fix MCP auth probe tests Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(mcp): add coverage for httpx.HTTPStatusError path in auth probe AsyncHTTPHandler.post() calls raise_for_status() internally, so a real upstream 401/403 lands as httpx.HTTPStatusError. Add a test that exercises that specific exception path so a regression that swallows the error in the broad fail-open except Exception would be caught. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: claude-bot <claude-bot@anthropic.com> * fix(cost): align vertex_ai/gemini-embedding-2-preview with Vertex multimodal pricing (#27848) * fix(cost): align vertex_ai/gemini-embedding-2-preview with Vertex multimodal pricing Co-authored-by: Cursor <cursoragent@cursor.com> * fix(cost): align vertex_ai/gemini-embedding-2 GA source URL with preview Per Greptile review on #27848: GA entry referenced ai.google.dev while the preview entry was updated to the canonical Vertex AI pricing page. Both share identical pricing values; sync the source URL for consistency. https://claude.ai/code/session_01W8jRwstnmduadGw8Z8egxe --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude <noreply@anthropic.com> * feat(mcp): add delegate_auth_to_upstream flag for PKCE passthrough (#27834) * feat(mcp): add delegate_auth_to_upstream flag for PKCE passthrough Adds an opt-in per-server flag that lets clients (e.g. VS Code) complete PKCE directly with an upstream OAuth2 MCP server, instead of LiteLLM double-gating with its own API-key/SSO check. Only honored when auth_type=oauth2 and the operator explicitly sets the flag; mixed-target or non-oauth2 requests fail closed. - Adds the field to Pydantic models, Prisma schema, and a migration - New MCPRequestHandler._target_servers_delegate_auth_to_upstream gate that runs only when no x-litellm-api-key is present, so authenticated users still get user_id resolution + stored-credential lookup - Anonymous callers now see delegate servers in get_allowed_mcp_servers (scoped to delegate servers only; the upstream still enforces auth) - mcp_management_endpoints: allow anonymous /authorize and /token for delegate servers so VS Code can complete PKCE without a LiteLLM session - UI toggle (shown only for oauth2) + payload/view wiring - Tests covering: oauth2 on/off, non-oauth2 with flag, mixed targets, no resolvable target, explicit key precedence, and 401 emission Co-authored-by: Cursor <cursoragent@cursor.com> * Enforce oauth2 for delegated MCP auth bypass Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): close secondary Authorization bypass for delegate servers The delegate-auth bypass gated only on the primary `x-litellm-api-key` header, so a LiteLLM key sent via `Authorization: Bearer sk-...` (the secondary header) was silently dropped — skipping spend tracking and rate limiting. Gate on the resolved litellm_api_key (which considers both headers) so the bypass fires only when neither is present. Also update the existing "Authorization header present" test to reflect that an upstream OAuth token now flows through the existing oauth2 fallback (LiteLLM auth attempt → fail → anonymous), not via the delegate branch. Co-authored-by: Cursor <cursoragent@cursor.com> * Avoid duplicate MCP OAuth credential lookup Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): block delegate bypass for M2M and internal-only servers Two security issues flagged in code review: 1. High – client_credentials (M2M) servers must not be delegatable: LiteLLM auto-fetches the upstream token using stored credentials, so allowing anonymous bypass would let any external caller invoke tools authenticated as LiteLLM's service account. Fix: check `server.has_client_credentials` in `_target_servers_delegate_auth_to_upstream`, the anonymous allow-list in `get_allowed_mcp_servers`, and `_mcp_oauth_user_api_key_auth`. 2. Medium – internal-only servers exposed to public internet: The anonymous delegate allow-list was not filtering by `available_on_public_internet`, so external callers with an upstream OAuth token could invoke tools on servers marked internal-only. Fix: add `available_on_public_internet` guard to the anonymous delegate server list in `get_allowed_mcp_servers`. Tests added for both cases. Co-authored-by: Cursor <cursoragent@cursor.com> * Require public MCP delegate auth servers Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): align delegate auth path parsing with downstream routing `_extract_target_server_names_from_path` used a naive segments-based split while `server.py::_get_mcp_servers_in_path` uses a regex that allows server names with one embedded slash and comma-separated lists. With the old parser, a request to `/mcp/<delegated>/<garbage>` was parsed as targeting `<delegated>` by the auth gate (bypassing LiteLLM auth) while the routing layer parsed it as `<delegated>/<garbage>` — when that name did not resolve, the request fell back to the anonymous allow-list, which can include `allow_all_keys` servers that normally require a LiteLLM key. Replace the parser with the same regex logic as `_get_mcp_servers_in_path` so auth gating sees the exact target name(s) downstream routing sees. Add regression tests covering parser parity and the specific extra-path-segment bypass attempt. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * fix(mcp): close header/path TOCTOU in MCP delegate auth gate `_target_servers_delegate_auth_to_upstream` and `_target_servers_use_oauth2` trusted the `x-mcp-servers` header when present, but `server.py::extract_mcp_auth_context` overrides that header with the path-derived list for `/mcp/...` routes. An attacker could set `x-mcp-servers: <delegated>` while pointing the URL path at a non-delegate server, flipping the auth gate without changing the target downstream routing actually uses. Extract a shared `_resolve_target_server_names` helper that mirrors the downstream override (path-derived names for `/mcp/...` routes, header value otherwise). Add regression tests covering the TOCTOU attempt and the helper's path-vs-header precedence. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * Fix delegated MCP OAuth test mock Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): drop unreachable /{server}/mcp branch in auth path parser `_extract_target_server_names_from_path` also matched the ``/{server_name}/mcp`` form, but the downstream parser ``_get_mcp_servers_in_path`` only handles ``/mcp/...`` — and ``dynamic_mcp_route`` in ``proxy_server`` rewrites ``/{name}/mcp`` to ``/mcp/{name}`` on the scope before the MCP handler runs. Parsing the un-rewritten form on the auth side was therefore unreachable in production, and contradicted the docstring's claim of mirroring the downstream parser — exactly the kind of mismatch that risks a future header/path TOCTOU if any new entry point skips the rewrite. Drop the branch; the canonical ``/mcp/...`` path matches both parsers. Update the regression test to assert the new behavior. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * Fix MCP path auth target resolution Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): require auth for refresh_token grants on delegate-auth servers `_mcp_oauth_user_api_key_auth` gates the unauthenticated PKCE flow for ``delegate_auth_to_upstream`` servers, but the bypass applied to BOTH ``/authorize`` and ``/token`` regardless of grant type. ``mcp_token`` accepts ``grant_type=refresh_token`` as well as ``authorization_code``, and ``exchange_token_with_server`` attaches the server's stored ``client_secret`` to whatever is forwarded upstream. An unauthenticated caller holding a refresh token issued to that OAuth client could mint fresh upstream access tokens through LiteLLM. Limit the anonymous bypass on ``/token`` to ``grant_type=authorization_code`` (the only grant PKCE actually protects via ``code_verifier``); fall through to normal LiteLLM auth for ``refresh_token`` and any other grant. ``/authorize`` continues to allow anonymous PKCE redirects. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * fix(ui): clear delegate_auth_to_upstream when switching off oauth2 The ``delegate_auth_to_upstream`` form field is rendered inside an ``isOAuth2 && (...)`` conditional, so the Form.Item unmounts when the user changes ``auth_type`` away from ``oauth2``. The follow-up ``form.setFieldValue("delegate_auth_to_upstream", false)`` runs after the field has already deregistered, so ``onFinish`` receives ``undefined`` and the fallback ``?? mcpServer.delegate_auth_to_upstream`` preserved the old ``true``. The flag then persisted in the database for a non-oauth2 server and silently re-activated if ``auth_type`` was later switched back to ``oauth2``. In the edit payload, force the flag to ``false`` whenever ``auth_type !== oauth2``; only trust the form value (and the existing DB fallback) when the server is actually oauth2. Backend defense-in-depth already ignores the flag for non-oauth2 servers, but the DB state should stay clean too. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * Fix MCP delegate auth reset on edit Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Claude <claude@anthropic.com> * fix(responses): preserve cache_control in Responses API -> Chat Completion transformation (#27727) * fix(responses): preserve cache_control in Responses API -> Chat Completion transformation cache_control injected by AnthropicCacheControlHook was silently dropped when _transform_responses_api_content_to_chat_completion_content rebuilt content blocks with only {type, text}. Now copies cache_control through so Anthropic prompt caching works correctly when using client.responses.create with cache_control_injection_points. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(responses): preserve cache_control for input_image and input_file blocks Extends the cache_control fix to image and file content blocks, which were also silently dropping cache_control during the Responses API -> Chat Completion transformation. Adds tests for all three content block types. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Babysitter <claude@anthropic.com> * fix(proxy): expose db status on public /health/readiness External readiness probes consumed the legacy detailed payload's `db` field to drive alerting and pod-rotation decisions. Stripping the body to `{"status": "healthy"}` broke those probes silently — the HTTP code still flipped to 503, but probes checking `body.db == "connected"` treated the response as healthy. Add `db` back to the unauthenticated payload. Keep the rest of the diagnostic fields (litellm_version, callbacks, cache, log_level) gated behind /health/readiness/details so the recon-leak gate from #26912 holds. Values match the legacy contract: "connected", "disconnected", "Not connected". * docs(budget_manager): add docstring to BudgetManager.reset_cost (#27867) Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> * docs: add class docstring to _LoopWrapper (#27870) Document the purpose of the daemon thread that backs the sync branch of the timeout decorator. Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> * fix: Fix Redis Sentinel client handling to solve authentication error… (#26302) * fix: Fix Redis Sentinel client handling to solve authentication error with password protected sentinel (#25625) * fix Redis Sentinel authentication handling * test: cover Redis Sentinel auth routing * refactor: align Redis Sentinel kwargs threading * fix: avoid duplicate Redis Sentinel socket timeouts * Address review comments * refactor(_redis): return set from _get_redis_kwargs for O(1) lookup Align _get_redis_kwargs() with the cluster helper by returning a set instead of a list, so the sentinel connection-kwargs filter uses O(1) membership tests. Addresses Greptile review feedback on PR #26302. * fix(_redis): restore Azure-specific kwargs in cluster kwargs set The set-literal refactor of _get_redis_cluster_kwargs dropped four LiteLLM-custom Azure keys (azure_redis_ad_token, azure_client_id, azure_tenant_id, azure_client_secret) that the prior list form had explicitly appended. Because they are not in RedisCluster's argspec, they were silently stripped, breaking Azure IAM auth on cluster clients. Re-add them to the explicit include set. --------- Co-authored-by: Kristin Cowalcijk <kristincowalcijk@gmail.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: krrish-berri-2 <krrish-berri-2@users.noreply.github.com> Co-authored-by: claude <claude@anthropic.com> * Litellm agent oss staging 05 11 2026 (#27733) * fix(ollama): Include provider in model list for ollama (#26135) * Include provider in model names for ollama * Fix unit tests * fix(ollama): process both thinking and content in same streaming chunk (#26098) * fix(health_check): skip max_tokens for image_generation mode (#26417) * fix(health_check): skip max_tokens for image_generation mode `_update_litellm_params_for_health_check` injected `max_tokens` for every deployment. OpenAI `/v1/images/generations` strictly rejects unknown fields, so health checks for dall-e-* and gpt-image-1 always failed with `400 "Unknown parameter: 'max_tokens'"` even though the actual image endpoint calls succeed. Skip the `max_tokens` injection when `model_info.mode == "image_generation"`. `messages` still gets injected (downstream `_filter_model_params` already strips it for non-chat handlers). * Switch to allow-list with per-deployment override Per @krrishdholakia review: deny-listing image_generation only re-introduces the same bug for every other non-chat mode (embedding, audio_*, rerank, video_generation, ocr, search, moderation, ...). Replace the single image_generation skip with `_MAX_TOKEN_SUPPORT_MODES = {chat, completion, responses}`. Missing `mode` is treated as chat for backward compatibility. New modes are safe by default. Add `model_info.health_check_supports_max_tokens` as an operator escape hatch — True forces injection on a non-listed deployment (operator wants to bound probe tokens), False suppresses it on a chat-style deployment behind a strict-schema provider. Tests: parametrize over 3 chat-style + 10 non-chat modes, plus override on/off and the no-mode legacy path. * fix(http_handler): handle RequestNotRead in MaskedHTTPStatusError for multipart uploads (#26718) Squash-merged by litellm-agent from dawidkulpa's PR. * fix(ollama): guard against double 'ollama/' prefix in live model listing Greptile flagged that Ollama servers can return names that already start with 'ollama/'. Check the prefix before prepending so we don't produce 'ollama/ollama/...'. Adds a regression test. * Fix Ollama empty reasoning stream chunks Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: James Myatt <james@jamesmyatt.co.uk> Co-authored-by: VHash <225398745+vhash0@users.noreply.github.com> Co-authored-by: hayden <sewhan.kim+@a-bly.com> Co-authored-by: dawidkulpa <84176950+dawidkulpa@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * Ishaan - May 13th Staging LiteLLM (#27877) * fix: strip Gemini thought-signature from tool_use.id in non-streaming path; example websearch config (#27873) - adapters/transformation.py: mirror the streaming path and strip the `__thought__<b64>` suffix off `tool_call.id` before building the AnthropicResponseContentBlockToolUse. Base64's `+ / =` characters violate Anthropic's `^[a-zA-Z0-9_-]+$` tool_use.id pattern, so when a conversation that flowed through Gemini is later replayed to an Anthropic-native provider (Bedrock or Anthropic API) the request 400s. - example_config_yaml/websearch_interception_config.yaml: register the interceptor under `callbacks:` not `success_callback:`. `success_callback` does not run pre-request hooks, so the tool-conversion step never fires on `/v1/messages` and the raw `web_search_20250305` tool is forwarded to Bedrock, which 400s. - adds a unit test pinning the non-streaming strip behavior and the surviving `^[a-zA-Z0-9_-]+$` shape of the resulting id. Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> * Fix/azure image edit auth header (#27863) * fix(azure/image_edit): use api-key header instead of Authorization Bearer Delegate `AzureImageEditConfig.validate_environment` to `BaseAzureLLM._base_validate_azure_environment` so the image-edit route follows the same auth resolution as every other Azure provider: - prefer the Azure-native `api-key` header when an API key is available - fall back to `Authorization: Bearer <azure_ad_token>` only for AAD auth The previous implementation unconditionally set `Authorization: Bearer <api_key>`, which is the OpenAI-direct convention and is rejected by Azure OpenAI / APIM-fronted deployments with `401 Access denied due to missing subscription key`. Adds regression tests covering api_key kwarg, litellm_params.api_key, and the AAD-token fallback path. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(azure/image_edit): pin api-key precedence semantics + add regression test Address review feedback that the move to ``BaseAzureLLM._base_validate_azure_environment`` changed the relative priority of the positional ``api_key`` kwarg vs. ``litellm_params["api_key"]``. The new behavior — ``litellm_params["api_key"]`` wins, positional only fills in when ``litellm_params["api_key"]`` is empty — is intentional and matches every other Azure ``validate_environment``: ``AzureVideosConfig`` uses the exact same merge logic, while ``AzureVectorStoresConfig`` and ``AzureResponsesAPIConfig`` don't accept a positional ``api_key`` at all. The old ``or`` chain (positional wins) was the outlier and was part of the same OpenAI-vs-Azure convention drift that produced the original ``Authorization: Bearer`` bug. The only production caller (``llm_http_handler.image_edit``) sources both values from the same ``litellm_params.api_key``, so this change is behaviorally a no-op there. Document the precedence in the docstring and lock it in with an explicit test so future refactors can't quietly re-invert it. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: Adam Kirstein <adam.kirstein@disney.com> Co-authored-by: Cursor <cursoragent@cursor.com> * test(azure/image_edit): expect api-key header instead of Authorization Bearer PR #27863 fixed Azure image edit to use the Azure-native api-key header instead of OpenAI's Authorization: Bearer convention, but did not update test_azure_image_edit_litellm_sdk to match. The test still asserted 'Authorization' in headers, which now fails since the new code routes through BaseAzureLLM._base_validate_azure_environment and emits api-key when an api_key is provided. Update the assertion to pin the correct Azure behavior: api-key header present with the resolved key, and no Authorization header. --------- Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai> Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> Co-authored-by: Adam Kirstein <107421694+justalittleadam@users.noreply.github.com> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: Adam Kirstein <adam.kirstein@disney.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com> * fix(fireworks_ai): strip `thinking_blocks` from chat messages before Fireworks API call (#27881) * fix(fireworks_ai): strip thinking_blocks from chat messages before API call Fireworks OpenAI-compatible ChatMessage schema uses additionalProperties:false and rejects Anthropic-style messages[].thinking_blocks (e.g. Claude Code replays), returning invalid_request_error. Remove the field in _transform_messages_helper alongside provider_specific_fields. Adds unit test test_transform_messages_helper_strips_thinking_blocks. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(fireworks_ai): drop inline comments from message sanitization Co-authored-by: Cursor <cursoragent@cursor.com> * docs(fireworks_ai): explain why provider_specific_fields and thinking_blocks are stripped Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * fix: block client-side pricing injection via request body Authenticated clients could supply CustomPricingLiteLLMParams fields (input_cost_per_token, output_cost_per_token, etc.) in the request body. These were forwarded to register_model() in main.py, permanently mutating the shared global litellm.model_cost dict for all users on the instance. Adds all CustomPricingLiteLLMParams fields to _BANNED_REQUEST_BODY_PARAMS so is_request_body_safe() rejects them before they reach completion(). New pricing fields added to CustomPricingLiteLLMParams are auto-covered. Admin opt-in via allow_client_side_credentials or configurable_clientside_auth_params still works as before. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * chore(proxy): scrub remote-URL module loads from DB-overlay config When ``ProxyConfig`` merges DB-persisted ``litellm_settings`` / ``general_settings`` on top of the YAML config, the merged dict is later iterated by ``load_config`` which threads ``config_file_path`` (the YAML path) into ``get_instance_fn``. The runtime gate that refuses ``s3://`` / ``gcs://`` modules when ``config_file_path`` is ``None`` therefore can't distinguish a YAML-sourced value from a DB-sourced one: both look the same to ``get_instance_fn``. Strip ``s3://`` / ``gcs://`` entries from the DB-overlay value for every field whose contents reach ``get_instance_fn`` during config load: - litellm_settings: ``callbacks``, ``success_callback``, ``failure_callback``, ``audit_log_callbacks``, ``post_call_rules``, ``custom_provider_map[].custom_handler`` - general_settings: ``custom_auth``, ``custom_key_generate``, ``custom_key_update``, ``custom_sso``, ``custom_ui_sso_sign_in_handler``, ``litellm_jwtauth.custom_validate`` The YAML config-file load path is unchanged — the documented operator flow (``callbacks: ["s3://bucket/module.instance"]`` in ``config.yaml``) still works. Only DB-overlay writes (e.g. via ``/config/update``) are stripped. Adds 16 regression tests covering the scrub matrix. * chore(proxy): also scrub pass_through_endpoints[].target from DB overlay A pass-through endpoint's ``target`` field is passed through ``create_pass_through_route`` into ``get_instance_fn`` during config load. A PROXY_ADMIN persisting ``target: "s3://attacker/m.i"`` via the DB-overlay ``pass_through_endpoints`` write path was not covered by the previous scrub matrix, so the remote module load would still reach the loader because the YAML-load chain has ``config_file_path`` set. Walk each entry in ``general_settings.pass_through_endpoints`` and null out any ``target`` that starts with ``s3://`` or ``gcs://``. The entry itself is preserved so the path-registration helper can choose how to handle a missing target (the existing code skips the route when ``target is None``). Adds two regression tests. * fix(prometheus): emit `litellm_remaining_tokens_metric` for Bedrock and Vertex (#27705) * fix(prometheus): emit remaining_tokens/requests gauges for bedrock + vertex (LIT-2719) Bedrock and Vertex AI never return x-ratelimit-remaining-* response headers, so litellm_remaining_tokens_metric / litellm_remaining_requests_metric only fired for OpenAI / Azure / Anthropic deployments even when tpm/rpm was configured on the router. Add a provider-agnostic fallback in PrometheusLogger.async_log_success_event that asks Router.get_remaining_model_group_usage() for the same model_group and emits the gauges with configured_limit - current_usage when the upstream provider didn't populate the headers itself. Existing OpenAI / Azure / Anthropic flows are unchanged because the fallback short-circuits when both header values are already present. Tests: 8 new tests covering bedrock + vertex emission, header short-circuit, partial-header fill, llm_router=None, missing model_group, empty router result, and router exception swallowing. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): narrow except to ImportError, log router lookup failures via verbose_logger.exception Address greptile review: - The optional 'from litellm.proxy.proxy_server import llm_router' should guard against ImportError specifically, not all exceptions, so that unexpected errors (e.g. AttributeError from partially-initialized state) stay visible. - get_remaining_model_group_usage failures are now logged via verbose_logger.exception (with traceback) instead of debug, matching the PR description's intent and avoiding silent loss of router-cache errors in production. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): subtract in-flight delta in router-remaining fallback The router's TPM/RPM counter is incremented by Router.deployment_callback_on_success, which f…

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated

Comment thread litellm/proxy/_experimental/mcp_server/server.py

Sameerlite and others added 2 commits May 13, 2026 19:04

veria-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py

veria-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated

Add async HTTP HEAD support

b211144

Co-authored-by: Yassin Kortam <yassin@berri.ai>

fix(mcp): use Scope type annotation in _get_forwarded_auth_from_scope

d4d6155

Co-authored-by: Cursor <cursoragent@cursor.com>

veria-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py

Fix MCP upstream auth probe method

73e41f6

Co-authored-by: Yassin Kortam <yassin@berri.ai>

veria-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/llms/custom_httpx/http_handler.py Outdated

cursoragent and others added 2 commits May 13, 2026 14:17

Remove unused AsyncHTTPHandler head method

ade8ca1

Co-authored-by: Yassin Kortam <yassin@berri.ai>

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py Outdated

veria-ai Bot reviewed May 13, 2026

View reviewed changes

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/_experimental/mcp_server/server.py

cursoragent and others added 2 commits May 13, 2026 17:09

Fix MCP ASGI HTTPException propagation

851d9f6

Co-authored-by: Yassin Kortam <yassin@berri.ai>

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread tests/test_litellm/proxy/_experimental/mcp_server/test_mcp_server.py

cursoragent and others added 2 commits May 13, 2026 17:38

Fix MCP auth probe tests

c07b62a

Co-authored-by: Yassin Kortam <yassin@berri.ai>

mateo-berri enabled auto-merge (squash) May 13, 2026 19:03

mateo-berri approved these changes May 13, 2026

View reviewed changes

mateo-berri merged commit 466f06d into litellm_internal_staging May 13, 2026
117 checks passed

mateo-berri deleted the litellm_mcp_passthrough_upstream_401 branch May 13, 2026 19:03

mateo-berri mentioned this pull request May 13, 2026

perf(mcp): cache _get_allowed_mcp_servers per request (PR #27847 follow-up) #27862

Closed

joelstucki-taulia mentioned this pull request May 29, 2026

[Bug]: Interactive OAuth2 MCP server returns 500 instead of 401+WWW-Authenticate when x-litellm-api-key is present but no Google OAuth token exists yet #29261

Open

Uh oh!

Conversation

Sameerlite commented May 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sameerlite commented May 13, 2026

Uh oh!

Uh oh!

veria-ai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCP upstream auth preflight added

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sameerlite commented May 13, 2026

Uh oh!

Uh oh!

Sameerlite commented May 13, 2026

Uh oh!

veria-ai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

mateo-berri commented May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

mateo-berri May 13, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mateo-berri commented May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

mateo-berri May 13, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mateo-berri commented May 13, 2026

Sameerlite commented May 13, 2026 •

edited by cursor Bot

Loading

codecov Bot commented May 13, 2026 •

edited

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading

veria-ai Bot commented May 13, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

CLAassistant commented May 13, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading