chore(ci): promote internal staging to main by yuneng-berri · Pull Request #28680 · BerriAI/litellm

yuneng-berri · 2026-05-23T00:34:31Z

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

* feat(gemini): add gemini-3.1-flash-lite model cost map entries Co-authored-by: Cursor <cursoragent@cursor.com> * Update model_prices_and_context_window.json * Update source URL for model pricing information * Sync source URL for gemini-3.1-flash-lite in backup JSON * fix(model_cost_map): add mistral/ministral-8b-2512 entry Mistral rotated the 'mistral/mistral-tiny' alias to return 'ministral-8b-2512' as the response model, which is not in the cost map. This caused test_completion_mistral_api and test_completion_mistral_api_modified_input to fail in completion_cost lookup. Add the entry mirroring the existing openrouter/mistralai/ministral-8b-2512 pricing. * test(cost_calculator): assert output_cost_per_reasoning_token for gemini-3.1-flash-lite * fix(tests): backfill local backup entries into runtime model_cost litellm.model_cost is loaded from LITELLM_MODEL_COST_MAP_URL (pinned to main) at import time, so any pricing entries added to the in-tree backup on this branch aren't visible at test runtime until they also land on main. The Mistral cassette currently returns model=ministral-8b-2512 and the cost-calculator lookup in test_completion_mistral_api / test_completion_mistral_api_modified_input fails despite the entry existing in the local backup. Backfill missing backup entries into litellm.model_cost in the local_testing conftest so these lookups succeed against the cassette state the branch is being tested with. * fix(tests): guard conftest backfill against empty local cost map --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

…d double-seed (#27854) * fix(spend_counter): seed Redis counter via SET NX to prevent cross-pod double-seed Symptom ------- Customers on multi-pod deployments see team `spend` jump to ~2x (or N x the pod count) shortly after a Redis cache miss / TTL expiry, triggering spurious "Budget Crossed" alerts and blocked requests until the value is manually reset. Root cause ---------- `SpendCounterReseed.coalesced` warmed the primary spend counter by calling `redis.async_increment(key, value=db_spend, refresh_ttl=True)`, which lowers to Redis `INCRBYFLOAT`. That is additive, not idempotent. The per-counter `asyncio.Lock` only coalesces seeders inside one process. With N pods sharing one Redis, on a cold key (cold start, TTL expiry, manual delete) every pod independently passes its lock + Redis re-check, reads the same `db_spend`, and issues `INCRBYFLOAT db_spend`. Final value: N x db_spend. Fix --- Use `redis.async_set_cache(key, value=db_spend, nx=True)` for the seed. SET NX is atomic across pods: exactly one writer initializes the key; losers read the winner's value via `async_get_cache`. This is the same idiom already used by `coalesced_window` in the same file, so the two seed paths are now consistent. Per-request deltas continue to use `INCRBYFLOAT` (correct - additive behaviour is what we want for increments, not for initial seed). Verification ------------ Live two-process repro against the same Postgres + Redis (DB spend = 506): Unpatched: 4/4 runs -> Redis counter = ~1012 (~2 x db_spend) Patched: 12/12 runs -> Redis counter = ~506 Unit tests (`test_proxy_server.py`): - New `test_primary_spend_counter_redis_concurrent_seed_does_not_double_seed` patches `_get_lock` to return a fresh lock per caller (otherwise the per-process lock masks the race), races two `coalesced` calls, and asserts final = 506 with exactly one of two SET NX attempts winning. - 4 existing tests updated for the new seed contract (SET NX for the seed, INCRBYFLOAT only for the per-request delta). - Full `spend_counter or reseed or budget` slice: 22 passed. Co-authored-by: Cursor <cursoragent@cursor.com> * test(spend_counter): make SET NX mock atomic so loser branch is exercised Greptile flagged that `redis_set_cache` in test_primary_spend_counter_redis_concurrent_seed_does_not_double_seed placed `await asyncio.sleep(0)` AFTER the NX membership check. Both concurrent tasks observed an empty `redis_store`, passed the guard, and both returned True - so the loser branch (else: read back winner's value) was never exercised. Fix the mock to model real atomic Redis SET NX: - Yield BEFORE the membership check so two concurrent callers interleave the way real SET NX does (first to resume runs check + write atomically and wins; second resumes after the key exists and loses). - Track set_cache return values; assert sorted([loser, winner]) so we know exactly one task wins and one loses. - Track async_get_cache calls that happen AFTER at least one SET NX has completed; assert at least one such read - that is the loser-path fallback (`current_value = float(cached)` when seeded is False). Verified by temporarily reverting the mock to the old order: the test now fails with `expected exactly one SET NX winner and one loser, got [True, True]`, exactly the failure mode Greptile described. No production code change. Co-authored-by: Cursor <cursoragent@cursor.com> * test(spend_counter): mock async_set_cache to populate redis_store in concurrent read+write test `test_concurrent_read_and_write_paths_share_one_db_query` mocks `async_increment` to populate the in-memory `redis_store`, but did not mock `async_set_cache`. After the SET-NX seed change in `coalesced()`, the seed step writes via `async_set_cache(nx=True)` (default AsyncMock, no `redis_store` write), so the simulated Redis stays empty after the first reseed. The second `get_current_spend` then sees a clean Redis miss, re-enters the DB read path, and the test fails with `expected 1 DB query, got 2`. Fix: add a `redis_set_cache` side_effect that updates `redis_store` on `nx=True` (and rejects when the key already exists), matching the pattern used by the four sibling tests fixed in this branch's first commit. Pre-existing assertions are unchanged. Full `tests/test_litellm/proxy/test_proxy_server.py`: 158 passed. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

…28339) * fix(proxy): normalize batch file IDs before ManagedObjectTable write Run post_call_success_hook before update_batch_in_database on retrieve/cancel, and ensure_batch_response_managed_file_ids so file_object never stores raw provider output_file_id or error_file_id. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): address Greptile review on batch file ID normalization Remove redundant resolve_* calls after update_batch_in_database and rename loop variable to avoid shadowing hidden_params unified_file_id. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(tests): add mistral/ministral-8b-2512 to cost map and backfill in conftest Mistral rotated the 'mistral/mistral-tiny' alias to return 'ministral-8b-2512' as the response model, which was missing from the cost map. This caused test_completion_mistral_api and test_completion_mistral_api_modified_input to fail in litellm.completion_cost lookup. - Add mistral/ministral-8b-2512 entry to both the in-tree model_prices_and_context_window.json and the bundled litellm/model_prices_and_context_window_backup.json (mirrors the existing openrouter/mistralai/ministral-8b-2512 pricing). - litellm.model_cost is loaded at import time from the URL pinned to main, so the new backup entry isn't visible at test runtime until it also lands on main. Backfill any entries missing from the remote-fetched map into litellm.model_cost in the local_testing conftest so cost-calculator lookups succeed on this branch. * fix(tests): drop unnecessary del of conftest backfill loop vars * fix: resolve batch response file IDs even when status unchanged The status-unchanged early return in update_batch_in_database was skipping ensure_batch_response_managed_file_ids, leaving raw provider input_file_id (and other raw IDs) in the user-facing response when polling an in-progress batch. Move the in-place file ID normalization above the early return so the response always carries unified managed IDs while still skipping the DB write when nothing changed. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(batches): cover ensure_batch_response_managed_file_ids branches Add tests for the previously-uncovered paths in ensure_batch_response_managed_file_ids: error_file_id normalization, swallowed conversion errors, UserAPIKeyAuth fallback from db_batch_object, model_name resolution from unified_file_id, and early returns when managed_files_obj, model_id, or auth context are missing. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Claude <noreply@anthropic.com>

…27921) * fix(router): use forwarded model_id for native Azure container IDs in _init_containers_api_endpoints Azure code-interpreter containers return provider-native IDs (cntr_ + hex) that carry no LiteLLM routing payload, so _decode_container_id returns model_id=None. The router was falling through to call the handler directly, bypassing _ageneric_api_call_with_fallbacks and leaving api_base=None for Azure deployments. Fall back to the model_id forwarded from the proxy ownership check so deployment credentials are always applied. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure-containers): strip /openai/responses path from api_base in AzureContainerConfig.get_complete_url When a deployment's api_base is the responses endpoint URL (e.g. .../openai/responses?api-version=...), AzureContainerConfig was appending /openai/containers on top of it, producing the broken path .../openai/responses/openai/containers. Azure returns 404 for that URL while the correct path is .../openai/containers. Strip any /openai/responses suffix from api_base before constructing the containers URL so the resource root is always used as the starting point. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure-containers): prefer api-version from api_base URL over deployment's api_version The deployment's api_version (e.g. 2024-08-01-preview) targets the chat/responses API and is too old for the containers API, which requires 2025-04-01-preview. The responses endpoint api_base already carries the correct api-version in its query string. Extract it and use it for the containers URL, overriding the stale deployment-level version. Fixes DELETE and file-upload operations returning 404 due to wrong api-version. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(containers): pass params=None instead of params={} to httpx to preserve api-version httpx erases a URL's query-string when params={} (empty dict) is passed, silently stripping ?api-version=2025-04-01-preview from every container POST/DELETE request. Azure's GET endpoints tolerate a missing api-version; POST (upload) and DELETE are strict, so those returned 404. Fix: use `params or None` in container_handler._async_handle and llm_http_handler.async_container_delete_handler (and all sibling container handlers) so that an empty params dict falls back to None, leaving httpx to preserve the URL's existing query string intact. Adds a regression test that directly documents the httpx behaviour. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(router): remove elif model_id branch from _init_containers_api_endpoints Two reviewer findings addressed: 1. Truncated comment on the model_id fallback line — now complete. 2. Security: the elif branch that fired when container_id was absent allowed any authenticated caller to supply model_id in a POST /v1/containers body and route the request through an arbitrary deployment UUID, bypassing the model-level access checks that only validate `model`. Removed the elif branch; operations without container_id (create, list) route by the caller-supplied `model` field as before. model_id forwarding is kept only inside the container_id block, where the proxy ownership check has already validated the container before forwarding the deployment ID. Adds a regression test pinning the security boundary: no-container-id path calls original_function directly even when model_id is in kwargs. Co-authored-by: Cursor <cursoragent@cursor.com> * test(containers): validate proxy-to-router model_id forwarding for managed IDs Add test_regression_get_container_forwarding_params_sets_model_id_for_managed_id to verify that get_container_forwarding_params (the proxy-side half of the Azure routing fix) correctly extracts and forwards model_id from a LiteLLM-managed encoded container ID. This closes the gap identified by Greptile P1: the previous regression test only injected model_id as a direct kwarg, validating the router in isolation. The new test exercises the actual proxy-to-router data flow through ownership.get_container_forwarding_params, confirming that kwargs["model_id"] is populated before _init_containers_api_endpoints is reached. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure-containers): tighten endpoint-path strip to endswith match Use path.endswith() instead of path.find() for _AZURE_ENDPOINT_PATHS so the suffix strip only fires when api_base actually ends with one of the endpoint-specific path suffixes. This is the more precise check greptile flagged on the original find()-based implementation. * Fix sync container handler to preserve URL query string Mirror the async path fix: pass None instead of an empty params dict so httpx does not strip the URL's existing query string (e.g. ?api-version=...), which is required for Azure container routing. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(azure-containers): strip trailing slash before endpoint suffix match Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(containers): recover model_id from stored encoded id for native Azure container IDs get_container_forwarding_params previously only set model_id when the user-supplied container_id was a LiteLLM-managed encoded id. For native upstream IDs (e.g. Azure 'cntr_<hex>') the decode fails and model_id was never forwarded — making the router-side fallback in _init_containers_api_endpoints unreachable in production. Fall back to the stored 'unified_object_id' on the ownership row, which is the encoded form captured at create time when the router selected a specific deployment. Decoding that yields the deployment model_id and restores router-based credential application (api_base, api_key) for retrieve/delete and container-file operations on native IDs. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

When a new filter is applied to spend logs, React Query's keepPreviousData left stale rows on screen for 10–15s with no indication that a fetch was in progress. The previous custom isFilteringResults flag was removed in the #25847 toolbar refactor and only partially restored on the Fetch button. Use React Query's isPlaceholderData to discriminate a real filter change (queryKey changed, data not yet arrived) from a same-key live-tail refetch, and feed it into the existing isLoading prop on the toolbar pagination text and the table body. Live-tail polls still keep previous rows without flicker. Co-authored-by: Ryan <ryan@Ryans-MBP.localdomain>

* chore(e2e): migrate runner to uv, add All Proxy Models key test Switches the local e2e runner (run_e2e.sh) from poetry to uv to match the rest of the repo and CI. Adds a Playwright test for creating an admin key with no team selected (all-proxy-models flow), a SLOWMO env hook for headed debugging, and a MIGRATION_TRACKING.md doc that maps the manual UI QA checklist to e2e tests so future migration work has a single source of truth. * chore(e2e): address greptile feedback - Remove MIGRATION_TRACKING.md (docs belong in litellm-docs repo) - playwright.config.ts: fall back to 0 when SLOWMO is non-numeric (parseInt returns NaN, which Playwright accepts silently) - run_e2e.sh: add --frozen to uv sync for CI determinism

* feat(ui): team allowed_passthrough_routes create parity + edit load fix Add the Allowed Pass Through Routes selector to the create-team modal (previously only on the edit form), and fix the edit form silently dropping the field: it lives under team metadata, so initialValues must read info.metadata.allowed_passthrough_routes — otherwise the selector renders empty and saving wipes admin-set routes. Both selectors are gated to premium proxy admins, mirroring the server-side gate. Resolves LIT-3019 * fix(ui): persist team allowed_passthrough_routes edits on save The edit form loaded the selector but the save path never wrote it back: allowed_passthrough_routes stayed in the raw metadata JSON textarea and parsedMetadata (from that textarea) always won, so selector edits were silently discarded. Strip it from the textarea initialValues and overlay values.allowed_passthrough_routes into updateData.metadata, mirroring how guardrails is handled. Resolves LIT-3019 * fix(ui): preserve team passthrough routes for non-proxy-admins on save Only proxy admins may set allowed_passthrough_routes (server-side gate). For non-proxy-admins, write the team's stored value back into metadata instead of the form value, so saving an unrelated setting can't silently wipe routes; omit the key entirely when the team never had any. Resolves LIT-3019

…8227) * fix(mcp): JWT on tools/list, REST server_id resolution, tool_server_mismatch Sign outbound MCP JWTs for list_mcp_tools and inject headers on the tools/list path. Resolve server_id on /mcp-rest/tools/call and return 403 tool_server_mismatch when the tool does not belong to the requested server. Default missing arguments to {}. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): restrict list JWTs to mcp:tools/list and default REST arguments to {} - List-only JWTs (call_type=list_mcp_tools) no longer carry the broad mcp:tools/call scope. _build_scope() now emits only mcp:tools/list when no tool name is provided, mirroring the existing least-privilege rule that tool-call JWTs omit mcp:tools/list. - REST /tools/call now defaults a missing 'arguments' field to {} so execute_mcp_tool() and downstream **arguments / .keys() calls don't receive None and crash with TypeError/AttributeError. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): validate tool/server in call_tool; skip JWT signer when not configured or static auth present Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): align tests and mypy with user_api_key_auth on tools/list Update mocks for the new _get_tools_from_server parameter, mock server registry in REST access-denied test, and narrow static_headers for mypy. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): accept user_api_key_auth in get_tools_from_mcp_servers mock The side_effect for the all-servers case did not accept the new kwarg, so tools/list returned an empty list. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): fail fast for unknown tools when server mapping exists Server-name fallback in call_tool must not open an upstream session when the tool is absent from a populated mapping. Update the HTTP transport test to register a known tool before asserting not-found behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * fix mypy * Fix mypy * fix(mcp): preserve tools/call scope on missing tool name; pass user_api_key_auth in list_tools Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): match alias/server_name in _resolve_mcp_server_for_tool_call The registry lookup in _resolve_mcp_server_for_tool_call previously only compared candidate.name against the provided server_name, but tool name prefixes can be derived from a server's alias or server_name (see get_server_prefix). When the tool→server mapping is empty/stale (cold start, dynamic tools), the lookup would fail for alias-configured servers even though get_mcp_server_by_name (used by the REST path) matches alias, server_name, and name. Match the same priority of identifiers in both the registry pass and the unprefixed fallback so the MCP protocol call_tool path is consistent with the REST path. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): reuse proxy_logging DualCache in inject_mcp_jwt_headers_for_upstream Instead of allocating a fresh DualCache() on every tools/list invocation, prefer the shared proxy_logging_obj.internal_usage_cache.dual_cache when available. The cache argument is currently unused by MCPJWTSigner, but sharing the proxy's cache avoids per-call allocation overhead and matches the cache identity used elsewhere in the proxy hook plumbing — so any future per-request state stored in cache will survive across list calls. Co-authored-by: Claude <noreply@anthropic.com> * fix(mcp): return 403 ip_filtering for IP-restricted servers in tools/call name lookup Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(test): accept user_api_key_auth kwarg in list_tools mocks The proxy-infra job was failing on four TestMCPServerManager tests because the mock_get_tools_from_server stubs did not accept the new user_api_key_auth keyword argument that list_tools now forwards to _get_tools_from_server. Add the kwarg to each stub so list_tools can call through cleanly. Co-authored-by: Claude <claude@anthropic.com> * fix(mcp): skip JWT injection when per-user mcp_auth_header is set MCPClient._get_auth_headers() applies extra_headers AFTER writing Authorization from auth_value, so an injected JWT silently overwrites the user's per-server OAuth token. Guard the JWT signer with 'not mcp_auth_header' so per-user OAuth (and any dict-form per-user auth) takes precedence, mirroring the existing static_headers guard. Adds a regression test that the signer's inject helper is not called when mcp_auth_header is supplied. * fix(mcp): skip JWT injection when extra_headers already has Authorization When a server uses per-user OAuth tokens, the resolved token is passed into _get_tools_from_server via extra_headers. The JWT injection guard only checked mcp_auth_header and the server's static headers, so the signer would silently overwrite the user's OAuth Authorization header. Add a check for an existing Authorization entry in extra_headers so caller-supplied per-user OAuth tokens take precedence over JWT signing. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(mcp): cover JWT signer + tool-call resolution branches Adds unit tests for the new MCPServerManager helpers (_resolve_mcp_server_for_tool_call, _resolve_oauth2_headers_for_tool_call) and the new MCPJWTSigner paths (_build_scope call_type branches and inject_mcp_jwt_headers_for_upstream). Brings patch coverage above the auto target without changing behavior. Co-authored-by: Claude <claude@anthropic.com> * fix(mcp): retry tool-server lookup with prefixed name in REST mismatch check When the REST /mcp-rest/tools/call path sends a raw tool name plus requested_server_id, _get_mcp_server_from_tool_name(name) can return None if the mapping only stores the prefixed form. That bypassed the tool_server_mismatch 403 guard and let the call fall through to trusting requested_server. Retry the lookup with every known prefix of the requested server so the mismatch check fires whenever the tool is actually registered. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): always reject unknown tools in server-name fallback Defense-in-depth: _resolve_mcp_server_for_tool_call previously skipped the unknown-tool check whenever the per-server mapping had no entries yet (cold start, OAuth2 lazy listing, or upstream listing failure), allowing arbitrary tool names to reach upstream servers. Tighten the check so the server-name fallback always rejects tool names not present in the mapping. Callers must call list_tools first (standard MCP flow) before tools/call can resolve. Removes the now-unused _mapping_has_tools_for_server helper and adds an explicit empty-mapping rejection test alongside the existing populated-mapping rejection test. Co-authored-by: Sameer Kankute <sameer@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Claude (greptile subagent) <claude-greptile-bot@anthropic.com>

…May 2026) (#28153) * feat(interactions): migrate to Google Interactions API steps schema (May 2026) Default to Api-Revision: 2026-05-20 (new `steps` schema). Add `litellm.use_legacy_interactions_schema` global flag that sends Api-Revision: 2026-05-07 for operators who need the legacy `outputs` schema until June 8, 2026. - Inject Api-Revision header in GoogleAIStudioInteractionsConfig.validate_environment() - Auto-coalesce response_mime_type → response_format and image_config migration on new schema - Add steps field to InteractionsAPIResponse and InteractionsAPIStreamingResponse - Add StepStart/StepDelta/StepStop/InteractionCreated/etc. SSE event types - Update streaming completion detection to handle interaction.completed event - Bridge transformer populates both outputs and steps fields - Bridge streaming iterator emits new-schema events by default Co-authored-by: Cursor <cursoragent@cursor.com> * fix(interactions): address greptile review feedback - Avoid mutating caller's generation_config dict by shallow-copying before popping image_config, preventing silent failures on retries - Skip schema key in response_format when response_format is None to avoid sending schema: null to the Google Interactions API - Remove delta field from step.stop events (new schema only); the StepStop model has no delta field and sending it duplicates already- streamed text and breaks spec-conformant clients Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): parse use_legacy_interactions_schema string values safely bool("false") returns True in Python, so quoted YAML values like "false" or "False" silently activated the legacy Interactions API schema. Match the env-var parsing pattern in litellm/__init__.py by treating string inputs as true only when they equal "true" (case insensitive). Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(interactions): only set object/id/delta on step.stop for legacy schema StepStop (new schema) has no object, id, or delta fields. Setting them unconditionally caused spec-breaking extra fields on new-schema step.stop events in all four construction sites (sync/async × main-loop/StopIteration). Legacy content.stop still receives id, object, and delta unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(interactions): stabilize streaming bridge schema, dict aliasing, and lost first delta - Capture use_legacy_interactions_schema once at iterator construction so all events emitted by a single stream use a consistent schema, even if the global flag is mutated mid-stream. - Check for the buffered interaction.complete/completed event before the finished check in __next__/__anext__ so the final completion event (which carries the full collected text in steps) is not dropped after self.finished is set. - Copy text content entries before appending to both outputs and the steps content list to avoid shared mutable dict aliasing between the two response fields. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix tests * fix greptile review * fix(interactions): address Greptile P1 review on schema coalescing and legacy deltas Skip response_mime_type merge when response_format is already a list, avoid in-place list mutation on image_config append, and restore delta.type on legacy content.delta events. Co-authored-by: Cursor <cursoragent@cursor.com> * style(interactions): black-format gemini transformation.py Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Claude <noreply@anthropic.com>

* test(ui-e2e): add admin key creation with a specific proxy model Adds Playwright coverage for creating a key (no team) scoped to a single proxy model, complementing the existing All-Proxy-Models test. Uses a DOM-dispatched click on the antd dropdown option since the popup animation can render the option outside the viewport. * test(ui-e2e): verify scoped key works against mock /chat/completions Extend the "Create a key with a specific proxy model" test to extract the new key from the success modal and POST to /chat/completions for the scoped model, asserting 200 and the mock response body. Without this the test could pass even if the model selection failed to register.

#28324) * fix(vertex_ai): omit function_call id on Vertex Gemini 3.5+ tool turns Vertex AI rejects `id` on function_call/function_response parts; only Google AI Studio accepts it for Gemini 3.5+ strict tool matching. Co-authored-by: Cursor <cursoragent@cursor.com> * Update litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(vertex_ai): forward custom_llm_provider in context caching Pass custom_llm_provider through to _gemini_convert_messages_with_history in the context caching path so Gemini 3.5+ tool-call `id` forwarding behaves consistently between cached and non-cached completions on Google AI Studio. Co-authored-by: Claude <claude@anthropic.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Claude <claude@anthropic.com>

* feat(mcp): allow native MCP OAuth redirect URIs (cursor://) Discoverable OAuth /authorize rejected cursor:// callbacks because validate_trusted_redirect_uri only accepted http/https. Add an allowlisted native path with a built-in Cursor default and optional MCP_TRUSTED_NATIVE_REDIRECT_URIS env for other clients. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): address Greptile native redirect URI review Lowercase paths in normalizer so env allowlist entries match case- insensitively. Tighten wildcard prefix matching to reject sibling paths (e.g. callback-2) unless the prefix ends with /. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): reject query params on native OAuth redirect URIs Greptile: normalization stripped query strings before allowlist compare, so cursor://.../callback?injected=... could pass validation. Reject any native redirect_uri with a query component (same as fragments). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(model_cost_map): add mistral/ministral-8b-2512 entry Mistral rotated the 'mistral/mistral-tiny' alias to return 'ministral-8b-2512' as the response model, which is not in the cost map. This caused test_completion_mistral_api and test_completion_mistral_api_modified_input to fail in completion_cost lookup. Add the entry mirroring the existing openrouter/mistralai/ministral-8b-2512 pricing. * fix(mcp): lowercase default native redirect URIs Make _parse_trusted_native_redirect_uris apply the same lowercasing to built-in defaults as it does to env-var entries. * fix(tests): backfill local model_cost into remote-fetched map litellm.model_cost is loaded at import time from the URL pinned to main, so pricing entries that exist only in this branch (e.g. mistral/ministral-8b-2512, freshly added because Mistral now returns this id from mistral-tiny) are absent at test time and completion_cost lookups raise. Backfill the in-tree backup so cassette-driven cost calculations resolve against the entries that ship with the branch under test. Fixes the local_testing_part1 failures on test_completion_mistral_api and test_completion_mistral_api_modified_input. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <claude@anthropic.com>

…nal completion (#28394) * fix(interactions): never drop streamed text deltas; always emit terminal completion The interactions streaming bridge had two bugs flagged by Greptile on PR #28153: 1. The first OutputTextDeltaEvent (and the second, when no ResponseCreatedEvent precedes the deltas) was consumed to emit a synthetic interaction.created / step.start event, but the chunk's text payload was never forwarded as a step.delta. The text only reappeared in the terminal step.stop, which defeats the purpose of incremental streaming. 2. When the upstream Responses API stream ended via StopIteration without a ResponseCompletedEvent, the iterator emitted step.stop but never the terminal interaction.completed event carrying the full collected text. This refactors the iterator to translate each upstream chunk into a list of events (instead of a single event) and buffers them in a deque. A text delta now expands into [interaction.created, step.start, step.delta] on the first chunk so no token is dropped, and the StopIteration / StopAsyncIteration fallback always flushes a terminal interaction.completed event when one hasn't already been sent. Both behaviors are covered by new unit tests: - test_no_text_token_is_dropped_during_streaming - test_response_created_then_text_delta_emits_step_start_and_delta - test_stop_iteration_fallback_emits_completion_event - test_response_completed_emits_stop_then_completion (no double-emit) Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(interactions): correlate EOF terminal events with stream's interaction id The StopIteration fallback path previously built the terminal step.stop / interaction.completed events with id=None (legacy content.stop) and a memory-address fallback string (interaction.completed), neither of which matched the item_id used by the earlier interaction.created / step.start / step.delta events in the same stream. Downstream consumers correlating events by id would see a mismatch. Persist the interaction id derived from the first upstream chunk (item_id on an OutputTextDeltaEvent, or response.id on a ResponseCreatedEvent) and reuse it when flushing the terminal events on EOF. Author: mateo-berri <277851410+mateo-berri@users.noreply.github.com> * ci(windows): raise UV_HTTP_TIMEOUT to 300s for uv sync The using_litellm_on_windows job has been hitting flaky PyPI download timeouts during 'uv sync --frozen --group dev' — different packages on each rerun (six, pydantic-core), all surfacing the same uv error: Failed to download distribution due to network timeout. Try increasing UV_HTTP_TIMEOUT (current value: 30s). uv's default 30s per-request timeout is too tight for the Windows runner on this project (50+ deps, several multi-MB wheels), so bump it to 300s to let slow individual downloads complete instead of failing the build. * fix(interactions): correlate ResponseCompletedEvent terminal events with stream's interaction id When a stream starts directly with OutputTextDeltaEvent (no preceding ResponseCreatedEvent), interaction.created carries item_id while interaction.completed previously carried response.id from ResponseCompletedEvent. The two ids can differ, leaving consumers that correlate events by id unable to match the start and completion events. Fall back to self._interaction_id (set on the first chunk that derives an id) before response.id, mirroring the EOF terminal path. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

…28395) * fix(proxy): expose Prisma idle/connect timeout + extra DB URL params Operators have reported large numbers of idle Prisma connections that never get closed. The proxy already forwards `connection_limit` and `pool_timeout` to the DATABASE_URL, but had no knob for capping idle or slow connections. Add three new `general_settings` keys that thread through to the DATABASE_URL / DIRECT_URL query string: - `database_connect_timeout` -> Prisma `connect_timeout` - `database_socket_timeout` -> Prisma `socket_timeout` (the main knob for closing idle connections from the LiteLLM side) - `database_extra_connection_params` -> untyped passthrough dict for any other Prisma URL param (`pgbouncer`, `statement_cache_size`, `sslmode`, ...); keys here override LiteLLM defaults. Refactors the duplicated DATABASE_URL/DIRECT_URL param dicts into a single `_build_db_connection_url_params` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update litellm/proxy/proxy_cli.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

@greptile-apps

* feat: add Xiaomi MiMo-V2.5-Pro and MiMo-V2.5 OpenRouter model entries (#27700) Squash-merged by litellm-agent from TorvaldUtne's PR. * fix(ui): trim whitespace from MCP inspector tool call inputs (#28203) Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * gemini-3.1-flash-lite pricing (#27933) * feat(model_prices): add gemini-3.1-flash-lite pricing with standard/batch/flex/priority tiers * fix pricing * add service tier --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> * fix: incorrect /v1/agents request example (#28131) * fix(anthropic): accept dict-shape reasoning_effort from Responses bridge (#28201) * fix(anthropic): accept dict-shape reasoning_effort from Responses bridge Issue #28196 — the Responses->Chat parser (transformation.py:184-200) keeps the full dict as reasoning_effort when summary is set; that branch was added in #25359. But the Anthropic transformation here still guarded on isinstance(value, str), silently dropping the param. Result: callers using the standard Reasoning(effort, summary) OpenAI-shaped object on Anthropic lose thinking entirely (0 reasoning_tokens, no thinking_blocks). Coerce dict -> string before mapping. Same shape tolerance that gpt_5_transformation._normalize_reasoning_effort_for_chat_completion already implements. summary is irrelevant for Anthropic's thinking_blocks. Adds two regression tests: one parametrized over string + dict shapes (with and without summary), one covering unparseable dict inputs (drops silently, no crash). * test(anthropic): add non-adaptive model coverage for dict-shape reasoning_effort Per Greptile feedback on PR #28198: the original regression test only exercised the adaptive (4.6+) path. Add a parametrized test for the non-adaptive branch (claude-sonnet-4-5) verifying that dict-shape reasoning_effort still maps to thinking.type='enabled' + budget_tokens, and that output_config is NOT set on pre-4.6 models. * test(anthropic): convert unparseable-dict test to @pytest.mark.parametrize Per @greptile-apps inline review on PR #28201 — matches the parametrize style of the two adjacent dict-shape tests and produces clearer failure messages (test ID per case instead of one collapsing for-loop). * feat: add pricing entry for openrouter/google/gemini-3.1-flash-lite (#28280) Squash-merged by litellm-agent from ro31337's PR. * fix(router): wrap aresponses streaming iterator for mid-stream fallbacks (#28215) Squash-merged by litellm-agent from cwang-otto's PR. * fix(router): unblock staging — mypy + coverage for aresponses streaming fallback (#28318) Squash-merged by litellm-agent from cwang-otto's PR. * fix(responses): forward timeout on completion transformation path (Anthropic, Bedrock, Vertex) (#28133) Squash-merged by litellm-agent from cwang-otto's PR. * feat(ui): add pause/resume Switch to the models table (#28151) Squash-merged by litellm-agent from Cyberfilo's PR. * fix(responses): merge sync completion kwargs to avoid duplicate keys Double-splatting litellm_completion_request and kwargs raised TypeError when metadata or service_tier were set. Match the async merge pattern. Co-authored-by: Cursor <cursoragent@cursor.com> * Use proxy base URL for CLI SSO form action (#28271) Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix(tests): add mistral/ministral-8b-2512 to cost map and backfill in conftest Mistral rotated the 'mistral/mistral-tiny' alias to return 'ministral-8b-2512' as the response model, which was missing from the cost map. This caused test_completion_mistral_api and test_completion_mistral_api_modified_input to fail in litellm.completion_cost lookup. - Add mistral/ministral-8b-2512 entry to both the in-tree model_prices_and_context_window.json and the bundled litellm/model_prices_and_context_window_backup.json (mirrors the existing openrouter/mistralai/ministral-8b-2512 pricing). - litellm.model_cost is loaded at import time from the URL pinned to main, so the new backup entry isn't visible at test runtime until it also lands on main. Backfill any entries missing from the remote-fetched map into litellm.model_cost in the local_testing conftest so cost-calculator lookups succeed on this branch. * fix(tests): drop unnecessary del of conftest backfill loop vars * fix(router): harden streaming fallback wrapper for bridge iterators - FallbackResponsesStreamWrapper now uses getattr fallbacks when copying attributes from the source iterator. The bridge path (LiteLLMCompletionStreamingIterator used by Anthropic/Bedrock/Vertex) does not call super().__init__ and is missing response, logging_obj (it uses litellm_logging_obj), responses_api_provider_config, start_time, request_data, call_type, and _hidden_params. Previously, wrapper construction raised AttributeError for any streaming fallback on the bridge path. - _aresponses_with_streaming_fallbacks now deep-copies the litellm_metadata (and metadata) dicts into fallback_kwargs. The primary attempt mutates this dict in place via _update_kwargs_with_deployment, so a shallow copy of kwargs was leaking primary-deployment fields (deployment, model_info, api_base) into the mid-stream fallback request. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(router): use safe_deep_copy for fallback metadata snapshot The ban_copy_deepcopy_kwargs CI check rejects copy.deepcopy() on any variable whose name contains 'kwargs' (incl. fallback_kwargs). Swap the two copy.deepcopy(fallback_kwargs[...]) calls for safe_deep_copy, which handles non-picklable values (OTEL spans, etc.) by per-key deepcopy with fallback to the original reference. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(ci): skip chronically flaky build_and_test integration tests Both tests have been failing on every recent run of build_and_test against this PR's HEAD (1686967, 1688402, 1689993, 1690877), and the same two tests also fail intermittently on unrelated commits and other branches, independent of any code change in this PR (which only touches router fallback wrappers, the Anthropic Responses bridge, and unrelated UI/cost-map files). - tests.test_spend_logs.test_spend_logs: /spend/logs?request_id=... returns 500 even after a 20s wait for the spend log to be written. Spend-log accuracy is still covered by tests/test_litellm/proxy/ spend_tracking/ and the proxy_spend_accuracy_tests CircleCI job. - tests.test_team_members.test_add_multiple_members: /team/info?team_id= ... intermittently returns 404/400 mid-loop after add_team_member calls in the same fixture-created team. Single-member coverage in test_add_single_member already exercises the same endpoints, and team-member CRUD has dedicated unit coverage under tests/test_litellm/proxy/management_endpoints/. Skipping unblocks the build_and_test job until the underlying race in the dockerized integration setup is root-caused. * fix: preserve explicit timeout=0 in responses API handler Use 'timeout if timeout is not None else request_timeout' instead of 'timeout or request_timeout' so an explicit timeout=0/0.0 isn't silently replaced by the default request_timeout. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(ui): guard model_info access in pause Switch with optional chaining * fix(ui): guard model_info access in pause Switch onChange handler Mirror the optional-chaining guard already applied to the isPausing check so a config-model row with a missing model_info cannot throw when the toggle's onChange fires. --------- Co-authored-by: TorvaldUtne <78661304+TorvaldUtne@users.noreply.github.com> Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: mubashir1osmani <mubashir.osmani777@gmail.com> Co-authored-by: Isha <72744901+IshaMeera@users.noreply.github.com> Co-authored-by: cwang-otto <chengxuan.wang@ottotheagent.com> Co-authored-by: Roman Pushkin <roman.pushkin@gmail.com> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: boarder7395 <37314943+boarder7395@users.noreply.github.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix: serialize guardrail_response to JSON in OTEL traces Guardrail spans previously set the `guardrail_response` attribute via `safe_set_attribute`, which let dict payloads reach the OTEL exporter as Python repr strings. Downstream log pipelines could not parse those as JSON, breaking metric creation from guardrail traces. Serialize `guardrail_response` with `safe_dumps` before setting the attribute, matching how `masked_entity_count` is already handled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: cover dict-serialization and None-skip for guardrail_response Address Greptile feedback on #28362 — add explicit coverage for the two behavioral guarantees of this fix: - Dict payloads (the OpenAI moderation case in the report) reach the span as a JSON string, not a Python repr. - ``None`` guardrail_response skips the attribute entirely, so no ``"null"`` leaks into traces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(proxy): strict media-type match for form bodies (#27939) * chore(proxy): strict media-type match for form bodies ``_read_request_body`` and ``get_request_body`` routed on ``"form" in content_type`` / ``"multipart/form-data" in content_type``, which match any header containing the literal — ``application/form-json``, ``multiform/anything``, ``application/json; xform=1``. Starlette's ``request.form()`` returns an empty ``FormData`` for any non-canonical type without consuming the body, so the auth-time pre-read saw ``{}`` and skipped the banned-param check while the handler's later ``request.body()`` saw the original JSON payload. Parse the media type per RFC 7231 (substring before ``;``, trimmed, lowercased) and accept only ``application/x-www-form-urlencoded`` and ``multipart/form-data``. Replace both substring sites with the shared ``_is_form_content_type`` helper. Tests pin: case/whitespace/charset variants of the two real types match; ``application/form-json`` and similar substring-match traps fall through to the JSON parse path; real form POSTs continue to route through ``request.form()``. * chore(proxy): extract _is_json_content_type symmetric helper Mirror ``_is_form_content_type`` for the JSON branch of ``get_request_body`` so both classifications share the same media-type normalisation (strip params, trim, lowercase) and any future change to the parsing rules has one place to update. Adds tests for ``_is_json_content_type`` and for ``get_request_body`` covering the canonical JSON / form / unsupported / non-POST paths. * chore(proxy): surface form-parse failures instead of caching empty body Starlette's ``request.form()`` raises ``MultiPartException`` / ``ValueError`` / ``AssertionError`` on malformed multipart input (missing boundary, malformed chunk encoding, etc.). The outer ``except Exception: return {}`` swallowed every form-parse failure and cached an empty parsed body — auth-time pre-reads saw ``{}`` and skipped every banned-param check while a later raw-body re-read in the handler still saw the original payload. Same TOCTOU shape as the substring-match bypass: the auth gate and the handler don't agree on what the body is. Wrap ``request.form()`` in a narrow ``try`` that converts any parse failure to a 400 ``ProxyException``. The outer broad ``except`` is retained for unrelated unexpected errors but no longer covers form-parse-side bypass shapes. Adds a regression test parametrised over the exception classes Starlette can raise from ``request.form()``. * chore(proxy): drop redundant _is_json_content_type test class ``_is_json_content_type`` is a 3-line wrapper around the shared ``_normalize_media_type`` helper. Positive coverage lives in ``TestGetRequestBody.test_json_with_charset_param_parses_as_json``; negative coverage is covered transitively by ``TestIsFormContentType``'s non-form parametrize matrix (anything that isn't a form type falls through to the JSON branch). * chore(proxy): carry ASGI path into WebSocket auth synthetic Request (#27940) ``user_api_key_auth_websocket`` built a synthetic ``Request`` with a two-key scope (``type`` + ``headers``) and set ``request._url = websocket.url``. ``get_request_route`` reads ``scope.get("path", ...)`` and falls back to ``request.url.path`` only when ``path`` is absent. For the WebSocket flow that fallback fires and resolves to the Host-header-derived value (Starlette reconstructs ``websocket.url`` from the Host header), so a malformed Host collapses the resolved route and lets the auth gate compare against the wrong value. Carry the ASGI scope's ``path``, ``root_path``, and ``app_root_path`` into the synthetic scope so the lookup never reaches the fallback on the legitimate path. Regression test pins that the request handed to ``user_api_key_auth`` has ``scope["path"]`` equal to the ASGI scope's path. --------- Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>

…28424) xAI's Grok Voice Agent API now sends session.created as its first realtime event (matching OpenAI), followed by conversation.created. The E2E canary pinned the old conversation.created value and failed. LiteLLM's xAI realtime path is a verbatim passthrough (provider_config is None, raw forwarding), so the event ordering is xAI's own — no transformation on our side. Update the pinned expected value and the now-stale comments to match the current API behavior.

* test(proxy_behavior): scaffold session-scoped async ASGI client + liveness smoke Slice 2 of the management-endpoints behavior-pinning effort. New top-level dir tests/proxy_behavior/management/ outside every existing pytest glob. conftest.py initialises the proxy app once per session against the DATABASE_URL the harness boots Postgres at, wraps it in httpx.AsyncClient via in-process ASGITransport. The one smoke test asserts /health/liveliness returns 200, which exercises the full FastAPI middleware stack against a real app — no mocks. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): connect prisma via real lifespan; key/generate de-risk Slice 3 of the management-endpoints behavior-pinning effort. The fixture now enters the real FastAPI lifespan (proxy_startup_event) instead of just calling initialize() — that is where prisma_client is connected, password migration is kicked off, and the rest of the startup wiring runs. Tests pin the loop to the session scope so the AsyncClient created in the session fixture and the prisma connection opened in the lifespan share the same loop as the test bodies. New de-risk smoke: POST /key/generate with the master key returns 200, the returned sk- token resolves to a hashed row in LiteLLM_VerificationToken, and the cleartext token is never stored. Proves auth + handler + helper + prisma all wire together end-to-end against a real Postgres. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): seed 8-actor read-world for the authz matrix Slice 4 of the management-endpoints behavior-pinning effort. New ``actors.py`` defines the actor enum + seeds an immutable world (2 orgs, 2 teams, 8 users, 8 verification tokens) under the ``behavior-pin-`` prefix so the rows are identifiable in psql and ``_wipe_world`` is targeted. Each actor key is created with its cleartext form generated locally and its hashed form (via ``litellm.proxy.utils.hash_token``) stored in ``LiteLLM_VerificationToken`` — so the real ``user_api_key_auth`` accepts the cleartext bearer token. Roles, ``team_id``, ``organization_id``, and the service-account metadata flag are all set on the seeded rows so the auth layer resolves the same scopes a real proxy would. The session-scoped ``world`` fixture re-seeds at session start (idempotent via wipe-then-create), and the smoke test confirms each of the 8 actor keys can call ``/key/info`` on itself and receive its own row back. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): per-test scratch namespace + targeted delete_many teardown Slice 5 of the management-endpoints behavior-pinning effort. Adds the ``scratch`` function-scoped fixture: each test gets a uuid4-derived namespace prefix, tags writes with it (``key_alias``, ``team_alias``, ``user_id``, ``budget_id``), and the fixture teardown ``delete_many``-s any row whose namespace column starts with that prefix. Cleanup uses Prisma model methods only (no raw SQL, per CLAUDE.md) and orders deletes children-before-parents to avoid FK conflicts. The Slice 3 de-risk smoke is migrated onto the same fixture so it stops accumulating untagged tokens across repeated local runs. Smoke proves both halves of the contract: one test writes a scratch-tagged key and asserts it lands; a second test runs after the first's teardown and asserts no rows in the scratch namespace survived. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): codify G3 (strict-import grep) as a pytest item Slice 6 of the management-endpoints behavior-pinning effort. Two new tests walk every .py file under tests/proxy_behavior/ and assert: * no ``from litellm.proxy.management_endpoints`` import — the suite is deliberately constrained to the HTTP boundary so it survives handler refactors; * no ``mock``/``patch`` on ``user_api_key_auth`` — mocking auth is the structural failure mode of the existing 11k-line mock suite, and the point of this harness is that the real auth layer runs. Codifying G3 as a CI test removes the "did someone forget to check the PR-description checklist" failure mode. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * style(proxy_behavior): apply black to G3 grep test Follow-up to 6f588c7 — line-length fixes only, no behavior change. * test(proxy_behavior): pin /key/generate authz matrix (18 scenarios) Slice 7 of the management-endpoints behavior-pinning effort. Parametrized matrix across two axes: actor (8 seeded) × target scope (self, team_alpha in org_a, team_beta in org_b). 18 scenarios after dropping non-applicable combos. Whole-suite wall-time stays at ~4.7s (well under the 10-min G2 budget for the eventual CI job). While pinning, the test surfaced one seed gap: ``_get_user_in_team`` reads ``members_with_roles`` (a JSON list of ``{user_id, role}``), not the plain ``members`` String[]. Both columns are now populated in the seed to match what the real ``/team/new`` handler would produce. Expected status codes are intentionally heterogeneous (200, 400, 401) because the current handler emits different statuses depending on which check fails first (role gate, team-member-perm gate, "not assigned" check). Pinning the *observed* codes — not what they "should" be — is exactly the regression signal we want. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/info authz matrix (24 scenarios) Slice 8 of the management-endpoints behavior-pinning effort. 8 actors × 3 target keys (own, OWNER's key in org_a, CROSS_ORG_USER's key in org_b) covering self-read, same-team-peer read, and cross-org read. Notable pinned behaviors (intentionally surfaced for review, not "fixed"): * ORG_ADMIN gets 403 on individual key info even within their own org — visibility is scoped to "your own keys" + "your team's keys", not "your org's keys". * Same-team peers (INTERNAL_USER, UNRELATED_SAME_ORG, SERVICE_ACCOUNT) DO see each other's keys. Whether that is desired is for the team to decide; this PR only pins the existing behavior so unintentional changes flip the matrix red. Wall-time is unchanged (~4.3s for the slice on its own). Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/list default-visibility matrix (8 scenarios) Slice 9 of the management-endpoints behavior-pinning effort. For /key/list the response IS the matrix: each of the 8 seeded actors calls the endpoint with default filters and the test asserts set-equality between the returned visible-token set (filtered to seeded tokens only, so unrelated rows can't flap the assertion) and a pinned expected actor-set. Pinned default visibility: * PROXY_ADMIN sees all 8 actors' keys. * Every other actor sees only their own key — including ORG_ADMIN (which had broader expectations going in but currently behaves same-as-internal-user for /key/list defaults) and TEAM_ADMIN (no team-aggregation without include_team_keys=true). Future changes that broaden or narrow any single actor's default visibility will turn this matrix red — exactly the regression signal we want. Parameter-driven views (include_team_keys, filters) are deferred to Slice 13 / PR2 follow-up. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/update authz matrix + mutation re-read (21 scenarios) Slice 10 of the management-endpoints behavior-pinning effort. 8 actors × 3 target shapes (self-owned, OWNER-scoped in org_a/team_alpha, CROSS_ORG_USER-scoped in org_b/team_beta) = 21 applicable scenarios. Each test: 1. Master-key-seeds a fresh scratch key with the target's (user_id, team_id) scope (so the read-world stays untouched). 2. Has the actor under test POST /key/update flipping ``models`` to a known marker list. 3. Asserts the status code AND the DB row's ``models`` field — present when 200, unchanged otherwise — so a handler that silently mutates on a denied response surfaces red. Observed gating (pinned, not endorsed): * PROXY_ADMIN bypasses every check. * ORG_ADMIN is blocked by an early role gate, always 401. * Every other (INTERNAL_USER-rolesed) actor hits one of three failure modes — 403 "user can only create keys for themselves", 403 "only proxy admins, team admins, or org admins", or 401 "team_member_permission_error" — depending on whether they own the target and whether they're a team admin / member of its team. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/regenerate authz matrix + rotation contract (22 scenarios) Slice 11 of the management-endpoints behavior-pinning effort. 21 matrix scenarios (8 actors × 3 target shapes, minus the cross_org/owner combo that exists in the seed but isn't applicable) plus one smoke for the ``/key/{key:path}/regenerate`` route registration. On 200 outcomes the test verifies the full rotation contract: * the regenerate response key differs from the old cleartext, * the OLD cleartext returns 401 on a follow-up ``/key/info``, * the NEW cleartext returns 200 on a follow-up ``/key/info``. On denied outcomes the test verifies the OLD cleartext still works — catching any handler that mutates the token row on a failed call. Pinned authz divergence vs /key/update: regenerate routes most denials through the team-member-perm 401 path rather than the role-gate 403 path. The matrices for both endpoints are now in tree side-by-side, so any future refactor that "harmonises" the codes will turn one of the two red. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/delete authz matrix + post-delete contract (21 scenarios) Slice 12 of the management-endpoints behavior-pinning effort. Mirrors slices 10/11. On success: cleartext can no longer authenticate (handles both hard-delete and soft-delete to LiteLLM_DeletedVerificationToken). On denial: row survives and cleartext still authenticates. Notable behavior gap with /key/update: same-team peers (internal_user, unrelated_same_org, etc.) get 403 on /key/delete for OWNER's key — i.e. cannot delete each other's keys — whereas they CAN read each other's keys (Slice 8). Delete is stricter than read. Pinned as-is. Cumulative whole-suite wall-time is 5.9s for all 128 tests on the local runner — well under the 10-min G2 budget for the CI job in Slice 13. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * ci(proxy-mgmt-behavior): add PR-triggered workflow for the behavior suite Slice 13 of the management-endpoints behavior-pinning effort. New workflow ``test-unit-proxy-mgmt-behavior.yml`` fires ``on: pull_request`` for the same branch set every other proxy unit-test workflow watches (main, litellm_internal_staging, litellm_oss_branch, litellm_**). It delegates to the existing reusable ``_test-unit-services-base.yml`` with ``enable-postgres: true``, which already provisions a postgres:14 service container and runs ``prisma db push`` against it before pytest collects. ``reruns: 0`` because a behavior-pinning matrix that needs reruns is itself a regression — flakes are signal. ``timeout-minutes: 15`` gives generous headroom over the local 5.9s whole-suite wall-time; the binding G2 budget is 10 min. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * docs(proxy_behavior): G4 regression-replay table for Key Tier-1 Slice 14 of the management-endpoints behavior-pinning effort. Documents the regression-replay verification methodology + a 12-row table mapping recent fix-PRs touching key_management_endpoints.py to the catching scenarios in the PR1 matrix. One canonical RED→GREEN cycle is captured verbatim — c7c3df2 "extend /key/update admin check to non-budget fields". Under the parent-of-fix code, 6 scenarios in test_key_update.py flip from 200 to 403; under HEAD code, all 21 pass. The handler swap is the only change between the two runs, confirming the matrix catches the behavior shift the fix introduced. The table also calls out 4 genuine coverage gaps deferred to PR2/PR3: 404-on-missing-key, budget-limit counter assertions, /key/regenerate upperbound enforcement, and /key/list filter-param views. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * chore(mutmut): include the behavior suite in tests_dir + G5 triage stub Slice 15 of the management-endpoints behavior-pinning effort. Appends ``tests/proxy_behavior/management/`` to ``[tool.mutmut].tests_dir`` so the existing mutation-test workflow runs against both the legacy mock suite AND the new behavior suite — the latter is where the regression signal will actually surface. Adds a stub at ``tests/proxy_behavior/management/mutmut_triage/pr1.md`` documenting the G5 triage protocol (zero unreviewed survivors in the 6 Tier-1 handler functions) and a placeholder baseline-metrics table to fill in after the first manually-triggered mutmut run completes — runs take hours and run on a manual cadence, so PR1 ships with the wiring + protocol, not the numbers. The actual baseline is recorded in a follow-up once ``gh workflow run mutation-test.yml`` finishes. The kill rate stays telemetry-only, never a gate. G5 (per-survivor classification) is the binding mutation gate. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * docs(proxy_behavior): suite README with local-repro + conventions + gates Slice 16 of the management-endpoints behavior-pinning effort. The README documents: * The same three commands the CI workflow runs locally (BYO-DATABASE_URL, no new tooling). * Suite layout — what each test file covers, which slice it lands. * The asyncio loop_scope convention required for session fixtures (httpx AsyncClient + prisma connection) to share a loop with each test body. * G3 strict-import convention + the test that enforces it. * Read-world vs scratch-world fixture conventions. * Behavior-pinning philosophy: pin observed codes; flag, don't judge. * Where each G1–G5 + PR1.M1–M3 gate's evidence lives. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * ci(proxy-mgmt-behavior): drop xdist (workers=0) to fix seed race First run on PR #28321 failed with UniqueViolation on ``behavior-pin-budget`` plus cascading missing-membership FK errors. Both xdist workers entered ``seed_world()`` concurrently against the shared Postgres service container; whichever lost the race left the world in a half-seeded state and downstream tests ran against missing team_membership rows. Whole-suite wall-time is ~7s sequentially, so disabling xdist here costs nothing — and the seed itself is the wrong place to add per-worker isolation (the world is intentionally shared so set-equality assertions in /key/list have a deterministic expected set). * ci(proxy-mgmt-behavior): seed scratch keys via proxy_admin actor, not master Second CI run failed: ``/key/generate`` with explicit ``user_id`` returned 403 "User can only create keys for themselves. Got user_id=X, Your ID=None" in every test that called ``_create_scratch_key`` with a per-actor user_id. The bare master key's auth path was producing ``user_id=None`` in the fresh CI Postgres, which doesn't trigger the PROXY_ADMIN bypass in ``_user_can_only_create_keys_for_themselves`` reliably. Locally the same master key path worked, masking the issue. Fix: every ``_create_scratch_key`` helper now takes a seeder cleartext and the test bodies pass ``world.keys[Actor.PROXY_ADMIN].cleartext``. That actor was seeded with ``user_role=PROXY_ADMIN`` AND a concrete ``user_id``, so the bypass fires deterministically in both environments. No behavior shift in the matrices themselves — all 128 scenarios still pass locally; only the setup helper's auth identity changed. The bare-master smoke (test_smoke + test_scratch_teardown) is intentionally left on the master key path: those tests don't pass ``user_id`` in the body so they don't hit the user_id-mismatch gate. * ci(proxy-mgmt-behavior): diag — run world-seed test first + bump max-failures Third CI run failed identically: seeded PROXY_ADMIN actor's auth resolves to ``user_id=None`` even though the DB row has the right ``user_id``. The suite was aborting at maxfail=10 inside test_key_delete, so test_world_seed (which would tell us whether the seed itself is reachable) never ran in CI. Two diagnostic moves on this push, no behavior change: * Rename ``test_world_seed.py`` → ``test_aaa_world_seed.py`` so it's the first collected file. If it passes in CI we know the seed is fine and the bug lives downstream; if it fails the same way the bug is in the auth resolution path. * Bump ``max-failures`` to 200 for this workflow so we see the full failure surface instead of stopping at the first cascading setup error. Will tighten back down once the suite is green. Adds one new test ``test_proxy_admin_actor_can_create_keys_for_others`` that explicitly exercises the PROXY_ADMIN bypass via /key/generate with an explicit user_id — the same shape the matrix setup helper uses but without the matrix machinery muddying the diagnostic. * ci(proxy-mgmt-behavior): await LiteLLM_VerificationTokenView creation in fixture Fourth CI run still failed because the proxy's lifespan kicks off ``prisma_client.check_view_exists()`` as a fire-and-forget background task — that task is what creates ``LiteLLM_VerificationTokenView``, the SQL view ``user_api_key_auth`` queries to resolve a token to its user_id / user_role / team. On a fresh Postgres (CI), the first test races the background task. The view doesn't exist when the first auth call runs, the resolver falls through to a degraded path that returns ``user_id=None``, and every matrix test that depends on the seeded actor's identity then fails confusingly with "Got user_id=X, Your ID=None" 403s. Locally the view persists across pytest runs so the race is invisible. Fix: await ``prisma_client.check_view_exists()`` explicitly inside the session ``proxy_app`` fixture, after the lifespan enters but before the fixture yields. Deterministic regardless of whether the underlying DB is fresh (CI) or warm (local). * ci(proxy-mgmt-behavior): widen diagnostic to dump token / user / view shape The fifth CI run isolated the failure to ``/key/generate`` with explicit user_id while ``/key/info`` works for the same seeded PROXY_ADMIN actor. The auth context's user_id is None even though the DB row has it set. This commit widens the diagnostic test: on failure, dump the raw token row's user_id, the user row's user_role, and what ``LiteLLM_VerificationTokenView`` actually returns for the seeded token. If the view returns user_id=None we know the view shape is the problem; if the view returns the right user_id we know it's a downstream code path stripping it. * ci(proxy-mgmt-behavior): unambiguous diagnostic view query Previous diagnostic's raw SQL had an ambiguous user_id column from joining the view with the user table, so the diagnostic itself crashed before printing useful state. Simplified to query just the view's columns. * ci(proxy-mgmt-behavior): add auth-resolver chain diagnostic Six runs and the underlying data (token row, user row, view row) all verified correct in CI, but auth still returns user_id=None. This diagnostic calls the resolver primitives directly: 1. ``prisma.get_data(table_name="combined_view")`` → raw view object 2. ``get_key_object(...)`` → cached/DB UserAPIKeyAuth 3. ``get_user_object(...)`` → LiteLLM_UserTable row 4. ``_is_user_proxy_admin`` / ``_get_user_role`` and prints each intermediate via captured stdout (-s). Whichever step returns None/False in CI is where the chain breaks. Imports come from ``litellm.proxy.auth`` (not management_endpoints), so G3 still passes. * ci(proxy-mgmt-behavior): set LITELLM_MASTER_KEY env so lifespan doesn't wipe it Real root cause of every CI run that returned ``Your ID=None`` for the seeded actors: * In ``initialize()``, ``master_key`` is set from the config YAML's ``general_settings.master_key`` (load_config code path at proxy_server.py:4174). * Then the FastAPI lifespan (``proxy_startup_event``) runs and at line 776 does ``master_key = get_secret_str("LITELLM_MASTER_KEY")``, which UNCONDITIONALLY overwrites the global. * In CI the env var is unset, so the post-lifespan ``master_key`` is None. Downstream every auth path degrades: master-key requests don't bypass because ``secrets.compare_digest(api_key, None)`` raises and is caught to ``is_master_key_valid=False``; seeded-actor requests cache a ``UserAPIKeyAuth`` whose ``user_role`` never resolves through the PROXY_ADMIN bypass; ``_is_allowed_to_make_key_request`` then hits the ``user_id`` mismatch path with ``Your ID=None``. Locally my shell happened to have ``LITELLM_MASTER_KEY`` set from a prior session, which is why every local run was green and CI red — exactly the "don't generalize from your environment to CI" memory. Fix: ``os.environ.setdefault("LITELLM_MASTER_KEY", MASTER_KEY)`` and ``os.environ.setdefault("CONFIG_FILE_PATH", config_path)`` before entering the lifespan, so its re-read produces the same value as ``initialize()``. Whole-suite still green locally (130 tests, ~6.4s). * ci(proxy-mgmt-behavior): force premium_user=True so /key/regenerate isn't gated Ninth CI run cleared every ``Your ID=None`` failure (the master_key env fix worked end-to-end) and exposed the next thin layer of failures: ``/key/regenerate`` returns 500 "Regenerating Virtual Keys is an Enterprise feature" in CI because the proxy can't see a ``LITELLM_LICENSE``. Locally my license is set, so the matrix passes. The behavior matrix is supposed to pin authz, not licensing — so flip ``proxy_server.premium_user = True`` directly, both before and after the lifespan (the lifespan re-runs ``_license_check.is_premium()`` and would otherwise reset it). With premium gating disabled, the regenerate matrix exercises the same authz path /key/update does. Whole-suite still green locally (130 tests, ~6.3s). * test(proxy_behavior): trim debug diagnostics, restore default max-failures Followup to the CI-bring-up sequence: now that the suite is green in CI (130 → 129 tests after this trim; 156s wall-time on ubuntu-latest), drop the diagnostic noise left over from debugging the master_key wipe: * Rename ``test_aaa_world_seed.py`` back to ``test_world_seed.py`` — no longer needs to run first. * Remove ``test_auth_resolver_returns_correct_user_id_and_role`` — that test reached into private auth helpers to localize the bug between the DB and ``UserAPIKeyAuth``; it has served its purpose and isn't HTTP-boundary. * Keep ``test_proxy_admin_actor_can_create_keys_for_others`` (without the failure-time dump) — it's a real authz contract that pins the PROXY_ADMIN bypass on /key/generate, and would catch a regression of the same conftest interaction this sequence revealed. * Drop the workflow's ``max-failures: 200`` override — that was a debug aid for seeing the full failure surface in CI. Default of 10 is right for a stable suite. * chore(proxy_behavior): drop empty mutmut triage stub, fold protocol into README The mutmut_triage/pr1.md file was a placeholder for numbers and classifications that don't exist yet — the first mutmut run is a manual follow-up. Empty stubs aren't evidence; deleting it. The G5 protocol (run the workflow, triage survivors in the six Tier-1 handler functions, kill-or-accept-with-reason, zero unreviewed) moves into the suite README's "Gate evidence" block. The real triage file will land alongside the first mutmut follow-up. pyproject.toml's [tool.mutmut].tests_dir entry stays — that's the one-line wiring that makes the existing (manual-trigger) mutation-test workflow include our suite next time someone runs it. Comment updated to drop the dead file reference. * chore(proxy_behavior): drop README + trim comments Removes the suite README — its contents (local repro, layout, conventions) were either restated by the file structure or already covered by the workflow YAML and pyproject.toml. Trims docstrings and inline comments across every test file to keep only non-obvious WHY (the masking ``_get_user_in_team`` reads, the LiteLLM_VerificationTokenView models-can't- be-NULL gotcha, the org_admin/peer-visibility surprise, the rotation contract). Suite still 129 green locally. * test(proxy_behavior): address Greptile review — env force, pagination, dedup - conftest: force LITELLM_MASTER_KEY / CONFIG_FILE_PATH unconditionally instead of setdefault. An ambient LITELLM_MASTER_KEY with a different value would make the proxy authenticate on that key while the tests still send MASTER_KEY → silent 401s. - test_key_list: paginate /key/list instead of a single size=100 request. size is capped at 100 by the endpoint, so on a non-fresh DB a single page could truncate PROXY_ADMIN's view and a seeded key could fall off the page. Walk total_pages. - conftest: hoist the duplicated _create_scratch_key helper (copy-pasted and already diverged across test_key_{update,regenerate,delete}.py) into a single shared create_scratch_key. - Delete regression_replay/README.md — G4 regression-replay evidence belongs in the PR description, not a committed doc file (repo docs policy + the effort's own plan both say so). Content moved to the PR.

* fix(proxy): hydrate wildcard discovery credentials * fix(proxy): constrain wildcard credential hydration Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com>

* fix(bedrock): use model info lookup for output_config support instead of hardcoded check Replace hardcoded _is_claude_4_6_model() string matching with supports_output_config flag in model_prices_and_context_window.json, accessed via _supports_factory(). This follows the project's established pattern for model capability checks (per AGENTS.md rule #8). Bedrock Invoke now conditionally preserves output_config for models that declare supports_output_config=true (currently Claude 4.6 models), while stripping it for older models to avoid request rejection. Ref: #22797 * fix(vertex_ai): single-flight credential refresh to prevent thundering herd (#26024) * fix(vertex_ai): single-flight credential refresh to prevent thundering herd When GCP credentials expire under high concurrency, all requests simultaneously call credentials.refresh() via asyncify, saturating the 40-thread anyio pool and blocking the proxy for 20+ seconds. This adds: - Per-credential asyncio.Lock in get_access_token_async for single-flight refresh (1 coroutine refreshes, others wait on the lock) - Background refresh when token_state is STALE (usable but near expiry), returning the current token immediately with zero added latency - threading.Lock on the sync get_access_token path - Uses google-auth's TokenState enum (FRESH/STALE/INVALID) instead of reimplementing expiry logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR review comments - Use asyncio.create_task() instead of deprecated get_event_loop().create_task() - Track in-flight background refresh tasks to prevent duplicate refreshes when multiple STALE-path callers pass through the lock before the first background task completes - Add token validation in the STALE branch (consistent with FRESH/INVALID) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lazy-import TokenState to avoid breaking when google-auth is not installed Also extract helper methods to bring get_access_token_async under the PLR0915 statement limit (50). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: apply Black formatting to test file and update uv.lock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove user-provided project_id from log messages (CodeQL log injection) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: avoid leaking token value in error message, log type instead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: restore uv.lock to match litellm_oss_branch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove project_id from remaining log message (CodeQL log injection) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove remaining project_id from log and error messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: reuse cached credentials in VertexAIPartnerModels (#26065) * fix: reuse cached credentials in VertexAIPartnerModels instead of creating new VertexLLM per request VertexAIPartnerModels.completion() was creating a throwaway VertexLLM() instance on every call to get an access token, bypassing the credential cache inherited from VertexBase. This caused a fresh token fetch for every single request, adding significant latency overhead. Fix: call super().__init__() to initialize VertexBase's credential cache, and use self._ensure_access_token() instead of a new VertexLLM instance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: apply same credential caching fix to VertexAIGemmaModels and VertexAIModelGardenModels Same bug as VertexAIPartnerModels: both classes had `pass` in __init__ instead of `super().__init__()`, and created throwaway VertexLLM() instances per request instead of using self._ensure_access_token(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(fireworks): add glm-5p1 metadata and parallel_tool_calls (#26069) * fix(chatgpt): preserve responses routing and recover empty output (#25403) (#26219) - preserve existing shared backend `mode` when router deployment registration reuses a provider/model key already in `litellm.model_cost` (prevents alias with `mode: chat` from downgrading shared `chatgpt/gpt-5.4` from `responses` to `chat` and triggering 403s on /v1/chat/completions) - teach the ChatGPT Responses parser to recover `response.output_item.done` entries when `response.completed.output` is empty - add defensive /responses -> /chat/completions bridge fallback that reconstructs output items from raw SSE when `raw_response.output` is empty - regression coverage for shared alias routing, empty completed.output parsing, and SSE bridge recovery Closes #25403 Co-authored-by: afoninsky <andrey.afoninsky@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(deps): relax core runtime dependency pins from exact == to ranges When litellm migrated from Poetry to uv (PR #24905, v1.83.1), the core dependency specifications in pyproject.toml changed from Poetry bare-version strings (e.g. openai = "2.30.0") to PEP 621 exact pins (openai==2.24.0). Poetry bare-version strings are actually caret ranges (^X.Y.Z == >=X.Y.Z,<X+1), but PEP 621 == is exact. This means every downstream package that installs litellm as a library dependency is now forced to downgrade aiohttp, pydantic, openai, click, and 8 other common packages to exact old versions. Fix: restore range specifiers for the 12 core runtime dependencies. The optional extras (proxy, proxy-runtime, etc.) are consumed primarily by Docker images where exact pins are appropriate and are left unchanged. The uv.lock file continues to provide exact reproducibility for Docker builds and CI. Fixes: #26154 * Add Rubrik as officially-supported guardrail plugin (#25305) * Add Rubrik as officially-supported guardrail plugin Adds tool blocking and batch logging integration with an external Rubrik webhook service. The plugin validates LLM tool calls against a policy service (fail-open on errors) and batch-logs all requests/responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update Rubrik docs: config.yaml as primary, env vars as fallback Restructures the Quick Start to present config.yaml as the recommended approach with tabbed UI, and environment variables as an alternative fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add Rubrik env vars to config_settings reference Fixes documentation validation by adding RUBRIK_API_KEY, RUBRIK_BATCH_SIZE, RUBRIK_SAMPLING_RATE, and RUBRIK_WEBHOOK_URL to the environment settings reference table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add fallback message when blocking service returns empty explanation Prevents whitespace-only violation message when the tool blocking service blocks tools but returns an empty content field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(ocr): add Reducto parse OCR support (#26068) * feat(ocr): add Reducto parse OCR support * fix(reducto): address OCR review feedback * chore: refresh uv lockfile * Revert "chore: refresh uv lockfile" This reverts commit 47200c0. * Fix failing tests * Fix code qa * Replaced the async client violation * Replaced black formatting * Fix failing tests * Fix failing tests * Fix failing tests * Fix failing tests * Fix tests * Fix vertex ai cred test * Fix test * fix(xai): normalize usage total_tokens for prompt caching xAI can return total_tokens inconsistent with prompt_tokens + completion_tokens when caching is enabled. Align with OpenAI-style usage so shared LLM tests and downstream consumers see coherent totals. Apply to non-streaming responses and streaming usage chunks. Made-with: Cursor * Fix stale Vertex token refresh fallback * Fix OCR zero credit and Bedrock support checks * Fix OCR and Fireworks capability handling * fix: evict completed background refresh tasks from _background_refresh_tasks Completed asyncio.Task objects were never removed from _background_refresh_tasks. In long-running proxies with many distinct credential keys the dict grows indefinitely, retaining references to finished tasks and their results. Fix: - Pop the existing (done) entry before creating a replacement task. - Attach a done_callback to each new task that removes its entry from the dict once the task finishes (success or failure). Tests: - test_background_refresh_task_removed_after_completion: verifies the done-callback cleans up a single entry after the task completes. - test_background_refresh_tasks_no_accumulation_across_many_keys: drives 20 distinct credential keys and confirms the dict is empty after all background refreshes finish. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix: guard asyncio.create_task in RubrikLogger.__init__ against missing event loop asyncio.create_task() raises RuntimeError when called outside a running event loop. Wrap the call in a try/except RuntimeError so that RubrikLogger can be instantiated in synchronous contexts (e.g. during startup, testing) without crashing. The periodic_flush background task simply won't start in those cases; it starts normally when the constructor is called inside an event loop. Add a test that verifies instantiation outside an event loop does not raise (does not patch asyncio.create_task). Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix: preserve async batch and reauth coordination * Fix mypy * Fix xAI usage and Fireworks parallel tool params * Fix Rubrik batch drain and SSE recovery mutation * Fix router mode preservation and Rubrik batch flushing * fix(responses): merge text-only items with output items in SSE recovery When recovering output from raw SSE, OUTPUT_ITEM_DONE and OUTPUT_TEXT_DONE events were treated as mutually exclusive fallbacks. If a stream emitted OUTPUT_ITEM_DONE for some output indices and only OUTPUT_TEXT_DONE for others, the text-only items at the missing indices were silently dropped. Merge both dicts before returning, with OUTPUT_ITEM_DONE entries taking precedence at any shared index (preserving the existing behavior covered by test_transform_response_preserves_output_item_when_text_done_arrives_later). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(rubrik): preserve events on batch send failure Previously, _log_batch_to_rubrik swallowed all HTTP errors and exceptions, and the parent flush_queue unconditionally drained the queue afterwards. On Rubrik 5xx responses, network errors, or timeouts the in-flight events were silently dropped without ever being delivered. - Re-raise from _log_batch_to_rubrik so failures surface to the caller. - In CustomBatchLogger.flush_queue, catch exceptions from async_send_batch and leave the queue intact for retry on the next flush. Existing loggers that override flush_queue (e.g. Datadog) or that swallow their own errors inside async_send_batch (e.g. Langsmith, GCS, Argilla) are unaffected. - Tests now assert events are preserved on HTTP errors, network errors, and that mid-flush appended events are also preserved on failure. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(chatgpt/responses): strip whitespace before parsing SSE chunks _parse_sse_json_chunk in ChatGPTResponsesAPIConfig passed the raw chunk directly to _strip_sse_data_from_chunk, which only matches the 'data:' prefix at position 0. Chunks with leading whitespace (e.g. ' data: {...}') were returned unchanged and silently failed JSON parsing, dropping the contained event. Mirror the existing fix in LiteLLMResponsesTransformationHandler._parse_raw_sse_chunk by calling chunk.strip() before stripping the SSE prefix. Adds a regression test using whitespace-padded data: lines and verifies that the response.output_item.done payload is recovered into the final ResponsesAPIResponse output. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(rubrik): override flush_queue so a single snapshot drives send and drain Previously RubrikLogger relied on CustomBatchLogger.flush_queue, which captured len(self.log_queue) separately from the snapshot taken inside async_send_batch. Although both happen without an intervening await today (so they agree in practice), they are semantically disconnected: a future refactor that adds an await between the two captures, or that changes the async_send_batch contract, could cause the parent to delete a different number of items than were actually sent and trigger duplicate deliveries to Rubrik. Override flush_queue on RubrikLogger so a single snapshot drives both the HTTP POST and the queue truncation. async_send_batch is preserved for direct callers/tests but no longer participates in the canonical flush path. Existing tests (including the one that explicitly invokes the base CustomBatchLogger.flush_queue path) still pass. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: register reducto/parse-v3 and reducto/parse-legacy in active model pricing file Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(bedrock): restore output_config forwarding and black formatting Use model-map lookup with _model_supports_effort_param fallback so Bedrock Invoke keeps output_config for Claude 4.6/4.7 when pricing flags are missing. Revert custom_llm_provider=bedrock for supports_output_config checks, fix allowlist test model, and apply black to xai/vertex files failing lint CI. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(greptile): address remaining review concerns - fireworks: resolve supports_reasoning lookup for short model names by also trying the full accounts/fireworks/models/ path in model_cost - ocr_cost: drop reducto-specific guard in shared utility; treat missing pages_processed as zero cost when no per-page pricing is configured - docs: remove reducto/rubrik markdown stubs from this repo (canonical docs live in litellm-docs) * fix(model_prices): register mistral/ministral-8b-2512 Mistral's API now returns model='ministral-8b-2512' when 'mistral-tiny' is requested. Adding the entry so completion_cost can resolve the cost for that response. * fix(greptile): prune async refresh locks and lazy-start rubrik flush - vertex: back `_async_refresh_locks` with a WeakValueDictionary so a per-key Lock is auto-evicted once no coroutine holds it, preventing unbounded growth in deployments with many credential combinations while keeping single-flight semantics intact. - rubrik: defer the periodic flush task to the first log event when the logger is constructed without a running event loop, so low-traffic batches still get drained instead of being silently stranded by a swallowed RuntimeError. * Remove duplicate supports_max_reasoning_effort key in claude-opus-4-7 entries Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex_ai): stabilize background refresh task tracking - Guard background refresh done_callback with an identity check so a stale callback cannot remove a newer task that already replaced it in the tracking dict (done_callbacks are scheduled via call_soon, so a fresh task can be stored for the same credential key before the old callback fires). - Replace WeakValueDictionary with a regular dict for _async_refresh_locks so the per-key asyncio.Lock identity is stable across concurrent callers; otherwise a lock can be GC'd between two coroutines arriving for the same key, breaking single-flight. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: surface OCR pricing gaps and recover OUTPUT_TEXT_DONE in ChatGPT SSE - cost_calculator.ocr_cost: log a warning when pages_processed is reported but no ocr_cost_per_page is configured, instead of silently billing zero via an implicit '(... or 0.0) * pages_processed' fallback. Behavior is preserved (zero cost) so free-tier / unpriced models still work, but configuration gaps are now visible in logs. - ChatGPTResponsesAPIConfig._extract_completed_response_from_sse: also collect response.output_text.done events into a text-only items map and merge them into the recovered output (OUTPUT_ITEM_DONE wins on duplicate output_index), mirroring the LiteLLMResponses handler. This recovers text content when a provider only emits OUTPUT_TEXT_DONE and the final response.completed event has an empty output list. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(cicd): drop obsolete async refresh locks auto-prune test Commit dfb2524 intentionally reverted _async_refresh_locks from a WeakValueDictionary back to a regular Dict so the per-key asyncio.Lock identity is stable across concurrent callers — preserving single-flight semantics. The test asserting that the dict shrinks back to 0 after refreshes was added when the WeakValueDictionary backing was still in place; it now contradicts the deliberate design and is failing CI. * fix(rubrik): sanitize proxy_server_request and harden tool_calls parsing Address bugbot review concerns: - Sanitize proxy_server_request before forwarding to the Rubrik webhook. The previous code passed the entire inbound HTTP context (Authorization, Cookie, x-api-key, and the raw request body) through to a third-party endpoint, which exfiltrates proxy credentials and upstream secrets. The new _sanitize_proxy_server_request allowlists only url and method. (Cursor Bugbot HIGH severity #3192354895) - Treat a null choices[0].message.tool_calls as 'all blocked' rather than letting iteration raise and silently fall through the outer except in apply_guardrail (which would fail open). Iterate over a defensive fallback list instead of relying on the dict default. (Cursor Bugbot MEDIUM severity #3192349538) Co-authored-by: Cursor Bugbot <bugbot@cursor.com> * fix: restore Fireworks substring matching and use RLock for Vertex sync refresh - Fireworks _get_model_cost_capability: after exact-key lookups, fall back to substring matching against fireworks_ai/* entries in model_cost so model name variants (e.g. fine-tuned suffixes) continue to inherit capability flags like supports_reasoning. - Vertex vertex_llm_base: replace non-reentrant threading.Lock with RLock on the sync refresh path so the reauthentication retry, which recurses into get_access_token while still holding the lock, does not deadlock when reloaded credentials are also expired. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(rubrik): collapse BlockedToolsResult dead-code into Optional[str] The `allowed_tools` field on `BlockedToolsResult` was computed in `_extract_blocked_tools` but never read by the only caller — when any tool was blocked the integration unconditionally raised `ModifyResponseException` to reject the full response, never doing partial filtering. Drop the dataclass and return the blocking explanation directly as `Optional[str]` so there's no misleading shape hinting at unused partial-filter capability. Co-authored-by: Greptile <greptile-apps[bot]@users.noreply.github.com> * fix(greptile): prune vertex async refresh lock dict after release Address greptile's open thread on _async_refresh_locks growing unboundedly in high-cardinality deployments. - Add _maybe_prune_async_refresh_lock: drops the per-key Lock from the registry once no coroutine holds it and no coroutine is queued in lock._waiters. The check-then-pop sequence is safe under asyncio's cooperative scheduler — a waiter that arrives after the pop simply creates a fresh lock under the same key, which is fine because the previous batch is already done. - Wrap the slow-path async with lock in a try/finally so the prune runs on every exit (return, exception, reauth retry). - Extract the existing background-refresh task scheduling into _schedule_background_refresh so get_access_token_async stays under ruff's PLR0915 ("Too many statements") limit. No behaviour change. - Regression tests cover both pruning after release (the dict shrinks back to zero after each call) and the safeguard that keeps the lock alive while a waiter is still queued. * fix(greptile): pass explicit bedrock provider to _supports_factory Bedrock Invoke transformation files (chat and messages) called _supports_factory(custom_llm_provider=None, ...) which relies on auto-detection. For short Bedrock model names (e.g. 'anthropic.claude-opus-4-6' without the version suffix) auto-detection fails and the lookup falls back through the exception path. Passing the known 'bedrock' provider explicitly makes the lookup deterministic for all Bedrock model variants, including cross-region inference profile IDs. Co-authored-by: Claude <noreply@anthropic.com> * fix(greptile): warn when OCR cost silently returns 0.0 Address greptile's P2 thread (#3144753707) about ocr_cost silently under-reporting billing when response.usage_info.pages_processed is missing. The credit-priced and unpriced fallback still has to return 0.0 (we don't know how to bill without usage), but emit a warning so the missing-data case is visible in logs instead of disappearing. The per-page-priced branch still raises, preserving the original ValueError signal callers may catch. * fix(greptile): reorder bedrock output_config strip comment labels Swap the # 5a / # 5b step labels so they appear in numerical order within the file. The new output_config-strip block was added with label # 5b above the pre-existing # 5a 'remove custom field from tools' block; rename the new block to # 5a and the pre-existing block to # 5b so the labels match the order of the steps in the file. No behavior change. Co-authored-by: Greptile Reviewer <greptile-apps@users.noreply.github.com> * Fix substring matching specificity and remove mutable Reducto OCR config state - Fireworks: _get_model_cost_capability fallback now picks the longest substring match in model_cost so more specific entries win over less specific ones (instead of returning the first match by insertion order). - Reducto OCR: drop per-request _api_key/_api_base instance attributes on _BaseReductoOCRConfig and instead thread api_key/api_base through transform_ocr_request/async_transform_ocr_request kwargs from the shared OCR HTTP handler. Makes the config safe to share/cache across concurrent requests with different credentials. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): drain background refresh + warn on router mode override Address the two new findings from greptile's 19:45 review of the vertex+router surfaces. - vertex_llm_base: when the slow path sees TokenState.INVALID, await any in-flight background refresh task before invoking refresh_auth ourselves. google-auth's Credentials.refresh() is not safe to call concurrently on the same credentials object, and the background task runs outside the per-key lock. After the wait, re-check the cached token so we can short-circuit if the background refresh already restored it. Extracted the helper into _await_in_flight_background_refresh so get_access_token_async stays under ruff's PLR0915 statement budget. - router.py: when alias registration would overwrite the deployment's declared `mode` to keep the shared backend mode stable, emit a verbose_router_logger.warning so the override is visible to operators instead of silently winning. The existing fix (preventing alias registration from downgrading a shared `mode: responses` to chat) is preserved; the warning just surfaces it. * fix(cicd): apply black formatting to vertex_llm_base.py * fix(greptile): guard Reducto upload helpers against missing file_id Raise a clear ValueError when Reducto /upload returns 200 without a file_id key (or with a non-JSON body), instead of letting downstream callers see a confusing KeyError. * fireworks_ai: cache fireworks model_cost index and use hyphen-boundary matching - Build a memoized index of fireworks_ai/* entries from litellm.model_cost, invalidated by (id, len) of the model_cost dict. Avoids re-scanning the full ~30k-entry model_cost dictionary on every get_provider_info call. - Replace plain substring containment with hyphen-aligned boundary matching so a known short model name (e.g. 'some-model') cannot falsely match an unrelated longer query (e.g. 'awesome-model'). Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): refcount vertex async refresh lock pruning Replace the asyncio.Lock._waiters inspection in _maybe_prune_async_refresh_lock with an explicit refcount so the entry is pruned exactly when no coroutine is holding or waiting on the lock, without depending on any private asyncio internals. * fix(vertex): serialize credentials.refresh() across threads via _sync_refresh_lock refresh_auth is invoked from three call sites that can run on different threads (sync get_access_token, async slow path via asyncify, and the background proactive refresh task). Only the sync path was protected by _sync_refresh_lock, so a concurrent sync + async/background call could invoke google-auth's Credentials.refresh() on the same object from two threads simultaneously, mutating internal credential state. Move the lock acquisition into refresh_auth itself; the lock is an RLock so reentrant acquisition from the sync path remains safe. Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(responses): extract shared SSE output-item recovery helpers Both ChatGPTResponsesAPIConfig and LiteLLMResponsesTransformationHandler duplicated the same OUTPUT_ITEM_DONE / OUTPUT_TEXT_DONE recovery algorithm. Move that logic into litellm.responses.sse_output_recovery and have both call sites use the shared helpers, so future fixes apply in one place. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): tie fireworks index cache to model_cost mutation generation * fix: address three bug detection findings - rubrik: use 'is not None' check for tool call IDs to allow empty-string IDs - router: indent mode preservation mutation to match warning conditional - responses transformation: add missing 'continue' after OUTPUT_TEXT_DONE handler Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(router): always preserve existing shared backend mode when deployment mode is None Previously the inner guard 'if _deployment_mode is not None' prevented _shared_model_info['mode'] from being set back to the existing shared mode when the deployment mode was None, which then overwrote the shared backend's mode with None via register_model. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: address three bug detection findings - vertex_llm_base: guard background refresh's cache write with an identity check so a stale write cannot overwrite a credentials reference replaced by a concurrent reauthentication path. - router: make shared backend mode preservation directional - only preserve when an existing 'responses' mode would be downgraded to 'chat', or when the deployment mode is None (which would otherwise clear the existing mode). Legitimate upgrades now apply. - rubrik: remove unused preserve_events_added_during_flush attribute; RubrikLogger overrides flush_queue, so the base-class flag never applied. Drop the test that exercised the parent path on a Rubrik instance since it does not reflect real flush behavior. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(veria): scope reducto file IDs to current request + register pricing - Reject reducto:// file IDs sent through the proxy /v1/ocr JSON API. The IDs are not bound to a LiteLLM key, so an authenticated user could submit another user's file ID and receive OCR text via the proxy's shared Reducto credentials. Force fresh uploads (multipart form or inline base64 data URI) so every OCR call is server-mediated and implicitly bound to the originating request. - Add ocr_cost_per_credit=0.015 to reducto/parse-v3 and reducto/parse-legacy in both pricing JSONs so successful Reducto OCR calls debit key/team spend instead of recording zero. * fix(vertex): always overwrite resolved cache key with fresh credentials After reauthentication or fresh load, the resolved (cache_credentials, project_id) cache key may point to stale credentials from a prior load. Skipping the write when the key existed forced the next request to go through a redundant refresh/reauth cycle. Always overwrite so callers using the resolved project_id hit the fresh credentials object. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(xai): fold reasoning tokens before normalizing usage in streaming chunks The non-streaming transform_response folds xAI's reasoning_tokens into completion_tokens before calling _normalize_openai_compatible_usage_totals, preserving the OpenAI invariant total = prompt + completion. The streaming chunk_parser only ran the normalization, so when xAI streamed usage with reasoning tokens (total = prompt + completion + reasoning), the normalize check (total < prompt + completion) was a no-op and the invariant remained violated. Refactor _fold_reasoning_tokens_into_completion to also accept a raw usage dict (in addition to ModelResponse / Usage) and call it from the streaming chunk_parser before normalization, so streaming and non-streaming paths report usage consistently for reasoning models. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): cap SSE content_index padding and use multiset tool-id check * fix(rubrik): apply event_hook default when caller passes None initialize_guardrail always passes event_hook=litellm_params.mode, so setdefault never applied its default. When mode is omitted from the guardrail config, event_hook ended up as None instead of post_call. Use 'or' to fall back to the intended default when the value is None. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(rubrik): cover event_hook default coercion Regression tests for the case where the upstream caller (initialize_guardrail) passes event_hook=None and the logger should still fall back to post_call, and the sanity case where an explicitly-set non-None event_hook is preserved. * fix: address autofix bugs in chatgpt SSE, vertex token cache, rubrik aclose - chatgpt responses: don't overwrite a meaningful error_message with None when a later RESPONSE_FAILED/ERROR event lacks an error object. - vertex_ai: serve STALE tokens from the lock-free fast path and only schedule a deduplicated background refresh, eliminating per-key lock contention near token expiry. - rubrik: aclose() now closes both async_httpx_client and tool_blocking_client to avoid leaking connections from the dedicated client when the logger shuts down. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex): drop redundant resolved_project rebind in slow path Reusing resolved_project (typed str from the fast path's tuple unpack) for an Optional[str] assignment tripped mypy. Use project_id directly after the None check. * test(team_members): skip flaky test_add_multiple_members The test creates a team via /team/new, adds a member via /team/member_add, then queries /team/info — and intermittently gets a 404 for a team that was just successfully created and mutated. The basic happy path is already covered by test_add_single_member; we only lose the 10-iteration stress loop. * fix(rubrik): cancel periodic flush task on aclose The aclose() method closed both HTTP clients but did not cancel the periodic flush task. After close, the task would wake up every flush_interval seconds and try to POST via the now-closed async_httpx_client, generating recurring errors. Cancel the task and await its termination before closing the clients. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(rubrik): coerce None default_on to True at init * fix: tighten SSE done parser + rubrik /v1/messages match Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bedrock): warn when invoke transformation strips output_config The Bedrock Invoke chat and messages transformations strip output_config when neither supports_output_config nor any supports_*_reasoning_effort flag is set in the model JSON. This was silent; emit a verbose_logger warning when the strip actually removes a present output_config so newly released models (where the JSON entry hasn't caught up yet) surface a clear log line instead of dropping the effort parameter without notice. * fix(rubrik): drop tool_call repr from normalize error to avoid leaking args The TypeError raised in _normalize_tool_calls is caught by apply_guardrail's broad except, which logs the message plus exc_info. Including repr(tc) in the message could expose function arguments (potentially sensitive user data) in the proxy log stream. Type name alone is enough for debugging. * fix: dedupe SSE chunk parser and warn on Fireworks tool drop - Centralize SSE 'data:' chunk parsing in litellm.responses.sse_output_recovery so the ChatGPT Responses transformer and the Responses->Chat-Completions bridge share a single implementation. - Log a warning when get_supported_openai_params drops 'tools' for a fireworks_ai model whose JSON entry sets supports_function_calling=false, so users notice the behavioral change instead of silently losing tools. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(fireworks_ai): demote per-request tool drop warning to debug Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(veria): cap Rubrik retry queue at 10k events with drop-oldest A persistent Rubrik webhook outage previously let authenticated traffic accumulate prompt/response payloads in the in-memory retry queue without bound. The PR-introduced retry-on-failure behavior in flush_queue() never trims the queue, so under sustained outage and high request volume the proxy can run out of memory. Cap the queue at RUBRIK_MAX_QUEUE_SIZE events (default 10_000) and drop the oldest events when the cap is exceeded. Emit a throttled verbose_logger warning so operators can detect a stuck webhook. * fix(tests): accept either initial event type from xAI realtime xAI's Grok Voice Agent API used to emit 'conversation.created' as the first event over the WebSocket. It has since shipped a fully OpenAI-compatible 'session.created' event (and may still emit the legacy 'conversation.created' on some routes), which breaks the strict-equality assertion in the realtime e2e test: AssertionError: Expected conversation.created, got session.created This is an upstream behavior change, not a regression in our code. Loosen the base realtime test so get_initial_event_type() may return a tuple of acceptable event types, and have the xAI subclass accept both 'conversation.created' and 'session.created'. The OpenAI subclasses keep their single-string contract unchanged. * fix(rubrik): drop RUBRIK_MAX_QUEUE_SIZE env knob, hardcode 10k cap The doc-validation CI scans for os.getenv() calls and requires each key to appear in litellm-docs config_settings.md. Adding the env var here without a matching docs PR fails the docs and code-quality checks, and the extra env-parsing block in __init__ also tripped ruff PLR0915. The hard cap at 10k still bounds memory on a Rubrik webhook outage, which is the actual bug being fixed -- operators don't need to tune this knob to get the safety guarantee. * test(team_members): skip flaky test_duplicate_user_addition Same /team/info 404-after-add_team_member race that already led to test_add_multiple_members being skipped in dedc402. Duplicate-prevention behavior is covered by test_update_team_members_list_duplicate_prevention in tests/test_litellm/proxy/management_endpoints/test_team_endpoints.py, so the e2e proxy variant doesn't add coverage. * fix: bound CustomBatchLogger queue and call super().__init__ in ContextCachingEndpoints Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(rubrik): distinguish malformed tool-blocking response from transient errors Raise a dedicated _MalformedToolBlockingResponseError when the tool blocking service returns an empty 'choices' list, instead of a bare Exception. Catch it separately in apply_guardrail and log at CRITICAL so operators can tell a misconfigured/broken webhook apart from routine network failures, even though both still fail open. Co-authored-by: Yassin Kortam <yassin@berri.ai> * router: clarify shared backend mode preservation flow Add a blank line and a brief comment before the _backend_alias_cost assignment to make it clear that registration runs unconditionally after the optional mode-preservation mutation. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(ci): skip chronically flaky test_spend_logs_with_org_id Same write-then-read race against the spend logs DB as test_spend_logs (already skipped above). /spend/logs?request_id=... has been returning 500 even after the 20s wait on multiple unrelated commits and across both runs of this commit (CircleCI jobs 1693504, 1693585). The PR itself does not touch spend logs. Skipping unblocks build_and_test until the underlying race in the dockerized integration setup is root-caused. Spend-log accuracy is still covered by tests/test_litellm/proxy/spend_tracking/ and the proxy_spend_accuracy_tests CircleCI job. --------- Co-authored-by: Kevin Zhao <zkm8093@gmail.com> Co-authored-by: Matthew Lapointe <lapointe683@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Elon Azoulay <elon.azoulay@gmail.com> Co-authored-by: Krrish Dholakia <krrish+github@berri.ai> Co-authored-by: afoninsky <andrey.afoninsky@gmail.com> Co-authored-by: Tai An <antai12232931@outlook.com> Co-authored-by: Joseph Barker <156112794+seph-barker@users.noreply.github.com> Co-authored-by: Maruti Agarwal <88403147+marutilai@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Cursor Bugbot <bugbot@cursor.com> Co-authored-by: Greptile <greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Greptile Reviewer <greptile-apps@users.noreply.github.com>

* fix: end user logs * fix(auth): address PR review feedback on end-user id validation - Gate DB validation behind litellm.validate_end_user_id_in_db (default False) so arbitrary client-supplied identifiers still pass through. - Reuse get_end_user_object / get_user_object / _get_fuzzy_user_object instead of issuing raw Prisma queries in the auth hot path. - Consolidate: builder does the resolution once and stores it on the auth obj; centralized checks reuse it, the outer user_api_key_auth copy is removed. - Preserve end_user_id when litellm.max_end_user_budget_id is set so the default end-user budget can still apply to new customers. * fix(auth): gate JSON-blob user-id rejection behind validate_end_user_id_in_db Addresses PR review feedback: the JSON-encoded dict/list rejection in _coerce_user_id_to_str was unconditionally applied, which would silently stop tracking spend for deployments passing JSON-encoded user identifiers on upgrade. Per the backwards-compatibility rule, default-path behavior changes must be opt-in. Now only strings that decode to a JSON object/array are dropped when litellm.validate_end_user_id_in_db is True. Non-string dict/list/tuple values are still always dropped, since stringifying them produces unusable "{'device_id': ...}"-shaped spend-log rows. * fix(auth): route email end-user lookup through get_user_object cache The email-shaped end-user id branch called _get_fuzzy_user_object directly, bypassing get_user_object's _should_check_db throttle and user_api_key_cache. Every unique email would hit an unbudgeted raw Prisma query on the critical auth path. Collapsing the two calls into one get_user_object invocation with user_email=end_user_id routes through the cached helper per PR review feedback. * fix(auth): keep end-user safety net at user_api_key_auth tail Krrish flagged that removing the tail-of-user_api_key_auth assignment was a regression risk: ``_user_api_key_auth_builder`` has multiple early-return paths (master_key=None, /user/auth, JWT short-circuits) that bypass the end-user resolution block, so dropping the safety net silently strips end-user attribution from those paths. Restore the assignment but route it through resolve_and_validate_end_user_id so the same validation rules apply. Skip the second pass when the builder already set an id. Adds two tests pinning the behaviour: one for the early-return safety net and one verifying we don't double-resolve when the builder set the id. Co-authored-by: Dennis Henry <dennis.henry@okta.com>

Vertex AI Gemma's chatCompletions wrapper does not understand the context_management parameter (an Anthropic / OpenAI Responses API concept). When callers route this field to a Gemma deployment (e.g. through allowed_openai_params or proxy passthrough), the upstream endpoint would reject the request with an unknown-field error. Drop context_management in VertexGemmaConfig.transform_request, matching the existing pattern used for stream and stream_options. Adds a direct transform_request unit test plus an acompletion-level test that exercises the realistic allowed_openai_params path. Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(logging): recalculate cost after router retry failures Do not preserve response_cost=0 from failure_handler when processing a successful response; only keep pre-calculated costs > 0 (pass-through). Co-authored-by: Cursor <cursoragent@cursor.com> * test(logging): guard pass-through zero cost; use != 0 preserve check Use != 0 for pre-calculated cost preservation (Greptile feedback). Add tests for zero cost in _hidden_params and for hidden_params overriding failure 0. Co-authored-by: Cursor <cursoragent@cursor.com> * test(vertex): skip google maps tool test on transient upstream 500 The test test_gemini_google_maps_tool_simple calls real Vertex AI with the googleMaps tool, which depends on Google Maps Platform. CI has been failing on local_testing_part1 across many unrelated PRs (including this one and the litellm_internal_staging base) with an InternalServerError 500 from Maps Platform ('Internal server error. Please retry. ...maps- platform-support'), which is an external upstream flake unrelated to the change under test. Catch litellm.InternalServerError and skip (mirroring the existing RateLimitError handler) so transient upstream outages don't block CI. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

…n pre-call blocks (#28364) - Fix missing guardrail child spans when a pre-call guardrail blocks the request before reaching the LLM provider; `async_post_call_failure_hook` now calls `_emit_guardrail_spans_from_request_data` to emit spans from `request_data["metadata"]` regardless of whether `_handle_failure` already fired - Add `guardrail_status`, `guardrail_action`, and `guardrail_violation_categories` as queryable top-level OTEL span attributes so trace backends can filter/group by violation type without parsing the redacted `guardrail_response` blob - Introduce `_emit_guardrail_spans_from_request_data` helper that constructs minimal kwargs from `request_data["metadata"]` and routes through `_create_guardrail_span`, sharing the same dedupe state to prevent double-emitting when both failure hooks fire - Extend `BedrockGuardrail` with `_build_tracing_detail` and `_extract_violation_category_names` which flatten BLOCKED assessments into human-readable category labels (topic names, content-filter types, PII entity types, named regex names) before redaction, and surface Bedrock's raw `action` field via `tracing_detail` - Security: violation category extraction deliberately omits `customWords.match` and unnamed regex `match` values because those fields carry the user-submitted content that triggered the rule; only operator-defined `name`/`type` labels are emitted - Add `violation_categories` and `guardrail_action` fields to `StandardLoggingGuardrailInformation` and `GuardrailTracingDetail` TypedDicts to carry the pre-redaction metadata through the logging pipeline - Add comprehensive test suite covering: guardrail span creation on failure, dedupe between `_handle_failure` and `async_post_call_failure_hook`, per-span status attributes for multi-guardrail sequences, Bedrock category extraction for all policy types, security leak prevention, and end-to-end `CustomGuardrail` violation path Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

…28441) * test(proxy): behavior-pinning matrix for team management endpoints PR2 (Team Tier-1) of the management-endpoint behavior-pinning effort. Extends the tests/proxy_behavior/management/ harness PR1 built and adds the actor x target-resource authz matrix for the 7 team endpoints: /team/new, /team/info, /team/list, /team/update, /team/member_add, /team/member_delete, /team/member_update. Tests-only, no production code changes. Harness extensions: - actors.py: ORG_B_ADMIN actor (org admin of ORG_B) and TEAM_GAMMA (an ORG_A team with no actor members), so team-targeting endpoints get a clean own / same-org-other / cross-org target axis. - conftest.py: create_scratch_team() raw-seeds target teams without /team/new side effects; the scratch teardown now also strips dangling scratch-team refs from LiteLLM_UserTable.teams. 156 new scenarios; status codes pinned to observed handler behavior. * test(proxy): record mutmut run blockers in PR2 triage doc Attempted a scoped local mutmut run for G5; it did not complete. Record the three concrete blockers in mutmut_triage/pr2-team-tier1.md so the next attempt has a head start: 1. mutmut's mutants/ sandbox is import-shadowed by the worktree source. 2. the legacy mock suite and the real-DB behavior suite cannot share a pytest session (mock suite globally patches prisma_client). 3. the CI mutation-test.yml workflow starts no Postgres, so its stats phase now aborts on the behavior-suite tests PR1 added to tests_dir. mutmut stays a deferred follow-up (as in PR1); the binding pre-merge signal remains the behavior matrix (G1) and the G4 regression-replay. * test(proxy): drop suite README + triage doc, trim test comments Remove the two prose docs from the behavior suite (README.md and mutmut_triage/pr2-team-tier1.md) and tighten the comment blocks on the team test files + harness down to the load-bearing parts (the gate each matrix pins, plus genuinely surprising results). No behavior change — all 286 scenarios still pass. * test(proxy): remove mutmut tests_dir comment

…#28503) test_gemini_google_maps_tool_simple makes live calls to Vertex AI's Google Maps grounding backend, which intermittently returns 500 INTERNAL ("Please retry") — a transient Google-side failure, not a LiteLLM bug. The request LiteLLM emits matches Google's published googleMaps grounding spec field-for-field, and the maps-platform 500 only occurs after Vertex accepts the request. The test already passes on RateLimitError; treat InternalServerError the same way so transient Vertex-side failures don't fail CI.

The non_root builder stage installs `nodejs` but not `npm`. Without `npm` on PATH, prisma-python falls back to downloading a Node runtime via nodeenv from nodejs.org, and that downloaded binary fails to load `libatomic.so.1` — breaking `prisma generate` and the image build. `npm` was dropped from this apk list in ca52e34. Restoring it lets prisma-python use the system Node + npm, matching docker/Dockerfile which already installs `npm` for the same reason.

…#27665) (#28524) Bumps [next](https://github.com/vercel/next.js) from 16.2.4 to 16.2.6. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](vercel/next.js@v16.2.4...v16.2.6) --- updated-dependencies: - dependency-name: next dependency-version: 16.2.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump black 24.10.0 -> 26.3.1 * style: apply black 26.3.1 formatting * chore: authorize black 26.3.1 license in liccheck.ini

* build(deps): bump next from 16.2.4 to 16.2.6 in /ui/litellm-dashboard (#27665) Bumps [next](https://github.com/vercel/next.js) from 16.2.4 to 16.2.6. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](vercel/next.js@v16.2.4...v16.2.6) --- updated-dependencies: - dependency-name: next dependency-version: 16.2.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump protobufjs in /tests/pass_through_tests (#28296) Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.5.6 to 7.6.0. - [Release notes](https://github.com/protobufjs/protobuf.js/releases) - [Changelog](https://github.com/protobufjs/protobuf.js/blob/protobufjs-v7.6.0/CHANGELOG.md) - [Commits](protobufjs/protobuf.js@protobufjs-v7.5.6...protobufjs-v7.6.0) --- updated-dependencies: - dependency-name: protobufjs dependency-version: 7.6.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump ws from 8.20.0 to 8.20.1 in /tests/pass_through_tests (#28303) Bumps [ws](https://github.com/websockets/ws) from 8.20.0 to 8.20.1. - [Release notes](https://github.com/websockets/ws/releases) - [Commits](websockets/ws@8.20.0...8.20.1) --- updated-dependencies: - dependency-name: ws dependency-version: 8.20.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* test(e2e): forward LITELLM_LICENSE to UI e2e proxy The UI e2e job ran without LITELLM_LICENSE, so premium_user was always false in the issued login JWT and premium-gated UI surfaces (Team-BYOK Model switch, etc.) couldn't be driven through the UI. Forward the env var from run_e2e.sh and the CircleCI e2e_ui_testing job, and add a sanity test that decodes the admin storage state token and asserts premium_user=true so the wiring fails loudly if it ever regresses. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Update ui/litellm-dashboard/e2e_tests/tests/proxy-admin/license.spec.ts Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…t stability, (#26027) * Add granian as a ASGI compliant web server. Provides better stability, 10-20 RPS improvement under standard LT conditions. TODO: Verify poetry lock details and add locust numbers to PR * Update granian version in license_cache.json and pyproject.toml to 2.5.7 * Enhance proxy CLI tests by adding SSL initialization checks for Granian server. Remove Python version skip conditions and implement tests to ensure SSL certificate and key are required for server initialization. * update uv lock to fix granian import error

* Add error_description and hint for oauth flows * Fix tests * fix(mcp-oauth): improve redirect_uri errors without leaking internal config Use NoReturn on _oauth_invalid_request, structured errors for BYOK loopback validation, and refactor validate_trusted_redirect_uri to satisfy PLR0915. Keep PROXY_BASE_URL and raw proxy_base_url in server logs only, not in the HTTP 400 body returned to unauthenticated callers. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp-oauth): stop leaking internal proxy origin in redirect_uri 400 body The trusted-redirect-uri rejection helper included the proxy's resolved scheme/host/port (e.g. http://litellm-internal:4000) in both the error_description and as a top-level proxy_origin field. Since the OAuth /authorize endpoint is unauthenticated, any caller could probe with a crafted redirect_uri and enumerate the internal network topology behind a reverse proxy. Keep full diagnostic detail in the server-side warning log (including the computed proxy base) but omit proxy-side values from the HTTP 400 body. Also drop the duplicated origin computation in _raise_trusted_redirect_uri_rejected now that those values are no longer needed by the response. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp-oauth): remove dead userinfo check in redirect_uri validation The first check combined missing netloc with userinfo presence, making the second userinfo-only check unreachable. Split into two distinct checks so each error message reflects the actual failure mode. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

…28454) * feat(mcp): cache OAuth token client-side so Tools tab loads without re-auth After a user creates an OAuth MCP server and completes the authorization flow, the resulting access token is now stored in sessionStorage keyed by server_id. The MCP Tools tab reads this cached token and includes it as an MCP auth header when listing and invoking tools, so the user never sees an empty tool list. When the session ends (tab close / new browser) an Authorize button re-triggers the flow without leaving the Tools screen. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix(ui/mcp): surface listMCPTools 401 errors so auth gate reappears listMCPTools previously swallowed all errors (including HTTP 401) by returning a synthetic { tools: [], error: 'network_error', ... } payload. That made the useQuery retry-on-401 guard and mcpToolsError dead code, so expired OAuth tokens never re-triggered the auth gate. - Throw an enhanced Error with .status attached on non-2xx responses (still preserves the legacy shape for true network failures so the caller can render a generic message without crashing). - Clear the cached OAuth session token when the tools query fails with 401, mirroring callMCPTool's onError handler so the Authorize button is shown again. - Surface mcpToolsError in the existing error banner. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp-tools): stable onSuccess + reuse parsed flow state - Pass stable setOauthToken setter directly as onSuccess to avoid recreating useToolsOAuthFlow's resumeOAuthFlow on every render. - Reuse the already-parsed FLOW_STATE_KEY value (peeked) instead of re-reading and re-parsing sessionStorage in resumeOAuthFlow. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(ui/mcp): restore listMCPTools never-throws contract The previous fix made listMCPTools throw on HTTP errors while still returning a synthetic object on network errors. This inconsistent contract broke existing callers (MCPToolPermissions, MCPAppsPanel, MCPConnectPicker) which inspect result.error / result.message and expect the function to never throw. - Return a normalized { tools: [], error, message, status, ... } object on HTTP errors (instead of throwing) so all callers see a consistent shape and the user-visible error text from result.message is preserved. - Convert the returned error object into a thrown Error inside the one caller that needs it — the useQuery in mcp_tools.tsx — so the 401 retry/onError handlers still trigger and clear the cached OAuth token. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix greptile * fix(mcp): align OAuth header alias lookup with dashboard sanitization Backend auth header resolution now matches x-mcp-{alias} keys produced by the dashboard sanitizer, and the Tools tab re-syncs OAuth tokens when serverId changes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): widen auth header lookup types for list_tools Accept legacy str | dict server auth maps and annotate list_tools server_auth_header as Union[str, dict] for mypy. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ui): extract shared buildCallbackUrl/clearStorage for MCP OAuth hooks Hoist the duplicate buildCallbackUrl and clearStorage helpers out of useToolsOAuthFlow and useUserMcpOAuthFlow into a new shared module src/hooks/mcpOAuthUtils.ts so the two hooks cannot drift if the URL construction or storage cleanup logic needs to change. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(ui): don't gate M2M OAuth MCP servers behind interactive authorize M2M (client_credentials) OAuth servers share auth_type="oauth2" with interactive PKCE servers, but the backend fetches their token internally and they typically lack a user authorization endpoint. Gating tool listing on them rendered an Authorize button that would fail or redirect incorrectly. Detect M2M via the presence of token_url (matching the existing heuristic in mcp_server_edit.tsx) and skip the auth gate. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(ui/mcp): return error shape when listMCPTools JSON parse fails Restore the never-throws contract when response.json() fails on a 2xx body so callers do not receive null and crash on result.tools. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

* feat(proxy): persist allowlisted OIDC claims in CLI SSO poll Map CLI_SSO_CLAIM_MAP sources into user metadata and return scalar attribution_metadata from /sso/cli/poll. Build SSOUserDefinedValues in cli_sso_callback so first-time CLI logins can upsert users. Add mock OIDC scripts and tests for claim extraction and poll exposure. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(proxy): document CLI SSO attribution_metadata in client README Co-authored-by: Cursor <cursoragent@cursor.com> * Delete scripts/mock_oidc_server_for_cli_sso.py * Delete scripts/test_cli_sso_claims_e2e.py * fix(ui_sso): preserve claim types and avoid metadata. prefix stripping - Replace _update_dictionary with a local recursive merge so string OIDC claim values that happen to look numeric are not silently coerced to int/float when persisting CLI SSO attribution metadata. - Use a local dot-path resolver in _extract_sso_claim_value so that source claim paths beginning with 'metadata.' are not silently stripped by get_nested_value (which is designed for LiteLLM JWT metadata, not arbitrary OIDC claims). Co-authored-by: Yassin Kortam <yassin@berri.ai> * Remove redundant metadata. prefix strip in _set_nested_metadata_value The _parse_cli_sso_claim_map already strips the metadata. prefix from dest keys before reaching the setter. The duplicate strip in _set_nested_metadata_value was a no-op in normal flow but could mis-place values for dest keys like metadata.metadata.foo. Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix greptile * Fix ruff * Move CLI SSO user defined values build inside try/except for consistent error handling Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(proxy): enforce restricted SSO group on CLI SSO callback Apply verify_user_in_restricted_sso_group before CLI session completion and user upsert, matching the UI SSO path. Re-raise ProxyException so restricted-group denials return 403 instead of 500. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): replace recursive CLI SSO metadata helpers with iterative merge Use stack-based flatten/merge to satisfy recursive_detector CI. Fix mypy types for UserApiKeyCache and user_id on CLI SSO session completion. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: resolve nested CustomOpenID extra_fields in CLI SSO claim extraction When GENERIC_USER_EXTRA_ATTRIBUTES captures a parent object (e.g. org_info), extra_fields stores it as {"org_info": {"department": "..."}}. A CLI claim map entry using a dotted path like org_info.department would silently fail because the lookup only checked the exact flat key. Fall back to dotted-path resolution on extra_fields before model_dump(). Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(sso): update CLI SSO test for new received_response kwarg and remove redundant 'token' secret fragment Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

…8566) * fix(responses): use OpenAI SSEDecoder for Responses API streaming httpx aiter_lines() uses str.splitlines(), which splits on U+2028 inside JSON payloads and silently drops response.completed (no spend log). Use openai._streaming.SSEDecoder (bytes.splitlines before decode) instead. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(responses): drop redundant SSE prefix strip after SSEDecoder switch SSEDecoder already strips the 'data:' field prefix from each event, so the extra call to _strip_sse_data_from_chunk on sse.data was redundant and could incorrectly mangle payloads whose actual content starts with 'data:'. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(anthropic): handle empty streaming tool calls (#28549) Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * [Feature][Bug Fix] Decouple Azure OpenAI Deployment ID from model name via base_model to fix gpt5 model routing (#28490) * feat(azure): decouple deployment ID from model name via base_model Azure OpenAI deployments have arbitrary names (deployment IDs) that may not match the underlying model. Previously, model-type detection (o-series, gpt-5, etc.) relied on substring matching against the deployment name, causing misrouted configs and rejected params when deployment names were non-standard (e.g. 'my-deployment-id' for gpt-5.2). This change extends the existing base_model field to drive model-type detection, config selection, supported param resolution, and param mapping throughout the Azure call path: - _get_azure_config() uses base_model for is_o_series/is_gpt_5 checks - get_provider_chat_config() threads base_model for Azure - get_supported_openai_params() accepts and uses base_model - get_optional_params() accepts base_model and passes it to all Azure config method calls (get_supported_openai_params, map_openai_params) - azure.py completion handler uses base_model for GPT-5 detection - Config internal methods (e.g. is_model_gpt_5_2_model) now receive base_model so features like logprobs are correctly enabled Fully backward compatible - when base_model is unset, behavior is identical. Existing o_series/ and gpt5_series/ prefix workarounds continue to work. Usage in proxy config: model_list: - model_name: my-gpt5 litellm_params: model: azure/my-deployment-id model_info: base_model: azure/gpt-5.2 Fixes: non-standard deployment names like 'prefix-gpt-5.2' rejecting logprobs/top_logprobs despite the underlying model supporting them. * Addressing Greptile comments. * gemini-3.1-flash-lite pricing (#27933) * feat(model_prices): add gemini-3.1-flash-lite pricing with standard/batch/flex/priority tiers * fix pricing * add service tier --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> * fix(openai-responses): strip Anthropic cache_control from Responses API requests (#28431) Squash-merged by litellm-agent from cwang-otto's PR. * Treat None litellm_provider as wildcard in _check_provider_match (#28523) Squash-merged by litellm-agent from adityasingh2400's PR. * fix greptile * fix: use _azure_detection_model in default Azure branch of get_supported_openai_params Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(openai-responses): strip cache_control on compact endpoint as well Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Felipe Garé <90070734+FelipeRodriguesGare@users.noreply.github.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: withomasmicrosoft <withomas@microsoft.com> Co-authored-by: mubashir1osmani <mubashir.osmani777@gmail.com> Co-authored-by: cwang-otto <chengxuan.wang@ottotheagent.com> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

The dependency license checker only read the legacy free-text `info.license` field from PyPI. Packages that adopt PEP 639 publish their license as an SPDX expression in `info.license_expression` and leave the legacy field null, so the checker reported "Unknown license" and failed CI for every newly-bumped PEP 639 dependency. `get_package_license_from_pypi` now resolves the license in order: `license_expression`, then legacy `license`, then the `License :: OSI Approved :: ...` trove classifiers. `is_license_acceptable` splits compound SPDX expressions on the uppercase OR/AND operators (case-sensitive, so the lowercase `-or-later` inside an identifier is not mistaken for an operator) and strips `WITH <exception>` suffixes, requiring every component to be acceptable. Free-text license blobs are detected and fall back to the original whole-string matching. The `black` and `pydantic-settings` entries in liccheck.ini that existed solely to work around this now resolve correctly on their own and have been removed.

@router

…nt endpoints (#28620) * test(proxy): add create_scratch_actor harness helper Adds create_scratch_actor() to the management behavior-suite conftest and extends create_scratch_team() with team_member_permissions / models kwargs, needed by the PR3 team-key-permission and team-model matrices. The new helper mints a scratch-prefixed user + verification token (+ org memberships), all reclaimed by the existing scratch-prefix teardown. * test(proxy): pin /key block, unblock, health, aliases behavior Adds behavior-pinning matrices for POST /key/block, POST /key/unblock, POST /key/health, and GET /key/aliases. Pins that the management-route gate 401s ORG_ADMIN-role callers before _check_key_admin_access runs, the block/unblock round-trip on the blocked column, missing-key 404, and the _apply_non_admin_alias_scope visibility rules for /key/aliases. * test(proxy): pin /key/bulk_update + /team/key/bulk_update behavior Adds behavior-pinning matrices for POST /key/bulk_update (PROXY_ADMIN-only; ORG_ADMIN stopped 401 at the route gate, INTERNAL_USER-role 403 at the handler) and POST /team/key/bulk_update (team-member-permission gate keyed on KEY_UPDATE). Pins batch semantics: empty/over-cap 400, per-key failure isolation into failed_updates, all_keys_in_team broadcast, and no-keys 404. Adds an optional key_alias arg to create_scratch_key for multi-key scenarios. * test(proxy): pin /key SA-generate, v2-info, reset-spend behavior Adds behavior-pinning matrices for POST /key/service-account/generate (team-membership + team-member-permission gating; SA keys carry no user_id), POST /v2/key/info (per-key _can_user_query_key_info silently drops invisible keys), and POST /key/{key}/reset_spend (PROXY_ADMIN or team admin only; missing key 404, reset-value 400). Pins that ORG_ADMIN-role callers are stopped 401 at the management-route gate on the two non-info routes. * test(proxy): close PR1/PR2 key-side deferred coverage gaps Closes the four key-side gaps deferred from PR1/PR2: - 404 on missing key for /key/update and /key/delete (not 401/403) - denied /key/update leaves max_budget/tpm_limit/rpm_limit untouched - /key/regenerate enforces litellm.upperbound_key_generate_params (#26340) - /key/list key_alias substring vs exact (admin-only) + team_id filter, and a non-admin filtering a foreign team is 403 * test(proxy): pin /team block, unblock, available, filter/ui, members/me Adds behavior-pinning matrices for POST /team/block + /team/unblock (management-route gate fronts _verify_team_access; reachable only by PROXY_ADMIN and an org admin of the team's own org), GET /team/available (default empty path), GET /team/filter/ui (route-gated PROXY-ADMIN-only despite the handler having no gate), and GET /team/{team_id}/members/me (caller resolves its own membership; non-member 404, no-user_id key 400). * test(proxy): pin /team model add/delete + permissions endpoints Adds behavior-pinning matrices for POST /team/model/add + /team/model/delete (route-gated PROXY-ADMIN-only; missing team 404), GET /team/permissions_list + POST /team/permissions_update (self-managed; proxy/team/org admin pass), and POST /team/permissions_bulk_update (PROXY_ADMIN-only). Pins the deliberate divergence that the available-team self-join grants read access via permissions_list but never write access via permissions_update. * test(proxy): pin /team delete, bulk_member_add, v2/list, daily/activity Adds behavior-pinning matrices for POST /team/delete (per-team _verify_team_access; batch aborts whole on a missing id), POST /team/bulk_member_add (route-gated PROXY-ADMIN-only; empty/over-cap 400), GET /v2/team/list (_enforce_list_team_v2_access — bare query 401s regular users, org-scoped for org admins) and GET /team/daily/activity (non-member team_ids filter 404, the VERIA-43 fix). * test(proxy): add route-coverage gate + close team org-relocation gap Adds test_route_coverage.py (PR3.M1): parses every @router route literal from the two management-endpoint source files and asserts each is exercised by >=1 behavior-suite scenario — a permanent regression guard for future routes. Closes the last PR1/PR2 deferred gap: the /team/update org-relocation allowed branch, exercised by a dual-org-admin minted via create_scratch_actor. test_team_model uses literal route URLs so the coverage parser resolves them. * test(proxy): bound plain route params to one path segment in coverage gate Plain path params ({team_id}) now compile to [^/?]+ instead of [^?]+, so a parameter cannot span '/'. Starlette ':path' params still match across '/'. Keeps the route-coverage guard from falsely reporting a future multi-segment route as covered. All 37 routes remain covered.

The Playwright suite under tests/proxy_admin_ui_tests/e2e_ui_tests/ is no longer wired into CI (only test_*.py is globbed) and every active spec is duplicated by ui/litellm-dashboard/e2e_tests/tests/ (login, auth redirect, search users, internal user list). team_admin.spec.ts was entirely commented out. Removing the directory plus its only-used-here playwright config, package.json/lock, and utils/login.ts keeps the canonical suite under ui/litellm-dashboard/e2e_tests/ as the single source of truth.

…endpoints (#28613) * fix(sagemaker): use Cohere embed payload for Marketplace endpoints SageMaker embedding only special-cased Voyage; every other endpoint received HuggingFace TGI `{"inputs": [...]}`. AWS Marketplace Cohere containers expect the native Cohere embed payload (`texts`, `input_type`) and reject the HF shape with `422 EmbedReqV2.inputs is of type string but should be of type Object`. Add `SagemakerCohereEmbeddingConfig` that reuses Bedrock/Cohere request and response transforms, and route SageMaker endpoint names containing `cohere` or a Cohere embed model fragment (`embed-multilingual`, `embed-english`, `embed-v3`, `embed-v4`) to it. Supports `input_type`, `dimensions`, and `encoding_format`. Voyage and HuggingFace SageMaker endpoints are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(sagemaker): simplify cohere detection and align with file conventions - Detect Cohere SageMaker endpoints with a single `"cohere" in model.lower()` check, mirroring the existing Voyage branch instead of a separate helper function and marker constant. - Drop instance caches of sub-configs; instantiate `BedrockCohereEmbeddingConfig` / `CohereEmbeddingConfig` per call to match the existing pattern in `BedrockCohereEmbeddingConfig._transform_request`. - Match `SagemakerEmbeddingConfig`'s signatures, defaults, and `Any` typing for `logging_obj`; collapse the input-normalization helper inline. - Inline `transform_embedding_response` input lookup; no behavior change. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(sagemaker): restore provider-supported embedding params after map Cohere input_type is advertised in get_supported_openai_params but was filtered out of non_default_params by OPENAI_EMBEDDING_PARAMS before map_openai_params ran. Merge supported params from passed_params after map (same path Greptile flagged). Handle input_type explicitly in SagemakerCohereEmbeddingConfig.map_openai_params and add an integration test through get_optional_params_embeddings. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(embeddings): only restore non-OpenAI supported params after map The post-map restore loop must skip OPENAI_EMBEDDING_PARAMS so mapped fields (e.g. dimensions -> output_dimension) are not duplicated under their OpenAI names. Align SageMaker embedding import order with sibling files and add a regression test for dimensions mapping. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(sagemaker): avoid double post_call on Cohere embedding response Greptile review on #28613 caught that `CohereEmbeddingConfig._transform_response` calls `logging_obj.post_call` internally. The SageMaker embedding handler already calls `post_call` once before invoking the transform, so the Cohere SageMaker path fired callbacks, cost calculators, and log handlers twice per request. Extract the parsing body of `_transform_response` into `_populate_embedding_response` (pure extract-method, no behavior change for existing Cohere direct or Bedrock Cohere paths, which keep calling `_transform_response`). Have `SagemakerCohereEmbeddingConfig` call the new helper directly so it parses the response without re-logging. Add a regression test asserting `logging_obj.post_call` is not invoked by the SageMaker Cohere transform. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

) * fix(bedrock): strip bedrock/ prefix and URL-encode ARNs in get_bedrock_model_id for invoke path The invoke path (used by /v1/messages → Anthropic SDK / Claude Code) called get_bedrock_model_id() which, when falling back to the raw model string, did not strip the 'bedrock/' routing prefix and did not URL-encode ARNs. For a model like: bedrock/arn:aws:bedrock:us-east-1:<ACCOUNT>:inference-profile/global.anthropic... the URL built was: /model/bedrock/arn:aws:bedrock:…/invoke-with-response-stream ❌ Bedrock returned a JSON error body. LiteLLM's AWSEventStreamDecoder passed those bytes into botocore's EventStreamBuffer which expects binary event-stream framing. Checksum validation failed on the JSON prelude (0x223a7b22 == ':{"') producing a misleading botocore.eventstream.ChecksumMismatch instead of the actual Bedrock error. Fix: strip 'bedrock/' (and 'invoke/') routing prefix from model string, then URL-encode if the result is an ARN — matching what the converse path already does in converse_handler.py. Fixes: LIT-3274 * fix(bedrock): use strip_bedrock_routing_prefix to handle compound prefixes Address greptile review: the original fix used a loop with break, so bedrock/invoke/arn:... only stripped bedrock/ leaving invoke/arn:... which is not an ARN → fell through to .replace('invoke/','',1) → bare unencoded ARN → same malformed-URL bug. strip_bedrock_routing_prefix() iterates without break, correctly stripping bedrock/ then invoke/ in sequence. Also adds test case for the compound-prefix scenario. * style: apply black formatting to fix lint CI (LIT-3274) --------- Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai> Co-authored-by: LiteLLM Bot <bot@berri.ai>

* fix(bedrock): decouple STS region from Bedrock aws_region_name STS AssumeRole now resolves signing region from aws_sts_endpoint (parsed host) or AWS_REGION/AWS_DEFAULT_REGION instead of aws_region_name, fixing air-gapped cross-region Bedrock setups and endpoint/signature mismatches. Co-authored-by: Cursor <cursoragent@cursor.com> * test(bedrock): add regression coverage for _build_sts_client_kwargs Parametrize _resolve_sts_region and _build_sts_client_kwargs matrix cases, and assert IRSA/web-identity paths use aligned STS endpoint and region_name. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(bedrock): tighten STS region helpers and drop redundant web-identity endpoint synthesis Co-authored-by: Cursor <cursoragent@cursor.com> * test(bedrock): cover FIPS, GovCloud, and China STS endpoints Addresses greptile P2: regex sts(?:-fips)? supported sts-fips hosts but was not exercised by the parametrized parse test. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

#28669) Streaming 429s are wrapped in MidStreamFallbackError so the Router can fall back; the existing 'except litellm.RateLimitError: pass' in test_vertex_ai_stream no longer matches, causing the generic pytest.fail branch to fire when upstream Vertex returns 429. Add a sibling except for MidStreamFallbackError that only swallows it when e.original_exception is a RateLimitError, so unrelated streaming failures still fail the test.

* feat(guardrails): add Microsoft Purview DLP guardrail * fix(guardrails/purview): raise_for_status on HTTP errors, cap scope cache, reuse executor * fix(guardrails/purview): propagate litellm_call_id as correlation_id to Purview * chore: fixes * refactor(guardrails): delegate get_user_prompt to get_last_user_message PurviewGuardrailBase duplicated AzureGuardrailBase (and OpenAIGuardrailBase) user-prompt extraction. The same logic already lived in common_utils.get_last_user_message; wire guardrail bases to that helper, fix the helper docstring, and drop its redundant self-import of convert_content_list_to_str. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(purview): make protection scope cache true LRU on hits OrderedDict.get() does not update insertion order; call move_to_end on TTL-valid cache hits so popitem(last=False) evicts least-recently-used users instead of FIFO by first insert. Add a regression test with a small max cache size. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * Fix mypy * fix(guardrails/purview): harden user-id resolution and broaden DLP text Prefer API key and proxy-injected metadata over client metadata for Entra identity. Scan full message transcript pre-call and all completion choices post-call. Align logging-only hook with the same user-id rules. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(guardrails/purview): scan /v1/completions prompt and TextChoices Normalize text-completion prompts (string or list of strings); skip token-id-only prompts. Run post-call DLP on TextCompletionResponse choices. Extend logging_only hook for text_completion. Add tests and completion_prompt_to_str helper. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(purview-dlp): return data after DLP pass; per-call executor; dedupe text extraction async_pre_call_hook now returns the request dict after a successful check so callers match skip-path behavior. logging_hook uses a fresh ThreadPoolExecutor per invocation like Presidio to avoid single-worker starvation. Response text extraction is centralized in _completion_response_text_parts. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(purview): fix LRU cache refresh position and add Responses API scanning Two fixes to the Microsoft Purview DLP guardrail: 1. LRU cache bug (base.py): When a stale scope cache entry was re-fetched, the assignment updated the value but Python's OrderedDict.__setitem__ preserves the original insertion order for existing keys. This left the refreshed entry near the front of the dict, making it the first candidate for LRU eviction via popitem(last=False). Fix: call move_to_end(user_id) after every write to an existing key. 2. Responses API coverage gap (purview_dlp.py): Requests to /v1/responses use an 'input' field instead of 'messages' or 'prompt', so the pre-call hook returned without scanning the content. Similarly, post-call hook did not handle ResponsesAPIResponse.output. Fix: add _responses_api_input_to_str() helper and handle 'responses'/'aresponses' call types in async_pre_call_hook, async_post_call_success_hook (via _completion_response_text_parts), and async_logging_hook. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(purview): message separator, non-blocking logging_hook, TextChoices type error Three bugs fixed in the Microsoft Purview DLP guardrail: 1. get_prompt_text_for_dlp message separator (base.py) - Previously called get_str_from_messages() which concatenated all message texts with NO separator, so 'end of msg1' + 'start of msg2' became 'end of msg1start of msg2'. - Now joins per-message text with '\n\n' via convert_content_list_to_str(), preserving DLP pattern detection accuracy across message boundaries. 2. logging_hook blocking the event loop thread (purview_dlp.py) - Previously called future.result() which blocked the calling thread (often the event loop thread) for the entire round-trip of two sequential Microsoft Graph API calls (_compute_protection_scopes + _process_content). - Now fires and forgets: when called inside a running loop, schedules the coroutine with loop.create_task(); otherwise spawns a daemon thread. Returns (kwargs, result) immediately in both cases. - Removes unused concurrent.futures.ThreadPoolExecutor import; adds threading. 3. Incompatible assignment type error (purview_dlp.py:180) - mypy inferred 'choice' as TextChoices from the first loop body, then flagged the assignment in the second loop as incompatible with Choices. - Fixed by using distinct loop variable names: text_choice (TextChoices) and chat_choice (Choices). Tests: 7 new tests added covering the separator fix (TestGetPromptTextForDlp) and the non-blocking logging_hook (TestLoggingHookNonBlocking). Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(purview): suppress API errors in logging-only mode and scan tool-call arguments Three issues fixed: 1. _check_content except block re-raised unconditionally even when block_on_violation=False. The docstring promised 'log only - do not raise' but network/API errors always propagated. Fixed by checking block_on_violation before re-raising; when False, log a warning and continue. 2. async_logging_hook used a single try/except wrapping both the prompt and response audit calls. When the first _check_content (uploadText) raised due to an API error the second call (downloadText) was silently skipped. Fixed by giving each audit call its own try/except so both always run independently. 3. convert_content_list_to_str() only reads message.content, so tool_calls[].function.arguments and function_call.arguments were invisible to the Purview pre-call and post-call scans. An authenticated caller could embed sensitive text in tool-call arguments and bypass DLP. Fixed by: - Adding PurviewGuardrailBase._extract_tool_call_args_from_message() which handles both dict and object-style messages, covering both tool_calls[] arrays and the legacy function_call field. - Updating get_prompt_text_for_dlp() to include those arguments alongside message content (request/prompt path). - Changing _completion_response_text_parts() from @staticmethod to an instance method and adding tool-call argument extraction for ModelResponse choices (response path). Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * chore(ui): restructure pre-built Next.js output to directory-based routing Flat page files (e.g. guardrails.html) replaced by directory-based index.html equivalents (e.g. guardrails/index.html) matching the Next.js App Router output format. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(purview): comprehensive security hardening — identity spoofing, streaming bypass, token-id gap Four security issues addressed: 1. end_user_id kwargs fallback missing in _resolve_user_id_from_logging_kwargs user_id already fell back to kwargs.get("user_api_key_user_id") when absent from metadata, but end_user_id only checked md.get("user_api_key_end_user_id") with no kwargs-level fallback. Added or kwargs.get("user_api_key_end_user_id"). 2. Streaming responses bypassed post_call blocking async_post_call_success_hook only runs on assembled non-streaming responses. For streaming requests the proxy already delivered all content before the hook ran, so raising HTTPException there had no effect. Added async_post_call_streaming_iterator_hook which buffers the entire stream, assembles it via stream_chunk_builder, runs the Purview DLP check, and only then re-yields chunks via MockResponseIterator. If a violation is detected the exception is raised before any bytes reach the client. The proxy automatically skips async_post_call_success_hook for guardrails that define this method, preventing duplicate scans. 3. Caller-controlled Purview user identity in blocking modes When a LiteLLM API key has no bound user_id the guardrail fell back to metadata[user_id_field], which is supplied by the caller. A caller could set this to any Entra object ID whose Purview policies are more permissive and bypass DLP. Added _resolve_trusted_user_id() that only returns identities from the proxy auth system (user_api_key_dict.user_id, end_user_id, or proxy-injected metadata["user_api_key_user_id"]). Added _resolve_user_id_for_blocking() used by all blocking-mode hooks: tries trusted sources first; if only caller-supplied is available, logs a SECURITY WARNING and still proceeds (backward compat); if nothing resolves, skips with a warning. 4. Token-id prompt DLP bypass When /v1/completions received a pure token-id array prompt, completion_prompt_to_str() returned None and the pre_call hook silently skipped the Purview scan. An authenticated caller could tokenize blocked text and send it without DLP evaluation. The hook now detects this case (raw_prompt present but prompt_text None) and logs a WARNING while letting the request pass through — token-id payloads are opaque at the text layer and cannot be scanned. This makes the gap explicit rather than silent. Tests: 94 total, all passing. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * Revert "chore(ui): restructure pre-built Next.js output to directory-based routing" This reverts commit c70c4303b735bb3885732bd4a0e01997e9571f56. * fix(purview): fail closed on identity spoofing, token prompts, and path encoding Encode Entra user IDs in Graph paths, guard caches with asyncio.Lock, scan Responses API instructions with string input, reject caller-only metadata and token-id completion prompts in blocking mode, and revert unrelated UI HTML restructure from the PR branch. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(purview): use threading.Lock and getattr for LitellmParams - Replace asyncio.Lock with threading.Lock in PurviewGuardrailBase. The cache lock is acquired both from the proxy's main event loop and from short-lived event loops created by the logging_hook thread fallback. In Python 3.10+ an asyncio.Lock is bound to the first event loop that acquires it, so the second loop would silently break audit logging with RuntimeError. All critical sections are in-memory dict ops with no awaits, so a synchronous lock is safe. - Use getattr() on LitellmParams in initialize_guardrail() instead of .get(), which does not exist on Pydantic BaseModel instances and would raise AttributeError at runtime. Tests updated to construct Mock objects with spec= so they reflect the real interface. Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(purview): dedupe trust-level user resolution and drop dead code - _resolve_user_id now delegates levels 1-3 to _resolve_trusted_user_id so blocking and non-blocking paths share a single source of truth. - Drop redundant event_hook override in MicrosoftPurviewDLPGuardrail.__init__ (initialize_guardrail already forwards event_hook=litellm_params.mode). - Drop unused self._logging_only attribute; blocking is controlled by the block_on_violation argument passed to _check_content. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(purview): fail-closed on responses API transform error; avoid duplicate audit calls Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(purview): fail-closed blocking DLP; revert directory-based UI HTML Blocking hooks now require UserAPIKeyAuth user_id/end_user_id only (no spoofable metadata), re-raise Responses API transform errors, scan streamed text completions, and reject requests with no bound identity. Reverts the accidental directory-based Next.js output from cc47081 (c70c4303b7). Co-authored-by: Cursor <cursoragent@cursor.com> * Remove dead code in purview_dlp: _resolve_user_id_for_blocking never returns falsy The method either returns a non-empty trusted user id or raises HTTPException, so the 'if not user_id' guards in async_pre_call_hook and async_post_call_success_hook were unreachable. Tighten the return type to str and drop the dead checks to make the fail-closed behavior explicit. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(purview): exclude caller-controlled end_user_id from blocking DLP Blocking Purview checks now use only API-key/JWT-bound user_id, not end_user_id populated from request user/metadata/safety_identifier. Co-authored-by: Cursor <cursoragent@cursor.com> * style(purview): apply Black formatting to base.py Co-authored-by: Cursor <cursoragent@cursor.com> * fix(purview): use post-await timestamp for cache TTL Capture the timestamp after the network call completes when storing it as the cache freshness marker, so the effective TTL reflects when the response was actually received rather than when the request started. Under high network latency the previous behavior shortened the effective cache lifetime. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(purview_dlp): fail closed when stream_chunk_builder returns None stream_chunk_builder can return None (e.g., when ChunkProcessor filters all chunks), causing both isinstance checks to fail and the buffered chunks to be released without DLP scanning. Explicitly fail closed in that case by raising an HTTPException so the streaming DLP guardrail does not bypass policy enforcement. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(purview_dlp): resolve user_id before buffering stream Co-authored-by: Yassin Kortam <yassin@berri.ai> * merge main (#28629) * test(vcr): classify cache verdicts, detect live calls, surface cost leaks Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS / PARTIAL' tag into a classified outcome that distinguishes the cases that silently bill the live API on every CI run from the ones that don't: HIT pure replay PARTIAL mixed replay + new recordings MISS:RECORDED new cassette saved to Redis (cached next run) MISS:OVERFLOW cassette > MAX_EPISODES_PER_CASSETTE; persister refused to save; re-bills every run MISS:NOT_PERSISTED test failed; save_cassette skipped; re-bills NOOP VCR-marked but no HTTP traffic (mocked elsewhere) UNMARKED:LIVE_CALL test bypassed VCR AND opened a TCP connection to a known LLM provider host -> wasted spend UNMARKED:NO_TRAFFIC test bypassed VCR but didn't call out The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits live' into 'this test connected to api.openai.com'. We install a socket.connect / socket.create_connection wrapper for the duration of each non-VCR-marked test and record any outbound TCP to a known LLM provider hostname. The probe sits below the httpx layer so vcrpy and respx (which both patch above the socket) are unaffected. Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the llm_translation and local_testing conftests with per-item respx detection in apply_vcr_auto_marker_to_items. A test now skips VCR when it actually carries @pytest.mark.respx or has respx_mock in its fixture chain - not just because some other test in the same file imports MockRouter. Items skipped by skip_files are split into respx_conflict (real conflict, the module wires up respx) vs file_opt_out (dead skip- list entry whose module never touches respx) so the session summary makes pruning obvious. Stabilize the AWS SigV4 fingerprint: the Authorization header on Bedrock requests rotates its Credential date and Signature on every call, which previously pushed every Bedrock test past the 50-episode overflow threshold. Extract the access-key id only ('aws-sigv4:AKIA...') so two requests with the same identity match. Always emit verdict logging when VCR is active (set LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a session-end classification summary that lists overflow tests, unmarked live-call tests, and the skip-reason breakdown. Wire the live-call probe + summary hook into every test directory that already uses the Redis-backed VCR cache (audio_tests, guardrails_tests, image_gen_tests, litellm_utils_tests, llm_responses_api_testing, llm_translation, local_testing, logging_callback_tests, ocr_tests, pass_through_unit_tests, router_unit_tests, search_tests, unified_google_tests). Add tests/llm_translation/test_vcr_classification.py covering the verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability, live-host classification, and session summary rendering. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(vcr): drop dead 'from respx import MockRouter' imports These seven test files were on _RESPX_CONFLICTING_FILES, which made the auto-marker skip them entirely. Inspecting the source shows the only respx artifact is a top-level 'from respx import MockRouter' that no test ever uses - no @pytest.mark.respx, no respx_mock fixture, no respx.mock context manager. The import is dead code left over from a previous mocking pattern. Now that apply_vcr_auto_marker_to_items detects respx per-item via the marker / fixture chain (b637d9f64a), the file-level skip is no longer needed for these files - they were the reason the OpenAI tests (test_o3_reasoning_effort, test_streaming_response[o1/o3-mini], TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search, TestOpenAIO3::test_web_search, etc.) ran live every CI build despite the cassette cache being healthy. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(image_edits): regenerate fixtures per call instead of holding open module-level file handles Module-level TEST_IMAGES = [ open(os.path.join(pwd, 'ishaan_github.png'), 'rb'), open(os.path.join(pwd, 'litellm_site.png'), 'rb'), ] SINGLE_TEST_IMAGE = open(...) opens the file once at import. After the first multipart upload, the file pointer is at EOF, so every subsequent test in the same xdist worker sends an empty multipart body. That non-determinism (a) blows the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so _RedisPersister.save_cassette refuses to save it, and (b) re-bills the live image edit endpoint on every CI run. Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py shows six tests parking at 51-52 cassette entries (TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False], TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio, test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False], test_multiple_image_edit_with_different_formats). Replace the module-level file handles with _make_test_images() / _make_single_test_image() factories that return fresh _RewindableImage (BytesIO subclass) objects whose pointer always starts at 0. The image bytes are read once at import into module-level constants (_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is unchanged. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(vcr): match real Bedrock hostnames in live-call probe The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com' (region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit host check for that pattern so Bedrock live calls are visible to the probe, and update the unit test accordingly. Also drop the unused '_LIVE_CALL_PROBE_INSTALLED' module variable. * fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes * fix(image_edits): drop _RewindableImage to prevent infinite multipart upload The _RewindableImage(BytesIO) wrapper auto-rewound on every read after EOF, which made the OpenAI SDK's multipart upload writer read the same bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd: [gw0] node down: Not properly terminated replacing crashed worker gw0 ... worker 'gw1' crashed while running 'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]' The auto-rewind was added defensively for parametrized + flaky-retried tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk already calls get_base_image_edit_call_args() once per invocation and that helper now constructs fresh streams via _make_test_images(), so rewinding inside the stream is unnecessary. Replace with plain BytesIO seeded with the cached image bytes. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible The pass_through prompt-caching tests (test_prompt_caching_returns_cache_read_tokens_on_second_call, test_prompt_caching_streaming_second_call_returns_cache_read) make a warm-up call and then assert the *second* call sees a non-zero cache_read_input_tokens count from the upstream's prompt-cache. VCR replay can't model cross-call provider state — both calls match the same cassette episode, so the second call returns the first call's pre-warmup response and the assertion fails: AssertionError: Expected cache_read_input_tokens > 0 on second call, but got 0. Full usage: {'input_tokens': 4986, 'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0} This started biting after the AWS SigV4 fingerprint stabilization (b637d9f64a): Bedrock requests now produce a stable per-access-key fingerprint instead of a per-request signature, so cassettes successfully replay where they previously always missed and re-recorded live. Opt these tests out via skip_nodeid_suffixes so they run live and match the existing pattern in tests/llm_translation/conftest.py (::test_prompt_caching). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(vcr): tighten OVERFLOW classification and switch respx detection to AST Address two greptile P2 review concerns on PR #27795: 1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE regardless of cassette state. A cassette that grew past the cap historically but this run only *replayed* (dirty=False) is healthy — the persister never tries to save, so the cache state is stable and the next run will replay too. Only flag OVERFLOW when dirty=True (new episodes were recorded that the persister would refuse to save). Add a regression test covering the dirty=False + large-total case. 2. _module_uses_respx did substring matching on the module source, which false-positives on comments / docstrings / string literals. A comment like # Previously tried respx.mock but switched to vcrpy would keep a file pinned on the opt-out list, defeating the dead-import pruning goal of this PR. Replace the substring scan with an ast.NodeVisitor (_RespxUsageVisitor) that only counts: - @pytest.mark.respx / @respx.mock decorators - with respx.mock(): ... (sync + async) context managers - respx.mock(...) calls outside a with/decorator - function parameters / fixture names equal to respx_mock Add tests for the comment / docstring / string-literal cases plus each real-usage pattern. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist `_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate` — which runs in each xdist worker process. The controller's `pytest_terminal_summary` then reads its own empty `_session_stats` and bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections the rest of this PR adds never make it into CI logs in the dist mode CI actually uses. Ship a structured `vcr_outcome` payload via `user_properties` (which xdist round-trips) and add `aggregate_report_outcome` on the controller to fold worker outcomes into `_session_stats`. The recording process tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can tell "single-process — already counted locally" apart from "produced by a worker — needs aggregation here", and not double-count when there's no xdist. Covered by 9 new unit tests in test_vcr_classification.py including the end-to-end summary render path. * fix(guardrails): improve CrowdStrike AIDR input handling (#26658) * feat(lasso): add tool-calling support to LassoGuardrail (#27648) * feat(lasso): extend LassoGuardrail to support tool calling (RND-5748) * fix(lasso): PR review followups for tool-calling guardrail (RND-5748) * fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748) * fix(lasso): use model role for tool_use blocks (RND-5748) * test(lasso): add round-trip tests for message transformation (RND-5748) * fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748) * fix(lasso): inspect Responses-API input field (RND-5748) * fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748) * fix(lasso): flatten list content in tool_result.content (RND-5748) * fix(lasso): remap multimodal list content during masking (RND-5748) Bug: _map_masked_messages_back counted list-content messages in original_text_count but the remap loop only handled isinstance(str). The positional text_cursor never advanced for list messages, causing all subsequent masked texts to be written onto the wrong messages. Fix: added elif isinstance(content, list) branch that replaces the list with the masked text string and advances the cursor — mirrors the existing string-content branch. Also handles the assistant + tool_calls combo for list-content messages. Test: test_map_masked_messages_back_list_content verifies a user message with [text + image_url] followed by an assistant message gets correct masked content on both (cursor stays aligned). * refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748) The dict-vs-object access pattern (x.get('y') if isinstance(x, dict) else getattr(x, 'y', None)) was duplicated 14 times across 5 methods. _get_field(obj, field) — single-point dict/Pydantic field access. _extract_tool_call_fields(call) — returns (call_id, name, parsed_input) with JSON argument parsing, replacing ~30 duplicate lines in both async_post_call_success_hook and _expand_messages_for_classification. Also simplified _update_tool_calls_from_masked, _prepare_payload tool mapping, and _apply_masking_to_model_response call_id extraction. Net ~60 lines removed. No behavior change — all 32 tests pass. * fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748) _apply_masking_to_model_response used a bare text_cursor without verifying 1:1 correspondence between text-bearing choices and masked text entries. If Lasso returned a different number of text messages than choices with content, masked text would be applied to the wrong choice or silently skip choices. Added the same count-mismatch guard pattern already used in _map_masked_messages_back: count original text-bearing choices, compare to masked_text length, skip text remap on mismatch with a warning log. Tool_call masking via id-based lookup is unaffected. Tests: - test_apply_masking_to_model_response_multiple_choices: verifies correct per-choice masked text with 2 choices - test_apply_masking_to_model_response_count_mismatch: verifies content is left unchanged when counts disagree * fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748) * tool-call args: when function.arguments is malformed JSON or parses to a non-object, preserve the raw string as {"arguments": <raw>} so Lasso still inspects it instead of receiving input=None. Covers both pre-call and post-call extraction (shared helper). Also resolves the CodeQL empty-except warning since the except body now assigns parsed=None. * Responses-API input: when a request carries both "messages" and "input", inspect both. Previously a benign messages array let the guardrail skip data["input"] entirely. The masking write-back is split via a count boundary so masked messages flow back to data["messages"] and masked input flows back to data["input"] without cross-contamination. Tests: malformed/non-object args round-trip, dual-field classification, dual-field masking write-back split. * chore(lasso): black formatting + comment on expand skip branch (RND-5748) * black: wrap two long expressions in lasso.py and reformat dict literals in test_lasso.py to satisfy CI lint. * add a short comment in _expand_messages_for_classification explaining why empty string and None content are intentionally skipped (None is the OpenAI shape for a pure tool-call turn). * fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748) * Narrow `response.get("messages")` into a local before slicing so mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable. * Rename the two write-side `func` bindings in `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so mypy doesn't unify the dict and Any|None branches. * Rename the inner loop variable in `_apply_masking_to_model_response` from `msg` to `masked_msg` to avoid clashing with the `msg = choice.message` rebinding below. No behavior change; resolves the 7 mypy errors from the CI lint job. * perf: eliminate per-request callback scanning on proxy hot path (#27858) - Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead - Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered - Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active - Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields - Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk - Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement - Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support - Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> * ci(mutmut): enable mutate_only_covered_lines to fit in CI budget (#27910) The mutation-test workflow timed out at the 350-minute job cap when running whole-folder mutation against litellm/proxy/management_endpoints/ (~30 files, ~1.5 MB of source). Every mutant was running the full test suite, and mutants were generated for lines no test covers — which would survive regardless, just wasting compute. mutmut 3.x's mutate_only_covered_lines setting runs the suite once up front to compute coverage, then skips mutating uncovered lines. This cuts the mutant count dramatically and is the right semantic for the score (no test → no kill possible → uncountable). Per-mutant test filtering by function name is already automatic in mutmut 3.x; no external coverage step is needed. * fix(rate-limit): stop v3 limiter from leaking internal stash to provider body (#27913) * fix(rate-limit): stop v3 limiter from leaking internal stash to provider body PR #27001 (atomic TPM rate limit) introduced a reservation flow that writes four LiteLLM-internal keys onto the request data dict: _litellm_rate_limit_descriptors _litellm_tpm_reserved_tokens _litellm_tpm_reserved_model _litellm_tpm_reserved_scopes _litellm_tpm_reservation_released These keys are forwarded as request body params to the upstream provider, which rejects them as unknown fields: OpenAI -> 400 'Unknown parameter: _litellm_rate_limit_descriptors' (mapped by litellm to RateLimitError / 429, hiding the bug behind a misleading 'throttling_error' code) Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are not permitted' Net effect: every chat completion against any real provider fails the moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check itself still runs (raises 429 on over-limit), but the success path poisons the upstream body. Reproduced on litellm_internal_staging HEAD (410ce761dc) against gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request fails with the provider's unknown-field error. Fix: the stash is metadata only. - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS registry so we have a single source of truth for stash keys. - New helper _stash_value_in_metadata_channels writes to data['metadata'] / data['litellm_metadata'] without touching the top level. - _stash_reservation_in_data and the descriptor stash now route through that helper. _mark_reservation_released stops writing top-level. - _lookup_stashed_value also checks kwargs['metadata'] / kwargs['litellm_metadata'] (raw request_data shape) in addition to kwargs['litellm_params']['metadata'] (completion kwargs shape). - async_post_call_failure_hook now reads descriptors via the unified metadata lookup instead of request_data.get(top-level). - Defense in depth: async_pre_call_hook strips any stash key that somehow surfaced at the top level (stale cache, future refactor, test fixture) before returning. Tests: - New regression test asserts no _litellm_* stash key is present at the top level of data after async_pre_call_hook, and that the metadata channel still carries the reservation + descriptors so success / failure reconciliation works. - Existing test_tpm_concurrent.py tests that asserted top-level presence are updated to read from data['metadata'] — the location is an implementation detail; the spec is that post-call callbacks can resolve the stash. Verified end-to-end against OpenAI gpt-4o-mini and Anthropic claude-haiku-4-5 via /v1/chat/completions on a low-rpm key: - With limits not exceeded: HTTP 200, valid completion response, no leaked fields in body. - With RPM exceeded: HTTP 429 from v3 enforcement ('Rate limit exceeded ... Limit type: requests'). - With TPM exceeded: HTTP 429 from v3 enforcement ('Rate limit exceeded ... Limit type: tokens'). Full v3 hook test suite passes (171 tests). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments Address greptile P2: test fixture now uses the imported constant. Drop comments that re-explain what well-named identifiers already convey. * fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at the start of async_pre_call_hook. Without this, an authenticated caller can inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in body metadata, trigger a proxy-side rejection, and cause async_post_call_failure_hook to refund TPM counters against attacker-named scopes (e.g. another tenant's api_key). --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: allow for allowlisted redirect URIs (#27761) * fix: allow for allowlisted redirect URIs * github comment addressing * Update litellm/proxy/_experimental/mcp_server/oauth_utils.py Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com> * harden oauth wildcard further * test: cover wildcard entry with dot-leading suffix rejection --------- Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com> * Emit native web_search_tool_result blocks for Anthropic clients (Claude Desktop / Cowork citations) (#27886) * feat(custom_logger): add async_post_agentic_loop_response_hook Lets a CustomLogger shape the response returned by the agentic-loop follow-up call without bypassing the loop's safety / observability machinery (depth tracking, fingerprinting, etc.). Default returns the response unchanged. Used by websearch_interception to inject Anthropic-native web_search_tool_result blocks when the originating client requested a native web_search_* tool. * feat(llm_http_handler): call post-agentic-loop hook on the originating callback In _execute_anthropic_agentic_plan, after anthropic_messages.acreate returns, call the originating callback's async_post_agentic_loop_response_hook so it can mutate the final response (e.g. inject native tool_result blocks). Pass the callback through from _call_agentic_completion_hooks. Exceptions in the post-hook are caught and logged so a buggy callback can't kill the request. * feat(websearch_interception): add is_anthropic_native_web_search_tool Identifies tools the Anthropic-native clients (Claude Desktop, the Anthropic SDK, the Anthropic Console) use to request native search: type starts with "web_search_" (e.g. web_search_20250305). Rejects the LiteLLM standard tool, the OpenAI-function variant, the bare "WebSearch" legacy name, and the bare "web_search" Claude Code shape. This lets us decide per-request whether the client expects web_search_tool_result content blocks in the response, without renaming any existing constants or touching native-provider skip logic. * feat(websearch_interception): add build_web_search_tool_result_block Produces the Anthropic-native web_search_tool_result content block from a structured SearchResponse. Anthropic-native clients use this block to populate citations / source links — the existing text-blob flatten path only feeds readable evidence to the model and discards the structure, so this builder gives us the missing piece. Shape matches https://docs.anthropic.com/en/api/web-search-tool — web_search_result items carry url, title, page_age, encrypted_content (empty string when the search provider doesn't supply one). * feat(websearch_interception): emit native web_search_tool_result blocks When the originating client request carried a native Anthropic web_search_* tool, the final response now also carries web_search_tool_result content blocks alongside the model's text answer — so Claude Desktop / Anthropic SDK clients can populate the citations panel and replay conversation history with structured search evidence. Wiring: - Pre-request hooks (both deployment + Anthropic path) set a flag on kwargs when they see a native web_search_* tool, so the signal survives the conversion-to-litellm_web_search step regardless of which hook fires first. - _execute_search now returns (text, SearchResponse) so the structured results aren't lost when the text is flattened for the follow-up model call. - _build_anthropic_request_patch returns the parallel list of SearchResponse objects. - async_build_agentic_loop_plan pre-builds the web_search_tool_result blocks (one per tool_use_id) and stashes them on plan.metadata when the flag is set. - async_post_agentic_loop_response_hook reads the metadata and prepends the blocks to response.content. - _execute_agentic_loop mirrors the injection for the legacy path so both paths behave identically. Clients that send the LiteLLM standard tool keep the existing text-only behavior — no regression. * test(websearch_interception): cover native web_search_tool_result emission 18 tests across: - detector branches (native vs litellm-standard, OpenAI-function shape, Claude Desktop builtin WebSearch, bare web_search, missing type) - block-builder shape (results, none, empty) - pre-request hook flag-setting (native sets, standard does not) - async_build_agentic_loop_plan attaches blocks to plan.metadata when the flag is present, leaves metadata untouched when absent - post-hook injection into dict and object responses - legacy _execute_agentic_loop mirrors the injection so both paths return the same shape * test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return * test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return * feat(websearch_interception): emit native blocks from try_short_circuit_search The agentic-loop post-hook only fires when the model returns a tool_use block. Cowork / Claude Desktop on Bedrock actually make TWO requests per user turn: the main /v1/messages with their builtin tool, and a separate standalone /v1/messages whose only tool is web_search_20250305. That second request hits try_short_circuit_search — no agentic loop, no post-hook — and was returning text-only, leaving the citations panel empty. When the short-circuit input carries a native web_search_* tool, build a synthetic server_tool_use + web_search_tool_result pair (using the structured SearchResponse already returned by _execute_search) so the client gets the native shape it expects. The legacy text block is preserved so non-native short-circuit callers (Claude Code, github_copilot, etc.) see the same payload as before. Failure path still emits the native block pair (with empty results) plus the text-error block, so the client gets a well-formed response rather than a malformed half-shape. * test(websearch_native_blocks): cover short-circuit native-block emission Three new cases on top of the existing 18: - native web_search_20250305 short-circuit → [server_tool_use, web_search_tool_result, text], ids paired, urls/titles carried. - litellm_web_search short-circuit → text-only (no regression). - native short-circuit on search failure → still emits the native block pair (empty results) plus the text-error block, so the client never sees a malformed half-shape. * test(websearch_short_circuit): index assertions by block type, not by position Native short-circuit responses now have [server_tool_use, web_search_tool_result, text] when the input carries web_search_20250305 — find the text block by type rather than relying on content[0]. * fix(websearch_interception): gate legacy WebSearch name on schema absence Clients like Cowork / Claude Desktop ship a client-side tool named "WebSearch" with a full input_schema — they handle it themselves and expect to make a separate native web_search_20250305 sub-request for the actual search. Today is_web_search_tool matches the bare name regardless of other fields, which hijacks the client's tool server-side. The agentic loop fires on the main request, the model never gets to emit the client-side tool_use, and the separate native sub-request (where citation data flows) is never made. Net: citations panel empty. Real Anthropic client tools always carry input_schema (the API rejects them otherwise), so a bare {name: "WebSearch"} with no schema is the only thing that could be a legacy interception marker. Gate the match on schema absence: legacy callers (if any) keep working, real client-side WebSearch tools pass through untouched. * fix(websearch_interception): drop "WebSearch" from response-detection lists Post-conversion the model always sees ``litellm_web_search``, so the "WebSearch" entry in the response-side tool_use detection lists was dead at best. If a model ever did return ``tool_use(name="WebSearch")`` it would now (incorrectly) hijack the client's own ``WebSearch`` tool again — same Cowork problem we just fixed on the input side. Drop it. * test(websearch_native_blocks): cover the WebSearch legacy-name schema gate Three new cases: - {name: "WebSearch"} (bare interception marker) → still matched - {name: "WebSearch", input_schema: {...}} (Cowork client tool) → passes through untouched - {name: "WebSearch", description: "..."} (no schema) → still matched on the assumption it's a legacy marker rather than a malformed real client tool. --------- Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com> * ci(codecov): restore litellm/ prefix on uploaded coverage paths pytest-cov runs with --cov=litellm, which makes coverage.xml store paths relative to the package root (e.g. `proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when the basename is unique in the repo. Files like proxy_server.py, router.py, utils.py, main.py, and constants.py — which have duplicates under enterprise/ or other subpackages — get silently dropped during ingest. The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded path so they resolve unambiguously. Confirmed against multiple recent coverage.xml artifacts that no uploader currently emits paths already prefixed with `litellm/`, so the rule is safe to apply universally. This restores Codecov visibility for the highest-fix-rate hotspots: proxy_server.py, router.py, proxy/utils.py, litellm_logging.py, constants.py, key_management_endpoints.py, utils.py, main.py, user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py. * chore(ci): remove unused GitHub Actions workflows and orphan files Audit of .github/workflows/ via gh run history shows the following have either never run or have been dormant for 10+ weeks. CI coverage that still matters is preserved on CircleCI (e.g. llm_translation_testing). Removed workflows: - test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled); CCI local_testing_part1/2 covers the same tests - llm-translation-testing.yml — last run 2025-07-10; replaced by CCI llm_translation_testing job (run_llm_translation_tests.py kept for the make test-llm-translation target) - run_observatory_tests.yml — last run 2026-03-03 (cancelled) - scan_duplicate_issues.yml — last run 2026-03-02 (failure) - publish_to_pypi.yml — never run - read_pyproject_version.yml — fires on every push to main but its echoed version output is not consumed by any downstream step Removed orphan files (no callers in workflows, CCI, or Makefile): - .github/workflows/README.md — documented only publish_to_pypi.yml - .github/workflows/update_release.py + results_stats.csv - .github/actions/helm-oci-chart-releaser/ * Revert "ci(codecov): restore litellm/ prefix on uploaded coverage paths" This reverts commit e25a988a3feb4a31843a67274a3a64fea2fed805. The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's auto-resolution, not before. Files with unique basenames (which were auto-resolving correctly to `litellm/<path>`) got an extra `litellm/` prepended, producing `litellm/litellm/<path>` storage. Files with ambiguous basenames (the actual target of the fix) continued to be dropped because the auto-resolution still failed for them. Net result on the verification run: 1375 files now stored under unresolvable `litellm/litellm/...` paths, and the 11 originally-missing hotspots are still missing. Reverting before piling on further changes. * test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock Per-file `vi.mock("@tremor/react", ...)` factories fully replace the setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip overrides are lost in any file that re-mocks `@tremor/react`. Without them, the real Tremor `<Button>` leaks through and its internal `useTooltip(300)` schedules a native 300ms `setTimeout` on pointer events. When the test environment is torn down before the timer fires, the trailing `setState` calls `getCurrentEventPriority`, which reads `window.event` against a destroyed jsdom -> "window is not defined" flake observed on CI. Patches the 7 leaky test files to re-supply `Button` (bare `<button>`) and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops a dead `afterEach` workaround in `user_edit_view.test.tsx` (the fake-timer dance it ran could not drain a real timer scheduled before the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`. * ci: use --cov=./litellm so coverage paths resolve unambiguously in Codecov pytest-cov treats --cov=<module-name> as a Python package and emits XML paths relative to the package root, stripping the litellm/ prefix (`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`). Codecov's auto-prefix heuristic then drops every file whose basename is ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/), `router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py` (2). The 11 highest-fix-rate hotspots have never appeared in Codecov. Switching to --cov=./litellm treats the argument as a path, which makes coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`). Each path is unambiguous, so Codecov resolves all files correctly. Verified locally: rerunning a single proxy_unit_tests test with --cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`, `filename="litellm/router.py"`, and `filename="litellm/types/router.py"` as distinct entries — exactly the disambiguation Codecov needs. Touches every workflow that uploads coverage: the two reusable GHA workflows (_test-unit-base.yml, _test-unit-services-base.yml), test-mcp.yml, and all 14 invocations in .circleci/config.yml. * fix(mcp): allow delegate PKCE bypass for internal MCP servers Remove available_on_public_internet gating from delegate-auth-to-upstream paths so oauth2 + delegate_auth_to_upstream interactive servers behave the same when marked internal. Keeps M2M exclusion. Updates tests. * chore(mcp): warn on internal + upstream PKCE delegate Log verbose_logger.warning when loading oauth2 interactive servers with available_on_public_internet=false and delegate_auth_to_upstream=true (config + DB). Dashboard Alert for the same combo. CLAUDE note for operators. Tests for log and M2M skip. * fix(mcp): dedupe load_servers_from_config alias block Removes accidental duplicate alias/mcp_aliases and get_server_prefix logic (fixes PLR0915 and avoids resetting alias after mapping). * fix(mcp): expose delegate_auth_to_upstream in MCP server list rows (#27936) _build_mcp_server_table omitted delegate_auth_to_upstream, so GET /v1/mcp/server always returned the default false while the registry kept the DB value. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(proxy): fix vector store retrieve/list/update/delete without model (#27929) * feat(proxy): fix vector store retrieve/list/update/delete routing without model Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): remove unchecked query-param injection in vector store management endpoints Co-authored-by: Cursor <cursoragent@cursor.com> * test(proxy): use subset assertion for vector store route test to allow extra kwargs like shared_session Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller (#27984) * fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller CheckBatchCost bypasses async_post_call_success_hook, causing raw provider output_file_ids to be persisted in LiteLLM_ManagedObjectTable. This fix converts output_file_id and error_file_id to managed base64 IDs before the DB write. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(check_batch_cost): persist managed file before mutating response and propagate team_id - Move setattr after store_unified_file_id so the response only receives the managed ID once the DB record is successfully written. Avoids serializing an orphaned managed ID into file_object when the store call fails. - Populate team_id on the minimal UserAPIKeyAuth from job.team_id so the managed file record is created with the correct team ownership, allowing other team members to access the batch output file via /files/{id}/content. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(managed_batches): extend test to cover error_file_id conversion Co-authored-by: Cursor <cursoragent@cursor.com> * fix managed file test --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs (#27912) * fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs Vertex batch jobs recorded 0 spend and 0 tokens after PR #25627 added automatic transformation of GCS predictions.jsonl to OpenAI format. Two bugs fixed: 1. batch_utils.py: the Vertex-specific cost/usage reader (calculate_vertex_ai_batch_cost_and_usage) was always invoked and reads raw usageMetadata fields that no longer exist in the OpenAI-shaped output. Now the reader is only used when disable_vertex_batch_output_transformation=True; otherwise the generic path handles the already-transformed OpenAI-shaped content. 2. cost_calculator.py: batch_cost_calculator skipped the global litellm.get_model_info() lookup when a model_info dict was passed in, even when that dict had no pricing fields (e.g. deployment metadata with only id/db_model). It now falls back to the global pricing table when the provided model_info has no pricing data. Co-authored-by: Cursor <cursoragent@cursor.com> * Update litellm/cost_calculator.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(cost-calculator): use not-any guard for pricing fallback in batch_cost_calculator Co-authored-by: Cursor <cursoragent@cursor.com> * fix(cost-calculator): treat explicit zero batch pricing as set in model_info The fallback to litellm.get_model_info() used truthy checks on pricing fields, so 0.0 was treated as missing and replaced by global rates. Use `is not None` like elsewhere in cost calculation. Add regression test. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * feat: add weighted-routing failover (#27980) * Feat: Add Weighted-Routing Failover * test(router): cover weighted failover helper functions Co-authored-by: Cursor <cursoragent@cursor.com> * fix(router): align weighted failover deployment list type with mypy Co-authored-by: Cursor <cursoragent@cursor.com> * fix(router): address greptile review on weighted failover - Narrow exception swallowing in `_maybe_run_weighted_failover` to `openai.APIError` so model failures defer to the regular fallback while programming bugs (AttributeError/KeyError/TypeError) surface. - Note async-only limitation of `enable_weighted_failover` in the Router constructor docstring. - Make the weighted distribution test less flaky (1000 iterations, looser bound) and make the non-simple-shuffle test deterministic by failing both deployments instead of relying on the latency strategy's first pick. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(router): ensure weighted failover metadata persists in kwargs The previous `kwargs.setdefault(metadata_variable_name, {}) or {}` returned a brand-new dict whenever the existing metadata was falsy (empty dict or None), so writes to `_failover_excluded_ids` never made it back into `kwargs`. Multi-hop weighted failover then re-selected previously failed deployments and exhausted `max_fallbacks` prematurely. Explicitly assign a fresh dict into kwargs when metadata is missing so mutations are visible to subsequent failover hops. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(router): regression for weighted failover metadata persistence Asserts kwargs["metadata"]["_failover_excluded_ids"] is populated after _maybe_run_weighted_failover, proving the metadata dict written by the helper is the same object that lives in kwargs (no disconnected copy). Pairs with the prior fix that replaced `setdefault(..., {}) or {}` with an explicit get/assign so writes survive across hops. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(router): harden weighted failover error/state handling - Catch RouterRateLimitError (ValueError) alongside openai.APIError in _maybe_run_weighted_failover so an exhausted intra-group retry falls through to the regular cross-group fallback path instead of bubbling out and bypassing configured fallbacks. - Stop mutating the shared input_kwargs dict; build a local copy with the weighted-failover keys so the entry (with _excluded_deployment_ids) cannot leak into later fallback paths reading the same dict. - _get_excluded_filtered_deployments now returns an empty list when the exclusion filter removes every healthy deployment, instead of falling back to the original list. The original-list behavior risked re-picking the just-failed deployment; callers already handle the empty case by raising their no-deployments error, which weighted failover now catches and converts into a normal cross-group fallback. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(router): fall through to rpm/tpm when total weight is zero When the weight metric's total is zero (e.g. after weighted-failover exclusion leaves only zero-weight backups), continue to the next metric (rpm/tpm) instead of returning a uniform random pick immediately. This lets rpm/tpm still drive routing when present, and only falls back to the uniform random pick at the end if no metric provides a positive total weight. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(router): skip weighted failover when remaining deployments are all in cooldown _maybe_run_weighted_failover was computing 'remaining' from all_deployments (every deployment in the model group, including those in cooldown). This meant that when all non-excluded deployments were in cooldown the method still invoked run_async_fallback unnecessarily, which propagated into async_get_healthy_deployments, found no eligible deployments, and raised RouterRateLimitError — only safely caught thanks to the earlier exception-broadening fix. The fix: before computing 'remaining', fetch the current cooldown set via _async_get_cooldown_deployments and subtract it from all_ids. This allows _maybe_run_weighted_failover to return None immediately (skipping the run_async_fallback call entirely) when every non-failed deployment is in cooldown, letting the caller fall through to the correct cross-group fallback path without the wasteful extra round-trip. Tests added: - unit: _maybe_run_weighted_failover returns None without calling run_async_fallback when all remaining deployments are in cooldown - unit: _maybe_run_weighted_failover still calls run_async_fallback when at least one healthy (non-cooldown) deployment is available - integration: end-to-end fallthrough to cross-group fallback when remaining deployments are in cooldown Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpo… (#27976) * fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint (#27943) * docs: add one-line docstring to _disable_debugging (#27894) Squash-merged by litellm-agent from oss-agent-shin's PR. * Add jp. Bedrock cross-region inference profile for claude-sonnet-4-6 (#27831) Squash-merged by litellm-agent from Cyberfilo's PR. * Sanitize empty text content blocks on /v1/messages (#27832) Squash-merged by litellm-agent from Cyberfilo's PR. * fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint The bedrock-mantle gateway (Claude Mythos Preview) serves the Anthropic Messages API at /anthropic/v1/messages; /v1/messages returns 404 Not Found. Both AmazonMantleConfig (chat/completions caller route) and AmazonMantleMessagesConfig (anthropic-messages caller route) hardcoded the wrong path, so every Mantle request 404'd before reaching the model. Per the Anthropic docs: "[Claude in Amazon Bedrock] uses the Messages API at /anthropic/v1/messages with SSE streaming." https://platform.claude.com/docs/en/api/claude-on-amazon-bedrock Confirmed independently against the live endpoint: /v1/chat/completions -> 200 OK /v1/messages -> 404 Not Found (what litellm used) /anthropic/v1/messages -> 200 OK (Claude only) Adds a regression test asserting both Mantle configs build the /anthropic/v1/messages path, and updates the existing assertions that encoded the wrong path. --------- Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> * fix: sanitize empty text blocks in sync anthropic_messages_handler path Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com> Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(utils): import get_secret at runtime (#28014) * fix(proxy): make /config/update env-var encryption idempotent A single decrypt-then-encrypt chokepoint (_encrypt_env_variables_for_db) now backs both update_config and save_config. Re-submitting a value the Admin UI read back from /get/config/callbacks as ciphertext no longer stacks a second encryption layer, which previously decrypted to garbage and silently broke the callback. The chokepoint decrypts with the pure _decrypt_db_variables (no os.environ mutation on the write path) and encrypts exactly once; update_config merges only the sent keys so untouched env vars keep their stored ciphertext byte-for-byte. * test(proxy): add endpoint-level regression for /config/update double-encryption Adds test_update_config_env_var_round_trip_not_double_encrypted, which drives the real /config/update handler: first write plaintext, then re-POST the stored ciphertext (the Admin UI round-trip) and assert the value is not stacked with a second encryption layer and untouched keys stay byte-identical. Verified to fail against the pre-fix handler and pass after. Also tightens the unit test to exactly three ciphertext re-feeds. * chore(ci): modernize model references in tests and configs (#27856) * test: modernize models used in CircleCI e2e test suites Replaces obsolete models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo, claude-3-5-sonnet-20240620, claude-sonnet-4-20250514) with current equivalents across the e2e_openai_endpoints and proxy_e2e_anthropic_messages_tests CircleCI jobs. - gpt-4o -> gpt-5.5 (responses API e2e tests) - gpt-4o-mini -> gpt-5-mini (websocket responses, oai_misc_config) - gpt-4o-mini-2024-07-18 -> gpt-4.1-mini-2025-04-14 (fine-tuning, still actively fine-tunable) - gpt-4 / gpt-3.5-turbo target_model_names example -> gpt-5.5 / gpt-5-mini - bedrock claude-3-5-sonnet-20240620 batch entry -> haiku-4-5-20251001 (also aligning oai_misc_config model_name with what test_bedrock_batches_api.py actually requests) - bedrock claude-sonnet-4-20250514 (deprecated, retires 2026-06-15) -> claude-sonnet-4-5-20250929 * test: point bedrock-claude-sonnet-4 alias at Sonnet 4.6, not 4.5 Greptile/Cursor flagged that after the previous commit, the bedrock-claude-sonnet-4 alias collided with bedrock-claude-sonnet-4.5 (both pointed to claude-sonnet-4-5-20250929). Rename to bedrock-claude-sonnet-4.6 and point it at the Sonnet 4.6 Bedrock ID (us.anthropic.claude-sonnet-4-6, already in the litellm model registry) so the alias name matches the underlying model version. * test: modernize models across remaining CI-mounted configs & tests Expands the modernization sweep to all CircleCI-mounted proxy configs and to test directories where the model literal is a fixture/route key (not the test's subject). Config changes: - pro…

…it (#28231) Prefetch upstream InitializeResult.instructions before merging gateway initialize options when YAML/DB do not set instructions, so clients receive upstream server text on the first MCP initialize without list_tools. Co-authored-by: Cursor <cursoragent@cursor.com>

greptile-apps · 2026-05-23T00:34:36Z

Too many files changed for review. (367 files found, 100 file limit)

CLAassistant · 2026-05-23T00:34:43Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
6 out of 8 committers have signed the CLA.

✅ Sameerlite
✅ milan-berri
✅ ryan-crabbe-berri
✅ yuneng-berri
✅ harish-berri
✅ mateo-berri
❌ yassin-berriai
❌ krrish-berri-2
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codspeed-hq · 2026-05-23T00:36:22Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing litellm_internal_staging (7270f72) with main (79b4578)}

+    ...(data.refresh_token ? { refresh_token: data.refresh_token } : {}),
+  };
+  try {
+    window.sessionStorage.setItem(storageKey(serverId, userId), JSON.stringify(stored));


codecov · 2026-05-23T00:38:51Z

Codecov Report

❌ Patch coverage is 87.97935% with 326 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/proxy/management_endpoints/ui_sso.py	80.09%	41 Missing ⚠️
litellm/llms/vertex_ai/vertex_llm_base.py	81.56%	33 Missing ⚠️
litellm/llms/xai/chat/transformation.py	44.18%	24 Missing ⚠️
litellm/integrations/rubrik.py	92.46%	19 Missing ⚠️
...s/guardrail_hooks/microsoft_purview/purview_dlp.py	92.14%	19 Missing ⚠️
litellm/responses/sse_output_recovery.py	73.52%	18 Missing ⚠️
...llm_responses_transformation/streaming_iterator.py	81.39%	16 Missing ⚠️
litellm/router.py	89.13%	15 Missing ⚠️
litellm/llms/reducto/ocr/transformation.py	83.33%	13 Missing ⚠️
litellm/llms/reducto/common.py	84.41%	12 Missing ⚠️
... and 30 more

📢 Thoughts on this report? Let us know!

Sameerlite and others added 30 commits May 20, 2026 10:03

fix(proxy): hydrate wildcard discovery credentials (#28284) (#28419)

37ef8d9

* fix(proxy): hydrate wildcard discovery credentials * fix(proxy): constrain wildcard credential hydration Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com>

build(deps-dev): bump black to 26.3.1 and apply formatting (#28525)

2a5dfcd

* build(deps-dev): bump black 24.10.0 -> 26.3.1 * style: apply black 26.3.1 formatting * chore: authorize black 26.3.1 license in liccheck.ini

yuneng-berri and others added 19 commits May 22, 2026 00:42

Fix conflicts and UI (#28477)

d96e260

Include team alias in CLI JWT token (#28621)

b0b25ae

ryan-crabbe-berri approved these changes May 23, 2026

View reviewed changes

yuneng-berri enabled auto-merge May 23, 2026 00:36

github-advanced-security AI found potential problems May 23, 2026

View reviewed changes

Comment thread ui/litellm-dashboard/src/utils/mcpTokenStore.ts

...(data.refresh_token ? { refresh_token: data.refresh_token } : {}),

};

try {

window.sessionStorage.setItem(storageKey(serverId, userId), JSON.stringify(stored));

shin-berri approved these changes May 23, 2026

View reviewed changes

yuneng-berri merged commit 35f6961 into main May 23, 2026
129 of 131 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(ci): promote internal staging to main#28680

chore(ci): promote internal staging to main#28680
yuneng-berri merged 49 commits into
mainfrom
litellm_internal_staging

yuneng-berri commented May 23, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026

Uh oh!

CLAassistant commented May 23, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 23, 2026

Uh oh!

codecov Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Uh oh!

Conversation

yuneng-berri commented May 23, 2026

Relevant issues

Linear ticket

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Uh oh!

greptile-apps Bot commented May 23, 2026

Uh oh!

CLAassistant commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented May 23, 2026

Merging this PR will not alter performance

Uh oh!

codecov Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

CLAassistant commented May 23, 2026 •

edited

Loading

codecov Bot commented May 23, 2026 •

edited

Loading