feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider#30203
Conversation
Greptile SummaryAdds LibertAI as a JSON-configured OpenAI-compatible provider, following the established pattern used by parasail, chutes, and others (PR #29063/#29842). The integration covers provider registration, 12 model entries (11 chat including reasoning
Confidence Score: 4/5Safe to merge after updating litellm/provider_endpoints_support_backup.json and fixing the embeddings flag. The provider wiring (constants, enum, providers.json) and model pricing data are correctly implemented. The only runtime-visible gap is that provider_endpoints_support.json (root) and litellm/provider_endpoints_support_backup.json both need attention — the root has the wrong embeddings flag and the backup was not updated at all.
|
| Filename | Overview |
|---|---|
| litellm/llms/openai_like/providers.json | Adds LibertAI entry with correct base URL, env var names, and max_completion_tokens→max_tokens param mapping; follows the established pattern. |
| litellm/constants.py | Registers libertai in both openai_compatible_endpoints (base URL) and openai_compatible_providers lists; consistent with other JSON-configured providers. |
| litellm/types/utils.py | Adds LlmProviders.LIBERTAI enum value; placed in alphabetical order among similar entries. |
| model_prices_and_context_window.json | Adds 12 LibertAI models (11 chat + 1 embedding); thinking variants correctly carry supports_reasoning:true, but bge-m3 embedding model contradicts the embeddings:false flag in provider_endpoints_support.json. |
| litellm/model_prices_and_context_window_backup.json | Correctly updated as a mirror of model_prices_and_context_window.json; same 12 model entries added consistently. |
| provider_endpoints_support.json | Has two issues: embeddings is marked false despite bge-m3 being a registered embedding model, and the corresponding litellm/provider_endpoints_support_backup.json (the file actually served by the proxy) was not updated. |
| tests/test_litellm/llms/openai_like/test_libertai_provider.py | Six offline configuration tests covering provider registration, JSON config loading, URL resolution, cost map presence, and Router initialization; no real network calls made. |
Reviews (1): Last reviewed commit: "feat(provider): add LibertAI as a JSON-c..." | Re-trigger Greptile
| "interactions": true | ||
| } | ||
| }, | ||
| "libertai": { | ||
| "display_name": "LibertAI (`libertai`)", | ||
| "url": "https://docs.litellm.ai/docs/providers/libertai", | ||
| "endpoints": { | ||
| "chat_completions": true, | ||
| "messages": true, | ||
| "responses": false, | ||
| "embeddings": false, | ||
| "image_generations": false, | ||
| "audio_transcriptions": false, | ||
| "audio_speech": false, | ||
| "moderations": false, | ||
| "batches": false, | ||
| "rerank": false, | ||
| "a2a": false | ||
| } | ||
| }, | ||
| "litellm_proxy": { | ||
| "display_name": "LiteLLM Proxy (`litellm_proxy`)", | ||
| "url": "https://docs.litellm.ai/docs/providers/litellm_proxy", |
There was a problem hiding this comment.
Missing backup file update — LibertAI absent from
/public/supported_endpoints
The proxy's GET /public/supported_endpoints endpoint reads exclusively from litellm/provider_endpoints_support_backup.json (via files("litellm").joinpath("provider_endpoints_support_backup.json") in public_endpoints.py), not from the root provider_endpoints_support.json edited here. Since the backup was not updated, LibertAI will be absent from that public API response at runtime, even though the root file is correct.
| "libertai": { | ||
| "display_name": "LibertAI (`libertai`)", | ||
| "url": "https://docs.litellm.ai/docs/providers/libertai", | ||
| "endpoints": { | ||
| "chat_completions": true, | ||
| "messages": true, | ||
| "responses": false, | ||
| "embeddings": false, |
There was a problem hiding this comment.
embeddings should be true — libertai/bge-m3 is a registered embedding model
The LibertAI cost map includes libertai/bge-m3 with "mode": "embedding", so the provider demonstrably supports embeddings. Marking it false here means the /public/supported_endpoints matrix will tell users embeddings aren't available for LibertAI, which contradicts the actual model catalogue. The OpenAI-like embedding handler routes these calls correctly for JSON-configured providers.
| "libertai": { | |
| "display_name": "LibertAI (`libertai`)", | |
| "url": "https://docs.litellm.ai/docs/providers/libertai", | |
| "endpoints": { | |
| "chat_completions": true, | |
| "messages": true, | |
| "responses": false, | |
| "embeddings": false, | |
| "libertai": { | |
| "display_name": "LibertAI (`libertai`)", | |
| "url": "https://docs.litellm.ai/docs/providers/libertai", | |
| "endpoints": { | |
| "chat_completions": true, | |
| "messages": false, | |
| "responses": false, | |
| "embeddings": true, |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Thanks for adding LibertAI support, @moshemalawach! Before this is ready:
Almost there! |
Addresses review feedback: - Add libertai to litellm/provider_endpoints_support_backup.json, the file actually served by GET /public/supported_endpoints (the root provider_endpoints_support.json already had it). - Add tests asserting bge-m3 normalizes to mode='embedding' and that the served matrix lists libertai. embeddings stays false: the JSON-configured provider path only wires chat routing (OpenAILike embedding handler is reached only for literal openai_like/llamafile/lm_studio), matching the llamagate precedent; bge-m3 remains in the cost map for metadata.
|
Thanks @Sameerlite — addressed in the latest push: 1. Proof of working (live, through the LiteLLM provider). Ran a real completion and embedding via the import litellm
resp = litellm.completion(
model="libertai/qwen3.6-27b",
messages=[{"role":"user","content":"Reply with exactly: LibertAI via LiteLLM works"}],
max_tokens=20,
)
print(resp.choices[0].message.content)
print(resp.usage)
print("cost USD", litellm.completion_cost(completion_response=resp))Cost tracking is correct: 23 × $0.15/M + 9 × $0.50/M = $7.95e-06. ✅ 2. 3. The The Tests. Added CLA: signing now. |
… add-libertai-provider # Conflicts: # litellm/llms/openai_like/providers.json
809e89c
into
BerriAI:litellm_oss_staging_150626
* fix(pricing): add GitHub Copilot MAI Code Flash pricing (#30415) * fix(pricing): add GitHub Copilot MAI Code Flash pricing Add GitHub Copilot pricing entries for MAI-Code-1-Flash and the internal Copilot CLI model name so cost calculation can price input, cached input, and output tokens. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(pricing): cover GitHub Copilot MAI Code Flash pricing Add regression coverage for both GitHub Copilot MAI-Code-1-Flash model names, including cached input pricing, chat endpoint metadata, and cost_per_token arithmetic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) (#30213) * fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) #28990 added ownership recording for streaming /v1/responses via _wrap_responses_stream_for_container_ownership, which reads `getattr(stream_response, 'completed_response', None)` to extract the ResponsesAPIResponse. The unit test bypassed the Router, so it never exercised the production wrapping path. Through the Router (every proxy deployment), the stream is wrapped by FallbackResponsesStreamWrapper (router.py:2527). Its __init__ set `self.completed_response = None` and __anext__ only forwarded chunks — the inner source iterator's terminal event never bubbled up to the attribute the ownership hook reads, so the hook silently recorded nothing and every follow-up /v1/containers/<id>/files call returned 403 for non-admin keys. This commit: - router.py: pre-resolves the responses-API terminal event tuple (response.completed / .incomplete / .failed) once per _aresponses_streaming_iterator call, and has the wrapper's __anext__ sniff each forwarded chunk's .type. First terminal event hit gets stored on the wrapper's completed_response. Iterator-agnostic — works for source_iterator AND any future wrapper. - common_request_processing.py: when _extract_completed_responses_response returns None we now warn instead of silently skipping. Reporter on #30210 lost a day to this exact silent skip; the warning surfaces future regressions of the same shape directly in operator logs. Fixes #30210 * fix(router): type-ignore wrapper getattr-defaults; broaden ownership-skip warning CI lint (mypy) flagged the three pre-existing getattr(..., None) assignments in FallbackResponsesStreamWrapper.__init__: router.py:2564 self.response = getattr(source_iterator, 'response', None) router.py:2565 self.model = getattr(source_iterator, 'model', None) router.py:2566 self.logging_obj = getattr(..., None) Those lines also exist on litellm_internal_staging and pass mypy there. Adding the typed terminal-event tuple above the class made the function body more narrowable, which surfaced the pre-existing mismatch — base class declares non-Optional types but the bridge path (LiteLLMCompletionStreamingIterator) legitimately omits these. Keep the None fallback and silence with type: ignore[assignment]. Greptile 4/5 note: the ownership-skip warning hard-named code_interpreter which misleads operators when a non-code_interpreter stream aborts. Generalize to 'any tool container (e.g. code_interpreter)'. * fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) (#30201) * fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) get_model_info synthesizes input_cost_per_token / output_cost_per_token = 0 when they are absent from the raw entry (the price-unknown and free cases share the same representation). register_model then merges that result back into litellm.model_cost, which flips a sparse entry from 'no cost keys' (priced via model name) to 'cost keys = 0' (free). That defeats _is_cost_explicitly_configured (#24949) on re-registration: _is_model_cost_zero returns True, common_checks skips every tag / key / team / user / org budget check for the group, and over-budget traffic keeps returning 200. Spend keeps recording because cost calc still resolves by model name, so the symptom is silent and only triggers on the second register_model pass (router rebuild, /model/update, config sync). Mirror the existing litellm_provider-None guard one block above and pop the cost fields from the synthesized result when they are absent from the raw entry and not in the caller's value. Caller-provided zeros (genuinely free models, BYOK overrides) are preserved. Fixes #30198 * fix(register_model): switch _raw_entry to is-None checks + drop dead test assertion Greptile #30201 review notes: - the `or`-chain in the raw-entry lookup treated an empty dict (a key with no fields) as falsy and fell through to the second arm — replace with explicit `is None` checks so a present-but-empty entry is still taken at face value. - the first assertion in `test_router_double_init_keeps_db_model_entry_sparse` used `in (None, 0)` which passes under the bug condition (cost = 0 matches the tuple); the strong follow-up assertion already covers every shape, so drop the dead branch. * fix(bedrock mantle): use unique function-call id for responses->chat tool calls (#30426) * fix(bedrock mantle): use unique function-call id for responses->chat tool calls ... * fix(bedrock mantle): scope unique tool-call id fallback to degenerate call_id The previous revision preferred the Responses item id for every tool call, which broke providers (and existing tests) where call_id is a unique, canonical correlation key. Restrict the fallback to the degenerate index-based call_id that Bedrock Mantle returns (call_0, call_1, ... resetting per response) and keep call_id otherwise. Revert the change to the OUTPUT_ITEM_DONE streaming handler, whose tool_call_chunk is never emitted (dead code, per review). Extend the regression tests to assert a normal call_id is preserved. * fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) (#30241) * fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) Router.get_deployment_credentials_with_provider re-validates a deployment's litellm_params through CredentialLiteLLMParams before handing them to file/batch/passthrough callers: return CredentialLiteLLMParams( **deployment.litellm_params.model_dump(exclude_none=True) ).model_dump(exclude_none=True) Any field NOT declared on CredentialLiteLLMParams gets silently dropped on the way through. azure_ad_token was undeclared, so Azure deployments using OAuth/M2M (azure_ad_token instead of a static api_key) silently lost their token at the files endpoint and the proxy returned: Missing credentials. Please pass one of api_key, azure_ad_token, azure_ad_token_provider, ... Declare azure_ad_token on CredentialLiteLLMParams alongside api_key / api_base / api_version so it rides through the round-trip. Static-key deployments stay unaffected (Optional, default None, dropped by exclude_none=True). Provider-callable (azure_ad_token_provider) is a separate concern and out of scope here. Fixes #30235 * fix(ui-types): regenerate schema.d.ts for new azure_ad_token field CI's 'Verify schema.d.ts matches the proxy OpenAPI spec' check auto-detected the new field and emitted the exact diff to apply. Two schemas had `aws_secret_access_key` from CredentialLiteLLMParams, both get the new azure_ad_token marker next to it. * fix(proxy): org_admin with own user_id now sees all org teams on /v2/team/list (#30247) When the UI sends the callers own user_id (as it does for non-Admin global roles), _enforce_list_team_v2_access now nulls it out for org admins so _build_team_list_where_conditions scopes by organization_id only -- matching the legacy /team/list behavior and the documented intent. Fixes #30215 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * test(vertex_ai): multi-region regression coverage for cachedContents host (#29571) (#29707) litellm_internal_staging already routes the cachedContents URL through get_vertex_base_url, fixing the multi-region 404 reported in #29571 — but carries no test coverage for the actual regression scenario (eu/us must resolve to the REP host aiplatform.{geo}.rep.googleapis.com). Add TestContextCachingMultiRegionUrls: parametrized eu/us REP-host assertions (including absence of the old broken {geo}-aiplatform host), plus regional (us-central1) and global no-regression checks. * fix(proxy): close upstream LLM stream when client disconnects mid-stream (#30245) * fix(proxy): close upstream LLM stream when client disconnects mid-stream When a streaming client disconnects, Starlette abandons the response body iterator without calling aclose(), so the proxy's connection to the upstream backend stays open until garbage collection, which may never come. The backend (e.g. vLLM) keeps generating into a dead pipe: small responses drain invisibly into TCP buffers while large ones block the backend on a full send buffer indefinitely (observed via lsof as an ESTABLISHED proxy->backend connection minutes after the client left) create_response now returns a StreamingResponse subclass that closes both its body iterator and the wrapped upstream-facing generator in a shielded finally. The upstream generator is closed directly rather than through a cascade because aclose() on a never-started generator skips its body, which would make the cascade a no-op when the client disconnects before the first chunk is sent. async_streaming_data_generator also gains the same shielded finally-aclose that async_data_generator in proxy_server.py already had, covering the Anthropic and Google SSE paths With this, killing a streaming client causes the backend to observe the abort within about a second and free its slot, while completed streams are unaffected. No flag is needed, unlike the non-streaming opt-in cancel in #30223: this only releases resources after the client is already gone and does not change any response a client can observe Fixes #30244 * fix(proxy): close upstream even when body iterator aclose raises BaseException Addresses the Greptile finding on #30245: the cleanup loop caught only Exception while the generator-level cleanup catches BaseException, so a CancelledError or GeneratorExit escaping body_iterator.aclose() would skip closing the upstream generator. Both sites now use the same scope and a regression test pins that the upstream is closed even when the body iterator explodes with a BaseException * fix(llms): expose aclose on BaseModelResponseIterator so stream close reaches the provider connection The response-level close added for #30244 only worked for SDK-based providers (e.g. openai), whose streams expose aclose all the way down. Providers served by base_llm_http_handler (hosted_vllm and most modern transformation-based providers) wrap a bare response.aiter_lines() generator in BaseModelResponseIterator, which had no aclose or close at all, and nothing retained the httpx response object; so CustomStreamWrapper.aclose() silently did nothing and the upstream connection stayed open. Verified with a vLLM-style mock: with hosted_vllm/ the backend streamed all 100 chunks to completion after the client disconnected, while openai/ aborted at chunk 6 BaseModelResponseIterator now carries an optional http_response and an aclose() that closes it; make_async_call_stream_helper attaches the response after building the iterator. With this, hosted_vllm aborts the backend within ~1.6s of the client dropping, and completed streams are unaffected --------- Co-authored-by: kursad <kursad.lacin@brado.net> * feat(anthropic): surface compaction usage iterations data (#27065) * feat(anthropic): surface compaction usage iterations data * style: apply black formatting to fix lint checks * fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock (#30422) * fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock * fix(usage): optimize test imports * feat: add fastCRW search provider (#30434) * feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider (#30203) * feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider * libertai: update served endpoints backup + add mode/matrix tests Addresses review feedback: - Add libertai to litellm/provider_endpoints_support_backup.json, the file actually served by GET /public/supported_endpoints (the root provider_endpoints_support.json already had it). - Add tests asserting bge-m3 normalizes to mode='embedding' and that the served matrix lists libertai. embeddings stays false: the JSON-configured provider path only wires chat routing (OpenAILike embedding handler is reached only for literal openai_like/llamafile/lm_studio), matching the llamagate precedent; bge-m3 remains in the cost map for metadata. --------- Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> * feat(provider): add ModelScope as an OpenAI-compatible provider (#28460) * add ModelScope API support * add modelscope api support * update modelscope model list * add image-genetation support * update test and multimodal * fix: address PR review feedback for modelscope provider * update README * fix(customer_endpoints): restrict /customer/daily/activity to admin-only (#28849) * fix(customer_endpoints): restrict /customer/daily/activity to admin-only * fix(customer_endpoints): check role before prisma_client guard * fix(custom_guardrail): key disable_global_guardrails takes precedence over team guardrail list (#28563) * fix(fallbacks): preserve fallback model in SDK fallback responses (#28260) * fix(fallbacks): preserve fallback model in response when using SDK-level fallbacks * fix(fallbacks): gate x-litellm-* passthrough to trusted callers only The previous patch unconditionally let `x-litellm-*` keys bypass the `llm_provider-` prefix in `process_response_headers`. That function is also called on raw upstream-provider response headers (e.g. from `llm_http_handler.py`), so a malicious provider could return `x-litellm-attempted-fallbacks` and spoof a LiteLLM-internal marker, bypassing the proxy model-override guard. Add a `preserve_litellm_internal_headers` flag (default False). Only `response_metadata.py`, which re-processes the already-built `_hidden_params["additional_headers"]` dict (LiteLLM-owned), passes True. Raw provider header callsites keep the default False, so upstream `x-litellm-*` still gets the `llm_provider-` prefix. Adds a regression test for the spoofing case and renames the existing preserve test to make the trusted-path semantics explicit. * fix(fallbacks): ignore preserve_litellm_internal_headers for raw httpx.Headers inputs * style(core_helpers): apply black formatting * fix(lint): remove banned typing.List/Dict/Any imports and suppress PLR0913 on interface overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): apply black formatting to modelscope chat transformation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): replace noqa with proper fixes — use **kwargs and Awaitable instead of Any/List Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): remove unused AllMessageValues import Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * revert: restore base_model_iterator.py to original PR state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): restore full method signatures for MyPy compatibility; bump PLR0913 budget for new provider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): use @OverRide to suppress PLR0913 on inherited signatures instead of bumping budget The overrides keep their full base-class signatures for MyPy compatibility, but those signatures carry more than five parameters, which tripped PLR0913 on each subclass redeclaration. Since the arity is dictated by the base class and cannot be reduced, decorate the overrides with typing_extensions.override; ruff treats that as the intended signal that the parameter count is not under the author's control and skips PLR0913. This restores the PLR0913 baseline to 1813. * fix(lint): add @OverRide to modelscope image generation overrides Apply the same typing_extensions.override treatment to the image generation config so its inherited-signature overrides do not count against PLR0913. --------- Co-authored-by: Joel Tony <github@jaytau.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: hcl <chenglunhu@gmail.com> Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com> Co-authored-by: Nahrin <nahrin@nahrinoda.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Humphrey <a739376838@gmail.com> Co-authored-by: kursadlacin <kursadlacin@gmail.com> Co-authored-by: kursad <kursad.lacin@brado.net> Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com> Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com> Co-authored-by: Recep S <22618852+us@users.noreply.github.com> Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com> Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> Co-authored-by: Rongkun Yan <2493404415@qq.com> Co-authored-by: Varshith <kvarshithgowda@gmail.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
…30554) * chore(codecov): add Batches, Videos, and Realtime components (#30517) * chore(codecov): add Batches, Videos, and Realtime components Define per-feature Codecov components so PR comments track coverage for batch API, video generation, and realtime streaming paths. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(codecov): use wildcard path for Batches proxy component Align batches_endpoints glob with Videos, Realtime, and Proxy_Authentication. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * test(batches): move orphan tests into tests/test_litellm for CI coverage (#30510) Four batch-related tests lived under tests/litellm/ and were never picked up by GitHub Actions. Relocate them and fix gemini multimodal e2e to use the batchEmbedContents path expected for gemini/ provider. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(guardrails): run pre_call hook once for model-level guardrails (#30543) * fix(guardrails): run pre_call hook once for model-level guardrails A CustomGuardrail attached to a deployment via litellm_params.guardrails gets its async_pre_call_hook invoked twice per request: once by the proxy pre-call loop and again by async_pre_call_deployment_hook after the router spreads the model-level guardrails into the top-level request kwargs. Record in request metadata that the proxy pre-call loop already ran a given guardrail, and have the deployment hook skip it when the marker is present. Direct-SDK usage never runs the proxy loop, so the deployment hook stays the sole invocation there and still fires exactly once. The marker key is stripped from untrusted caller metadata so a request body cannot suppress a model-only guardrail by pre-seeding it. * fix(guardrails): mark pre_call dedup on the post-hook request data Record the exactly-once marker after async_pre_call_hook runs, on the data object that flows downstream, rather than before it. A guardrail whose hook returns a brand-new request dict (instead of mutating or spreading the one it received) would otherwise discard the marker, letting the deployment hook re-run the guardrail a second time. * fix(guardrails): stop re-initializing DB guardrails on every poll (#30542) * fix(guardrails): stop re-initializing DB guardrails on every poll InMemoryGuardrailHandler._has_guardrail_params_changed compared the in-memory LitellmParams against the raw dict loaded from the DB. The in-memory side carries every field default and coerces enums via model_dump(), while the DB side only holds the keys originally stored, so the two shapes never compared equal and the guardrail was rebuilt on every poll cycle. Each rebuild created a fresh instance, but delete_in_memory_guardrail only removed the old callback from litellm.callbacks. Request handling promotes guardrail callbacks into the success/failure/async lists, so the previous instance stayed referenced there and instances accumulated. Normalize both sides through LitellmParams(...).model_dump() before diffing, and purge the callback from every callback list on delete. * refactor(guardrails): narrow params-normalization fallback to ValidationError The comparison normalizer caught a bare Exception and silently fell back to the raw dict, which hid the cause and quietly degraded the affected guardrail back to re-initializing on every poll. Catch only the ValidationError that LitellmParams construction can raise, log a warning so the offending row is diagnosable, and let any other error surface instead of being swallowed. * refactor(callbacks): add remove_callback_from_all_lists helper to manager Move the knowledge of which callback lists a callback can be promoted into out of the guardrail registry and into LoggingCallbackManager, where the rest of the callback-list bookkeeping already lives. delete_in_memory_guardrail now delegates to the new helper instead of iterating the lists itself. * chore(oss): litellm oss staging 150626 (#30463) * fix(pricing): add GitHub Copilot MAI Code Flash pricing (#30415) * fix(pricing): add GitHub Copilot MAI Code Flash pricing Add GitHub Copilot pricing entries for MAI-Code-1-Flash and the internal Copilot CLI model name so cost calculation can price input, cached input, and output tokens. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(pricing): cover GitHub Copilot MAI Code Flash pricing Add regression coverage for both GitHub Copilot MAI-Code-1-Flash model names, including cached input pricing, chat endpoint metadata, and cost_per_token arithmetic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) (#30213) * fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) #28990 added ownership recording for streaming /v1/responses via _wrap_responses_stream_for_container_ownership, which reads `getattr(stream_response, 'completed_response', None)` to extract the ResponsesAPIResponse. The unit test bypassed the Router, so it never exercised the production wrapping path. Through the Router (every proxy deployment), the stream is wrapped by FallbackResponsesStreamWrapper (router.py:2527). Its __init__ set `self.completed_response = None` and __anext__ only forwarded chunks — the inner source iterator's terminal event never bubbled up to the attribute the ownership hook reads, so the hook silently recorded nothing and every follow-up /v1/containers/<id>/files call returned 403 for non-admin keys. This commit: - router.py: pre-resolves the responses-API terminal event tuple (response.completed / .incomplete / .failed) once per _aresponses_streaming_iterator call, and has the wrapper's __anext__ sniff each forwarded chunk's .type. First terminal event hit gets stored on the wrapper's completed_response. Iterator-agnostic — works for source_iterator AND any future wrapper. - common_request_processing.py: when _extract_completed_responses_response returns None we now warn instead of silently skipping. Reporter on #30210 lost a day to this exact silent skip; the warning surfaces future regressions of the same shape directly in operator logs. Fixes #30210 * fix(router): type-ignore wrapper getattr-defaults; broaden ownership-skip warning CI lint (mypy) flagged the three pre-existing getattr(..., None) assignments in FallbackResponsesStreamWrapper.__init__: router.py:2564 self.response = getattr(source_iterator, 'response', None) router.py:2565 self.model = getattr(source_iterator, 'model', None) router.py:2566 self.logging_obj = getattr(..., None) Those lines also exist on litellm_internal_staging and pass mypy there. Adding the typed terminal-event tuple above the class made the function body more narrowable, which surfaced the pre-existing mismatch — base class declares non-Optional types but the bridge path (LiteLLMCompletionStreamingIterator) legitimately omits these. Keep the None fallback and silence with type: ignore[assignment]. Greptile 4/5 note: the ownership-skip warning hard-named code_interpreter which misleads operators when a non-code_interpreter stream aborts. Generalize to 'any tool container (e.g. code_interpreter)'. * fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) (#30201) * fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) get_model_info synthesizes input_cost_per_token / output_cost_per_token = 0 when they are absent from the raw entry (the price-unknown and free cases share the same representation). register_model then merges that result back into litellm.model_cost, which flips a sparse entry from 'no cost keys' (priced via model name) to 'cost keys = 0' (free). That defeats _is_cost_explicitly_configured (#24949) on re-registration: _is_model_cost_zero returns True, common_checks skips every tag / key / team / user / org budget check for the group, and over-budget traffic keeps returning 200. Spend keeps recording because cost calc still resolves by model name, so the symptom is silent and only triggers on the second register_model pass (router rebuild, /model/update, config sync). Mirror the existing litellm_provider-None guard one block above and pop the cost fields from the synthesized result when they are absent from the raw entry and not in the caller's value. Caller-provided zeros (genuinely free models, BYOK overrides) are preserved. Fixes #30198 * fix(register_model): switch _raw_entry to is-None checks + drop dead test assertion Greptile #30201 review notes: - the `or`-chain in the raw-entry lookup treated an empty dict (a key with no fields) as falsy and fell through to the second arm — replace with explicit `is None` checks so a present-but-empty entry is still taken at face value. - the first assertion in `test_router_double_init_keeps_db_model_entry_sparse` used `in (None, 0)` which passes under the bug condition (cost = 0 matches the tuple); the strong follow-up assertion already covers every shape, so drop the dead branch. * fix(bedrock mantle): use unique function-call id for responses->chat tool calls (#30426) * fix(bedrock mantle): use unique function-call id for responses->chat tool calls ... * fix(bedrock mantle): scope unique tool-call id fallback to degenerate call_id The previous revision preferred the Responses item id for every tool call, which broke providers (and existing tests) where call_id is a unique, canonical correlation key. Restrict the fallback to the degenerate index-based call_id that Bedrock Mantle returns (call_0, call_1, ... resetting per response) and keep call_id otherwise. Revert the change to the OUTPUT_ITEM_DONE streaming handler, whose tool_call_chunk is never emitted (dead code, per review). Extend the regression tests to assert a normal call_id is preserved. * fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) (#30241) * fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) Router.get_deployment_credentials_with_provider re-validates a deployment's litellm_params through CredentialLiteLLMParams before handing them to file/batch/passthrough callers: return CredentialLiteLLMParams( **deployment.litellm_params.model_dump(exclude_none=True) ).model_dump(exclude_none=True) Any field NOT declared on CredentialLiteLLMParams gets silently dropped on the way through. azure_ad_token was undeclared, so Azure deployments using OAuth/M2M (azure_ad_token instead of a static api_key) silently lost their token at the files endpoint and the proxy returned: Missing credentials. Please pass one of api_key, azure_ad_token, azure_ad_token_provider, ... Declare azure_ad_token on CredentialLiteLLMParams alongside api_key / api_base / api_version so it rides through the round-trip. Static-key deployments stay unaffected (Optional, default None, dropped by exclude_none=True). Provider-callable (azure_ad_token_provider) is a separate concern and out of scope here. Fixes #30235 * fix(ui-types): regenerate schema.d.ts for new azure_ad_token field CI's 'Verify schema.d.ts matches the proxy OpenAPI spec' check auto-detected the new field and emitted the exact diff to apply. Two schemas had `aws_secret_access_key` from CredentialLiteLLMParams, both get the new azure_ad_token marker next to it. * fix(proxy): org_admin with own user_id now sees all org teams on /v2/team/list (#30247) When the UI sends the callers own user_id (as it does for non-Admin global roles), _enforce_list_team_v2_access now nulls it out for org admins so _build_team_list_where_conditions scopes by organization_id only -- matching the legacy /team/list behavior and the documented intent. Fixes #30215 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * test(vertex_ai): multi-region regression coverage for cachedContents host (#29571) (#29707) litellm_internal_staging already routes the cachedContents URL through get_vertex_base_url, fixing the multi-region 404 reported in #29571 — but carries no test coverage for the actual regression scenario (eu/us must resolve to the REP host aiplatform.{geo}.rep.googleapis.com). Add TestContextCachingMultiRegionUrls: parametrized eu/us REP-host assertions (including absence of the old broken {geo}-aiplatform host), plus regional (us-central1) and global no-regression checks. * fix(proxy): close upstream LLM stream when client disconnects mid-stream (#30245) * fix(proxy): close upstream LLM stream when client disconnects mid-stream When a streaming client disconnects, Starlette abandons the response body iterator without calling aclose(), so the proxy's connection to the upstream backend stays open until garbage collection, which may never come. The backend (e.g. vLLM) keeps generating into a dead pipe: small responses drain invisibly into TCP buffers while large ones block the backend on a full send buffer indefinitely (observed via lsof as an ESTABLISHED proxy->backend connection minutes after the client left) create_response now returns a StreamingResponse subclass that closes both its body iterator and the wrapped upstream-facing generator in a shielded finally. The upstream generator is closed directly rather than through a cascade because aclose() on a never-started generator skips its body, which would make the cascade a no-op when the client disconnects before the first chunk is sent. async_streaming_data_generator also gains the same shielded finally-aclose that async_data_generator in proxy_server.py already had, covering the Anthropic and Google SSE paths With this, killing a streaming client causes the backend to observe the abort within about a second and free its slot, while completed streams are unaffected. No flag is needed, unlike the non-streaming opt-in cancel in #30223: this only releases resources after the client is already gone and does not change any response a client can observe Fixes #30244 * fix(proxy): close upstream even when body iterator aclose raises BaseException Addresses the Greptile finding on #30245: the cleanup loop caught only Exception while the generator-level cleanup catches BaseException, so a CancelledError or GeneratorExit escaping body_iterator.aclose() would skip closing the upstream generator. Both sites now use the same scope and a regression test pins that the upstream is closed even when the body iterator explodes with a BaseException * fix(llms): expose aclose on BaseModelResponseIterator so stream close reaches the provider connection The response-level close added for #30244 only worked for SDK-based providers (e.g. openai), whose streams expose aclose all the way down. Providers served by base_llm_http_handler (hosted_vllm and most modern transformation-based providers) wrap a bare response.aiter_lines() generator in BaseModelResponseIterator, which had no aclose or close at all, and nothing retained the httpx response object; so CustomStreamWrapper.aclose() silently did nothing and the upstream connection stayed open. Verified with a vLLM-style mock: with hosted_vllm/ the backend streamed all 100 chunks to completion after the client disconnected, while openai/ aborted at chunk 6 BaseModelResponseIterator now carries an optional http_response and an aclose() that closes it; make_async_call_stream_helper attaches the response after building the iterator. With this, hosted_vllm aborts the backend within ~1.6s of the client dropping, and completed streams are unaffected --------- Co-authored-by: kursad <kursad.lacin@brado.net> * feat(anthropic): surface compaction usage iterations data (#27065) * feat(anthropic): surface compaction usage iterations data * style: apply black formatting to fix lint checks * fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock (#30422) * fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock * fix(usage): optimize test imports * feat: add fastCRW search provider (#30434) * feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider (#30203) * feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider * libertai: update served endpoints backup + add mode/matrix tests Addresses review feedback: - Add libertai to litellm/provider_endpoints_support_backup.json, the file actually served by GET /public/supported_endpoints (the root provider_endpoints_support.json already had it). - Add tests asserting bge-m3 normalizes to mode='embedding' and that the served matrix lists libertai. embeddings stays false: the JSON-configured provider path only wires chat routing (OpenAILike embedding handler is reached only for literal openai_like/llamafile/lm_studio), matching the llamagate precedent; bge-m3 remains in the cost map for metadata. --------- Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> * feat(provider): add ModelScope as an OpenAI-compatible provider (#28460) * add ModelScope API support * add modelscope api support * update modelscope model list * add image-genetation support * update test and multimodal * fix: address PR review feedback for modelscope provider * update README * fix(customer_endpoints): restrict /customer/daily/activity to admin-only (#28849) * fix(customer_endpoints): restrict /customer/daily/activity to admin-only * fix(customer_endpoints): check role before prisma_client guard * fix(custom_guardrail): key disable_global_guardrails takes precedence over team guardrail list (#28563) * fix(fallbacks): preserve fallback model in SDK fallback responses (#28260) * fix(fallbacks): preserve fallback model in response when using SDK-level fallbacks * fix(fallbacks): gate x-litellm-* passthrough to trusted callers only The previous patch unconditionally let `x-litellm-*` keys bypass the `llm_provider-` prefix in `process_response_headers`. That function is also called on raw upstream-provider response headers (e.g. from `llm_http_handler.py`), so a malicious provider could return `x-litellm-attempted-fallbacks` and spoof a LiteLLM-internal marker, bypassing the proxy model-override guard. Add a `preserve_litellm_internal_headers` flag (default False). Only `response_metadata.py`, which re-processes the already-built `_hidden_params["additional_headers"]` dict (LiteLLM-owned), passes True. Raw provider header callsites keep the default False, so upstream `x-litellm-*` still gets the `llm_provider-` prefix. Adds a regression test for the spoofing case and renames the existing preserve test to make the trusted-path semantics explicit. * fix(fallbacks): ignore preserve_litellm_internal_headers for raw httpx.Headers inputs * style(core_helpers): apply black formatting * fix(lint): remove banned typing.List/Dict/Any imports and suppress PLR0913 on interface overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): apply black formatting to modelscope chat transformation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): replace noqa with proper fixes — use **kwargs and Awaitable instead of Any/List Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): remove unused AllMessageValues import Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * revert: restore base_model_iterator.py to original PR state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): restore full method signatures for MyPy compatibility; bump PLR0913 budget for new provider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): use @OverRide to suppress PLR0913 on inherited signatures instead of bumping budget The overrides keep their full base-class signatures for MyPy compatibility, but those signatures carry more than five parameters, which tripped PLR0913 on each subclass redeclaration. Since the arity is dictated by the base class and cannot be reduced, decorate the overrides with typing_extensions.override; ruff treats that as the intended signal that the parameter count is not under the author's control and skips PLR0913. This restores the PLR0913 baseline to 1813. * fix(lint): add @OverRide to modelscope image generation overrides Apply the same typing_extensions.override treatment to the image generation config so its inherited-signature overrides do not count against PLR0913. --------- Co-authored-by: Joel Tony <github@jaytau.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: hcl <chenglunhu@gmail.com> Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com> Co-authored-by: Nahrin <nahrin@nahrinoda.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Humphrey <a739376838@gmail.com> Co-authored-by: kursadlacin <kursadlacin@gmail.com> Co-authored-by: kursad <kursad.lacin@brado.net> Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com> Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com> Co-authored-by: Recep S <22618852+us@users.noreply.github.com> Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com> Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> Co-authored-by: Rongkun Yan <2493404415@qq.com> Co-authored-by: Varshith <kvarshithgowda@gmail.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> * ci(lint): add blanket-noqa, dataclass-default, and unused-noqa Ruff rules (#30516) * ci(lint): enforce blanket-noqa, dataclass-default, and unused-noqa rules Enable PGH004 (blanket-noqa), RUF008 (mutable-dataclass-default), RUF009 (function-call-in-dataclass-default-argument), and RUF100 (unused-noqa) in ruff.toml, and clean up every resulting violation. RUF008/RUF009 were already clean. PGH004/RUF100 surfaced ~335 stale or blanket noqas: blanket `# noqa` are now scoped to the rule they actually suppress (mostly T201), dead directives are removed, and inapplicable codes are trimmed (e.g. F401 dropped from `import *`). lint.external lists rules enforced outside this config (the strict-rule gate via ruff-strict.toml and upstream litellm's own ruff config) so RUF100 keeps the noqa directives that protect them instead of stripping coverage this config can't see. * ci(lint): trim RUF100 external list to load-bearing codes only Drop the 9 precautionary strict-gate codes (ANN001/002/003/401, B006, PLR0913, PLW0603, RUF012, TID251) that have zero `# noqa` references in the gated source. Keep only the 11 codes with live suppressions so RUF100 doesn't flag them as unused. Future strict-gate suppressions can re-add codes here (or fix the underlying issue) as needed. * ci: ratchet lint and type-check gates (ruff preview, ANN, mypy, basedpyright) (#30379) * ci: enable ruff preview rules under the budgeted strict gate Turn on ruff preview in the strict-budget lane (ruff-strict.toml) only, leaving the clean gate (ruff.toml) untouched so make lint-ruff stays at zero. Enumerate the 118 firing codes explicitly with explicit-preview-rules so the gate is deterministic and stable across ruff upgrades rather than depending on preview auto-selecting the broad catalog. Grandfather the existing 58438 violations into ruff-strict-budget.json as per-rule baselines with headroom, so only net-new violations fail CI. The existing ten rules keep their hand-tuned slack; the new rules get slack 10 when the baseline is 50 or more and 3 otherwise. * ci: add ANN return-type rules to the budgeted strict gate Add ANN201/202/204/205/206 (missing return annotations) to the strict lane and grandfather the existing counts into ruff-strict-budget.json so the codebase ratchets toward explicit return types without breaking CI. * ci: add mypy (disallow_untyped_defs) and basedpyright strict gates with baselines Add two type-check gates, each grandfathering the current tree so only net-new violations fail CI, matching the ruff strict-budget ratchet. mypy gains disallow_untyped_defs in litellm/mypy.ini (the config the CI invocation actually reads; the root [tool.mypy] is not picked up from the litellm/ working dir). The 4885 existing missing-annotation errors are captured in litellm/.mypy-baseline.txt and the run is piped through mypy-baseline filter so new untyped defs are rejected. basedpyright runs in strict mode over litellm/, with enableTypeIgnoreComments disabled so it only honors '# pyright: ignore' and never polices mypy's '# type: ignore'. The existing strict diagnostics are grandfathered into .basedpyright/baseline.json. Both tools are pinned in the dev group and uv.lock; the lint workflow and Makefile run them filtered through their baselines, with lint-mypy-baseline-update and lint-basedpyright-baseline-update to ratchet. * ci: raise lint job timeout to 15m for the basedpyright strict pass * ci: pin pythonVersion 3.12 and regenerate baselines against merged base Merge litellm_internal_staging so the baselines cover code the CI merge includes (e.g. the cisco_ai_defense guardrail), which otherwise tripped the mypy gate with 3 ungrandfathered no-untyped-def errors. Pin pythonVersion 3.12 in pyrightconfig so basedpyright's strict analysis is reproducible across interpreter versions (CI runs 3.12). * ci: regenerate basedpyright baseline against the frozen lint env The previous baseline was generated with optional provider deps (azure, google, anthropic, mcp, numpydoc, google-genai) installed locally, so CI's dev-only env surfaced ~3500 reportUnknown*/reportMissingTypeStubs errors not in the baseline. Regenerate after uv sync --frozen so the baseline reflects the same dependency set the lint job sees. * ci: regenerate basedpyright baseline on python 3.12 frozen env The prior baseline still carried proxy-dev packages (e.g. prisma) that the lint job's dev-only, python 3.12 env lacks, leaving 2 unresolved-import errors ungrandfathered. Regenerate in a python 3.12 venv synced to the frozen lock with default groups only, so the baseline matches exactly what CI sees. * ci: replace type-check baselines with per-file count budgets The mypy and basedpyright baselines were position-sensitive (and the basedpyright one was a 27MB file), so ordinary line shifts churned them. Replace both with a per-file count gate: scripts/type_check_gate.py reduces each tool's output to errors-per-file and checks it against a committed {file: max} budget, ignoring line and column numbers. A file fails only when it gains more errors than its ceiling; debt can't be shuffled between files because each file has its own cap and new files default to zero. Budgets (mypy-file-budget.json 48K, basedpyright-file-budget.json 96K) are generated in the python 3.12 frozen lint env so they match CI. Drops the mypy-baseline dependency; basedpyright runs without its native baseline. ratchet via make lint-mypy-budget-update / lint-basedpyright-budget-update. * ci: add a small per-file slack to the type-check gate Allow each file to drift PER_FILE_SLACK (5) errors past its recorded count before failing, so a basedpyright inference ripple in an unrelated file doesn't break the build over a couple of errors. Budgets still record exact counts; the tolerance is applied at check time. * ci: move type-check slack into the budget json and trim lint timeout Make slack declarative: the budget is now {"slack": N, "files": {path: count}} so the tolerance is tuned in JSON without editing the script, mirroring how ruff-strict-budget.json carries its slack. --update preserves the existing slack. Also drop the lint job timeout from 15m to 10m; the mypy and basedpyright passes add ~2m, leaving the job around 4-5m, so 10m is a comfortable margin. * ci: collapse fully-adopted ruff categories and drop inert preview flag ANN (all nine non-removed rules) and BLE (its only rule) were spelled out code-by-code; replace each with its category selector, which is exactly equivalent in 0.15.3 (the removed ANN101/ANN102 are skipped by a category selector and error when named explicitly). explicit-preview-rules was inert: every selected rule is stable and nothing is selected by category, so the flag had nothing to gate. Verified the strict-rule counts are identical before and after (62379 each, zero per-rule drift), so no budget change. * ci: drop redundant pyright dev dependency Nothing invokes bare pyright in the Makefile, the linting workflow, or scripts; the basedpyright gate added on this branch is the only type checker that runs. basedpyright is a superset fork that reads the same pyrightconfig.json and honors the same "# pyright: ignore" comments, so pyright==1.1.408 in the ci group was dead weight. Regenerated uv.lock under the same exclude-newer cutoff so the only change is removing pyright and its package stanza * ci: un-weaken mypy and error on Any in basedpyright mypy: enable warn_return_any, drop the valid-type silencer, and stop globally ignoring missing first-party imports via [mypy-litellm.*] ignore_missing_imports = False, which surfaced eight real broken litellm.* imports the blanket ignore was hiding; third-party imports stay ignored. The per-file budget moves 4888 -> 5799 (902 no-any-return, 1 valid-type, 8 import-not-found), all grandfathered so only net-new errors fail and the ceilings ratchet down basedpyright: error on reportExplicitAny and reportAny. The per-file budget moves 117033 -> 148946 (6931 explicit-Any, 24954 Any-typed expressions), grandfathered the same way * ci: add Any-discipline gate on changed lines under litellm/ Add scripts/check_any_discipline.py, a type-aware gate that fails when a changed line holds a value typed Any -- including the X | Any unions that mypy --strict / basedpyright accept (e.g. re.Match.group() -> str | Any, json.loads() -> Any, bare dict -> dict[Any, Any]). It reuses the repo's mypyc-compiled mypy 1.19 via a custom generic AST walker (mypyc precludes subclassing TraverserVisitor), loads litellm/mypy.ini for parity with lint-mypy, and uses a dedicated incremental cache (.mypy_cache_any) with mtime+hash invalidation to force re-checks. Scope is changed-lines-only so editing a legacy file never forces cleaning its existing Any debt; suppress a genuine typed/untyped boundary with # any-ok: <reason> (ANY002 requires the reason). Wire it into the Makefile (lint-any, lint, lint-dev), a parallel any-discipline CI job with its own actions/cache, .gitignore, and the CLAUDE.md / CONTRIBUTING.md docs. * ci: move Any-gate codes into the shared LIT namespace Renumber the Any-discipline checker into the LIT*** scheme owned by scripts/check_type_discipline.py (PR #30500) so the two checkers share one rule namespace and suppression convention: ANY001 -> LIT002 (Any-typed value; LIT002 was the retired/free slot) ANY002 -> LIT005 (any-ok without a reason; the shared suppression-reason code) ANY000 -> LIT000 (setup/build/read error; the shared error code) Messages and behavior are unchanged; LIT005's text already matches the "<token> requires a reason" shape used for cast-ok/guard-ok. * ci: gate mypy and basedpyright per error rule, not per file Switch the mypy/basedpyright budget gate from per-file error counts to per-rule-code totals, mirroring the {rule: {baseline, slack}} shape of ruff-strict-budget.json. A rule fails when its codebase-wide error count exceeds baseline + slack, so violations are tracked by category rather than by file location. scripts/type_check_gate.py now parses mypy from its text output (trailing [code]) and basedpyright from --outputjson (the JSON `rule` field), since basedpyright's wrapped text diagnostics mis-attribute the rule on continuation lines. Replace the *-file-budget.json files with freshly captured *-code-budget.json baselines and update the Makefile, CI, and CLAUDE.md accordingly. * docs: prefer Pydantic validation over any-ok suppression Point the Any-discipline guidance at validating Any with Pydantic (a model or TypeAdapter that returns a typed value or raises) and frame # any-ok as a last resort that should ideally never be used. * chore: remove extraneous comment * chore: make the CLAUDE.md more concise * chore: clean up bloated CONTRIBUTING.md additions * chore: make Makefile more concise * ci: add the lint-budget-update target CLAUDE.md references CLAUDE.md tells contributors to run make lint-budget-update, but the target was never defined. Add it as an aggregate that re-captures the ruff, mypy, and basedpyright budgets in one shot. * ci: recapture mypy and basedpyright budgets in the lint env The per-rule baselines were captured in a richer dependency env than the CI lint job's uv sync --frozen, so CI resolved fewer types and reported more errors than the budgets allowed (no-any-return 902 over cap 900, plus several basedpyright reportUnknown* rules). Regenerate both in the frozen env so they grandfather the true CI debt: mypy 5786 -> 5799 (no-any-return 890 -> 902, valid-type 1 restored), basedpyright 146213 -> 148942. * ci: check out PR head sha in lint and any-discipline jobs The default pull_request checkout uses refs/pull/N/merge, which folds the latest base commits into HEAD. The diff-based gates (ruff delta, Any discipline) then diff against the event's older base.sha and blame base's own new commits on this branch; staging's otel-v2 and streaming changes (#30326, #30485) tripped the Any gate on files this branch never touched. Checking out the PR head sha makes the gates diff the real branch tip against base, and pins the tree the mypy/basedpyright budgets were captured against so their counts stay deterministic as the base advances. * ci(lint): renumber Any-typed-value rule LIT002 -> LIT009 Free up LIT002 for the sibling type-discipline gate (check_type_discipline.py, #30500), which groups its mutable-collection family at LIT001 (annotation) and LIT002 (construction). This gate's Any-typed-value rule moves to LIT009 so the shared LIT namespace stays contiguous with no holes; LIT000 and LIT005 are unchanged. * style: rename lint-strict-budget -> lint-ruff-budget * ci: harden type-check gates against silent passes (greptile review) type_check_gate.py: refuse to certify a vacuous run. The CI pipe swallows the tool's exit code ('tool || true'), so a crashed mypy/basedpyright that emits nothing would parse to zero errors, breach no ceiling, and pass. is_vacuous_run() now fails when nothing was parsed but the budget expects errors. Also wrap basedpyright's json.loads in a JSONDecodeError handler that prints the offending output instead of dumping a raw traceback. check_any_discipline.py: ALL_LINES was None, which dict.get() also returns for a path absent from the line map, so a path-normalisation mismatch could let a violation on an unchanged file pass the scope filter. Make ALL_LINES a distinct sentinel object so 'whole file' and 'path missing' are unambiguous. Adds tests for all three. --------- Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Joel Tony <github@jaytau.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: hcl <chenglunhu@gmail.com> Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com> Co-authored-by: Nahrin <nahrin@nahrinoda.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Humphrey <a739376838@gmail.com> Co-authored-by: kursadlacin <kursadlacin@gmail.com> Co-authored-by: kursad <kursad.lacin@brado.net> Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com> Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com> Co-authored-by: Recep S <22618852+us@users.noreply.github.com> Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com> Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> Co-authored-by: Rongkun Yan <2493404415@qq.com> Co-authored-by: Varshith <kvarshithgowda@gmail.com>
* feat(ui): gate "Default Credentials" hint on /ui/login behind env flag (#30234)
Adds LITELLM_HIDE_DEFAULT_CREDENTIALS_HINT (and an equivalent
general_settings.hide_default_credentials_hint) that suppresses the
"By default, Username is admin and Password is your set LiteLLM Proxy
MASTER_KEY" info card rendered on /ui/login and /fallback/login.
Motivation: in production deployments operators set UI_USERNAME /
UI_PASSWORD (or SSO), and the hardcoded hint becomes factually
incorrect and is flagged by security scanners (Tenable WAS plugin
114625) as information disclosure. There is currently no way to
suppress it without forking the dashboard.
Behaviour:
- Default is unchanged (hint shown), so existing deployments are
unaffected.
- New field hide_default_credentials_hint on the well-known UI config
endpoint, populated from the env var or general_settings.
- LoginPage.tsx conditionally renders the Alert based on the flag.
Refs: BerriAI/litellm#30232
* fix(router): clean pattern_router state on upsert/delete (#29601)
* fix(router): clean pattern_router state on upsert/delete
PatternMatchRouter.add_pattern was append-only, and neither Router.upsert_deployment nor Router.delete_deployment removed the existing entry. Rotated-out api_keys stayed in the routing rotation for wildcard deployments (model_name with `*`) until proxy restart, silently defeating key rotation as an admin operation. The same leak applied to provider_default_deployment_ids and per-team pattern routers, and the patterns list grew unboundedly on every edit
* test(router): direct unit tests for _remove_deployment_from_wildcard_state
router_code_coverage.py greps test files for AST Call nodes and flagged
the helper as untested because the existing coverage only exercised it
transitively through upsert/delete. Adds two direct tests that pin the
helper's contract (cleans across global pattern router, per-team
routers with empty-router pop, and provider_default_deployment_ids;
noop on falsy model_id)
* fix(router): address Greptile review on pattern_router cleanup
Widen PatternMatchRouter.remove_deployment annotation to Optional[str];
the implementation already handles None via the falsy guard and the
unit test exercises it directly.
Move _remove_deployment_from_wildcard_state up one level in
upsert_deployment so it runs whenever the prior deployment is on the
router, not only when the model_id is present in the fast-mapping
index. The scenario is currently unreachable (get_deployment shares
the same index), but the cleanup is idempotent so this is defensive
against any future divergence between those code paths.
* fix(router): widen _remove_deployment_from_wildcard_state to Optional[str]
Moving the call out of the inner `deployment_id in deployment_fast_mapping`
block in the previous commit lost mypy's narrowing of `deployment_id`
from Optional[str] to str, tripping the lint CI. The helper already
handles None via its falsy guard, so widening the annotation matches
the actual contract.
* fix(router): make delete_deployment wildcard cleanup symmetric with upsert
After the previous commit moved _remove_deployment_from_wildcard_state out
of the inner index-map guard in upsert_deployment, delete_deployment was
still calling it only inside `if deployment_idx is not None`. Greptile
flagged the asymmetry: under a desynced index_map, delete would silently
leave the stale wildcard credential in pattern_router.
Moves the cleanup call to the top of the try block, mirroring the upsert
path. Cleanup is idempotent so the change is a no-op on the happy path.
Adds a regression test that simulates the desync by removing the entry
from model_id_to_deployment_index_map and asserts delete still clears
pattern_router.
* fix(pricing): add 1h cache-write cost for Anthropic Sonnet 4.5/4.6 (#30474)
The native anthropic claude-sonnet-4-5/4-6 price-map entries were missing
cache_creation_input_token_cost_above_1hr (and the >200K long-context
sub-tier for 4.5), so 1-hour-TTL cache writes were costed at the 5-minute
rate. Adds 6e-06 regular (and 1.2e-05 long-context) = 2x base input,
matching the vertex_ai/azure_ai/bedrock siblings and the older
claude-sonnet-4-20250514 entry. Adds a regression test.
* fix(proxy): cancel upstream gemini request and release httpx connection on client disconnect (#30075)
* fix(proxy): cancel upstream gemini request and release httpx connection on client disconnect
- add _check_request_disconnection to common_request_processing; wrap llm_call
as asyncio.Task so it can be cancelled; catch CancelledError and raise
HTTPException(499) when client disconnects before LLM responds (non-streaming path)
- pass raw httpx.Response into ModelResponseIterator in make_call/make_sync_call
so the iterator holds a reference to the underlying connection
- implement ModelResponseIterator.aclose() and .close(): close the line iterator
then explicitly call response.aclose()/response.close() to release the httpx
connection when the client drops mid-stream; errors are debug-logged, not raised
- add tests for _check_request_disconnection (cancels task, graceful on exception,
does not cancel when client stays connected) and base_process_llm_request 499
behavior; add TestModelResponseIteratorCleanup verifying aclose/close propagation
through CustomStreamWrapper
* fix(proxy): record 499 on streaming disconnect and cancel orphaned gather tasks
Wire streaming generator cleanup to log client_disconnected with error_code 499
in spend logs, cancel pending during_call_hook tasks when the LLM call is
cancelled on disconnect, and align the 600s poll limit comment with proxy_server.
* fix: extract client disconnect logging helper to satisfy PLR0915
* fix: resolve mypy and code-quality CI failures for client disconnect logging
Cast client disconnect error_information for mypy, only await pending gather tasks to avoid masking LLM errors, and add tests for the new logging helper and gather cleanup.
* fix(proxy): harden gather cleanup so finally cannot mask LLM errors
* fix(proxy): shield streaming disconnect logging and strip spoofable metadata
Move streaming disconnect recording into a shielded cancel scope, add gather cleanup regression coverage for guardrail-converted cancels, and strip client_disconnected/error_information from user metadata at the proxy boundary.
* fix(proxy): only map CancelledError to 499 for client disconnect
Track when the disconnect poller cancels the LLM task and re-raise other CancelledError paths so graceful shutdown is not reported as HTTP 499.
* fix(proxy): remove dead _check_request_disconnection helper
Non-streaming client disconnect is handled by staging's cancel_on_disconnect path via _await_llm_call_cancelling_on_disconnect. Drop the unused is_disconnected poller and its unit tests; rename the remaining integration tests to TestDisconnectGatherCleanup.
* feat(mistral): add mistral-medium-3-5 to model_prices_and_context_wind.. (#29303)
* feat(mistral): add mistral-medium-3-5 to
model_prices_and_context_window.json
Mistral's docs page lists mistral-medium-3-5 as a new model offering.
Pricing/specs sourced from Mistral's published model metadata:
- input: $1.50 / 1M tokens
- output: $7.50 / 1M tokens
- context: 262,144 tokens
- capabilities: vision, function calling, structured outputs, assistant
prefill
Adds entry: `mistral/mistral-medium-3-5`, mirroring the pattern used for
the rest of the Mistral family.
test(mistral): add model_info test for mistral-medium-3-5 + sync backup
cost map
- Mirror mistral/mistral-medium-3-5 entries into
litellm/model_prices_and_context_window_backup.json so the bundled
model cost map matches the canonical
model_prices_and_context_window.json.
- Add tests/test_litellm/test_mistral_medium_3_5_model_metadata.py
covering pricing tiers, capability flags, context window, provider
routing, and parity between the main and backup cost maps.
- Point 'source' at the live Mistral models documentation page.
* fix(ui): three small UI fixes — Gemini api_base + credential form reset + Mode badge (#30419)
* fix(ui): three small UI fixes — Gemini api_base field + credential form reset + Mode badge
Three independent fixes; bundled because they all touch the
credential-form / logging-callbacks area.
1. expose api_base field on Google AI Studio credential form
The runtime gemini provider supports custom api_base via
`vertex_llm_base._check_custom_proxy`; the UI just needs to expose
the field. Adds api_base to the Google_AI_Studio credential form
ordered before api_key (matching OpenAI/Anthropic conventions).
Default value matches the canonical Google AI Studio endpoint that
LiteLLM's gemini provider talks to when api_base is unset, so
leaving the default in the form behaves identically to leaving it
blank.
2. reset credential form state when switching providers
Switching the Provider select in AddCredentialModal / EditCredentialModal
left the previous provider's field values populated. The form then
submitted a mixed payload (e.g. Azure deployment fields under an
OpenAI credential), producing confusing failures.
Extract `getProviderFieldDefaults` helper and reset the form to it
on provider change. Unit-tested via the extracted helper because
Antd Select's portal/dropdown behaviour is unreliable in jsdom.
3. logging callbacks table reads backend `type` for Mode badge (#35)
The `/get_callbacks` proxy endpoint returns each callback as
`{name, type, variables}` where `type` is `"success"` or
`"failure"`. The same callback name can appear twice (one per event
class) and the two entries fire on disjoint events.
`LoggingCallbacksTable` ignored `type` and read `record.mode`
(always undefined), so every row fell back to the "Success" badge.
A `generic_api` callback registered for both classes showed up as
two identical "Success" rows + React duplicate-key warning.
Read `record.type` first (fall back to `record.mode` for newly-
added not-yet-server-acknowledged rows). Composite rowKey
`${name}-${type ?? mode ?? 'success'}`. Removed leftover debug
`console.log`.
* fix(ui): drop api_base default_value to preserve Gemini v1alpha auto-routing
Greptile P2 (PR #30419, threads on lines 1255-1256 of
provider_create_fields.json): the api_base field's `default_value` was
hard-coded to "https://generativelanguage.googleapis.com/v1beta". This:
1. Bakes v1beta into every credential record saved through the form,
even when the user never touched the field. If LiteLLM's internal
gemini default URL ever changes, those persisted credentials keep
hitting the stale path.
2. Bypasses `_get_gemini_url`'s automatic version routing for Gemini 3+
models. That helper picks v1alpha for Gemini 3+ and v1beta for older
models when api_base is unset. With the default pre-filled (and
`_check_custom_proxy` then taking over because api_base is non-empty),
Gemini 3+ requests get pinned to v1beta and may fail or behave
unexpectedly — purely because the user accepted the visible default.
Fix: set `default_value` to `null` and move the canonical URL guidance
into the `placeholder` (visible to the user, never persisted) and an
expanded tooltip. UX is unchanged — the URL is still shown in the
greyed-out input — but the auto-version-routing path stays default.
Updated test_google_ai_studio_provider_fields_expose_api_base to assert
the new contract (`default_value is None`, `placeholder` carries the
canonical URL), with a comment pointing at the Greptile threads as the
rationale so future contributors don't accidentally re-introduce the
default.
26/26 tests in the file pass. JSON validates (`json.load` clean).
* feat(azure_ai): add gpt-5.5 to model cost map (#30428)
* feat(azure_ai): add gpt-5.5 to model cost map
Adds azure_ai/gpt-5.5 and its dated snapshot azure_ai/gpt-5.5-2026-04-23 to
both the canonical and bundled cost maps. gpt-5.5 is generally available on
Azure AI Foundry; pricing mirrors the openai gpt-5.5 entry, matching the
established azure_ai convention (verified identical for gpt-5.4), in the
azure tier structure (base / above-272k / priority). supports_minimal_
reasoning_effort is false, the capability that changed from gpt-5.4.
Fixes #30306
* Update tests/test_litellm/test_gpt_5_5_model_metadata.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
---------
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* fix: guard check_and_fix_namespace against None key (#30435)
* fix: guard check_and_fix_namespace against None key
When user_id is None, the cache key can be None, causing
AttributeError: 'NoneType' object has no attribute 'startswith'
in check_and_fix_namespace.
Add an early return for None key to prevent the error and the
ERROR-level log noise it produces on every unauthenticated request.
Fixes #30424
* fix: update type annotations for check_and_fix_namespace
- key: str -> Optional[str] (now handles None input)
- return: str -> Optional[str] (returns None when input is None)
Addresses Greptile review concern about type signature mismatch.
* fix: revert check_and_fix_namespace type signature to str to fix MyPy downstream errors
* fix: update type annotations for check_and_fix_namespace
- Change signature from str -> str to Optional[str] -> Optional[str]
- Remove type: ignore comment on None return
- Add None guard in async_set_cache_sadd before passing to helper
Addresses review feedback from Sameerlite on type mismatch.
* Revert "fix: update type annotations for check_and_fix_namespace"
This reverts commit 5272920fa0daab676f5ad46dcadd8cd537cfc96f.
---------
Co-authored-by: michaelxer <michaelxer@users.noreply.github.com>
* fix(cost): apply service_tier suffix to above-threshold cache rates and expose priority+threshold keys in ModelInfo (#30450)
* fix(cost): apply service_tier suffix to above-threshold cache rates and expose priority+threshold keys in ModelInfo
Models that publish both a service_tier (e.g. priority) rate and an above-threshold tier (e.g. _above_200k_tokens) currently bill cached tokens at the standard above-threshold rate rather than the priority above-threshold rate. Affected entries in the live pricing JSON include gemini-3-pro-preview, gemini-3.1-pro-preview and their vertex_ai/ and gemini/ variants, plus azure/gpt-5.4 and azure_ai/gpt-5.4. For a 250K-token priority request with 200K cached tokens against gemini-3-pro-preview, the leak is about 44 percent of the prompt cost.
Two stacked defects caused this. First, ModelInfoBase (and the ModelInfo pydantic class) and the get_model_info construction in litellm/utils.py omit the priority+above-threshold cost keys, so even if the calculator asked for them they would never reach it. Second, in _get_token_base_cost the cache_creation/cache_read tiered keys never get wrapped with _get_service_tier_cost_key, while the input/output tiered keys above and below do. The change here surfaces six new keys (input, output and cache_read at both 200k and 272k priority variants) and wraps the three cache tiered keys in _get_token_base_cost the same way input/output already are. _get_cost_per_unit's existing service_tier-to-base fallback covers models that ship the standard above-threshold rate without a priority variant.
Adds one regression test in tests/test_litellm/litellm_core_utils/llm_cost_calc/test_llm_cost_calc_utils.py that drives the actual generic_cost_per_token path for gemini-3-pro-preview at 200K cached + 50K text under priority and asserts the priority above_200k rates are picked. Verified the test fails on litellm_internal_staging without these changes and passes with them.
* fix(cost): drop guard on cache tiered keys so service_tier fallback can reach standard above-threshold rate
Addresses Greptile P1 on PR 30450. The previous commit wrapped cache_creation_tiered_key, cache_creation_1hr_tiered_key, and cache_read_tiered_key with _get_service_tier_cost_key (matching how the sibling input and output tiered keys are wrapped) but kept the surrounding 'if key in model_info' guards. For models that publish a standard above-threshold cache rate but no priority variant (gpt-5.4-pro, gpt-5.5-pro and their dated siblings, plus vertex_ai/claude-sonnet-4-5 for cache_creation), the guard short-circuits before _get_cost_per_unit's existing service_tier-to-base fallback can strip _priority and find the standard above-threshold key. The result on priority requests over the threshold was that those models silently dropped from the above-threshold rate back to the priority-base rate. Dropping the guard and calling _get_cost_per_unit unconditionally (mirroring how tiered_input_key and tiered_output_key are already handled) restores correct billing for that class of models while keeping the new priority+above-threshold behaviour for gemini-3-pro-preview and friends.
Adds a second regression test that pins generic_cost_per_token for vertex_ai/claude-sonnet-4-5 priority + above_200k with cached and cache_creation tokens to the expected standard above-threshold rates, so the guard cannot be silently reintroduced for either the cache_read or cache_creation path.
* fix(presidio): skip pre-call masking when guardrail is logging_only (#30461)
The Presidio pre-call hook masked the live request unconditionally, ignoring
the configured event hook. With mode: logging_only the masked request reached
the model, so its response echoed anonymization tokens (e.g. <PERSON>) instead
of the real output. Gate async_pre_call_hook on should_run_guardrail, matching
every other guardrail; logging_only masking still happens via async_logging_hook.
* fix(router): resolve list unhashable crash on model alias (#30464)
* fix(router): resolve list unhashable crash on model alias
Fixes the fallback parsing logic that mistakenly categorized standard array fallback definitions as override dictionaries when a deployment alias matches the literal string 'model'.
Closes https://github.com/BerriAI/litellm/issues/30459
* fix(router): address greptile review for fallback parsing edge cases
- Resolves ambiguity in standard vs override fallback dictionaries by iterating over all items and validating that no mapped litellm param resolves to a non-list type.
- Adds regression tests in test_router_order_fallback.py to prevent unhashable type crash from silently re-entering the codebase.
* chore(router): format code with black to pass CI
* fix(hosted_vllm): remove thinking_blocks and convert list content to strings (#30475)
* fix: hosted_vllm remove thinking_blocks and convert list content to strings
vLLM endpoints reject assistant messages with thinking_blocks converted
to content list blocks. This change removes thinking_blocks entirely
and converts any list content back to strings.
This fixes BadRequestError when using Claude Code with hosted_vllm
models that pass thinking_blocks in messages.
* fix(hosted_vllm): address Greptile review feedback
- Join multiple text blocks with newline instead of empty string
- Always set content to string (never None) to avoid vLLM validation errors
* fix(hosted_vllm): update chat transformation to clean assistant messages
* fix: re-raise exception instead of silently dropping MCP team permissions (#30477)
* fix: re-raise exception instead of silently
dropping MCP team permissions
When MCPRequestHandler.get_allowed_mcp_servers raises, the
broad
except was swallowing the error and returning only
allow_all_server_ids,
silently discarding all team-level object_permission grants.
Fixes #30476
* fix: log full traceback when MCP permission lookup fails
Uses verbose_logger.exception() instead of warning() so operators
can see the full traceback when team-level object_permission grants
are dropped due to an internal error in get_allowed_mcp_servers.
Fixes #30476
* fix: remove timezone date expansion in daily-activity aggregation (#29569)
* fix: remove timezone date expansion in daily-activity aggregation
Single-day spend queries from non-UTC timezones over-counted by ~2x
because the previous implementation widened the SQL date range by a
full UTC day on whichever side the offset pointed. Spend is bucketed
in whole-UTC-day rows in LiteLLM_DailyUserSpend, so the expansion
pulled an extra 24h of unrelated bucket data per boundary.
Concretely on IST (UTC+5:30, offset -330): a single-day query for
2026-05-29 was rewritten to date >= 2026-05-28 AND date <= 2026-05-29
and returned spend across both UTC days. Sums of single-day queries
across a 5-day window then exceeded the equivalent multi-day aggregate
by ~50%, which is mathematically impossible.
Treat the local date range as the UTC date range. The aggregation
table has no hour-level granularity, so any conversion using only
date arithmetic must round to whole UTC days; the previous fix turned
that boundary slop into systematic over-counting. Pass-through trades
a small one-time slop at each end of the range for correct, monotonic,
additive results across single-day and multi-day queries.
Repro from production: bedrock/global.anthropic.claude-opus-4-8 over
2026-05-29 to 2026-06-02, IST timezone:
- 5-day aggregate: $701.39 / 1,831 reqs
- Sum of 5 single-day queries: $1,070.94 / 2,755 reqs
- Excess (was 1.527x): now matches within boundary slop
Adds regression tests in TestAdjustDatesForTimezone and
TestBuildAggregatedSqlQuery that pin the pass-through behavior and
the additivity invariant for any future implementation.
* ci: rerun checks on litellm_oss_branch base
---------
Co-authored-by: Sameer Kankute <sameer@berri.ai>
* fix: buffer native gemini sse frames (#30225)
* fix: buffer native gemini sse frames
* fix: scope native gemini sse buffering
* fix: check raw sse residual buffer size
* feat: updated openrouter provider to map max level to xhigh (#28881)
* feat(proxy): allow use_redis_transaction_buffer without redis cache (#28764)
* feat(proxy): allow use_redis_transaction_buffer without redis cache
* fix(proxy): require host or url for standalone buffer redis
* fix(mcp): fail closed when scope filter resolves to no servers (#30353)
`_get_allowed_mcp_servers_from_mcp_server_names` returned the caller's full
allowed-server set when the requested `mcp_servers` list (path- or
header-derived) resolved to nothing. URL/header namespacing therefore
appeared to work even when the requested name was unknown or the caller had
no grant — `/mcp/<typo>/` silently exposed every server the key could reach.
Fail closed instead: when `mcp_servers` is explicitly provided but nothing
resolves, return an empty list. The `mcp_servers=None` path (no scope
requested) keeps its existing behavior.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(token-counter): handle Anthropic tool_reference blocks to stop dropped spend logs (#30302)
* fix(token-counter): handle Anthropic tool_reference blocks to stop dropped spend logs
`token_counter` did not know about Anthropic tool-search `tool_reference`
content blocks, a lightweight pointer to a deferred tool that shows up as
`{"type": "tool_reference", "tool_name": ...}`. When such a block appeared in
message content, `_count_content_list` fell through to its catch-all branch and
raised `Invalid content item type: tool_reference`.
On the streaming `anthropic_messages` proxy path that exception nulls
`response_cost`, which makes the proxy drop the entire SpendLogs row. The result
is a silent cost undercount on any tool-search traffic; the request succeeds for
the caller but the spend is never recorded.
This adds a `tool_reference` branch that counts the referenced `tool_name` (the
full tool definition is already counted via the `tools` param, so only the name
is added here) and handles an empty/missing name gracefully. The catch-all error
message is updated to list `tool_reference` among the expected types.
A regression test asserts that a message containing a `tool_reference` block no
longer raises and returns a positive token count, and that an empty `tool_name`
is handled without error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(token-counter): collapse explicit None tool_name to empty string
In _count_content_list, c.get("tool_name", "") returns None when the
key is present with an explicit None value, and str(None) == "None"
which is truthy, causing a spurious token to be counted. Use
c.get("tool_name") or "" so both a missing key and an explicit None
collapse to an empty string and are skipped.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(token-counter): cover catch-all for unknown content block type
Adds a regression test that calls `_count_content_list` with an unrecognized
content block type and asserts it raises `ValueError` whose message names the
offending type and lists `tool_reference` among the supported types. This
exercises the previously uncovered catch-all branch (codecov patch gap) and
pins the error contract.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(token-counter): cover tool_reference on the spend/cost and streaming paths
Adds end-to-end regression tests that exercise the real public entry points
(`completion_cost` and `stream_chunk_builder`), not just the private
`_count_content_list` helper, for Anthropic tool-search `tool_reference`
content blocks.
These pin the actual bug the fix addresses: before the fix the `tool_reference`
block raised out of `completion_cost` -> the proxy logging layer nulled
`response_cost` and the spend callback dropped the SpendLogs row (silent cost
undercount on all tool-search traffic); and `stream_chunk_builder` swallowed the
same raise and collapsed prompt_tokens to 0. With the fix, cost is positive and
prompt_tokens are counted. Verified: 3 fail without the fix, 3 pass with it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(cost): add cost mapping for deepseek-v4-flash and deepseek-v4-pro (#27056)
* feat(cost): add cost mapping for deepseek-v4-flash and deepseek-v4-pro
Adds pricing entries for the two new DeepSeek V4 models released on
2026-04-24, for both bare model names and the deepseek/ provider prefix.
Prices sourced from https://api-docs.deepseek.com/quick_start/pricing:
- deepseek-v4-flash: $0.14/M input, $0.28/M output
- deepseek-v4-pro: $1.74/M input, $3.48/M output
Cache hit price set to 1/10 of input (per DeepSeek docs).
Context window: 1M tokens for both models.
Closes #26709
* fix(cost): update backup registry for deepseek-v4
* style: remove print statement from deepseek-v4 test
* feat(cost): add cost mapping for deepseek-v4-flash and deepseek-v4-pro
Adds pricing entries for the two new DeepSeek V4 models released on
2026-04-24, for both bare model names and the deepseek/ provider prefix.
Prices sourced from https://api-docs.deepseek.com/quick_start/pricing:
- deepseek-v4-flash: $0.14/M input, $0.28/M output
- deepseek-v4-pro: $1.74/M input, $3.48/M output
Cache hit price set to 1/10 of input (per DeepSeek docs).
Context window: 1M tokens for both models.
Closes #26709
* fix: update deepseek-v4 prices to active discounted rates
* test: update deepseek-v4 prices in tests to match active discounted rates
* fix(deepseek): remove duplicate entries and update backup registry to active discounted rates
* fix: update max_output_tokens to 384K for deepseek-v4
* fix: correctly restore upstream models accidentally dropped during merge
* fix(tests): resolve failing claude-fable-5 and reasoning tests by safely updating cost map
- Pulled the latest cost map from upstream staging
- Safely appended deepseek-v4 mapping without deleting duplicate keys or formatting via json.dump
* fix(tests): correct deepseek model cache prices and update JSON schema
- Appended both prefixed and bare deepseek-v4 models to satisfy test assertions
- Corrected deepseek-v4-pro expected cache hit and token prices based on latest review updates
- Added missing realtime endpoint to test_utils.py INTENDED_SCHEMA
* fix: remove accidental azure/gpt-realtime-whisper addition
---------
Co-authored-by: Dushyant Acharya <dushyantacharya@Dushyants-MacBook-Pro.local>
* feat(key/info): expose per-model budget usage in /key/info response (#30394)
* feat(key/info): expose per-model budget usage in /key/info response
Add model_max_budget_usage to /key/info and /v2/key/info responses.
For each model in model_max_budget, reads current-period spend from
the same DualCache used by the budget enforcer and returns it alongside
the limit and time period so callers can see how much of each model
budget has been consumed in the active window.
* test(key/info): add coverage for model_max_budget_usage in v1 and v2 endpoints
Add tests for the model_max_budget_usage enrichment in both info_key_fn
and info_key_fn_v2, covering the budget-present path, the empty-budget
path, and the v2 batch endpoint.
* fix(key/info): source model_max_budget current_spend from SpendLogs instead of DualCache
The DualCache used for enforcement is ephemeral and only populated when budget metadata
is present at request time. Fall back to a direct LiteLLM_SpendLogs DB aggregation
using the budget period window (budget_reset_at - budget_duration) for accurate reporting.
Also fall back to litellm_budget_table.model_max_budget when the key's top-level field
is empty, and round current_spend to 4 decimal places.
* test(key/info): cover remaining branches in model_max_budget_usage helpers
Add unit tests for: prisma_client=None early return, DB query exception swallowing,
invalid budget_duration handled by _compute_budget_period_start, budget_reset_at
received as a datetime object (Prisma native type), max_seconds=0 early return, and
skipping models that lack a budget_duration. Also remove an unreachable except branch
where fromisoformat would fail after _compute_budget_period_start already validated the
same value.
* test(key/info): cover except path for unparseable per-model budget_duration
* fix(key/info): compute per-model rolling windows in model_max_budget_usage
Each model in model_max_budget now gets its own time window derived from
its own budget_duration, rather than sharing a single window computed as
the max (or the budget table's reset_at). This matches what the DualCache
enforcer actually tracks and prevents current_spend from being inflated
for models with shorter windows.
_query_model_spend_for_period is refactored to accept a model filter
(handling provider-prefix variants in SQL) and return a float directly.
_compute_budget_period_start and the budget_table window path are removed
as they are no longer needed.
* refactor(model_max_budget_limiter): remove dead get_current_period_spend method
* refactor(key/info): strip synthetic formatter noise from PR diff
Restore key_management_endpoints.py and test_key_management_endpoints.py
to origin/litellm_internal_staging, then re-apply only the intentional
additions: _query_model_spend_for_period, _build_model_max_budget_usage,
the two endpoint patches (info_key_fn / info_key_fn_v2), and the new
test suite. The previous commits had reformatted ~300 pre-existing lines
across both files, making the functional diff unreadable.
* test(key/info): cover empty-rows path in _query_model_spend_for_period
* fix(model_max_budget_limiter): guard BudgetConfig construction inside try/except
A malformed model entry in the DB (e.g. non-numeric max_budget from a
manually edited or migrated row) caused BudgetConfig(**budget_info) to
raise a Pydantic ValidationError outside any exception guard, surfacing
as a 500 for the entire /key/info or /v2/key/info call. Merging both
try/except blocks into one ensures bad entries are silently skipped,
consistent with the existing duration_in_seconds guard.
* fix: don't stack provider prefix on wildcard models with a custom prefix (#30360)
* fix: don't stack provider prefix on wildcard models with a custom prefix
get_known_models_from_wildcard expanded provider-prefixed model ids (e.g.
"ollama/gemma3:1b" from get_provider_models) by prepending the wildcard's
prefix whenever the id did not already start with it. With a custom wildcard
prefix such as "ollama_server1/*" (used to distinguish multiple Ollama
instances), this produced "ollama_server1/ollama/gemma3:1b", which is
uncallable and breaks /v1/models.
When the expanded id already carries a provider prefix, replace it with the
wildcard's prefix instead of stacking both. Matching-prefix and bare-model
cases are unchanged.
Fixes #30358
* fix: only strip a known provider prefix when expanding custom wildcard prefixes
The wildcard expansion replaced the leading slash segment of every expanded id with the wildcard prefix whenever the id did not already start with it. For ids whose first segment is an org rather than a litellm provider (for example a provider returning "meta-llama/Llama-3-8B" with no outer provider prefix), that dropped the org and produced an uncallable id
Only strip the leading segment when it is a recognized provider (membership in LlmProviders); otherwise keep it and just prepend the wildcard prefix. Provider-prefixed ids like "ollama/gemma3:1b" still have their prefix replaced, so the original fix is unchanged for known providers
* address greptile review feedback: log dropped non-text vLLM assistant content blocks (greploop iteration 1)
* fix(ci): format credential_form_helpers test + regenerate dashboard schema.d.ts
* fix(proxy): raise litellm.BadRequestError for missing model param
When no model is passed, route_request now raises a litellm.BadRequestError
('Missing model parameter') instead of falling through to ProxyModelNotFoundError.
This keeps the missing-param error clear and independent of router wildcard
state. Unknown (non-empty) model names still raise ProxyModelNotFoundError.
* Revert "fix(proxy): raise litellm.BadRequestError for missing model param"
This reverts commit 9240da403c0432a80473d6c4677ddb7e2bad7420.
* Revert "fix(router): clean pattern_router state on upsert/delete (#29601)"
This reverts commit ad4e6e2395620ea6d2fe38089a54cde160720de2.
* fix: correct streaming and key budget usage reporting
* fix(hosted_vllm): type assistant tool_calls to satisfy mypy
* feat: aws secret manager cross region replication (#30368)
* feat(aws-secret-manager): add replica_regions cross-region replication after CreateSecret
When store_virtual_keys is enabled, async_write_secret() only wrote secrets
to the primary AWS region. Multi-region proxy deployments had no built-in
way to synchronize virtual key secrets across regions through LiteLLM,
requiring external replication mechanisms.
Add replica_regions support to AWSSecretsManagerV2:
- New replica_regions field in KeyManagementSettings (types/secret_managers/main.py)
- New async_replicate_secret() method that calls ReplicateSecretToRegions API
- async_write_secret() calls replication after successful CreateSecret
- Replication failure is logged as a warning but does NOT fail key creation
- load_aws_secret_manager() forwards replica_regions from key_management_settings
Configuration example:
key_management_settings:
store_virtual_keys: true
replica_regions:
- us-west-2
- eu-west-1
When replica_regions is omitted or empty, behavior is unchanged.
* test(aws-secret-manager): restore litellm.secret_manager_client after test to prevent state pollution
* test(aws-secret-manager): add coverage for HTTP error and replication exception paths
* fix: restore litellm.secret_manager_client global state in test; add replication log proof
- Global state in test_load_aws_secret_manager_passes_replica_regions was
already guarded with try/finally (committed in previous pass); no further
change needed for Fix 1.
- Fix 2: add verbose_logger.info("ReplicateSecretToRegions called …") inside
async_replicate_secret so callers get an observable INFO log line whenever
replication fires.
- Add test_replication_fires_on_create: calls async_replicate_secret directly
with caplog.at_level(INFO, logger="LiteLLM") and asserts "ReplicateSecretToRegions"
appears in the captured log output, proving the code path executes.
* fix: pass request to streaming generators
* fix(hosted-vllm): preserve assistant structured content
* fix(hosted_vllm): satisfy mypy on preserved structured content assignment
* chore: resolve litellm_internal_staging merge conflicts for #30527 (#30554)
* chore(codecov): add Batches, Videos, and Realtime components (#30517)
* chore(codecov): add Batches, Videos, and Realtime components
Define per-feature Codecov components so PR comments track coverage
for batch API, video generation, and realtime streaming paths.
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(codecov): use wildcard path for Batches proxy component
Align batches_endpoints glob with Videos, Realtime, and Proxy_Authentication.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(batches): move orphan tests into tests/test_litellm for CI coverage (#30510)
Four batch-related tests lived under tests/litellm/ and were never picked
up by GitHub Actions. Relocate them and fix gemini multimodal e2e to use
the batchEmbedContents path expected for gemini/ provider.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(guardrails): run pre_call hook once for model-level guardrails (#30543)
* fix(guardrails): run pre_call hook once for model-level guardrails
A CustomGuardrail attached to a deployment via litellm_params.guardrails
gets its async_pre_call_hook invoked twice per request: once by the proxy
pre-call loop and again by async_pre_call_deployment_hook after the router
spreads the model-level guardrails into the top-level request kwargs.
Record in request metadata that the proxy pre-call loop already ran a given
guardrail, and have the deployment hook skip it when the marker is present.
Direct-SDK usage never runs the proxy loop, so the deployment hook stays the
sole invocation there and still fires exactly once.
The marker key is stripped from untrusted caller metadata so a request body
cannot suppress a model-only guardrail by pre-seeding it.
* fix(guardrails): mark pre_call dedup on the post-hook request data
Record the exactly-once marker after async_pre_call_hook runs, on the data
object that flows downstream, rather than before it. A guardrail whose hook
returns a brand-new request dict (instead of mutating or spreading the one it
received) would otherwise discard the marker, letting the deployment hook
re-run the guardrail a second time.
* fix(guardrails): stop re-initializing DB guardrails on every poll (#30542)
* fix(guardrails): stop re-initializing DB guardrails on every poll
InMemoryGuardrailHandler._has_guardrail_params_changed compared the
in-memory LitellmParams against the raw dict loaded from the DB. The
in-memory side carries every field default and coerces enums via
model_dump(), while the DB side only holds the keys originally stored,
so the two shapes never compared equal and the guardrail was rebuilt on
every poll cycle.
Each rebuild created a fresh instance, but delete_in_memory_guardrail
only removed the old callback from litellm.callbacks. Request handling
promotes guardrail callbacks into the success/failure/async lists, so
the previous instance stayed referenced there and instances accumulated.
Normalize both sides through LitellmParams(...).model_dump() before
diffing, and purge the callback from every callback list on delete.
* refactor(guardrails): narrow params-normalization fallback to ValidationError
The comparison normalizer caught a bare Exception and silently fell back
to the raw dict, which hid the cause and quietly degraded the affected
guardrail back to re-initializing on every poll. Catch only the
ValidationError that LitellmParams construction can raise, log a warning
so the offending row is diagnosable, and let any other error surface
instead of being swallowed.
* refactor(callbacks): add remove_callback_from_all_lists helper to manager
Move the knowledge of which callback lists a callback can be promoted
into out of the guardrail registry and into LoggingCallbackManager, where
the rest of the callback-list bookkeeping already lives. delete_in_memory_guardrail
now delegates to the new helper instead of iterating the lists itself.
* chore(oss): litellm oss staging 150626 (#30463)
* fix(pricing): add GitHub Copilot MAI Code Flash pricing (#30415)
* fix(pricing): add GitHub Copilot MAI Code Flash pricing
Add GitHub Copilot pricing entries for MAI-Code-1-Flash and the internal Copilot CLI model name so cost calculation can price input, cached input, and output tokens.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* test(pricing): cover GitHub Copilot MAI Code Flash pricing
Add regression coverage for both GitHub Copilot MAI-Code-1-Flash model names, including cached input pricing, chat endpoint metadata, and cost_per_token arithmetic.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) (#30213)
* fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210)
#28990 added ownership recording for streaming /v1/responses via
_wrap_responses_stream_for_container_ownership, which reads
`getattr(stream_response, 'completed_response', None)` to extract the
ResponsesAPIResponse. The unit test bypassed the Router, so it never
exercised the production wrapping path.
Through the Router (every proxy deployment), the stream is wrapped by
FallbackResponsesStreamWrapper (router.py:2527). Its __init__ set
`self.completed_response = None` and __anext__ only forwarded chunks
— the inner source iterator's terminal event never bubbled up to the
attribute the ownership hook reads, so the hook silently recorded
nothing and every follow-up /v1/containers/<id>/files call returned
403 for non-admin keys.
This commit:
- router.py: pre-resolves the responses-API terminal event tuple
(response.completed / .incomplete / .failed) once per
_aresponses_streaming_iterator call, and has the wrapper's __anext__
sniff each forwarded chunk's .type. First terminal event hit gets
stored on the wrapper's completed_response. Iterator-agnostic — works
for source_iterator AND any future wrapper.
- common_request_processing.py: when _extract_completed_responses_response
returns None we now warn instead of silently skipping. Reporter on
#30210 lost a day to this exact silent skip; the warning surfaces
future regressions of the same shape directly in operator logs.
Fixes #30210
* fix(router): type-ignore wrapper getattr-defaults; broaden ownership-skip warning
CI lint (mypy) flagged the three pre-existing getattr(..., None) assignments
in FallbackResponsesStreamWrapper.__init__:
router.py:2564 self.response = getattr(source_iterator, 'response', None)
router.py:2565 self.model = getattr(source_iterator, 'model', None)
router.py:2566 self.logging_obj = getattr(..., None)
Those lines also exist on litellm_internal_staging and pass mypy there.
Adding the typed terminal-event tuple above the class made the function
body more narrowable, which surfaced the pre-existing mismatch — base
class declares non-Optional types but the bridge path
(LiteLLMCompletionStreamingIterator) legitimately omits these. Keep
the None fallback and silence with type: ignore[assignment].
Greptile 4/5 note: the ownership-skip warning hard-named code_interpreter
which misleads operators when a non-code_interpreter stream aborts.
Generalize to 'any tool container (e.g. code_interpreter)'.
* fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) (#30201)
* fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198)
get_model_info synthesizes input_cost_per_token / output_cost_per_token = 0
when they are absent from the raw entry (the price-unknown and free cases
share the same representation). register_model then merges that result back
into litellm.model_cost, which flips a sparse entry from 'no cost keys'
(priced via model name) to 'cost keys = 0' (free).
That defeats _is_cost_explicitly_configured (#24949) on re-registration:
_is_model_cost_zero returns True, common_checks skips every tag / key /
team / user / org budget check for the group, and over-budget traffic
keeps returning 200. Spend keeps recording because cost calc still resolves
by model name, so the symptom is silent and only triggers on the second
register_model pass (router rebuild, /model/update, config sync).
Mirror the existing litellm_provider-None guard one block above and pop
the cost fields from the synthesized result when they are absent from the
raw entry and not in the caller's value. Caller-provided zeros (genuinely
free models, BYOK overrides) are preserved.
Fixes #30198
* fix(register_model): switch _raw_entry to is-None checks + drop dead test assertion
Greptile #30201 review notes:
- the `or`-chain in the raw-entry lookup treated an empty dict (a key
with no fields) as falsy and fell through to the second arm — replace
with explicit `is None` checks so a present-but-empty entry is still
taken at face value.
- the first assertion in `test_router_double_init_keeps_db_model_entry_sparse`
used `in (None, 0)` which passes under the bug condition (cost = 0
matches the tuple); the strong follow-up assertion already covers
every shape, so drop the dead branch.
* fix(bedrock mantle): use unique function-call id for responses->chat tool calls (#30426)
* fix(bedrock mantle): use unique function-call id for responses->chat tool calls
...
* fix(bedrock mantle): scope unique tool-call id fallback to degenerate call_id
The previous revision preferred the Responses item id for every tool call, which broke providers (and existing tests) where call_id is a unique, canonical correlation key. Restrict the fallback to the degenerate index-based call_id that Bedrock Mantle returns (call_0, call_1, ... resetting per response) and keep call_id otherwise. Revert the change to the OUTPUT_ITEM_DONE streaming handler, whose tool_call_chunk is never emitted (dead code, per review). Extend the regression tests to assert a normal call_id is preserved.
* fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) (#30241)
* fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235)
Router.get_deployment_credentials_with_provider re-validates a
deployment's litellm_params through CredentialLiteLLMParams before
handing them to file/batch/passthrough callers:
return CredentialLiteLLMParams(
**deployment.litellm_params.model_dump(exclude_none=True)
).model_dump(exclude_none=True)
Any field NOT declared on CredentialLiteLLMParams gets silently dropped
on the way through. azure_ad_token was undeclared, so Azure deployments
using OAuth/M2M (azure_ad_token instead of a static api_key) silently
lost their token at the files endpoint and the proxy returned:
Missing credentials. Please pass one of api_key, azure_ad_token,
azure_ad_token_provider, ...
Declare azure_ad_token on CredentialLiteLLMParams alongside api_key /
api_base / api_version so it rides through the round-trip. Static-key
deployments stay unaffected (Optional, default None, dropped by
exclude_none=True). Provider-callable (azure_ad_token_provider) is a
separate concern and out of scope here.
Fixes #30235
* fix(ui-types): regenerate schema.d.ts for new azure_ad_token field
CI's 'Verify schema.d.ts matches the proxy OpenAPI spec' check
auto-detected the new field and emitted the exact diff to apply.
Two schemas had `aws_secret_access_key` from CredentialLiteLLMParams,
both get the new azure_ad_token marker next to it.
* fix(proxy): org_admin with own user_id now sees all org teams on /v2/team/list (#30247)
When the UI sends the callers own user_id (as it does for non-Admin
global roles), _enforce_list_team_v2_access now nulls it out for org
admins so _build_team_list_where_conditions scopes by organization_id
only -- matching the legacy /team/list behavior and the documented intent.
Fixes #30215
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* test(vertex_ai): multi-region regression coverage for cachedContents host (#29571) (#29707)
litellm_internal_staging already routes the cachedContents URL through
get_vertex_base_url, fixing the multi-region 404 reported in #29571 —
but carries no test coverage for the actual regression scenario (eu/us
must resolve to the REP host aiplatform.{geo}.rep.googleapis.com).
Add TestContextCachingMultiRegionUrls: parametrized eu/us REP-host
assertions (including absence of the old broken {geo}-aiplatform host),
plus regional (us-central1) and global no-regression checks.
* fix(proxy): close upstream LLM stream when client disconnects mid-stream (#30245)
* fix(proxy): close upstream LLM stream when client disconnects mid-stream
When a streaming client disconnects, Starlette abandons the response
body iterator without calling aclose(), so the proxy's connection to
the upstream backend stays open until garbage collection, which may
never come. The backend (e.g. vLLM) keeps generating into a dead pipe:
small responses drain invisibly into TCP buffers while large ones block
the backend on a full send buffer indefinitely (observed via lsof as an
ESTABLISHED proxy->backend connection minutes after the client left)
create_response now returns a StreamingResponse subclass that closes
both its body iterator and the wrapped upstream-facing generator in a
shielded finally. The upstream generator is closed directly rather than
through a cascade because aclose() on a never-started generator skips
its body, which would make the cascade a no-op when the client
disconnects before the first chunk is sent.
async_streaming_data_generator also gains the same shielded
finally-aclose that async_data_generator in proxy_server.py already
had, covering the Anthropic and Google SSE paths
With this, killing a streaming client causes the backend to observe the
abort within about a second and free its slot, while completed streams
are unaffected. No flag is needed, unlike the non-streaming opt-in
cancel in #30223: this only releases resources after the client is
already gone and does not change any response a client can observe
Fixes #30244
* fix(proxy): close upstream even when body iterator aclose raises BaseException
Addresses the Greptile finding on #30245: the cleanup loop caught only
Exception while the generator-level cleanup catches BaseException, so a
CancelledError or GeneratorExit escaping body_iterator.aclose() would
skip closing the upstream generator. Both sites now use the same scope
and a regression test pins that the upstream is closed even when the
body iterator explodes with a BaseException
* fix(llms): expose aclose on BaseModelResponseIterator so stream close reaches the provider connection
The response-level close added for #30244 only worked for SDK-based
providers (e.g. openai), whose streams expose aclose all the way down.
Providers served by base_llm_http_handler (hosted_vllm and most modern
transformation-based providers) wrap a bare response.aiter_lines()
generator in BaseModelResponseIterator, which had no aclose or close at
all, and nothing retained the httpx response object; so
CustomStreamWrapper.aclose() silently did nothing and the upstream
connection stayed open. Verified with a vLLM-style mock: with
hosted_vllm/ the backend streamed all 100 chunks to completion after
the client disconnected, while openai/ aborted at chunk 6
BaseModelResponseIterator now carries an optional http_response and an
aclose() that closes it; make_async_call_stream_helper attaches the
response after building the iterator. With this, hosted_vllm aborts the
backend within ~1.6s of the client dropping, and completed streams are
unaffected
---------
Co-authored-by: kursad <kursad.lacin@brado.net>
* feat(anthropic): surface compaction usage iterations data (#27065)
* feat(anthropic): surface compaction usage iterations data
* style: apply black formatting to fix lint checks
* fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock (#30422)
* fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock
* fix(usage): optimize test imports
* feat: add fastCRW search provider (#30434)
* feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider (#30203)
* feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider
* libertai: update served endpoints backup + add mode/matrix tests
Addresses review feedback:
- Add libertai to litellm/provider_endpoints_support_backup.json, the file
actually served by GET /public/supported_endpoints (the root
provider_endpoints_support.json already had it).
- Add tests asserting bge-m3 normalizes to mode='embedding' and that the
served matrix lists libertai. embeddings stays false: the JSON-configured
provider path only wires chat routing (OpenAILike embedding handler is
reached only for literal openai_like/llamafile/lm_studio), matching the
llamagate precedent; bge-m3 remains in the cost map for metadata.
---------
Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>
* feat(provider): add ModelScope as an OpenAI-compatible provider (#28460)
* add ModelScope API support
* add modelscope api support
* update modelscope model list
* add image-genetation support
* update test and multimodal
* fix: address PR review feedback for modelscope provider
* update README
* fix(customer_endpoints): restrict /customer/daily/activity to admin-only (#28849)
* fix(customer_endpoints): restrict /customer/daily/activity to admin-only
* fix(customer_endpoints): check role before prisma_client guard
* fix(custom_guardrail): key disable_global_guardrails takes precedence over team guardrail list (#28563)
* fix(fallbacks): preserve fallback model in SDK fallback responses (#28260)
* fix(fallbacks): preserve fallback model in response when using SDK-level fallbacks
* fix(fallbacks): gate x-litellm-* passthrough to trusted callers only
The previous patch unconditionally let `x-litellm-*` keys bypass the
`llm_provider-` prefix in `process_response_headers`. That function is
also called on raw upstream-provider response headers (e.g. from
`llm_http_handler.py`), so a malicious provider could return
`x-litellm-attempted-fallbacks` and spoof a LiteLLM-internal marker,
bypassing the proxy model-override guard.
Add a `preserve_litellm_internal_headers` flag (default False). Only
`response_metadata.py`, which re-processes the already-built
`_hidden_params["additional_headers"]` dict (LiteLLM-owned), passes
True. Raw provider header callsites keep the default False, so upstream
`x-litellm-*` still gets the `llm_provider-` prefix.
Adds a regression test for the spoofing case and renames the existing
preserve test to make the trusted-path semantics explicit.
* fix(fallbacks): ignore preserve_litellm_internal_headers for raw httpx.Headers inputs
* style(core_helpers): apply black formatting
* fix(lint): remove banned typing.List/Dict/Any imports and suppress PLR0913 on interface overrides
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): apply black formatting to modelscope chat transformation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): replace noqa with proper fixes — use **kwargs and Awaitable instead of Any/List
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): remove unused AllMessageValues import
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* revert: restore base_model_iterator.py to original PR state
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): restore full method signatures for MyPy compatibility; bump PLR0913 budget for new provider files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): use @override to suppress PLR0913 on inherited signatures instead of bumping budget
The overrides keep their full base-class signatures for MyPy compatibility, but those signatures carry more than five parameters, which tripped PLR0913 on each subclass redeclaration. Since the arity is dictated by the base class and cannot be reduced, decorate the overrides with typing_extensions.override; ruff treats that as the intended signal that the parameter count is not under the author's control and skips PLR0913. This restores the PLR0913 baseline to 1813.
* fix(lint): add @override to modelscope image generation overrides
Apply the same typing_extensions.override treatment to the image generation config so its inherited-signature overrides do not count against PLR0913.
---------
Co-authored-by: Joel Tony <github@jaytau.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com>
Co-authored-by: Nahrin <nahrin@nahrinoda.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Humphrey <a739376838@gmail.com>
Co-authored-by: kursadlacin <kursadlacin@gmail.com>
Co-authored-by: kursad <kursad.lacin@brado.net>
Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com>
Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com>
Co-authored-by: Recep S <22618852+us@users.noreply.github.com>
Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com>
Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>
Co-authored-by: Rongkun Yan <2493404415@qq.com>
Co-authored-by: Varshith <kvarshithgowda@gmail.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
* ci(lint): add blanket-noqa, dataclass-default, and unused-noqa Ruff rules (#30516)
* ci(lint): enforce blanket-noqa, dataclass-default, and unused-noqa rules
Enable PGH004 (blanket-noqa), RUF008 (mutable-dataclass-default),
RUF009 (function-call-in-dataclass-default-argument), and RUF100
(unused-noqa) in ruff.toml, and clean up every resulting violation.
RUF008/RUF009 were already clean. PGH004/RUF100 surfaced ~335 stale or
blanket noqas: blanket `# noqa` are now scoped to the rule they actually
suppress (mostly T201), dead directives are removed, and inapplicable
codes are trimmed (e.g. F401 dropped from `import *`).
lint.external lists rules enforced outside this config (the strict-rule
gate via ruff-strict.toml and upstream litellm's own ruff config) so
RUF100 keeps the noqa directives that protect them instead of stripping
coverage this config can't see.
* ci(lint): trim RUF100 external list to load-bearing codes only
Drop the 9 precautionary strict-gate codes (ANN001/002/003/401, B006,
PLR0913, PLW0603, RUF012, TID251) that have zero `# noqa` references in
the gated source. Keep only the 11 codes with live suppressions so
RUF100 doesn't flag them as unused. Future strict-gate suppressions can
re-add codes here (or fix the underlying issue) as needed.
* ci: ratchet lint and type-check gates (ruff preview, ANN, mypy, basedpyright) (#30379)
* ci: enable ruff preview rules under the budgeted strict gate
Turn on ruff preview in the strict-budget lane (ruff-strict.toml) only,
leaving the clean gate (ruff.toml) untouched so make lint-ruff stays at
zero. Enumerate the 118 firing codes explicitly with
explicit-preview-rules so the gate is deterministic and stable across
ruff upgrades rather than depending on preview auto-selecting the broad
catalog.
Grandfather the existing 58438 violations into ruff-strict-budget.json
as per-rule baselines with headroom, so only net-new violations fail CI.
The existing ten rules keep their hand-tuned slack; the new rules get
slack 10 when the baseline is 50 or more and 3 otherwise.
* ci: add ANN return-type rules to the budgeted strict gate
Add ANN201/202/204/205/206 (missing return annotations) to the strict
lane and grandfather the existing counts into ruff-strict-budget.json so
the codebase ratchets toward explicit return types without breaking CI.
* ci: add mypy (disallow_untyped_defs) and basedpyright strict gates with baselines
Add two type-check gates, each grandfathering the current tree so only
net-new violations fail CI, matching the ruff strict-budget ratchet.
mypy gains disallow_untyped_defs in litellm/mypy.ini (the config the CI
invocation actually reads; the root [tool.mypy] is not picked up from the
litellm/ working dir). The 4885 existing missing-annotation errors are
captured in litellm/.mypy-baseline.txt and the run is piped through
mypy-baseline filter so new untyped defs are rejected.
basedpyright runs in strict mode over litellm/, with
enableTypeIgnoreComments disabled so it only honors '# pyright: ignore'
and never polices mypy's '# type: ignore'. The existing strict diagnostics
are grandfathered into .basedpyright/baseline.json.
Both tools are pinned in the dev group and uv.lock; the lint workflow and
Makefile run them filtered through their baselines, with
lint-mypy-baseline-update and lint-basedpyright-baseline-update to ratchet.
* ci: raise lint job timeout to 15m for the basedpyright strict pass
* ci: pin pythonVersion 3.12 and regenerate baselines against merged base
Merge litellm_internal_staging so the baselines cover code the CI merge
includes (e.g. the cisco_ai_defense guardrail), which otherwise tripped
the mypy gate with 3 ungrandfathered no-untyped-def errors. Pin
pythonVersion 3.12 in pyrightconfig so basedpyright's strict analysis is
reproducible across interpreter versions (CI runs 3.12).
* ci: regenerate basedpyright baseline against the frozen lint env
The previous baseline was generated with optional provider deps (azure,
google, anthropic, mcp, numpydoc, google-genai) installed locally, so CI's
dev-only env surfaced ~3500 reportUnknown*/reportMissingTypeStubs errors
not in the baseline. Regenerate after uv sync --frozen so the baseline
reflects the same dependency set the lint job sees.
* ci: regenerate basedpyright baseline on python 3.12 frozen env
The prior baseline still carried proxy-dev packages (e.g. prisma) that the
lint job's dev-only, python 3.12 env lacks, leaving 2 unresolved-import
errors ungrandfathered. Regenerate in a python 3.12 venv synced to the
frozen lock with default groups only, so the baseline matches exactly what
CI sees.
* ci: replace type-check baselines with per-file count budgets
The mypy and basedpyright baselines were position-sensitive (and the
basedpyright one was a 27MB file), so ordinary line shifts churned them.
Replace both with a per-file count gate: scripts/type_check_gate.py reduces
each tool's output to errors-per-file and checks it against a committed
{file: max} budget, ignoring line and column numbers. A file fails only
when it gains more errors than its ceiling; debt can't be shuffled between
files because each file has its own cap and new files default to zero.
Budgets (mypy-file-budget.json 48K, basedpyright-file-budget.json 96K) are
generated in the python 3.12 frozen lint env so they match CI. Drops the
mypy-baseline dependency; basedpyright runs without its native baseline.
ratchet via make lint-mypy-budget-update / lint-basedpyright-budget-update.
* ci: add a small per-file slack to the type-check gate
Allow each file to drift PER_FILE_SLACK (5) errors past its recorded count
before failing, so a basedpyright inference ripple in an unrelated file
doesn't break the build over a couple of errors. Budgets still record exact
counts; the tolerance is applied at check time.
* ci: move type-check slack into the budget json and trim lint timeout
Make slack declarative: the budget is now {"slack": N, "files": {path: count}}
so the tolerance is tuned in JSON without editing the script, mirroring how
ruff-strict-budget.json carries its slack. --update preserves the existing
slack. Also drop the lint job timeout from 15m to 10m; the mypy and
basedpyright passes add ~2m, leaving the job around 4-5m, so 10m is a
comfortable margin.
* ci: collapse fully-adopted ruff categories and drop inert preview flag
ANN (all nine non-removed rules) and BLE (its only rule) were spelled out
code-by-code; replace each with its category selector, which is exactly
equivalent in 0.15.3 (the removed ANN101/ANN102 are skipped by a category
selector and error when named explicitly). explicit-preview-rules was inert:
every selected rule is stable and nothing is selected by category, so the flag
had nothing to gate. Verified the strict-rule counts are identical before and
after (62379 each, zero per-rule drift), so no budget change.
* ci: drop redundant pyright dev dependency
Nothing invokes bare pyright in the Makefile, the linting workflow, or
scripts; the basedpyright gate added on this branch is the only type
checker that runs. based…
* fix(pricing): add GitHub Copilot MAI Code Flash pricing (BerriAI#30415) * fix(pricing): add GitHub Copilot MAI Code Flash pricing Add GitHub Copilot pricing entries for MAI-Code-1-Flash and the internal Copilot CLI model name so cost calculation can price input, cached input, and output tokens. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(pricing): cover GitHub Copilot MAI Code Flash pricing Add regression coverage for both GitHub Copilot MAI-Code-1-Flash model names, including cached input pricing, chat endpoint metadata, and cost_per_token arithmetic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (BerriAI#30210) (BerriAI#30213) * fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (BerriAI#30210) BerriAI#28990 added ownership recording for streaming /v1/responses via _wrap_responses_stream_for_container_ownership, which reads `getattr(stream_response, 'completed_response', None)` to extract the ResponsesAPIResponse. The unit test bypassed the Router, so it never exercised the production wrapping path. Through the Router (every proxy deployment), the stream is wrapped by FallbackResponsesStreamWrapper (router.py:2527). Its __init__ set `self.completed_response = None` and __anext__ only forwarded chunks — the inner source iterator's terminal event never bubbled up to the attribute the ownership hook reads, so the hook silently recorded nothing and every follow-up /v1/containers/<id>/files call returned 403 for non-admin keys. This commit: - router.py: pre-resolves the responses-API terminal event tuple (response.completed / .incomplete / .failed) once per _aresponses_streaming_iterator call, and has the wrapper's __anext__ sniff each forwarded chunk's .type. First terminal event hit gets stored on the wrapper's completed_response. Iterator-agnostic — works for source_iterator AND any future wrapper. - common_request_processing.py: when _extract_completed_responses_response returns None we now warn instead of silently skipping. Reporter on BerriAI#30210 lost a day to this exact silent skip; the warning surfaces future regressions of the same shape directly in operator logs. Fixes BerriAI#30210 * fix(router): type-ignore wrapper getattr-defaults; broaden ownership-skip warning CI lint (mypy) flagged the three pre-existing getattr(..., None) assignments in FallbackResponsesStreamWrapper.__init__: router.py:2564 self.response = getattr(source_iterator, 'response', None) router.py:2565 self.model = getattr(source_iterator, 'model', None) router.py:2566 self.logging_obj = getattr(..., None) Those lines also exist on litellm_internal_staging and pass mypy there. Adding the typed terminal-event tuple above the class made the function body more narrowable, which surfaced the pre-existing mismatch — base class declares non-Optional types but the bridge path (LiteLLMCompletionStreamingIterator) legitimately omits these. Keep the None fallback and silence with type: ignore[assignment]. Greptile 4/5 note: the ownership-skip warning hard-named code_interpreter which misleads operators when a non-code_interpreter stream aborts. Generalize to 'any tool container (e.g. code_interpreter)'. * fix(register_model): drop synthesized zero costs to preserve sparse entries (BerriAI#30198) (BerriAI#30201) * fix(register_model): drop synthesized zero costs to preserve sparse entries (BerriAI#30198) get_model_info synthesizes input_cost_per_token / output_cost_per_token = 0 when they are absent from the raw entry (the price-unknown and free cases share the same representation). register_model then merges that result back into litellm.model_cost, which flips a sparse entry from 'no cost keys' (priced via model name) to 'cost keys = 0' (free). That defeats _is_cost_explicitly_configured (BerriAI#24949) on re-registration: _is_model_cost_zero returns True, common_checks skips every tag / key / team / user / org budget check for the group, and over-budget traffic keeps returning 200. Spend keeps recording because cost calc still resolves by model name, so the symptom is silent and only triggers on the second register_model pass (router rebuild, /model/update, config sync). Mirror the existing litellm_provider-None guard one block above and pop the cost fields from the synthesized result when they are absent from the raw entry and not in the caller's value. Caller-provided zeros (genuinely free models, BYOK overrides) are preserved. Fixes BerriAI#30198 * fix(register_model): switch _raw_entry to is-None checks + drop dead test assertion Greptile BerriAI#30201 review notes: - the `or`-chain in the raw-entry lookup treated an empty dict (a key with no fields) as falsy and fell through to the second arm — replace with explicit `is None` checks so a present-but-empty entry is still taken at face value. - the first assertion in `test_router_double_init_keeps_db_model_entry_sparse` used `in (None, 0)` which passes under the bug condition (cost = 0 matches the tuple); the strong follow-up assertion already covers every shape, so drop the dead branch. * fix(bedrock mantle): use unique function-call id for responses->chat tool calls (BerriAI#30426) * fix(bedrock mantle): use unique function-call id for responses->chat tool calls ... * fix(bedrock mantle): scope unique tool-call id fallback to degenerate call_id The previous revision preferred the Responses item id for every tool call, which broke providers (and existing tests) where call_id is a unique, canonical correlation key. Restrict the fallback to the degenerate index-based call_id that Bedrock Mantle returns (call_0, call_1, ... resetting per response) and keep call_id otherwise. Revert the change to the OUTPUT_ITEM_DONE streaming handler, whose tool_call_chunk is never emitted (dead code, per review). Extend the regression tests to assert a normal call_id is preserved. * fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (BerriAI#30235) (BerriAI#30241) * fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (BerriAI#30235) Router.get_deployment_credentials_with_provider re-validates a deployment's litellm_params through CredentialLiteLLMParams before handing them to file/batch/passthrough callers: return CredentialLiteLLMParams( **deployment.litellm_params.model_dump(exclude_none=True) ).model_dump(exclude_none=True) Any field NOT declared on CredentialLiteLLMParams gets silently dropped on the way through. azure_ad_token was undeclared, so Azure deployments using OAuth/M2M (azure_ad_token instead of a static api_key) silently lost their token at the files endpoint and the proxy returned: Missing credentials. Please pass one of api_key, azure_ad_token, azure_ad_token_provider, ... Declare azure_ad_token on CredentialLiteLLMParams alongside api_key / api_base / api_version so it rides through the round-trip. Static-key deployments stay unaffected (Optional, default None, dropped by exclude_none=True). Provider-callable (azure_ad_token_provider) is a separate concern and out of scope here. Fixes BerriAI#30235 * fix(ui-types): regenerate schema.d.ts for new azure_ad_token field CI's 'Verify schema.d.ts matches the proxy OpenAPI spec' check auto-detected the new field and emitted the exact diff to apply. Two schemas had `aws_secret_access_key` from CredentialLiteLLMParams, both get the new azure_ad_token marker next to it. * fix(proxy): org_admin with own user_id now sees all org teams on /v2/team/list (BerriAI#30247) When the UI sends the callers own user_id (as it does for non-Admin global roles), _enforce_list_team_v2_access now nulls it out for org admins so _build_team_list_where_conditions scopes by organization_id only -- matching the legacy /team/list behavior and the documented intent. Fixes BerriAI#30215 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * test(vertex_ai): multi-region regression coverage for cachedContents host (BerriAI#29571) (BerriAI#29707) litellm_internal_staging already routes the cachedContents URL through get_vertex_base_url, fixing the multi-region 404 reported in BerriAI#29571 — but carries no test coverage for the actual regression scenario (eu/us must resolve to the REP host aiplatform.{geo}.rep.googleapis.com). Add TestContextCachingMultiRegionUrls: parametrized eu/us REP-host assertions (including absence of the old broken {geo}-aiplatform host), plus regional (us-central1) and global no-regression checks. * fix(proxy): close upstream LLM stream when client disconnects mid-stream (BerriAI#30245) * fix(proxy): close upstream LLM stream when client disconnects mid-stream When a streaming client disconnects, Starlette abandons the response body iterator without calling aclose(), so the proxy's connection to the upstream backend stays open until garbage collection, which may never come. The backend (e.g. vLLM) keeps generating into a dead pipe: small responses drain invisibly into TCP buffers while large ones block the backend on a full send buffer indefinitely (observed via lsof as an ESTABLISHED proxy->backend connection minutes after the client left) create_response now returns a StreamingResponse subclass that closes both its body iterator and the wrapped upstream-facing generator in a shielded finally. The upstream generator is closed directly rather than through a cascade because aclose() on a never-started generator skips its body, which would make the cascade a no-op when the client disconnects before the first chunk is sent. async_streaming_data_generator also gains the same shielded finally-aclose that async_data_generator in proxy_server.py already had, covering the Anthropic and Google SSE paths With this, killing a streaming client causes the backend to observe the abort within about a second and free its slot, while completed streams are unaffected. No flag is needed, unlike the non-streaming opt-in cancel in BerriAI#30223: this only releases resources after the client is already gone and does not change any response a client can observe Fixes BerriAI#30244 * fix(proxy): close upstream even when body iterator aclose raises BaseException Addresses the Greptile finding on BerriAI#30245: the cleanup loop caught only Exception while the generator-level cleanup catches BaseException, so a CancelledError or GeneratorExit escaping body_iterator.aclose() would skip closing the upstream generator. Both sites now use the same scope and a regression test pins that the upstream is closed even when the body iterator explodes with a BaseException * fix(llms): expose aclose on BaseModelResponseIterator so stream close reaches the provider connection The response-level close added for BerriAI#30244 only worked for SDK-based providers (e.g. openai), whose streams expose aclose all the way down. Providers served by base_llm_http_handler (hosted_vllm and most modern transformation-based providers) wrap a bare response.aiter_lines() generator in BaseModelResponseIterator, which had no aclose or close at all, and nothing retained the httpx response object; so CustomStreamWrapper.aclose() silently did nothing and the upstream connection stayed open. Verified with a vLLM-style mock: with hosted_vllm/ the backend streamed all 100 chunks to completion after the client disconnected, while openai/ aborted at chunk 6 BaseModelResponseIterator now carries an optional http_response and an aclose() that closes it; make_async_call_stream_helper attaches the response after building the iterator. With this, hosted_vllm aborts the backend within ~1.6s of the client dropping, and completed streams are unaffected --------- Co-authored-by: kursad <kursad.lacin@brado.net> * feat(anthropic): surface compaction usage iterations data (BerriAI#27065) * feat(anthropic): surface compaction usage iterations data * style: apply black formatting to fix lint checks * fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock (BerriAI#30422) * fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock * fix(usage): optimize test imports * feat: add fastCRW search provider (BerriAI#30434) * feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider (BerriAI#30203) * feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider * libertai: update served endpoints backup + add mode/matrix tests Addresses review feedback: - Add libertai to litellm/provider_endpoints_support_backup.json, the file actually served by GET /public/supported_endpoints (the root provider_endpoints_support.json already had it). - Add tests asserting bge-m3 normalizes to mode='embedding' and that the served matrix lists libertai. embeddings stays false: the JSON-configured provider path only wires chat routing (OpenAILike embedding handler is reached only for literal openai_like/llamafile/lm_studio), matching the llamagate precedent; bge-m3 remains in the cost map for metadata. --------- Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> * feat(provider): add ModelScope as an OpenAI-compatible provider (BerriAI#28460) * add ModelScope API support * add modelscope api support * update modelscope model list * add image-genetation support * update test and multimodal * fix: address PR review feedback for modelscope provider * update README * fix(customer_endpoints): restrict /customer/daily/activity to admin-only (BerriAI#28849) * fix(customer_endpoints): restrict /customer/daily/activity to admin-only * fix(customer_endpoints): check role before prisma_client guard * fix(custom_guardrail): key disable_global_guardrails takes precedence over team guardrail list (BerriAI#28563) * fix(fallbacks): preserve fallback model in SDK fallback responses (BerriAI#28260) * fix(fallbacks): preserve fallback model in response when using SDK-level fallbacks * fix(fallbacks): gate x-litellm-* passthrough to trusted callers only The previous patch unconditionally let `x-litellm-*` keys bypass the `llm_provider-` prefix in `process_response_headers`. That function is also called on raw upstream-provider response headers (e.g. from `llm_http_handler.py`), so a malicious provider could return `x-litellm-attempted-fallbacks` and spoof a LiteLLM-internal marker, bypassing the proxy model-override guard. Add a `preserve_litellm_internal_headers` flag (default False). Only `response_metadata.py`, which re-processes the already-built `_hidden_params["additional_headers"]` dict (LiteLLM-owned), passes True. Raw provider header callsites keep the default False, so upstream `x-litellm-*` still gets the `llm_provider-` prefix. Adds a regression test for the spoofing case and renames the existing preserve test to make the trusted-path semantics explicit. * fix(fallbacks): ignore preserve_litellm_internal_headers for raw httpx.Headers inputs * style(core_helpers): apply black formatting * fix(lint): remove banned typing.List/Dict/Any imports and suppress PLR0913 on interface overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): apply black formatting to modelscope chat transformation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): replace noqa with proper fixes — use **kwargs and Awaitable instead of Any/List Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): remove unused AllMessageValues import Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * revert: restore base_model_iterator.py to original PR state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): restore full method signatures for MyPy compatibility; bump PLR0913 budget for new provider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lint): use @OverRide to suppress PLR0913 on inherited signatures instead of bumping budget The overrides keep their full base-class signatures for MyPy compatibility, but those signatures carry more than five parameters, which tripped PLR0913 on each subclass redeclaration. Since the arity is dictated by the base class and cannot be reduced, decorate the overrides with typing_extensions.override; ruff treats that as the intended signal that the parameter count is not under the author's control and skips PLR0913. This restores the PLR0913 baseline to 1813. * fix(lint): add @OverRide to modelscope image generation overrides Apply the same typing_extensions.override treatment to the image generation config so its inherited-signature overrides do not count against PLR0913. --------- Co-authored-by: Joel Tony <github@jaytau.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: hcl <chenglunhu@gmail.com> Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com> Co-authored-by: Nahrin <nahrin@nahrinoda.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Humphrey <a739376838@gmail.com> Co-authored-by: kursadlacin <kursadlacin@gmail.com> Co-authored-by: kursad <kursad.lacin@brado.net> Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com> Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com> Co-authored-by: Recep S <22618852+us@users.noreply.github.com> Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com> Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com> Co-authored-by: Rongkun Yan <2493404415@qq.com> Co-authored-by: Varshith <kvarshithgowda@gmail.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
* feat(ui): gate "Default Credentials" hint on /ui/login behind env flag (#30234)
Adds LITELLM_HIDE_DEFAULT_CREDENTIALS_HINT (and an equivalent
general_settings.hide_default_credentials_hint) that suppresses the
"By default, Username is admin and Password is your set LiteLLM Proxy
MASTER_KEY" info card rendered on /ui/login and /fallback/login.
Motivation: in production deployments operators set UI_USERNAME /
UI_PASSWORD (or SSO), and the hardcoded hint becomes factually
incorrect and is flagged by security scanners (Tenable WAS plugin
114625) as information disclosure. There is currently no way to
suppress it without forking the dashboard.
Behaviour:
- Default is unchanged (hint shown), so existing deployments are
unaffected.
- New field hide_default_credentials_hint on the well-known UI config
endpoint, populated from the env var or general_settings.
- LoginPage.tsx conditionally renders the Alert based on the flag.
Refs: BerriAI/litellm#30232
* fix(router): clean pattern_router state on upsert/delete (#29601)
* fix(router): clean pattern_router state on upsert/delete
PatternMatchRouter.add_pattern was append-only, and neither Router.upsert_deployment nor Router.delete_deployment removed the existing entry. Rotated-out api_keys stayed in the routing rotation for wildcard deployments (model_name with `*`) until proxy restart, silently defeating key rotation as an admin operation. The same leak applied to provider_default_deployment_ids and per-team pattern routers, and the patterns list grew unboundedly on every edit
* test(router): direct unit tests for _remove_deployment_from_wildcard_state
router_code_coverage.py greps test files for AST Call nodes and flagged
the helper as untested because the existing coverage only exercised it
transitively through upsert/delete. Adds two direct tests that pin the
helper's contract (cleans across global pattern router, per-team
routers with empty-router pop, and provider_default_deployment_ids;
noop on falsy model_id)
* fix(router): address Greptile review on pattern_router cleanup
Widen PatternMatchRouter.remove_deployment annotation to Optional[str];
the implementation already handles None via the falsy guard and the
unit test exercises it directly.
Move _remove_deployment_from_wildcard_state up one level in
upsert_deployment so it runs whenever the prior deployment is on the
router, not only when the model_id is present in the fast-mapping
index. The scenario is currently unreachable (get_deployment shares
the same index), but the cleanup is idempotent so this is defensive
against any future divergence between those code paths.
* fix(router): widen _remove_deployment_from_wildcard_state to Optional[str]
Moving the call out of the inner `deployment_id in deployment_fast_mapping`
block in the previous commit lost mypy's narrowing of `deployment_id`
from Optional[str] to str, tripping the lint CI. The helper already
handles None via its falsy guard, so widening the annotation matches
the actual contract.
* fix(router): make delete_deployment wildcard cleanup symmetric with upsert
After the previous commit moved _remove_deployment_from_wildcard_state out
of the inner index-map guard in upsert_deployment, delete_deployment was
still calling it only inside `if deployment_idx is not None`. Greptile
flagged the asymmetry: under a desynced index_map, delete would silently
leave the stale wildcard credential in pattern_router.
Moves the cleanup call to the top of the try block, mirroring the upsert
path. Cleanup is idempotent so the change is a no-op on the happy path.
Adds a regression test that simulates the desync by removing the entry
from model_id_to_deployment_index_map and asserts delete still clears
pattern_router.
* fix(pricing): add 1h cache-write cost for Anthropic Sonnet 4.5/4.6 (#30474)
The native anthropic claude-sonnet-4-5/4-6 price-map entries were missing
cache_creation_input_token_cost_above_1hr (and the >200K long-context
sub-tier for 4.5), so 1-hour-TTL cache writes were costed at the 5-minute
rate. Adds 6e-06 regular (and 1.2e-05 long-context) = 2x base input,
matching the vertex_ai/azure_ai/bedrock siblings and the older
claude-sonnet-4-20250514 entry. Adds a regression test.
* fix(proxy): cancel upstream gemini request and release httpx connection on client disconnect (#30075)
* fix(proxy): cancel upstream gemini request and release httpx connection on client disconnect
- add _check_request_disconnection to common_request_processing; wrap llm_call
as asyncio.Task so it can be cancelled; catch CancelledError and raise
HTTPException(499) when client disconnects before LLM responds (non-streaming path)
- pass raw httpx.Response into ModelResponseIterator in make_call/make_sync_call
so the iterator holds a reference to the underlying connection
- implement ModelResponseIterator.aclose() and .close(): close the line iterator
then explicitly call response.aclose()/response.close() to release the httpx
connection when the client drops mid-stream; errors are debug-logged, not raised
- add tests for _check_request_disconnection (cancels task, graceful on exception,
does not cancel when client stays connected) and base_process_llm_request 499
behavior; add TestModelResponseIteratorCleanup verifying aclose/close propagation
through CustomStreamWrapper
* fix(proxy): record 499 on streaming disconnect and cancel orphaned gather tasks
Wire streaming generator cleanup to log client_disconnected with error_code 499
in spend logs, cancel pending during_call_hook tasks when the LLM call is
cancelled on disconnect, and align the 600s poll limit comment with proxy_server.
* fix: extract client disconnect logging helper to satisfy PLR0915
* fix: resolve mypy and code-quality CI failures for client disconnect logging
Cast client disconnect error_information for mypy, only await pending gather tasks to avoid masking LLM errors, and add tests for the new logging helper and gather cleanup.
* fix(proxy): harden gather cleanup so finally cannot mask LLM errors
* fix(proxy): shield streaming disconnect logging and strip spoofable metadata
Move streaming disconnect recording into a shielded cancel scope, add gather cleanup regression coverage for guardrail-converted cancels, and strip client_disconnected/error_information from user metadata at the proxy boundary.
* fix(proxy): only map CancelledError to 499 for client disconnect
Track when the disconnect poller cancels the LLM task and re-raise other CancelledError paths so graceful shutdown is not reported as HTTP 499.
* fix(proxy): remove dead _check_request_disconnection helper
Non-streaming client disconnect is handled by staging's cancel_on_disconnect path via _await_llm_call_cancelling_on_disconnect. Drop the unused is_disconnected poller and its unit tests; rename the remaining integration tests to TestDisconnectGatherCleanup.
* feat(mistral): add mistral-medium-3-5 to model_prices_and_context_wind.. (#29303)
* feat(mistral): add mistral-medium-3-5 to
model_prices_and_context_window.json
Mistral's docs page lists mistral-medium-3-5 as a new model offering.
Pricing/specs sourced from Mistral's published model metadata:
- input: $1.50 / 1M tokens
- output: $7.50 / 1M tokens
- context: 262,144 tokens
- capabilities: vision, function calling, structured outputs, assistant
prefill
Adds entry: `mistral/mistral-medium-3-5`, mirroring the pattern used for
the rest of the Mistral family.
test(mistral): add model_info test for mistral-medium-3-5 + sync backup
cost map
- Mirror mistral/mistral-medium-3-5 entries into
litellm/model_prices_and_context_window_backup.json so the bundled
model cost map matches the canonical
model_prices_and_context_window.json.
- Add tests/test_litellm/test_mistral_medium_3_5_model_metadata.py
covering pricing tiers, capability flags, context window, provider
routing, and parity between the main and backup cost maps.
- Point 'source' at the live Mistral models documentation page.
* fix(ui): three small UI fixes — Gemini api_base + credential form reset + Mode badge (#30419)
* fix(ui): three small UI fixes — Gemini api_base field + credential form reset + Mode badge
Three independent fixes; bundled because they all touch the
credential-form / logging-callbacks area.
1. expose api_base field on Google AI Studio credential form
The runtime gemini provider supports custom api_base via
`vertex_llm_base._check_custom_proxy`; the UI just needs to expose
the field. Adds api_base to the Google_AI_Studio credential form
ordered before api_key (matching OpenAI/Anthropic conventions).
Default value matches the canonical Google AI Studio endpoint that
LiteLLM's gemini provider talks to when api_base is unset, so
leaving the default in the form behaves identically to leaving it
blank.
2. reset credential form state when switching providers
Switching the Provider select in AddCredentialModal / EditCredentialModal
left the previous provider's field values populated. The form then
submitted a mixed payload (e.g. Azure deployment fields under an
OpenAI credential), producing confusing failures.
Extract `getProviderFieldDefaults` helper and reset the form to it
on provider change. Unit-tested via the extracted helper because
Antd Select's portal/dropdown behaviour is unreliable in jsdom.
3. logging callbacks table reads backend `type` for Mode badge (#35)
The `/get_callbacks` proxy endpoint returns each callback as
`{name, type, variables}` where `type` is `"success"` or
`"failure"`. The same callback name can appear twice (one per event
class) and the two entries fire on disjoint events.
`LoggingCallbacksTable` ignored `type` and read `record.mode`
(always undefined), so every row fell back to the "Success" badge.
A `generic_api` callback registered for both classes showed up as
two identical "Success" rows + React duplicate-key warning.
Read `record.type` first (fall back to `record.mode` for newly-
added not-yet-server-acknowledged rows). Composite rowKey
`${name}-${type ?? mode ?? 'success'}`. Removed leftover debug
`console.log`.
* fix(ui): drop api_base default_value to preserve Gemini v1alpha auto-routing
Greptile P2 (PR #30419, threads on lines 1255-1256 of
provider_create_fields.json): the api_base field's `default_value` was
hard-coded to "https://generativelanguage.googleapis.com/v1beta". This:
1. Bakes v1beta into every credential record saved through the form,
even when the user never touched the field. If LiteLLM's internal
gemini default URL ever changes, those persisted credentials keep
hitting the stale path.
2. Bypasses `_get_gemini_url`'s automatic version routing for Gemini 3+
models. That helper picks v1alpha for Gemini 3+ and v1beta for older
models when api_base is unset. With the default pre-filled (and
`_check_custom_proxy` then taking over because api_base is non-empty),
Gemini 3+ requests get pinned to v1beta and may fail or behave
unexpectedly — purely because the user accepted the visible default.
Fix: set `default_value` to `null` and move the canonical URL guidance
into the `placeholder` (visible to the user, never persisted) and an
expanded tooltip. UX is unchanged — the URL is still shown in the
greyed-out input — but the auto-version-routing path stays default.
Updated test_google_ai_studio_provider_fields_expose_api_base to assert
the new contract (`default_value is None`, `placeholder` carries the
canonical URL), with a comment pointing at the Greptile threads as the
rationale so future contributors don't accidentally re-introduce the
default.
26/26 tests in the file pass. JSON validates (`json.load` clean).
* feat(azure_ai): add gpt-5.5 to model cost map (#30428)
* feat(azure_ai): add gpt-5.5 to model cost map
Adds azure_ai/gpt-5.5 and its dated snapshot azure_ai/gpt-5.5-2026-04-23 to
both the canonical and bundled cost maps. gpt-5.5 is generally available on
Azure AI Foundry; pricing mirrors the openai gpt-5.5 entry, matching the
established azure_ai convention (verified identical for gpt-5.4), in the
azure tier structure (base / above-272k / priority). supports_minimal_
reasoning_effort is false, the capability that changed from gpt-5.4.
Fixes #30306
* Update tests/test_litellm/test_gpt_5_5_model_metadata.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
---------
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* fix: guard check_and_fix_namespace against None key (#30435)
* fix: guard check_and_fix_namespace against None key
When user_id is None, the cache key can be None, causing
AttributeError: 'NoneType' object has no attribute 'startswith'
in check_and_fix_namespace.
Add an early return for None key to prevent the error and the
ERROR-level log noise it produces on every unauthenticated request.
Fixes #30424
* fix: update type annotations for check_and_fix_namespace
- key: str -> Optional[str] (now handles None input)
- return: str -> Optional[str] (returns None when input is None)
Addresses Greptile review concern about type signature mismatch.
* fix: revert check_and_fix_namespace type signature to str to fix MyPy downstream errors
* fix: update type annotations for check_and_fix_namespace
- Change signature from str -> str to Optional[str] -> Optional[str]
- Remove type: ignore comment on None return
- Add None guard in async_set_cache_sadd before passing to helper
Addresses review feedback from Sameerlite on type mismatch.
* Revert "fix: update type annotations for check_and_fix_namespace"
This reverts commit 5272920fa0daab676f5ad46dcadd8cd537cfc96f.
---------
Co-authored-by: michaelxer <michaelxer@users.noreply.github.com>
* fix(cost): apply service_tier suffix to above-threshold cache rates and expose priority+threshold keys in ModelInfo (#30450)
* fix(cost): apply service_tier suffix to above-threshold cache rates and expose priority+threshold keys in ModelInfo
Models that publish both a service_tier (e.g. priority) rate and an above-threshold tier (e.g. _above_200k_tokens) currently bill cached tokens at the standard above-threshold rate rather than the priority above-threshold rate. Affected entries in the live pricing JSON include gemini-3-pro-preview, gemini-3.1-pro-preview and their vertex_ai/ and gemini/ variants, plus azure/gpt-5.4 and azure_ai/gpt-5.4. For a 250K-token priority request with 200K cached tokens against gemini-3-pro-preview, the leak is about 44 percent of the prompt cost.
Two stacked defects caused this. First, ModelInfoBase (and the ModelInfo pydantic class) and the get_model_info construction in litellm/utils.py omit the priority+above-threshold cost keys, so even if the calculator asked for them they would never reach it. Second, in _get_token_base_cost the cache_creation/cache_read tiered keys never get wrapped with _get_service_tier_cost_key, while the input/output tiered keys above and below do. The change here surfaces six new keys (input, output and cache_read at both 200k and 272k priority variants) and wraps the three cache tiered keys in _get_token_base_cost the same way input/output already are. _get_cost_per_unit's existing service_tier-to-base fallback covers models that ship the standard above-threshold rate without a priority variant.
Adds one regression test in tests/test_litellm/litellm_core_utils/llm_cost_calc/test_llm_cost_calc_utils.py that drives the actual generic_cost_per_token path for gemini-3-pro-preview at 200K cached + 50K text under priority and asserts the priority above_200k rates are picked. Verified the test fails on litellm_internal_staging without these changes and passes with them.
* fix(cost): drop guard on cache tiered keys so service_tier fallback can reach standard above-threshold rate
Addresses Greptile P1 on PR 30450. The previous commit wrapped cache_creation_tiered_key, cache_creation_1hr_tiered_key, and cache_read_tiered_key with _get_service_tier_cost_key (matching how the sibling input and output tiered keys are wrapped) but kept the surrounding 'if key in model_info' guards. For models that publish a standard above-threshold cache rate but no priority variant (gpt-5.4-pro, gpt-5.5-pro and their dated siblings, plus vertex_ai/claude-sonnet-4-5 for cache_creation), the guard short-circuits before _get_cost_per_unit's existing service_tier-to-base fallback can strip _priority and find the standard above-threshold key. The result on priority requests over the threshold was that those models silently dropped from the above-threshold rate back to the priority-base rate. Dropping the guard and calling _get_cost_per_unit unconditionally (mirroring how tiered_input_key and tiered_output_key are already handled) restores correct billing for that class of models while keeping the new priority+above-threshold behaviour for gemini-3-pro-preview and friends.
Adds a second regression test that pins generic_cost_per_token for vertex_ai/claude-sonnet-4-5 priority + above_200k with cached and cache_creation tokens to the expected standard above-threshold rates, so the guard cannot be silently reintroduced for either the cache_read or cache_creation path.
* fix(presidio): skip pre-call masking when guardrail is logging_only (#30461)
The Presidio pre-call hook masked the live request unconditionally, ignoring
the configured event hook. With mode: logging_only the masked request reached
the model, so its response echoed anonymization tokens (e.g. <PERSON>) instead
of the real output. Gate async_pre_call_hook on should_run_guardrail, matching
every other guardrail; logging_only masking still happens via async_logging_hook.
* fix(router): resolve list unhashable crash on model alias (#30464)
* fix(router): resolve list unhashable crash on model alias
Fixes the fallback parsing logic that mistakenly categorized standard array fallback definitions as override dictionaries when a deployment alias matches the literal string 'model'.
Closes https://github.com/BerriAI/litellm/issues/30459
* fix(router): address greptile review for fallback parsing edge cases
- Resolves ambiguity in standard vs override fallback dictionaries by iterating over all items and validating that no mapped litellm param resolves to a non-list type.
- Adds regression tests in test_router_order_fallback.py to prevent unhashable type crash from silently re-entering the codebase.
* chore(router): format code with black to pass CI
* fix(hosted_vllm): remove thinking_blocks and convert list content to strings (#30475)
* fix: hosted_vllm remove thinking_blocks and convert list content to strings
vLLM endpoints reject assistant messages with thinking_blocks converted
to content list blocks. This change removes thinking_blocks entirely
and converts any list content back to strings.
This fixes BadRequestError when using Claude Code with hosted_vllm
models that pass thinking_blocks in messages.
* fix(hosted_vllm): address Greptile review feedback
- Join multiple text blocks with newline instead of empty string
- Always set content to string (never None) to avoid vLLM validation errors
* fix(hosted_vllm): update chat transformation to clean assistant messages
* fix: re-raise exception instead of silently dropping MCP team permissions (#30477)
* fix: re-raise exception instead of silently
dropping MCP team permissions
When MCPRequestHandler.get_allowed_mcp_servers raises, the
broad
except was swallowing the error and returning only
allow_all_server_ids,
silently discarding all team-level object_permission grants.
Fixes #30476
* fix: log full traceback when MCP permission lookup fails
Uses verbose_logger.exception() instead of warning() so operators
can see the full traceback when team-level object_permission grants
are dropped due to an internal error in get_allowed_mcp_servers.
Fixes #30476
* fix: remove timezone date expansion in daily-activity aggregation (#29569)
* fix: remove timezone date expansion in daily-activity aggregation
Single-day spend queries from non-UTC timezones over-counted by ~2x
because the previous implementation widened the SQL date range by a
full UTC day on whichever side the offset pointed. Spend is bucketed
in whole-UTC-day rows in LiteLLM_DailyUserSpend, so the expansion
pulled an extra 24h of unrelated bucket data per boundary.
Concretely on IST (UTC+5:30, offset -330): a single-day query for
2026-05-29 was rewritten to date >= 2026-05-28 AND date <= 2026-05-29
and returned spend across both UTC days. Sums of single-day queries
across a 5-day window then exceeded the equivalent multi-day aggregate
by ~50%, which is mathematically impossible.
Treat the local date range as the UTC date range. The aggregation
table has no hour-level granularity, so any conversion using only
date arithmetic must round to whole UTC days; the previous fix turned
that boundary slop into systematic over-counting. Pass-through trades
a small one-time slop at each end of the range for correct, monotonic,
additive results across single-day and multi-day queries.
Repro from production: bedrock/global.anthropic.claude-opus-4-8 over
2026-05-29 to 2026-06-02, IST timezone:
- 5-day aggregate: $701.39 / 1,831 reqs
- Sum of 5 single-day queries: $1,070.94 / 2,755 reqs
- Excess (was 1.527x): now matches within boundary slop
Adds regression tests in TestAdjustDatesForTimezone and
TestBuildAggregatedSqlQuery that pin the pass-through behavior and
the additivity invariant for any future implementation.
* ci: rerun checks on litellm_oss_branch base
---------
Co-authored-by: Sameer Kankute <sameer@berri.ai>
* fix: buffer native gemini sse frames (#30225)
* fix: buffer native gemini sse frames
* fix: scope native gemini sse buffering
* fix: check raw sse residual buffer size
* feat: updated openrouter provider to map max level to xhigh (#28881)
* feat(proxy): allow use_redis_transaction_buffer without redis cache (#28764)
* feat(proxy): allow use_redis_transaction_buffer without redis cache
* fix(proxy): require host or url for standalone buffer redis
* fix(mcp): fail closed when scope filter resolves to no servers (#30353)
`_get_allowed_mcp_servers_from_mcp_server_names` returned the caller's full
allowed-server set when the requested `mcp_servers` list (path- or
header-derived) resolved to nothing. URL/header namespacing therefore
appeared to work even when the requested name was unknown or the caller had
no grant — `/mcp/<typo>/` silently exposed every server the key could reach.
Fail closed instead: when `mcp_servers` is explicitly provided but nothing
resolves, return an empty list. The `mcp_servers=None` path (no scope
requested) keeps its existing behavior.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(token-counter): handle Anthropic tool_reference blocks to stop dropped spend logs (#30302)
* fix(token-counter): handle Anthropic tool_reference blocks to stop dropped spend logs
`token_counter` did not know about Anthropic tool-search `tool_reference`
content blocks, a lightweight pointer to a deferred tool that shows up as
`{"type": "tool_reference", "tool_name": ...}`. When such a block appeared in
message content, `_count_content_list` fell through to its catch-all branch and
raised `Invalid content item type: tool_reference`.
On the streaming `anthropic_messages` proxy path that exception nulls
`response_cost`, which makes the proxy drop the entire SpendLogs row. The result
is a silent cost undercount on any tool-search traffic; the request succeeds for
the caller but the spend is never recorded.
This adds a `tool_reference` branch that counts the referenced `tool_name` (the
full tool definition is already counted via the `tools` param, so only the name
is added here) and handles an empty/missing name gracefully. The catch-all error
message is updated to list `tool_reference` among the expected types.
A regression test asserts that a message containing a `tool_reference` block no
longer raises and returns a positive token count, and that an empty `tool_name`
is handled without error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(token-counter): collapse explicit None tool_name to empty string
In _count_content_list, c.get("tool_name", "") returns None when the
key is present with an explicit None value, and str(None) == "None"
which is truthy, causing a spurious token to be counted. Use
c.get("tool_name") or "" so both a missing key and an explicit None
collapse to an empty string and are skipped.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(token-counter): cover catch-all for unknown content block type
Adds a regression test that calls `_count_content_list` with an unrecognized
content block type and asserts it raises `ValueError` whose message names the
offending type and lists `tool_reference` among the supported types. This
exercises the previously uncovered catch-all branch (codecov patch gap) and
pins the error contract.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(token-counter): cover tool_reference on the spend/cost and streaming paths
Adds end-to-end regression tests that exercise the real public entry points
(`completion_cost` and `stream_chunk_builder`), not just the private
`_count_content_list` helper, for Anthropic tool-search `tool_reference`
content blocks.
These pin the actual bug the fix addresses: before the fix the `tool_reference`
block raised out of `completion_cost` -> the proxy logging layer nulled
`response_cost` and the spend callback dropped the SpendLogs row (silent cost
undercount on all tool-search traffic); and `stream_chunk_builder` swallowed the
same raise and collapsed prompt_tokens to 0. With the fix, cost is positive and
prompt_tokens are counted. Verified: 3 fail without the fix, 3 pass with it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(cost): add cost mapping for deepseek-v4-flash and deepseek-v4-pro (#27056)
* feat(cost): add cost mapping for deepseek-v4-flash and deepseek-v4-pro
Adds pricing entries for the two new DeepSeek V4 models released on
2026-04-24, for both bare model names and the deepseek/ provider prefix.
Prices sourced from https://api-docs.deepseek.com/quick_start/pricing:
- deepseek-v4-flash: $0.14/M input, $0.28/M output
- deepseek-v4-pro: $1.74/M input, $3.48/M output
Cache hit price set to 1/10 of input (per DeepSeek docs).
Context window: 1M tokens for both models.
Closes #26709
* fix(cost): update backup registry for deepseek-v4
* style: remove print statement from deepseek-v4 test
* feat(cost): add cost mapping for deepseek-v4-flash and deepseek-v4-pro
Adds pricing entries for the two new DeepSeek V4 models released on
2026-04-24, for both bare model names and the deepseek/ provider prefix.
Prices sourced from https://api-docs.deepseek.com/quick_start/pricing:
- deepseek-v4-flash: $0.14/M input, $0.28/M output
- deepseek-v4-pro: $1.74/M input, $3.48/M output
Cache hit price set to 1/10 of input (per DeepSeek docs).
Context window: 1M tokens for both models.
Closes #26709
* fix: update deepseek-v4 prices to active discounted rates
* test: update deepseek-v4 prices in tests to match active discounted rates
* fix(deepseek): remove duplicate entries and update backup registry to active discounted rates
* fix: update max_output_tokens to 384K for deepseek-v4
* fix: correctly restore upstream models accidentally dropped during merge
* fix(tests): resolve failing claude-fable-5 and reasoning tests by safely updating cost map
- Pulled the latest cost map from upstream staging
- Safely appended deepseek-v4 mapping without deleting duplicate keys or formatting via json.dump
* fix(tests): correct deepseek model cache prices and update JSON schema
- Appended both prefixed and bare deepseek-v4 models to satisfy test assertions
- Corrected deepseek-v4-pro expected cache hit and token prices based on latest review updates
- Added missing realtime endpoint to test_utils.py INTENDED_SCHEMA
* fix: remove accidental azure/gpt-realtime-whisper addition
---------
Co-authored-by: Dushyant Acharya <dushyantacharya@Dushyants-MacBook-Pro.local>
* feat(key/info): expose per-model budget usage in /key/info response (#30394)
* feat(key/info): expose per-model budget usage in /key/info response
Add model_max_budget_usage to /key/info and /v2/key/info responses.
For each model in model_max_budget, reads current-period spend from
the same DualCache used by the budget enforcer and returns it alongside
the limit and time period so callers can see how much of each model
budget has been consumed in the active window.
* test(key/info): add coverage for model_max_budget_usage in v1 and v2 endpoints
Add tests for the model_max_budget_usage enrichment in both info_key_fn
and info_key_fn_v2, covering the budget-present path, the empty-budget
path, and the v2 batch endpoint.
* fix(key/info): source model_max_budget current_spend from SpendLogs instead of DualCache
The DualCache used for enforcement is ephemeral and only populated when budget metadata
is present at request time. Fall back to a direct LiteLLM_SpendLogs DB aggregation
using the budget period window (budget_reset_at - budget_duration) for accurate reporting.
Also fall back to litellm_budget_table.model_max_budget when the key's top-level field
is empty, and round current_spend to 4 decimal places.
* test(key/info): cover remaining branches in model_max_budget_usage helpers
Add unit tests for: prisma_client=None early return, DB query exception swallowing,
invalid budget_duration handled by _compute_budget_period_start, budget_reset_at
received as a datetime object (Prisma native type), max_seconds=0 early return, and
skipping models that lack a budget_duration. Also remove an unreachable except branch
where fromisoformat would fail after _compute_budget_period_start already validated the
same value.
* test(key/info): cover except path for unparseable per-model budget_duration
* fix(key/info): compute per-model rolling windows in model_max_budget_usage
Each model in model_max_budget now gets its own time window derived from
its own budget_duration, rather than sharing a single window computed as
the max (or the budget table's reset_at). This matches what the DualCache
enforcer actually tracks and prevents current_spend from being inflated
for models with shorter windows.
_query_model_spend_for_period is refactored to accept a model filter
(handling provider-prefix variants in SQL) and return a float directly.
_compute_budget_period_start and the budget_table window path are removed
as they are no longer needed.
* refactor(model_max_budget_limiter): remove dead get_current_period_spend method
* refactor(key/info): strip synthetic formatter noise from PR diff
Restore key_management_endpoints.py and test_key_management_endpoints.py
to origin/litellm_internal_staging, then re-apply only the intentional
additions: _query_model_spend_for_period, _build_model_max_budget_usage,
the two endpoint patches (info_key_fn / info_key_fn_v2), and the new
test suite. The previous commits had reformatted ~300 pre-existing lines
across both files, making the functional diff unreadable.
* test(key/info): cover empty-rows path in _query_model_spend_for_period
* fix(model_max_budget_limiter): guard BudgetConfig construction inside try/except
A malformed model entry in the DB (e.g. non-numeric max_budget from a
manually edited or migrated row) caused BudgetConfig(**budget_info) to
raise a Pydantic ValidationError outside any exception guard, surfacing
as a 500 for the entire /key/info or /v2/key/info call. Merging both
try/except blocks into one ensures bad entries are silently skipped,
consistent with the existing duration_in_seconds guard.
* fix: don't stack provider prefix on wildcard models with a custom prefix (#30360)
* fix: don't stack provider prefix on wildcard models with a custom prefix
get_known_models_from_wildcard expanded provider-prefixed model ids (e.g.
"ollama/gemma3:1b" from get_provider_models) by prepending the wildcard's
prefix whenever the id did not already start with it. With a custom wildcard
prefix such as "ollama_server1/*" (used to distinguish multiple Ollama
instances), this produced "ollama_server1/ollama/gemma3:1b", which is
uncallable and breaks /v1/models.
When the expanded id already carries a provider prefix, replace it with the
wildcard's prefix instead of stacking both. Matching-prefix and bare-model
cases are unchanged.
Fixes #30358
* fix: only strip a known provider prefix when expanding custom wildcard prefixes
The wildcard expansion replaced the leading slash segment of every expanded id with the wildcard prefix whenever the id did not already start with it. For ids whose first segment is an org rather than a litellm provider (for example a provider returning "meta-llama/Llama-3-8B" with no outer provider prefix), that dropped the org and produced an uncallable id
Only strip the leading segment when it is a recognized provider (membership in LlmProviders); otherwise keep it and just prepend the wildcard prefix. Provider-prefixed ids like "ollama/gemma3:1b" still have their prefix replaced, so the original fix is unchanged for known providers
* address greptile review feedback: log dropped non-text vLLM assistant content blocks (greploop iteration 1)
* fix(ci): format credential_form_helpers test + regenerate dashboard schema.d.ts
* fix(proxy): raise litellm.BadRequestError for missing model param
When no model is passed, route_request now raises a litellm.BadRequestError
('Missing model parameter') instead of falling through to ProxyModelNotFoundError.
This keeps the missing-param error clear and independent of router wildcard
state. Unknown (non-empty) model names still raise ProxyModelNotFoundError.
* Revert "fix(proxy): raise litellm.BadRequestError for missing model param"
This reverts commit 9240da403c0432a80473d6c4677ddb7e2bad7420.
* Revert "fix(router): clean pattern_router state on upsert/delete (#29601)"
This reverts commit ad4e6e2395620ea6d2fe38089a54cde160720de2.
* fix: correct streaming and key budget usage reporting
* fix(hosted_vllm): type assistant tool_calls to satisfy mypy
* feat: aws secret manager cross region replication (#30368)
* feat(aws-secret-manager): add replica_regions cross-region replication after CreateSecret
When store_virtual_keys is enabled, async_write_secret() only wrote secrets
to the primary AWS region. Multi-region proxy deployments had no built-in
way to synchronize virtual key secrets across regions through LiteLLM,
requiring external replication mechanisms.
Add replica_regions support to AWSSecretsManagerV2:
- New replica_regions field in KeyManagementSettings (types/secret_managers/main.py)
- New async_replicate_secret() method that calls ReplicateSecretToRegions API
- async_write_secret() calls replication after successful CreateSecret
- Replication failure is logged as a warning but does NOT fail key creation
- load_aws_secret_manager() forwards replica_regions from key_management_settings
Configuration example:
key_management_settings:
store_virtual_keys: true
replica_regions:
- us-west-2
- eu-west-1
When replica_regions is omitted or empty, behavior is unchanged.
* test(aws-secret-manager): restore litellm.secret_manager_client after test to prevent state pollution
* test(aws-secret-manager): add coverage for HTTP error and replication exception paths
* fix: restore litellm.secret_manager_client global state in test; add replication log proof
- Global state in test_load_aws_secret_manager_passes_replica_regions was
already guarded with try/finally (committed in previous pass); no further
change needed for Fix 1.
- Fix 2: add verbose_logger.info("ReplicateSecretToRegions called …") inside
async_replicate_secret so callers get an observable INFO log line whenever
replication fires.
- Add test_replication_fires_on_create: calls async_replicate_secret directly
with caplog.at_level(INFO, logger="LiteLLM") and asserts "ReplicateSecretToRegions"
appears in the captured log output, proving the code path executes.
* fix: pass request to streaming generators
* fix(hosted-vllm): preserve assistant structured content
* fix(hosted_vllm): satisfy mypy on preserved structured content assignment
* chore: resolve litellm_internal_staging merge conflicts for #30527 (#30554)
* chore(codecov): add Batches, Videos, and Realtime components (#30517)
* chore(codecov): add Batches, Videos, and Realtime components
Define per-feature Codecov components so PR comments track coverage
for batch API, video generation, and realtime streaming paths.
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(codecov): use wildcard path for Batches proxy component
Align batches_endpoints glob with Videos, Realtime, and Proxy_Authentication.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(batches): move orphan tests into tests/test_litellm for CI coverage (#30510)
Four batch-related tests lived under tests/litellm/ and were never picked
up by GitHub Actions. Relocate them and fix gemini multimodal e2e to use
the batchEmbedContents path expected for gemini/ provider.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(guardrails): run pre_call hook once for model-level guardrails (#30543)
* fix(guardrails): run pre_call hook once for model-level guardrails
A CustomGuardrail attached to a deployment via litellm_params.guardrails
gets its async_pre_call_hook invoked twice per request: once by the proxy
pre-call loop and again by async_pre_call_deployment_hook after the router
spreads the model-level guardrails into the top-level request kwargs.
Record in request metadata that the proxy pre-call loop already ran a given
guardrail, and have the deployment hook skip it when the marker is present.
Direct-SDK usage never runs the proxy loop, so the deployment hook stays the
sole invocation there and still fires exactly once.
The marker key is stripped from untrusted caller metadata so a request body
cannot suppress a model-only guardrail by pre-seeding it.
* fix(guardrails): mark pre_call dedup on the post-hook request data
Record the exactly-once marker after async_pre_call_hook runs, on the data
object that flows downstream, rather than before it. A guardrail whose hook
returns a brand-new request dict (instead of mutating or spreading the one it
received) would otherwise discard the marker, letting the deployment hook
re-run the guardrail a second time.
* fix(guardrails): stop re-initializing DB guardrails on every poll (#30542)
* fix(guardrails): stop re-initializing DB guardrails on every poll
InMemoryGuardrailHandler._has_guardrail_params_changed compared the
in-memory LitellmParams against the raw dict loaded from the DB. The
in-memory side carries every field default and coerces enums via
model_dump(), while the DB side only holds the keys originally stored,
so the two shapes never compared equal and the guardrail was rebuilt on
every poll cycle.
Each rebuild created a fresh instance, but delete_in_memory_guardrail
only removed the old callback from litellm.callbacks. Request handling
promotes guardrail callbacks into the success/failure/async lists, so
the previous instance stayed referenced there and instances accumulated.
Normalize both sides through LitellmParams(...).model_dump() before
diffing, and purge the callback from every callback list on delete.
* refactor(guardrails): narrow params-normalization fallback to ValidationError
The comparison normalizer caught a bare Exception and silently fell back
to the raw dict, which hid the cause and quietly degraded the affected
guardrail back to re-initializing on every poll. Catch only the
ValidationError that LitellmParams construction can raise, log a warning
so the offending row is diagnosable, and let any other error surface
instead of being swallowed.
* refactor(callbacks): add remove_callback_from_all_lists helper to manager
Move the knowledge of which callback lists a callback can be promoted
into out of the guardrail registry and into LoggingCallbackManager, where
the rest of the callback-list bookkeeping already lives. delete_in_memory_guardrail
now delegates to the new helper instead of iterating the lists itself.
* chore(oss): litellm oss staging 150626 (#30463)
* fix(pricing): add GitHub Copilot MAI Code Flash pricing (#30415)
* fix(pricing): add GitHub Copilot MAI Code Flash pricing
Add GitHub Copilot pricing entries for MAI-Code-1-Flash and the internal Copilot CLI model name so cost calculation can price input, cached input, and output tokens.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* test(pricing): cover GitHub Copilot MAI Code Flash pricing
Add regression coverage for both GitHub Copilot MAI-Code-1-Flash model names, including cached input pricing, chat endpoint metadata, and cost_per_token arithmetic.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210) (#30213)
* fix(router/proxy): propagate completed_response through FallbackResponsesStreamWrapper for streaming /v1/responses container ownership (#30210)
#28990 added ownership recording for streaming /v1/responses via
_wrap_responses_stream_for_container_ownership, which reads
`getattr(stream_response, 'completed_response', None)` to extract the
ResponsesAPIResponse. The unit test bypassed the Router, so it never
exercised the production wrapping path.
Through the Router (every proxy deployment), the stream is wrapped by
FallbackResponsesStreamWrapper (router.py:2527). Its __init__ set
`self.completed_response = None` and __anext__ only forwarded chunks
— the inner source iterator's terminal event never bubbled up to the
attribute the ownership hook reads, so the hook silently recorded
nothing and every follow-up /v1/containers/<id>/files call returned
403 for non-admin keys.
This commit:
- router.py: pre-resolves the responses-API terminal event tuple
(response.completed / .incomplete / .failed) once per
_aresponses_streaming_iterator call, and has the wrapper's __anext__
sniff each forwarded chunk's .type. First terminal event hit gets
stored on the wrapper's completed_response. Iterator-agnostic — works
for source_iterator AND any future wrapper.
- common_request_processing.py: when _extract_completed_responses_response
returns None we now warn instead of silently skipping. Reporter on
#30210 lost a day to this exact silent skip; the warning surfaces
future regressions of the same shape directly in operator logs.
Fixes #30210
* fix(router): type-ignore wrapper getattr-defaults; broaden ownership-skip warning
CI lint (mypy) flagged the three pre-existing getattr(..., None) assignments
in FallbackResponsesStreamWrapper.__init__:
router.py:2564 self.response = getattr(source_iterator, 'response', None)
router.py:2565 self.model = getattr(source_iterator, 'model', None)
router.py:2566 self.logging_obj = getattr(..., None)
Those lines also exist on litellm_internal_staging and pass mypy there.
Adding the typed terminal-event tuple above the class made the function
body more narrowable, which surfaced the pre-existing mismatch — base
class declares non-Optional types but the bridge path
(LiteLLMCompletionStreamingIterator) legitimately omits these. Keep
the None fallback and silence with type: ignore[assignment].
Greptile 4/5 note: the ownership-skip warning hard-named code_interpreter
which misleads operators when a non-code_interpreter stream aborts.
Generalize to 'any tool container (e.g. code_interpreter)'.
* fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198) (#30201)
* fix(register_model): drop synthesized zero costs to preserve sparse entries (#30198)
get_model_info synthesizes input_cost_per_token / output_cost_per_token = 0
when they are absent from the raw entry (the price-unknown and free cases
share the same representation). register_model then merges that result back
into litellm.model_cost, which flips a sparse entry from 'no cost keys'
(priced via model name) to 'cost keys = 0' (free).
That defeats _is_cost_explicitly_configured (#24949) on re-registration:
_is_model_cost_zero returns True, common_checks skips every tag / key /
team / user / org budget check for the group, and over-budget traffic
keeps returning 200. Spend keeps recording because cost calc still resolves
by model name, so the symptom is silent and only triggers on the second
register_model pass (router rebuild, /model/update, config sync).
Mirror the existing litellm_provider-None guard one block above and pop
the cost fields from the synthesized result when they are absent from the
raw entry and not in the caller's value. Caller-provided zeros (genuinely
free models, BYOK overrides) are preserved.
Fixes #30198
* fix(register_model): switch _raw_entry to is-None checks + drop dead test assertion
Greptile #30201 review notes:
- the `or`-chain in the raw-entry lookup treated an empty dict (a key
with no fields) as falsy and fell through to the second arm — replace
with explicit `is None` checks so a present-but-empty entry is still
taken at face value.
- the first assertion in `test_router_double_init_keeps_db_model_entry_sparse`
used `in (None, 0)` which passes under the bug condition (cost = 0
matches the tuple); the strong follow-up assertion already covers
every shape, so drop the dead branch.
* fix(bedrock mantle): use unique function-call id for responses->chat tool calls (#30426)
* fix(bedrock mantle): use unique function-call id for responses->chat tool calls
...
* fix(bedrock mantle): scope unique tool-call id fallback to degenerate call_id
The previous revision preferred the Responses item id for every tool call, which broke providers (and existing tests) where call_id is a unique, canonical correlation key. Restrict the fallback to the degenerate index-based call_id that Bedrock Mantle returns (call_0, call_1, ... resetting per response) and keep call_id otherwise. Revert the change to the OUTPUT_ITEM_DONE streaming handler, whose tool_call_chunk is never emitted (dead code, per review). Extend the regression tests to assert a normal call_id is preserved.
* fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235) (#30241)
* fix(router): preserve azure_ad_token through CredentialLiteLLMParams for /v1/files + batches (#30235)
Router.get_deployment_credentials_with_provider re-validates a
deployment's litellm_params through CredentialLiteLLMParams before
handing them to file/batch/passthrough callers:
return CredentialLiteLLMParams(
**deployment.litellm_params.model_dump(exclude_none=True)
).model_dump(exclude_none=True)
Any field NOT declared on CredentialLiteLLMParams gets silently dropped
on the way through. azure_ad_token was undeclared, so Azure deployments
using OAuth/M2M (azure_ad_token instead of a static api_key) silently
lost their token at the files endpoint and the proxy returned:
Missing credentials. Please pass one of api_key, azure_ad_token,
azure_ad_token_provider, ...
Declare azure_ad_token on CredentialLiteLLMParams alongside api_key /
api_base / api_version so it rides through the round-trip. Static-key
deployments stay unaffected (Optional, default None, dropped by
exclude_none=True). Provider-callable (azure_ad_token_provider) is a
separate concern and out of scope here.
Fixes #30235
* fix(ui-types): regenerate schema.d.ts for new azure_ad_token field
CI's 'Verify schema.d.ts matches the proxy OpenAPI spec' check
auto-detected the new field and emitted the exact diff to apply.
Two schemas had `aws_secret_access_key` from CredentialLiteLLMParams,
both get the new azure_ad_token marker next to it.
* fix(proxy): org_admin with own user_id now sees all org teams on /v2/team/list (#30247)
When the UI sends the callers own user_id (as it does for non-Admin
global roles), _enforce_list_team_v2_access now nulls it out for org
admins so _build_team_list_where_conditions scopes by organization_id
only -- matching the legacy /team/list behavior and the documented intent.
Fixes #30215
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* test(vertex_ai): multi-region regression coverage for cachedContents host (#29571) (#29707)
litellm_internal_staging already routes the cachedContents URL through
get_vertex_base_url, fixing the multi-region 404 reported in #29571 —
but carries no test coverage for the actual regression scenario (eu/us
must resolve to the REP host aiplatform.{geo}.rep.googleapis.com).
Add TestContextCachingMultiRegionUrls: parametrized eu/us REP-host
assertions (including absence of the old broken {geo}-aiplatform host),
plus regional (us-central1) and global no-regression checks.
* fix(proxy): close upstream LLM stream when client disconnects mid-stream (#30245)
* fix(proxy): close upstream LLM stream when client disconnects mid-stream
When a streaming client disconnects, Starlette abandons the response
body iterator without calling aclose(), so the proxy's connection to
the upstream backend stays open until garbage collection, which may
never come. The backend (e.g. vLLM) keeps generating into a dead pipe:
small responses drain invisibly into TCP buffers while large ones block
the backend on a full send buffer indefinitely (observed via lsof as an
ESTABLISHED proxy->backend connection minutes after the client left)
create_response now returns a StreamingResponse subclass that closes
both its body iterator and the wrapped upstream-facing generator in a
shielded finally. The upstream generator is closed directly rather than
through a cascade because aclose() on a never-started generator skips
its body, which would make the cascade a no-op when the client
disconnects before the first chunk is sent.
async_streaming_data_generator also gains the same shielded
finally-aclose that async_data_generator in proxy_server.py already
had, covering the Anthropic and Google SSE paths
With this, killing a streaming client causes the backend to observe the
abort within about a second and free its slot, while completed streams
are unaffected. No flag is needed, unlike the non-streaming opt-in
cancel in #30223: this only releases resources after the client is
already gone and does not change any response a client can observe
Fixes #30244
* fix(proxy): close upstream even when body iterator aclose raises BaseException
Addresses the Greptile finding on #30245: the cleanup loop caught only
Exception while the generator-level cleanup catches BaseException, so a
CancelledError or GeneratorExit escaping body_iterator.aclose() would
skip closing the upstream generator. Both sites now use the same scope
and a regression test pins that the upstream is closed even when the
body iterator explodes with a BaseException
* fix(llms): expose aclose on BaseModelResponseIterator so stream close reaches the provider connection
The response-level close added for #30244 only worked for SDK-based
providers (e.g. openai), whose streams expose aclose all the way down.
Providers served by base_llm_http_handler (hosted_vllm and most modern
transformation-based providers) wrap a bare response.aiter_lines()
generator in BaseModelResponseIterator, which had no aclose or close at
all, and nothing retained the httpx response object; so
CustomStreamWrapper.aclose() silently did nothing and the upstream
connection stayed open. Verified with a vLLM-style mock: with
hosted_vllm/ the backend streamed all 100 chunks to completion after
the client disconnected, while openai/ aborted at chunk 6
BaseModelResponseIterator now carries an optional http_response and an
aclose() that closes it; make_async_call_stream_helper attaches the
response after building the iterator. With this, hosted_vllm aborts the
backend within ~1.6s of the client dropping, and completed streams are
unaffected
---------
Co-authored-by: kursad <kursad.lacin@brado.net>
* feat(anthropic): surface compaction usage iterations data (#27065)
* feat(anthropic): surface compaction usage iterations data
* style: apply black formatting to fix lint checks
* fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock (#30422)
* fix(usage): correct calculate usage with cached tokens when use ChatCompletionUsageBlock
* fix(usage): optimize test imports
* feat: add fastCRW search provider (#30434)
* feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider (#30203)
* feat(provider): add LibertAI as a JSON-configured OpenAI-compatible provider
* libertai: update served endpoints backup + add mode/matrix tests
Addresses review feedback:
- Add libertai to litellm/provider_endpoints_support_backup.json, the file
actually served by GET /public/supported_endpoints (the root
provider_endpoints_support.json already had it).
- Add tests asserting bge-m3 normalizes to mode='embedding' and that the
served matrix lists libertai. embeddings stays false: the JSON-configured
provider path only wires chat routing (OpenAILike embedding handler is
reached only for literal openai_like/llamafile/lm_studio), matching the
llamagate precedent; bge-m3 remains in the cost map for metadata.
---------
Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>
* feat(provider): add ModelScope as an OpenAI-compatible provider (#28460)
* add ModelScope API support
* add modelscope api support
* update modelscope model list
* add image-genetation support
* update test and multimodal
* fix: address PR review feedback for modelscope provider
* update README
* fix(customer_endpoints): restrict /customer/daily/activity to admin-only (#28849)
* fix(customer_endpoints): restrict /customer/daily/activity to admin-only
* fix(customer_endpoints): check role before prisma_client guard
* fix(custom_guardrail): key disable_global_guardrails takes precedence over team guardrail list (#28563)
* fix(fallbacks): preserve fallback model in SDK fallback responses (#28260)
* fix(fallbacks): preserve fallback model in response when using SDK-level fallbacks
* fix(fallbacks): gate x-litellm-* passthrough to trusted callers only
The previous patch unconditionally let `x-litellm-*` keys bypass the
`llm_provider-` prefix in `process_response_headers`. That function is
also called on raw upstream-provider response headers (e.g. from
`llm_http_handler.py`), so a malicious provider could return
`x-litellm-attempted-fallbacks` and spoof a LiteLLM-internal marker,
bypassing the proxy model-override guard.
Add a `preserve_litellm_internal_headers` flag (default False). Only
`response_metadata.py`, which re-processes the already-built
`_hidden_params["additional_headers"]` dict (LiteLLM-owned), passes
True. Raw provider header callsites keep the default False, so upstream
`x-litellm-*` still gets the `llm_provider-` prefix.
Adds a regression test for the spoofing case and renames the existing
preserve test to make the trusted-path semantics explicit.
* fix(fallbacks): ignore preserve_litellm_internal_headers for raw httpx.Headers inputs
* style(core_helpers): apply black formatting
* fix(lint): remove banned typing.List/Dict/Any imports and suppress PLR0913 on interface overrides
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): apply black formatting to modelscope chat transformation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): replace noqa with proper fixes — use **kwargs and Awaitable instead of Any/List
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): remove unused AllMessageValues import
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* revert: restore base_model_iterator.py to original PR state
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): restore full method signatures for MyPy compatibility; bump PLR0913 budget for new provider files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): use @override to suppress PLR0913 on inherited signatures instead of bumping budget
The overrides keep their full base-class signatures for MyPy compatibility, but those signatures carry more than five parameters, which tripped PLR0913 on each subclass redeclaration. Since the arity is dictated by the base class and cannot be reduced, decorate the overrides with typing_extensions.override; ruff treats that as the intended signal that the parameter count is not under the author's control and skips PLR0913. This restores the PLR0913 baseline to 1813.
* fix(lint): add @override to modelscope image generation overrides
Apply the same typing_extensions.override treatment to the image generation config so its inherited-signature overrides do not count against PLR0913.
---------
Co-authored-by: Joel Tony <github@jaytau.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: ztko <96878659+koztkozt@users.noreply.github.com>
Co-authored-by: Nahrin <nahrin@nahrinoda.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Humphrey <a739376838@gmail.com>
Co-authored-by: kursadlacin <kursadlacin@gmail.com>
Co-authored-by: kursad <kursad.lacin@brado.net>
Co-authored-by: Dushyant Acharya <dushyantacharya873@gmail.com>
Co-authored-by: Yuriy <yuriy.shuyskiy@gmail.com>
Co-authored-by: Recep S <22618852+us@users.noreply.github.com>
Co-authored-by: Moshe Malawach <moshe.malawach@protonmail.com>
Co-authored-by: Moshe Malawach <moshemalawach@users.noreply.github.com>
Co-authored-by: Rongkun Yan <2493404415@qq.com>
Co-authored-by: Varshith <kvarshithgowda@gmail.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
* ci(lint): add blanket-noqa, dataclass-default, and unused-noqa Ruff rules (#30516)
* ci(lint): enforce blanket-noqa, dataclass-default, and unused-noqa rules
Enable PGH004 (blanket-noqa), RUF008 (mutable-dataclass-default),
RUF009 (function-call-in-dataclass-default-argument), and RUF100
(unused-noqa) in ruff.toml, and clean up every resulting violation.
RUF008/RUF009 were already clean. PGH004/RUF100 surfaced ~335 stale or
blanket noqas: blanket `# noqa` are now scoped to the rule they actually
suppress (mostly T201), dead directives are removed, and inapplicable
codes are trimmed (e.g. F401 dropped from `import *`).
lint.external lists rules enforced outside this config (the strict-rule
gate via ruff-strict.toml and upstream litellm's own ruff config) so
RUF100 keeps the noqa directives that protect them instead of stripping
coverage this config can't see.
* ci(lint): trim RUF100 external list to load-bearing codes only
Drop the 9 precautionary strict-gate codes (ANN001/002/003/401, B006,
PLR0913, PLW0603, RUF012, TID251) that have zero `# noqa` references in
the gated source. Keep only the 11 codes with live suppressions so
RUF100 doesn't flag them as unused. Future strict-gate suppressions can
re-add codes here (or fix the underlying issue) as needed.
* ci: ratchet lint and type-check gates (ruff preview, ANN, mypy, basedpyright) (#30379)
* ci: enable ruff preview rules under the budgeted strict gate
Turn on ruff preview in the strict-budget lane (ruff-strict.toml) only,
leaving the clean gate (ruff.toml) untouched so make lint-ruff stays at
zero. Enumerate the 118 firing codes explicitly with
explicit-preview-rules so the gate is deterministic and stable across
ruff upgrades rather than depending on preview auto-selecting the broad
catalog.
Grandfather the existing 58438 violations into ruff-strict-budget.json
as per-rule baselines with headroom, so only net-new violations fail CI.
The existing ten rules keep their hand-tuned slack; the new rules get
slack 10 when the baseline is 50 or more and 3 otherwise.
* ci: add ANN return-type rules to the budgeted strict gate
Add ANN201/202/204/205/206 (missing return annotations) to the strict
lane and grandfather the existing counts into ruff-strict-budget.json so
the codebase ratchets toward explicit return types without breaking CI.
* ci: add mypy (disallow_untyped_defs) and basedpyright strict gates with baselines
Add two type-check gates, each grandfathering the current tree so only
net-new violations fail CI, matching the ruff strict-budget ratchet.
mypy gains disallow_untyped_defs in litellm/mypy.ini (the config the CI
invocation actually reads; the root [tool.mypy] is not picked up from the
litellm/ working dir). The 4885 existing missing-annotation errors are
captured in litellm/.mypy-baseline.txt and the run is piped through
mypy-baseline filter so new untyped defs are rejected.
basedpyright runs in strict mode over litellm/, with
enableTypeIgnoreComments disabled so it only honors '# pyright: ignore'
and never polices mypy's '# type: ignore'. The existing strict diagnostics
are grandfathered into .basedpyright/baseline.json.
Both tools are pinned in the dev group and uv.lock; the lint workflow and
Makefile run them filtered through their baselines, with
lint-mypy-baseline-update and lint-basedpyright-baseline-update to ratchet.
* ci: raise lint job timeout to 15m for the basedpyright strict pass
* ci: pin pythonVersion 3.12 and regenerate baselines against merged base
Merge litellm_internal_staging so the baselines cover code the CI merge
includes (e.g. the cisco_ai_defense guardrail), which otherwise tripped
the mypy gate with 3 ungrandfathered no-untyped-def errors. Pin
pythonVersion 3.12 in pyrightconfig so basedpyright's strict analysis is
reproducible across interpreter versions (CI runs 3.12).
* ci: regenerate basedpyright baseline against the frozen lint env
The previous baseline was generated with optional provider deps (azure,
google, anthropic, mcp, numpydoc, google-genai) installed locally, so CI's
dev-only env surfaced ~3500 reportUnknown*/reportMissingTypeStubs errors
not in the baseline. Regenerate after uv sync --frozen so the baseline
reflects the same dependency set the lint job sees.
* ci: regenerate basedpyright baseline on python 3.12 frozen env
The prior baseline still carried proxy-dev packages (e.g. prisma) that the
lint job's dev-only, python 3.12 env lacks, leaving 2 unresolved-import
errors ungrandfathered. Regenerate in a python 3.12 venv synced to the
frozen lock with default groups only, so the baseline matches exactly what
CI sees.
* ci: replace type-check baselines with per-file count budgets
The mypy and basedpyright baselines were position-sensitive (and the
basedpyright one was a 27MB file), so ordinary line shifts churned them.
Replace both with a per-file count gate: scripts/type_check_gate.py reduces
each tool's output to errors-per-file and checks it against a committed
{file: max} budget, ignoring line and column numbers. A file fails only
when it gains more errors than its ceiling; debt can't be shuffled between
files because each file has its own cap and new files default to zero.
Budgets (mypy-file-budget.json 48K, basedpyright-file-budget.json 96K) are
generated in the python 3.12 frozen lint env so they match CI. Drops the
mypy-baseline dependency; basedpyright runs without its native baseline.
ratchet via make lint-mypy-budget-update / lint-basedpyright-budget-update.
* ci: add a small per-file slack to the type-check gate
Allow each file to drift PER_FILE_SLACK (5) errors past its recorded count
before failing, so a basedpyright inference ripple in an unrelated file
doesn't break the build over a couple of errors. Budgets still record exact
counts; the tolerance is applied at check time.
* ci: move type-check slack into the budget json and trim lint timeout
Make slack declarative: the budget is now {"slack": N, "files": {path: count}}
so the tolerance is tuned in JSON without editing the script, mirroring how
ruff-strict-budget.json carries its slack. --update preserves the existing
slack. Also drop the lint job timeout from 15m to 10m; the mypy and
basedpyright passes add ~2m, leaving the job around 4-5m, so 10m is a
comfortable margin.
* ci: collapse fully-adopted ruff categories and drop inert preview flag
ANN (all nine non-removed rules) and BLE (its only rule) were spelled out
code-by-code; replace each with its category selector, which is exactly
equivalent in 0.15.3 (the removed ANN101/ANN102 are skipped by a category
selector and error when named explicitly). explicit-preview-rules was inert:
every selected rule is stable and nothing is selected by category, so the flag
had nothing to gate. Verified the strict-rule counts are identical before and
after (62379 each, zero per-rule drift), so no budget change.
* ci: drop redundant pyright dev dependency
Nothing invokes bare pyright in the Makefile, the linting workflow, or
scripts; the basedpyright gate added on this branch is the only type
checker that runs. based…
Adds LibertAI as a JSON-configured OpenAI-compatible provider.
LibertAI is a confidential AI inference platform: open-weight models (Qwen 3.5/3.6, Gemma 4, DeepSeek V4 Flash, Hermes 3) served from Trusted Execution Environments on a decentralized cloud, behind an OpenAI-compatible API.
https://api.libertai.io/v1(override:LIBERTAI_API_BASE)LIBERTAI_API_KEYChanges (follows the pattern of #29063 / #29842):
litellm/llms/openai_like/providers.json: libertai entrylitellm/constants.py: endpoint + provider registrationlitellm/types/utils.py:LlmProviders.LIBERTAIprovider_endpoints_support.json: endpoint matrixmodel_prices_and_context_window.json+ backup: 12 model entries (11 chat incl.-thinkingreasoning variants, 1 embedding) with live pricing and context limitstests/test_litellm/llms/openai_like/test_libertai_provider.py(6 tests, all passing)Verified with a live completion, a reasoning-variant completion, and cost tracking against the production API.
Affiliation: I'm the founder of LibertAI.